scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Embedded Systems Letters in 2011"


Journal ArticleDOI
TL;DR: The design of the versatile Opal platform that couples a Cortex M3 MCU with two IEEE 802.15.4 radios for supporting sensing applications with high transfer rates without sacrificing communication range is presented.
Abstract: Design of current sensor network platforms has favored low power operation at the cost of communication throughput or range, which severely limits support for real-time monitoring applications with high throughput requirements. This letter presents the design of the versatile Opal platform that couples a Cortex M3 MCU with two IEEE 802.15.4 radios for supporting sensing applications with high transfer rates without sacrificing communication range. We present experiments that evaluate Opal's throughput and range when operating with one or two radios, and we compare these results with an Iris-based node and TelosB nodes. We introduce the spatial energy cost metric that measures the energy to transfer one bit of information in a unit area for comparing the performance of the platforms. The results show that Opal operating with dual radios increases the throughput compared to single radio platforms with the same data-rate by a factor of 3.7, without sacrificing communication range. Opal operating with one radio can deliver a 460% increase in throughput over other single radio nodes at reduced range. We also analyze the implications of Opal's design for multihop communication, showing that the dual radio architecture removes the bandwidth bottleneck in multihop communications that is inherent to single radio platforms.

62 citations


Journal ArticleDOI
TL;DR: In this letter, virtual reconfigurable architectures, referred to as intermediate fabrics, are evaluated, which enable near-instant placement and routing of applications for commercial FPGAs.
Abstract: Field-programmable gate arrays (FPGAs) suffer from lower application design productivity than other devices, which is largely due to compilation taking hours or even days. Making FPGA compilation comparable to software compilation is critical for continued FPGA usage due to competitive technologies, such as graphics-processing units, that use languages with runtime compilation models. In this letter, we evaluate virtual reconfigurable architectures, referred to as intermediate fabrics, which enable near-instant placement and routing of applications for commercial FPGAs.

46 citations


Journal ArticleDOI
TL;DR: A resource efficient custom processor-the differential equation processing element, or DEPE-specifically designed for efficient solution of ODEs on FPGAs, is introduced, and its accompanying compilation tools are introduced.
Abstract: Models of physical systems, such as of human physiology or of chemical reactions, are typically comprised of numerous ordinary differential equations (ODEs). Today's designers commonly consider simulating physical models utilizing field-programmable gate arrays (FPGAs). This letter introduces a resource efficient custom processor-the differential equation processing element, or DEPE-specifically designed for efficient solution of ODEs on FPGAs, and also introduces its accompanying compilation tools. We show that a single DEPE on a Xilinx Virtex6 130T FPGA executes several physiological models faster than real-time while requiring only a few hundred FPGA lookup tables (LUTs). Experiments with a commercial high-level synthesis(HLS) tool show that while a single DEPE is 5-50× slower than HLS circuits, DEPE is 10-200 × smaller. We show that a single DEPE is only 10× slower than a relatively massive and costly 3 GHz Pentium 4 desktop processor for ODE solving, and its speed is also competitive with a 700 Mhz TI digital signal processor and an 450 Mhz ARM9 processor. DEPE is 4×-17× faster than a Xilinx MicroBlaze soft-core processor and 3 ×-6 × smaller. DEPE thus represents an excellent processing element for use by itself for small physical models, and in future parallel networks for larger models.

29 citations


Journal ArticleDOI
TL;DR: This letter characterize such energy sources by means of two performance measures: expected time before the first task failure and the fraction of tasks that fail before the battery dies, and presents semi-Markov models to evaluate these measures.
Abstract: Batteries and supercapacitors are complementary: batteries have a high energy-to-weight ratio but are limited in the power levels they can support; supercapacitors can provide high levels of power while they have a much lower energy-to-weight ratio. A battery-supercapacitor duo can therefore prove useful in embedded systems serving sporadic, energy-intensive, tasks: the battery charges the capacitor at a low, fairly steady, rate which maximizes the energy that can be drawn from it, while the supercapacitor satisfies the impulse power demands of the application. In this letter, we characterize such energy sources by means of two performance measures: expected time before the first task failure and the fraction of tasks that fail before the battery dies. or the case of rare (but energy-intensive) sporadic tasks, we present semi-Markov models to evaluate these measures. For more frequent task arrivals, we provide simulation results. This letter demonstrates the impact of various parameters on our performance measures: power draw, capacitor sizing, and the battery rest scheduling policy.

27 citations


Journal ArticleDOI
TL;DR: A novel fault-tolerant allocating algorithm called “best-fit empty area compact (BF-EAC),” and its on-chip implementation on a Xilinx Virtex-4 field-programmable gate array (FPGA), which circumvents emerging faults while maintaining more compact empty areas for emerging tasks.
Abstract: This letter presents efficient and modular task scheduler and allocator support for dynamically and partially reconfigurable electronic systems. This enables hardware tasks to be preempted and arbitrarily placed at an optimal position on the chip on-the-fly. In particular, we present a novel fault-tolerant allocating algorithm called “best-fit empty area compact (BF-EAC),” and its on-chip implementation on a Xilinx Virtex-4 field-programmable gate array (FPGA), which circumvents emerging faults while maintaining more compact empty areas for emerging tasks. We also present an implementation of the early deadline first (EDF) scheduling heuristic used to optimize the chronological order of execution of hardware tasks to meet real time constraints. Put together, the placement and scheduling architecture efficiently exploits chip resources with a μs-grade computing speed and a lightweight footprint (less than 500 slices).

23 citations


Journal ArticleDOI
TL;DR: The design and implementation of a lossless compression system for hyperspectral images on a processor-plus-field-programmable gate array (FPGA)-based embedded platform shows a 21 speed-up compared to a purely software implementation and the performance was actually bounded by the software section in realizing an entropy coder.
Abstract: The design and implementation of a lossless compression system for hyperspectral images on a processor-plus-field-programmable gate array (FPGA)-based embedded platform. Software execution time of compression algorithm was profiled first to conclude the decision of accelerating the most time consuming interband prediction module by hardware realization. Efficient algorithm to hardware mapping led to a high throughput accelerator design in FPGA capable of processing 16.5 M pixels/s. A set of optimization techniques were applied systematically to enhance the overall system performance. These include a hierarchical memory access scheme to resolve the bus bandwidth limitation, DMA assisted data transfers to shorten the hardware/software (HW/SW) communication, and various coding style and compiler options to optimize the software execution. The final result shows a 21 speed-up compared to a purely software implementation and the performance was actually bounded by the software section in realizing an entropy coder. A 27 speed-up can be achieved if a simplified coder is used.

19 citations


Journal ArticleDOI
TL;DR: This letter builds a low-power system using an Atom processor, an ION, a GPU, and a field-programmable gate array (FPGA)-based custom accelerator, and study its performance and power characteristics using four representative workloads.
Abstract: Embedded learning applications in automobiles, surveillance, robotics, and defense are computationally intensive, and process large amounts of real-time data. Systems for such workloads have to balance stringent performance constraints within limited power budgets. High performance computer processing units (CPUs) and graphics processing units (GPUs) cannot be used in an embedded platform due to power issues. In this letter, we propose a low power heterogeneous system consisting of an Atom processor supported by multiple accelerators that target these workloads, and seek to find if such a system can satisfy performance requirements in an energy-efficient manner. We build our low-power system using an Atom processor, an ION, a GPU, and a field-programmable gate array (FPGA)-based custom accelerator, and study its performance and power characteristics using four representative workloads. With such a system, we show an energy improvement of 42-85% over a server comprising a 2.27 GHz quadcore Xeon coupled to a 1.3 GHz 240 core Tesla GPU.

19 citations


Journal ArticleDOI
TL;DR: A Hamming code based error detection and correction (EDAC) circuit that can protect the configuration memory of a reconfigurable device from SEUs and DEUs is developed.
Abstract: As the size of integrated circuits has reached the nanoscale, embedded memories are more sensitive to single-event upsets (SEUs) or double-event upsets (DEUs), due to their low threshold voltage. In particular, reconfigurable systems, containing a large number of configuration memories to implement customer circuits, are more likely to suffer from soft errors caused by SEUs and DEUs. In this letter, we develop a Hamming code based error detection and correction (EDAC) circuit that can protect the configuration memory of a reconfigurable device from SEUs. Evaluation reveals that compared to the conventional triple modular redundancy (TMR) protected field-programmable gate array (FPGA) tile, the proposed EDAC protected FPGA tile shows about 2.3 times better dependability on the influence of DEUs. Moreover, as the FPGA array size increases, the dependability advantage of EDAC increases exponentially. The main drawback of EDAC is that it has about 1.6 times greater area overhead than TMR.

15 citations


Journal ArticleDOI
TL;DR: This work presents a real-time scheduling perspective analysis of lazy and eager conflict detection policies used in STM, and presents an abstract model for this analysis.
Abstract: Transactional memory is a mechanism of controlling access to shared resources in concurrent programs. Though originally implemented in hardware, software implementations of transactional memory are now available as library extensions in all major programming language. Lately, variants of software transactional memory (STM) with real-time support have been presented. The conflict detection policy used in STM, which can be of lazy or eager type, determines the point at which transactions are aborted. The conflict detection policy can have a significant effect on the schedulability of tasks sharing common resources. Using an abstract model, we present a real-time scheduling perspective analysis of lazy and eager conflict detection policies used in STM.

14 citations


Journal ArticleDOI
TL;DR: This work addresses the problem of providing customized microcoded DMM on MPSoC platforms with distributed memory organization by providing a solution that can serve approximately 7× more allocation requests compared to pure distributed memory platforms and perform 25% faster than the corresponding high-level implementation in C language.
Abstract: Multiprocessor system-on-chip (MPSoCs) have attracted significant attention since they are recognized as a scalable paradigm to interconnect and organize a high number of cores. Current multicore embedded systems exhibit increased levels of dynamic behavior, leading to unexpected memory footprint variations unknown at design time. Dynamic memory management (DMM) is a promising solution for such types of dynamic systems. Although some efficient dynamic memory managers have been proposed for conventional bus-based MPSoC platforms, there are no DMM solutions regarding the constraints and the opportunities delivered by the physical distribution of multiple memory nodes of the platform. In this work, we address the problem of providing customized microcoded DMM on MPSoC platforms with distributed memory organization. Customization is enabled at application- and platform-level. Results show that customized microcoded DMM can serve approximately 7× more allocation requests compared to pure distributed memory platforms and perform 25% faster than the corresponding high-level implementation in C language.

14 citations


Journal ArticleDOI
TL;DR: The design of an efficient, real-time data archival system to a secure digital flash memory card via reconfigurable hardware and the bidirectional access takes place correctly and the data integrity has been verified using cyclic redundancy code in both field-programmable gate array (FPGA) chip and the SD card controller.
Abstract: The main objective of this letter is to present the design of an efficient, real-time data archival system to a secure digital flash memory card via reconfigurable hardware. The data access from the SD card is implemented completely using Verilog and hence, there is no use of any microcontroller or on-chip general purpose processors. And since the complete design is a single-purpose system, no extra hardware is required. The design has four independent modules for the required different operations on the SD memory card. These four modules are for single-block write, multiple-block write, single-block read, and multiple-block read operations. We show how the bidirectional access takes place correctly and the data integrity has been verified using cyclic redundancy code in both field-programmable gate array (FPGA) chip and the SD card controller.

Journal ArticleDOI
TL;DR: A TSV-aware partitioning algorithm that enables higher performance for application implementation onto 3-D field-programmable gate arrays (FPGAs) and leads to a more efficient utilization of the available (fabricated) interlayer connectivity.
Abstract: Integrating more functionality in a smaller form factor with higher performance and lower power consumption is pushing semiconductor technology scaling to its limits. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moore's momentum and fuel the next wave of consumer electronics products. This letter introduces a TSV-aware partitioning algorithm that enables higher performance for application implementation onto 3-D field-programmable gate arrays (FPGAs). Unlike other algorithms that minimize the number of connections among layers, our solution leads to a more efficient utilization of the available (fabricated) interlayer connectivity. Experimental results show average reductions in delay and power consumption, as compared to similar 3-D computer-aided design (CAD) tools, about 28% and 26%, respectively.

Journal ArticleDOI
TL;DR: This letter presents a systematic approach to efficiently handle a very large number of power domains in modern coarse-grained reconfigurable arrays in order to tightly match the different computational demands of processed algorithms with corresponding power consumption.
Abstract: This letter presents a systematic approach to efficiently handle a very large number of power domains in modern coarse-grained reconfigurable arrays in order to tightly match the different computational demands of processed algorithms with corresponding power consumption. It is based on a new highly scalable and generic power control network and additionally uses the state-of-the-art common power format based front-to-backend design methodology for a fully automated implementation. The power management is transparent to the user and is seamlessly integrated into the overall reconfiguration process: reconfiguration-controlled power gating. Furthermore, for the first time, a coarse-grained reconfigurable case study design with as many as 24 switchable power domains with detailed results on power savings and overheads is presented. The application of the proposed technique results in 60% active leakage and 90% standby leakage power reduction for several digital signal processing algorithms.

Journal ArticleDOI
TL;DR: A hybrid model-based quality priority algorithm is developed to reduce power consumption, required hardware resources, and computation time with a small quality degradation in ZQDCT computation.
Abstract: Due to the high computational complexity of discrete cosine transform (DCT) computation, prediction of zero quantized DCT (ZQDCT) coefficients has been extensively studied to reduce the computational complexity of DCT computation. In this letter, we propose a reconfigurable architecture to support ZQDCT computation. Twelve different modes of DCT computations including zonal coding, multiblock processing, and parallel-sequential stage mode can be performed using proposed architecture. We develop a hybrid model-based quality priority algorithm to reduce power consumption, required hardware resources, and computation time with a small quality degradation.

Journal ArticleDOI
TL;DR: A novel approach based on the utilization of PI and PID controllers, widely used in control automation, for optimizing resources utilization in Multiprocessor System-on-Chip (MPSoC).
Abstract: Adaptive multiprocessor systems are appearing as a promising solution for dealing with complex and unpredictable scenarios. Given the large variety of possible use cases that these platforms must support and the resulting workload variability, offline approaches are no longer sufficient because they do not allow coping with time changing workloads. This letter presents a novel approach based on the utilization of PI and PID controllers, widely used in control automation, for optimizing resources utilization in Multiprocessor System-on-Chip (MPSoC). Several architecture characteristics such as response time during frequency changing, noise and perturbations are modeled and validated in a high-level model and results are compared to information obtained on a homogeneous MPSoC platform prototype. Power and energy consumption figures are discussed and two controllers are proposed: 1) PI-; and 2) PID-based controllers. Results show the system capability of adapting under disturbing conditions while ensuring application performance constraints and reducing energy consumption.

Journal ArticleDOI
TL;DR: It is argued that execution dependencies among tasks need to be suitably considered in various embedded software engineering activities such as debugging, regression testing, and computation of complexity metrics.
Abstract: Execution dependencies arise among the tasks of an embedded program due to issues such as task priority, task precedence, and intertask communication. We argue that execution dependencies among tasks need to be suitably considered in various embedded software engineering activities such as debugging, regression testing, and computation of complexity metrics. In this letter, we discuss how task execution dependencies among real-time tasks can be identified from static code analysis. Subsequently, we briefly describe an application of our analysis to regression test selection.

Journal ArticleDOI
TL;DR: This work proposes a design methodology that combines an efficient reconfigurable architecture and a related mapping flow that couples an efficient area usage and an adaptable communication infrastructure for island-based hardware architecture.
Abstract: Nowadays, hardware devices are meant to host the execution of many complex, multicore applications, whose functional and nonfunctional requirements vary according to the specific working domain. In this work, we propose a design methodology that combines an efficient reconfigurable architecture and a related mapping flow. In particular, the proposed island-based hardware architecture couples an efficient area usage and an adaptable communication infrastructure. The proposed mapping flow distributes the cores on the device to optimize both performance and reconfiguration related metrics.

Journal ArticleDOI
TL;DR: This letter formally derive utilization based necessary and sufficient scheduling condition for a STM system using lazy conflict detection and derives the execution semantics of STM from the classical preemptive or nonpreemptive model.
Abstract: Software transactional memory (STM) is a transactional mechanism of controlling access to shared resources in memory. Recently, variants of STM with real-time support have been presented. Due to its abort-restart nature, the execution semantics of STM are different from the classical preemptive or nonpreemptive model. In this letter, we formally derive utilization based necessary and sufficient scheduling condition for a STM system using lazy conflict detection.

Journal ArticleDOI
TL;DR: This work provides a representation of tagged systems using the semantics of Kleene algebra that facilitates the usage of standard off-the-shelf theorem provers for reasoning about such systems for both behavioral verification through equivalence checking and property verification.
Abstract: The tagged signal model (TSM) is a formal framework for modeling heterogeneous embedded systems. In the present work, we provide a representation of tagged systems using the semantics of Kleene algebra. Such an algebraic representation facilitates the usage of standard off-the-shelf theorem provers for reasoning about such systems for both behavioral verification through equivalence checking and property verification.

Journal ArticleDOI
TL;DR: Cross-layer optimization techniques involving the cooperation between the buffer management layer and the FTL are proposed in this letter to maximize the degree of parallelism while keeping a low garbage collection overhead.
Abstract: Solid-state drives (SSDs) utilize parallel architectures to improve their IO throughput. Although log buffer based flash translation layers (FTLs) are widely employed in SSDs, little has been addressed on the issue of placing pages under a parallel architecture in such FTLs. In this letter, we evaluate three possible page placement policies and show that there is no best policy due to the tradeoff between the degree of parallelism and the garbage collection overhead. To achieve high performance SSDs, cross-layer optimization techniques involving the cooperation between the buffer management layer and the FTL are proposed in this letter to maximize the degree of parallelism while keeping a low garbage collection overhead. The basic idea is to let the FTL keep the garbage collection overhead low by eliminating high-cost cross-channel live page copying, while making the buffer management layer responsible for maximizing the degree of parallelism. Simulation results on five realistic or benchmark based workloads show that the proposed techniques reduce the response time of the SSD by up to 79%.

Journal ArticleDOI
TL;DR: Experimental results reveal that POES incurs almost no migrations at low workloads and achieves up to 32 times reduction in the number of migrations suffered with respect to the global ERfair scheduler on a set of two to 16 processors even when the average system load is as high as 85%.
Abstract: This letter presents partition-oriented ERfair scheduler (POES), a low-overhead proportional fair scheduler for hard real-time multiprocessor embedded systems. POES achieves lower overheads using an online partitioning/merging mechanism that retains the optimal schedulability of a fully global scheduler by merging processor groups as resources become critical while using partitioning for fast scheduling at other times. The principal objective is to remain only just as global at any given instant of time as is necessary to maintain ERfair schedulability of the system throughout the schedule length. Experimental results reveal that POES incurs almost no migrations at low workloads and achieves up to 32 times reduction in the number of migrations suffered with respect to the global ERfair scheduler on a set of two to 16 processors even when the average system load is as high as 85%. Theoretical analysis proves that POES typically has the same amortized complexity as that of the global ERfair algorithm.

Journal ArticleDOI
TL;DR: A methodology is proposed, managed by the memory controller, that optimizes the data reliability at the physical level for critical data whereas exploiting the transaction performances for noncritical data by partitioning the memory addressable space in different functional blocks.
Abstract: The need to improve nonvolatile memories reliability in embedded systems is a key design concern. We here propose a methodology, managed by the memory controller, that optimizes the data reliability at the physical level for critical data whereas exploiting the transaction performances for noncritical data. The reliability-performance tradeoff is obtained by partitioning the memory addressable space in different functional blocks, each on written by means of a specific optimized writing algorithm. The method feasibility is demonstrated by a case study exploiting phase change memories (PCMs) features.

Journal ArticleDOI
TL;DR: The design of a resource manager and master scheduler for the OpenCLosE environment, that allows efficient realization of multiple applications within a multitasked platform, is described.
Abstract: We present a framework, OpenCLosE, for dynamic resource management and scheduling of applications written in open compute language (OpenCL) for heterogeneous multimedia and graphics platforms, such as those found in multimedia smartphones and automotive infotainment clusters. We describe the design of a resource manager and master scheduler for the OpenCLosE environment, that allows efficient realization of multiple applications within a multitasked platform.

Journal ArticleDOI
Phillip Stanley-Marbell1
TL;DR: This letter presents the design, implementation, and evaluation of a miniature, energy-scalable, 24-processor module, L24, for compute-intensive in situ sensor data processing tasks, finding an optimum operating voltage that minimizes either time-to-solution, energy usage, or the energy-delay product.
Abstract: The in situ processing of vast amounts of data, available intermittently in networks of sensors, motivates investigation of means for achieving high performance when required, but ultralow-power dissipation when idle. One approach is the use of embedded multiprocessor systems, leading to tradeoffs between parallelism, performance, energy-efficiency, and cost. To evaluate these tradeoffs, and to gain insight for future system designs, this letter presents the design, implementation, and evaluation of a miniature, energy-scalable, 24-processor module, L24, for compute-intensive in situ sensor data processing tasks. The platform provides idle power dissipation over an order of magnitude lower than systems employing a monolithic processor of equivalent performance, while dynamic power dissipation remains competitive. Taking into account both application computation and interprocessor communication demands, it is shown that there may exist an optimum operating voltage that minimizes either time-to-solution, energy usage, or the energy-delay product. This optimum operating point is formulated analytically, calibrated with system measurements and instruction-level microarchitectural simulation, and evaluated for the hardware platform and application presented.

Journal ArticleDOI
TL;DR: An algorithm for automatic synthesis of SL/SF monitors from ESC-QC specifications, which is used for generating monitors for verification of controller models from active safety and body control applications.
Abstract: Requirements of embedded systems often describe the system behavior with quantitative constraints over parameters such as timing, memory, and other resources. In this letter, we present a visual language suited for scenario-based specification of requirements with quantitative constraints. Our language, known as event sequence charts with quantitative constraints (ESC-QC), is inspired by message sequence charts (MSC) and its variants. We introduce ESC-QC notations through an example from automotive requirements and then describe the formal syntax and semantics. Besides being useful for formal documentation and analysis of system requirements, ESC-QC specifications can be translated into monitors and used for run-time verification of designs. In automotive systems Simulink/Stateflow (SL/SF) is widely used for design of control systems. We have developed an algorithm for automatic synthesis of SL/SF monitors from ESC-QC specifications. We have used this algorithm for generating monitors for verification of controller models from active safety and body control applications.

Journal ArticleDOI
TL;DR: An approach to automatic software synthesis from SystemC-based on coroutines instead of the traditional approaches based on real-time operating system (RTOS) threads, which results in impressive reduction of runtime overheads compared to the thread-based approaches.
Abstract: SystemC is a widely used electronic system-level (ESL) design language that can be used to model both hardware and software at different stages of system design. There has been a lot of research on behavior synthesis of hardware from SystemC, but relatively little work on synthesizing embedded software for SystemC designs. In this letter, we present an approach to automatic software synthesis from SystemC-based on coroutines instead of the traditional approaches based on real-time operating system (RTOS) threads. Performance evaluation results on some realistic applications show that our approach results in impressive reduction of runtime overheads compared to the thread-based approaches.

Journal ArticleDOI
TL;DR: This letter presents the practical issues concerning late and insufficient verification of low-level software on hardware platforms developed by an industrial partner, and proposes a coverification platform based on process algebra.
Abstract: This letter presents the practical issues concerning late and insufficient verification of low-level software on hardware platforms developed by our industrial partner. To overcome these issues, we propose a coverification platform based on process algebra. The descriptions of hardware and software, and their interface are translated into a common process-algebraic platform, and formal verification techniques are used to check the conformance of the two descriptions. We present the results of our first attempt towards this goal, discuss the lessons learned, and present the road-map for future research.