scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions in Embedded Computing Systems in 2006"


Journal ArticleDOI
TL;DR: This research proposes a dynamic allocation methodology for global and stack data and program code that accounts for changing program requirements at runtime, has no software-caching tags, requires no runtime checks, has extremely low overheads, and yields 100% predictable memory access times.
Abstract: In this research, we propose a highly predictable, low overhead, and, yet, dynamic, memory-allocation strategy for embedded systems with scratch pad memory. A scratch pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees versus cache and by its significantly lower overheads in energy consumption, area, and overall runtime, even with a simple allocation scheme. Primarily scratch pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption, and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitions variables at compile-time into the two banks. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache. We propose a dynamic allocation methodology for global and stack data and program code that; (i) accounts for changing program requirements at runtime, (ii) has no software-caching tags, (iii) requires no runtime checks, (iv) has extremely low overheads, and (v) yields 100p predictable memory access times. In this method, data that is about to be accessed frequently is copied into the scratch pad using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation, results show that our scheme reduces runtime by up to 39.8p and energy by up to 31.3p, on average, for our benchmarks, depending on the SRAM size used. The actual gain depends on the SRAM size, but our results show that close to the maximum benefit in runtime and energy is achieved for a substantial range of small SRAM sizes commonly found in embedded systems. Our comparison with a direct mapped cache shows that our method performs roughly as well as a cached architecture.

240 citations


Journal ArticleDOI
TL;DR: The design flow is utilized in the integration of state-of-the-art technology approaches, including a wireless terminal architecture, a network-on-chip, and multiprocessing utilizing RTOS in a SoC.
Abstract: This paper describes a complete design flow for multiprocessor systems-on-chips (SoCs) covering the design phases from system-level modeling to FPGA prototyping. The design of complex heterogeneous systems is enabled by raising the abstraction level and providing several system-level design automation tools. The system is modeled in a UML design environment following a new UML profile that specifies the practices for orthogonal application and architecture modeling. The design flow tools are governed in a single framework that combines the subtools into a seamless flow and visualizes the design process. Novel features also include an automated architecture exploration based on the system models in UML, as well as the automatic back and forward annotation of information in the design flow. The architecture exploration is based on the global optimization of systems that are composed of subsystems, which are then locally optimized for their particular purposes. As a result, the design flow produces an optimized component allocation, task mapping, and scheduling for the described application. In addition, it implements the entire system for FPGA prototyping board. As a case study, the design flow is utilized in the integration of state-of-the-art technology approaches, including a wireless terminal architecture, a network-on-chip, and multiprocessing utilizing RTOS in a SoC. In this study, a central part of a WLAN terminal is modeled, verified, optimized, and prototyped with the presented framework.

171 citations


Journal ArticleDOI
TL;DR: Algorithms and tools for reachability analysis of hybrid systems are presented by combining the notion of predicate abstraction with recent techniques for approximating the set of reachable states of linear systems using polyhedra, and it is shown that predicate abstraction of hybrid system can be used to prove bounded safety.
Abstract: Embedded systems are increasingly finding their way into a growing range of physical devices. These embedded systems often consist of a collection of software threads interacting concurrently with each other and with a physical, continuous environment. While continuous dynamics have been well studied in control theory, and discrete and distributed systems have been investigated in computer science, the combination of the two complexities leads us to the recent research on hybrid systems. This paper addresses the formal analysis of such hybrid systems. Predicate abstraction has emerged to be a powerful technique for extracting finite-state models from infinite-state discrete programs. This paper presents algorithms and tools for reachability analysis of hybrid systems by combining the notion of predicate abstraction with recent techniques for approximating the set of reachable states of linear systems using polyhedra. Given a hybrid system and a set of predicates, we consider the finite discrete quotient whose states correspond to all possible truth assignments to the input predicates. The tool performs an on-the-fly exploration of the abstract system. We present the basic techniques for guided search in the abstract state-space, optimizations of these techniques, implementation of these in our verifier, and case studies demonstrating the promise of the approach. We also address the completeness of our abstraction-based verification strategy by showing that predicate abstraction of hybrid systems can be used to prove bounded safety.

124 citations


Journal ArticleDOI
TL;DR: This paper presents the design, implementation, and evaluation of EnviroSuite, a programming framework that introduces a new paradigm, called environmentally immersive programming, to abstract distributed interactions with the environment.
Abstract: Sensor networks open a new frontier for embedded-distributed computing. Paradigms for sensor network programming-in-the-large have been identified as a significant challenge toward developing large-scale applications. Classical programming languages are too low-level. This paper presents the design, implementation, and evaluation of EnviroSuite, a programming framework that introduces a new paradigm, called environmentally immersive programming, to abstract distributed interactions with the environment. Environmentally immersive programming refers to an object-based programming model in which individual objects represent physical elements in the external environment. It allows the programmer to think directly in terms of environmental abstractions. EnviroSuite provides language primitives for environmentally immersive programming that map transparently into a support library of distributed algorithms for tracking and environmental monitoring. We show how nesC code of realistic applications is significantly simplified using EnviroSuite and demonstrate the resulting system performance on Mica2 and XSM platforms.

104 citations


Journal ArticleDOI
TL;DR: Application-specific networks-on-chip (ASNoC) and its design methodology are proposed and comparison results show that ASNoC provide substantial improvements in power, performance, and cost compared to 2D mesh networks- on-chip.
Abstract: With the help of HW/SW codesign, system-on-chip (SoC) can effectively reduce cost, improve reliability, and produce versatile products. The growing complexity of SoC designs makes on-chip communication subsystem design as important as computation subsystem design. While a number of codesign methodologies have been proposed for on-chip computation subsystems, many works are needed for on-chip communication subsystems. This paper proposes application-specific networks-on-chip (ASNoC) and its design methodology. ASNoC is used for two high-performance SoC applications. The methodology (1) can automatically generate optimized ASNoC for different applications, (2) can generate a corresponding distributed shared memory along with an ASNoC, (3) can use both recorded and statistical communication traces for cycle-accurate performance analysis, (4) is based on standardized network component library and floorplan to estimate power and area, (5) adapts an industrial-grade network modeling and simulation environment, OPNET, which makes the methodology ready to use, and (6) can be easily integrated into current HW/SW codesign flow. Using the methodology, ASNoC is generated for a H.264 HDTV decoder SoC and Smart Camera SoC. ASNoC and 2D mesh networks-on-chip are compared in performance, power, and area in detail. The comparison results show that ASNoC provide substantial improvements in power, performance, and cost compared to 2D mesh networks-on-chip. In the H.264 HDTV decoder SoC, ASNoC uses 39p less power, 59p less silicon area, 74p less metal area, 63p less switch capacity, and 69p less interconnection capacity to achieve 2X performance compared to 2D mesh networks-on-chip.

98 citations


Journal ArticleDOI
TL;DR: This paper contributes novel techniques for tight and flexible static timing analysis, particularly well-suited for dynamic scheduling schemes, and proposes a parametric approach toward bounding the WCET statically with respect to the frequency.
Abstract: Energy is a valuable resource in embedded systems as the lifetime of many such systems is constrained by their battery capacity. Recent advances in processor design have added support for dynamic frequency/voltage scaling (DVS) for saving energy. Recent work on real-time scheduling focuses on saving energy in static as well as dynamic scheduling environments by exploiting idle time and slack because of early task completion for DVS of subsequent tasks. These scheduling algorithms rely on a priori knowledge of worst-case execution times (WCET) for each task. They assume that DVS has no effect on the worst-case execution cycles (WCEC) of a task and scale the WCET according to the processor frequency. However, for systems with memory hierarchies, the WCEC typically does change under DVS because of requency modulation. Hence, current assumptions used by DVS schemes result in a highly exaggerated WCET. This paper contributes novel techniques for tight and flexible static timing analysis, particularly well-suited for dynamic scheduling schemes. The technical contributions are as follows: (1) We assess the problem of changing execution cycles owing to scaling techniques. (2) We propose a parametric approach toward bounding the WCET statically with respect to the frequency. Using a parametric model, we can capture the effect of changes in frequency on the WCEC and, thus, accurately model the WCET over any frequency range. (3) We discuss design and implementation of the frequency-aware static timing analysis (FAST) tool based on our prior experience with static timing analysis. (4) We demonstrate in experiments that our FAST tool provides safe upper bounds on the WCET, which are tight. The FAST tool allows us to capture the WCET of six benchmarks using equations that overestimate the WCET by less than 1p. FAST equations can also be used to improve existing DVS scheduling schemes to ensure that the effect of frequency scaling on WCET is considered and that the WCET used is not exaggerated. (5) We leverage three DVS scheduling schemes by incorporating FAST into them and by showing that the energy consumption further decreases. (6) We compare experimental results using two different energy models to demonstrate or verify the validity of simulation methods. To the best of our knowledge, this study of DVS effects on timing analysis is unprecedented.

90 citations


Journal ArticleDOI
TL;DR: This work aims at providing a comparative energy and performance analysis of cache-coherence support schemes in MPSoCs by exploring different cache- coherent shared-memory communication schemes for a number of cache configurations and workloads.
Abstract: Shared memory is a common interprocessor communication paradigm for single-chip multiprocessor platforms. Snoop-based cache coherence is a very successful technique that provides a clean shared-memory programming abstraction in general-purpose chip multiprocessors, but there is no consensus on its usage in resource-constrained multiprocessor systems on chips (MPSoCs) for embedded applications. This work aims at providing a comparative energy and performance analysis of cache-coherence support schemes in MPSoCs. Thanks to the use of a complete multiprocessor simulation platform, which relies on accurate technology-homogeneous power models, we were able to explore different cache-coherent shared-memory communication schemes for a number of cache configurations and workloads.

48 citations


Journal ArticleDOI
TL;DR: This work has developed a generic instruction model and a generic decode algorithm that facilitates easy and efficient retargetability of the ISA-simulator for a wide range of processor architectures, such as RISC, CISC, VLIW, and variable length instruction-set processors.
Abstract: Instruction-set architecture (ISA) simulators are an integral part of today's processor and software design process. While increasing complexity of the architectures demands high-performance simulation, the increasing variety of available architectures makes retargetability a critical feature of an instruction-set simulator. Retargetability requires generic models while high-performance demands target specific customizations. To address these contradictory requirements, we have developed a generic instruction model and a generic decode algorithm that facilitates easy and efficient retargetability of the ISA-simulator for a wide range of processor architectures, such as RISC, CISC, VLIW, and variable length instruction-set processors. The instruction model is used to generate compact and easy to debug instruction descriptions that are very similar to that of architecture manual. These descriptions are used to generate high-performance simulators. Our retargetable framework combines the flexibility of interpretive simulation with the speed of compiled simulation. The generation of the simulator is completely separate from the simulation engine. Hence, we can incorporate any fast simulation technique in our retargetable framework without introducing any performance penalty. To demonstrate this, we have incorporated fast IS-CS simulation engine in our retargetable framework which has generated 70p performance improvement over the best known simulators in this category. We illustrate the retargetability of our approach using two popular, yet different, realistic architectures: the SPARC and the ARM.

33 citations


Journal ArticleDOI
TL;DR: A method to detect memory overflows using compiler-inserted software run-time checks and techniques to grow the stack or heap segment after they overflow, into previously unutilized space, such as dead variables, free holes in the heap, and space freed by compressing live variables are presented.
Abstract: Embedded systems usually lack virtual memory and are vulnerable to memory overflow since they lack a mechanism to detect overflow or use swap space thereafter. We present a method to detect memory overflows using compiler-inserted software run-time checks. Its overheads in run-time and energy are 1.35 and 1.12p, respectively. Detection of overflow allows system-specific remedial action. We also present techniques to grow the stack or heap segment after they overflow, into previously unutilized space, such as dead variables, free holes in the heap, and space freed by compressing live variables. These may avoid the out-of-memory error if the space recovered is enough to complete execution. The reuse methods are able to grow the stack or heap beyond its overflow by an amount that varies widely by application---the amount of recovered space ranges from 0.7 to 93.5p of the combined stack and heap size.

32 citations


Journal ArticleDOI
TL;DR: This paper proposes a set of techniques dedicated to the digital signal processing domain that lead to an optimized IP core integration and shows the effectiveness of the approach with a DCT core design case study.
Abstract: IP integration, which is one of the most important SoC design steps, requires taking into account communication and timing constraints. In that context, design and reuse can be improved using IP cores described at a high abstraction level. In this paper, we present an IP design approach that relies on three main phases: (1) constraint modeling, (2) IP constraint analysis steps for feasibility checking, and (3) synthesis. We propose a set of techniques dedicated to the digital signal processing domain that lead to an optimized IP core integration. Based on a generic architecture of components, the method we propose provides automatic generation of IP cores designed under integration constraints. We show the effectiveness of our approach with a DCT core design case study.

30 citations


Journal ArticleDOI
TL;DR: This paper presents the super-complex instruction-set computing (SuperCISC) Embedded Processor Architecture and, in particular, investigates performance and power consumption of this device compared to traditional processor architecture-based execution.
Abstract: Multiprocessor Systems on Chips (MPSoCs) have become a popular architectural technique to increase performance. However, MPSoCs may lead to undesirable power consumption characteristics for computing systems that have strict power budgets, such as PDAs, mobile phones, and notebook computers. This paper presents the super-complex instruction-set computing (SuperCISC) Embedded Processor Architecture and, in particular, investigates performance and power consumption of this device compared to traditional processor architecture-based execution. SuperCISC is a heterogeneous, multicore processor architecture designed to exceed performance of traditional embedded processors while maintaining a reduced power budget compared to low-power embedded processors. At the heart of the SuperCISC processor is a multicore VLIW (Very Large Instruction Word) containing several homogeneous execution cores/functional units. In addition, complex and heterogeneous combinational hardware function cores are tightly integrated to the core VLIW engine providing an opportunity for improved performance and reduced energy consumption. Our SuperCISC processor core has been synthesized for both a 90-nm Stratix II Field Programmable Gate Aray (FPGA) and a 160-nm standard cell Application-Specific Integrated Circuit (ASIC) fabrication process from OKI, each operating at approximately 167 MHz for the VLIW core. We examine several reasons for speedup and power improvement through the SuperCISC architecture, including predicated control flow, cycle compression, and a reduction in arithmetic power consumption, which we call power compression. Finally, testing our SuperCISC processor with multimedia and signal-processing benchmarks, we show how the SuperCISC processor can provide performance improvements ranging from 7X to 160X with an average of 60X, while also providing orders of magnitude of power improvements for the computational kernels. The power improvements for our benchmark kernels range from just over 40X to over 400X, with an average savings exceeding 130X. By combining these power and performance improvements, our total energy improvements all exceed 1000X. As these savings are limited to the computational kernels of the applications, which often consume approximately 90p of the execution time, we expect our savings to approach the ideal application improvement of 10X.

Journal ArticleDOI
TL;DR: This work proposes a method to automatically distribute programs such that the obtained parts can be run at different rates, which it calls rate desynchronization, and considers general programs whose control structure is a finite state automaton and with a DAG of actions in each state.
Abstract: Many embedded reactive programs perform computations at different rates, while still requiring the overall application to satisfy very tight temporal constraints. We propose a method to automatically distribute programs such that the obtained parts can be run at different rates, which we call rate desynchronization. We consider general programs whose control structure is a finite state automaton and with a DAG of actions in each state. The motivation is to take into account long-duration tasks inside the programs: these are tasks whose execution time is long compared to the other computations in the application, and whose maximal execution rate is known and bounded. Merely scheduling such a long duration task at a slow rate would not work since the whole program would be slowed down if compiled into sequential code. It would thus be impossible to meet the temporal constraints, unless such long duration tasks could be desynchronized from the remaining computations. This is precisely what our method achieves: it distributes the initial program into several parts, so that the parts performing the slow computations can be run at an appropriate rate, therefore not impairing the global reaction time of the program. We present in detail our method, all the involved algorithms, and a small running example. We also compare our method with the related work.

Journal ArticleDOI
TL;DR: This work analytically establishes several timeliness and nontimeliness properties of the ReUA algorithm, an energy-efficient, utility accrual, real-time scheduling algorithm that targets mobile embedded systems where system-level energy consumption is also a major concern.
Abstract: We present an energy-efficient, utility accrual, real-time scheduling algorithm called ReUA. ReUA considers an application model where activities are subject to time/utility function time constraints, mutual exclusion constraints on shared non-CPU resources, and statistical performance requirements on individual activity timeliness behavior. The algorithm targets mobile embedded systems where system-level energy consumption is also a major concern. For such a model, we consider the scheduling objectives of (1) satisfying the statistical performance requirements and (2) maximizing the system-level energy efficiency, while respecting resource constraints. Since the problem is NP-hard, ReUA allocates CPU cycles using statistical properties of application cycle demands, and heuristically computes schedules with a polynomial time cost. We analytically establish several timeliness and nontimeliness properties of the algorithm. Further, our simulation experiments illustrate ReUA's effectiveness and superiority.

Journal ArticleDOI
TL;DR: This paper introduces a novel collaborative approach between the compiler and the operating system (OS) to reduce energy consumption by using the compiler to annotate an application's source code with path-dependent information called power-management hints (PMHs).
Abstract: Managing energy consumption has become vitally important to battery-operated portable and embedded systems. Dynamic voltage scaling (DVS) reduces the processor's dynamic power consumption quadratically at the expense of linearly decreasing the performance. When reducing energy with DVS for real-time systems, one must consider the performance penalty to ensure that deadlines can be met. In this paper, we introduce a novel collaborative approach between the compiler and the operating system (OS) to reduce energy consumption. We use the compiler to annotate an application's source code with path-dependent information called power-management hints (PMHs). This fine-grained information captures the temporal behavior of the application, which varies by executing different paths. During program execution, the OS periodically changes the processor's frequency and voltage based on the temporal information provided by the PMHs. These speed adaptation points are called power-management points (PMPs). We evaluate our scheme using three embedded applications: a video decoder, automatic target recognition, and a sub-band tuner. Our scheme shows an energy reduction of up to 57p over no power-management and up to 32p over a static power-management scheme. We compare our scheme to other schemes that solely utilize PMPs for power-management and show experimentally that our scheme achieves more energy savings. We also analyze the advantages and disadvantages of our approach relative to another compiler-directed scheme.

Journal ArticleDOI
TL;DR: The work presented in this paper does not only tackle the modeling side in embedded systems design, but also the validation of embedded system models through formal methods.
Abstract: This paper addresses the interrelation between control and data flow in embedded system models through a new design representation, called Dual Flow Net (DFN). A modeling formalism with a very close-fitting control and data flow is achieved by this representation, as a consequence of enhancing its underlying Petri net structure. The work presented in this paper does not only tackle the modeling side in embedded systems design, but also the validation of embedded system models through formal methods. Various introductory examples illustrate the applicability of the DFN principles, whereas the capability of the model to with complex designs is demonstrated through the design and verification of a real-life Ethernet coprocessor.

Journal ArticleDOI
TL;DR: For the first time, real-power and EM measurements are used to analyze the difficulty of launching new third-order DPA and DEMA attacks on a popular low-energy 32-bit embedded ARM processor.
Abstract: Future wireless embedded devices will be increasingly powerful, supporting many more applications, including one of the most crucial---security. Although many embedded devices offer more resistance to bus---probing attacks because of their compact size, susceptibility to power or electromagnetic analysis attacks must be analyzed. This paper presents a new split-mask countermeasure to thwart low-order differential power analysis (DPA) and differential EM analysis (DEMA). For the first time, real-power and EM measurements are used to analyze the difficulty of launching new third-order DPA and DEMA attacks on a popular low-energy 32-bit embedded ARM processor. Results show that the new split-mask countermeasure provides increased security without large overheads of energy dissipation, compared to previous research. With the emergence of security applications in PDAs, cell phones, and other embedded devices, low-energy countermeasures for resistance to low-order DPA/DEMA is crucial for supporting future enabled wireless internet.

Journal ArticleDOI
TL;DR: This paper shows how variables' liveness information can be used to dramatically reduce the addressing instructions required to access local variables on the program stack.
Abstract: The generation of efficient addressing code is a central problem in compiling for processors with restricted addressing modes, like digital signal processors (DSPs). Offset assignment (OA) is the problem of allocating scalar variables to memory, so as to minimize the need of addressing instructions. This problem is called simple offset assignment (SOA) when a single address register is available, and general offset assignment (GOA) when more address registers are used. This paper shows how variables' liveness information can be used to dramatically reduce the addressing instructions required to access local variables on the program stack. Two techniques that make effective use of variable coalescing to solve SOA and GOA are described, namely coalescing SOA (CSOA) and coalescing GOA (CGOA). In addition, a thorough comparison between these algorithms and others described in the literature is presented. The experimental results, when compiling MediaBench benchmark programs with the LANCE compiler, reveal a very significant improvement of the proposed techniques over the other available solutions to the problem.

Journal ArticleDOI
TL;DR: A new code improvement paradigm implemented in a system called VISTA that can help achieve the cost/performance trade-offs that embedded applications demand and include the use of a genetic algorithm to search for the most efficient sequence based on specified fitness criteria is described.
Abstract: Software designers face many challenges when developing applications for embedded systems. One major challenge is meeting the conflicting constraints of speed, code size, and power consumption. Embedded application developers often resort to hand-coded assembly language to meet these constraints since traditional optimizing compiler technology is usually of little help in addressing this challenge. The results are software systems that are not portable, less robust, and more costly to develop and maintain. Another limitation is that compilers traditionally apply the optimizations to a program in a fixed order. However, it has long been known that a single ordering of optimization phases will not produce the best code for every application. In fact, the smallest unit of compilation in most compilers is typically a function and the programmer has no control over the code improvement process other than setting flags to enable or disable certain optimization phases. This paper describes a new code improvement paradigm implemented in a system called VISTA that can help achieve the cost/performance trade-offs that embedded applications demand. The VISTA system opens the code improvement process and gives the application programmer, when necessary, the ability to finely control it. VISTA also provides support for finding effective sequences of optimization phases. This support includes the ability to interactively get static and dynamic performance information, which can be used by the developer to steer the code improvement process. This performance information is also internally used by VISTA for automatically selecting the best optimization sequence from several attempted. One such feature is the use of a genetic algorithm to search for the most efficient sequence based on specified fitness criteria. We include a number of experimental results that evaluate the effectiveness of using a genetic algorithm in VISTA to find effective optimization phase sequences.

Journal ArticleDOI
TL;DR: An abstract RTOS model is introduced, as well as a new approach to refine an unscheduled high-level model to a high- level model with RTOS scheduling, based on SystemC language, which enables the system designer to quickly evaluate different dynamic scheduling policies and make the optimal choice in early design stages.
Abstract: Scheduling decision for real-time embedded software applications has a great impact on system performance and, therefore, is an important issue in RTOS design. Moreover, it is highly desirable to have the system designer able to evaluate and select the right scheduling policy at high abstraction levels, in order to allow faster exploration of the design space. In this paper, we address this problem by introducing an abstract RTOS model, as well as a new approach to refine an unscheduled high-level model to a high-level model with RTOS scheduling. This approach is based on SystemC language and enables the system designer to quickly evaluate different dynamic scheduling policies and make the optimal choice in early design stages. Furthermore, we present a case of study where our model is used to simulate and analyze a telecom system.

Journal ArticleDOI
TL;DR: This work proposes a .Net Framework-based methodology, which simplifies specification, synthesis, and validation of systems and enables the efficient creation/customization of EDA tools at low cost and development time.
Abstract: New sophisticated EDA tools and methodologies will be needed to make products viable in the future marketplace by simplifying the various design stages. These tools will permit system design at a high abstraction level and enable automatic refinement through several abstraction levels to obtain a final prototype. They will have to be based on representations that are clean, complete, and easy to manipulate. In order to develop these new EDA tools, key features such as standardization, metadata programming, reflectivity, and introspection are needed. This work proposes a .Net Framework-based methodology, which possesses all these required key features. This methodology simplifies specification, synthesis, and validation of systems and enables the efficient creation/customization of EDA tools at low cost and development time. We show the effectiveness of this methodology by presenting its application for the design of a new EDA tool called ESys .Net (Embedded System design with .Net). We emphasize the specification and simulation aspects of this tool.

Journal ArticleDOI
TL;DR: This paper develops energy-driven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multiprocessor systems with multiple supply voltages and proposes a best-effort energy minimization algorithm (BEEM1) that achieves Qmax with the provably minimum energy consumption.
Abstract: This paper develops energy-driven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multiprocessor systems with multiple supply voltages. We leverage application's performance requirements, uncertainties in execution time, and tolerance for reasonable execution failures to scale each processor's supply voltage at run-time to reduce the multiprocessor system's total energy consumption. Specifically, we study how to trade the difference between the system's highest achievable completion ratio Qmax and the required completion ratio Q0 for energy saving. First, we propose a best-effort energy minimization algorithm (BEEM1) that achieves Qmax with the provably minimum energy consumption. We then relax its unrealistic assumption on the application's real execution time and develop algorithm BEEM2 that only requires the application's best- and worst-case execution times. Finally, we propose a hybrid offline on-line completion ratio guaranteed energy minimization algorithm (QGEM) that provides the required Q0 with further energy reduction based on the probabilistic distribution of the application's execution time. We implement the proposed algorithms and verify their energy efficiency on real-life DSP applications and the TGFF random benchmark suite. BEEM1, BEEM2, and QGEM all provide the required completion ratio with average energy reduction of 28.7, 26.4, and 35.8p, respectively.

Journal ArticleDOI
TL;DR: Three compiler-directed techniques that take advantage of schedule slacks to optimize leakage and dynamic energy consumption are presented and a unified energy-optimization strategy that integrates both dynamic and leakage energy-reduction schemes is provided.
Abstract: The mobile computing device market has been growing rapidly. This brings the technologies that optimize system energy to the forefront. As circuits continue to scale in the future, it would be important to optimize both leakage and dynamic energy. Effective optimization of leakage and dynamic energy consumption requires a vertical integration of techniques spanning from circuit to software levels. Schedule slacks in codes executing in VLIW architectures present an opportunity for such an integration. In this paper, we present three compiler-directed techniques that take advantage of schedule slacks to optimize leakage and dynamic energy consumption. Integer ALU (IALU) components operating with multiple supply voltages are designed to provide different low-energy versions that possess different operational latencies. The goal of the first technique explored is to maximize the number of operations mapped to IALU components with the lowest energy consumption without extending the schedule length. We also consider a variant of this technique that saves more energy at the cost of some performance loss. The second technique uses two leakage-control mechanisms to reduce leakage energy consumption when no operations are scheduled in the component. Our evaluation of these two approaches, using fifteen benchmarks, shows that based on the number and duration of slacks, the availability of low-energy functional units and the relative magnitude of leakage and dynamic energy, either leakage or dynamic energy consumption, will provide more energy gains. Finally, we provide a unified energy-optimization strategy that integrates both dynamic and leakage energy-reduction schemes. The proposed techniques have been incorporated into a cycle accurate simulator using parameters extracted from circuit-level simulation. Our results show that the unified scheme generates better results than using either of dynamic and leakage energy-reduction techniques independently.

Journal ArticleDOI
TL;DR: This study creates the NetBench benchmarking suite, designed to represent Network Processor workloads, and compares key characteristics, such as instructions per cycle, instruction distribution, cache behavior, and branch prediction accuracy with the programs from MediaBench.
Abstract: The Network Processor market is one of the fastest growing segments of the microprocessor industry today. In spite of this increasing market importance, there does not exist a common framework to compare the performance of different Network Processor designs. Our primary goal in this study is to fill this gap by creating the NetBench benchmarking suite. NetBench is designed to represent Network Processor workloads. It contains 11 programs that form 18 different applications. The programs are selected from all levels of packet processing: Small, low-level code fragments as well as large application-level programs are included in the suite. These applications are representative of the Network Processor applications in the market. Using the SimpleScalar simulator to model an ARM processor, we study these programs in detail and compare key characteristics, such as instructions per cycle, instruction distribution, cache behavior, and branch prediction accuracy with the programs from MediaBench. Using statistical analysis, we show that the simulation results for the programs in NetBench have significantly different characteristics than programs in MediaBench. Finally, we present performance measurements from Intel IXP1200 Network Processor to show how NetBench can be utilized.

Journal ArticleDOI
TL;DR: This work proposes a design space exploration technique for configurable multiprocessor platforms using arithmetic-level cycle-accurate hardware--software cosimulation, which significantly speed up thecosimulation process for configurator platforms.
Abstract: Configurable multiprocessor platforms consist of multiple soft processors configured on FPGA devices. They have become an attractive choice for implementing many computing applications. In addition to the various ways of distributing software execution among the multiple soft processors, the application designer can customize soft processors and the connections between them in order to improve the performance of the applications running on the multiprocessor platform. State-of-the-art design tools rely on low-level simulation to explore the various design trade-offs offered by configurable multiprocessor platforms. These low-level simulation based exploration techniques are too time-consuming and can be a major bottleneck to efficient design space exploration on these platforms. We propose a design space exploration technique for configurable multiprocessor platforms using arithmetic-level cycle-accurate hardware--software cosimulation. Arithmetic-level abstractions of the hardware and software execution platforms are created within the proposed cosimulation environment. The configurable multiprocessor platforms are described using these arithmetic-level abstractions. Hardware and software simulators are tightly integrated to concurrently simulate the arithmetic behavior of the multiprocessor platform. The simulation within the integrated simulators are synchronized to provide cycle-accurate simulation results for the complete multiprocessor platform. By doing so, we significantly speed up the cosimulation process for configurable multiprocessor platforms. Exploration of the various hardware-software design trade-offs provided by configurable multiprocessor platforms can be performed within the proposed cycle-accurate cosimulation environment. After the final designs are identified, the corresponding low-level implementations with the desired cycle-accurate arithmetic behavior are generated automatically. For illustrative purposes, we provide an implementation of our approach based on MATLAB/Simulink. We show the cosimulation of two numerical computation applications and one image-processing application on a popular configurable multiprocessor platform within the MATLAB/Simulink-based cosimulation environment. For these three applications, our arithmetic-level cosimulation approach leads to speed-ups in simulation time of up to more than 800x compared with the low-level simulation approaches. The designs of these applications identified using our arithmetic-level cosimulation approach achieve execution time speed-ups up to 5.6x, compared with other designs considered in our experiments.

Journal ArticleDOI
TL;DR: This work presents a compiler-based optimization strategy to “reduce the code size in embedded systems” that maximizes the use of indirect addressing modes with postincrement/decrement capabilities available in DSP processors.
Abstract: In DSP processors, minimizing the amount of address calculations is critical for reducing code size and improving performance, since studies of programs have shown that instructions that manipulate address registers constitute a significant portion of the overall instruction count (up to 55p). This work presents a compiler-based optimization strategy to “reduce the code size in embedded systems.” Our strategy maximizes the use of indirect addressing modes with postincrement/decrement capabilities available in DSP processors. These modes can be exploited by ensuring that successive references to variables access consecutive memory locations. To achieve this spatial locality, our approach uses both access pattern modification (program code restructuring) and memory storage reordering (data layout restructuring). Experimental results on a set of benchmark codes show the effectiveness of our solution and indicate that our approach outperforms the previous approaches to the problem. In addition to resulting in significant reductions in instruction memory (storage) requirements, the proposed technique improves execution time.

Journal ArticleDOI
TL;DR: This article explores the possibility of formally verifying a class loader for the SSP implemented in the strategic programming language TL and an implementation of the core activities of an abstract class loader is presented and its verification in ACL2 is considered.
Abstract: The SSP is a hardware implementation of a subset of the JVM for use in high-consequence embedded applications. In this context, a majority of the activities belonging to class loading, as it is defined in the specification of the JVM, can be performed statically. Static class loading has the net result of dramatically simplifying the design of the SSP, as well as increasing its performance. Because of the high consequence nature of its applications, strong evidence must be provided that all aspects of the SSP have been implemented correctly. This includes the class loader. This article explores the possibility of formally verifying a class loader for the SSP implemented in the strategic programming language TL. Specifically, an implementation of the core activities of an abstract class loader is presented and its verification in ACL2 is considered.

Journal ArticleDOI
TL;DR: This paper provides software, and hardware-based solutions detecting both illegal references across the application memory spaces and dangling pointers within an application space, and an approach to divide/share the memory among the applications executing concurrently in the system.
Abstract: Our objective is to adapt the Java memory management to an embedded system, e.g., a wireless PDA executing concurrent multimedia applications within a single JVM. This paper provides software, and hardware-based solutions detecting both illegal references across the application memory spaces and dangling pointers within an application space. We give an approach to divide/share the memory among the applications executing concurrently in the system. We introduce and define application-specific memory, building upon the real-time specification for Java (RTSJ) from the real-time Java expert group. The memory model used in RTSJ imposes strict rules for assignment between memory areas, preventing the creation of dangling pointers, and thus maintaining the pointer safety of Java. Our implementation solution to ensure the checking of these rules before each assignment inserts write barriers that use a stack-based algorithm. This solution adversely affects both the performance and predictability of the RTSJ applications, which can be improved by using an existing hardware support.

Journal ArticleDOI
TL;DR: This work presents the transformations and examines how well the compiler integrates threads for two display applications in STI, and examines the integration procedure, the processor load, and code memory expansion.
Abstract: Embedded systems require control of many concurrent real-time activities, leading to system designs that feature a variety of hardware peripherals, with each providing a specific, dedicated service. These peripherals increase system size, cost, weight, and design time. Software thread integration (STI) provides low-cost thread concurrency on general-purpose processors by automatically interleaving multiple threads of control into one. This simplifies hardware to software migration (which eliminates dedicated hardware) and can help embedded system designers meet design constraints, such as size, weight and cost. We have developed concepts for performing STI and have implemented many in our automated postpass compiler Thrint. Here we present the transformations and examine how well the compiler integrates threads for two display applications. We examine the integration procedure, the processor load, and code memory expansion. Integration allows reclamation of CPU idle time, allowing run-time speedups of 1.6x to 3.6x.

Journal ArticleDOI
TL;DR: The main benefit of concurrent hardware and software design methods are to reduce design time and cost, in addition to making the handling of complexity easier.
Abstract: System-on-Chip (SoC) design complexity threatens continuation of current design schemes. In fact, designing SoCs requires concurrent design of complex embedded software and a sophisticated hardware platform that may include several heterogeneous CPU subsystems. The lack of early coordination between different teams belonging to different cultures causes delay and cost overheads that are no longer acceptable for the design of embedded systems. Programming models have been used to coordinate software and hardware communities for the design of classic computers. A programming model provides an abstraction of HW–SW interfaces and allows concurrent design of complex systems composed of sophisticated software and hardware platforms. Examples include API at different abstraction levels, RTOS libraries, drivers, typically summarized as hardware-dependent software. This abstraction smoothes the design flow and eases interaction between different teams belonging to different cultures, hardware, software, and system architecture. The abstract hardware/software interface model defined in the early stage of the design process will then act as a contract between separate teams that may work concurrently to develop the hardware and the software parts. In addition, this scheme eases the integration phase, since both hardware and software have been developed to comply with a well-defined interface. The main benefit of concurrent hardware and software design methods are to reduce design time and cost, in addition to making the handling of complexity easier.

Journal ArticleDOI
TL;DR: A postregister allocation solution to merge the generated load/store instructions into their parallel counterparts using a multipass approach and proves that the coloring problem for MSG is NP-complete and solve it with two different heuristic algorithms with different complexity.
Abstract: Many modern embedded processors such as DSPs support partitioned memory banks (also called X--Y memory or dual-bank memory) along with parallel load/store instructions to achieve higher code density and performance. In order to effectively utilize the parallel load/store instructions, the compiler must partition the memory-resident values and assign them to X or Y bank. This paper gives a postregister allocation solution to merge the generated load/store instructions into their parallel counterparts. Simultaneously, our framework performs allocation of values to X or Y memory banks. We first remove as many load/stores and register--register moves as possible through an excellent iterated coalescing based register allocator by Appel and George [1996]. We then attempt to parallelize the generated load/stores using a multipass approach. The basic phase of our approach attempts the merger of load/stores without duplication and web splitting. We model this problem as a graph-coloring problem in which each value is colored as either X or Y. We then construct a motion scheduling graph (MSG), based on the range of motion for each load/store instruction. MSG reflects potential instructions that could be merged. We propose a notion of pseudofixed boundaries so that the load/store movement is less affected by register dependencies. We prove that the coloring problem for MSG is NP-complete and solve it with two different heuristic algorithms with different complexity. We then propose a two-level iterative process to attempt instruction duplication, variable duplication, web splitting, and local conflict elimination to effectively merge the remaining load/stores. Finally, we clean up some multiple-aliased load/stores. To improve the performance, we combine profiling information with each stage coupled with some modifications to the algorithm. We show that our framework results in parallelization of a large number of load/stores without much growth in data and code segments. The average speedup for our optimization pass reaches roughly 13p if no profile information is available and 17p with profile information. The average code and data segment growth is controlled within 13p.