scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions in Embedded Computing Systems in 2008"


Journal ArticleDOI
TL;DR: Different approaches to the determination of upper bounds on execution times are described and several commercially available tools1 and research prototypes are surveyed.
Abstract: The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

1,946 citations


Journal ArticleDOI
TL;DR: This paper introduces the periodic resource model to characterize resource allocations provided to a single component and presents exact schedulability conditions for the standard Liu and Layland periodic task model and the proposed periodic resource models under EDF and RM scheduling.
Abstract: It is desirable to develop large complex systems using components based on systematic abstraction and composition. Our goal is to develop a compositional real-time scheduling framework to support abstraction and composition techniques for real-time aspects of components. In this paper, we present a formal description of compositional real-time scheduling problems, which are the component abstraction and composition problems. We identify issues that need be addressed by solutions and provide our framework for the solutions, which is based on the periodic interface. Specifically, we introduce the periodic resource model to characterize resource allocations provided to a single component. We present exact schedulability conditions for the standard Liu and Layland periodic task model and the proposed periodic resource model under EDF and RM scheduling, and we show that the component abstraction and composition problems can be addressed with periodic interfaces through the exact schedulability conditions. We also provide the utilization bounds of a periodic task set over the periodic resource model and the abstraction bounds of periodic interfaces for a periodic task set under EDF and RM scheduling. We finally present the analytical bounds of overheads that our solution incurs in terms of resource utilization increase and evaluate the overheads through simulations.

228 citations


Journal ArticleDOI
Chanik Park1, Won-Moon Cheon1, Jeong-Uk Kang1, Kangho Roh1, Wonhee Cho1, Jin-Soo Kim2 
TL;DR: This paper categorizes dominant parameters that affect performance and endurance and explores the design space of the FTL architecture based on a diverse workload analysis to decide which configuration of FTL mapping parameters yields the best performance depending on each NAND flash application behavior.
Abstract: In this article, a novel FTL (flash translation layer) architecture is proposed for NAND flash-based applications such as MP3 players, DSCs (digital still cameras) and SSDs (solid-state drives). Although the basic function of an FTL is to translate a logical sector address to a physical sector address in flash memory, efficient algorithms of an FTL have a significant impact on performance as well as the lifetime. After the dominant parameters that affect the performance and endurance are categorized, the design space of the FTL architecture is explored based on a diverse workload analysis. With the proposed FTL architectural framework, it is possible to decide which configuration of FTL mapping parameters yields the best performance, depending on the differing characteristics of various NAND flash-based applications.

212 citations


Journal ArticleDOI
TL;DR: This paper explores the design and implementation of BORPH, an operating system designed for FPGA-based reconfigurable computers, and a Simulink-based design flow that integrates with BOR PH is employed.
Abstract: This paper explores the design and implementation of BORPH, an operating system designed for FPGA-based reconfigurable computers. Hardware designs execute as normal UNIX processes under BORPH, having access to standard OS services, such as file system support. Hardware and software components of user designs may, therefore, run as communicating processes within BORPH's runtime environment. The familiar language independent UNIX kernel interface facilitates easy design reuse and rapid application development. To develop hardware designs, a Simulink-based design flow that integrates with BORPH is employed. Performances of BORPH on two on-chip systems implemented on a BEE2 platform are compared.

141 citations


Journal ArticleDOI
TL;DR: This article shows how the AADL (Architecture Analysis and Design Language), which appeared in late 2004, helps solve issues thanks to a dedicated tool suite, and details the prototyping process and its current implementation: Ocarina.
Abstract: Building distributed deal-time embedded systems requires a stringent methodology, from early requirement capture to full implementation. However, there is a strong link between the requirements and the final implementation (e.g., scheduling and resource dimensioning). Therefore, a rapid prototyping process based on automation of tedious and error-prone tasks (analysis and code generation) is required to speed up the development cycle. In this article, we show how the AADL (Architecture Analysis and Design Language), which appeared in late 2004, helps solve these issues thanks to a dedicated tool suite. We then detail the prototyping process and its current implementation: Ocarina.

128 citations


Journal ArticleDOI
TL;DR: This paper uses system-level energy consideration to derive the “optimal ” scaling factor by which a task should be scaled if there are no deadline constraints and develops dynamic task-scheduling algorithms that make use of dynamic processor utilization and optimal scaling factor to determine the speed setting of a task.
Abstract: Dynamic voltage scaling (DVS) is a well-known low-power design technique that reduces the processor energy by slowing down the DVS processor and stretching the task execution time. However, in a DVS system consisting of a DVS processor and multiple devices, slowing down the processor increases the device energy consumption and thereby the system-level energy consumption. In this paper, we first use system-level energy consideration to derive the “optimal ” scaling factor by which a task should be scaled if there are no deadline constraints. Next, we develop dynamic task-scheduling algorithms that make use of dynamic processor utilization and optimal scaling factor to determine the speed setting of a task. We present algorithm duEDF, which reduces the CPU energy consumption and algorithm duSYS and its reduced preemption version, duSYS_PC, which reduce the system-level energy. Experimental results on the video-phone task set show that when the CPU power is dominant, algorithm duEDF results in up to 45% energy savings compared to the non-DVS case. When the CPU power and device power are comparable, algorithms duSYS and duSYS_PC achieve up to 25% energy saving compared to CPU energy-efficient algorithm duEDF, and up to 12% energy saving over the non-DVS scheduling algorithm. However, if the device power is large compared to the CPU power, then we show that a DVS scheme does not result in lowest energy. Finally, a comparison of the performance of algorithms duSYS and duSYS_PC show that preemption control has minimal effect on system-level energy reduction.

109 citations


Journal ArticleDOI
TL;DR: A design space exploration strategy together with a fast method for system performance analysis and optimisation heuristic for task priority assignment and task mapping in the context of multiprocessor applications with stochastic execution times and in the presence of constraints on the percentage of missed deadlines.
Abstract: Both analysis and design optimisation of real-time systems has predominantly concentrated on considering hard real-time constraints. For a large class of applications, however, this is both unrealistic and leads to unnecessarily expensive implementations. This paper addresses the problem of task priority assignment and task mapping in the context of multiprocessor applications with stochastic execution times and in the presence of constraints on the percentage of missed deadlines. We propose a design space exploration strategy together with a fast method for system performance analysis. Experiments emphasize the efficiency of the proposed analysis method and optimisation heuristic in generating high-quality implementations of soft real-time systems with stochastic task execution times and constraints on deadline miss ratios.

54 citations


Journal ArticleDOI
TL;DR: This work proposes to replace the on-chip instruction cache with a scratchpad memory (SPM) and a small minicache and shows that by using the data cache as a victim buffer for the SPM, significant energy savings are possible.
Abstract: In this work, we present a dynamic memory allocation technique for a novel, horizontally partitioned memory subsystem targeting contemporary embedded processors with a memory management unit (MMU). We propose to replace the on-chip instruction cache with a scratchpad memory (SPM) and a small minicache. Serializing the address translation with the actual memory access enables the memory system to access either only the SPM or the minicache. Independent of the SPM size and based solely on profiling information, a postpass optimizer classifies the code of an application binary into a pageable and a cacheable code region. The latter is placed at a fixed location in the external memory and cached by the minicache. The former, the pageable code region, is copied on demand to the SPM before execution. Both the pageable code region and the SPM are logically divided into pages the size of an MMU memory page. Using the MMU's pagefault exception mechanism, a runtime scratchpad memory manager (SPMM) tracks page accesses and copies frequently executed code pages to the SPM before they get executed. In order to minimize the number of page transfers from the external memory to the SPM, good code placement techniques become more important with increasing sizes of the MMU pages. We discuss code-grouping techniques and provide an analysis of the effect of the MMU's page size on execution time, energy consumption, and external memory accesses. We show that by using the data cache as a victim buffer for the SPM, significant energy savings are possible. We evaluate our SPM allocation strategy with fifteen applications, including H.264, MP3, MPEG-4, and PGP. The proposed memory system requires 8% less die are compared to a fully-cached configuration. On average, we achieve a 31% improvement in runtime performance and a 35% reduction in energy consumption with an MMU page size of 256 bytes.

47 citations


Journal ArticleDOI
TL;DR: In this article, the authors designed scalable deep-packet filters on field-programmable gate arrays (FPGAs) to search for all data-independent patterns simultaneously, which can scale linearly to support a greater number of patterns, as well as higher data throughput.
Abstract: Most network routers and switches provide some protection against the network attacks. However, the rapidly increasing amount of damages reported over the past few years indicates the urgent need for tougher security. Deep-packet inspection is one of the solutions to capture packets that can not be identified using the traditional methods. It uses a list of signatures to scan the entire content of the packet, providing the means to filter harmful packets out of the network. Since one signature does not depend on the other, the filtering process has a high degree of parallelism. Most software and hardware deep-packet filters that are in use today execute the tasks under Von Neuman architecture. Such architecture can not fully take advantage of the parallelism. For instance, one of the most widely used network intrusion-detection systems, Snort, configured with 845 patterns, running on a dual 1-GHz Pentium III system, can sustain a throughput of only 50 Mbps. The poor performance is because of the fact that the processor is programmed to execute several tasks sequentially instead of simultaneously. We designed scalable deep-packet filters on field-programmable gate arrays (FPGAs) to search for all data-independent patterns simultaneously. With FPGAs, we have the ability to reprogram the filter when there are any changes to the signature set. The smallest full-pattern matcher implementation for the latest Snort NIDS fits in a single 400k Xilinx FPGA (Spartan 3-XC3S400) with a sustained throughput of 1.6 Gbps. Given a larger FPGA, the design can scale linearly to support a greater number of patterns, as well as higher data throughput.

43 citations


Journal ArticleDOI
TL;DR: An intertask communication protocol, called DBP, that is semantics-preserving and memory-optimal, which guarantees semantical preservation under all possible triggering patterns of the synchronous program and works under both fixed priority and earliest-deadline first scheduling.
Abstract: We study the implementation of a synchronous program as a set of multiple tasks running on the same computer, and scheduled by a real-time operating system using some preemptive scheduling policy, such as fixed priority or earliest-deadline first. Multitask implementations are necessary, for instance, in multiperiodic applications, when the worst-case execution time of the program is larger than its smallest period. In this case, a single-task implementation violates the schedulability assumption and, therefore, the synchrony hypothesis does not hold. We are aiming at semantics-preserving implementations, where, for a given input sequence, the output sequence produced by the implementation is the same as that produced by the original synchronous program, and this under all possible executions of the implementation. Straightforward implementation techniques are not semantics-preserving. We present an intertask communication protocol, called DBP, that is semantics-preserving and memory-optimal. DBP guarantees semantical preservation under all possible triggering patterns of the synchronous program: thus, it is applicable not only to time-, but also event-triggered applications. DBP works under both fixed priority and earliest-deadline first scheduling. DBP is a nonblocking protocol based on the use of intermediate buffers and manipulations of write-to/read-from pointers to these buffers: these manipulations happen upon arrivals, rather than executions of tasks, which is a distinguishing feature of DBP. DBP is memory-optimal in the sense that it uses as few buffers as needed, for any given triggering pattern. In the worst case, DBP requires, at most, N p 2 buffers for each writer, where N is the number of readers for this writer.

43 citations


Journal ArticleDOI
TL;DR: This paper shows how this causality information can be algebraically composed so that compositions of components acquire causality interfaces that are inferred from their components and the interconnections and gives a conservative approximation technique for handling dynamically changing causality properties.
Abstract: We consider concurrent models of computation where “actors” (components that are in charge of their own actions) communicate by exchanging messages. The interfaces of actors principally consist of “ports,” which mediate the exchange of messages. Actor-oriented architectures contrast with and complement object-oriented models by emphasizing the exchange of data between concurrent components rather than transformation of state. Examples of such models of computation include the classical actor model, synchronous languages, data-flow models, process networks, and discrete-event models. Many experimental and production languages used to design embedded systems are actor oriented and based on one of these models of computation. Many of these models of computation benefit considerably from having access to causality information about the components. This paper augments the interfaces of such components to include such causality information. It shows how this causality information can be algebraically composed so that compositions of components acquire causality interfaces that are inferred from their components and the interconnections. We illustrate the use of these causality interfaces to statically analyze timed models and synchronous language compositions for causality loops and data-flow models for deadlock. We also show that that causality analysis for each communication cycle can be performed independently and in parallel, and it is only necessary to analyze one port for each cycle. Finally, we give a conservative approximation technique for handling dynamically changing causality properties.

Journal ArticleDOI
TL;DR: An algebra of tag structures is introduced to define heterogeneous parallel composition formally and Morphisms between tag structures are used to define relationships between heterogeneous models at different levels of abstraction.
Abstract: We present a compositional theory of heterogeneous reactive systems. The approach is based on the concept of tags marking the events of the signals of a system. Tags can be used for multiple purposes from indexing evolution in time (time stamping) to expressing relations among signals, like coordination (e.g., synchrony and asynchrony) and causal dependencies. The theory provides flexibility in system modeling because it can be used both as a unifying mathematical framework to relate heterogeneous models of computations and as a formal vehicle to implement complex systems by combining heterogeneous components. In particular, we introduce an algebra of tag structures to define heterogeneous parallel composition formally. Morphisms between tag structures are used to define relationships between heterogeneous models at different levels of abstraction. In particular, they can be used to represent design transformations from tightly synchronized specifications to loosely-synchronized implementations. The theory has an important application in the correct-by-construction deployment of synchronous design on distributed architectures.

Journal ArticleDOI
TL;DR: This paper proposes a surveillance service for sensor networks based on a distributed energy-efficient sensing coverage protocol that outperforms other state-of-the-art schemes by as much as 50% reduction in energy consumption and asMuch as 130% increase in the half-life of the network.
Abstract: For many sensor network applications, such as military surveillance, it is necessary to provide full sensing coverage to a security-sensitive area while, at the same time, minimizing energy consumption and extending system lifetime by leveraging the redundant deployment of sensor nodes. In this paper, we propose a surveillance service for sensor networks based on a distributed energy-efficient sensing coverage protocol. In the protocol, each node is able to dynamically decide a schedule for itself to guarantee a certain degree-of-coverage (DOC) with average energy consumption inversely proportional to the node density. Several optimizations and extensions are proposed to enhance the basic design with a better load-balance feature and a longer network lifetime. We consider and address the impact of the target size and the unbalanced initial energy capacity of individual nodes to the network lifetime. Several practical issues such as the localization error, irregular sensing range, and unreliable communication links are addressed as well. Simulation shows that our protocol extends system lift-time significantly with low energy consumption. It outperforms other state-of-the-art schemes by as much as 50p reduction in energy consumption and as much as 130p increase in the half-life of the network.

Journal ArticleDOI
TL;DR: A platform-based software design flow able to efficiently use the resources of the architecture and allowing easy experimentation of several mappings of the application onto the platform resources is proposed, which increases productivity and preserves design quality.
Abstract: Current multimedia applications demand complex heterogeneous multiprocessor architectures with specific communication infrastructure in order to achieve the required performances Programming these architectures usually results in writing separate low-level code for the different processors (DSP, microcontroller), implying late global validation of the overall application with the hardware platform We propose a platform-based software design flow able to efficiently use the resources of the architecture and allowing easy experimentation of several mappings of the application onto the platform resources We use a high-level environment to capture both application and architecture initial representations An executable software stack is generated automatically for each processor from the initial model The software generation and validation is performed gradually corresponding to different software abstraction levels Specific software development platforms (abstract models of the architecture) are generated and used to allow debugging of the different software components with explicit hardware-software interaction We applied this approach on a multimedia platform, involving a high performance DSP and a RISC processor, to explore communication architecture and generate an efficient executable code for a multimedia application Based on automatic tools, the proposed flow increases productivity and preserves design quality

Journal ArticleDOI
TL;DR: It is proved that system-wide energy optimization for sporadic tasks is NP-hard in the strong sense and (pseudo-) polynomial-time solutions are developed by exploiting its inherent properties.
Abstract: We present a dynamic voltage scaling (DVS) technique that minimizes system-wide energy consumption for both periodic and sporadic tasks. It is known that a system consists of processors and a number of other components. Energy-aware processors can be run in different speed levels; components like memory and I/O subsystems and network interface cards can be in a standby state when they are active, but idle. Processor energy optimization solutions are not necessarily efficient from the perspective of systems. Current system-wide energy optimization studies are often limited to periodic tasks with heuristics in getting approximated solutions. In this paper, we develop an exact dynamic programming algorithm for periodic tasks on processors with practical discrete speed levels. The algorithm determines the lower bound of energy expenditure in pseudopolynomial time. An approximation algorithm is proposed to provide performance guarantee with a given bound in polynomial running time. Because of their time efficiency, both the optimization and approximation algorithms can be adapted for online scheduling of sporadic tasks with irregular task releases. We prove that system-wide energy optimization for sporadic tasks is NP-hard in the strong sense. We develop (pseudo-) polynomial-time solutions by exploiting its inherent properties.

Journal ArticleDOI
TL;DR: Experimental results for some mobile applications show that the proposed DSOM module in user space can meet user-specified task-oriented goals and significantly improve battery utilization.
Abstract: Energy efficiency has become a very important and challenging issue for resource-constrained mobile computers. In this article, we propose a novel dynamic software management (DSOM) framework to improve battery utilization. We have designed and implemented a DSOM module in user space, independent of the operating system (OS), which explores quality-of-service (QoS) adaptation to reduce system energy and employs a priority-based preemption policy for multiple applications to avoid competition for limited energy resources. Software energy macromodels for mobile applications are employed to predict energy demand at each QoS level, so that the DSOM module is able to select the best possible trade-off between energy conservation and application QoS; it also honors the priority desired by the user. Our experimental results for some mobile applications (video player, speech recognizer, voice-over-IP) show that this approach can meet user-specified task-oriented goals and significantly improve battery utilization.

Journal ArticleDOI
TL;DR: This work presents a cosynthesis framework for design space exploration that considers heterogeneous scheduling while mapping multimedia applications onto such MPSoCs and generates a tradeoff space between energy and cost allowing designers to comparatively evaluate multiple system level mappings.
Abstract: Real-time multimedia applications are increasingly being mapped onto MPSoC (multiprocessor system-on-chip) platforms containing hardware--software IPs (intellectual property), along with a library of common scheduling policies such as EDF, RM. The choice of a scheduling policy for each IP is a key decision that greatly affects the design's ability to meet real-time constraints, and also directly affects the energy consumed by the design. We present a cosynthesis framework for design space exploration that considers heterogeneous scheduling while mapping multimedia applications onto such MPSoCs. In our approach, we select a suitable scheduling policy for each IP such that system energy is minimized—our framework also includes energy-reduction techniques utilizing dynamic power management. Experimental results on a realistic multimode multimedia terminal application demonstrate that our approach enables us to select design points with up to 60.5% reduced energy for a given area constraint, while meeting all real-time requirements. More importantly, our approach generates a tradeoff space between energy and cost allowing designers to comparatively evaluate multiple system level mappings.

Journal ArticleDOI
TL;DR: This paper proposes a new transaction-based modeling abstraction level (CCATB) to explore the communication design space, and describes the mechanisms that produce the speedup in CCATB models and also analyzes how the achieved simulation speedup scales with design complexity.
Abstract: Currently, system-on-chip (SoC) designs are becoming increasingly complex, with more and more components being integrated into a single SoC design Communication between these components is increasingly dominating critical system paths and frequently becomes the source of performance bottlenecks It, therefore, becomes imperative for designers to explore the communication space early in the design flow Traditionally, system designers have used Pin-Accurate Bus Cycle Accurate (PA-BCA) models for early communication space exploration These models capture all of the bus signals and strictly maintain cycle accuracy, which is useful for reliable performance exploration but results in slow simulation speeds for complex, designs, even when they are modeled using high-level languages Recently, there have been several efforts to use the Transaction-Level Modeling (TLM) paradigm for improving simulation performance in BCA models However, these transaction-based BCA (T-BCA) models capture a lot of details that can be eliminated when exploring communication architectures In this paper, we extend the TLM approach and propose a new transaction-based modeling abstraction level (CCATB) to explore the communication design space Our abstraction level bridges the gap between the TLM and BCA levels, and yields an average performance speedup of 120% over PA-BCA and 67% over T-BCA models, on average The CCATB models are not only faster to simulate, but also extremely accurate and take less time to model compared to both T-BCA and PA-BCA models We describe the mechanisms that produce the speedup in CCATB models and also analyze how the achieved simulation speedup scales with design complexity To demonstrate the effectiveness of using CCATB for exploration, we present communication space exploration case studies from the broadband communication and multimedia application domains

Journal ArticleDOI
TL;DR: Stochastic bit-width estimation that follows a simulation-based probabilistic approach to estimate the bit- widths of integer variables using extreme value theory is introduced and enables more compact and power-efficient custom hardware designs than the compile-time integer bit-Width analysis techniques.
Abstract: There is an increasing trend toward compiling from C to custom hardware for designing embedded systems in which the area and power consumption of application-specific functional units, registers, and memory blocks are heavily dependent on the bit-widths of integer operands used in computations. The actual bit-width required to store the values assigned to an integer variable during the execution of a program will not, in general, match the built-in C data types. Thus, precious area is wasted if the built-in data type sizes are used to declare the size of integer operands. In this paper, we introduce stochastic bit-width estimation that follows a simulation-based probabilistic approach to estimate the bit-widths of integer variables using extreme value theory. The estimation technique is also empirically compared to two compile-time integer bit-width analysis techniques. Our experimental results show that the stochastic bit-width estimation technique dramatically reduces integer bit-widths and, therefore, enables more compact and power-efficient custom hardware designs than the compile-time integer bit-width analysis techniques. Up to 37p reduction in custom hardware area and 30p reduction in logic power consumption using stochastic bit-width estimation can be attained over ten integer applications implemented on an FPGA chip.

Journal ArticleDOI
TL;DR: This paper presents and evaluates a scheme for reducing the energy consumption of SDRAM (synchronous DRAM) memory access by using small, cachelike structures in the memory controller to prefetch an additional cache block(s) on SDRam reads and to combine block writes to the same S DRAM row.
Abstract: DRAM (dynamic random-access memory) energy consumption in low-power embedded systems can be very high, exceeding that of the data cache or even that of the processor. This paper presents and evaluates a scheme for reducing the energy consumption of SDRAM (synchronous DRAM) memory access by a combination of techniques that take advantage of SDRAM energy efficiencies in bank and row access. This is achieved by using small, cachelike structures in the memory controller to prefetch an additional cache block(s) on SDRAM reads and to combine block writes to the same SDRAM row. The results quantify the SDRAM energy consumption of MiBench applications and demonstrate significant savings in SDRAM energy consumption, 23p, on average, and reduction in the energy-delay product, 44p, on average. The approach also improves performance: the CPI is reduced by 26p, on average.

Journal ArticleDOI
TL;DR: This article demonstrates that, for the first time, repeatable EM differential attacks are possible and is repeatable in the presence of a complex embedded system, thus supporting attacks on real embedded systems.
Abstract: The susceptibility of wireless portable devices to electromagnetic (EM) attacks is largely unknown. If analysis of electromagnetic (EM) waves emanating from the wireless device during a cryptographic computation do leak sufficient information, it may be possible for an attacker to reconstruct the secret key. Possession of the secret cryptographic key would render all future wireless communications insecure and cause further potential problems, such as identity theft. Despite the complexities of a PDA wireless device, such as operating system events, interrupts, cache misses, and other interfering events, this article demonstrates that, for the first time, repeatable EM differential attacks are possible. The proposed differential analysis methodology involves precharacterization of the PDA device (thresholding and pattern recognition), and a new frequency-based differential analysis. Unlike previous research, the new methodology does not require perfect alignment of EM frames and is repeatable in the presence of a complex embedded system (including cache misses, operating system events, etc), thus supporting attacks on real embedded systems. This research is important for future wireless embedded systems, which will increasingly demand higher levels of security.

Journal ArticleDOI
TL;DR: This work shows how timing analysis can be used to find the cases in which additional buffering and latency can be avoided, improving the space and time performance of the application.
Abstract: Automatic generation of a controller implementation from a synchronous reactive model is among the best practices for software development in the automotive and aeronautics industry, because of the possibility of simulation, model checking, and error-free implementation. This paper discusses an algorithm for optimizing the single-processor multitask implementation of Simulink models with real-time execution constraints, derived from the sampling rates of the functional blocks. Existing code generation tools enforce the addition of extra buffering and latencies whenever there is a rate transition among functional blocks. This work shows how timing analysis can be used to find the cases in which additional buffering and latency can be avoided, improving the space and time performance of the application. The proposed search algorithm allows finding a solution with reduced and possibly minimal use of buffering even for very high values of processor utilization.

Journal ArticleDOI
TL;DR: This study formulated the minimal placement of bank selection instructions as a discrete optimization problem that is mapped to a partitioned boolean quadratic programming (PBQP) problem, and implemented the optimization as part of a PIC Microchip backend and evaluated the approach for several optimization objectives.
Abstract: We have devised an algorithm for minimal placement of bank selections in partitioned memory architectures. This algorithm is parameterizable for a chosen metric, such as speed, space, or energy. Bank switching is a technique that increases the code and data memory in microcontrollers without extending the address buses. Given a program in which variables have been assigned to data banks, we present a novel optimization technique that minimizes the overhead of bank switching through cost-effective placement of bank selection instructions. The placement is controlled by a number of different objectives, such as runtime, low power, small code size or a combination of these parameters. We have formulated the minimal placement of bank selection instructions as a discrete optimization problem that is mapped to a partitioned boolean quadratic programming (PBQP) problem. We implemented the optimization as part of a PIC Microchip backend and evaluated the approach for several optimization objectives. Our benchmark suite comprises programs from MiBench and DSPStone plus a microcontroller real-time kernel and drivers for microcontroller hardware devices. Our optimization achieved a reduction in program memory space of between 2.7 and 18.2%, and an overall improvement with respect to instruction cycles between 5.0 and 28.8%. Our optimization achieved the minimal solution for all benchmark programs. We investigated the scalability of our approach toward the requirements of future generations of microcontrollers. This study was conducted as a worst-case analysis on the entire MiBench suite. Our results show that our optimization (1) scales well to larger numbers of memory banks, (2) scales well to the larger problem sizes that will become feasible with future microcontrollers, and (3) achieves minimal placement for more than 72% of all functions from MiBench.

Journal ArticleDOI
TL;DR: This work improves reliability for multitasking embedded systems by proposing MTSS, a multitask stack sharing technique that can avoid the out-of-memory error if the extra space recovered is sufficient to complete execution.
Abstract: Out-of-memory errors are a serious source of unreliability in most embedded systems. Applications run out of main memory because of the frequent difficulty of estimating the memory requirement before deployment, either because it depends on input data, or because certain language features prevent estimation. The typical lack of disks and virtual memory in embedded systems has a serious consequence when an out-of-memory error occurs. Without swap space, the system crashes if its memory footprint exceeds the available memory by even 1 byte. This work improves reliability for multitasking embedded systems by proposing MTSS, a multitask stack sharing technique. If a task attempts to overflow the bounds of its allocated stack space, MTSS grows its stack into the stack memory space allocated for other tasks. This technique can avoid the out-of-memory error if the extra space recovered is sufficient to complete execution. Experiments show that MTSS is able to recover an average of 54p of the stack space allocated to the overflowing task in the free space of other tasks. In addition, unlike conventional systems, MTSS detects memory overflows, allowing the possibility of remedial action or a graceful exit if the recovered space is not enough. Alternatively, MTSS can be used for decreasing the required physical memory of an embedded system by reducing the initial memory allocated to each of the tasks and recovering the deficit by sharing stack with other tasks. The overheads of MTSS are low: the runtime and energy overheads are 3.1p and 3.2p, on average. These are tolerable given that reliability is the most important concern in virtually all systems, ahead of other concerns, such as runtime and energy.

Journal ArticleDOI
TL;DR: Comp compile time and instruction-set techniques for improving the accuracy of signal-processing algorithms run on fixed-point embedded processors are evaluated and a novel instruction- set enhancement—fractional multiplication with internal left shift (FMLS)—is proposed to further leverage interoperand correlations uncovered by the IRP-SA scaling algorithm.
Abstract: This paper proposes and evaluates compile time and instruction-set techniques for improving the accuracy of signal-processing algorithms run on fixed-point embedded processors. These techniques are proposed in the context of a profile guided floating- to fixed-point compiler-based conversion process. A novel fixed-point scaling algorithm (IRP) is introduced that exploits correlations between values in a program by applying fixed-point scaling, retaining as much precision as possible without causing overflow. This approach is extended into a more aggressive scaling algorithm (IRP-SA) by leveraging the modulo nature of 2's complement addition and subtraction to discard most significant bits that may not be redundant sign-extension bits. A complementary scaling technique (IDS) is then proposed that enables the fixed-point scaling of a variable to be parameterized, depending upon the context of its definitions and uses. Finally, a novel instruction-set enhancement—fractional multiplication with internal left shift (FMLS)—is proposed to further leverage interoperand correlations uncovered by the IRP-SA scaling algorithm. FMLS preserves a different subset of the full product's bits than traditional fractional fixed-point or integer multiplication. On average, FMLS combined with IRP-SA improves accuracy on processors with uniform bitwidth register architectures by the equivalent of 0.61 bits of additional precision for a set of signal-processing benchmarks (up to 2 bits). Even without employing FMLS, the IRP-SA scaling algorithm achieves additional accuracy over two previous fixed-point scaling algorithms by averages of 1.71 and 0.49 bits. Furthermore, as FMLS combines multiplication with a scaling shift, it reduces execution time by an average of 9.8p. An implementation of IDS, specialized to single-nested loops, is found to improve accuracy of a lattice filter benchmark by the equivalent of more than 16-bits of precision.

Journal ArticleDOI
TL;DR: The proposed formal approach to fault-tolerance by program transformations highlights the benefits of separation of concerns and allows us to establish correctness properties and to compute optimal values of parameters to minimize fault-Tolerance overhead.
Abstract: We present a formal approach to implement fault-tolerance in real-time embedded systems. The initial fault-intolerant system consists of a set of independent periodic tasks scheduled onto a set of fail-silent processors connected by a reliable communication network. We transform the tasks such that, assuming the availability of an additional spare processor, the system tolerates one failure at a time (transient or permanent). Failure detection is implemented using heartbeating, and failure masking using checkpointing and rollback. These techniques are described and implemented by automatic program transformations on the tasks' programs. The proposed formal approach to fault-tolerance by program transformations highlights the benefits of separation of concerns. It allows us to establish correctness properties and to compute optimal values of parameters to minimize fault-tolerance overhead. We also present an implementation of our method, to demonstrate its feasibility and its efficiency.

Journal ArticleDOI
TL;DR: This paper proposes a modified recursive flow classification (RFC) algorithm, Bitmap-RFC, which significantly reduces the memory requirements of RFC by applying a bitmap compression technique, and experiments with an architecture-aware design principle to guarantee the high performance of the classification algorithm on an NPU implementation.
Abstract: Packet classification is crucial for the Internet to provide more value-added services and guaranteed quality of service. Besides hardware-based solutions, many software-based classification algorithms have been proposed. However, classifying at 10 Gbps speed or higher is a challenging problem and it is still one of the performance bottlenecks in core routers. In general, classification algorithms face the same challenge of balancing between high classification speed and low memory requirements. This paper proposes a modified recursive flow classification (RFC) algorithm, Bitmap-RFC, which significantly reduces the memory requirements of RFC by applying a bitmap compression technique. To speed up classifying speed, we exploit the multithreaded architectural features in various algorithm development stages from algorithm design to algorithm implementation. As a result, Bitmap-RFC strikes a good balance between speed and space. It can significantly keep both high classification speed and reduce memory space consumption. This paper investigates the main NPU software design aspects that have dramatic performance impacts on any NPU-based implementations: memory space reduction, instruction selection, data allocation, task partitioning, and latency hiding. We experiment with an architecture-aware design principle to guarantee the high performance of the classification algorithm on an NPU implementation. The experimental results show that the Bitmap-RFC algorithm achieves 10 Gbps speed or higher and has a good scalability on Intel IXP2800 NPU.

Journal ArticleDOI
TL;DR: A design framework is proposed that can flexibly balance the tradeoff involving the system's code size, execution time, and energy consumption and is driven by an abstract system specification given by the system developer, so that the system development process can be automated.
Abstract: Real-time embedded systems are typically constrained in terms of three system performance criteria: space, time, and energy. The performance requirements are directly translated into constraints imposed on the system's resources, such as code size, execution time, and energy consumption. These resource constraints often interact or even conflict with each other in a complex manner, making it difficult for a system developer to apply a well-defined design methodology in developing a real-time embedded system. Motivated by this observation, we propose a design framework that can flexibly balance the tradeoff involving the system's code size, execution time, and energy consumption. Given a system specification and an optimization criteria, the proposed technique generates a set of design parameters in such a way that a system cost function is minimized while the given resource constraints are satisfied. Specifically, the technique derives code generation decision for each task so that a specific version of code is selected among a number of different ones that have distinct characteristics in terms of code size and execution time. In addition, the design framework determines the voltage/frequency setting for a variable voltage processor whose supply voltage can be adjusted at runtime in order to minimize the energy consumption while execution performance is degraded accordingly. The proposed technique formulates this design process as a constrained optimization problem. We show that this optimization problem is NP-hard and then provide a heuristic solution to it. We show that these seemingly conflicting design goals can be pursued by using a simple optimization algorithm that works with a single optimization criteria. Moreover, the optimization is driven by an abstract system specification given by the system developer, so that the system development process can be automated. The results from our simulation show that the proposed algorithm finds a solution that is close to the optimal one with the average error smaller than 1.0%.

Journal ArticleDOI
TL;DR: In this article, the authors present NWSLite, a prediction utility that addresses the shortcomings of traditional time series-based prediction methods on resource-constrained platforms by configuring prediction algorithms.
Abstract: Time series-based prediction methods have a wide range of uses in embedded systems. Many OS algorithms and applications require accurate prediction of demand and supply of resources. However, configuring prediction algorithms is not easy, since the dynamics of the underlying data requires continuous observation of the prediction error and dynamic adaptation of the parameters to achieve high accuracy. Current prediction methods are either too costly to implement on resource-constrained devices or their parameterization is static, making them inappropriate and inaccurate for a wide range of datasets. This paper presents NWSLite, a prediction utility that addresses these shortcomings on resource-restricted platforms.

Journal ArticleDOI
TL;DR: By using application-specific dynamic buffering techniques, the workload of these applications can be suitably “shaped” to fit the available processor bandwidth, analogous to traffic shaping, which is widely used in communication networks to optimally utilize network bandwidth.
Abstract: Today, most personal mobile devices (e.g., cell phones and PDAs) are multimedia-enabled and support a variety of concurrently running applications, such as audio/video players, word processors, and web browsers. Media-processing applications are often computationally expensive and most of these devices typically have 100--400-MHz processors. As a result, the user-perceived application response times are often poor when multiple applications are concurrently fired. In this paper, we show that by using application-specific dynamic buffering techniques, the workload of these applications can be suitably “shaped” to fit the available processor bandwidth. Our techniques are analogous to traffic shaping, which is widely used in communication networks to optimally utilize network bandwidth. Such shaping techniques have recently attracted a lot of attention in the context of embedded systems design (e.g., for dynamic voltage scaling). However, they have not been exploited for enhanced schedulability of multiple applications, as we do in this paper.