scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions in Embedded Computing Systems in 2014"


Journal ArticleDOI
TL;DR: An overview of the Delite compiler framework and DSLs that have been developed with it is presented and it is shown that they all achieve performance competitive to or exceeding Cpp code.
Abstract: Developing high-performance software is a difficult task that requires the use of low-level, architecture-specific programming models (e.g., OpenMP for CMPs, CUDA for GPUs, MPI for clusters). It is typically not possible to write a single application that can run efficiently in different environments, leading to multiple versions and increased complexity. Domain-Specific Languages (DSLs) are a promising avenue to enable programmers to use high-level abstractions and still achieve good performance on a variety of hardware. This is possible because DSLs have higher-level semantics and restrictions than general-purpose languages, so DSL compilers can perform higher-level optimization and translation. However, the cost of developing performance-oriented DSLs is a substantial roadblock to their development and adoption. In this article, we present an overview of the Delite compiler framework and the DSLs that have been developed with it. Delite simplifies the process of DSL development by providing common components, like parallel patterns, optimizations, and code generators, that can be reused in DSL implementations. Delite DSLs are embedded in Scala, a general-purpose programming language, but use metaprogramming to construct an Intermediate Representation (IR) of user programs and compile to multiple languages (including Cpp, CUDA, and OpenCL). DSL programs are automatically parallelized and different parts of the application can run simultaneously on CPUs and GPUs. We present Delite DSLs for machine learning, data querying, graph analysis, and scientific computing and show that they all achieve performance competitive to or exceeding Cpp code.

139 citations


Journal ArticleDOI
TL;DR: The intention of this article is to summarize the current state of the art in research concerning how to build predictable yet performant systems, and suggest precise definitions for the concept of “predictability”, and present predictability concerns at different abstraction levels in embedded system design.
Abstract: A large class of embedded systems is distinguished from general-purpose computing systems by the need to satisfy strict requirements on timing, often under constraints on available resources. Predictable system design is concerned with the challenge of building systems for which timing requirements can be guaranteed a priori. Perhaps paradoxically, this problem has become more difficult by the introduction of performance-enhancing architectural elements, such as caches, pipelines, and multithreading, which introduce a large degree of uncertainty and make guarantees harder to provide. The intention of this article is to summarize the current state of the art in research concerning how to build predictable yet performant systems. We suggest precise definitions for the concept of “predictability”, and present predictability concerns at different abstraction levels in embedded system design. First, we consider timing predictability of processor instruction sets. Thereafter, we consider how programming languages can be equipped with predictable timing semantics, covering both a language-based approach using the synchronous programming paradigm, as well as an environment that provides timing semantics for a mainstream programming language (in this case C). We present techniques for achieving timing predictability on multicores. Finally, we discuss how to handle predictability at the level of networked embedded systems where randomly occurring errors must be considered.

126 citations


Journal ArticleDOI
TL;DR: A novel technique are proposed to directly model the idle intervals of individual cores such that both DVFS and DPM can be optimized at the same time.
Abstract: Energy optimization is a critical design concern for embedded systems. Combining DVFSpDPM is considered as one preferable technique to reduce energy consumption. There have been optimal DVFSpDPM algorithms for periodic independent tasks running on uniprocessor in the literature. Optimal combination of DVFS and DPM for periodic dependent tasks on multicore systems is however not yet reported. The challenge of this problem is that the idle intervals of cores are not easy to model. In this article, a novel technique is proposed to directly model the idle intervals of individual cores such that both DVFS and DPM can be optimized at the same time. Based on this technique, the energy optimization problem is formulated by means of mixed integrated linear programming. We also present techniques to prune the exploration space of the formulation. Experimental results using real-world benchmarks demonstrate the effectiveness of our approach compared to existing approaches.

108 citations


Journal ArticleDOI
TL;DR: This paper addresses the scheduling-control co-design problem of determining the optimal sampling rates of feedback control loops sharing a Wireless Hart network by formulate rate selection as a differentiable convex optimization problem that provides a closed-form solution through a gradient descent method.
Abstract: With the advent of industrial standards such as WirelessHART, process industries are now gravitating towards wireless control systems. Due to limited bandwidth in a wireless network shared by multiple control loops, it is critical to optimize the overall control performance. In this article, we address the scheduling-control co-design problem of determining the optimal sampling rates of feedback control loops sharing a WirelessHART network. The objective is to minimize the overall control cost while ensuring that all data flows meet their end-to-end deadlines. The resulting constrained optimization based on existing delay bounds for WirelessHART networks is challenging since it is nondifferentiable, nonlinear, and not in closed-form. We propose four methods to solve this problem. First, we present a subgradient method for rate selection. Second, we propose a greedy heuristic that usually achieves low control cost while significantly reducing the execution time. Third, we propose a global constrained optimization algorithm using a simulated annealing (SA) based penalty method. We study SA method under both constant factor penalty and adaptive penalty. Finally, we formulate rate selection as a differentiable convex optimization problem that provides a quick solution through a convex optimization technique. This is based on a new delay bound that is convex and differentiable, and hence simplifies the optimization problem. We study both the gradient descent method and the interior point method to solve it. We evaluate all methods through simulations based on topologies of a 74-node wireless sensor network testbed. The subgradient method is disposed to incur the longest execution time as well as the highest control cost among all methods. Among the SA-based constant penalty method, the greedy heuristic, and the gradient descent method, the first two represent the opposite ends of the tradeoff between control cost and execution time, while the third one hits the balance between the two. We further observe that the SA based adaptive penalty method is superior to the constant penalty method, and that the interior point method is superior to the gradient method. Thus, the interior point method and the SA-based adaptive penalty method are the two most effective approaches for rate selection. While both methods are competitive against each other in terms of control cost, the interior point method is significantly faster than the penalty method. As a result, the interior point method upon convex relaxation is more suitable for online rate adaptation than the SA based adaptive penalty method due to their significant difference in run-time efficiency.

77 citations


Journal ArticleDOI
TL;DR: The proposed methods not only consistently outperform the existing approaches in terms of throughput maximization, but also significantly improve the feasibility of tasks when a more stringent temperature constraint is imposed.
Abstract: In this article, we study the problem of how to maximize the throughput of a periodic real-time system under a given peak temperature constraint. We assume that different tasks in our system may have different power and thermal characteristics. Two scheduling approaches are presented. The first is built upon processors that can be in either active or sleep mode. By judiciously selecting tasks with different thermal characteristics as well as alternating the processor's active/sleep mode, the sleep period required to cool down the processor is kept at a minimum level, and, as the result, the throughput is maximized. We further extend this approach for processors with dynamic voltage/frequency scaling (DVFS) capability. Our experiments on a large number of synthetic test cases as well as real benchmark programs show that the proposed methods not only consistently outperform the existing approaches in terms of throughput maximization, but also significantly improve the feasibility of tasks when a more stringent temperature constraint is imposed.

75 citations


Journal ArticleDOI
TL;DR: This work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor) by assuming a timing anomaly free multi-core architecture for computing the WCET.
Abstract: With the advent of multicore architectures, worst-case execution time (WCET) analysis has become an increasingly difficult problem. In this article, we propose a unified WCET analysis framework for multicore processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic microarchitectural components (e.g., pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multicore architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.

70 citations


Journal ArticleDOI
TL;DR: This work introduces a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs) and presents a seamless mapping flow for TCPAs, based on a domain-specific language, and outlines a complete symbolic mapping approach.
Abstract: We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

66 citations


Journal ArticleDOI
TL;DR: A recurrent neural network is developed to solve the problem distributively in real time on the Bluetooth network and the solution feasibility to the defined problem are both theoretically proven.
Abstract: It is meaningful to design a strategy to roughly localize mobile phones without a GPS by exploiting existing conditions and devices especially in environments without GPS availability (e.g., tunnels, subway stations, etc.). The availability of Bluetooth devices for most phones and the existence of a number of GPS equipped phones in a crowd of phone users enable us to design a Bluetooth aided mobile phone localization strategy. With the position of GPS equipped phones as beacons, and with the Bluetooth connection between neighbor phones as proximity constraints, we formulate the problem into an inequality problem defined on the Bluetooth network. A recurrent neural network is developed to solve the problem distributively in real time. The convergence of the neural network and the solution feasibility to the defined problem are both theoretically proven. The hardware implementation architecture of the proposed neural network is also given in this article. As applications, rough localizations of drivers in a tunnel and localization of customers in a supermarket are explored and simulated. Simulations demonstrate the effectiveness of the proposed method.

54 citations


Journal ArticleDOI
TL;DR: A design-time (offline) multi-criterion optimization technique for application mapping on embedded multiprocessor systems to minimize energy consumption for all processor fault-scenarios and a scheduling technique based on self-timed execution to minimize the schedule storage and construction overhead at runtime are proposed.
Abstract: Task mapping and scheduling are critical in minimizing energy consumption while satisfying the performance requirement of applications enabled on heterogeneous multiprocessor systems. An area of growing concern for modern multiprocessor systems is the increase in the failure probability of one or more component processors. This is especially critical for applications where performance degradation (e.g., throughput) directly impacts the quality of service requirement. This article proposes a design-time (offline) multi-criterion optimization technique for application mapping on embedded multiprocessor systems to minimize energy consumption for all processor fault-scenarios. A scheduling technique is then proposed based on self-timed execution to minimize the schedule storage and construction overhead at runtime. Experiments conducted with synthetic and real applications from streaming and nonstreaming domains on heterogeneous MPSoCs demonstrate that the proposed technique minimizes energy consumption by 22p and design space exploration time by 100x, while satisfying the throughput requirement for all processor fault-scenarios. For scalable throughput applications, the proposed technique achieves 30p better throughput per unit energy, compared to the existing techniques. Additionally, the self-timed execution-based scheduling technique minimizes schedule construction time by 95p and storage overhead by 92p.

51 citations


Journal ArticleDOI
TL;DR: A method that is capable of identifying critical pathways in a network at run-time and, then, can dynamically reconfigure the network to optimize for the network performance subjected to the identified dominated flows is introduced.
Abstract: Modern network-on-chip (NoC) systems are required to handle complex runtime traffic patterns and unprecedented applications. Data traffics of these applications are difficult to fully comprehend at design time so as to optimize the network design. However, it has been discovered that the majority of dataflows in a network are dominated by less than 10p of the specific pathways. In this article, we introduce a method that is capable of identifying critical pathways in a network at runtime and can then dynamically reconfigure the network to optimize for network performance subject to the identified dominated flows. An online learning and analysis scheme is employed to quickly discover the emerging dominated traffic flows and provides a statistical traffic prediction using regression analysis. The architecture of a self-tuning network is also discussed which can be reconfigured by setting up the identified point-to-point paths for the dominance dataflows in large traffic volumes. The merits of this new approach are experimentally demonstrated using comprehensive NoC simulations. Compared to the conventional network architectures over a range of realistic applications, the proposed self-tuning network approach can effectively reduce the latency and power consumption by as much as 25p and 24p, respectively. We also evaluated the configuration time and additional hardware cost. This new approach demonstrates the capability of an adaptive NoC to handle more complex and dynamic applications.

47 citations


Journal ArticleDOI
TL;DR: This article finds that the traditional FSM synthesis procedure will introduce security risks and cannot guarantee trustworthiness in the implemented circuits, and proposes a novel approach to designing trusted circuits from the FSM specification.
Abstract: Sequential components are crucial for a real-time embedded system as they control the system based on the system's current state and real life input. In this article, we explore the security and trust issues of sequential system design from the perspective of a finite state machine (FSM), which is the most popular model used to describe sequential systems. Specifically, we find that the traditional FSM synthesis procedure will introduce security risks and cannot guarantee trustworthiness in the implemented circuits. Indeed, we show that not only do there exist simple and effective ways to attack a sequential system, it is also possible to insert a hardware Trojan Horse into the design without introducing any significant design overhead. We then formally define the notion of trust in FSM and propose a novel approach to designing trusted circuits from the FSM specification. We demonstrate both our findings on the security threats and the effectiveness of our proposed method on Microelectronics Center of North Carolina (MCNC) sequential circuit benchmarks.

Journal ArticleDOI
TL;DR: Experimental results indicate a strong motivation to consider the proposed architecture for future CMPs, as it can provide about 5× reduction in power consumption and improved throughput and access latencies, compared to traditional electrical 2D mesh and torus NoC architectures.
Abstract: With increasing application complexity and improvements in process technology, Chip MultiProcessors (CMPs) with tens to hundreds of cores on a chip are becoming a reality. Networks-on-Chip (NoCs) have emerged as scalable communication fabrics that can support high bandwidths for these massively parallel multicore systems. However, traditional electrical NoC implementations still need to overcome the challenges of high data transfer latencies and large power consumption. On-chip photonic interconnects with high performance-per-watt characteristics have recently been proposed as an alternative to address these challenges for intra-chip communication. In this article, we explore using low-cost photonic interconnects on a chip to enhance traditional electrical NoCs. Our proposed hybrid photonic ring-mesh NoC (METEOR) utilizes a configurable photonic ring waveguide coupled to a traditional 2D electrical mesh NoC. Experimental results indicate a strong motivation to consider the proposed architecture for future CMPs, as it can provide about 5× reduction in power consumption and improved throughput and access latencies, compared to traditional electrical 2D mesh and torus NoC architectures. Compared to other previously proposed hybrid photonic NoC fabrics such as the hybrid photonic torus, Corona, and Firefly, our proposed fabric is also shown to have lower photonic area overhead, power consumption, and energy-delay product, while maintaining competitive throughput and latency.

Journal ArticleDOI
TL;DR: This article presents a reinforcement learning (RL)-based DPM technique for optimal selection of timeout values in the different device states and shows that the proposed learning algorithm not only adequately explores the power-performance trade-off with nonstationary workload but can also successfully perform online adjustment of the trade-offs parameter in order to meet the user-specified constraint.
Abstract: Dynamic power management (DPM) refers to strategies which selectively change the operational states of a device during runtime to reduce the power consumption based on the past usage pattern, the current workload, and the given performance constraint. The power management problem becomes more challenging when the workload exhibits nonstationary behavior which may degrade the performance of any single or static DPM policy. This article presents a reinforcement learning (RL)-based DPM technique for optimal selection of timeout values in the different device states. Each timeout period determines how long the device will remain in a particular state before the transition decision is taken. The timeout selection is based on workload estimates derived from a Multilayer Artificial Neural Network (ML-ANN) and an objective function given by weighted performance and power parameters. Our DPM approach is further able to adapt the power-performance weights online to meet user-specified power and performance constraints, respectively. We have completely implemented our DPM algorithm on our embedded traffic surveillance platform and performed long-term experiments using real traffic data to demonstrate the effectiveness of the DPM. Our results show that the proposed learning algorithm not only adequately explores the power-performance trade-off with nonstationary workload but can also successfully perform online adjustment of the trade-off parameter in order to meet the user-specified constraint.

Journal ArticleDOI
TL;DR: A novel approach is presented based on reinforcement learning to predict the best policy amidst existing DPM policies and deterministic markovian nonstationary policies (DMNSP) that supports different devices according to their DPM.
Abstract: In this work, an embedded system working model is designed with one server that receives requests by a requester by a service queue that is monitored by a Power Manager (PM). A novel approach is presented based on reinforcement learning to predict the best policy amidst existing DPM policies and deterministic markovian nonstationary policies (DMNSP). We apply reinforcement learning, namely a computational approach to understanding and automating goal-directed learning that supports different devices according to their DPM. Reinforcement learning uses a formal framework defining the interaction between agent and environment in terms of states, response action, and reward points. The capability of this approach is demonstrated by an event-driven simulator designed using Java with a power-manageable machine-to-machine device. Our experiment result shows that the proposed dynamic power management with timeout policy gives average power saving from 4p to 21p and the novel dynamic power management with DMNSP gives average power saving from 10p to 28p more than already proposed DPM policies.

Journal ArticleDOI
TL;DR: A Mini-Census ADaptive Support Region (MCADSR) stereo matching algorithm is used as a case study due to its high accuracy and representative operations in this domain and several efficient optimization methods including vertical-first cost aggregation, hybrid parallel processing, and hardware-friendly integral image are proposed.
Abstract: Domain of stereo vision is highly important in the fields of autonomous cars, video tolling, robotics, and aerial surveys. The specific feature of this domain is that we should handle not only the pixel-by-pixel 2D processing in one image but also the 3D processing for depth estimation by comparing information about a scene from several images with different perspectives. This feature brings challenges to memory resource utilization, because an extra dimension of data has to be buffered. Due to the memory limitation, few of previous stereo vision implementations provide both accurate and high-speed processing for high-resolution images at the same time.To achieve domain-specific acceleration for stereo vision, the memory limitation has to be addressed. This article uses a Mini-Census ADaptive Support Region (MCADSR) stereo matching algorithm as a case study due to its high accuracy and representative operations in this domain. To relieve the memory limitation and achieve high-speed processing, the article proposes several efficient optimization methods including vertical-first cost aggregation, hybrid parallel processing, and hardware-friendly integral image. The article also presents a customizable system which provides both accurate and high-speed stereo matching for high-resolution images. The benefits of applying the optimization methods to the system are highlighted.With the aforesaid optimization and specific customization implemented on FPGA, the demonstrated system can process 47.6 fps (frames per second) and 129 fps for video size of 1920 × 1080 with a large disparity range of 256 and 1024 × 768 with a disparity range of 128, respectively. Our results are up to 1.64 times better than previous work in terms of Million Disparity Estimation per second (MDE/s). For accuracy, the 7.65p overall average error rate outperforms current work which can provide real-time processing with this high-resolution and large disparity range.

Journal ArticleDOI
TL;DR: This work identifies and explores some limitations in the existing recursive-calculus-based approaches to compute the Worst-Case Traversal Time (WCTT) of a packet, and introduces a more general approach, namely “Branch, Prune and Collapse” (BPC) which offers a configurable parameter that provides a flexible trade-off between the computational complexity and the tightness of the computed estimate.
Abstract: “Many-core” systems based on a Network-on-Chip (NoC) architecture offer various opportunities in terms of performance and computing capabilities, but at the same time they pose many challenges for the deployment of real-time systems, which must fulfill specific timing requirements at runtime. It is therefore essential to identify, at design time, the parameters that have an impact on the execution time of the tasks deployed on these systems and the upper bounds on the other key parameters. The focus of this work is to determine an upper bound on the traversal time of a packet when it is transmitted over the NoC infrastructure. Towards this aim, we first identify and explore some limitations in the existing recursive-calculus-based approaches to compute the Worst-Case Traversal Time (WCTT) of a packet. Then, we extend the existing model by integrating the characteristics of the tasks that generate the packets. For this extended model, we propose an algorithm called “Branch and Prune” (BP). Our proposed method provides tighter and safe estimates than the existing recursive-calculus-based approaches. Finally, we introduce a more general approach, namely “Branch, Prune and Collapse” (BPC) which offers a configurable parameter that provides a flexible trade-off between the computational complexity and the tightness of the computed estimate. The recursive-calculus methods and BP present two special cases of BPC when a trade-off parameter is 1 or ∞, respectively. Through simulations, we analyze this trade-off, reason about the implications of certain choices, and also provide some case studies to observe the impact of task parameters on the WCTT estimates.

Journal ArticleDOI
TL;DR: The analysis shows that SFA is indeed an effective scheme under practical settings, even though it is not optimal, and any uni-core dynamic power management technique for reducing the energy consumption for idling can be easily incorporated individually on each core in the voltage island.
Abstract: Energy-efficient designs are important issues in computing systems This article studies the energy efficiency of a simple and linear-time strategy, called the Single Frequency Approximation (SFA) scheme, for periodic real-time tasks on multicore systems with a shared supply voltage in a voltage island The strategy executes all the cores at a single frequency to just meet the timing constraints SFA has been adopted in the literature after task partitioning, but the worst-case performance of SFA in terms of energy consumption incurred is an open problem We provide comprehensive analysis for SFA to derive the cycle utilization distribution for its worst-case behaviour for energy minimization Our analysis shows that the energy consumption incurred by using SFA for task execution is at most 153 (174, 210, 269, respectively), compared to the energy consumption of the optimal voltage/frequency scaling, when the dynamic power consumption is a cubic function of the frequency and the voltage island has up to 4 (8, 16, 32, respectively) cores The analysis shows that SFA is indeed an effective scheme under practical settings, even though it is not optimal Furthermore, since all the cores run at a single frequency and no frequency alignment for Dynamic Voltage and Frequency Scaling (DVFS) between cores is needed, any unicore dynamic power management technique for reducing the energy consumption for idling can be easily incorporated individually on each core in the voltage island This article also provides an analysis of energy consumption for SFA combined with procrastination for Dynamic Power Management (DPM), resulting in an increment of 1 from the previous results for task execution Furthermore, we also extend our analysis for deriving the approximation factor of SFA for a multicore system with multiple voltage islands

Journal ArticleDOI
TL;DR: The UPP2SF model-translation tool, which facilitates automatic conversion of verified models (in UPPAAL) to models that may be simulated and tested (in Simulink/Stateflow) and the translation rules that ensure correct model conversion are described, applicable to a large class of models.
Abstract: Software-based control of life-critical embedded systems has become increasingly complex, and to a large extent has come to determine the safety of the human being. For example, implantable cardiac pacemakers have over 80,000 lines of code which are responsible for maintaining the heart within safe operating limits. As firmware-related recalls accounted for over 41p of the 600,000 devices recalled in the last decade, there is a need for rigorous model-driven design tools to generate verified code from verified software models. To this effect, we have developed the UPP2SF model-translation tool, which facilitates automatic conversion of verified models (in UPPAAL) to models that may be simulated and tested (in Simulink/Stateflow). We describe the translation rules that ensure correct model conversion, applicable to a large class of models. We demonstrate how UPP2SF is used in the model-driven design of a pacemaker whose model is (a) designed and verified in UPPAAL (using timed automata), (b) automatically translated to Stateflow for simulation-based testing, and then (c) automatically generated into modular code for hardware-level integration testing of timing-related errors. In addition, we show how UPP2SF may be used for worst-case execution time estimation early in the design stage. Using UPP2SF, we demonstrate the value of integrated end-to-end modeling, verification, code-generation and testing process for complex software-controlled embedded systems.

Journal ArticleDOI
TL;DR: A hybrid specification of dataflow and FSM models is proposed to specify the dynamic behavior of a system distinguishing inter- and intra-application dynamism and the proposed technique is implemented in the HOPES design environment.
Abstract: As the number of processors in a chip increases and more functions are integrated, the system status will change dynamically due to various factors such as the workload variation, QoS requirement, and unexpected component failure A typical method to deal with the dynamics of the system is to decide the mapping decision at runtime, based on the local information of the system status It is very challenging to guarantee any real-time performance of a certain application in such a dynamically varying system To solve this problem, we propose a hybrid specification of dataflow and FSM models to specify the dynamic behavior of a system distinguishing inter- and intra-application dynamism At the top level, each application is specified by a dataflow task and the dynamic behavior is modeled as a control task that supervises the execution of applications Inside a dataflow task, we specify the dynamic behavior using a similar way as FSM-based SADF in which an application is specified by a synchronous dataflow graph for each mode of operation It enables us to perform compile-time scheduling of each graph to maximize the throughput varying the number of allocated processors, and store the scheduling information When a change in system state is detected at runtime, the number of allocated processors to the active tasks is determined dynamically utilizing the stored scheduling information of those tasks in order to meet the real-time requirements The proposed technique is implemented in the HOPES design environment Through preliminary experiments with a simple smartphone example, we show the viability of the proposed methodology

Journal ArticleDOI
TL;DR: This work presents Energy-Synchronized Communication (ESC) as a transparent middleware between the network layer and MAC layer that controls the amount and timing of RF activity at receiving nodes that is implemented on MicaZ nodes with two state-of-the-art routing protocols.
Abstract: With advances in energy-harvesting techniques, it is now feasible to build sustainable sensor networks to support long-term applications. Unlike battery-powered sensor networks, the objective of sustainable sensor networks is to effectively utilize a continuous stream of ambient energy. Instead of pushing the limits of energy conservation, we aim to design energy-synchronized schemes that keep energy supplies and demands in balance. Specifically, this work presents Energy-Synchronized Communication (ESC) as a transparent middleware between the network layer and MAC layer that controls the amount and timing of RF activity at receiving nodes. In this work, we first derive a delay model for cross-traffic at individual nodes, which reveals an interesting stair effect. This effect allows us to design a localized energy synchronization control with o(d3) time complexity that shuffles or adjusts the working schedule of a node to optimize cross-traffic delays in the presence of changing duty cycle budgets, where d is the node degree in the network. Under different rates of energy fluctuations, shuffle-based and adjustment-based methods have different influences on logical connectivity and cross-traffic delay, due to the inconsistent views of working schedules among neighboring nodes before schedule updates. We study the trade-off between them and propose methods for updating working schedules efficiently. To evaluate our work, ESC is implemented on MicaZ nodes with two state-of-the-art routing protocols. Both testbed experiment and large-scale simulation results show significant performance improvements over randomized synchronization controls.

Journal ArticleDOI
TL;DR: Empirical evaluations demonstrate that user-space implementations of mechanisms to enforce different mixed-criticality scheduling approaches can be achieved atop Linux without kernel modification, with reasonably low (but in some cases nontrivial) overhead for mixed- criticality real-time task sets.
Abstract: Traditional fixed-priority scheduling analysis for periodic and sporadic task sets is based on the assumption that all tasks are equally critical to the correct operation of the system. Therefore, every task has to be schedulable under the chosen scheduling policy, and estimates of tasks' worst-case execution times must be conservative in case a task runs longer than is usual. To address the significant underutilization of a system's resources under normal operating conditions that can arise from these assumptions, several mixed-criticality scheduling approaches have been proposed. However, to date, there have been few quantitative comparisons of system schedulability or runtime overhead for the different approaches.In this article, we present a side-by-side implementation and evaluation of the known mixed-criticality scheduling approaches, for periodic and sporadic mixed-criticality tasks on uniprocessor systems, under a mixed-criticality scheduling model that is common to all these approaches. To make a fair evaluation of mixed-criticality scheduling, we also address previously open issues and propose modifications to improve particular approaches. Our empirical evaluations demonstrate that user-space implementations of mechanisms to enforce different mixed-criticality scheduling approaches can be achieved atop Linux without kernel modification, with reasonably low (but in some cases nontrivial) overhead for mixed-criticality real-time task sets.

Journal ArticleDOI
Abstract: Synchronous languages ensure determinate concurrency but at the price of restrictions on what programs are considered valid, or constructive. Meanwhile, sequential languages such as C and Java offer an intuitive, familiar programming paradigm but provide no guarantees with regard to determinate concurrency. The sequentially constructive (SC) model of computation (MoC) presented here harnesses the synchronous execution model to achieve determinate concurrency while taking advantage of familiar, convenient programming paradigms from sequential languages.In essence, the SC MoC extends the classical synchronous MoC by allowing variables to be read and written in any order and multiple times, as long as the sequentiality expressed in the program provides sufficient scheduling information to rule out race conditions. This allows to use programming patterns familiar from sequential programming, such as testing and later setting the value of a variable, which are forbidden in the standard synchronous MoC. The SC MoC is a conservative extension in that programs considered constructive in the common synchronous MoC are also SC and retain the same semantics. In this article, we investigate classes of shared variable accesses, define SC-admissible scheduling as a restriction of “free scheduling,” derive the concept of sequential constructiveness, and present a priority-based scheduling algorithm for analyzing and compiling SC programs efficiently.

Journal ArticleDOI
TL;DR: MultiNets is deployed in a real-world scenario and its experimental results show that depending on the user requirements, it outperforms the state-of-the-art Android system either by saving up to 33.75p energy, achieving near-optimal offloading, or achievingNear-Optimal throughput while substantially reducing TCP interruptions due to switching.
Abstract: MultiNets is a system supporting seamless switch-over between wireless interfaces on mobile devices in real-time. MultiNets is configurable to run in three different modes: (i) Energy Saving mode--for choosing the interface that saves the most energy based on the condition of the device, (ii) Offload mode--for offloading data traffic from the cellular to WiFi network, and (iii) Performance mode--for selecting the network for the fastest data connectivity. MultiNets also provides a powerful API that gives the application developers: (i) the choice to select a network interface to communicate with a specific server, and (ii) the ability to simultaneously transfer data over multiple network interfaces. MultiNets is modular, easily integrable, lightweight, and applicable to various mobile operating systems. We implement MultiNets on Android devices as a show case. MultiNets does not require any extra support from the network infrastructure and runs existing applications transparently.To evaluate MultiNets, we first collect data traces from 13 actual Android smartphone users over three months. We then use the collected traces to show that, by automatically switching to WiFi whenever it is available, MultiNets can offload on average 79.82p of the data traffic. We also illustrate that, by optimally switching between the interfaces, MultiNets can save on average 21.14 KJ of energy per day, which is equivalent to 27.4p of the daily energy usage. Using our API, we demonstrate that a video streaming application achieves 43--271p higher streaming rate when concurrently using WiFi and 3G interfaces. We deploy MultiNets in a real-world scenario and our experimental results show that depending on the user requirements, it outperforms the state-of-the-art Android system either by saving up to 33.75p energy, achieving near-optimal offloading, or achieving near-optimal throughput while substantially reducing TCP interruptions due to switching.

Journal ArticleDOI
TL;DR: A new mechanism called Elon for enabling efficient and long-term reprogramming in wireless sensor networks that reduces the transferred code size significantly by introducing the concept of replaceable component and significantly prolongs the reprogrammable lifetime.
Abstract: We present a new mechanism called Elon for enabling efficient and long-term reprogramming in wireless sensor networks. Elon reduces the transferred code size significantly by introducing the concept of replaceable component. It avoids the cost of hardware reboot with a novel software reboot mechanism. Moreover, it significantly prolongs the reprogrammable lifetime (i.e., the time period during which the sensor nodes can be reprogrammed) by avoiding flash writes for TelosB nodes. Experimental results show that Elon transfers up to 120--389 times less information than Deluge, and 18--42 times less information than Stream. The software reboot mechanism that Elon applies reduces the rebooting cost by 50.4p--53.87p in terms of beacon packets, and 56.83p in terms of unsynchronized nodes. In addition, Elon prolongs the reprogrammable lifetime by a factor of 3.3.

Journal ArticleDOI
TL;DR: STEAM -- an optimal closed-loop DEM controller designed for multicore processors with an excellent prediction of core temperatures and power consumption, and the ability to control the core temperatures within 3ˆC of the specified maximum.
Abstract: Recent empirical studies have shown that multicore scaling is fast becoming power limited, and consequently, an increasing fraction of a multicore processor has to be under clocked or powered off. Therefore, in addition to fundamental innovations in architecture, compilers and parallelization of application programs, there is a need to develop practical and effective dynamic energy management (DEM) techniques for multicore processors.Existing DEM techniques mainly target reducing processor power consumption and temperature, and only few of them have addressed improving energy efficiency for multicore systems. With energy efficiency taking a center stage in all aspects of computing, the focus of the DEM needs to be on finding practical methods to maximize processor efficiency. Towards this, this article presents STEAM -- an optimal closed-loop DEM controller designed for multicore processors. The objective is to maximize energy efficiency by dynamic voltage and frequency scaling (DVFS). Energy efficiency is defined as the ratio of performance to power consumption or performance-per-watt (PPW). This is the same as the number of instructions executed per Joule. The PPW metric is actually replaced by PαPW (performanceα-per-Watt), which allows for controlling the importance of performance versus power consumption by varying α.The proposed controller was implemented on a Linux system and tested with the Intel Sandy Bridge processor. There are three power management schemes called governors, available with Intel platforms. They are referred to as (1) Powersave (lowest power consumption), (2) Performance (achieves highest performance), and (3) Ondemand. Our simple and lightweight controller when executing SPEC CPU2006, PARSEC, and MiBench benchmarks have achieved an average of 18p improvement in energy efficiency (MIPS/Watt) over these ACPI policies. Moreover, STEAM also demonstrated an excellent prediction of core temperatures and power consumption, and the ability to control the core temperatures within 3ˆC of the specified maximum. Finally, the overhead of the STEAM implementation (in terms of CPU resources) is less than 0.25p. The entire implementation is self-contained and can be installed on any processor with very little prior knowledge of the processor.

Journal ArticleDOI
TL;DR: OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer to protect a portion of the processed data used to restore from runtime error, and optimally select the buffer size to minimize the energy overhead, with timing and area constraints.
Abstract: Recent process technology advances trigger reliability issues that degrade the Quality-of-Service (QoS) required by embedded Systems-on-Chip (SoCs). To maintain the required QoS with acceptable overheads, we propose OCEAN, a novel cross-layer error mitigation. OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer. We utilize this buffer to protect a portion of the processed data used to restore from runtime error. We optimally select the buffer size to minimize the energy overhead, with timing and area constraints. OCEAN achieves full error mitigation with 10.1p average energy overhead compared to base-line operation that does not include any error correction capability, and 65p energy savings, compared to a cross-layer error mitigation mechanism.

Journal ArticleDOI
TL;DR: Modifications to the traditional bin-packing techniques are proposed and novel techniques taking into account the DVFS model supported by the platform are designed.
Abstract: Asymmetric multiprocessor systems are considered power-efficient multiprocessor architectures Furthermore, efficient task allocation (partitioning) can achieve more energy efficiency at these asymmetric multiprocessor platforms This article addresses the problem of energy-aware static partitioning of periodic real-time tasks on asymmetric multiprocessor (multicore) embedded systems The article formulates the problem according to the Dynamic Voltage and Frequency Scaling (DVFS) model supported by the platform and shows that it is an NP-hard problem Then, the article outlines optimal reference partitioning techniques for each case of DVFS model with suitable assumptions Finally, the article proposes modifications to the traditional bin-packing techniques and designs novel techniques taking into account the DVFS model supported by the platform All algorithms and techniques are simulated and compared The simulation shows promising results, where the proposed techniques reduced the energy consumption by 75p compared to traditional methods when DVFS is not supported and by 50p when per-core DVFS is supported by the platform

Journal ArticleDOI
TL;DR: This work compares the performance and capabilities of various CoreManager HW/SW solutions, based on ASIC, RISC and ASIP paradigms and demonstrates that the proposed ASIP-based solution approaches the performance of the ASIC realization, while preserving the full flexibility of the software (RISC-based) implementation.
Abstract: Heterogeneity and parallelism in MPSoCs for 4G (and beyond) communications signal processing are inevitable in order to meet stringent power constraints and performance requirements. The question arises on how to cope with the problem of system programmability and runtime management incurred by the statically or even dynamically varying number and type of processing elements. This work addresses this challenge by proposing the concept of a heterogeneous many-core platform called Tomahawk. Apart from the definition of the system architecture, in this approach a unified framework including a model of computation, a programming interface and a dedicated runtime management unit called CoreManager is proposed. The increase of system complexity in terms of application parallelism and number of resources may lead to a dramatic increase of the management costs, hence causing performance degradation. For this reason, the efficient implementation of the CoreManager becomes a major issue in system design. This work compares the performance and capabilities of various CoreManager HW/SW solutions, based on ASIC, RISC and ASIP paradigms. The results demonstrate that the proposed ASIP-based solution approaches the performance of the ASIC realization, while preserving the full flexibility of the software (RISC-based) implementation.

Journal ArticleDOI
TL;DR: This article targets the assignment of the scheduling parameters to minimize memory usage for systems of practical interest, including designs compliant with automotive standards and proposes algorithms either proven optimal or shown to improve on randomized optimization methods like simulated annealing.
Abstract: In the development of real-time embedded applications, especially those on systems-on-chip, an efficient use of RAM memory is as important as the effective scheduling of the computation resources. The protection of communication and state variables accessed by concurrent tasks must provide real-time schedulability guarantees while using the least amount of memory. Several schemes, including preemption thresholds, have been developed to improve schedulability and save stack space by selectively disabling preemption. However, the design synthesis problem is still open. In this article, we target the assignment of the scheduling parameters to minimize memory usage for systems of practical interest, including designs compliant with automotive standards. We propose algorithms either proven optimal or shown to improve on randomized optimization methods like simulated annealing.

Journal ArticleDOI
TL;DR: This work studies the analysis of MRU, a non-LRU replacement policy employed in mainstream processor architectures like Intel Nehalem, and proposes a new cache hit/miss classification, k-Miss, to better capture the MRU behavior, and develops formal conditions and efficient techniques to decide the k- Miss memory accesses.
Abstract: Most previous work on cache analysis for WCET estimation assumes a particular replacement policy called LRU. In contrast, much less work has been done for non-LRU policies, since they are generally considered to be very unpredictable. However, most commercial processors are actually equipped with these non-LRU policies, since they are more efficient in terms of hardware cost, power consumption and thermal output, while still maintaining almost as good average-case performance as LRU.In this work, we study the analysis of MRU, a non-LRU replacement policy employed in mainstream processor architectures like Intel Nehalem. Our work shows that the predictability of MRU has been significantly underestimated before, mainly because the existing cache analysis techniques and metrics do not match MRU well. As our main technical contribution, we propose a new cache hit/miss classification, k-Miss, to better capture the MRU behavior, and develop formal conditions and efficient techniques to decide k-Miss memory accesses. A remarkable feature of our analysis is that the k-Miss classifications under MRU are derived by the analysis result of the same program under LRU. Therefore, our approach inherits the advantages in efficiency and precision of the state-of-the-art LRU analysis techniques based on abstract interpretation. Experiments with instruction caches show that our proposed MRU analysis has both good precision and high efficiency, and the obtained estimated WCET is rather close to (typically 1p∼8p more than) that obtained by the state-of-the-art LRU analysis, which indicates that MRU is also a good candidate for cache replacement policies in real-time systems.