Showing papers in "ACM Transactions in Embedded Computing Systems in 2014"

PDF

Open Access

Journal Article•DOI•

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages

[...]

Arvind K. Sujeeth¹, Kevin J. Brown¹, HyoukJoong Lee¹, Tiark Rompf², Hassan Chafi¹, Martin Odersky³, Kunle Olukotun¹ - Show less +3 more•Institutions (3)

Stanford University¹, Oracle Corporation², École Polytechnique Fédérale de Lausanne³

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: An overview of the Delite compiler framework and DSLs that have been developed with it is presented and it is shown that they all achieve performance competitive to or exceeding Cpp code.

...read moreread less

Abstract: Developing high-performance software is a difficult task that requires the use of low-level, architecture-specific programming models (e.g., OpenMP for CMPs, CUDA for GPUs, MPI for clusters). It is typically not possible to write a single application that can run efficiently in different environments, leading to multiple versions and increased complexity. Domain-Specific Languages (DSLs) are a promising avenue to enable programmers to use high-level abstractions and still achieve good performance on a variety of hardware. This is possible because DSLs have higher-level semantics and restrictions than general-purpose languages, so DSL compilers can perform higher-level optimization and translation. However, the cost of developing performance-oriented DSLs is a substantial roadblock to their development and adoption. In this article, we present an overview of the Delite compiler framework and the DSLs that have been developed with it. Delite simplifies the process of DSL development by providing common components, like parallel patterns, optimizations, and code generators, that can be reused in DSL implementations. Delite DSLs are embedded in Scala, a general-purpose programming language, but use metaprogramming to construct an Intermediate Representation (IR) of user programs and compile to multiple languages (including Cpp, CUDA, and OpenCL). DSL programs are automatically parallelized and different parts of the application can run simultaneously on CPUs and GPUs. We present Delite DSLs for machine learning, data querying, graph analysis, and scientific computing and show that they all achieve performance competitive to or exceeding Cpp code.

...read moreread less

139 citations

Journal Article•DOI•

Building timing predictable embedded systems

[...]

Philip Axer¹, Rolf Ernst¹, Heiko Falk², Alain Girault³, Daniel Grund⁴, Nan Guan⁵, Bengt Jonsson⁵, Peter Marwedel⁶, Jan Reineke⁴, Christine Rochange⁷, Maurice Sebastian¹, Reinhard von Hanxleden, Reinhard Wilhelm⁴, Wang Yi⁵ - Show less +10 more•Institutions (7)

Braunschweig University of Technology¹, University of Ulm², University of Grenoble³, Saarland University⁴, Uppsala University⁵, Technical University of Dortmund⁶, University of Toulouse⁷

10 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: The intention of this article is to summarize the current state of the art in research concerning how to build predictable yet performant systems, and suggest precise definitions for the concept of “predictability”, and present predictability concerns at different abstraction levels in embedded system design.

...read moreread less

Abstract: A large class of embedded systems is distinguished from general-purpose computing systems by the need to satisfy strict requirements on timing, often under constraints on available resources. Predictable system design is concerned with the challenge of building systems for which timing requirements can be guaranteed a priori. Perhaps paradoxically, this problem has become more difficult by the introduction of performance-enhancing architectural elements, such as caches, pipelines, and multithreading, which introduce a large degree of uncertainty and make guarantees harder to provide. The intention of this article is to summarize the current state of the art in research concerning how to build predictable yet performant systems. We suggest precise definitions for the concept of “predictability”, and present predictability concerns at different abstraction levels in embedded system design. First, we consider timing predictability of processor instruction sets. Thereafter, we consider how programming languages can be equipped with predictable timing semantics, covering both a language-based approach using the synchronous programming paradigm, as well as an environment that provides timing semantics for a mainstream programming language (in this case C). We present techniques for achieving timing predictability on multicores. Finally, we discuss how to handle predictability at the level of networked embedded systems where randomly occurring errors must be considered.

...read moreread less

126 citations

Journal Article•DOI•

Energy optimization for real-time multiprocessor system-on-chip with optimal DVFS and DPM combination

[...]

Gang Chen¹, Kai Huang¹, Alois Knoll¹•Institutions (1)

Technische Universität München¹

28 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A novel technique are proposed to directly model the idle intervals of individual cores such that both DVFS and DPM can be optimized at the same time.

...read moreread less

Abstract: Energy optimization is a critical design concern for embedded systems. Combining DVFSpDPM is considered as one preferable technique to reduce energy consumption. There have been optimal DVFSpDPM algorithms for periodic independent tasks running on uniprocessor in the literature. Optimal combination of DVFS and DPM for periodic dependent tasks on multicore systems is however not yet reported. The challenge of this problem is that the idle intervals of cores are not easy to model. In this article, a novel technique is proposed to directly model the idle intervals of individual cores such that both DVFS and DPM can be optimized at the same time. Based on this technique, the energy optimization problem is formulated by means of mixed integrated linear programming. We also present techniques to prune the exploration space of the formulation. Experimental results using real-world benchmarks demonstrate the effectiveness of our approach compared to existing approaches.

...read moreread less

108 citations

Journal Article•DOI•

Near optimal rate selection for wireless control systems

[...]

Abusayeed Saifullah¹, Chengjie Wu¹, Paras Babu Tiwari¹, You Xu¹, Yong Fu¹, Chenyang Lu¹, Yixin Chen¹ - Show less +3 more•Institutions (1)

Washington University in St. Louis¹

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This paper addresses the scheduling-control co-design problem of determining the optimal sampling rates of feedback control loops sharing a Wireless Hart network by formulate rate selection as a differentiable convex optimization problem that provides a closed-form solution through a gradient descent method.

...read moreread less

Abstract: With the advent of industrial standards such as WirelessHART, process industries are now gravitating towards wireless control systems. Due to limited bandwidth in a wireless network shared by multiple control loops, it is critical to optimize the overall control performance. In this article, we address the scheduling-control co-design problem of determining the optimal sampling rates of feedback control loops sharing a WirelessHART network. The objective is to minimize the overall control cost while ensuring that all data flows meet their end-to-end deadlines. The resulting constrained optimization based on existing delay bounds for WirelessHART networks is challenging since it is nondifferentiable, nonlinear, and not in closed-form. We propose four methods to solve this problem. First, we present a subgradient method for rate selection. Second, we propose a greedy heuristic that usually achieves low control cost while significantly reducing the execution time. Third, we propose a global constrained optimization algorithm using a simulated annealing (SA) based penalty method. We study SA method under both constant factor penalty and adaptive penalty. Finally, we formulate rate selection as a differentiable convex optimization problem that provides a quick solution through a convex optimization technique. This is based on a new delay bound that is convex and differentiable, and hence simplifies the optimization problem. We study both the gradient descent method and the interior point method to solve it. We evaluate all methods through simulations based on topologies of a 74-node wireless sensor network testbed. The subgradient method is disposed to incur the longest execution time as well as the highest control cost among all methods. Among the SA-based constant penalty method, the greedy heuristic, and the gradient descent method, the first two represent the opposite ends of the tradeoff between control cost and execution time, while the third one hits the balance between the two. We further observe that the SA based adaptive penalty method is superior to the constant penalty method, and that the interior point method is superior to the gradient method. Thus, the interior point method and the SA-based adaptive penalty method are the two most effective approaches for rate selection. While both methods are competitive against each other in terms of control cost, the interior point method is significantly faster than the penalty method. As a result, the interior point method upon convex relaxation is more suitable for online rate adaptation than the SA based adaptive penalty method due to their significant difference in run-time efficiency.

...read moreread less

77 citations

Journal Article•DOI•

Throughput maximization for periodic real-time systems under the maximal temperature constraint

[...]

Huang Huang¹, Vivek Chaturvedi¹, Gang Quan¹, Jeffrey Fan¹, Meikang Qiu² - Show less +1 more•Institutions (2)

Florida International University¹, San Jose State University²

27 Jan 2014-ACM Transactions in Embedded Computing Systems

TL;DR: The proposed methods not only consistently outperform the existing approaches in terms of throughput maximization, but also significantly improve the feasibility of tasks when a more stringent temperature constraint is imposed.

...read moreread less

Abstract: In this article, we study the problem of how to maximize the throughput of a periodic real-time system under a given peak temperature constraint. We assume that different tasks in our system may have different power and thermal characteristics. Two scheduling approaches are presented. The first is built upon processors that can be in either active or sleep mode. By judiciously selecting tasks with different thermal characteristics as well as alternating the processor's active/sleep mode, the sleep period required to cool down the processor is kept at a minimum level, and, as the result, the throughput is maximized. We further extend this approach for processors with dynamic voltage/frequency scaling (DVFS) capability. Our experiments on a large number of synthetic test cases as well as real benchmark programs show that the proposed methods not only consistently outperform the existing approaches in terms of throughput maximization, but also significantly improve the feasibility of tasks when a more stringent temperature constraint is imposed.

...read moreread less

75 citations

Journal Article•DOI•

A Unified WCET analysis framework for multicore platforms

[...]

Sudipta Chattopadhyay¹, Lee Kee Chong¹, Abhik Roychoudhury¹, Timon Kelter², Peter Marwedel², Heiko Falk³ - Show less +2 more•Institutions (3)

National University of Singapore¹, Technical University of Dortmund², University of Ulm³

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor) by assuming a timing anomaly free multi-core architecture for computing the WCET.

...read moreread less

Abstract: With the advent of multicore architectures, worst-case execution time (WCET) analysis has become an increasingly difficult problem. In this article, we propose a unified WCET analysis framework for multicore processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic microarchitectural components (e.g., pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multicore architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.

...read moreread less

70 citations

Journal Article•DOI•

Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach

[...]

Frank Hannig¹, Vahid Lari¹, Srinivas Boppu¹, Alexandru Tanase¹, Oliver Reiche¹ - Show less +1 more•Institutions (1)

University of Erlangen-Nuremberg¹

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work introduces a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs) and presents a seamless mapping flow for TCPAs, based on a domain-specific language, and outlines a complete symbolic mapping approach.

...read moreread less

Abstract: We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

...read moreread less

66 citations

Journal Article•DOI•

Bluetooth aided mobile phone localization: A nonlinear neural circuit approach

[...]

Shuai Li¹, Yuesheng Lou, Bo Liu²•Institutions (2)

Stevens Institute of Technology¹, University of Massachusetts Amherst²

10 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A recurrent neural network is developed to solve the problem distributively in real time on the Bluetooth network and the solution feasibility to the defined problem are both theoretically proven.

...read moreread less

Abstract: It is meaningful to design a strategy to roughly localize mobile phones without a GPS by exploiting existing conditions and devices especially in environments without GPS availability (e.g., tunnels, subway stations, etc.). The availability of Bluetooth devices for most phones and the existence of a number of GPS equipped phones in a crowd of phone users enable us to design a Bluetooth aided mobile phone localization strategy. With the position of GPS equipped phones as beacons, and with the Bluetooth connection between neighbor phones as proximity constraints, we formulate the problem into an inequality problem defined on the Bluetooth network. A recurrent neural network is developed to solve the problem distributively in real time. The convergence of the neural network and the solution feasibility to the defined problem are both theoretically proven. The hardware implementation architecture of the proposed neural network is also given in this article. As applications, rough localizations of drivers in a tunnel and localization of customers in a supermarket are explored and simulated. Simulations demonstrate the effectiveness of the proposed method.

...read moreread less

54 citations

Journal Article•DOI•

Energy-aware task mapping and scheduling for reliable embedded computing systems

[...]

Anup Das¹, Akash Kumar¹, Bharadwaj Veeravalli¹•Institutions (1)

National University of Singapore¹

27 Jan 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A design-time (offline) multi-criterion optimization technique for application mapping on embedded multiprocessor systems to minimize energy consumption for all processor fault-scenarios and a scheduling technique based on self-timed execution to minimize the schedule storage and construction overhead at runtime are proposed.

...read moreread less

Abstract: Task mapping and scheduling are critical in minimizing energy consumption while satisfying the performance requirement of applications enabled on heterogeneous multiprocessor systems. An area of growing concern for modern multiprocessor systems is the increase in the failure probability of one or more component processors. This is especially critical for applications where performance degradation (e.g., throughput) directly impacts the quality of service requirement. This article proposes a design-time (offline) multi-criterion optimization technique for application mapping on embedded multiprocessor systems to minimize energy consumption for all processor fault-scenarios. A scheduling technique is then proposed based on self-timed execution to minimize the schedule storage and construction overhead at runtime. Experiments conducted with synthetic and real applications from streaming and nonstreaming domains on heterogeneous MPSoCs demonstrate that the proposed technique minimizes energy consumption by 22p and design space exploration time by 100x, while satisfying the throughput requirement for all processor fault-scenarios. For scalable throughput applications, the proposed technique achieves 30p better throughput per unit energy, compared to the existing techniques. Additionally, the self-timed execution-based scheduling technique minimizes schedule construction time by 95p and storage overhead by 92p.

...read moreread less

51 citations

Journal Article•DOI•

On self-tuning networks-on-chip for dynamic network-flow dominance adaptation

[...]

Xiaohang Wang¹, Mei Yang², Yingtao Jiang², Peng Liu³, Masoud Daneshtalab⁴, Maurizio Palesi⁵, Terrence Mak⁶ - Show less +3 more•Institutions (6)

Chinese Academy of Sciences¹, University of Nevada, Las Vegas², Zhejiang University³, University of Turku⁴, Kore University of Enna⁵, The Chinese University of Hong Kong⁶

27 Jan 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A method that is capable of identifying critical pathways in a network at run-time and, then, can dynamically reconfigure the network to optimize for the network performance subjected to the identified dominated flows is introduced.

...read moreread less

Abstract: Modern network-on-chip (NoC) systems are required to handle complex runtime traffic patterns and unprecedented applications. Data traffics of these applications are difficult to fully comprehend at design time so as to optimize the network design. However, it has been discovered that the majority of dataflows in a network are dominated by less than 10p of the specific pathways. In this article, we introduce a method that is capable of identifying critical pathways in a network at runtime and can then dynamically reconfigure the network to optimize for network performance subject to the identified dominated flows. An online learning and analysis scheme is employed to quickly discover the emerging dominated traffic flows and provides a statistical traffic prediction using regression analysis. The architecture of a self-tuning network is also discussed which can be reconfigured by setting up the identified point-to-point paths for the dominance dataflows in large traffic volumes. The merits of this new approach are experimentally demonstrated using comprehensive NoC simulations. Compared to the conventional network architectures over a range of realistic applications, the proposed self-tuning network approach can effectively reduce the latency and power consumption by as much as 25p and 24p, respectively. We also evaluated the configuration time and additional hardware cost. This new approach demonstrates the capability of an adaptive NoC to handle more complex and dynamic applications.

...read moreread less

47 citations

Journal Article•DOI•

Designing Trusted Embedded Systems from Finite State Machines

[...]

Carson Dunbar¹, Gang Qu¹•Institutions (1)

University of Maryland, College Park¹

06 Oct 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This article finds that the traditional FSM synthesis procedure will introduce security risks and cannot guarantee trustworthiness in the implemented circuits, and proposes a novel approach to designing trusted circuits from the FSM specification.

...read moreread less

Abstract: Sequential components are crucial for a real-time embedded system as they control the system based on the system's current state and real life input. In this article, we explore the security and trust issues of sequential system design from the perspective of a finite state machine (FSM), which is the most popular model used to describe sequential systems. Specifically, we find that the traditional FSM synthesis procedure will introduce security risks and cannot guarantee trustworthiness in the implemented circuits. Indeed, we show that not only do there exist simple and effective ways to attack a sequential system, it is also possible to insert a hardware Trojan Horse into the design without introducing any significant design overhead. We then formally define the notion of trust in FSM and propose a novel approach to designing trusted circuits from the FSM specification. We demonstrate both our findings on the security threats and the effectiveness of our proposed method on Microelectronics Center of North Carolina (MCNC) sequential circuit benchmarks.

...read moreread less

Journal Article•DOI•

METEOR: Hybrid photonic ring-mesh network-on-chip for multicore architectures

[...]

Shirish Bahirat¹, Sudeep Pasricha¹•Institutions (1)

Colorado State University¹

28 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: Experimental results indicate a strong motivation to consider the proposed architecture for future CMPs, as it can provide about 5× reduction in power consumption and improved throughput and access latencies, compared to traditional electrical 2D mesh and torus NoC architectures.

...read moreread less

Abstract: With increasing application complexity and improvements in process technology, Chip MultiProcessors (CMPs) with tens to hundreds of cores on a chip are becoming a reality. Networks-on-Chip (NoCs) have emerged as scalable communication fabrics that can support high bandwidths for these massively parallel multicore systems. However, traditional electrical NoC implementations still need to overcome the challenges of high data transfer latencies and large power consumption. On-chip photonic interconnects with high performance-per-watt characteristics have recently been proposed as an alternative to address these challenges for intra-chip communication. In this article, we explore using low-cost photonic interconnects on a chip to enhance traditional electrical NoCs. Our proposed hybrid photonic ring-mesh NoC (METEOR) utilizes a configurable photonic ring waveguide coupled to a traditional 2D electrical mesh NoC. Experimental results indicate a strong motivation to consider the proposed architecture for future CMPs, as it can provide about 5× reduction in power consumption and improved throughput and access latencies, compared to traditional electrical 2D mesh and torus NoC architectures. Compared to other previously proposed hybrid photonic NoC fabrics such as the hybrid photonic torus, Corona, and Firefly, our proposed fabric is also shown to have lower photonic area overhead, power consumption, and energy-delay product, while maintaining competitive throughput and latency.

...read moreread less

Journal Article•DOI•

Online learning of timeout policies for dynamic power management

[...]

Umair Ali Khan¹, Bernhard Rinner¹•Institutions (1)

Alpen-Adria-Universität Klagenfurt¹

10 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This article presents a reinforcement learning (RL)-based DPM technique for optimal selection of timeout values in the different device states and shows that the proposed learning algorithm not only adequately explores the power-performance trade-off with nonstationary workload but can also successfully perform online adjustment of the trade-offs parameter in order to meet the user-specified constraint.

...read moreread less

Abstract: Dynamic power management (DPM) refers to strategies which selectively change the operational states of a device during runtime to reduce the power consumption based on the past usage pattern, the current workload, and the given performance constraint. The power management problem becomes more challenging when the workload exhibits nonstationary behavior which may degrade the performance of any single or static DPM policy. This article presents a reinforcement learning (RL)-based DPM technique for optimal selection of timeout values in the different device states. Each timeout period determines how long the device will remain in a particular state before the transition decision is taken. The timeout selection is based on workload estimates derived from a Multilayer Artificial Neural Network (ML-ANN) and an objective function given by weighted performance and power parameters. Our DPM approach is further able to adapt the power-performance weights online to meet user-specified power and performance constraints, respectively. We have completely implemented our DPM algorithm on our embedded traffic surveillance platform and performed long-term experiments using real traffic data to demonstrate the effectiveness of the DPM. Our results show that the proposed learning algorithm not only adequately explores the power-performance trade-off with nonstationary workload but can also successfully perform online adjustment of the trade-off parameter in order to meet the user-specified constraint.

...read moreread less

Journal Article•DOI•

Real-Time Power Management for Embedded M2M Using Intelligent Learning Methods

[...]

Anand Paul¹•Institutions (1)

Kyungpook National University¹

23 Jul 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A novel approach is presented based on reinforcement learning to predict the best policy amidst existing DPM policies and deterministic markovian nonstationary policies (DMNSP) that supports different devices according to their DPM.

...read moreread less

Abstract: In this work, an embedded system working model is designed with one server that receives requests by a requester by a service queue that is monitored by a Power Manager (PM). A novel approach is presented based on reinforcement learning to predict the best policy amidst existing DPM policies and deterministic markovian nonstationary policies (DMNSP). We apply reinforcement learning, namely a computational approach to understanding and automating goal-directed learning that supports different devices according to their DPM. Reinforcement learning uses a formal framework defining the interaction between agent and environment in terms of states, response action, and reward points. The capability of this approach is demonstrated by an event-driven simulator designed using Java with a power-manageable machine-to-machine device. Our experiment result shows that the proposed dynamic power management with timeout policy gives average power saving from 4p to 21p and the novel dynamic power management with DMNSP gives average power saving from 10p to 28p more than already proposed DPM policies.

...read moreread less

Journal Article•DOI•

Hardware Acceleration for an Accurate Stereo Vision System Using Mini-Census Adaptive Support Region

[...]

Yi Shan¹, Yuchen Hao¹, Wenqiang Wang¹, Yu Wang¹, Xu Chen¹, Huazhong Yang¹, Wayne Luk² - Show less +3 more•Institutions (2)

Tsinghua University¹, Imperial College London²

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A Mini-Census ADaptive Support Region (MCADSR) stereo matching algorithm is used as a case study due to its high accuracy and representative operations in this domain and several efficient optimization methods including vertical-first cost aggregation, hybrid parallel processing, and hardware-friendly integral image are proposed.

...read moreread less

Abstract: Domain of stereo vision is highly important in the fields of autonomous cars, video tolling, robotics, and aerial surveys. The specific feature of this domain is that we should handle not only the pixel-by-pixel 2D processing in one image but also the 3D processing for depth estimation by comparing information about a scene from several images with different perspectives. This feature brings challenges to memory resource utilization, because an extra dimension of data has to be buffered. Due to the memory limitation, few of previous stereo vision implementations provide both accurate and high-speed processing for high-resolution images at the same time.To achieve domain-specific acceleration for stereo vision, the memory limitation has to be addressed. This article uses a Mini-Census ADaptive Support Region (MCADSR) stereo matching algorithm as a case study due to its high accuracy and representative operations in this domain. To relieve the memory limitation and achieve high-speed processing, the article proposes several efficient optimization methods including vertical-first cost aggregation, hybrid parallel processing, and hardware-friendly integral image. The article also presents a customizable system which provides both accurate and high-speed stereo matching for high-resolution images. The benefits of applying the optimization methods to the system are highlighted.With the aforesaid optimization and specific customization implemented on FPGA, the demonstrated system can process 47.6 fps (frames per second) and 129 fps for video size of 1920 × 1080 with a large disparity range of 256 and 1024 × 768 with a disparity range of 128, respectively. Our results are up to 1.64 times better than previous work in terms of Million Disparity Estimation per second (MDE/s). For accuracy, the 7.65p overall average error rate outperforms current work which can provide real-time processing with this high-resolution and large disparity range.

...read moreread less

Journal Article•DOI•

NoC contention analysis using a branch-and-prune algorithm

[...]

Dakshina Dasari¹, Borislav Nikolic¹, Vincent Nélis¹, Stefan M. Petters¹•Institutions (1)

Polytechnic Institute of Porto¹

28 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work identifies and explores some limitations in the existing recursive-calculus-based approaches to compute the Worst-Case Traversal Time (WCTT) of a packet, and introduces a more general approach, namely “Branch, Prune and Collapse” (BPC) which offers a configurable parameter that provides a flexible trade-off between the computational complexity and the tightness of the computed estimate.

...read moreread less

Abstract: “Many-core” systems based on a Network-on-Chip (NoC) architecture offer various opportunities in terms of performance and computing capabilities, but at the same time they pose many challenges for the deployment of real-time systems, which must fulfill specific timing requirements at runtime. It is therefore essential to identify, at design time, the parameters that have an impact on the execution time of the tasks deployed on these systems and the upper bounds on the other key parameters. The focus of this work is to determine an upper bound on the traversal time of a packet when it is transmitted over the NoC infrastructure. Towards this aim, we first identify and explore some limitations in the existing recursive-calculus-based approaches to compute the Worst-Case Traversal Time (WCTT) of a packet. Then, we extend the existing model by integrating the characteristics of the tasks that generate the packets. For this extended model, we propose an algorithm called “Branch and Prune” (BP). Our proposed method provides tighter and safe estimates than the existing recursive-calculus-based approaches. Finally, we introduce a more general approach, namely “Branch, Prune and Collapse” (BPC) which offers a configurable parameter that provides a flexible trade-off between the computational complexity and the tightness of the computed estimate. The recursive-calculus methods and BP present two special cases of BPC when a trade-off parameter is 1 or ∞, respectively. Through simulations, we analyze this trade-off, reason about the implications of certain choices, and also provide some case studies to observe the impact of task parameters on the WCTT estimates.

...read moreread less

Journal Article•DOI•

Energy Efficiency Analysis for the Single Frequency Approximation (SFA) Scheme

[...]

Santiago Pagani¹, Jian-Jia Chen¹•Institutions (1)

Karlsruhe Institute of Technology¹

06 Oct 2014-ACM Transactions in Embedded Computing Systems

TL;DR: The analysis shows that SFA is indeed an effective scheme under practical settings, even though it is not optimal, and any uni-core dynamic power management technique for reducing the energy consumption for idling can be easily incorporated individually on each core in the voltage island.

...read moreread less

Abstract: Energy-efficient designs are important issues in computing systems This article studies the energy efficiency of a simple and linear-time strategy, called the Single Frequency Approximation (SFA) scheme, for periodic real-time tasks on multicore systems with a shared supply voltage in a voltage island The strategy executes all the cores at a single frequency to just meet the timing constraints SFA has been adopted in the literature after task partitioning, but the worst-case performance of SFA in terms of energy consumption incurred is an open problem We provide comprehensive analysis for SFA to derive the cycle utilization distribution for its worst-case behaviour for energy minimization Our analysis shows that the energy consumption incurred by using SFA for task execution is at most 153 (174, 210, 269, respectively), compared to the energy consumption of the optimal voltage/frequency scaling, when the dynamic power consumption is a cubic function of the frequency and the voltage island has up to 4 (8, 16, 32, respectively) cores The analysis shows that SFA is indeed an effective scheme under practical settings, even though it is not optimal Furthermore, since all the cores run at a single frequency and no frequency alignment for Dynamic Voltage and Frequency Scaling (DVFS) between cores is needed, any unicore dynamic power management technique for reducing the energy consumption for idling can be easily incorporated individually on each core in the voltage island This article also provides an analysis of energy consumption for SFA combined with procrastination for Dynamic Power Management (DPM), resulting in an increment of 1 from the previous results for task execution Furthermore, we also extend our analysis for deriving the approximation factor of SFA for a multicore system with multiple voltage islands

...read moreread less

Journal Article•DOI•

Safety-critical medical device development using the UPP2SF model translation tool

[...]

Miroslav Pajic¹, Zhihao Jiang¹, Insup Lee¹, Oleg Sokolsky¹, Rahul Mangharam¹ - Show less +1 more•Institutions (1)

University of Pennsylvania¹

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: The UPP2SF model-translation tool, which facilitates automatic conversion of verified models (in UPPAAL) to models that may be simulated and tested (in Simulink/Stateflow) and the translation rules that ensure correct model conversion are described, applicable to a large class of models.

...read moreread less

Abstract: Software-based control of life-critical embedded systems has become increasingly complex, and to a large extent has come to determine the safety of the human being. For example, implantable cardiac pacemakers have over 80,000 lines of code which are responsible for maintaining the heart within safe operating limits. As firmware-related recalls accounted for over 41p of the 600,000 devices recalled in the last decade, there is a need for rigorous model-driven design tools to generate verified code from verified software models. To this effect, we have developed the UPP2SF model-translation tool, which facilitates automatic conversion of verified models (in UPPAAL) to models that may be simulated and tested (in Simulink/Stateflow). We describe the translation rules that ensure correct model conversion, applicable to a large class of models. We demonstrate how UPP2SF is used in the model-driven design of a pacemaker whose model is (a) designed and verified in UPPAAL (using timed automata), (b) automatically translated to Stateflow for simulation-based testing, and then (c) automatically generated into modular code for hardware-level integration testing of timing-related errors. In addition, we show how UPP2SF may be used for worst-case execution time estimation early in the design stage. Using UPP2SF, we demonstrate the value of integrated end-to-end modeling, verification, code-generation and testing process for complex software-controlled embedded systems.

...read moreread less

Journal Article•DOI•

Dynamic Behavior Specification and Dynamic Mapping for Real-Time Embedded Systems: HOPES Approach

[...]

Hanwoong Jung¹, Chanhee Lee¹, Shin-haeng Kang¹, Sungchan Kim², Hyunok Oh³, Soonhoi Ha¹ - Show less +2 more•Institutions (3)

Seoul National University¹, Chonbuk National University², Hanyang University³

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A hybrid specification of dataflow and FSM models is proposed to specify the dynamic behavior of a system distinguishing inter- and intra-application dynamism and the proposed technique is implemented in the HOPES design environment.

...read moreread less

Abstract: As the number of processors in a chip increases and more functions are integrated, the system status will change dynamically due to various factors such as the workload variation, QoS requirement, and unexpected component failure A typical method to deal with the dynamics of the system is to decide the mapping decision at runtime, based on the local information of the system status It is very challenging to guarantee any real-time performance of a certain application in such a dynamically varying system To solve this problem, we propose a hybrid specification of dataflow and FSM models to specify the dynamic behavior of a system distinguishing inter- and intra-application dynamism At the top level, each application is specified by a dataflow task and the dynamic behavior is modeled as a control task that supervises the execution of applications Inside a dataflow task, we specify the dynamic behavior using a similar way as FSM-based SADF in which an application is specified by a synchronous dataflow graph for each mode of operation It enables us to perform compile-time scheduling of each graph to maximize the throughput varying the number of allocated processors, and store the scheduling information When a change in system state is detected at runtime, the number of allocated processors to the active tasks is determined dynamically utilizing the stored scheduling information of those tasks in order to meet the real-time requirements The proposed technique is implemented in the HOPES design environment Through preliminary experiments with a simple smartphone example, we show the viability of the proposed methodology

...read moreread less

Journal Article•DOI•

Achieving energy-synchronized communication in energy-harvesting wireless sensor networks

[...]

Yu Gu¹, Liang He¹, Ting Zhu², Tian He³•Institutions (3)

Singapore University of Technology and Design¹, Binghamton University², University of Minnesota³

27 Jan 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work presents Energy-Synchronized Communication (ESC) as a transparent middleware between the network layer and MAC layer that controls the amount and timing of RF activity at receiving nodes that is implemented on MicaZ nodes with two state-of-the-art routing protocols.

...read moreread less

Abstract: With advances in energy-harvesting techniques, it is now feasible to build sustainable sensor networks to support long-term applications. Unlike battery-powered sensor networks, the objective of sustainable sensor networks is to effectively utilize a continuous stream of ambient energy. Instead of pushing the limits of energy conservation, we aim to design energy-synchronized schemes that keep energy supplies and demands in balance. Specifically, this work presents Energy-Synchronized Communication (ESC) as a transparent middleware between the network layer and MAC layer that controls the amount and timing of RF activity at receiving nodes. In this work, we first derive a delay model for cross-traffic at individual nodes, which reveals an interesting stair effect. This effect allows us to design a localized energy synchronization control with o(d3) time complexity that shuffles or adjusts the working schedule of a node to optimize cross-traffic delays in the presence of changing duty cycle budgets, where d is the node degree in the network. Under different rates of energy fluctuations, shuffle-based and adjustment-based methods have different influences on logical connectivity and cross-traffic delay, due to the inconsistent views of working schedules among neighboring nodes before schedule updates. We study the trade-off between them and propose methods for updating working schedules efficiently. To evaluate our work, ESC is implemented on MicaZ nodes with two state-of-the-art routing protocols. Both testbed experiment and large-scale simulation results show significant performance improvements over randomized synchronization controls.

...read moreread less

Journal Article•DOI•

Implementation and evaluation of mixed-criticality scheduling approaches for sporadic tasks

[...]

Huang-Ming Huang¹, Christopher Gill¹, Chenyang Lu¹•Institutions (1)

Washington University in St. Louis¹

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: Empirical evaluations demonstrate that user-space implementations of mechanisms to enforce different mixed-criticality scheduling approaches can be achieved atop Linux without kernel modification, with reasonably low (but in some cases nontrivial) overhead for mixed- criticality real-time task sets.

...read moreread less

Abstract: Traditional fixed-priority scheduling analysis for periodic and sporadic task sets is based on the assumption that all tasks are equally critical to the correct operation of the system. Therefore, every task has to be schedulable under the chosen scheduling policy, and estimates of tasks' worst-case execution times must be conservative in case a task runs longer than is usual. To address the significant underutilization of a system's resources under normal operating conditions that can arise from these assumptions, several mixed-criticality scheduling approaches have been proposed. However, to date, there have been few quantitative comparisons of system schedulability or runtime overhead for the different approaches.In this article, we present a side-by-side implementation and evaluation of the known mixed-criticality scheduling approaches, for periodic and sporadic mixed-criticality tasks on uniprocessor systems, under a mixed-criticality scheduling model that is common to all these approaches. To make a fair evaluation of mixed-criticality scheduling, we also address previously open issues and propose modifications to improve particular approaches. Our empirical evaluations demonstrate that user-space implementations of mechanisms to enforce different mixed-criticality scheduling approaches can be achieved atop Linux without kernel modification, with reasonably low (but in some cases nontrivial) overhead for mixed-criticality real-time task sets.

...read moreread less

Journal Article•DOI•

Sequentially Constructive Concurrency—A Conservative Extension of the Synchronous Model of Computation

[...]

Reinhard von Hanxleden¹, Michael Mendler², Joaquin Aguado², Björn Duderstadt¹, Insa Fuhrmann¹, Christian Motika¹, Stephen R. Mercer³, Owen O'Brien³, Partha S. Roop⁴ - Show less +5 more•Institutions (4)

University of Kiel¹, University of Bamberg², National Instruments³, University of Auckland⁴

28 Jul 2014-ACM Transactions in Embedded Computing Systems

Abstract: Synchronous languages ensure determinate concurrency but at the price of restrictions on what programs are considered valid, or constructive. Meanwhile, sequential languages such as C and Java offer an intuitive, familiar programming paradigm but provide no guarantees with regard to determinate concurrency. The sequentially constructive (SC) model of computation (MoC) presented here harnesses the synchronous execution model to achieve determinate concurrency while taking advantage of familiar, convenient programming paradigms from sequential languages.In essence, the SC MoC extends the classical synchronous MoC by allowing variables to be read and written in any order and multiple times, as long as the sequentiality expressed in the program provides sufficient scheduling information to rule out race conditions. This allows to use programming patterns familiar from sequential programming, such as testing and later setting the value of a variable, which are forbidden in the standard synchronous MoC. The SC MoC is a conservative extension in that programs considered constructive in the common synchronous MoC are also SC and retain the same semantics. In this article, we investigate classes of shared variable accesses, define SC-admissible scheduling as a restriction of “free scheduling,” derive the concept of sequential constructiveness, and present a priority-based scheduling algorithm for analyzing and compiling SC programs efficiently.

...read moreread less

Journal Article•DOI•

MultiNets: A system for real-time switching between multiple network interfaces on mobile devices

[...]

Shahriar Nirjon¹, Angela Nicoara², Cheng-Hsin Hsu², Jatinder Pal Singh², John A. Stankovic¹ - Show less +1 more•Institutions (2)

University of Virginia¹, Telekom Innovation Laboratories²

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: MultiNets is deployed in a real-world scenario and its experimental results show that depending on the user requirements, it outperforms the state-of-the-art Android system either by saving up to 33.75p energy, achieving near-optimal offloading, or achievingNear-Optimal throughput while substantially reducing TCP interruptions due to switching.

...read moreread less

Abstract: MultiNets is a system supporting seamless switch-over between wireless interfaces on mobile devices in real-time. MultiNets is configurable to run in three different modes: (i) Energy Saving mode--for choosing the interface that saves the most energy based on the condition of the device, (ii) Offload mode--for offloading data traffic from the cellular to WiFi network, and (iii) Performance mode--for selecting the network for the fastest data connectivity. MultiNets also provides a powerful API that gives the application developers: (i) the choice to select a network interface to communicate with a specific server, and (ii) the ability to simultaneously transfer data over multiple network interfaces. MultiNets is modular, easily integrable, lightweight, and applicable to various mobile operating systems. We implement MultiNets on Android devices as a show case. MultiNets does not require any extra support from the network infrastructure and runs existing applications transparently.To evaluate MultiNets, we first collect data traces from 13 actual Android smartphone users over three months. We then use the collected traces to show that, by automatically switching to WiFi whenever it is available, MultiNets can offload on average 79.82p of the data traffic. We also illustrate that, by optimally switching between the interfaces, MultiNets can save on average 21.14 KJ of energy per day, which is equivalent to 27.4p of the daily energy usage. Using our API, we demonstrate that a video streaming application achieves 43--271p higher streaming rate when concurrently using WiFi and 3G interfaces. We deploy MultiNets in a real-world scenario and our experimental results show that depending on the user requirements, it outperforms the state-of-the-art Android system either by saving up to 33.75p energy, achieving near-optimal offloading, or achieving near-optimal throughput while substantially reducing TCP interruptions due to switching.

...read moreread less

Journal Article•DOI•

Elon: Enabling efficient and long-term reprogramming for wireless sensor networks

[...]

Wei Dong¹, Yunhao Liu², Chun Chen¹, Lin Gu³, Xiaofan Wu¹ - Show less +1 more•Institutions (3)

Zhejiang University¹, Tsinghua University², Hong Kong University of Science and Technology³

10 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: A new mechanism called Elon for enabling efficient and long-term reprogramming in wireless sensor networks that reduces the transferred code size significantly by introducing the concept of replaceable component and significantly prolongs the reprogrammable lifetime.

...read moreread less

Abstract: We present a new mechanism called Elon for enabling efficient and long-term reprogramming in wireless sensor networks. Elon reduces the transferred code size significantly by introducing the concept of replaceable component. It avoids the cost of hardware reboot with a novel software reboot mechanism. Moreover, it significantly prolongs the reprogrammable lifetime (i.e., the time period during which the sensor nodes can be reprogrammed) by avoiding flash writes for TelosB nodes. Experimental results show that Elon transfers up to 120--389 times less information than Deluge, and 18--42 times less information than Stream. The software reboot mechanism that Elon applies reduces the rebooting cost by 50.4p--53.87p in terms of beacon packets, and 56.83p in terms of unsynchronized nodes. In addition, Elon prolongs the reprogrammable lifetime by a factor of 3.3.

...read moreread less

Journal Article•DOI•

STEAM: A Smart Temperature and Energy Aware Multicore Controller

[...]

Vinay Hanumaiah¹, Digant Desai¹, Benjamin Gaudette¹, Carole-Jean Wu¹, Sarma Vrudhula¹ - Show less +1 more•Institutions (1)

Arizona State University¹

06 Oct 2014-ACM Transactions in Embedded Computing Systems

TL;DR: STEAM -- an optimal closed-loop DEM controller designed for multicore processors with an excellent prediction of core temperatures and power consumption, and the ability to control the core temperatures within 3ˆC of the specified maximum.

...read moreread less

Abstract: Recent empirical studies have shown that multicore scaling is fast becoming power limited, and consequently, an increasing fraction of a multicore processor has to be under clocked or powered off. Therefore, in addition to fundamental innovations in architecture, compilers and parallelization of application programs, there is a need to develop practical and effective dynamic energy management (DEM) techniques for multicore processors.Existing DEM techniques mainly target reducing processor power consumption and temperature, and only few of them have addressed improving energy efficiency for multicore systems. With energy efficiency taking a center stage in all aspects of computing, the focus of the DEM needs to be on finding practical methods to maximize processor efficiency. Towards this, this article presents STEAM -- an optimal closed-loop DEM controller designed for multicore processors. The objective is to maximize energy efficiency by dynamic voltage and frequency scaling (DVFS). Energy efficiency is defined as the ratio of performance to power consumption or performance-per-watt (PPW). This is the same as the number of instructions executed per Joule. The PPW metric is actually replaced by PαPW (performanceα-per-Watt), which allows for controlling the importance of performance versus power consumption by varying α.The proposed controller was implemented on a Linux system and tested with the Intel Sandy Bridge processor. There are three power management schemes called governors, available with Intel platforms. They are referred to as (1) Powersave (lowest power consumption), (2) Performance (achieves highest performance), and (3) Ondemand. Our simple and lightweight controller when executing SPEC CPU2006, PARSEC, and MiBench benchmarks have achieved an average of 18p improvement in energy efficiency (MIPS/Watt) over these ACPI policies. Moreover, STEAM also demonstrated an excellent prediction of core temperatures and power consumption, and the ability to control the core temperatures within 3ˆC of the specified maximum. Finally, the overhead of the STEAM implementation (in terms of CPU resources) is less than 0.25p. The entire implementation is self-contained and can be installed on any processor with very little prior knowledge of the processor.

...read moreread less

Journal Article•DOI•

OCEAN: An Optimized HW/SW Reliability Mitigation Approach for Scratchpad Memories in Real-Time SoCs

[...]

Mohamed M. Sabry¹, David Atienza¹, Francky Catthoor²•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Katholieke Universiteit Leuven²

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer to protect a portion of the processed data used to restore from runtime error, and optimally select the buffer size to minimize the energy overhead, with timing and area constraints.

...read moreread less

Abstract: Recent process technology advances trigger reliability issues that degrade the Quality-of-Service (QoS) required by embedded Systems-on-Chip (SoCs). To maintain the required QoS with acceptable overheads, we propose OCEAN, a novel cross-layer error mitigation. OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer. We utilize this buffer to protect a portion of the processed data used to restore from runtime error. We optimally select the buffer size to minimize the energy overhead, with timing and area constraints. OCEAN achieves full error mitigation with 10.1p average energy overhead compared to base-line operation that does not include any error correction capability, and 65p energy savings, compared to a cross-layer error mitigation mechanism.

...read moreread less

Journal Article•DOI•

Energy-efficient task allocation techniques for asymmetric multiprocessor embedded systems

[...]

Abdullah Elewi¹, Mohamed Shalan², Medhat Awadalla¹, Elsayed M. Saad¹•Institutions (2)

Helwan University¹, American University in Cairo²

27 Jan 2014-ACM Transactions in Embedded Computing Systems

TL;DR: Modifications to the traditional bin-packing techniques are proposed and novel techniques taking into account the DVFS model supported by the platform are designed.

...read moreread less

Abstract: Asymmetric multiprocessor systems are considered power-efficient multiprocessor architectures Furthermore, efficient task allocation (partitioning) can achieve more energy efficiency at these asymmetric multiprocessor platforms This article addresses the problem of energy-aware static partitioning of periodic real-time tasks on asymmetric multiprocessor (multicore) embedded systems The article formulates the problem according to the Dynamic Voltage and Frequency Scaling (DVFS) model supported by the platform and shows that it is an NP-hard problem Then, the article outlines optimal reference partitioning techniques for each case of DVFS model with suitable assumptions Finally, the article proposes modifications to the traditional bin-packing techniques and designs novel techniques taking into account the DVFS model supported by the platform All algorithms and techniques are simulated and compared The simulation shows promising results, where the proposed techniques reduced the energy consumption by 75p compared to traditional methods when DVFS is not supported and by 50p when per-core DVFS is supported by the platform

...read moreread less

Journal Article•DOI•

Tomahawk: Parallelism and heterogeneity in communications signal processing MPSoCs

[...]

Oliver Arnold¹, Emil Matus¹, Benedikt Noethen¹, Markus Winter¹, Torsten Limberg, Gerhard Fettweis¹ - Show less +2 more•Institutions (1)

Dresden University of Technology¹

28 Mar 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work compares the performance and capabilities of various CoreManager HW/SW solutions, based on ASIC, RISC and ASIP paradigms and demonstrates that the proposed ASIP-based solution approaches the performance of the ASIC realization, while preserving the full flexibility of the software (RISC-based) implementation.

...read moreread less

Abstract: Heterogeneity and parallelism in MPSoCs for 4G (and beyond) communications signal processing are inevitable in order to meet stringent power constraints and performance requirements. The question arises on how to cope with the problem of system programmability and runtime management incurred by the statically or even dynamically varying number and type of processing elements. This work addresses this challenge by proposing the concept of a heterogeneous many-core platform called Tomahawk. Apart from the definition of the system architecture, in this approach a unified framework including a model of computation, a programming interface and a dedicated runtime management unit called CoreManager is proposed. The increase of system complexity in terms of application parallelism and number of resources may lead to a dramatic increase of the management costs, hence causing performance degradation. For this reason, the efficient implementation of the CoreManager becomes a major issue in system design. This work compares the performance and capabilities of various CoreManager HW/SW solutions, based on ASIC, RISC and ASIP paradigms. The results demonstrate that the proposed ASIP-based solution approaches the performance of the ASIC realization, while preserving the full flexibility of the software (RISC-based) implementation.

...read moreread less

Journal Article•DOI•

Minimizing Stack and Communication Memory Usage in Real-Time Embedded Applications

[...]

Haibo Zeng¹, Marco Di Natale, Qi Zhu²•Institutions (2)

McGill University¹, University of California, Riverside²

23 Jul 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This article targets the assignment of the scheduling parameters to minimize memory usage for systems of practical interest, including designs compliant with automotive standards and proposes algorithms either proven optimal or shown to improve on randomized optimization methods like simulated annealing.

...read moreread less

Abstract: In the development of real-time embedded applications, especially those on systems-on-chip, an efficient use of RAM memory is as important as the effective scheduling of the computation resources. The protection of communication and state variables accessed by concurrent tasks must provide real-time schedulability guarantees while using the least amount of memory. Several schemes, including preemption thresholds, have been developed to improve schedulability and save stack space by selectively disabling preemption. However, the design synthesis problem is still open. In this article, we target the assignment of the scheduling parameters to minimize memory usage for systems of practical interest, including designs compliant with automotive standards. We propose algorithms either proven optimal or shown to improve on randomized optimization methods like simulated annealing.

...read moreread less

Journal Article•DOI•

WCET analysis with MRU cache: Challenging LRU for predictability

[...]

Nan Guan¹, Mingsong Lv¹, Wang Yi¹, Ge Yu¹•Institutions (1)

Northeastern University (China)¹

01 Apr 2014-ACM Transactions in Embedded Computing Systems

TL;DR: This work studies the analysis of MRU, a non-LRU replacement policy employed in mainstream processor architectures like Intel Nehalem, and proposes a new cache hit/miss classification, k-Miss, to better capture the MRU behavior, and develops formal conditions and efficient techniques to decide the k- Miss memory accesses.

...read moreread less

Abstract: Most previous work on cache analysis for WCET estimation assumes a particular replacement policy called LRU. In contrast, much less work has been done for non-LRU policies, since they are generally considered to be very unpredictable. However, most commercial processors are actually equipped with these non-LRU policies, since they are more efficient in terms of hardware cost, power consumption and thermal output, while still maintaining almost as good average-case performance as LRU.In this work, we study the analysis of MRU, a non-LRU replacement policy employed in mainstream processor architectures like Intel Nehalem. Our work shows that the predictability of MRU has been significantly underestimated before, mainly because the existing cache analysis techniques and metrics do not match MRU well. As our main technical contribution, we propose a new cache hit/miss classification, k-Miss, to better capture the MRU behavior, and develop formal conditions and efficient techniques to decide k-Miss memory accesses. A remarkable feature of our analysis is that the k-Miss classifications under MRU are derived by the analysis result of the same program under LRU. Therefore, our approach inherits the advantages in efficiency and precision of the state-of-the-art LRU analysis techniques based on abstract interpretation. Experiments with instruction caches show that our proposed MRU analysis has both good precision and high efficiency, and the obtained estimated WCET is rather close to (typically 1p∼8p more than) that obtained by the state-of-the-art LRU analysis, which indicates that MRU is also a good candidate for cache replacement policies in real-time systems.

...read moreread less