scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions in Embedded Computing Systems in 2015"


Journal ArticleDOI
TL;DR: A hybrid task mapping algorithm that combines a static mapping exploration and a dynamic mapping optimization to achieve an overall improvement of system efficiency is presented.
Abstract: The application workloads in modern MPSoC-based embedded systems are becoming increasingly dynamic. Different applications concurrently execute and contend for resources in such systems, which could cause serious changes in the intensity and nature of the workload demands over time. To cope with the dynamism of application workloads at runtime and improve the efficiency of the underlying system architecture, this article presents a hybrid task mapping algorithm that combines a static mapping exploration and a dynamic mapping optimization to achieve an overall improvement of system efficiency. We evaluate our algorithm using a heterogeneous MPSoC system with three real applications. Experimental results reveal the effectiveness of our proposed algorithm by comparing derived solutions to the ones obtained from several other runtime mapping algorithms. In test cases with three simultaneously active applications, the mapping solutions derived by our approach have average performance improvements ranging from 45.9p to 105.9p and average energy savings ranging from 14.6p to 23.5p.

74 citations


Journal ArticleDOI
TL;DR: Results obtained from the FT modeling reveal that an FT WSN composed of duplex sensor nodes can result in as high as a 100% MTTF increase and approximately a 350% improvement in reliability over a Non-Fault-Tolerant (NFT) WSN.
Abstract: Technological advancements in communications and embedded systems have led to the proliferation of Wireless Sensor Networks (WSNs) in a wide variety of application domains. These application domains include but are not limited to mission-critical (e.g., security, defense, space, satellite) or safety-related (e.g., health care, active volcano monitoring) systems. One commonality across all WSN application domains is the need to meet application requirements (e.g., lifetime, reliability). Many application domains require that sensor nodes be deployed in harsh environments, such as on the ocean floor or in an active volcano, making these nodes more prone to failures. Sensor node failures can be catastrophic for critical or safety-related systems. This article models and analyzes fault detection and fault tolerance in WSNs. To determine the effectiveness and accuracy of fault detection algorithms, we simulate these algorithms using ns-2. We investigate the synergy between fault detection and fault tolerance and use the fault detection algorithms’ accuracies in our modeling of Fault-Tolerant (FT) WSNs. We develop Markov models for characterizing WSN reliability and Mean Time to Failure (MTTF) to facilitate WSN application-specific design. Results obtained from our FT modeling reveal that an FT WSN composed of duplex sensor nodes can result in as high as a 100p MTTF increase and approximately a 350p improvement in reliability over a Non-Fault-Tolerant (NFT) WSN. The article also highlights future research directions for the design and deployment of reliable and trustworthy WSNs.

56 citations


Journal ArticleDOI
TL;DR: A portable solution to enforce runtime resource management decisions based on the standard control groups framework is shown, which focuses on multicore Linux systems.
Abstract: The extremely high technology process reached by silicon manufacturing (smaller than 32nm) has led to production of computational platforms and SoC, featuring a considerable amount of resources. Whereas from one side such multi- and many-core platforms show growing performance capabilities, from the other side they are more and more affected by power, thermal, and reliability issues. Moreover, the increased computational capabilities allows congested usage scenarios with workloads subject to mixed and time-varying requirements. Effective usage of the resources should take into account both the application requirements and resources availability, with an arbiter, namely a resource manager in charge to solve the resource contention among demanding applications.Current operating systems (OS) have only a limited knowledge about application-specific behaviors and their time-varying requirements. Dedicated system interfaces to collect such inputs and forward them to the OS (e.g., its scheduler) are thus an interesting research area that aims at integrating the OS with an ad hoc resource manager. Such a component can exploit efficient low-level OS interfaces and mechanisms to extend its capabilities of controlling tasks and system resources. Because of the specific tasks and timings of a resource manager, this component can be easily and effectively developed as a user-space extension lying in between the OS and the controlled application.This article, which focuses on multicore Linux systems, shows a portable solution to enforce runtime resource management decisions based on the standard control groups framework. A burst and a mixed workload analysis, performed on a multicore-based NUMA platform, have reported some promising results both in terms of performance and power saving.

42 citations


Journal ArticleDOI
TL;DR: An optimization approach that includes an Integer Linear Programming (ILP) optimization model and a scheme to dynamically determine thread-to-core assignment is proposed and simulation analysis shows energy savings and performance gains for a variety of workloads compared to state-of-the-art schemes.
Abstract: The current trend to move from homogeneous to heterogeneous multicore systems provides compelling opportunities for achieving performance and energy efficiency goals. Running multiple threads in multicore systems poses challenges on meeting limited shared resources, such as memory bandwidth. We propose an optimization approach that includes an Integer Linear Programming (ILP) optimization model and a scheme to dynamically determine thread-to-core assignment. We present simulation analysis that shows energy savings and performance gains for a variety of workloads compared to state-of-the-art schemes. We implemented and evaluated a prototype of our thread assignment approach at user level, leveraging Linux scheduling and performance-monitoring capabilities.

40 citations


Journal ArticleDOI
TL;DR: A multilayered work-stealing style algorithm for distributing work efficiently among mobile devices and compare speedups attainable for different topologies of devices networked with Bluetooth, justifying a topology-flexible opportunistic approach.
Abstract: With the proliferation of mobile devices, and their increasingly powerful embedded processors and storage, vast resources increasingly surround users. We have been investigating the concept of on-demand ad hoc forming of groups of nearby mobile devices in the midst of crowds to cooperatively perform computationally intensive tasks as a service to local mobile users, or what we call mobile crowd computing. As devices can vary in processing power and some can leave a group unexpectedly or new devices join in, there is a need for algorithms that can distribute work in a flexible manner and still work with different arrangements of devices that can arise in an ad hoc fashion. In this article, we first argue for the feasibility of such use of crowd-embedded computations using theoretical justifications and reporting on our experiments on Bluetooth-based proximity sensing. We then present a multilayered work-stealing style algorithm for distributing work efficiently among mobile devices and compare speedups attainable for different topologies of devices networked with Bluetooth, justifying a topology-flexible opportunistic approach. While our experiments are with Bluetooth and mobile devices, the approach is applicable to ecosystems of various embedded devices with powerful processors, networking technologies, and storage that will increasingly surround users.

40 citations


Journal ArticleDOI
TL;DR: This article formally defines, and proves the validity of, a theoretical framework that modifies a Kripke model to the least possible extent in order to satisfy a given HML formula.
Abstract: This article concerns the maximal synthesis for Hennessy-Milner Logic on Kripke structures with labeled transitions. We formally define, and prove the validity of, a theoretical framework that modifies a Kripke model to the least possible extent in order to satisfy a given HML formula. Applications of this work can be found in the field of controller synthesis and supervisory control for discrete-event systems. Synthesis is realized technically by first projecting the given Kripke model onto a bisimulation-equivalent partial tree representation, thereby unfolding up to the depth of the synthesized formula. Operational rules then define the required adaptations upon this structure in order to achieve validity of the synthesized formula. Synthesis might result in multiple valid adaptations, which are all related to the original model via simulation. Each simulant of the original Kripke model, which satisfies the synthesized formula, is also related to one of the synthesis results via simulation. This indicates maximality, or maximal permissiveness, in the context of supervisory control. In addition to the formal construction of synthesis as presented in this article, we present it in algorithmic form and analyze its computational complexity. Computer-verified proofs for two important theorems in this article have been created using the Coq proof assistant.

32 citations


Journal ArticleDOI
TL;DR: In this paper, a hybrid symbolic-numeric method is presented to compute exact inequality invariants of hybrid systems efficiently, where the modified Newton refinement and rational vector recovery techniques are applied to obtain exact polynomial invariants with rational coefficients, which exactly satisfy the conditions of invariants.
Abstract: In this article, we address the problem of safety verification of nonlinear hybrid systems. A hybrid symbolic-numeric method is presented to compute exact inequality invariants of hybrid systems efficiently. Some numerical invariants of a hybrid system can be obtained by solving a bilinear SOS programming via the PENBMI solver or iterative method, then the modified Newton refinement and rational vector recovery techniques are applied to obtain exact polynomial invariants with rational coefficients, which exactly satisfy the conditions of invariants. Experiments on some benchmarks are given to illustrate the efficiency of our algorithm.

30 citations


Journal ArticleDOI
TL;DR: A new method for automatically updating a Wi-Fi indoor positioning model on a cloud server by employing uploaded sensor data obtained from the smartphone sensors of a specific user who spends a lot of time in a given environment (e.g., a worker in the environment).
Abstract: In this article, we propose a new method for automatically updating a Wi-Fi indoor positioning model on a cloud server by employing uploaded sensor data obtained from the smartphone sensors of a specific user who spends a lot of time in a given environment (e.g., a worker in the environment). In this work, we attempt to track the user with pedestrian dead reckoning techniques, and at the same time we obtain Wi-Fi scan data from a mobile device possessed by the user. With the scan data and the estimated coordinates uploaded to a cloud server, we can automatically create a pair consisting of a scan and its corresponding indoor coordinates during the user's daily life and update an indoor positioning model on the server by using the information. With this approach, we try to cope with the instability of Wi-Fi-based positioning methods caused by changing environmental dynamics, that is, layout changes and moving or removal of Wi-Fi access points. Therefore, ordinary users (e.g., customers) who do not have rich sensors can benefit from the continually updating positioning model.

24 citations


Journal ArticleDOI
TL;DR: This article proposes a novel control-theoretic software monitoring solution for coordinating time predictability and memory utilization in runtime monitoring of systems that interact with the physical world and constructs a buffer sharing mechanism in which controllers dynamically share the memory space to negate the effect of bursts of environment actions.
Abstract: The goal of runtime monitoring is to inspect the well-being of a system by employing a monitor process that reads the state of the system during execution and evaluates a set of properties expressed in some specification language. The main challenge in runtime monitoring is dealing with the costs imposed in terms of resource utilization. In the context of cyber-physical systems, it is crucial for a software monitoring solution to be time predictable to improve scheduling, as well as support composition of monitoring solutions with an overall predictable behavior. Moreover, a small memory footprint is often required in components of cyber-physical systems, especially in deeply embedded systems. In this article, we propose a novel control-theoretic software monitoring solution for coordinating time predictability and memory utilization in runtime monitoring of systems that interact with the physical world. The controllers attempt to reduce monitoring jitter and maximize memory utilization while simultaneously ensuring the soundness of evaluation of properties. For systems where multiple properties are required to be monitored simultaneously, we construct a buffer sharing mechanism in which controllers dynamically share the memory space to negate the effect of bursts of environment actions, thus reducing jitter due to transient high loads.To validate our design choices, we present three case studies: (1) a Bluetooth mobile payment system, which shows a sporadic rate of events during peak hours; (2) a laser beam stabilizer for target tracking, and (3) a monitoring system for air/fuel ratio in a car engine exhaust and the CAM inlet position in the engine’s cylinders. The experimental results of the case studies demonstrate up to 40p improvement in time predictability of the monitoring solution when compared to a basic event-triggered approach. Moreover, memory utilization reaches an average of 90p when using our dynamic buffer resizing mechanism.

23 citations


Journal ArticleDOI
TL;DR: Experimental results show that WAPTM can significantly reduce writes in page tables, proving the feasibility and potential of prolonging the lifetime of PCM-based main memory through reducing writes at the OS level.
Abstract: Non-volatile memories such as phase change memory (PCM) and memristor are being actively studied as an alternative to DRAM-based main memory in embedded systems because of their properties, which include low power consumption and high density. Though PCM is one of the most promising candidates with commercial products available, its adoption has been greatly compromised by limited write endurance. As main memory is one of the most heavily accessed components, it is critical to prolong the lifetime of PCM.In this article, we present write-activity-awarepage table management (WAPTM), a simple yet effective page table management scheme for reducing unnecessary writes, by redesigning system software and exploiting write-activity-aware features provided by the hardware. We implemented WAPTM in Google Android based on the ARM architecture and evaluated it with real Android applications. Experimental results show that WAPTM can significantly reduce writes in page tables, proving the feasibility and potential of prolonging the lifetime of PCM-based main memory through reducing writes at the OS level.

21 citations


Journal ArticleDOI
TL;DR: A configurable real-time multichannel memory controller architecture with a novel method for logical-to-physical address translation and two design-time methods to map memory clients to the memory channels, one an optimal algorithm based on an integer programming formulation of the mapping problem, and the other a fast heuristic algorithm.
Abstract: Ever-increasing demands for main memory bandwidth and memory speed/power tradeoff led to the introduction of memories with multiple memory channels, such as Wide IO DRAM. Efficient utilization of a multichannel memory as a shared resource in multiprocessor real-time systems depends on mapping of the memory clients to the memory channels according to their requirements on latency, bandwidth, communication, and memory capacity. However, there is currently no real-time memory controller for multichannel memories, and there is no methodology to optimally configure multichannel memories in real-time systems. As a first work toward this direction, we present two main contributions in this article: (1) a configurable real-time multichannel memory controller architecture with a novel method for logical-to-physical address translation and (2) two design-time methods to map memory clients to the memory channels, one an optimal algorithm based on an integer programming formulation of the mapping problem, and the other a fast heuristic algorithm. We demonstrate the real-time guarantees on bandwidth and latency provided by our multichannel memory controller architecture by experimental evaluation. Furthermore, we compare the performance of the mapping problem formulation in a solver and the heuristic algorithm against two existing mapping algorithms in terms of computation time and mapping success ratio. We show that an optimal solution can be found in 2 hours using the solver and in less than 1 second with less than 7p mapping failure using the heuristic for realistically sized problems. Finally, we demonstrate configuring a Wide IO DRAM in a high-definition (HD) video and graphics processing system to emphasize the practical applicability and effectiveness of this work.

Journal ArticleDOI
TL;DR: This work builds upon work of Steinke and Nutt, and develops three general concepts that allow to quickly determine the complexity of a testing problem, and gives a single SAT encoding of the testing problem that works for all memory models in the Steinke-Nutt hierarchy.
Abstract: To improve the performance of the memory system, multiprocessors implement weak memory consistency models. Weak memory models admit different views of the processes on their load and store instructions, thus allowing for computations that are not sequentially consistent. Program analyses have to take into account the memory model of the targeted hardware. This is challenging because numerous memory models have been developed, and every memory model requires its own analysis.In this article, we study a prominent approach to program analysis: testing. The testing problem takes as input sequences of operations, one for each process in the concurrent program. The task is to check whether these sequences can be interleaved to an execution of the entire program that respects the constraints of a memory model under consideration. We determine the complexity of the testing problem for most of the known memory models. Moreover, we study the impact on the complexity of parameters, such as the number of concurrent processes, the length of their executions, and the number of shared variables.What differentiates our contribution from related results is a uniform approach that avoids considering each memory model on its own. We build upon work of Steinke and Nutt. They showed that the existing memory models form a hierarchy where one model is called weaker than another one if it includes the latter’s behavior. Using the Steinke-Nutt hierarchy, we develop three general concepts that allow us to quickly determine the complexity of a testing problem. First, we generalize the technique of problem reductions from complexity theory. So-called range reductions propagate hardness results between memory models, and we apply them to establish NP lower bounds for the stronger memory models. Second, for the weaker models, we present polynomial-time testing algorithms that are inspired by determinization algorithms for automata. Finally, we describe a single SAT encoding of the testing problem that works for all memory models in the Steinke-Nutt hierarchy to prove their membership in NP. Our results are general enough to carry over to future weak memory models. Moreover, they show that SAT solvers are adequate tools for testing.

Journal ArticleDOI
TL;DR: MC-SRP (Mixed-Criticality Stack Resource Policy), a resource synchronization protocol for EDF-VD, which allows resource sharing among tasks at the same criticality level and guarantees that each task is blocked at most once in each criticality mode, is presented.
Abstract: In a mixed-criticality system, multiple tasks with different levels of criticality may coexist on the same hardware platform. The scheduling algorithm EDF-VD (Earliest Deadline First with Virtual Deadlines) has been proposed for mixed-criticality systems, which assumes tasks do not share any common resources. We present MC-SRP (Mixed-Criticality Stack Resource Policy), a resource synchronization protocol for EDF-VD, which allows resource sharing among tasks at the same criticality level and guarantees that each task is blocked at most once in each criticality mode. In addition, we present MC-SRPT (MC-SRP with Thresholds) for reducing the application stack size requirement in resource-constrained embedded systems.

Journal ArticleDOI
TL;DR: The architecture implements an adaptive support weight stereo correspondence algorithm that integrates image segmentation information in an attempt to increase the robustness of the matching process and presents an effective processing speed/disparity map accuracy trade-off.
Abstract: Emerging embedded vision systems utilize disparity estimation as a means to perceive depth information to intelligently interact with their host environment and take appropriate actions. Such systems demand high processing performance and accurate depth perception while requiring low energy consumption, especially when dealing with mobile and embedded applications, such as robotics, navigation, and security. The majority of real-time dedicated hardware implementations of disparity estimation systems have adopted local algorithms relying on simple cost aggregation strategies with fixed and rectangular correlation windows. However, such algorithms generally suffer from significant ambiguity along depth borders and areas with low texture. To this end, this article presents the hardware architecture of a disparity estimation system that enables good performance in both accuracy and speed. The architecture implements an adaptive support weight stereo correspondence algorithm that integrates image segmentation information in an attempt to increase the robustness of the matching process. The article also presents hardware-oriented algorithmic modifications/optimization techniques that make the algorithm hardware-friendly and suitable for efficient dedicated hardware implementation. A comparison to the literature asserts that an FPGA implementation of the proposed architecture is among the fastest implementations in terms of million disparity estimations per second (MDE/s), and with an overall accuracy of 90.21p, it presents an effective processing speed/disparity map accuracy trade-off.

Journal ArticleDOI
TL;DR: This article investigates the potential of multilevel phase analysis (MLPA), where different granularity phase analyses are combined together to improve the overall accuracy and observes that coarse-grained phases can better capture the overall program characteristics with fewer of phases.
Abstract: Phase analysis, which classifies the set of execution intervals with similar execution behavior and resource requirements, has been widely used in a variety of systems, including dynamic cache reconfiguration, prefetching, race detection, and sampling simulation. Although phase granularity has been a major factor in the accuracy of phase analysis, it has not been well investigated, and most systems usually adopt a fine-grained scheme. However, such a scheme can only take account of recent local phase information and could be frequently interfered by temporary noise due to instant phase changes, which might notably limit the accuracy.In this article, we make the first investigation on the potential of multilevel phase analysis (MLPA), where different granularity phase analyses are combined together to improve the overall accuracy. The key observation is that the coarse-grained intervals belonging to the same phase usually consist of stably distributed fine-grained phases. Moreover, the phase of a coarse-grained interval can be accurately identified based on the fine-grained intervals at the beginning of its execution. Based on the observation, we design and implement an MLPA scheme. In such a scheme, a coarse-grained phase is first identified based on the fine-grained intervals at the beginning of its execution. The following fine-grained phases in it are then predicted based on the sequence of fine-grained phases in the coarse-grained phase. Experimental results show that such a scheme can notably improve the prediction accuracy. Using a Markov fine-grained phase predictor as the baseline, MLPA can improve prediction accuracy by 20p, 39p, and 29p for next phase, phase change, and phase length prediction for SPEC2000, respectively, yet incur only about 2p time overhead and 40p space overhead (about 360 bytes in total). To demonstrate the effectiveness of MLPA, we apply it to a dynamic cache reconfiguration system that dynamically adjusts the cache size to reduce the power consumption and access time of the data cache. Experimental results show that MLPA can further reduce the average cache size by 15p compared to the fine-grained scheme.Moreover, for MLPA, we also observe that coarse-grained phases can better capture the overall program characteristics with fewer of phases and the last representative phase could be classified in a very early program position, leading to fewer execution internals being functionally simulated. Based on this observation, we also design a multilevel sampling simulation technique that combines both fine- and coarse-grained phase analysis for sampling simulation. Such a scheme uses fine-grained simulation points to represent only the selected coarse-grained simulation points instead of the entire program execution; thus, it could further reduce both the functional and detailed simulation time. Experimental results show that MLPA for sampling simulation can achieve a speedup in simulation time of about 8.3X with similar accuracy compared to 10M SimPoint.

Journal ArticleDOI
TL;DR: An aggressive data reduction algorithm based on error inference within sensor segments that integrates three parallel dynamic error control mechanisms to optimize the trade-off between energy saving and data validity is proposed and evaluated.
Abstract: In wireless sensor networks, owing to the limited energy of the sensor node, it is very meaningful to propose a dynamic scheduling scheme with data management that reduces energy as soon as possible. However, traditional techniques treat data management as an isolated process on only selected individual nodes. In this article, we propose an aggressive data reduction architecture, which is based on error control within sensor segments and integrates three parallel dynamic control mechanisms. We demonstrate that this architecture not only achieves energy savings but also guarantees the data accuracy specified by the application. Furthermore, based on this architecture, we propose two implementations. The experimental results show that both implementations can raise the energy savings while keeping the error at an predefined and acceptable level. We observed that, compared with the basic implementation, the enhancement implementation achieves a relatively higher data accuracy. Moreover, the enhancement implementation is more suitable for the harsh environmental monitoring applications. Further, when both implementations achieve the same accuracy, the enhancement implementation saves more energy. Extensive experiments on realistic historical soil temperature data confirm the efficacy and efficiency of two implementations.

Journal ArticleDOI
TL;DR: The stubborn set method of Petrinets is considered and its extension to time Petri nets is investigated, which establishes some useful sufficient conditions for stubborn sets, which preserve deadlocks and k-boundedness of places.
Abstract: The main limitation of the verification approaches based on state enumeration is the state explosion problem. The partial order reduction techniques aim at attenuating this problem by reducing the number of transitions to be fired from each state while preserving properties of interest. Among the reduction techniques proposed in the literature, this article considers the stubborn set method of Petri nets and investigates its extension to time Petri nets. It establishes some useful sufficient conditions for stubborn sets, which preserve deadlocks and k-boundedness of places.

Journal ArticleDOI
TL;DR: It is shown that a previous approach to WF-diagnosability in the literature has a major flaw, and a corrected notion is presented, and an efficient method based on a reduction to LTL-X model checking based on the ability of existing model checkers to handle weak fairness directly is exploited.
Abstract: In partially observed Petri nets, diagnosis is the task of detecting whether the given sequence of observed labels indicates that some unobservable fault has occurred. Diagnosability is an associated property of the Petri net, stating that in any possible execution, an occurrence of a fault can eventually be diagnosed.In this article, we consider diagnosability under the weak fairness (WF) assumption, which intuitively states that no transition from a given set can stay enabled forever—it must eventually either fire or be disabled. We show that a previous approach to WF-diagnosability in the literature has a major flaw and present a corrected notion. Moreover, we present an efficient method for verifying WF-diagnosability based on a reduction to LTL-X model checking. An important advantage of this method is that the LTL-X formula is fixed—in particular, the WF assumption does not have to be expressed as a part of it (which would make the formula length proportional to the size of the specification), but rather the ability of existing model checkers to handle weak fairness directly is exploited.

Journal ArticleDOI
TL;DR: This article performs a comprehensive quantitative analysis of the benefits provided by the runtime reconfigurability of an MLC NAND flash controller through the combined effect of an adaptable memory programming circuitry coupled with runtime adaptation of the ECC correction capability.
Abstract: NAND flash memories are becoming the predominant technology in the implementation of mass storage systems for both embedded and high-performance applications However, when considering data and code storage in Non-Volatile Memories (NVMs), such as NAND flash memories, reliability and performance become a serious concern for systems designers Designing NAND flash-based systems based on worst-case scenarios leads to waste of resources in terms of performance, power consumption, and storage capacity This is clearly in contrast with the request for runtime reconfigurability, adaptivity, and resource optimization in modern computing systems There is a clear trend toward supporting differentiated access modes in flash memory controllers, each one setting a differentiated tradeoff point in the performance-reliability optimization space This is supported by the possibility of tuning the NAND flash memory performance, reliability, and power consumption through several tuning knobs such as the flash programming algorithm and the flash error correcting code However, to successfully exploit these degrees of freedom, it is mandatory to clearly understand the effect that the combined tuning of these parameters has on the full NVM subsystem This article performs a comprehensive quantitative analysis of the benefits provided by the runtime reconfigurability of an MLC NAND flash controller through the combined effect of an adaptable memory programming circuitry coupled with runtime adaptation of the ECC correction capability The full NVM subsystem is taken into account, starting from a characterization of the low-level circuitry to the effect of the adaptation on a wide set of realistic benchmarks in order to provide readers a clear view of the benefit this combined adaptation may provide at the system level

Journal ArticleDOI
TL;DR: Libra is proposed, which builds on flash storage made solely of MLC flash and uses the memory devices in SLC mode when appropriate, and exploits the fact that writing a single bit per cell in an MLC provides characteristics close to those of an ordinary SLC.
Abstract: Hybrid flash storages combine a small Single-Level Cell (SLC) partition with a large Multilevel Cell (MLC) partition. Compared to MLC-only solutions, the SLC partition exploits fast and short local write updates, while the MLC part brings large capacity. On the whole, hybrid storage achieves a tangible performance improvement for a moderate extra cost. Yet, device lifetime is an important aspect often overlooked: in a hybrid system, a large ratio of writes may be directed to the small SLC partition, thus generating a local stress that could exhaust the SLC lifetime significantly sooner than the MLC partition's. To address this issue, we propose Libra, which builds on flash storage made solely of MLC flash and uses the memory devices in SLC mode when appropriate; that is, we exploit the fact that writing a single bit per cell in an MLC provides characteristics close to those of an ordinary SLC. In our scheme, the cell bit-density of a block can be decided dynamically by the flash controller, and the physical location of the SLC partition can now be moved around the whole device, balancing wear across it. This article provides a thorough analysis and characterization of the SLC mode for MLCs and gives evidence that the inherent flexibility provided by Libra simplifies considerably the stress balance on the device. Overall, our technique improves lifetime by up to one order of magnitude at no cost when compared to any hybrid storage that relies on a static SLC-MLC partitioning.

Journal ArticleDOI
TL;DR: An event-based scheduling policy is formally defined and the notion of the correctness of a scheduling policy in terms of weak termination is proposed and the method can automatically decompose the scheduling policies of a concurrent reactive system into atomic scheduling policies is obtained.
Abstract: The traditional research on scheduling focuses on task scheduling and schedulability analysis in concurrent reactive systems. In this article, we dedicate ourselves to event-based scheduling. We first formally define an event-based scheduling policy and propose the notion of the correctness of a scheduling policy in terms of weak termination. Then we investigate the correctness of the decomposition of scheduling controls and finally obtain a decentralized scheduling method. The method can automatically decompose the scheduling policies of a concurrent reactive system into atomic scheduling policies. Every atomic scheduling policy corresponds to one subsystem. Each of the subsystems is a completely independent system, which may be developed and deployed independently. An experiment demonstrates these results that may help engineers to design correct and efficient schedule policies for a concurrent reactive system.

Journal ArticleDOI
TL;DR: A comprehensive software scheme is proposed that transforms the traditionally single-threaded CGRA into a multithreaded coprocessor to be used as a power-efficient accelerator for multith readed embedded processors.
Abstract: Recent industry trends show a drastic rise in the use of hand-held embedded devices, from everyday applications to medical (e.g., monitoring devices) and critical defense applications (e.g., sensor nodes). The two key requirements in the design of such devices are their processing capabilities and battery life. There is therefore an urgency to build high-performance and power-efficient embedded devices, inspiring researchers to develop novel system designs for the same. The use of a coprocessor (application-specific hardware) to offload power-hungry computations is gaining favor among system designers to suit their power budgets. We propose the use of CGRAs (Coarse-Grained Reconfigurable Arrays) as a power-efficient coprocessor. Though CGRAs have been widely used for streaming applications, the extensive compiler support required limits its applicability and use as a general purpose coprocessor. In addition, a CGRA structure can efficiently execute only one statically scheduled kernel at a time, which is a serious limitation when used as an accelerator to a multithreaded or multitasking processor. In this work, we envision a multithreaded CGRA where multiple schedules (or kernels) can be executed simultaneously on the CGRA (as a coprocessor). We propose a comprehensive software scheme that transforms the traditionally single-threaded CGRA into a multithreaded coprocessor to be used as a power-efficient accelerator for multithreaded embedded processors. Our software scheme includes (1) a compiler framework that integrates with existing CGRA mapping techniques to prepare kernels for execution on the multithreaded CGRA and (2) a runtime mechanism that dynamically schedules multiple kernels (offloaded from the processor) to execute simultaneously on the CGRA coprocessor. Our multithreaded CGRA coprocessor implementation thus makes it possible to achieve improved power-efficient computing in modern multithreaded embedded systems.

Journal ArticleDOI
TL;DR: This work presents a generic analysis tool for psi-calculus instances, enabling symbolic execution and (bi)simulation checking for both unicast and broadcast communication, and describes the theoretical foundations of the tool, including an improved symbolic operational semantics.
Abstract: Psi-calculi is a parametric framework for extensions of the pi-calculus with arbitrary data and logic. All instances of the framework inherit machine-checked proofs of the metatheory such as compositionality and bisimulation congruence. We present a generic analysis tool for psi-calculus instances, enabling symbolic execution and (bi)simulation checking for both unicast and broadcast communication. The tool also provides a library for implementing new psi-calculus instances. We provide examples from traditional communication protocols and wireless sensor networks. We also describe the theoretical foundations of the tool, including an improved symbolic operational semantics, with additional support for scoped broadcast communication.

Journal ArticleDOI
TL;DR: SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL is used, observing that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.
Abstract: The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.

Journal ArticleDOI
TL;DR: This work provides a full construction of the proposed Anonymous Split E-Cash scheme and shows how the protocol’s computational complexity can be relaxed by a secure split of computations: nonsensitive operations are delegated to the powerful platform, while sensitive computations are kept in a secure environment.
Abstract: Anonymous E-Cash was first introduced in 1982 as a digital, privacy-preserving alternative to physical cash. A lot of research has since then been devoted to extend and improve its properties, leading to the appearance of multiple schemes. Despite this progress, the practical feasibility of E-Cash systems is still today an open question. Payment tokens are typically portable hardware devices in smart card form, resource constrained due to their size, and therefore not suited to support largely complex protocols such as E-Cash. Migrating to more powerful mobile platforms, for instance, smartphones, seems a natural alternative. However, this implies moving computations from trusted and dedicated execution environments to generic multiapplication platforms, which may result in security vulnerabilities. In this work, we propose a new anonymous E-Cash system to overcome this limitation. Motivated by existing payment schemes based on MTM (Mobile Trusted Module) architectures, we consider at design time a model in which user payment tokens are composed of two modules: an untrusted but powerful execution platform (e.g., smartphone) and a trusted but constrained platform (e.g., secure element). We show how the protocol’s computational complexity can be relaxed by a secure split of computations: nonsensitive operations are delegated to the powerful platform, while sensitive computations are kept in a secure environment. We provide a full construction of our proposed Anonymous Split E-Cash scheme and show that it fully complies with the main properties of an ideal E-Cash system. Finally, we test its performance by implementing it on an Android smartphone equipped with a Java-Card-compatible secure element.

Journal ArticleDOI
TL;DR: A new refinement relation for modal transition systems (MTS) is defined, using an MTS-specific variant of testing in the sense of De Nicola and Hennessy, and it is demonstrated that the conjunction of two MTSs is an M TS - in contrast to the standard modal refinement.
Abstract: With the aim to preserve deadlock freedom, we define a new refinement preorder for modal transition systems (MTSs), using an MTS-specific variant of testing inspired by De Nicola and Hennessy We characterize this refinement with a kind of failure semantics and show that it “supports itself,” for example, in the sense of thoroughness—in contrast to standard modal refinements We present a conjunction operator with respect to our new refinement, which is quite different from existing ones It always returns an MTS—again in contrast to the case of modal refinement Finally, we also consider De Nicola’s and Hennessy’s may- and must-testing, where the latter leads to a semantics that is also compositional for hiding

Journal ArticleDOI
TL;DR: Dylog is presented, a dynamic logging facility for networked embedded systems that employs binary instrumentation for dynamically inserting or removing logging statements, enabling interactive debugging at the runtime and significantly reduces the communication cost.
Abstract: Event logging is an important technique for networked embedded systems like wireless sensor networks. It can greatly help developers to understand complex system behaviors and diagnose program bugs. Existing logging facilities do not well satisfy three practical requirements: flexibility, efficiency, and high synchronization accuracy. To simultaneously satisfy these requirements, we present Dylog, a dynamic logging facility for networked embedded systems. Dylog employs several techniques. First, Dylog uses binary instrumentation for dynamically inserting or removing logging statements, enabling flexible and interactive debugging at runtime. Second, Dylog incorporates an efficient storage system and log collection protocol for recording and transferring the logging messages. Third, Dylog employs a lightweight data-driven approach for reconstructing the synchronized time of the logging messages. Dylog uses MAC-layer timestamping and drift compensation to achieve high synchronization accuracy. We implement Dylog on the TinyOS 2.1.1/TelosB platform. Results show the following: (1) Dylog incurs a small overhead. Indirections in Dylog incur an additional execution overhead of less than 1p. Dylog reduces the logging storage size by approximately 50p compared with the standard TinyOS radio printf library. Dylog reduces the patch size by more than 90p, compared with incremental reprogramming. (2) Dylog reduces the synchronization overhead by 78p in terms of transmission cost, compared with a traditional time synchronization protocol, FTSP, and it can achieve a high time synchronization accuracy of 5.4μs. (3) Dylog can help diagnose system problems effectively at the source-code level for three real-world scenarios.

Journal ArticleDOI
TL;DR: This article solves the problem of efficiently managing code on an SMM architecture by proposing a cost calculation graph and developing two heuristics CMSM (Code Mapping for Software Managed multicores) and CMSM_advanced that result in efficient code management execution on the local scratchpad memory.
Abstract: Scaling the memory hierarchy is a major challenge when we scale the number of cores in a multicore processor. Software Managed Multicore (SMM) architectures come up as one of the promising solutions. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory [Banakar et al. 2002]. As the local memory usually is small, large applications cannot be directly executed on it. Code and data of the task mapped to each core need to be managed between global memory and local memory. This article solves the problem of efficiently managing code on an SMM architecture. The primary requirement of generating efficient code assignments is a correct management cost model. In this article, we address this problem by proposing a cost calculation graph. In addition, we develop two heuristics CMSM (Code Mapping for Software Managed multicores) and CMSM_advanced that result in efficient code management execution on the local scratchpad memory. Experimental results collected after executing applications from the MiBench suite [Guthaus et al. 2001] demonstrate that merely by adopting the correct management cost calculation, even using previous code assignment schemes, we can improve performance by an average of 12p. Combining the correct management cost model and a more optimized code mapping algorithm together, our heuristics can reduce runtime in more than 80p of the cases, and by up to 20p on our set of benchmarks, compared to the state-of-the-art code assignment approach [Jung et al. 2010]. When compared with Instruction-level Parallelism (ILP) results, CMSM_advanced performs an average of 5p worse. We also simulate the benchmarks on a cache-based system, and find that the code management overhead on SMM core with our code management is much less than memory latency of a cache-based system.

Journal ArticleDOI
TL;DR: This work presents the first IT that allows us to specify a parametric number of interfaces and provides a fully algorithmic procedure, implemented in a tool, for checking the compatibility of and refinement between parametrised interfaces.
Abstract: Interface theories (ITs) enable us to analyse the compatibility interfaces and refine them while preserving their compatibility. However, most ITs are for finite state interfaces, whereas computing systems are often parametrised involving components, the number of which cannot be fixed. We present, to our knowledge, the first IT that allows us to specify a parametric number of interfaces. Moreover, we provide a fully algorithmic procedure, implemented in a tool, for checking the compatibility of and refinement between parametrised interfaces. Finally, we show that the restrictions of the technique are necessary; removing any of them renders the refinement checking problem undecidable.

Journal ArticleDOI
TL;DR: FFT-Cache is proposed, a flexible fault-tolerant cache that uses a flexible defect map to configure its architecture to achieve significant reduction in energy consumption through aggressive voltage scaling while maintaining high error reliability.
Abstract: Caches are known to consume a large part of total microprocessor power. Traditionally, voltage scaling has been used to reduce both dynamic and leakage power in caches. However, aggressive voltage reduction causes process-variation--induced failures in cache SRAM arrays, which compromise cache reliability. In this article, we propose FFT-Cache, a flexible fault-tolerant cache that uses a flexible defect map to configure its architecture to achieve significant reduction in energy consumption through aggressive voltage scaling while maintaining high error reliability. FFT-Cache uses a portion of faulty cache blocks as redundancy—using block-level or line-level replication within or between sets—to tolerate other faulty caches lines and blocks. Our configuration algorithm categorizes the cache lines based on degree of conflict between their blocks to reduce the granularity of redundancy replacement. FFT-Cache thereby sacrifices a minimal number of cache lines to avoid impacting performance while tolerating the maximum amount of defects. Our experimental results on a processor executing SPEC2K benchmarks demonstrate that the operational voltage of both L1/L2 caches can be reduced down to 375 mV, which achieves up to 80p reduction in the dynamic power and up to 48p reduction in the leakage power. This comes with only a small performance loss (