scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions in Embedded Computing Systems in 2018"


Journal ArticleDOI
TL;DR: A novel scheme for device-free PAssive Detection of moving humans with dynamic Speed (PADS), where both full information (amplitude and phase) of CSI and space diversity across multiantennas in MIMO systems are exploited to extract and shape sensitive metrics for accuracy and robust target detection.
Abstract: Device-free passive detection is an emerging technology to detect whether there exist any moving entities in the areas of interest without attaching any device to them It is an essential primitive for a broad range of applications including intrusion detection for safety precautions, patient monitoring in hospitals, child and elder care at home, and so forth Despite the prevalent signal feature Received Signal Strength (RSS), most robust and reliable solutions resort to a finer-grained channel descriptor at the physical layer, eg, the Channel State Information (CSI) in the 80211n standard Among a large body of emerging techniques, however, few of them have explored the full potential of CSI for human detection Moreover, space diversity supported by nowadays popular multiantenna systems are not investigated to a comparable extent as frequency diversity In this article, we propose a novel scheme for device-free PAssive Detection of moving humans with dynamic Speed (PADS) Both full information (amplitude and phase) of CSI and space diversity across multiantennas in MIMO systems are exploited to extract and shape sensitive metrics for accuracy and robust target detection We prototype PADS on commercial WiFi devices, and experiment results in different scenarios demonstrate that PADS achieves great performance improvement in spite of dynamic human movements

74 citations


Journal ArticleDOI
TL;DR: This work adapts the decomposition-based framework for federated scheduling and proposes an energy-sub-optimal scheduler and derives an approximation algorithm to identify processors to be merged together for further improvements in energy-efficiency.
Abstract: This work studies energy-aware real-time scheduling of a set of sporadic Directed Acyclic Graph (DAG) tasks with implicit deadlines While meeting all real-time constraints, we try to identify the best task allocation and execution pattern such that the average power consumption of the whole platform is minimized To our knowledge, this is the first work that addresses the power consumption issue in scheduling multiple DAG tasks on multi-cores and allows intra-task processor sharing First, we adapt the decomposition-based framework for federated scheduling and propose an energy-sub-optimal scheduler Then, we derive an approximation algorithm to identify processors to be merged together for further improvements in energy-efficiency The effectiveness of the proposed approach is evaluated both theoretically via approximation ratio bounds and also experimentally through simulation study Experimental results on randomly generated workloads show that our algorithms achieve an energy saving of 60% to 68% compared to existing DAG task schedulers

46 citations


Journal ArticleDOI
TL;DR: A survey on energy-efficient multicore scheduling algorithms for hard real-time systems based on Partitioned, Semi-Partitionsed, and Global scheduling techniques for both homogeneous and heterogeneous multicores is presented.
Abstract: As real-time embedded systems are evolving in scale and complexity, the demand for a higher performance at a minimum energy consumption has become a necessity. Consequently, many embedded systems are now adopting multicore architectures into their design. However, scheduling on multicores is not a trivial task and scheduling to minimize the energy consumption further increases the complexity of the problem. This problem is especially aggravated for hard real-time systems where failure to meet a deadline can be catastrophic. Such scheduling algorithms yearn for a polynomial time complexity for the task-to-core assignment problem with an objective to minimize the overall energy consumption. There is now a trend toward heterogeneous multicores where cores differ in power, performance, and architectural capabilities. The desired performance and energy consumption is attained by assigning a task to the core that is best suited for it. In this article, we present a survey on energy-efficient multicore scheduling algorithms for hard real-time systems. We summarize various algorithms reported in the literature and classify them based on Partitioned, Semi-Partitioned, and Global scheduling techniques for both homogeneous and heterogeneous multicores. We also present a detailed discussion on various open issues within this domain.

36 citations


Journal ArticleDOI
TL;DR: This article proposes an energy cooperation scheme that enables energy cooperation in battery-free wireless networks with RF harvesting and states the problem that optimizing the system performance with limited harvesting energy through new energy cooperation protocol performs better than the original battery- free wireless network solution.
Abstract: Radio frequency (RF) energy harvesting techniques are becoming a potential method to power battery-free wireless networks. In RF energy harvesting communications, energy cooperation enables shaping and optimization of the energy arrivals at the energy-receiving node to improve the overall system performance. In this article, we propose an energy cooperation scheme that enables energy cooperation in battery-free wireless networks with RF harvesting. We first study the battery-free wireless network with RF energy harvesting and then state the problem that optimizing the system performance with limited harvesting energy through new energy cooperation protocol. Finally, from the extensive simulation results, our energy cooperation protocol performs better than the original battery-free wireless network solution.

35 citations


Journal ArticleDOI
TL;DR: This article proposes a categorization of available controllers, and introduces an analytical performance model based on worst-case latency, and conducts an extensive evaluation for all state-of-the-art controllers based on a common simulation platform.
Abstract: Recently, the research community has introduced several predictable dynamic random-access memory (DRAM) controller designs that provide improved worst-case timing guarantees for real-time embedded systems The proposed controllers significantly differ in terms of arbitration, configuration, and simulation environment, making it difficult to assess the contribution of each approach To bridge this gap, this article provides the first comprehensive evaluation of state-of-the-art predictable DRAM controllers We propose a categorization of available controllers, and introduce an analytical performance model based on worst-case latency We then conduct an extensive evaluation for all state-of-the-art controllers based on a common simulation platform, and discuss findings and recommendations

33 citations


Journal ArticleDOI
TL;DR: A novel and simple method to detect hijacking only based on gyroscopes’ measurements and GPS data, without using any accelerometer in the detection procedure, which is suitable to be implemented in the drones with micro-controllers.
Abstract: With the fast growth of civil drones, their security problems meet significant challenges. A commercial drone may be hijacked by a GPS-spoofing attack for illegal activities, such as terrorist attacks. The target of this article is to develop a technique that only uses onboard gyroscopes to determine whether a drone has been hijacked. Ideally, GPS data and the angular velocities measured by gyroscopes can be used to estimate the acceleration of a drone, which can be further compared with the measurement of the accelerometer to detect whether a drone has been hijacked. However, the detection results may not always be accurate due to some calculation and measurement errors, especially when no hijacking occurs in curve trajectory situations. To overcome this, in this article, we propose a novel and simple method to detect hijacking only based on gyroscopes’ measurements and GPS data, without using any accelerometer in the detection procedure. The computational complexity of our method is very low, which is suitable to be implemented in the drones with micro-controllers. On the other hand, the proposed method does not rely on any accelerometer to detect attacks, which means it receives less information in the detection procedure and may reduce the results accuracy in some special situations. While the previous method can compensate for this flaw, the high detection results also can be guaranteed by using the above two methods. Experiments with a quad-rotor drone are conducted to show the effectiveness of the proposed method and the combination method.

32 citations


Journal ArticleDOI
TL;DR: A self-sustaining sensing system that draws energy from indoor environments, adapts its duty-cycle to the harvested energy, and pays back the environment by enhancing the awareness of the indoor microclimate through an “energy-free” sensing.
Abstract: Whereas a lot of efforts have been put on energy conservation in wireless sensor networks (WSNs), the limited lifetime of these systems still hampers their practical deployments. This situation is further exacerbated indoors, as conventional energy harvesting (e.g., solar) may not always work. To enable long-lived indoor sensing, we report in this article a self-sustaining sensing system that draws energy from indoor environments, adapts its duty-cycle to the harvested energy, and pays back the environment by enhancing the awareness of the indoor microclimate through an “energy-free” sensing. First of all, given the pervasive operation of heating, ventilation, and air conditioning (HVAC) systems indoors, our system harvests energy from airflow introduced by the HVAC systems to power each sensor node. Secondly, as the harvested power is tiny, an extremely low but synchronous duty-cycle has to be applied whereas the system gets no energy surplus to support existing synchronization schemes. So, we design two complementary synchronization schemes that cost virtually no energy. Finally, we exploit the feature of our harvester to sense the airflow speed in an energy-free manner. To our knowledge, this is the first indoor wireless sensing system that encapsulates energy harvesting, network operating, and sensing all together.

22 citations


Journal ArticleDOI
TL;DR: A framework for automatically mining temporal properties that are in the form of timed regular expressions (TREs) from system traces using an abstract structure of the property to serve as an acceptor is proposed.
Abstract: Temporal properties define the order of occurrence and timing constraints on event occurrence. Such specifications are important for safety-critical real-time systems. We propose a framework for automatically mining temporal properties that are in the form of timed regular expressions (TREs) from system traces. Using an abstract structure of the property, the framework constructs a finite state machine to serve as an acceptor. We analytically derive speedup for the fragment and confirm the speedup using empirical validation with synthetic traces. The framework is evaluated on industrial-strength safety-critical real-time applications using traces with more than 1 million entries.

21 citations


Journal ArticleDOI
TL;DR: This article presents implementations for Addition, Rotation, and eXclusive-or (ARX)-based block ciphers, including LEA and HIGHT, on IoT devices, including 8-bit AVR, 16-bit MSP, 32-bit ARM, and 32- bit ARM-NEON processors.
Abstract: In this article, we present implementations for Addition, Rotation, and eXclusive-or (ARX)-based block ciphers, including LEA and HIGHT, on IoT devices, including 8-bit AVR, 16-bit MSP, 32-bit ARM, and 32-bit ARM-NEON processors We optimized 32-/8-bitwise ARX operations for LEA and HIGHT block ciphers by considering variations in word size, the number of general purpose registers, and the instruction set of the target IoT devices Finally, we achieved the most compact implementations of LEA and HIGHT block ciphers The implementations were fairly evaluated through the Fair Evaluation of Lightweight Cryptographic Systems framework, and implementations won the competitions in the first and the second rounds

20 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a methodology for hybrid application mapping, which comprises a design space exploration coupled with a formal performance analysis, with verified real-time guarantees for each individual application.
Abstract: Executing multiple applications on a single MPSoC brings the major challenge of satisfying multiple quality requirements regarding real-time, energy, and so on. Hybrid application mapping denotes the combination of design-time analysis with run-time application mapping. In this article, we present such a methodology, which comprises a design space exploration coupled with a formal performance analysis. This results in several resource reservation configurations, optimized for multiple objectives, with verified real-time guarantees for each individual application. The Pareto-optimal configurations are handed over to run-time management, which searches for a suitable mapping according to this information. To provide any real-time guarantees, the performance analysis needs to be composable and the influence of the applications on each other has to be bounded. We achieve this either by spatial or a novel temporal isolation for tasks and by exploiting composable networks-on-chip (NoCs). With the proposed temporal isolation, tasks of different applications can be mapped to the same resource, while, with spatial isolation, one computing resource can be exclusively used by only one application. The experiments reveal that the success rate in finding feasible application mappings can be increased by the proposed temporal isolation by up to 30% and energy consumption can be reduced compared to spatial isolation.

19 citations


Journal ArticleDOI
TL;DR: This work presents a sufficient schedulability test, which has pseudo-polynomial-time complexity if the number of processors is fixed, and runs experiments with synthetic software benchmarks on a quad-core Intel multicore processor with the Linux/RK operating system and found that for each task, its maximum measured response time was bounded by the upper bound computed by the theory.
Abstract: Consider fixed-priority preemptive partitioned scheduling of constrained-deadline sporadic tasks on a multiprocessor. A task generates a sequence of jobs and each job has a deadline that must be met. Assume tasks have Corunner-dependent execution times; i.e., the execution time of a job J depends on the set of jobs that happen to execute (on other processors) at instants when J executes. We present a model that describes Corunner-dependent execution times. For this model, we show that exact schedulability testing is co-NP-hard in the strong sense. Facing this complexity, we present a sufficient schedulability test, which has pseudo-polynomial-time complexity if the number of processors is fixed. We ran experiments with synthetic software benchmarks on a quad-core Intel multicore processor with the Linux/RK operating system and found that for each task, its maximum measured response time was bounded by the upper bound computed by our theory.

Journal ArticleDOI
TL;DR: A general approach is proposed that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long- SIMD instructions.
Abstract: Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture that supports improved parallelism and enhanced vector primitives, resulting in only a small fraction of potential peak performance. This article presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator’s internal translation condition and one general approach using dynamic loop peeling optimization. Benchmark results show that average speedups of 1.51× and 2.48× are achieved for an ARM NEON to x86 AVX2 and x86 AVX-512 loop transformation, respectively.

Journal ArticleDOI
TL;DR: A lightweight do-it-all cryptographic design that offers the basic underlying functionalities to secure embedded communication systems in tiny devices and performs a thorough security analysis of the new design with respect to its diffusion, differential and linear, and algebraic properties.
Abstract: The emerging areas in which highly resource constrained devices are interacting wirelessly to accomplish tasks have led manufacturers to embed communication systems in them. Tiny low-end devices such as sensor networks nodes and Radio Frequency Identification (RFID) tags are of particular importance due to their vulnerability to security attacks, which makes protecting their communication privacy and authenticity an essential matter. In this work, we present a lightweight do-it-all cryptographic design that offers the basic underlying functionalities to secure embedded communication systems in tiny devices. Specifically, we revisit the design approach of the sLiSCP family of lightweight cryptographic permutations, which was proposed in SAC 2017. sLiSCP is designed to be used in a unified duplex sponge construction to provide minimal overhead for multiple cryptographic functionalities within one hardware design. The design of sLiSCP follows a 4-subblock Type-2 Generalized Feistel-like Structure (GFS) with unkeyed round-reduced Simeck as the round function, which are extremely efficient building blocks in terms of their hardware area requirements. In SLISCP-light, we tweak the GFS design and turn it into an elegant Partial Substitution-Permutation Network construction, which further reduces the hardware areas of the SLISCP permutations by around 16% of their original values. The new design also enhances the bit diffusion and algebraic properties of the permutations and enables us to reduce the number of steps, thus achieving a better throughput in both the hashing and authentication modes. We perform a thorough security analysis of the new design with respect to its diffusion, differential and linear, and algebraic properties. For SLISCP-light-192, we report parallel implementation hardware areas of 1,820 (respectively, 1,892)GE in CMOS 65nm (respectively, 130nm) ASIC. The areas for SLISCP-light-256 are 2,397 and 2,500GE in CMOS 65nm and 130nm ASIC, respectively. Overall, the unified duplex sponge mode of SLISCP-light-192, which provides (authenticated) encryption and hashing functionalities, satisfies the area (1,958GE), power (3.97μ W), and throughput (44.4kbps) requirements of passive RFID tags.

Journal ArticleDOI
TL;DR: This article proposes a real-time aggregation scheduling method for WIA-PA networks that combines the real- time theory with the classical bin-packing method to aggregate original packets into the minimum number of aggregated packets, and demonstrates that this method outperforms the traditional bin- packing method.
Abstract: The IEC standard WIA-PA is a communication protocol for industrial wireless sensor networks. Its special features, including a hierarchical topology, hybrid centralized-distributed management and packet aggregation make it suitable for large-scale industrial wireless sensor networks. Industrial systems place large real-time requirements on wireless sensor networks. However, the WIA-PA standard does not specify the transmission methods, which are vital to the real-time performance of wireless networks, and little work has been done to address this problem.In this article, we propose a real-time aggregation scheduling method for WIA-PA networks. First, to satisfy the real-time constraints on dataflows, we propose a method that combines the real-time theory with the classical bin-packing method to aggregate original packets into the minimum number of aggregated packets. The simulation results indicate that our method outperforms the traditional bin-packing method, aggregating up to 35% fewer packets, and improves the real-time performance by up to 10%. Second, to make it possible to solve the scheduling problem of WIA-PA networks using the classical scheduling algorithms, we transform the ragged time slots of WIA-PA networks to a universal model. In the simulation, a large number of WIA-PA networks are randomly generated to evaluate the performances of several real-time scheduling algorithms. By comparing the results, we obtain that the earliest deadline first real-time scheduling algorithm is the preferred method for WIA-PA networks.

Journal ArticleDOI
TL;DR: The challenges of teaching and learning English language in Nigerian schools are highlighted to highlight the importance of English language as official language of communication in Nigeria and useful strategies the teachers and learners may adopt so that teaching andLearning of English would be simple.
Abstract: The importance of English language cannot be overemphasized, due to its role in social, political economical, and environmental development. English language functions as a vehicle of interaction and an instrument of communication. This paper discussed the importance of English language as official language of communication in Nigeria. The paper tries to highlight the challenges of teaching and learning English language in Nigerian schools. The paper also, discusses useful strategies the teachers and learners of English language in Nigerian schools may adopt so that teaching and learning of English would be simple.

Journal ArticleDOI
TL;DR: A novel and exact worst-case response time (WCRT) analysis method for message-processing tasks in the gateway for controller area network (CAN) cluster integrated by a central gateway.
Abstract: A typical automotive integrated architecture is a controller area network (CAN) cluster integrated by a central gateway. This study proposes a novel and exact worst-case response time (WCRT) analysis method for message-processing tasks in the gateway. We first propose a round search method to obtain lower bound on response time (LBRT) and upper bound on response time (UBRT), respectively. We then obtain the exact WCRT belonging to the scope of the LBRT and UBRT with an effective non-exhaustive exploration. Experimental results on a real CAN message set reveal that the proposed exact analysis method can reduce 99.99999% combinations on large-scale CAN clusters.

Journal ArticleDOI
TL;DR: This article evaluates a variant of TES, i.e., the Hash-Counter-Hash scheme, which involves polynomial hashing as other variants are either similar or do not constitute finite field multiplication which is the most involved operation in TES.
Abstract: Through pseudorandom permutation, tweakable enciphering schemes (TES) constitute block cipher modes of operation which perform length-preserving computations. The state-of-the-art research has focused on different aspects of TES, including implementations on hardware [field-programmable gate array (FPGA)/ application-specific integrated circuit (ASIC)] and software (hard/soft-core microcontrollers) platforms, algorithmic security, and applicability to sensitive, security-constrained usage models. In this article, we propose efficient approaches for protecting such schemes against natural and malicious faults. Specifically, noting that intelligent attackers do not merely get confined to injecting multiple faults, one major benchmark for the proposed schemes is evaluation toward biased and burst fault models. We evaluate a variant of TES, i.e., the Hash-Counter-Hash scheme, which involves polynomial hashing as other variants are either similar or do not constitute finite field multiplication which, by far, is the most involved operation in TES. In addition, we benchmark the overhead and performance degradation on the ASIC platform. The results of our error injection simulations and ASIC implementations show the suitability of the proposed approaches for a wide range of applications including deeply embedded systems.

Journal ArticleDOI
TL;DR: In this paper, the authors examine dynamic energy consumption caused by data during software execution on deeply embedded microprocessors, and conclude that any energy model targeting tightness must either sacrifice safety or accept overapproximation proportional to data-dependent energy.
Abstract: This article examines dynamic energy consumption caused by data during software execution on deeply embedded microprocessors, which can be significant on some devices. In worst-case energy consumption analysis, energy models are used to find the most costly execution path. Taking each instruction’s worst-case energy produces a safe but overly pessimistic upper bound. Algorithms for safe and tight bounds would be desirable. We show that finding exact worst-case energy is NP-hard, and that tight bounds cannot be approximated with guaranteed safety. We conclude that any energy model targeting tightness must either sacrifice safety or accept overapproximation proportional to data-dependent energy.

Journal ArticleDOI
TL;DR: An automotive climate control methodology that is aware of the battery behavior and performance, while maintaining the passenger’s thermal comfort is proposed, and it is shown that the battery stress reduces, while the cabin temperature is maintained by predicting and optimizing the system states in the near-future.
Abstract: Electric Vehicles (EV) as a zero-emission means of transportation encounter challenges in battery design that cause a range anxieties for the drivers. Besides the electric motor, the Heating, Ventilation, and Air Conditioning (HVAC) system is another major contributor to the power consumption that may influence the EV battery lifetime and driving range. In the state-of-the-art methodologies for battery management systems, the battery performance is monitored and improved. While in the automotive climate control, the passenger’s thermal comfort is the main objective. Hence, the influence of the HVAC power on the battery behavior for the purpose of jointly optimized battery management and climate control has not been considered. In this article, we propose an automotive climate control methodology that is aware of the battery behavior and performance, while maintaining the passenger’s thermal comfort. In our methodology, battery parameters and cabin temperature are modeled and estimated, and the HVAC utilization is optimized and adjusted with respect to the electric motor and HVAC power requests. Therefore, the battery stress reduces, while the cabin temperature is maintained by predicting and optimizing the system states in the near-future. We have implemented our methodology and compared its performance to the state-of-the-art in terms of battery lifetime improvement and energy consumption reduction. We have also conducted experiments and analyses to explore multiple control window sizes, drive profiles, ambient temperatures, and modeling error rates in the methodology. It is shown that our battery-aware climate control can extend the battery lifetime by up to 13.2% and reduce the energy consumption by up to 14.4%.

Journal ArticleDOI
TL;DR: DLSpace is proposed, a novel distributed log space allocation strategy named distributedlog space, which divides log space into block-level log space and page- level log space to significantly optimize solid state disk lifetime.
Abstract: Due to limited numbers of program/erase cycles (i.e., P/Es) of NAND Flash, excessive out-of-place update and erase-before-write operations wear out these P/Es during garbage collections, which adversely shorten solid state disk (i.e., SSD) lifetime. The log space in NAND Flash space of an SSD performs as an updated page ′s buffer, which lowers garbage-collection frequency while reducing consumption of P/Es to extend SSD lifetime. In this article, we propose DLSpace, a novel distributed log space allocation strategy named distributed log space, which divides log space into block-level log space and page-level log space to significantly optimize SSD lifetime. DLSpace′s log page space is dedicated to data pages in a data block. Such log page space only buffers page-update operations in this data block; thereby the use of log blocks for postponing garbage collection delays. DLSpace is conducive to fully utilizing pages in data and log blocks to avoid erasures of blocks with free pages. Consequently, DLSpace decreases write amplification by reducing excessive valid page-rewrite and block-erase operations under random-write-intensive workloads. We carried out quantitative research on the extension of SSD lifetime by virtue of three metrics (i.e., write amplification, the number of block-erase operations, and the delay time before the first garbage collection occurring). Experimental results reveal that compared with the existing traditional allocation strategy for log space (i.e., TLSpace), DLSpace reduces write amplification and the number of erase operations by up to 55.2% and 64.1% to the most extent, respectively. DLSpace also extends TLSpace′s delay time of garbage collections by 73.3% to optimize SSD lifetime.

Journal ArticleDOI
TL;DR: This work presents a distributed runtime resource management framework for many-core systems utilizing a network-on-chip (NoC) infrastructure that manages to allocate resources efficiently at runtime, leading to gains of up to 30% in application execution latency compared to relevant state-of-the-art distributed resource management frameworks.
Abstract: As technology constantly strengthens its presence in all aspects of human life, computing systems integrate a high number of processing cores, whereas applications become more complex and greedy for computational resources. Inevitably, this high increase in processing elements combined with the unpredictable resource requirements of executed applications at design time impose new design constraints to resource management of many-core systems, turning the distributed functionality into a necessity. In this work, we present a distributed runtime resource management framework for many-core systems utilizing a network-on-chip (NoC) infrastructure. Specifically, we couple the concept of distributed management with parallel applications by assigning different roles to the available computing resources. The presented design is based on the idea of local controllers and managers, whereas an on-chip intercommunication scheme ensures decision distribution. The evaluation of the proposed framework was performed on an Intel Single-Chip Cloud Computer, an actual NoC-based, many-core system. Experimental results show that the proposed scheme manages to allocate resources efficiently at runtime, leading to gains of up to 30% in application execution latency compared to relevant state-of-the-art distributed resource management frameworks.

Journal ArticleDOI
TL;DR: This work takes advantage of the fact that the iteration is a sequence of smaller behaviours, each captured in a scenario, that are typically repeated many times, and allows a compositional worst-case throughput analysis of the repeated scenarios by raising the matrices to the power of the number of repetitions.
Abstract: Multi-scale dataflow models have actors acting at multiple granularity levels, e.g., a dataflow model of a video processing application with operations on frame, line, and pixel level. The state of the art timing analysis methods for both static and dynamic dataflow types aggregate the behaviours across all granularity levels into one, often large iteration, which is repeated without exploiting the structure within such an iteration. This poses scalability issues to dataflow analysis, because behaviour of the large iteration is analysed by some form of simulation that involves a large number of actor firings. We take a fresh perspective of what is happening inside the large iteration. We take advantage of the fact that the iteration is a sequence of smaller behaviours, each captured in a scenario, that are typically repeated many times. We use the (max ,+) linear model of dataflow to represent each of the scenarios with a matrix. This allows a compositional worst-case throughput analysis of the repeated scenarios by raising the matrices to the power of the number of repetitions, which scales logarithmically with the number of repetitions, whereas the existing throughput analysis scales linearly. We moreover provide the first exact worst-case latency analysis for scenario-aware dataflow. This compositional latency analysis also scales logarithmically when applied to multi-scale dataflow models. We apply our new throughput and latency analysis to several realistic applications. The results confirm that our approach provides a fast and accurate analysis.

Journal ArticleDOI
TL;DR: This work focuses on the implementation of the spatial isolation of resources for sensitive applications in order to minimize the induced performance overhead and avoids cache sharing with sensitive processes.
Abstract: Current cache Side-Channel Attacks (SCAs) countermeasures have not been designed for many-core architectures and need to be revisited in order to be practical for these new technologies. Spatial isolation of resources for sensitive applications has been proposed taking advantage of the large number of resources offered by these architectures. This solution avoids cache sharing with sensitive processes. Consequently, their cache activity cannot be monitored and cache SCAs cannot be performed. This work focuses on the implementation of this technique in order to minimize the induced performance overhead. Different strategies for the management of isolated secure zones are implemented and compared.

Journal ArticleDOI
TL;DR: A cache performance evaluation framework equipped with three analytical models, which can more accurately predict cache misses, MLPs, and the average cache miss service time, respectively is proposed.
Abstract: Utilizing analytical models to evaluate proposals or provide guidance in high-level architecture decisions is been becoming more and more attractive. A certain number of methods have emerged regarding cache behaviors and quantified insights in the last decade, such as the stack distance theory and the memory level parallelism (MLP) estimations. However, prior research normally oversimplified the factors that need to be considered in out-of-order processors, such as the effects triggered by reordered memory instructions, and multiple dependences among memory instructions, along with the merged accesses in the same MSHR entry. These ignored influences actually result in low and unstable precisions of recent analytical models.By quantifying the aforementioned effects, this article proposes a cache performance evaluation framework equipped with three analytical models, which can more accurately predict cache misses, MLPs, and the average cache miss service time, respectively. Similar to prior studies, these analytical models are all fed with profiled software characteristics in which case the architecture evaluation process can be accelerated significantly when compared with cycle-accurate simulations.We evaluate the accuracy of proposed models compared with gem5 cycle-accurate simulations with 16 benchmarks chosen from Mobybench Suite 2.0, Mibench 1.0, and Mediabench II. The average root mean square errors for predicting cache misses, MLPs, and the average cache miss service time are around 4%, 5%, and 8%, respectively. Meanwhile, the average error of predicting the stall time due to cache misses by our framework is as low as 8%. The whole cache performance estimation can be sped by about 15 times versus gem5 cycle-accurate simulations and 4 times when compared with recent studies. Furthermore, we have shown and studied the insights between different performance metrics and the reorder buffer sizes by using our models. As an application case of the framework, we also demonstrate how to use our framework combined with McPAT to find out Pareto optimal configurations for cache design space explorations.

Journal ArticleDOI
Hwajeong Seo1
TL;DR: Very compact and generic implementations of multiplication and squaring operations on the 16-bit MSP430X processors for the ECC, utilizing the new 32-bit multiplier and advanced multiplication and Squaring routines.
Abstract: On the low-end embedded processors, the implementations of Elliptic Curve Cryptography (ECC) are considered to be a challenging task due to the limited computation power and storage of the low-end embedded processors. Particularly, the multi-precision multiplication and squaring operations are the most expensive operations for ECC implementations. In order to enhance the performance, many works presented efficient multiplication and squaring routines on the target devices. Recent works show that 128-bit security level ECC is available within a second and this is practically fast enough for IoT services. However, previous approaches missed the other important storage issues (i.e., program size, ROM). Considering that the embedded processors only have a few KB ROM, we need to pay attention to the compact ROM size with reasonable performance. In this article, we present very compact and generic implementations of multiplication and squaring operations on the 16-bit MSP430X processors for the ECC. The implementations utilize the new 32-bit multiplier and advanced multiplication and squaring routines. Since the proposed routines are generic, the arbitrary length of operand is available with high-speed and small code size. With proposed multiplication and squaring routines, we implemented Curve25519 on the MSP430X processors. The scalar multiplication is performed within 6,666,895 clock cycles and 4,054 bytes. Compared with previous works based on the speed-optimized version, our memory-efficient version reduces the code size by 59.8%, sacrificing the execution timing by 20.5%.

Journal ArticleDOI
TL;DR: A motivating teaching strategy is tentatively recommended to create situations that may encourage the production of self-repairs and give the learners more opportunity to use the target language.
Abstract: With the furthering of Cognitive linguistics, speech errors have arrested attention of many scholars. The article is mainly focused on the common phenomenon—speech error, intended to probe into the complicated process of language with the basis of categorization and causes, based on the comprehensive theoretical analysis. At the same time, the article also tends to point out the significance of its application in English teaching, especially in the domain of oral language. A motivating teaching strategy is tentatively recommended to create situations that may encourage the production of self-repairs and give the learners more opportunity to use the target language.

Journal ArticleDOI
TL;DR: This work proposes an integer linear programming--based method that terminates very fast while producing the optimum solution, considering both uniform and weighted cost of reagents, and can be used conveniently in tandem with several existing sample-preparation algorithms for improving their performance.
Abstract: Sample preparation plays a crucial role in almost all biochemical applications, since a predominant portion of biochemical analysis time is associated with sample collection, transportation, and preparation. Many sample-preparation algorithms are proposed in the literature that are suitable for execution on programmable digital microfluidic (DMF) platforms. In most of the existing DMF-based sample-preparation algorithms, a fixed target ratio is provided as input, and the corresponding mixing tree is generated as output. However, in many biochemical applications, target mixtures with exact component proportions may not be needed. From a biochemical perspective, it may be sufficient to prepare a mixture in which the input reagents may lie within a range of concentration factors. The choice of a particular valid ratio, however, strongly impacts solution-preparation cost and time. To address this problem, we propose a concentration-resilient ratio-selection method from the input ratio space so that the reactant cost is minimized. We propose an integer linear programming--based method that terminates very fast while producing the optimum solution, considering both uniform and weighted cost of reagents. Experimental results reveal that the proposed method can be used conveniently in tandem with several existing sample-preparation algorithms for improving their performance.

Journal ArticleDOI
TL;DR: In this paper, the authors present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level memory constraints.
Abstract: The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task, and data-level parallelism to meet throughput requirements of digital signal processing applications. Moreover, in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inefficient and error prone and can therefore, greatly slow down development time or result in highly underutilized processing resources. In this article, we present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level memory constraints. The methods operate on synchronous dataflow representations, which are widely used in the design of embedded systems for signal and information processing. We show that our novel methods can significantly improve system throughput compared to previous vectorization and scheduling approaches under the same memory constraints. In addition, we present a practical case-study of applying our methods to significantly improve the throughput of an orthogonal frequency division multiplexing receiver system for wireless communications.

Journal ArticleDOI
TL;DR: A new RAID5 architecture called channel-RAID5 with mirroring (CR5M) with associated data reconstruction strategy called mirroring-assisted channel-level reconstruction (MCR) is developed to further shrink the window of vulnerability.
Abstract: Simply applying an existing redundant array of independent disks (RAID) technique to enhance data reliability within a single solid-state drive for safety-critical mobile applications significantly degrades performance In this article, we first propose a new RAID5 architecture called channel-RAID5 with mirroring (CR5M) to alleviate the performance degradation problem Next, an associated data reconstruction strategy called mirroring-assisted channel-level reconstruction (MCR) is developed to further shrink the window of vulnerability Experimental results demonstrate that compared with channel-RAID5 (CR5), CR5M improves performance up to 402% Compared with disk-oriented reconstruction, a traditional data reconstruction scheme, MCR on average improves data recovery speed by 75% while delivering a similar performance during reconstruction

Journal ArticleDOI
Hamza Omar1, Qingchuan Shi1, Masab Ahmad1, Halit Dogan1, Omer Khan1 
TL;DR: This article proposes the idea of declarative resilience that selectively applies strong resiliency schemes for code regions that are crucial for program correctness (crucial code) and lightweight resiliencies for code areas that are susceptible to program accuracy deviations as a result of soft-errors (non-crucialcode).
Abstract: To protect multicores from soft-error perturbations, research has explored various resiliency schemes that provide high soft-error coverage. However, these schemes incur high performance and energy overheads. We observe that not all soft-error perturbations affect program correctness, and some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. Thus, it is practical to improve processor efficiency by trading off resiliency overheads with program accuracy. This article proposes the idea of declarative resilience that selectively applies strong resiliency schemes for code regions that are crucial for program correctness (crucial code) and lightweight resiliency for code regions that are susceptible to program accuracy deviations as a result of soft-errors (non-crucial code). At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. A cross-layer architecture enables efficient resilience along with holistic soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of machine-learning and graph analytic benchmarks, declarative resilience reduces performance overhead over a state-of-the-art system that applies strong resiliency for all program code regions from ∼ 1.43× to ∼ 1.2×.