Showing papers in &quot;IEEE Embedded Systems Letters in 2021&quot;

Dynamic Partial Reconfiguration Profitability for Real-Time Systems

TL;DR: In this paper, the authors propose a spatial decomposition technique that decomposes a neuron function with many presynaptic connections into a sequence of homogeneous neural units, where each neural unit is a function computation node, with two pre-synaptic connections.

...read moreread less

Abstract: With growing model complexity, mapping spiking neural network (SNN)-based applications to tile-based neuromorphic hardware is becoming increasingly challenging. This is because the synaptic storage resources on a tile, viz. , a crossbar, can accommodate only a fixed number of presynaptic connections per postsynaptic neuron. For complex SNN models that have many presynaptic connections per neuron, some connections may need to be pruned after training to fit onto the tile resources, leading to a loss in the model quality, e.g., accuracy. In this letter, we propose a novel unrolling technique that decomposes a neuron function with many presynaptic connections into a sequence of homogeneous neural units, where each neural unit is a function computation node, with two presynaptic connections. This spatial decomposition technique significantly improves crossbar utilization and retains all presynaptic connections, resulting in no loss of the model quality derived from connection pruning. We integrate the proposed technique within an existing SNN mapping framework and evaluate it using machine learning applications on the DYNAP-SE state-of-the-art neuromorphic hardware. Our results demonstrate an average 60% lower crossbar requirement, $9\times $ higher synapse utilization, 62% lower wasted energy on the hardware, and between 0.8% and 4.6% increase in the model quality.

...read moreread less

26 citations

Journal Article•DOI•

[...]

Giacomo Valente¹, Tania Di Mascio¹, Luigi Pomante¹, Gabriella D'Andrea¹•Institutions (1)

University of L'Aquila¹

Low-Power Compressor-based Approximate Multipliers with Error Correcting Module

TL;DR: This letter presents the characterization of its reconfiguration cost in terms of time and a definition of the “DPR Profitability” concept targeting real-time systems, and validate the approach on a real DPR-compliant platform, showing that it is general enough to be applied to modern DPR- Compliant platforms.

...read moreread less

Abstract: Modern field-programmable gate arrays offer dynamic partial reconfiguration (DPR) capabilities, a characteristic that opens new scheduling opportunities for real-time applications running on heterogeneous platforms. To evaluate when it is really useful to exploit a DPR, in this letter, we present the characterization of its reconfiguration cost in terms of time and a definition of the “DPR Profitability” concept targeting real-time systems. To obtain such results, the components involved in a DPR process have been identified and an innovative approach to calculate the DPR time and its worst-case bound is provided. We validate our approach on a real DPR-compliant platform, showing that our proposal is general enough to be applied to modern DPR-compliant platforms.

...read moreread less

16 citations

Journal Article•DOI•

[...]

U. Anil Kumar¹, Sumit K. Chatterjee¹, Syed Ershad Ahmed¹•Institutions (1)

Birla Institute of Technology and Science¹

16 Sep 2021-IEEE Embedded Systems Letters

TL;DR: This letter proposes an unsigned approximate multiplier architecture segmented into three portions: the least significant portion that contributes least to the partial product (PP) is replaced with a new constant compensation term to improve hardware savings without sacrificing accuracy.

...read moreread less

Abstract: This paper proposes an unsigned approximate multiplier architecture segmented into three portions; the least significant portion that contributes least to the partial product is replaced with a new constant compensation term to improve hardware savings without sacrificing accuracy. The partial products in the middle portion are simplified using a new 4:2 approximate compressor, and the error due to approximation is compensated using a simple yet efficient error correction module. The most significant portion of the multiplier is implemented using exact logic as approximating it will results in a large error. Experimental results of 8-bit multiplier show that power and power-delay products are reduced up to 47.7% and 55.2% respectively in comparison with the exact design and 36.9% and 39.5% respectively in comparison with the existing designs without significant compromise on accuracy.

...read moreread less

16 citations

Journal Article•DOI•

Embedded Identification of Surface Based on Multirate Sensor Fusion With Deep Neural Network

[...]

Semin Ryu¹, Seung-Chan Kim¹•Institutions (1)

Hallym University¹

Ring-DVFS: Reliability-Aware Reinforcement Learning-Based DVFS for Real-Time Embedded Systems

TL;DR: A multivariate time-series classification system that fuses multirate sensor measurements within the latent space of a deep neural network and investigates the feasibility of categorizing ten different everyday surfaces using a proposed convolutional neural network, which is trained in an end-to-end manner.

...read moreread less

Abstract: In this letter, we propose a multivariate time-series classification system that fuses multirate sensor measurements within the latent space of a deep neural network. In our network, the system identifies the surface category based on audio and inertial measurements generated from the surface impact, each of which has a different sampling rate and resolution in nature. We investigate the feasibility of categorizing ten different everyday surfaces using a proposed convolutional neural network, which is trained in an end-to-end manner. To validate our approach, we developed an embedded system and collected 60 000 data samples under a variety of conditions. The experimental results obtained exhibit a test accuracy for a blind test dataset of 93%, taking less than 300 ms for end-to-end classification in an embedded machine environment. We conclude this letter with a discussion of the results and future direction of research.

...read moreread less

15 citations

Journal Article•DOI•

[...]

Amir Yeganeh-Khaksar¹, Mohsen Ansari¹, Sepideh Safari¹, Sina Yari-Karin¹, Alireza Ejlali¹ - Show less +1 more•Institutions (1)

Sharif University of Technology¹

Bactran: A Hardware Batch Normalization Implementation for CNN Training Engine

TL;DR: An enhanced DVFS method based on reinforcement learning to reduce the power consumption of sporadic tasks at runtime in multicore embedded systems without task-reliability degradation and deadline misses is proposed.

...read moreread less

Abstract: Dynamic voltage and frequency scaling (DVFS) is one of the most popular and exploited techniques to reduce power consumption in multicore embedded systems. However, this technique might lead to a task-reliability degradation because scaling the voltage and frequency increases the fault rate and the worst-case execution time of the tasks. In order to preserve task-reliability at an acceptable level as well as achieving power saving, in this letter, we have proposed an enhanced DVFS method based on reinforcement learning to reduce the power consumption of sporadic tasks at runtime in multicore embedded systems without task-reliability degradation. The reinforcement learner takes decisions based on the power savings and task-reliability variations due to DVFS and considers the suitable voltage-frequency level for all tasks such that the timing constraints are met. Experimental evaluation was done on different configurations and with different numbers of tasks to investigate the efficiency of the proposed method. Our experiments show that our proposed method works efficiently than other existing works for reducing power consumption without reliability degradation and deadline misses.

...read moreread less

14 citations

Journal Article•DOI•

[...]

Yang Zhijie¹, Wang Lei¹, Luo Li¹, Li Shiming¹, Guo Shasha¹, Wang Shuquan¹ - Show less +2 more•Institutions (1)

National University of Defense Technology¹

Design Space Exploration of FPGA-Based System With Multiple DNN Accelerators

TL;DR: An efficient CNN training architecture is designed by using the systolic array, which can support the BN functions both in the training process and the inference process, and is an improved, hardware-friendly BN algorithm, range batch normalization (RBN).

...read moreread less

Abstract: In recent years, convolutional neural networks (CNNs) have been widely used. However, their ever-increasing amount of parameters makes it challenging to train them with the GPUs, which is time and energy expensive. This has prompted researchers to turn their attention to training on more energy-efficient hardware. batch normalization (BN) layer has been widely used in various state-of-the-art CNNs for it is an indispensable layer in the acceleration of CNN training. As the amount of computation of the convolutional layer declines, its importance continues to increase. However, the traditional CNN training accelerators do not pay attention to the efficient hardware implementation of the BN layer. In this letter, we design an efficient CNN training architecture by using the systolic array. The processing element of the systolic array can support the BN functions both in the training process and the inference process. The BN function implemented is an improved, hardware-friendly BN algorithm, range batch normalization (RBN). The experimental results show that the implementation of RBN saves 10% hardware resources, reduces the power by 10.1%, and the delay by 4.6% on average. We implement the accelerator on the field programmable gate array VU440, and the power consumption of the its core computing engine is 8.9 W.

...read moreread less

14 citations

Journal Article•DOI•

[...]

Rajesh Kedia¹, Shikha Goel¹, Mahesh Balakrishnan¹, Kolin Paul¹, Rijurekha Sen¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Delhi¹

Deep Learning for Eye Blink Detection Implemented at the Edge

TL;DR: A design space exploration (DSE) strategy is formulated to explore trade-offs in accuracy, runtime, cost, and energy consumption arising due to flexibility in choosing DNN topology, DPU configuration, and FPGA model.

...read moreread less

Abstract: Many emerging systems concurrently execute multiple applications that use deep neural network (DNN) as a key portion of the computation. To speedup the execution of such DNNs, various hardware accelerators have been proposed in recent works. Deep learning processor unit (DPU) from Xilinx is one such accelerator targeted for field programmable gate array (FPGA)-based systems. We study the runtime and energy consumption for different DNNs on a range of DPU configurations and derive useful insights. Using these insights, we formulate a design space exploration (DSE) strategy to explore tradeoffs in accuracy, runtime, cost, and energy consumption arising due to flexibility in choosing DNN topology, DPU configuration, and FPGA model. The proposed strategy provides a reduction of $28\times $ in the number of design points to be simulated and $23\times $ in the pruning time.

...read moreread less

13 citations

Journal Article•DOI•

[...]

Alexis Arcaya Jordan¹, Alain Pegatoquet¹, Andrea Castagnetti, Julien Raybaut, Pierre Le Coz - Show less +1 more•Institutions (1)

Centre national de la recherche scientifique¹

CHISEL: Compression-Aware High-Accuracy Embedded Indoor Localization with Deep Learning

TL;DR: CNN outperforms the accuracy obtained by the thresholdbased algorithm by more than 7%.

...read moreread less

Abstract: Driver drowsiness is one of the major causes of accidents and fatal road crashes, causing a high human and economic cost. Recently, automatic drowsiness detection has begun to be recognized as a promising solution, receiving growing attention from industry and academics. In this letter, we propose to embed a convolutional neural network (CNN)-based solution in smart connected glasses to detect eye blinks and use them to estimate the driver’s drowsiness level. This innovative solution is compared with a more traditional method, based on a detection threshold mechanism. The performance, battery lifetime, and memory footprint of both solutions are assessed for embedded implementation in the connected glasses. The results demonstrate that CNN outperforms the accuracy obtained by the threshold-based algorithm by more than 7%. Moreover, increased overheads in terms of memory and battery lifetime are acceptable, thus making CNN a viable solution for drowsiness detection in wearable devices.

...read moreread less

12 citations

Journal Article•DOI•

[...]

Liping Wang¹, Saideep Tiku¹, Sudeep Pasricha¹•Institutions (1)

Colorado State University¹

06 Jul 2021-IEEE Embedded Systems Letters

TL;DR: Li et al. as discussed by the authors proposed a compression-aware and high-accuracy deep learning framework called CHISEL that outperforms the best-known works in the area while maintaining localization robustness on embedded devices.

...read moreread less

Abstract: GPS technology has revolutionized the way we localize and navigate outdoors. However, the poor reception of GPS signals in buildings makes it unsuitable for indoor localization. WiFi fingerprinting-based indoor localization is one of the most promising ways to meet this demand. Unfortunately, most work in the domain fails to resolve challenges associated with deployability on resource-limited embedded devices. In this work, we propose a compression-aware and high-accuracy deep learning framework called CHISEL that outperforms the best-known works in the area while maintaining localization robustness on embedded devices.

...read moreread less

12 citations

Journal Article•DOI•

Combining Thermal Maps With Inception Neural Networks for Hardware Trojan Detection

[...]

Yiming Wen¹, Weize Yu¹•Institutions (1)

Shandong University¹

A General Embedded Underwater Acoustic Communication System Based on Advance STM32

TL;DR: A nondestructive technique based on thermal maps and inception neural networks (INNs) and the corresponding Trojan detection accuracy can be achieved over 98.2% after training the INNs with 150 000 thermal maps.

...read moreread less

Abstract: Hardware Trojan detection on modern integrated circuits (ICs) is a challenging task since the inspector may have no idea about the location and size of the embedded Trojan circuit. To achieve an accurate Trojan detection, instead of relying on hardware reverse engineering, a nondestructive technique based on thermal maps and inception neural networks (INNs) is proposed in this letter. The thermal maps generated by a Trojan-free (TF) IC chip and multiple emulated Trojan-infected (TI) IC chips are collected and optimized as the critical side-channel leakages at first. Then, INNs are utilized to analyze these optimized thermal maps to exactly extract the information of the embedded Trojans under the assistance of customized filters. As shown in the results, after training the INNs with 150 000 thermal maps, the corresponding Trojan detection accuracy can be achieved over 98.2%.

...read moreread less

Journal Article•DOI•

[...]

Su Yishan¹, Dong Lijie¹, Zhou Zhaojia¹, Xuan Liu¹, Xing Wei² - Show less +1 more•Institutions (2)

Tianjin University¹, Guilin University of Aerospace Technology²

Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic

TL;DR: An embedded UAC system with the STM32H743 processor as the core and the peripheral sending/receiving circuit as the signal conditioning circuit module is proposed and a fast and robust frame synchronization algorithm based on the segmented fast Fourier transform is applied.

...read moreread less

Abstract: Underwater acoustic communication (UAC) modem is an important infrastructure of underwater network construction. In recent years, with the performance improvement of the STM32 processor, the realization of reliable UAC through high-performance STM32 processor is conducive to reduce the system power consumption, cheapen the hardware cost, and ease the development difficulty. In this letter, we propose an embedded UAC system with the STM32H743 processor as the core and the peripheral sending/receiving circuit as the signal conditioning circuit module. The system can support a variety of modulation and demodulation methods, including single/multi-carrier frequency-shift keying and orthogonal frequency division multiplexing. Furthermore, in order to reduce the computational cost of the system, a fast and robust frame synchronization algorithm based on the segmented fast Fourier transform is applied. The sea trials show that the system can realize the reliable UAC transmission of 100 b/s–1 kb/s in the distance of 5–8 km in the shallow water area.

...read moreread less

Journal Article•DOI•

[...]

Sumit K. Mandal¹, Raid Ayoub², Michael Kishinevsky², Mohammad M. Islam², Umit Y. Ogras¹ - Show less +1 more•Institutions (2)

Arizona State University¹, Intel²

Synthesis of Parallel Synchronous Software

TL;DR: In this article, the authors propose an analytical modeling technique for priority-aware NoCs under bursty traffic, which has less than 10% modeling error with respect to cycle-accurate NoC simulator.

...read moreread less

Abstract: Networks-on-Chip (NoCs) used in commercial many-core processors typically incorporate priority arbitration. Moreover, they experience bursty traffic due to application workloads. However, most state-of-the-art NoC analytical performance analysis techniques assume fair arbitration and simple traffic models. To address these limitations, we propose an analytical modeling technique for priority-aware NoCs under bursty traffic. Experimental evaluations with synthetic and bursty traffic show that the proposed approach has less than 10% modeling error with respect to cycle-accurate NoC simulator.

...read moreread less

Journal Article•DOI•

[...]

Pantea Kiaei¹, Patrick Schaumont¹•Institutions (1)

Virginia Tech¹

Energy-Efficient Low-Latency Signed Multiplier for FPGA-Based Hardware Accelerators

TL;DR: In this article, the authors describe a parallel synchronous software model, which executes as $N$ parallel threads on a processor with word-length $N$, each thread is a single-bit synchronous machine with precise, contention-free timing, while each of the N$ threads still executes as an independent machine.

...read moreread less

Abstract: In typical embedded applications, the precise execution time of the program does not matter and it is sufficient to meet a real-time deadline. However, modern applications in information security have become much more time-sensitive due to the risk of timing side-channel leakage. The timing of such programs needs to be data-independent and precise. We describe a parallel synchronous software model, which executes as $N$ parallel threads on a processor with word-length $N$ . Each thread is a single-bit synchronous machine with precise, contention-free timing, while each of the $N$ threads still executes as an independent machine. The resulting software supports fine-grained parallel execution. In contrast to earlier work to obtain precise and repeatable timing in software, our solution does not require modifications to the processor architecture nor specialized instruction scheduling techniques. In addition, all threads run in parallel and without contention, which eliminates the problem of thread scheduling. We use hardware (HDL) semantics to describe a thread as a single-bit synchronous machine. Using logic synthesis and code generation, we derive a parallel synchronous implementation of this design. We illustrate the synchronous parallel programming model with practical examples from cryptography and other applications with precise timing requirements.

...read moreread less

Journal Article•DOI•

[...]

Salim Ullah¹, Tuan D. A. Nguyen¹, Akash Kumar¹•Institutions (1)

Dresden University of Technology¹

Toward Mobile Malware Detection Through Convolutional Neural Networks

TL;DR: This letter presents an area-optimized, low-latency, and energy-efficient architecture for an accurate signed multiplier, which can be used for FPGA-based implementations of applications utilizing signed numbers.

...read moreread less

Abstract: Multiplication is one of the most extensively used arithmetic operations in a wide range of applications, such as multimedia processing and artificial neural networks. For such applications, multiplier is one of the major contributors to energy consumption, critical path delay, and resource utilization. These effects get more pronounced in field-programmable gate array (FPGA)-based designs. However, most of the state-of-the-art designs are done for ASIC-based systems. Furthermore, a few field-programmable gate array (FPGA)-based designs that exist are largely limited to unsigned numbers, which require extra circuits to support signed operations. To overcome these limitations for the FPGA-based implementations of applications utilizing signed numbers, this letter presents an area-optimized, low-latency, and energy-efficient architecture for an accurate signed multiplier. Compared to the Vivado area-optimized multiplier IP, our implementations offer up to 40.0%, 43.0%, and 70.0% reduction in terms of area, latency, and energy, respectively. The RTL implementations of our designs will be released as an open-source library at https://cfaed.tu-dresden.de/pd-downloads .

...read moreread less

Journal Article•DOI•

[...]

Nada Lachtar¹, Duha Ibdah¹, Anys Bacha¹•Institutions (1)

University of Michigan¹

High-Speed Architecture for Successive Cancellation Decoder With Split-g Node Block

TL;DR: It is shown that systematically converting native instructions from Android apps into images using Hilbert space-filling curves and entropy visualization techniques enable CNNs to reliably detect malicious apps with near ideal accuracy.

...read moreread less

Abstract: Traditional research on mobile malware detection has focused on approaches that rely on analyzing bytecode for uncovering malicious apps. Unfortunately, cybercriminals can bypass such methods by embedding malware directly in native machine code, making traditional methods inadequate. Another challenge that detection solutions face is scalability. The sheer number of malware released every year makes it difficult for solutions to efficiently scale their coverage. This letter presents an energy efficient solution that uses convolutional neural networks (CNNs) to defend against malware. We show that systematically converting native instructions from Android apps into images using Hilbert space-filling curves and entropy visualization techniques enable CNNs to reliably detect malicious apps with near ideal accuracy. We characterize popular CNN architectures that have been known to perform well on different computer vision tasks and evaluate their effectiveness against malware using an Android malware dataset.

...read moreread less

Journal Article•DOI•

[...]

J. Sujanth Roy¹, G. Lakshminarayanan¹, Seok-Bum Ko²•Institutions (2)

National Institute of Technology, Tiruchirappalli¹, University of Saskatchewan²

Enhancing Matrix Multiplication With a Monolithic 3-D-Based Scratchpad Memory

TL;DR: A novel restructuring of the 2bit-SC (2b-SC) precomputation decoder architecture is carried out to reduce the latency by 20% while reducing the hardware complexity.

...read moreread less

Abstract: Polar codes are one of the recently developed error correcting codes, and they are popular due to their capacity achieving nature. The architecture of the successive cancellation (SC) decoder algorithm is composed of a recursive processing element (PE). The PE comprises various blocks that include signed adder, subtractor, comparator, multiplexers, and few logic gates. Therefore, the latency of the PE is a primary concern. Hence, a high-speed architecture for implementing the SC decoding algorithm for polar codes is proposed. In the proposed work, a novel restructuring of the 2bit-SC (2b-SC) precomputation decoder architecture is carried out to reduce the latency by 20% while reducing the hardware complexity. Compared to the 2b-SC precomputation decoder, the proposed architecture also has 19% increased throughput for (1024, 512) polar codes with 45% reduction in the gate count.

...read moreread less

Journal Article•DOI•

[...]

Cong Thuan Do¹, Jeong Hwan Choi¹, Young Seo Lee¹, Cheol Hong Kim², Sung Woo Chung¹ - Show less +1 more•Institutions (2)

Korea University¹, Chonnam National University²

C++20 Coroutines on Microcontrollers—What We Learned

TL;DR: The monolithic 3-D integration for the GPU scratchpad memory, called monolithic 2-D SPM, is adopted to enhance matrix multiplication, which improves the system performance by 46.3% for the $32\times 32$ matrix multiplication.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are one of the most popular machine learning algorithms. The convolutional layers, which account for the most execution time of CNNs, are implemented with matrix multiplication because the convolution operation performs dot products between filters and local regions of the input. On the other hand, GPUs with thousands of cores were proven to significantly accelerate matrix multiplication, compared to CPUs with a limited number of cores, especially for large matrices. However, the current memory architecture allows only one row access at a time so that multiple accesses are necessary to read the column data of the second matrix, thus slowing down matrix multiplication. In this study, we adopt the monolithic 3-D integration for the GPU scratchpad memory, called monolithic 3-D integration (M3D) scratchpad memory (SPM), to enhance matrix multiplication. The M3D SPM allows one access to read the column data of the second matrix, similar to the case of the first matrix. The simulation results show that our M3D SPM improves the system performance by 46.3% for the $32\times 32$ matrix multiplication, over the conventional 2-D SPM where the column data of the second matrix are read sequentially.

...read moreread less

Journal Article•DOI•

[...]

Bruce Belson¹, Wei Xiang¹, Jason Holdsworth¹, Bronson Philippa¹•Institutions (1)

James Cook University¹

An Energy-Efficient Routing Method in WSNs Based on Compressive Sensing: From the Perspective of Social Welfare

TL;DR: It is found that the proposed language enhancements potentially bring significant benefits to programming in C++ for embedded computers, but that the implementation imposes constraints that may prevent its widespread acceptance among the embedded development community.

...read moreread less

Abstract: Coroutines will be added to C++ as part of the C++20 standard. Coroutines provide native language support for asynchronous operations. This letter evaluates the C++ coroutine specification from the perspective of embedded systems developers. We find that the proposed language features are generally beneficial, but that memory management of the coroutine state needs to be improved. Our experiments on an ARM Cortex-M4 microcontroller evaluate the time and memory costs of coroutines in comparison with alternatives, and we show that context switching with coroutines is significantly faster than with thread-based real-time operating systems. Furthermore, we analyzed the impact of these language features on prototypical Internet of Things sensor software. We find that the proposed language enhancements potentially bring significant benefits to programming in C++ for embedded computers, but that the implementation imposes constraints that may prevent its widespread acceptance among the embedded development community.

...read moreread less

Journal Article•DOI•

[...]

Deyu Lin¹, Weidong Min¹, Jianfeng Xu¹, Jiaxun Yang¹, Jianlin Zhang¹ - Show less +1 more•Institutions (1)

Nanchang University¹

A Fully Configurable SoC-Based IR-UWB Platform for Data Acquisition and Algorithm Testing

TL;DR: This letter presented a novel routing method to improve the Energy Efficiency among different clusters during inter-cluster routing decision-making using the theory of Energy Welfare and the Compressive Sensing theory from the perspective of social welfare.

...read moreread less

Abstract: This letter presented a novel routing method to improve the energy efficiency among different clusters during intercluster routing decision making. To this end, the theory of energy welfare was applied to promote energy equality. Besides, the compressive sensing (CS) theory was utilized in intracluster data acquisition to further reduce data redundancy. Subsequently, an energy-efficient routing based on CS from the perspective of social welfare was proposed. Finally, extensive experiments were conducted and the numerical results verified its effectiveness on improving the energy efficiency and prolonging the network lifetime of wireless sensor networks.

...read moreread less

Journal Article•DOI•

[...]

Marcos Cervetto¹, Edgardo Marchi¹, Cecilia G. Galarza²•Institutions (2)

National Institute of Industrial Technology¹, University of Buenos Aires²

Gradual Channel Estimation Method for TLC NAND Flash Memory

TL;DR: This paper has designed an SoC-based platform, conceived for scientific experimentation, with a fully modular and configurable design that achieves a configurable UWB-capable sampling rate through an equivalent-time sampling scheme.

...read moreread less

Abstract: Research and development of algorithms for processing impulse radio ultrawideband signals is a trending issue within remote sensing applications, personal area networks, and RF imaging among other areas. We have designed an SoC-based platform, conceived for scientific experimentation, with a fully modular and configurable design. Built with off-the-shelf components, our design achieves a configurable UWB-capable sampling rate through an equivalent-time sampling scheme. In this letter, we introduce the system architecture, its main interfaces, and the rationale behind each module implementation.

...read moreread less

Journal Article•DOI•

[...]

Liu Yang¹, Qi Wang¹, Li Qianhui¹, Xiaolei Yu¹, Jing He¹, Zongliang Huo¹ - Show less +2 more•Institutions (1)

Chinese Academy of Sciences¹

19 May 2021-IEEE Embedded Systems Letters

TL;DR: In this article, a time-saving channel parameter estimation method for TLC NAND flash memory is proposed, which reduces estimation time by three improvements: (1) reducing fitted parameters in one iteration step, (2) using pre-derived values as initial guess values to decrease iteration steps, and (3) utilizing parallelism between data sensing operations and computation.

...read moreread less

Abstract: As the storage density of NAND flash increases, the reliability is significantly degraded, making NAND flash memory more sensitive to noise. Among all noise sources, retention noise is a major one. Error correction based on channel parameter estimation is an essential method to deal with retention noise. In this paper, a time-saving channel parameter estimation method for TLC NAND flash memory is proposed. Proposed method reduces estimation time by three improvements: (1) reducing fitted parameters in one iteration step, (2) using pre-derived values as initial guess values to decrease iteration steps, (3) utilizing parallelism between data sensing operations and computation. Compared with previous work, proposed method estimates parameters with higher accuracy and lower time overhead which is verified by experiment results.

...read moreread less

Journal Article•DOI•

Gbit/s Throughput Under 6.3-W Lossless Hyperspectral Image Compression on Parallel Embedded Devices

[...]

Oscar Ferraz¹, Gabriel Falcao¹, Vitor Silva¹•Institutions (1)

University of Coimbra¹

Aging-Aware Parallel Execution

TL;DR: This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption.

...read moreread less

Abstract: The consultative committee for space data system (CCSDS)-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power-constrained systems, such as satellites and military drones. This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption. This solution parallelizes the predictor on a low-power graphics processing unit (GPU) while the encoders exploit the heterogeneous multiple cores of the CPUs and GPU concurrently. We report more than 16.6 Gb/s for the predictor and 1.4-Gb/s for the whole system, requiring less than 6.3 W and providing an efficiency of 245.6 Mb/s/W.

...read moreread less

Journal Article•DOI•

[...]

Thiarles S. Medeiros, Gustavo Berned, Antoni Navarro¹, Fábio Diniz Rossi, Marcelo Caggiani Luizelli, Marcelo Brandalero², Michael Hübner², Antonio Carlos Schneider Beck³, Arthur Francisco Lorenzon - Show less +5 more•Institutions (3)

Barcelona Supercomputing Center¹, Brandenburg University of Technology², Universidade Federal do Rio Grande do Sul³

SecPump: A Connected Open-Source Infusion Pump for Security Research Purposes

TL;DR: BALDER is a learning framework capable of automatically choosing optimal configuration executions according to the parallel application at hand, aiming to maximize the trade-off between aging and performance.

...read moreread less

Abstract: Computation has been pushed to the edge to decrease latency and alleviate the computational burden of the IoT applications in the cloud. However, the increasing processing demands of edge applications make necessary the employment of platforms that exploit thread-level parallelism (TLP). Yet, power and heat dissipation rise as TLP inadvertently increases or when parallelism is not cleverly exploited, which may be the result of the nonideal use of a given parallel program interface (PPI). Besides the common issues, such as the need for more robust power sources and better cooling, heat also adversely affects aging, accelerating phenomenons, such as negative bias temperature instability (NBTI) and hot-carrier injection (HCI), which further reduces processor lifetime. Hence, considering that increasing the lifespan of an edge device is key, so the number of times the application set may execute until its end-of-life is maximized, we propose BALDER. It is a learning framework capable of automatically choosing optimal configuration executions (PPI and number of threads) according to the parallel application at hand, aiming to maximize the tradeoff between aging and performance. When executing ten well-known applications on two multicore embedded architectures, we show that BALDER can find a nearly optimal configuration for all our experiments.

...read moreread less

Journal Article•DOI•

[...]

Cyril Bresch¹, David Hely¹, Stephanie Chollet¹, Roman Lysecky²•Institutions (2)

University of Grenoble¹, University of Arizona²

Fast LDPC GPU Decoder for Cloud RAN

TL;DR: This letter presents SecPump, a new open-source wireless infusion pump platform dedicated to security researchers, which intends to provide a framework for security evaluation, tailored for countermeasure development against security flaws related to medical devices.

...read moreread less

Abstract: This letter presents SecPump, a new open-source wireless infusion pump platform dedicated to security researchers. The novelty of the platform is that it is “plug and play.” Indeed, SecPump simulates a functional infusion pump system on a single board without requiring additional hardware or mechanical components. The presented cyber-physical platform intends to provide a framework for security evaluation, tailored for countermeasure development against security flaws related to medical devices. This letter presents the functionality of the cyber-physical device, its wireless features, and its portability across several hardware architectures. Finally, both hardware and software attacks are showcased on the platform.

...read moreread less

Journal Article•DOI•

[...]

Jonathan Ling¹, Paul Cautereels¹•Institutions (1)

Bell Labs¹

19 Jan 2021-IEEE Embedded Systems Letters

TL;DR: A new design for a 5G NR low-density parity check code decoder running on a GPU is presented, which improves on the layered algorithm by increasing parallelism on a single code word.

...read moreread less

Abstract: The graphical processing unit (GPU), as a digital signal processing accelerator for cloud RAN, is investigated. This letter presents a new design for a 5G NR low-density parity check code decoder running on a GPU. The algorithm is flexibly adaptable to GPU architecture to achieve high resource utilization as well as low latency. It improves on the layered algorithm by increasing parallelism on a single code word. The flexible GPU decoder (on a 24 core GPU) was found to have $5\times $ higher throughput compared to a recent GPU flooding decoder and $3\times $ higher throughput compared to a field programmable gate array (FPGA) decoder (757K gate). The flexible GPU decoder exhibits 1/3 decoding power efficiency of the FPGA typical of general-purpose processors. For rapid deployment and flexibility, GPUs may be suitable as cloud RAN accelerators.

...read moreread less

Journal Article•DOI•

Improving Memory Utilization in Convolutional Neural Network Accelerators

[...]

Petar Jokic¹, Stephane Emery, Luca Benini¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

WATERMARCH: IP Protection Through Authenticated Obfuscation in FPGA Bitstreams

TL;DR: In this paper, the authors propose a mapping method that allows these regions to overlap and thus utilize the memory more efficiently, which can decrease the activation memory by up to 32.9% compared to traditional ping-pong buffering.

...read moreread less

Abstract: While the accuracy of convolutional neural networks (CNNs) has achieved vast improvements by introducing larger and deeper network architectures, also the memory footprint for storing their parameters and activations has increased. This trend especially challenges power- and resource-limited accelerator designs, which are often restricted to store all network data in on-chip memory to avoid interfacing energy-hungry external memories. Maximizing the network size that fits on a given accelerator thus requires to maximize its memory utilization. While the traditionally used ping-pong buffering technique is mapping subsequent activation layers to disjunctive memory regions, we propose a mapping method that allows these regions to overlap and thus utilize the memory more efficiently. This letter presents the mathematical model to compute the maximum activations memory overlap and thus the lower bound of on-chip memory needed to perform layer-by-layer processing of CNNs on memory-limited accelerators. Our experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%, reducing the overall memory for the entire network by up to 23.9% compared to traditional ping-pong buffering. For higher resolution denoising networks, we achieve activation memory savings of 48.8%. Additionally, we implement a face detector network on a field-programmable gate array-based camera to validate these memory savings on a complete end-to-end system.

...read moreread less

Journal Article•DOI•

[...]

Brooks Olney¹, Robert Karam¹•Institutions (1)

University of South Florida¹

Design and Implementation of an Embedded Cardiorespiratory Monitoring System for Wheelchair Users

TL;DR: This letter proposes a mutable architecture-based watermarking scheme called WATERMARCH, a novel technique of authenticated obfuscation utilizing a hash-based message authentication code (HMAC) to cryptographically mesh the obfuscation and watermark with the original design, with no additional overhead beyond the underlying obfuscation method.

...read moreread less

Abstract: Field programmable gate array (FPGA) bitstreams contain information on the functionality of all hardware intellectual property (IP) cores used in a given design, so if an attacker gains access to the bitstream, they can mount attacks on the IP. Various mechanisms have been proposed to protect IP from reverse engineering and theft. However, there are no examples of IP obfuscation in FPGA bitstreams that also intrinsically enable tamper detection and authentication at no additional hardware cost. In this letter, we propose a mutable architecture-based watermarking scheme called WATERMARCH, a novel technique of authenticated obfuscation utilizing a hash-based message authentication code (HMAC) to cryptographically mesh the obfuscation and watermark with the original design, with no additional overhead beyond the underlying obfuscation method. While collaboration between the IP owner and FPGA vendor is necessary to facilitate parsing of the bitstream, once the bitstream is parsable, the watermark can be extracted to prove authorship of the IP or confirm the presence of malicious IP modification, providing tremendous benefits to both IP owners and end users.

...read moreread less

Journal Article•DOI•

[...]

Chia-Hung Chang¹, Wei-Wen Hu²•Institutions (2)

National Yunlin University of Science and Technology¹, Southern Taiwan University of Science and Technology²

01 Dec 2021-IEEE Embedded Systems Letters

TL;DR: A novel design of an embedded cardiorespiratory monitoring system for wheelchair users consisting of a sensor node, a smartphone, and a cloud server to achieve a fully integrated radar system is proposed.

...read moreread less

Abstract: A novel design of an embedded cardiorespiratory monitoring system for wheelchair users is proposed. The entire system is composed of a sensor node, a smartphone, and a cloud server. The sensor node contains two parts of an ultra-wideband pulse radar and the data processing module; the former is used to obtain continuous vital-sign signals, and the latter processes the sampled signals to estimate the heart rate and the respiration rate, implemented in an embedded system to achieve a fully integrated radar system. The smartphone functions as the data bridge between the sensor node and the cloud server, which is responsible for sending emergent messages. Experiment results show that the proposed system could work reliably in static and dynamic cases of wheelchairs.

...read moreread less

Journal Article•DOI•

FPGA Design of Elliptic Curve Cryptosystem (ECC) for Isomorphic Transformation and EC ElGamal Encryption

[...]

Rami Amiri¹, Omar Elkeelany²•Institutions (2)

St. Cloud State University¹, Tennessee Technological University²