scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Embedded Systems Letters in 2021"


Journal ArticleDOI
TL;DR: This letter proposes a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT.
Abstract: As deep learning inference applications are increasing, an embedded device tends to equip neural processing units (NPUs) in addition to a CPU and a GPU. For fast and efficient development of deep learning applications, TensorRT is provided as the SDK for NVIDIA hardware platform, including optimizer and runtime that delivers low latency and high-throughput for deep learning inference. Like most deep learning frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU or NPU, not both. In this paper, we propose a parallelization methodology to maximize the throughput of a single deep learning application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, we could achieve 81% 391% throughput improvement over the baseline inference using GPU only.

33 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a spatial decomposition technique that decomposes a neuron function with many presynaptic connections into a sequence of homogeneous neural units, where each neural unit is a function computation node, with two pre-synaptic connections.
Abstract: With growing model complexity, mapping spiking neural network (SNN)-based applications to tile-based neuromorphic hardware is becoming increasingly challenging. This is because the synaptic storage resources on a tile, viz. , a crossbar, can accommodate only a fixed number of presynaptic connections per postsynaptic neuron. For complex SNN models that have many presynaptic connections per neuron, some connections may need to be pruned after training to fit onto the tile resources, leading to a loss in the model quality, e.g., accuracy. In this letter, we propose a novel unrolling technique that decomposes a neuron function with many presynaptic connections into a sequence of homogeneous neural units, where each neural unit is a function computation node, with two presynaptic connections. This spatial decomposition technique significantly improves crossbar utilization and retains all presynaptic connections, resulting in no loss of the model quality derived from connection pruning. We integrate the proposed technique within an existing SNN mapping framework and evaluate it using machine learning applications on the DYNAP-SE state-of-the-art neuromorphic hardware. Our results demonstrate an average 60% lower crossbar requirement, $9\times $ higher synapse utilization, 62% lower wasted energy on the hardware, and between 0.8% and 4.6% increase in the model quality.

26 citations


Journal ArticleDOI
TL;DR: This letter presents the characterization of its reconfiguration cost in terms of time and a definition of the “DPR Profitability” concept targeting real-time systems, and validate the approach on a real DPR-compliant platform, showing that it is general enough to be applied to modern DPR- Compliant platforms.
Abstract: Modern field-programmable gate arrays offer dynamic partial reconfiguration (DPR) capabilities, a characteristic that opens new scheduling opportunities for real-time applications running on heterogeneous platforms. To evaluate when it is really useful to exploit a DPR, in this letter, we present the characterization of its reconfiguration cost in terms of time and a definition of the “DPR Profitability” concept targeting real-time systems. To obtain such results, the components involved in a DPR process have been identified and an innovative approach to calculate the DPR time and its worst-case bound is provided. We validate our approach on a real DPR-compliant platform, showing that our proposal is general enough to be applied to modern DPR-compliant platforms.

16 citations


Journal ArticleDOI
TL;DR: This letter proposes an unsigned approximate multiplier architecture segmented into three portions: the least significant portion that contributes least to the partial product (PP) is replaced with a new constant compensation term to improve hardware savings without sacrificing accuracy.
Abstract: This paper proposes an unsigned approximate multiplier architecture segmented into three portions; the least significant portion that contributes least to the partial product is replaced with a new constant compensation term to improve hardware savings without sacrificing accuracy. The partial products in the middle portion are simplified using a new 4:2 approximate compressor, and the error due to approximation is compensated using a simple yet efficient error correction module. The most significant portion of the multiplier is implemented using exact logic as approximating it will results in a large error. Experimental results of 8-bit multiplier show that power and power-delay products are reduced up to 47.7% and 55.2% respectively in comparison with the exact design and 36.9% and 39.5% respectively in comparison with the existing designs without significant compromise on accuracy.

16 citations


Journal ArticleDOI
TL;DR: A multivariate time-series classification system that fuses multirate sensor measurements within the latent space of a deep neural network and investigates the feasibility of categorizing ten different everyday surfaces using a proposed convolutional neural network, which is trained in an end-to-end manner.
Abstract: In this letter, we propose a multivariate time-series classification system that fuses multirate sensor measurements within the latent space of a deep neural network. In our network, the system identifies the surface category based on audio and inertial measurements generated from the surface impact, each of which has a different sampling rate and resolution in nature. We investigate the feasibility of categorizing ten different everyday surfaces using a proposed convolutional neural network, which is trained in an end-to-end manner. To validate our approach, we developed an embedded system and collected 60 000 data samples under a variety of conditions. The experimental results obtained exhibit a test accuracy for a blind test dataset of 93%, taking less than 300 ms for end-to-end classification in an embedded machine environment. We conclude this letter with a discussion of the results and future direction of research.

15 citations


Journal ArticleDOI
TL;DR: An enhanced DVFS method based on reinforcement learning to reduce the power consumption of sporadic tasks at runtime in multicore embedded systems without task-reliability degradation and deadline misses is proposed.
Abstract: Dynamic voltage and frequency scaling (DVFS) is one of the most popular and exploited techniques to reduce power consumption in multicore embedded systems. However, this technique might lead to a task-reliability degradation because scaling the voltage and frequency increases the fault rate and the worst-case execution time of the tasks. In order to preserve task-reliability at an acceptable level as well as achieving power saving, in this letter, we have proposed an enhanced DVFS method based on reinforcement learning to reduce the power consumption of sporadic tasks at runtime in multicore embedded systems without task-reliability degradation. The reinforcement learner takes decisions based on the power savings and task-reliability variations due to DVFS and considers the suitable voltage-frequency level for all tasks such that the timing constraints are met. Experimental evaluation was done on different configurations and with different numbers of tasks to investigate the efficiency of the proposed method. Our experiments show that our proposed method works efficiently than other existing works for reducing power consumption without reliability degradation and deadline misses.

14 citations


Journal ArticleDOI
TL;DR: An efficient CNN training architecture is designed by using the systolic array, which can support the BN functions both in the training process and the inference process, and is an improved, hardware-friendly BN algorithm, range batch normalization (RBN).
Abstract: In recent years, convolutional neural networks (CNNs) have been widely used. However, their ever-increasing amount of parameters makes it challenging to train them with the GPUs, which is time and energy expensive. This has prompted researchers to turn their attention to training on more energy-efficient hardware. batch normalization (BN) layer has been widely used in various state-of-the-art CNNs for it is an indispensable layer in the acceleration of CNN training. As the amount of computation of the convolutional layer declines, its importance continues to increase. However, the traditional CNN training accelerators do not pay attention to the efficient hardware implementation of the BN layer. In this letter, we design an efficient CNN training architecture by using the systolic array. The processing element of the systolic array can support the BN functions both in the training process and the inference process. The BN function implemented is an improved, hardware-friendly BN algorithm, range batch normalization (RBN). The experimental results show that the implementation of RBN saves 10% hardware resources, reduces the power by 10.1%, and the delay by 4.6% on average. We implement the accelerator on the field programmable gate array VU440, and the power consumption of the its core computing engine is 8.9 W.

14 citations


Journal ArticleDOI
TL;DR: A design space exploration (DSE) strategy is formulated to explore trade-offs in accuracy, runtime, cost, and energy consumption arising due to flexibility in choosing DNN topology, DPU configuration, and FPGA model.
Abstract: Many emerging systems concurrently execute multiple applications that use deep neural network (DNN) as a key portion of the computation. To speedup the execution of such DNNs, various hardware accelerators have been proposed in recent works. Deep learning processor unit (DPU) from Xilinx is one such accelerator targeted for field programmable gate array (FPGA)-based systems. We study the runtime and energy consumption for different DNNs on a range of DPU configurations and derive useful insights. Using these insights, we formulate a design space exploration (DSE) strategy to explore tradeoffs in accuracy, runtime, cost, and energy consumption arising due to flexibility in choosing DNN topology, DPU configuration, and FPGA model. The proposed strategy provides a reduction of $28\times $ in the number of design points to be simulated and $23\times $ in the pruning time.

13 citations


Journal ArticleDOI
TL;DR: CNN outperforms the accuracy obtained by the thresholdbased algorithm by more than 7%.
Abstract: Driver drowsiness is one of the major causes of accidents and fatal road crashes, causing a high human and economic cost. Recently, automatic drowsiness detection has begun to be recognized as a promising solution, receiving growing attention from industry and academics. In this letter, we propose to embed a convolutional neural network (CNN)-based solution in smart connected glasses to detect eye blinks and use them to estimate the driver’s drowsiness level. This innovative solution is compared with a more traditional method, based on a detection threshold mechanism. The performance, battery lifetime, and memory footprint of both solutions are assessed for embedded implementation in the connected glasses. The results demonstrate that CNN outperforms the accuracy obtained by the threshold-based algorithm by more than 7%. Moreover, increased overheads in terms of memory and battery lifetime are acceptable, thus making CNN a viable solution for drowsiness detection in wearable devices.

12 citations


Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a compression-aware and high-accuracy deep learning framework called CHISEL that outperforms the best-known works in the area while maintaining localization robustness on embedded devices.
Abstract: GPS technology has revolutionized the way we localize and navigate outdoors. However, the poor reception of GPS signals in buildings makes it unsuitable for indoor localization. WiFi fingerprinting-based indoor localization is one of the most promising ways to meet this demand. Unfortunately, most work in the domain fails to resolve challenges associated with deployability on resource-limited embedded devices. In this work, we propose a compression-aware and high-accuracy deep learning framework called CHISEL that outperforms the best-known works in the area while maintaining localization robustness on embedded devices.

12 citations


Journal ArticleDOI
TL;DR: A nondestructive technique based on thermal maps and inception neural networks (INNs) and the corresponding Trojan detection accuracy can be achieved over 98.2% after training the INNs with 150 000 thermal maps.
Abstract: Hardware Trojan detection on modern integrated circuits (ICs) is a challenging task since the inspector may have no idea about the location and size of the embedded Trojan circuit. To achieve an accurate Trojan detection, instead of relying on hardware reverse engineering, a nondestructive technique based on thermal maps and inception neural networks (INNs) is proposed in this letter. The thermal maps generated by a Trojan-free (TF) IC chip and multiple emulated Trojan-infected (TI) IC chips are collected and optimized as the critical side-channel leakages at first. Then, INNs are utilized to analyze these optimized thermal maps to exactly extract the information of the embedded Trojans under the assistance of customized filters. As shown in the results, after training the INNs with 150 000 thermal maps, the corresponding Trojan detection accuracy can be achieved over 98.2%.

Journal ArticleDOI
TL;DR: An embedded UAC system with the STM32H743 processor as the core and the peripheral sending/receiving circuit as the signal conditioning circuit module is proposed and a fast and robust frame synchronization algorithm based on the segmented fast Fourier transform is applied.
Abstract: Underwater acoustic communication (UAC) modem is an important infrastructure of underwater network construction. In recent years, with the performance improvement of the STM32 processor, the realization of reliable UAC through high-performance STM32 processor is conducive to reduce the system power consumption, cheapen the hardware cost, and ease the development difficulty. In this letter, we propose an embedded UAC system with the STM32H743 processor as the core and the peripheral sending/receiving circuit as the signal conditioning circuit module. The system can support a variety of modulation and demodulation methods, including single/multi-carrier frequency-shift keying and orthogonal frequency division multiplexing. Furthermore, in order to reduce the computational cost of the system, a fast and robust frame synchronization algorithm based on the segmented fast Fourier transform is applied. The sea trials show that the system can realize the reliable UAC transmission of 100 b/s–1 kb/s in the distance of 5–8 km in the shallow water area.

Journal ArticleDOI
TL;DR: In this article, the authors propose an analytical modeling technique for priority-aware NoCs under bursty traffic, which has less than 10% modeling error with respect to cycle-accurate NoC simulator.
Abstract: Networks-on-Chip (NoCs) used in commercial many-core processors typically incorporate priority arbitration. Moreover, they experience bursty traffic due to application workloads. However, most state-of-the-art NoC analytical performance analysis techniques assume fair arbitration and simple traffic models. To address these limitations, we propose an analytical modeling technique for priority-aware NoCs under bursty traffic. Experimental evaluations with synthetic and bursty traffic show that the proposed approach has less than 10% modeling error with respect to cycle-accurate NoC simulator.

Journal ArticleDOI
TL;DR: In this article, the authors describe a parallel synchronous software model, which executes as $N$ parallel threads on a processor with word-length $N$, each thread is a single-bit synchronous machine with precise, contention-free timing, while each of the N$ threads still executes as an independent machine.
Abstract: In typical embedded applications, the precise execution time of the program does not matter and it is sufficient to meet a real-time deadline. However, modern applications in information security have become much more time-sensitive due to the risk of timing side-channel leakage. The timing of such programs needs to be data-independent and precise. We describe a parallel synchronous software model, which executes as $N$ parallel threads on a processor with word-length $N$ . Each thread is a single-bit synchronous machine with precise, contention-free timing, while each of the $N$ threads still executes as an independent machine. The resulting software supports fine-grained parallel execution. In contrast to earlier work to obtain precise and repeatable timing in software, our solution does not require modifications to the processor architecture nor specialized instruction scheduling techniques. In addition, all threads run in parallel and without contention, which eliminates the problem of thread scheduling. We use hardware (HDL) semantics to describe a thread as a single-bit synchronous machine. Using logic synthesis and code generation, we derive a parallel synchronous implementation of this design. We illustrate the synchronous parallel programming model with practical examples from cryptography and other applications with precise timing requirements.

Journal ArticleDOI
TL;DR: This letter presents an area-optimized, low-latency, and energy-efficient architecture for an accurate signed multiplier, which can be used for FPGA-based implementations of applications utilizing signed numbers.
Abstract: Multiplication is one of the most extensively used arithmetic operations in a wide range of applications, such as multimedia processing and artificial neural networks. For such applications, multiplier is one of the major contributors to energy consumption, critical path delay, and resource utilization. These effects get more pronounced in field-programmable gate array (FPGA)-based designs. However, most of the state-of-the-art designs are done for ASIC-based systems. Furthermore, a few field-programmable gate array (FPGA)-based designs that exist are largely limited to unsigned numbers, which require extra circuits to support signed operations. To overcome these limitations for the FPGA-based implementations of applications utilizing signed numbers, this letter presents an area-optimized, low-latency, and energy-efficient architecture for an accurate signed multiplier. Compared to the Vivado area-optimized multiplier IP, our implementations offer up to 40.0%, 43.0%, and 70.0% reduction in terms of area, latency, and energy, respectively. The RTL implementations of our designs will be released as an open-source library at https://cfaed.tu-dresden.de/pd-downloads .

Journal ArticleDOI
TL;DR: It is shown that systematically converting native instructions from Android apps into images using Hilbert space-filling curves and entropy visualization techniques enable CNNs to reliably detect malicious apps with near ideal accuracy.
Abstract: Traditional research on mobile malware detection has focused on approaches that rely on analyzing bytecode for uncovering malicious apps. Unfortunately, cybercriminals can bypass such methods by embedding malware directly in native machine code, making traditional methods inadequate. Another challenge that detection solutions face is scalability. The sheer number of malware released every year makes it difficult for solutions to efficiently scale their coverage. This letter presents an energy efficient solution that uses convolutional neural networks (CNNs) to defend against malware. We show that systematically converting native instructions from Android apps into images using Hilbert space-filling curves and entropy visualization techniques enable CNNs to reliably detect malicious apps with near ideal accuracy. We characterize popular CNN architectures that have been known to perform well on different computer vision tasks and evaluate their effectiveness against malware using an Android malware dataset.

Journal ArticleDOI
TL;DR: A novel restructuring of the 2bit-SC (2b-SC) precomputation decoder architecture is carried out to reduce the latency by 20% while reducing the hardware complexity.
Abstract: Polar codes are one of the recently developed error correcting codes, and they are popular due to their capacity achieving nature. The architecture of the successive cancellation (SC) decoder algorithm is composed of a recursive processing element (PE). The PE comprises various blocks that include signed adder, subtractor, comparator, multiplexers, and few logic gates. Therefore, the latency of the PE is a primary concern. Hence, a high-speed architecture for implementing the SC decoding algorithm for polar codes is proposed. In the proposed work, a novel restructuring of the 2bit-SC (2b-SC) precomputation decoder architecture is carried out to reduce the latency by 20% while reducing the hardware complexity. Compared to the 2b-SC precomputation decoder, the proposed architecture also has 19% increased throughput for (1024, 512) polar codes with 45% reduction in the gate count.

Journal ArticleDOI
TL;DR: The monolithic 3-D integration for the GPU scratchpad memory, called monolithic 2-D SPM, is adopted to enhance matrix multiplication, which improves the system performance by 46.3% for the $32\times 32$ matrix multiplication.
Abstract: Convolutional neural networks (CNNs) are one of the most popular machine learning algorithms. The convolutional layers, which account for the most execution time of CNNs, are implemented with matrix multiplication because the convolution operation performs dot products between filters and local regions of the input. On the other hand, GPUs with thousands of cores were proven to significantly accelerate matrix multiplication, compared to CPUs with a limited number of cores, especially for large matrices. However, the current memory architecture allows only one row access at a time so that multiple accesses are necessary to read the column data of the second matrix, thus slowing down matrix multiplication. In this study, we adopt the monolithic 3-D integration for the GPU scratchpad memory, called monolithic 3-D integration (M3D) scratchpad memory (SPM), to enhance matrix multiplication. The M3D SPM allows one access to read the column data of the second matrix, similar to the case of the first matrix. The simulation results show that our M3D SPM improves the system performance by 46.3% for the $32\times 32$ matrix multiplication, over the conventional 2-D SPM where the column data of the second matrix are read sequentially.

Journal ArticleDOI
TL;DR: It is found that the proposed language enhancements potentially bring significant benefits to programming in C++ for embedded computers, but that the implementation imposes constraints that may prevent its widespread acceptance among the embedded development community.
Abstract: Coroutines will be added to C++ as part of the C++20 standard. Coroutines provide native language support for asynchronous operations. This letter evaluates the C++ coroutine specification from the perspective of embedded systems developers. We find that the proposed language features are generally beneficial, but that memory management of the coroutine state needs to be improved. Our experiments on an ARM Cortex-M4 microcontroller evaluate the time and memory costs of coroutines in comparison with alternatives, and we show that context switching with coroutines is significantly faster than with thread-based real-time operating systems. Furthermore, we analyzed the impact of these language features on prototypical Internet of Things sensor software. We find that the proposed language enhancements potentially bring significant benefits to programming in C++ for embedded computers, but that the implementation imposes constraints that may prevent its widespread acceptance among the embedded development community.

Journal ArticleDOI
Deyu Lin1, Weidong Min1, Jianfeng Xu1, Jiaxun Yang1, Jianlin Zhang1 
TL;DR: This letter presented a novel routing method to improve the Energy Efficiency among different clusters during inter-cluster routing decision-making using the theory of Energy Welfare and the Compressive Sensing theory from the perspective of social welfare.
Abstract: This letter presented a novel routing method to improve the energy efficiency among different clusters during intercluster routing decision making. To this end, the theory of energy welfare was applied to promote energy equality. Besides, the compressive sensing (CS) theory was utilized in intracluster data acquisition to further reduce data redundancy. Subsequently, an energy-efficient routing based on CS from the perspective of social welfare was proposed. Finally, extensive experiments were conducted and the numerical results verified its effectiveness on improving the energy efficiency and prolonging the network lifetime of wireless sensor networks.

Journal ArticleDOI
TL;DR: This paper has designed an SoC-based platform, conceived for scientific experimentation, with a fully modular and configurable design that achieves a configurable UWB-capable sampling rate through an equivalent-time sampling scheme.
Abstract: Research and development of algorithms for processing impulse radio ultrawideband signals is a trending issue within remote sensing applications, personal area networks, and RF imaging among other areas. We have designed an SoC-based platform, conceived for scientific experimentation, with a fully modular and configurable design. Built with off-the-shelf components, our design achieves a configurable UWB-capable sampling rate through an equivalent-time sampling scheme. In this letter, we introduce the system architecture, its main interfaces, and the rationale behind each module implementation.

Journal ArticleDOI
Liu Yang1, Qi Wang1, Li Qianhui1, Xiaolei Yu1, Jing He1, Zongliang Huo1 
TL;DR: In this article, a time-saving channel parameter estimation method for TLC NAND flash memory is proposed, which reduces estimation time by three improvements: (1) reducing fitted parameters in one iteration step, (2) using pre-derived values as initial guess values to decrease iteration steps, and (3) utilizing parallelism between data sensing operations and computation.
Abstract: As the storage density of NAND flash increases, the reliability is significantly degraded, making NAND flash memory more sensitive to noise. Among all noise sources, retention noise is a major one. Error correction based on channel parameter estimation is an essential method to deal with retention noise. In this paper, a time-saving channel parameter estimation method for TLC NAND flash memory is proposed. Proposed method reduces estimation time by three improvements: (1) reducing fitted parameters in one iteration step, (2) using pre-derived values as initial guess values to decrease iteration steps, (3) utilizing parallelism between data sensing operations and computation. Compared with previous work, proposed method estimates parameters with higher accuracy and lower time overhead which is verified by experiment results.

Journal ArticleDOI
TL;DR: This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption.
Abstract: The consultative committee for space data system (CCSDS)-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power-constrained systems, such as satellites and military drones. This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption. This solution parallelizes the predictor on a low-power graphics processing unit (GPU) while the encoders exploit the heterogeneous multiple cores of the CPUs and GPU concurrently. We report more than 16.6 Gb/s for the predictor and 1.4-Gb/s for the whole system, requiring less than 6.3 W and providing an efficiency of 245.6 Mb/s/W.

Journal ArticleDOI
TL;DR: BALDER is a learning framework capable of automatically choosing optimal configuration executions according to the parallel application at hand, aiming to maximize the trade-off between aging and performance.
Abstract: Computation has been pushed to the edge to decrease latency and alleviate the computational burden of the IoT applications in the cloud. However, the increasing processing demands of edge applications make necessary the employment of platforms that exploit thread-level parallelism (TLP). Yet, power and heat dissipation rise as TLP inadvertently increases or when parallelism is not cleverly exploited, which may be the result of the nonideal use of a given parallel program interface (PPI). Besides the common issues, such as the need for more robust power sources and better cooling, heat also adversely affects aging, accelerating phenomenons, such as negative bias temperature instability (NBTI) and hot-carrier injection (HCI), which further reduces processor lifetime. Hence, considering that increasing the lifespan of an edge device is key, so the number of times the application set may execute until its end-of-life is maximized, we propose BALDER. It is a learning framework capable of automatically choosing optimal configuration executions (PPI and number of threads) according to the parallel application at hand, aiming to maximize the tradeoff between aging and performance. When executing ten well-known applications on two multicore embedded architectures, we show that BALDER can find a nearly optimal configuration for all our experiments.

Journal ArticleDOI
TL;DR: This letter presents SecPump, a new open-source wireless infusion pump platform dedicated to security researchers, which intends to provide a framework for security evaluation, tailored for countermeasure development against security flaws related to medical devices.
Abstract: This letter presents SecPump, a new open-source wireless infusion pump platform dedicated to security researchers. The novelty of the platform is that it is “plug and play.” Indeed, SecPump simulates a functional infusion pump system on a single board without requiring additional hardware or mechanical components. The presented cyber-physical platform intends to provide a framework for security evaluation, tailored for countermeasure development against security flaws related to medical devices. This letter presents the functionality of the cyber-physical device, its wireless features, and its portability across several hardware architectures. Finally, both hardware and software attacks are showcased on the platform.

Journal ArticleDOI
TL;DR: A new design for a 5G NR low-density parity check code decoder running on a GPU is presented, which improves on the layered algorithm by increasing parallelism on a single code word.
Abstract: The graphical processing unit (GPU), as a digital signal processing accelerator for cloud RAN, is investigated. This letter presents a new design for a 5G NR low-density parity check code decoder running on a GPU. The algorithm is flexibly adaptable to GPU architecture to achieve high resource utilization as well as low latency. It improves on the layered algorithm by increasing parallelism on a single code word. The flexible GPU decoder (on a 24 core GPU) was found to have $5\times $ higher throughput compared to a recent GPU flooding decoder and $3\times $ higher throughput compared to a field programmable gate array (FPGA) decoder (757K gate). The flexible GPU decoder exhibits 1/3 decoding power efficiency of the FPGA typical of general-purpose processors. For rapid deployment and flexibility, GPUs may be suitable as cloud RAN accelerators.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a mapping method that allows these regions to overlap and thus utilize the memory more efficiently, which can decrease the activation memory by up to 32.9% compared to traditional ping-pong buffering.
Abstract: While the accuracy of convolutional neural networks (CNNs) has achieved vast improvements by introducing larger and deeper network architectures, also the memory footprint for storing their parameters and activations has increased. This trend especially challenges power- and resource-limited accelerator designs, which are often restricted to store all network data in on-chip memory to avoid interfacing energy-hungry external memories. Maximizing the network size that fits on a given accelerator thus requires to maximize its memory utilization. While the traditionally used ping-pong buffering technique is mapping subsequent activation layers to disjunctive memory regions, we propose a mapping method that allows these regions to overlap and thus utilize the memory more efficiently. This letter presents the mathematical model to compute the maximum activations memory overlap and thus the lower bound of on-chip memory needed to perform layer-by-layer processing of CNNs on memory-limited accelerators. Our experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%, reducing the overall memory for the entire network by up to 23.9% compared to traditional ping-pong buffering. For higher resolution denoising networks, we achieve activation memory savings of 48.8%. Additionally, we implement a face detector network on a field-programmable gate array-based camera to validate these memory savings on a complete end-to-end system.

Journal ArticleDOI
TL;DR: This letter proposes a mutable architecture-based watermarking scheme called WATERMARCH, a novel technique of authenticated obfuscation utilizing a hash-based message authentication code (HMAC) to cryptographically mesh the obfuscation and watermark with the original design, with no additional overhead beyond the underlying obfuscation method.
Abstract: Field programmable gate array (FPGA) bitstreams contain information on the functionality of all hardware intellectual property (IP) cores used in a given design, so if an attacker gains access to the bitstream, they can mount attacks on the IP. Various mechanisms have been proposed to protect IP from reverse engineering and theft. However, there are no examples of IP obfuscation in FPGA bitstreams that also intrinsically enable tamper detection and authentication at no additional hardware cost. In this letter, we propose a mutable architecture-based watermarking scheme called WATERMARCH, a novel technique of authenticated obfuscation utilizing a hash-based message authentication code (HMAC) to cryptographically mesh the obfuscation and watermark with the original design, with no additional overhead beyond the underlying obfuscation method. While collaboration between the IP owner and FPGA vendor is necessary to facilitate parsing of the bitstream, once the bitstream is parsable, the watermark can be extracted to prove authorship of the IP or confirm the presence of malicious IP modification, providing tremendous benefits to both IP owners and end users.

Journal ArticleDOI
TL;DR: A novel design of an embedded cardiorespiratory monitoring system for wheelchair users consisting of a sensor node, a smartphone, and a cloud server to achieve a fully integrated radar system is proposed.
Abstract: A novel design of an embedded cardiorespiratory monitoring system for wheelchair users is proposed. The entire system is composed of a sensor node, a smartphone, and a cloud server. The sensor node contains two parts of an ultra-wideband pulse radar and the data processing module; the former is used to obtain continuous vital-sign signals, and the latter processes the sampled signals to estimate the heart rate and the respiration rate, implemented in an embedded system to achieve a fully integrated radar system. The smartphone functions as the data bridge between the sensor node and the cloud server, which is responsible for sending emergent messages. Experiment results show that the proposed system could work reliably in static and dynamic cases of wheelchairs.

Journal ArticleDOI
TL;DR: The proposed cryptosystem is synthesized and implemented on Intel Cyclone 10 GX and Xilinx Kintex-7 FPGAs to evaluate throughput, and it achieves 25.73-57.1 Mbps.
Abstract: Attacking or tampering with sensitive data continues to increase risks to economic processes or human activities. These risks are significant key factors to improve the development and implementation of security systems. Therefore, improving cryptography is essentially needed to enhance the security of critical data. For example, elliptic curve cryptography (ECC) over the Galois field $GF(2^{163})$ is one of the public-key (asymmetric) cryptographic techniques, in which demands mapping a message (163-bit) to a point in the prime subgroup of the elliptic curve. To the best of our knowledge, mapping methods are not yet available on Field-Programmable Gate Arrays (FPGAs). Also, asymmetric encryption schemes often do not consider encrypting/decrypting data packets because of their computation complexity and performance limitations. In this letter, we propose and develop a concurrent reconfigurable cryptosystem to encrypt and decrypt stream of data using ECC on FPGA. First, we present hardware design and implementation to map a plain message on the elliptic curve based on isomorphic transformation, then second, we architect the elliptic curve ElGamal public-key encryption method by using point addition and multiplication on Koblitz elliptic curve on FPGA. Our proposed cryptosystem is synthesized and implemented on Intel Cyclone 10 GX and Xilinx Kintex-7 FPGAs to evaluate throughput, and it achieves 25.73-57.1 Mbps.