scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2019"


Journal ArticleDOI
TL;DR: This paper presents a Tera-OPS streaming hardware accelerator implementing a you-only-look-once (YOLO) CNN, which outperforms the “one-size-fits-all” designs in both performance and power efficiency.
Abstract: Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. For real-time object detection with high throughput and power efficiency, this paper presents a Tera-OPS streaming hardware accelerator implementing a you-only-look-once (YOLO) CNN. The parameters of the YOLO CNN are retrained and quantized with the PASCAL VOC data set using binary weight and flexible low-bit activation. The binary weight enables storing the entire network model in block RAMs of a field-programmable gate array (FPGA) to reduce off-chip accesses aggressively and, thereby, achieve significant performance enhancement. In the proposed design, all convolutional layers are fully pipelined for enhanced hardware utilization. The input image is delivered to the accelerator line-by-line. Similarly, the output from the previous layer is transmitted to the next layer line-by-line. The intermediate data are fully reused across layers, thereby eliminating external memory accesses. The decreased dynamic random access memory (DRAM) accesses reduce DRAM power consumption. Furthermore, as the convolutional layers are fully parameterized, it is easy to scale up the network. In this streaming design, each convolution layer is mapped to a dedicated hardware block. Therefore, it outperforms the “one-size-fits-all” designs in both performance and power efficiency. This CNN implemented using VC707 FPGA achieves a throughput of 1.877 tera operations per second (TOPS) at 200 MHz with batch processing while consuming 18.29 W of on-chip power, which shows the best power efficiency compared with the previous research. As for object detection accuracy, it achieves a mean average precision (mAP) of 64.16% for the PASCAL VOC 2007 data set that is only 2.63% lower than the mAP of the same YOLO network with full precision.

259 citations


Journal ArticleDOI
Florian Zaruba1, Luca Benini1
TL;DR: A thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline “application class” functionality, i.e., supporting the Linux OS and its application environment based on the authors' open-source single-issue in-order implementation of the 64-bit ISA variant (RV64GC) called Ariane.
Abstract: The open-source RISC-V instruction set architecture (ISA) is gaining traction, both in industry and academia. The ISA is designed to scale from microcontrollers to server-class processors. Furthermore, openness promotes the availability of various open-source and commercial implementations. Our main contribution in this paper is a thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline “application class” functionality, i.e., supporting the Linux OS and its application environment based on our open-source single-issue in-order implementation of the 64-bit ISA variant (RV64GC) called Ariane. Our analysis is based on a detailed power and efficiency analysis of the RISC-V ISA extracted from silicon measurements and calibrated simulation of an Ariane instance (RV64IMC) taped-out in GlobalFoundries 22FDX technology. Ariane runs at up to 1.7-GHz, achieves up to 40-Gop/sW energy efficiency, which is superior to similar cores presented in the literature. We provide insight into the interplay between functionality required for the application-class execution (e.g., virtual memory, caches, and multiple modes of privileged operation) and energy cost. We also compare Ariane with RISCY, a simpler and a slower microcontroller-class core. Our analysis confirms that supporting application-class execution implies a nonnegligible energy-efficiency loss and that compute performance is more cost-effectively boosted by instruction extensions (e.g., packed SIMD) rather than the high-frequency operation.

195 citations


Journal ArticleDOI
TL;DR: The experimental results show that the distributed authentication can be processed by individual vehicles within 1 ms, which meets the real-time requirement and is much more efficient, in terms of the processing time and storage requirement, than existing approaches.
Abstract: The privacy-preserving authentication is considered as the first line of defense against the attacks in addition to preserving the identity privacy of the vehicles in the vehicular ad hoc networks (VANETs). However, the existing authentication schemes suffer from drawbacks such as nontransparency of the trusted authorities (TAs), heavy workload to revoke certificates, and high computation overhead to authenticate identities and messages. In this paper, we propose a blockchain-based privacy-preserving authentication (BPPA) scheme for VANETs. In BPPA, all the certificates and transactions are recorded permanently and immutably in the blockchain to make the activities of the semi-TAs transparent and verifiable. However, it remains a challenge how to use such blockchain effectively for authentication in real driving scenarios (e.g., high speed or large amount of messages during congestion). With a novel data structure named the Merkle Patricia tree (MPT), we extend the conventional blockchain structure to provide a distributed authentication scheme without the revocation list. To achieve conditional privacy, we allow a vehicle to use multiple certificates. The linkability between the certificates and real identity is encrypted and stored in the blockchain and can only be revealed in case of disputes. We evaluate the validity and performance of BPPA on the Hyperledger Fabric (HLF) platform for each entity. The experimental results show that the distributed authentication can be processed by individual vehicles within 1 ms, which meets the real-time requirement and is much more efficient, in terms of the processing time and storage requirement, than existing approaches.

135 citations


Journal ArticleDOI
TL;DR: A novel Toeplitz matrix–vector product (TMVP)-based decomposition strategy is employed to derive an efficient subquadratic space complexity of systolic multiplier, which has lower area-delay product (ADP) than the existing ones.
Abstract: Systolic finite field multiplier over $GF(2^{m})$ , because of its superior features such as high throughput and regularity, is highly desirable for many demanding cryptosystems On the other side, however, obtaining high-performance systolic multiplier with relatively low hardware cost is still a challenging task due to the fact that the systolic structure usually involves large area complexity Based on this consideration, in this paper, we propose to carry out two novel coherent interdependent efforts First, a new digit-serial multiplication algorithm based on polynomial basis over binary field $(GF(2^{m}))$ is proposed Novel Toeplitz matrix–vector product (TMVP)-based decomposition strategy is employed to derive an efficient subquadratic space complexity Second, The proposed algorithm is then innovatively mapped into a low-complexity systolic multiplier, which involves less area-time complexities than the existing ones A series of resource optimization techniques also has been applied on the multiplier which optimizes further the proposed design (it is the first report on digit-serial systolic multiplier based on TMVP approach covering all irreducible polynomials, to the best of our knowledge) The following complexity analysis and comparison confirm the efficiency of the proposed multiplier, that is, it has lower area-delay product (ADP) than the existing ones The extension of the proposed multiplier for bit-parallel implementation is also considered in this paper

119 citations


Journal ArticleDOI
TL;DR: An optimized block-floating-point (BFP) arithmetic is adopted in the accelerator for efficient inference of deep neural networks in this paper, and improves the energy and hardware efficiency by three times.
Abstract: Convolutional neural networks (CNNs) are widely used and have achieved great success in computer vision and speech processing applications. However, deploying the large-scale CNN model in the embedded system is subject to the constraints of computation and memory. An optimized block-floating-point (BFP) arithmetic is adopted in our accelerator for efficient inference of deep neural networks in this paper. The feature maps and model parameters are represented in 16-bit and 8-bit formats, respectively, in the off-chip memory, which can reduce memory and off-chip bandwidth requirements by 50% and 75% compared to the 32-bit FP counterpart. The proposed 8-bit BFP arithmetic with optimized rounding and shifting-operation-based quantization schemes improves the energy and hardware efficiency by three times. One CNN model can be deployed in our accelerator without retraining at the cost of an accuracy loss of not more than 0.12%. The proposed reconfigurable accelerator with three parallelism dimensions, ping-pong off-chip DDR3 memory access, and an optimized on-chip buffer group is implemented on the Xilinx VC709 evaluation board. Our accelerator achieves a performance of 760.83 GOP/s and 82.88 GOP/s/W under a 200-MHz working frequency, significantly outperforming previous accelerators.

116 citations


Journal ArticleDOI
TL;DR: The proposed approximate multiplier has an almost Gaussian error distribution with a near-zero mean value and is exploited in the structure of a JPEG encoder, sharpening, and classification applications, indicating that the quality degradation of the output is negligible.
Abstract: A scalable approximate multiplier, called truncation- and rounding-based scalable approximate multiplier (TOSAM) is presented, which reduces the number of partial products by truncating each of the input operands based on their leading one-bit position. In the proposed design, multiplication is performed by shift, add, and small fixed-width multiplication operations resulting in large improvements in the energy consumption and area occupation compared to those of the exact multiplier. To improve the total accuracy, input operands of the multiplication part are rounded to the nearest odd number. Because input operands are truncated based on their leading one-bit positions, the accuracy becomes weakly dependent on the width of the input operands and the multiplier becomes scalable. Higher improvements in design parameters (e.g., area and energy consumption) can be achieved as the input operand widths increase. To evaluate the efficiency of the proposed approximate multiplier, its design parameters are compared with those of an exact multiplier and some other recently proposed approximate multipliers. Results reveal that the proposed approximate multiplier with a mean absolute relative error in the range of 11%–0.3% improves delay, area, and energy consumption up to 41%, 90%, and 98%, respectively, compared to those of the exact multiplier. It also outperforms other approximate multipliers in terms of speed, area, and energy consumption. The proposed approximate multiplier has an almost Gaussian error distribution with a near-zero mean value. We exploit it in the structure of a JPEG encoder, sharpening, and classification applications. The results indicate that the quality degradation of the output is negligible. In addition, we suggest an accuracy configurable TOSAM where the energy consumption of the multiplication operation can be adjusted based on the minimum required accuracy.

99 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the standard 8 transistor (8T) digital SRAM array can be configured as an analog-like in-memory multibit dot-product engine (DPE).
Abstract: Large-scale digital computing almost exclusively relies on the von Neumann architecture, which comprises separate units for storage and computations. The energy-expensive transfer of data from the memory units to the computing cores results in the well-known von Neumann bottleneck. Various approaches aimed toward bypassing the von Neumann bottleneck are being extensively explored in the literature. These include in-memory computing based on CMOS and beyond CMOS technologies, wherein by making modifications to the memory array, vector computations can be carried out as close to the memory units as possible. Interestingly, in-memory techniques based on CMOS technology are of special importance due to the ubiquitous presence of field-effect transistors and the resultant ease of large-scale manufacturing and commercialization. On the other hand, perhaps the most important computation required for applications such as machine learning, etc., comprises the dot-product operation. Emerging nonvolatile memristive technologies have been shown to be very efficient in computing analog dot products in an in situ fashion. The memristive analog computation of the dot product results in much faster operation as opposed to digital vector in-memory bitwise Boolean computations. However, challenges with respect to large-scale manufacturing coupled with the limited endurance of memristors have hindered rapid commercialization of memristive-based computing solutions. In this paper, we show that the standard 8 transistor (8T) digital SRAM array can be configured as an analoglike in-memory multibit dot-product engine (DPE). By applying appropriate analog voltages to the read ports of the 8T SRAM array and sensing the output current, an approximate analog–digital DPE can be implemented. We present two different configurations for enabling multibit dot-product computations in the 8T SRAM cell array, without modifying the standard bit-cell structure. We also demonstrate the robustness of the present proposal in presence of nonidealities such as the effect of line resistances and transistor threshold voltage variations. Since our proposal preserves the standard 8T-SRAM array structure, it can be used as a storage element with standard read–write instructions and also as an on-demand analoglike dot-product accelerator.

90 citations


Journal ArticleDOI
TL;DR: The HSPICE simulation results show that the write speed and power consumption of the proposed RSP-14T are improved by ~65% and ~50%, respectively, compared with those of the radiation hardened design (RHD)-12T memory cell.
Abstract: In this paper, a novel radiation-hardened 14-transistor SRAM bitcell with speed and power optimized [radiation-hardened with speed and power optimized (RSP)-14T] for space application is proposed. By circuit- and layout-level optimization design in a 65-nm CMOS technology, the 3-D TCAD mixed-mode simulation results show that the novel structure is provided with increased resilience to single-event upset as well as single-event–multiple-node upsets due to the charge sharing among OFF-transistors. Moreover, the HSPICE simulation results show that the write speed and power consumption of the proposed RSP-14T are improved by ~65% and ~50%, respectively, compared with those of the radiation hardened design (RHD)-12T memory cell.

87 citations


Journal ArticleDOI
TL;DR: This brief proposes using the 3-D vertical channel NAND array architecture to implement the vector–matrix multiplication (VMM) with for the first time, based on the array-level SPICE simulation, and the bias condition including the selector layer and the unselected layers is optimized to achieve high computation accuracy.
Abstract: Three-Dimensional NAND flash technology is one of the most competitive integrated solutions for the high-volume massive data storage. So far, there are few investigations on how to use 3-D NAND flash for in-memory computing in the neural network accelerator. In this brief, we propose using the 3-D vertical channel NAND array architecture to implement the vector–matrix multiplication (VMM) with for the first time. Based on the array-level SPICE simulation, the bias condition including the selector layer and the unselected layers is optimized to achieve high computation accuracy of VMM. Since the VMM can be performed layer by layer in a 3-D NAND array, the read-out latency is largely improved compared to the conventional single-cell read-out operation. The impact of device-to-device variation on the computation accuracy is also analyzed.

79 citations


Journal ArticleDOI
TL;DR: Simulation results show that FeFET-based NV TCAMs offer lower area overhead than MTJ and CMOS equivalents, as well as better search energy-delay products (EDPs) than TCAM designs based on MTJ.
Abstract: Among the beyond-complementary metal–oxide–semiconductor (CMOS) devices being explored, ferroelectric field-effect transistors (FeFETs) are considered as one of the most promising. FeFETs are being studied by all major semiconductor manufacturers, and experimentally, FeFETs are making rapid progress. FeFETs also stand out with the unique hysteretic $I_{\text {ds}}$ - $V_{\text {gs}}$ characteristic that allows a device to function as both a switch and a nonvolatile (NV) storage element. We exploit this FeFET property to build two categories of fine-grained logic-in-memory (LiM) circuits: 1) ternary content addressable memory (TCAM) which integrates efficient and compact logic/processing elements into various levels of memory hierarchy; 2) basic logic function units for constructing larger and more complex LiM circuits. Two writing schemes (with and without negative supply voltages respectively) for FeFETs are introduced in our LiM designs. The resulting designs are compared with existing LiM approaches based on CMOS, magnetic tunnel junctions (MTJs), resistive random access memories (ReRAMs), ferrorelectric tunnel junctions (FTJs), etc., that afford the same circuit-level functionality. Simulation results show that FeFET-based NV TCAMs offer lower area overhead than MTJ (79%) and CMOS (42% less) equivalents, as well as better search energy-delay products (EDPs) than TCAM designs based on MTJ ( $149\times $ ), ReRAM ( $1.7\times $ ), and CMOS ( $1.3\times $ ) in array evaluations. NV FeFET-based LiM basic circuit blocks are also more efficient than functional equivalents based on MTJs in terms of propagation delay ( $4.2\times $ ) and dynamic power ( $2.5\times $ ). A case study for an FeFET-based LiM accumulator further demonstrates that by employing FeFET as both a switch and an NV storage element, the FeFET-based accumulator can save area (36%) and power consumption (40%) when compared with a conventional CMOS accumulator with the same structure.

74 citations


Journal ArticleDOI
TL;DR: An optimized schoolbook polynomial multiplication (SPM) for compact LBC is proposed, exploiting the symmetric nature of Gaussian noise for bit reduction and achieving high hardware efficiency with reduced hardware area costs.
Abstract: Lattice-based cryptography (LBC) is one of the most promising classes of post-quantum cryptography (PQC) that is being considered for standardization. This brief proposes an optimized schoolbook polynomial multiplication (SPM) for compact LBC. We exploit the symmetric nature of Gaussian noise for bit reduction. Additionally, a single field-programmable gate array (FPGA) DSP block is used for two parallel multiplication operations per clock cycle. These optimizations enable a significant $2.2\times $ speedup along with reduced resources for dimension $n=256$ . The overall efficiency (throughput per slice) is $1.28\times $ higher than the conventional SPM, as well as contributing to a more compact LBC system compared to previously reported designs. The results targeting the FPGA platform show that the proposed design can achieve high hardware efficiency with reduced hardware area costs.

Journal ArticleDOI
TL;DR: A unified architecture named UniWiG is proposed, where both Winograd-based convolution and GEMM can be accelerated using the same set of processing elements, which leads to efficient utilization of FPGA hardware resources while computing all layers in the CNN.
Abstract: Deep neural networks have revolutionized a variety of applications in varying domains like autonomous vehicles, weather forecasting, cancer detection, surveillance, traffic management, and so on. The convolutional neural network (CNN) is the state-of-the-art technique for many machine learning tasks in the image and video processing domains. Deployment of CNNs on embedded systems with lower processing power and smaller power budget is a challenging task. Recent studies have shown the effectiveness of field-programmable gate array (FPGA) as a hardware accelerator for the CNNs that can deliver high performance at low power budgets. Majority of computations in CNNs involve 2-D convolution. Winograd minimal filtering-based algorithm is the most efficient technique for calculating convolution for smaller filter sizes. CNNs also consist of fully connected layers that are computed using general element-wise matrix multiplication (GEMM). In this article, we propose a unified architecture named UniWiG, where both Winograd-based convolution and GEMM can be accelerated using the same set of processing elements. This approach leads to efficient utilization of FPGA hardware resources while computing all layers in the CNN. The proposed architecture shows performance improvement in the range of $1.4\times $ to $4.02\times $ with only 13% additional FPGA resources with respect to the baseline GEMM-based architecture. We have mapped popular CNN models like AlexNet and VGG-16 onto the proposed accelerator and the measured performance compares favorably with other state-of-the-art implementations. We have also analyzed the vulnerability of the accelerator to the side-channel attacks. Preliminary investigations show that the UniWiG architecture is more robust to memory side-channel attacks than direct convolution-based techniques.

Journal ArticleDOI
TL;DR: A novel design for a 1-bit arithmetic logic unit-based on silicon nanowire reconfigurable FETs with the area, normalized circuit delay, and activity gains of 30%, 34%, and 36%, respectively, as compared with the contemporary CMOS version.
Abstract: An early evaluation in terms of circuit design is essential in order to assess the feasibility and practicability aspects for emerging nanotechnologies. Reconfigurable nanotechnologies, such as silicon or germanium nanowire-based reconfigurable field-effect transistors, hold great promise as suitable primitives for enabling multiple functionalities per computational unit. However, contemporary CMOS circuit designs when applied directly with this emerging nanotechnology often result in suboptimal designs. For example, 31% and 71% larger area was obtained for our two exemplary designs. Hence, new approaches delivering tailored circuit designs are needed to truly tap the exciting feature set of these reconfigurable nanotechnologies. To this effect, we propose six functionally enhanced logic gates based on a reconfigurable nanowire technology and employ these logic gates in efficient circuit designs. We carry out a detailed comparative study for a reconfigurable multifunctional circuit, which shows better normalized circuit delay (20.14%), area (32.40%), and activity as the power metric (40%) while exhibiting similar functionality as compared with the CMOS reference design. We further propose a novel design for a 1-bit arithmetic logic unit-based on silicon nanowire reconfigurable FETs with the area, normalized circuit delay, and activity gains of 30%, 34%, and 36%, respectively, as compared with the contemporary CMOS version.

Journal ArticleDOI
TL;DR: A self-optimizing and self-programming computing system (SOSPCS) design framework that achieves both programmability and flexibility and exploits computing heterogeneity and concludes that SOSPCS provides performance improvement and energy reduction compared to the state-of-the-art approaches.
Abstract: There exists an urgent need for determining the right amount and type of specialization while making a heterogeneous system as programmable and flexible as possible. Therefore, in this paper, we pioneer a self-optimizing and self-programming computing system (SOSPCS) design framework that achieves both programmability and flexibility and exploits computing heterogeneity [e.g., CPUs, GPUs, and hardware accelerators (HWAs)]. First, at compile time, we form a task pool consisting of hybrid tasks with different processing element (PE) affinities according to target applications. Tasks preferred to be executed on GPUs or accelerators are detected from target applications by neural networks. Tasks suitable to run on CPUs are formed by community detection to minimize data movement overhead. Next, a distributed reinforcement learning-based approach is used at runtime to allow agents to map the tasks onto the network-on-chip-based heterogeneous PEs by learning an optimal policy based on $Q$ values in the environment. We have conducted experiments on a heterogeneous platform consisting of CPUs, GPUs, and HWAs with deep learning algorithms such as matrix multiplication, ReLU, and sigmoid functions. We concluded that SOSPCS provides performance improvement up to $4.12\times $ and energy reduction up to $3.24\times $ compared to the state-of-the-art approaches.

Journal ArticleDOI
TL;DR: It is shown that randomness is not a requirement for this computational paradigm, and two methods for maintaining constant bit-streams lengths via approximations, based on low-discrepancy sequences are discussed.
Abstract: Stochastic logic performs computation on data represented by random bit-streams. The representation allows complex arithmetic to be performed with very simple logic, but it suffers from high latency and poor precision. Furthermore, the results are always somewhat inaccurate due to random fluctuations. In this paper, we show that randomness is not a requirement for this computational paradigm. If properly structured, the same arithmetical constructs can operate on deterministic bit-streams, with the data represented uniformly by the fraction of 1’s versus 0’s. This paper presents three approaches for the computation: relatively prime stream lengths, rotation, and clock division. Unlike stochastic methods, all three of our deterministic methods produce completely accurate results. The cost of generating the deterministic streams is a small fraction of the cost of generating streams from random/pseudorandom sources. Most importantly, the latency is reduced by a factor of $({1}/{2^{n}})$ , where $n$ is the equivalent number of bits of precision. When computing in unary, the bit-stream length increases with each level of logic. This is an inevitable consequence of the representation, but it can result in unmanageable bit-stream lengths. We discuss two methods for maintaining constant bit-streams lengths via approximations, based on low-discrepancy sequences. These methods provide the best accuracy and area $\times $ delay product. They are fast-converging and therefore offer progressive precision.

Journal ArticleDOI
TL;DR: A floating memristor model with minimum metal–oxide–semiconductor field-effect transistor count and a MOS-memristor models that comprises a Op-Amp-based Schmitt trigger circuit, a high-frequency modulation scheme, and an associative learning process is reported.
Abstract: This research paper reports on a floating memristor model with minimum metal–oxide–semiconductor field-effect transistor count. The proposed structure uses only three nMOS transistors with a constant current bias and a single external capacitor. It offers less design complexity as compared to other existing memristor designs. The heart of the proposed design incorporates a MOS-based feedback circuit as an electronically controlled element for the memristance value. The memristor model has been integrated with 0.18- $\mu \text{m}$ Taiwan Semiconductor Manufacturing Company Ltd. (TSMC) CMOS parameter. The pinched hysteresis loop of memristor for different frequency ranges and their composite characteristics are well analyzed using PSPICE simulation. The operating frequency range of the reported memristor is suitable up to few megahertz range. In addition to that application of the proposed MOS-memristor model is well described that comprises a Op-Amp-based Schmitt trigger circuit, a high-frequency modulation scheme, and an associative learning process. Finally, the postlayout simulation and experimental results are presented to validate the workability of the proposed memristor model.

Journal ArticleDOI
TL;DR: In this paper, the impacts of the simultaneous switching noise (SSN) in carbon nanotube field effect transistor-based ternary circuits are investigated and the results indicate that MWCNT bundle power interconnects reduce the SSN-induced delay at the output of the tenth stage for interConnects with 200- $\mu \text{m}$ .
Abstract: In this paper, the impacts of the simultaneous switching noise (SSN) in carbon nanotube field effect transistor-based ternary circuits are investigated. These effects, including the peak noise on the $V_{\mathrm {DD}}$ and ground rails and the SSN-induced delay and output noise are compared between traditional Cu and multiwall carbon nanotube bundle power interconnects in ternary circuits. Simulations are performed using HSPICE for global power interconnects at 14- and 7-nm technology nodes. The results indicate that for interconnects with 200- $\mu \text{m}$ length, the peak SSN voltage on the $V_{\mathrm {DD}}$ and ground rails for a power distribution network, including ten ternary buffers, using multi-walled carbon nanotube (MWCNT) bundle power interconnects is 53% and 40% lower, respectively, compared to Cu power interconnects in the last stage at the 14-nm node. Also, with scaling down the technology to 7 nm, these improvements increase to 60% and 59%, respectively. Moreover, MWCNT bundle power interconnects reduce the SSN-induced delay at the output of the tenth stage for interconnects with 200- $\mu \text{m}$ length on average by 82% as compared to the Cu interconnects at the 14-nm node. This improvement is 73% for the 7-nm technology node.

Journal ArticleDOI
TL;DR: This work presents profiling-based cross-device power SCA attacks using deep-learning techniques on 8-bit AVR microcontroller devices running AES-128 with results show that the designed MLP with PCA-based preprocessing outperforms a convolutional neural network with four-device training by ~20% in terms of the average test accuracy.
Abstract: Power side-channel analysis (SCA) has been of immense interest to most embedded designers to evaluate the physical security of the system. This work presents profiling-based cross-device power SCA attacks using deep-learning techniques on 8-bit AVR microcontroller devices running AES-128. First, we show the practical issues that arise in these profiling-based cross-device attacks due to significant device-to-device variations. Second, we show that utilizing principal component analysis (PCA)-based preprocessing and multidevice training, a multilayer perceptron (MLP)-based 256-class classifier can achieve an average accuracy of 99.43% in recovering the first keybyte from all the 30 devices in our data set, even in the presence of significant interdevice variations. Results show that the designed MLP with PCA-based preprocessing outperforms a convolutional neural network (CNN) with four-device training by ~20% in terms of the average test accuracy of cross-device attack for the aligned traces captured using the ChipWhisperer hardware. Finally, to extend the practicality of these cross-device attacks, another preprocessing step, namely, dynamic time warping (DTW) has been utilized to remove any misalignment among the traces, before performing PCA. DTW along with PCA followed by the 256-class MLP classifier provides ≥10.97% higher accuracy than the CNN-based approach for cross-device attack even in the presence of up to 50 time-sample misalignments between the traces.

Journal ArticleDOI
TL;DR: The proposed SRAM cell is well suited for bit-interleaving architecture, which helps to improve the soft-error immunity with error correction coding and the read static noise margin (RSNM) and the write margin (WM) are significantly improved due to its built-in write/read-assist scheme.
Abstract: This paper presents a half-select disturb-free 11T static random access memory (SRAM) cell for ultralow-voltage operations. The proposed SRAM cell is well suited for bit-interleaving architecture, which helps to improve the soft-error immunity with error correction coding. The read static noise margin (RSNM) and the write margin (WM) are significantly improved due to its built-in write/read-assist scheme. The experimental results in a 40-nm standard CMOS technology indicate that at a 0.5-V supply voltage, RSNM of the proposed SRAM cell is $19.8\times $ and $0.96\times $ as that of 6T and 8T SRAM cells with min-area, respectively. It achieves $11.84\times $ and $9.56\times $ higher WM correspondingly. As a result, a lower minimum operation voltage is obtained. In addition, its leakage power consumption is reduced by 53.3% and 44.5% when compared with 6T and 8T SRAM cell with min-area, respectively.

Journal ArticleDOI
TL;DR: This paper proposes a novel methodology of per-device PUF configuration and a new PUF variant derived from the popular FPGA-specific Anderson PUF, which has several advantages over existing work including theAnderson PUF on which it is based.
Abstract: Reconfigurable systems often require secret keys to encrypt and decrypt data. Applications requiring high security commonly generate keys based on physical unclonable functions (PUFs), circuits that use random manufacturing variations to produce secret keys that are unique to each device. Implementing PUFs on field-programmable gate arrays (FPGAs) is usually difficult, because the designer has limited control over layout, and each PUF system requires a large area overhead to correct errors in the PUF response bits. In this paper, we extend the state of the art for FPGA-based weak PUFs using a novel methodology of per-device configuration and a new PUF variant derived from the popular FPGA-specific Anderson PUF. The PUF is evaluated using Xilinx XC7Z020 programmable system-on-chips from the Virtex-7 family on Zynq ZedBoard platforms. The design we propose has several advantages over existing work including the Anderson PUF on which it is based. Our design is tunable to minimize the response bias and can be implemented using the common SLICEL components on Xilinx FPGAs. Moreover, the proposed PUF design enables an efficient per-device configuration that reduces bit error rate by over $10\times $ at room temperature and improves response stability by over $2\times $ across all temperatures. We demonstrate that the proposed per-device PUF configuration step leads to roughly $2\times $ savings in area resources for PUFs and error correction as used in key generation.

Journal ArticleDOI
TL;DR: Results show that backscattering-based detection outperforms the EM side channel, confirm that dormant HTs are much more difficult for detection than HTs that have been activated, and show how detection is affected by changing the HT’s size and physical location on the IC.
Abstract: This paper describes a new physical side channel, i.e., the backscattering side channel, created by transmitting a signal toward the integrated circuits (ICs), where the internal impedance changes caused by on-chip switching activity modulate the signal that is backscattered (reflected) from the IC. To demonstrate how this new side channel can be used to detect small changes in circuit impedances, we propose a new method for nondestructively detecting hardware Trojans (HTs) from outside the chip. We experimentally confirm, using measurements on one physical instance for training and nine other physical instances for testing, that the new side channel, when combined with an HT detection method, allows detection of a dormant HT in 100% of the HT-afflicted measurements for a number of different HTs while producing no false positives in HT-free measurements. Furthermore, additional experiments are conducted to compare the backscattering-based detection to one that uses the traditional EM-emanation-based side channel. These results show that backscattering-based detection outperforms the EM side channel, confirm that dormant HTs are much more difficult for detection than HTs that have been activated, and show how detection is affected by changing the HT’s size and physical location on the IC.

Journal ArticleDOI
TL;DR: Evaluation results show beyond-FinFET comparison speed and enhanced linearity for the proposed NCFET-based clocked comparator and VTC, respectively, and improvement is achieved by exploiting the steeper slope and increased output impedance ofNCFETs.
Abstract: Negative-capacitance FETs (NCFETs) are a promising candidate for low-power circuits with intrinsic features, e.g., the steep switching slope. Prior works have shown potential for enabling low-power digital logic and memory design with NCFETs. Yet, it is still not quite clear how to harness these new features of NCFETs for analog functionalities. This article provides more insights into the circuit design space with new device characteristics and investigates its deployment in analog circuits, specifically, time-domain analog-to-digital converters (ADCs) and phase-locked loops (PLLs). We propose and optimize a novel digital-based clocked comparator and a capacitor-based voltage-to-time converter (VTC), which are essential building blocks in ADCs and PLLs. Evaluation results show beyond-FinFET comparison speed and enhanced linearity for the proposed NCFET-based clocked comparator and VTC, respectively. Such improvement is achieved by exploiting the steeper slope and increased output impedance of NCFETs. More details on design details and a discussion are provided in this article.

Journal ArticleDOI
TL;DR: An efficient analysis and modeling technique is proposed that enables designers to assess the timing behavior of hybrid full adder circuits at the block level and anticipate their performance in multistage circuits.
Abstract: One of the critical issues in the advancement of very large scale of integration circuit design is the estimation of timing behavior of the arithmetic circuits. The concept of logical effort provides a proficient approach to comprehend and assess the timing behavior of circuits with conventional CMOS (C-CMOS) structure. However, this technique is not working for circuits with a hybrid structure. On the other hand, numerous circuits with the hybrid structure which are faster and consume less power than C-CMOS one have been proposed for different applications such as portable and IoT devices. In this regard, the necessity of having and use of a simple and efficient timing behavior method like conventional logical effort for analysis of the hybrid adder circuits is inevitable. This paper proposes an efficient analysis and modeling technique that enables designers to assess the timing behavior of hybrid full adder circuits at the block level and anticipate their performance in multistage circuits. The gain and selection factor are introduced as a criterion for accurate selection and optimization of the hybrid adder cells measurable on the single test bench for management of energy efficiency and performance tradeoff. The proposed method is investigated using 32-nm CMOS and FinFET technologies.

Journal ArticleDOI
TL;DR: Analysis of weighted sum and weight update functions in a one-selector–one-RRAM (1S–1R)-based crossbar arrays indicates that different selectors suitable for each operation mode (inference or training) are preferred in the neuromorphic computing system.
Abstract: The impact of selector devices on the inference and training accuracy of a resistive random access memory (RRAM)-based neuromorphic computing system is rarely studied. In this paper, we analyze the weighted sum and weight update functions in a one-selector–one-RRAM (1S–1R)-based crossbar arrays. We first develop a Verilog-A model based on the lateral evolution of the filament to describe analog conductance tuning in the filamentary RRAM. We then perform an array-level SPICE simulation on the 1S–1R arrays, where the exponential and threshold selectors are employed. In the inference stage, the read-out current is vulnerable to the inevitable IR drop caused by the wire resistance. Our finding reveals that the use of a threshold selector allows the 1S–1R device to have a linear I–V relation, improving the immunity to the IR drop. On the other hand, the threshold selector distorts analog RRAM’s linear weight update during the training stage. Instead, an introduction of exponential selector enables the desirable properties of analog RRAM to be maintained even in the 1S–1R device. These results indicate that different selectors suitable for each operation mode (inference or training) are preferred in the neuromorphic computing system.

Journal ArticleDOI
TL;DR: The Oracle policies enable us to design low-overhead power management policies that achieve near-optimal performance matching the Oracle, and present efficient approaches for constructing an Oracle policy to optimize different objective functions, such as energy and performance per Watt.
Abstract: The complexity of heterogeneous mobile platforms is growing at a rate faster than our ability to manage them optimally at runtime. For example, state-of-the-art systems-on-chip (SoCs) enable controlling the type (Big/Little), number, and frequency of active cores. Managing these platforms becomes challenging with the increase in the type, number, and supported frequency levels of the cores. However, existing solutions used in mobile platforms still rely on simple heuristics based on the utilization of cores. This paper presents a novel and practical imitation learning (IL) framework for dynamically controlling the type (Big/Little), number, and the frequencies of active cores in heterogeneous mobile processors. We present efficient approaches for constructing an Oracle policy to optimize different objective functions, such as energy and performance per Watt (PPW). The Oracle policies enable us to design low-overhead power management policies that achieve near-optimal performance matching the Oracle. Experiments on a commercial platform with 19 benchmarks show on an average 101% PPW improvement compared to the default interactive governor.

Journal ArticleDOI
TL;DR: An NVM-based CIM architecture employing a Preset-XNOR operation in/with the spin–orbit torque magnetic random access memory (SOT-MRAM) to accelerate the computation of BNNs (PXNOR-BNN) is proposed.
Abstract: Convolution neural networks (CNNs) have demonstrated superior capability in computer vision, speech recognition, autonomous driving, and so forth, which are opening up an artificial intelligence (AI) era. However, conventional CNNs require significant matrix computation and memory usage leading to power and memory issues for mobile deployment and embedded chips. On the algorithm side, the emerging binary neural networks (BNNs) promise portable intelligence by replacing the costly massive floating-point compute-and-accumulate operations with lightweight bit-wise XNOR and popcount operations. On the hardware side, the computing-in-memory (CIM) architectures developed by the non-volatile memory (NVM) present outstanding performance regarding high speed and good power efficiency. In this paper, we propose an NVM-based CIM architecture employing a Preset-XNOR operation in/with the spin–orbit torque magnetic random access memory (SOT-MRAM) to accelerate the computation of BNNs (PXNOR-BNN). PXNOR-BNN performs the XNOR operation of BNNs inside the computing-buffer array with only slight modifications of the peripheral circuits. Based on the layer evaluation results, PXNOR-BNN can achieve similar performance compared with the read-based SOT-MRAM counterpart. Finally, the end-to-end estimation demonstrates $12.3\times $ speedup compared with the baseline with 96.6-image/s/W throughput efficiency.

Journal ArticleDOI
TL;DR: This multiplier is a variant of the serial–parallel (SP) modified radix-4 Booth multiplier that adds only the nonzero Booth encodings and skips over the zero operations, making the latency dependent on the multiplier value.
Abstract: In this paper, we present a two-speed, radix-4, serial-parallel multiplier for accelerating applications such as digital filters, artificial neural networks, and other machine learning algorithms. Our multiplier is a variant of the serial–parallel (SP) modified radix-4 Booth multiplier that adds only the nonzero Booth encodings and skips over the zero operations, making the latency dependent on the multiplier value. Two subcircuits with different critical paths are utilized so that throughput and latency are improved for a subset of multiplier values. The multiplier is evaluated on an Intel Cyclone V field-programmable gate array against standard parallel–parallel and SP multipliers across four different process–voltage–temperature corners. We show that for bit widths of 32 and 64, our optimizations can result in a $1.42\times $ – $3.36\times $ improvement over the standard parallel Booth multiplier in terms of area–time depending on the input set.

Journal ArticleDOI
TL;DR: The proposed SSP includes novel detection, feature extraction, and improved K-means algorithms for better clustering accuracy, online clustering performance, and lower power and smaller area per channel, which is the lowest among the compared state-of-the-art SSPs.
Abstract: This paper presents a power- and area-efficient spike sorting processor (SSP) for real-time neural recordings. The proposed SSP includes novel detection, feature extraction, and improved K-means algorithms for better clustering accuracy, online clustering performance, and lower power and smaller area per channel. Time-multiplexed registers are utilized in the detector for dynamic power reduction. Finally, an ultralow-voltage 8T static random access memory (SRAM) is developed to reduce area and leakage consumption when compared to D flip-flop -based memory. The proposed SSP, fabricated in 65-nm CMOS process technology, consumes only 0.175 $\mu \text{W}$ /channel when processing 128 input channels at 3.2 MHz and 0.54 V, which is the lowest among the compared state-of-the-art SSPs. The proposed SSP also occupies 0.003 mm2/channel, which allows 333 channels/mm2.

Journal ArticleDOI
TL;DR: A variation and noise-tolerant learning algorithm and postsilicon process variation compensation technique which does not require any additional monitoring circuitry to reduce the accuracy degradation in the corrupted fully connected network.
Abstract: Recently, analog and mixed-signal neural network processors have been extensively studied due to their better energy efficiency and small footprint. However, analog computing is more vulnerable to circuit nonidealities such as process variation than their digital counterparts. On-chip calibration circuits can be adopted to measure and compensate for those effects, but it leads to unavoidable area and power overheads. In this brief, we propose a variation and noise-tolerant learning algorithm and postsilicon process variation compensation technique which does not require any additional monitoring circuitry. The proposed techniques reduce the accuracy degradation in the corrupted fully connected network down to 1% under large amount of variations including 10% unit capacitor mismatch, 8-mVrms comparator noise and 20-mVrms comparator offset.

Journal ArticleDOI
TL;DR: This paper aims to implement a fast implementation of ECC scalar multiplication for any generic Montgomery curve in Galois Field in p [GF(p)] without having the constraint of using any specialized modulus, and shows that the proposed ECC Scalar multiplication architecture is as fast as scalar multiplied in special curves like Curve25519, albeit with little area overhead.
Abstract: Elliptic curve-based cryptography (ECC) has become the automatic choice for public key cryptography due to its lightweightness compared to Rivest–Shamir–Adleman (RSA). The most important operation in ECC is elliptic curve scalar multiplication, and its efficient implementation has gathered significant attention in the research community. Fast implementation of ECC scalar multiplication is often desired for speed-critical applications such as runtime authentication in automated cars, web server certification, and so on. Such fast architectures are achieved by implementing ECC scalar multiplication in fields with pseudo-Mersenne prime or Solinas prime. In this paper, we aim to implement a fast implementation of ECC scalar multiplication for any generic Montgomery curve in Galois Field in p [GF(p)] without having the constraint of using any specialized modulus. We will show that the proposed ECC scalar multiplication architecture is as fast as scalar multiplication in special curves like Curve25519, albeit with little area overhead. The proposed architecture can be modified to support ECC scalar multiplication in both Montgomery and short Weierstrass curves.