scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2022"


Journal ArticleDOI
TL;DR: In this article , an integer-N-type-II sub-sampling phase-locked loop (SS-PLL) was proposed to suppress the spur-induced binary frequency shift keying modulation (BFSK) effect and shorten the settling time.
Abstract: This brief describes an integer-N-type-II sub-sampling phase-locked loop (SS-PLL) incorporating a push–pull sub-sampling phase detector to significantly suppress the spur-induced binary frequency shift keying modulation (BFSK) effect and a low-power fast-locking frequency-locked loop (FLL) to shorten the settling time. Prototyped in 65-nm CMOS, the SS-PLL at 3.3 GHz shows a reference spur of −82.2 dBc, an integrated jitter of 64.9 fsrms (1 kHz to 40 MHz), and an in-band phase noise (PN) of −128.4 dBc/Hz at 1-MHz offset. The corresponding jitter power figure of merit (FOM) is −255 dB. The entire SS-PLL consumes 7.5 mW, with only $90~\mu \text{W}$ associated with the FLL.

18 citations


Journal ArticleDOI
TL;DR: An ensemble of weak CNNs are used to build a robust classifier with low cost to effectively improve the system reliability when suffering soft errors with an overhead much lower than TMR.
Abstract: Convolutional neural networks (CNNs) are widely used in computer vision and natural language processing. Field-programmable gate arrays (FPGAs) are popular accelerators for CNNs. However, if used in critical applications, the reliability of FPGA-based CNNs becomes a priority because FPGAs are prone to suffer soft errors. Traditional protection schemes, such as triple modular redundancy (TMR), introduce a large overhead, which is not acceptable in resource-limited platforms. This article proposes to use an ensemble of weak CNNs to build a robust classifier with low cost. To have a group of base CNNs with low complexity and balanced similarity and diversity, residual neural networks (ResNets) with different layers (20/32/44/56) are combined in the ensemble system to replace a single strong ResNet 110. In addition, a robust combiner is designed based on the reliability evaluation of a single ResNet. Single ResNets with different layers and different ensemble schemes are implemented on the FPGA accelerator based on Xilinx Zynq 7000 SoC. The reliability of the ensemble systems is evaluated based on a large-scale fault injection platform and compared with that of the TMR-protected ResNet 110 and ResNet 20. Experiment results show that the proposed ensembles could effectively improve the system reliability when suffering soft errors with an overhead much lower than TMR.

18 citations


Journal ArticleDOI
TL;DR: An area-efficient and highly unified reconfigurable multicore number theoretic transform (NTT)/inverse NTT (INTT) architecture (named MCNA), which employs NTT and INTT for polynomial multiplier with a variable number of reconfiguring processing elements, and a novel memory access pattern named "cyclic-sharing” is proposed to reduce 25% memory capacity.
Abstract: The ring learning with error (RLWE)-based fully homomorphic encryption (FHE) scheme has become one of the most promising FHE schemes. However, its performance is limited by the homomorphic multiplication, especially the polynomial multiplication which occupies major computing resources. Therefore, efficient implementation of polynomial multiplication is crucial for high-performance FHE applications. In this article, we present an area-efficient and highly unified reconfigurable multicore number theoretic transform (NTT)/inverse NTT (INTT) architecture (named MCNA), which employs NTT and INTT for polynomial multiplier with a variable number of reconfigurable processing elements. To reduce latency, MCNA merges the preprocessing and postprocessing into the constant-geometry NTT and INTT, respectively. Also, a reconfigurable modular multiplier based on digital signal processor (DSP) is proposed to speed up the modular multiplication. In order to avoid designing independent memory access pattern for INTT, a unified read/write structure of NTT/INTT is presented. Furthermore, a novel memory access pattern named “cyclic-sharing” is proposed to reduce 25% memory capacity. MCNA is evaluated on a Xilinx Virtex-7 field-programmable gate array (FPGA) platform. Running at 250-MHz clock frequency, the throughput of MCNA for NTT/INTT achieves $2.78\times \sim 9.32\times $ improvements in comparison to prior works, while the area efficiency of lookup table (LUT) and flip-flop (FF) is improved by $1.25\times \sim 4.79\times $ . For polynomial multiplication, the throughput of MCNA achieves $3.73\times \sim 7.69\times $ enhancements, as well as $1.13\times \sim 14.8\times $ area efficiency improvements.

14 citations


Journal ArticleDOI
TL;DR: A novel 8T compute SRAM (CSRAM) for reliable and high-speed in-memory searching and compound logic-in-memory computations is proposed and a thorough circuit-level analysis reveals that the pMOS-based compute access port is essential for significantly mitigating the read disturbance.
Abstract: To efficiently implement searching and logic functions with the SRAM-based in-memory computing (IMC), we need to perform computations on bitlines (BLs) (called compute access) via multiple wordline (WL) activations. However, this may cause prominent read disturbance when the IMC is implemented with the standard 6 T SRAM. To address this reliability issue, existing solutions adopt either auxiliary assistance circuits or alternative bitcell topologies, but they lead to substantial overheads of the access speed or array density. In this article, we propose a novel 8T compute SRAM (CSRAM) for reliable and high-speed in-memory searching and compound logic-in-memory computations. Our 8T CSRAM features a pair of pMOS access transistors and split-WLs dedicated to the compute access. A thorough circuit-level analysis reveals that the pMOS-based compute access port is essential for significantly mitigating the read disturbance. Moreover, we propose an elevated precharge voltage scheme and a low-skewed inverter-based sensing amplifier to improve the sensing speed. We have validated the proposed 8T CSRAM design in a 16 Kb array with a 28-nm CMOS technology. Compared to the state-of-the-art 8 T CSRAM, results show that our design is not only reliable but also 3.1 times faster, with a maximum operating frequency upping to 2.44 GHz.

12 citations


Journal ArticleDOI
TL;DR: In this paper , Ordered Reliability Bits GRAND (ORBGRAND) is a soft-input variant that outperforms hard-input GRAND and is suitable for parallel hardware implementation, which achieves an average throughput of up to 42.5$ Gbps for a code length of $128$ at a target FER of $10^{-7}$.
Abstract: Ultra-reliable low-latency communication (URLLC), a major 5G New-Radio use case, is the key enabler for applications with strict reliability and latency requirements. These applications necessitate the use of short-length and high-rate codes. Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique for these short-length and high-rate codes. Rather than decoding the received vector, GRAND tries to infer the noise that corrupted the transmitted codeword during transmission through the communication channel. As a result, GRAND can decode any code, structured or unstructured. GRAND has hard-input as well as soft-input variants. Among these variants, Ordered Reliability Bits GRAND (ORBGRAND) is a soft-input variant that outperforms hard-input GRAND and is suitable for parallel hardware implementation. This work reports the first hardware architecture for ORBGRAND, which achieves an average throughput of up to $42.5$ Gbps for a code length of $128$ at a target FER of $10^{-7}$. Furthermore, the proposed hardware can be used to decode any code as long as the length and rate constraints are met. In comparison to the GRANDAB, a hard-input variant of GRAND, the proposed architecture enhances decoding performance by at least $2$ dB. When compared to the state-of-the-art fast dynamic successive cancellation flip decoder (Fast-DSCF) using a 5G polar $(128,105)$ code, the proposed ORBGRAND VLSI implementation has $49\times$ higher average throughput, $32\times$ times more energy efficiency, and $5\times$ more area efficiency while maintaining similar decoding performance.

12 citations


Journal ArticleDOI
TL;DR: In this article , a register transfer level (RTL) power analysis tool (PAT) framework is presented to perform a technology-independent power side-channel (PSC) assessment of cryptographic hardware at the RTL stage.
Abstract: Power side-channel (PSC) attacks received significant attention over the past two decades due to their effectiveness in breaking mathematically strong cryptographic implementations. However, most existing PSC assessment frameworks apply only to post-silicon implementations; this is unfavorable to the industry due to the lack of flexibility in fixing the design and the high cost/time penalty incurred in redoing the entire design cycle. This article presents the register transfer level (RTL)-power analysis tool (PAT) framework to perform a technology-independent PSC assessment of cryptographic (pre- and post-quantum) hardware at the RTL stage. Performing assessment at the RTL gives designers the utmost flexibility to quickly apply the countermeasures locally. RTL-PAT can also serve as a front-end sign-off framework for PSC leakage, allowing a designer to make changes in the early design stage, which would otherwise be difficult/time-consuming to perform in subsequent design stages. Furthermore, RTL-PAT can analyze both FPGA and ASIC design flows for standalone IPs and SoCs. In this article, we present the efficacy of RTL-PAT on several cryptographic implementations. The results are presented for standalone IPs, which include different AES implementations (Galois field, lookup table, pipelined, and threshold implementation) and PRESENT cipher. We also analyze a large-scale SoC, which includes the post-quantum SABER implementation and AES. The results show that the framework effectively identifies the leaky modules and validates the efficacy of PSC countermeasures implemented in the RTL. The obtained RTL-PAT assessment results are validated with the post-silicon $t$ -statistics assessment as well.

10 citations


Journal ArticleDOI
TL;DR: The miniature and high-speed features of the Dual-f Keccak design are found to be adequate for multimodal biometric authentication applications and performed better in terms of throughput and operating frequency.
Abstract: Synchronized padder block and a compact-dynamic round constant (RC) generator to achieve highly efficient Keccak architecture are proposed in this work. The proposed design yields high security with an option of 1024 bits as capacity “ $c$ ,” while limiting the round count to less than 12 for the base design. Fusion schemes are adapted as a cost-effective approach in the base design to explore and arrive at the best efficient architecture for biometric access control application. The hybrid architecture designed as a pipeline structure with ≤2 stages eliminated the need for on-chip digital signal processor (DSP) and block random access memory (BRAM) slices. Though fusion schemes might lead to the increase in area, the minimized structural RC design coupled with a low cost architecture, ensures to achieve moderately low area. Among the proposed architectures, dual round function (Dual-f) design performed better in terms of throughput and operating frequency. Thus, when implemented, Dual-f achieved the highest efficiency of all with 12.85 Mb/s/slices and 15.11 Mb/s/slices on Virtex-5 and Virtex-7 devices, respectively. The miniature and high-speed features of the Dual-f Keccak design are found to be adequate for multimodal biometric authentication applications.

10 citations


Journal ArticleDOI
TL;DR: In this paper , the experimental results for a multiple-input operational transconductance amplifier (MI-OTA) were presented, which employs three linearization techniques: the bulk-driven (BD), the source degeneration, and the input voltage attenuation created by the MI metaloxide-semiconductor transistor technique.
Abstract: This article presents the experimental results for a multiple-input operational transconductance amplifier (MI-OTA). To achieve extended linearity under 0.5-V low voltage supply, the circuit employs three linearization techniques: the bulk-driven (BD), the source degeneration, and the input voltage attenuation created by the MI metal-oxide-semiconductor transistor technique (MI-MOST). Although the linearization techniques result in reduced dc gain, the self-cascode transistors are used to boost the gain of the MI-OTA. Furthermore, the MI-MOST simplifies the internal structure of the OTA and may reduce the complexity of the applications. The MI-OTA operates in the subthreshold region and offers tunability by a bias current in the nanoampere range. The circuit is capable to work with 0.5-V supply voltage while consuming 24.77 nW. The circuit was fabricated using the 0.18- $\mu \text{m}$ Taiwan Semiconductor Manufacturing Company (TSMC) CMOS technology and it occupies a 0.01153-mm2 silicon area. Intensive simulation and experimental results confirm the benefits and robustness of the design.

10 citations


Journal ArticleDOI
TL;DR: A compact and efficient in-MEmory NTT accelerator is presented, named MeNTT, which explores an optimized computation in and near a 6T SRAM array, and a novel mapping strategy reduces the data flow between the NTT stages into a unique pattern, which greatly simplifies the routing among processing units.
Abstract: Lattice-based cryptography (LBC), exploiting learning with error (LWE) problems, is a promising candidate for postquantum cryptography. The number theoretic transform (NTT) is the latency- and energy-dominant process in the computation of LWE problems. This article presents a compact and efficient in-MEmory NTT accelerator, named MeNTT, which explores an optimized computation in and near a 6T SRAM array. Specifically designed peripherals enable fast and efficient modular operations. Moreover, a novel mapping strategy reduces the data flow between the NTT stages into a unique pattern, which greatly simplifies the routing among processing units (i.e., the SRAM column in this work), reducing the energy and area overheads. The accelerator achieves significant latency and energy reductions over prior arts.

9 citations


Journal ArticleDOI
TL;DR: In this paper , a flexible and reconfigurable PCB test bed derived from the popular open-source programmable logic controller (PLC) platform “OpenPLC.” is developed, which utilizes and analyzes multimodal side channels.
Abstract: Malicious modifications to printed circuit boards (PCBs) are known as hardware Trojans. These may arise when malafide third parties alter PCBs premanufacturing or postmanufacturing and are a concern in safety-critical applications, such as industrial control systems. In this research, we examine how data-driven detection can be utilized to detect such Trojans at run-time. We develop a flexible and reconfigurable PCB test bed derived from the popular open-source programmable logic controller (PLC) platform “OpenPLC.” We then develop a Trojan detection framework, which utilizes and analyzes multimodal side channels (e.g., timing, magnetic signals, power, and hardware performance counters). We consider defender-configurable input/output (I/O) loopback test, comparison with design-document baselines, and magnetometer-aided monitoring of system behavior under defender-chosen excitations. Our approach can extend to golden-free environments. Golden (known-good) versions of the PCBs are assumed not available, but design information, datasheets, and component-level data are available. We demonstrate the efficacy of our approach on a range of Trojans instantiated in the test bed.

8 citations


Journal ArticleDOI
TL;DR: In this article , a compute-in-memory (CIM)-based ultralow-power framework for probabilistic localization of insect-scale drones is proposed, where the likelihood function useful for drone localization can be efficiently implemented by connecting many multi-input inverters in parallel.
Abstract: We propose a novel compute-in-memory (CIM)-based ultralow-power framework for probabilistic localization of insect-scale drones. Localization is a critical subroutine for path planning and rotor control in drones, where a drone is required to continuously estimate its pose (position and orientation) in flying space. The conventional probabilistic localization approaches rely on the 3-D Gaussian mixture model (GMM)-based representation of a 3-D map. A GMM model with hundreds of mixture functions is typically needed to adequately learn and represent the intricacies of the map. Meanwhile, localization using complex GMM map models is computationally intensive. Since insect-scale drones operate under extremely limited area/power budget, continuous localization using GMM models entails much higher operating energy, thereby limiting flying duration and/or size of the drone due to a larger battery. Addressing the computational challenges of localization in an insect-scale drone using a CIM approach, we propose a novel framework of 3-D map representation using a harmonic mean of the “Gaussian-like” mixture (HMGM) model. We show that short-circuit current of a multiinput floating-gate CMOS-based inverter follows the harmonic mean of a Gaussian-like function. Therefore, the likelihood function useful for drone localization can be efficiently implemented by connecting many multiinput inverters in parallel, each programmed with the parameters of the 3-D map model represented as HMGM. When the depth measurements are projected to the input of the implementation, the summed current of the inverters emulates the likelihood of the measurement. We have characterized our approach on an RGB-D scenes dataset. The proposed localization framework is $\sim 25\times $ energy-efficient than the traditional, 8-bit digital GMM-based processor paving the way for tiny autonomous drones.

Journal ArticleDOI
TL;DR: A novel latency-hiding architecture for recurrent neural network (RNN) acceleration using column-wise matrix–vector multiplication (MVM) instead of the state-of-the-art row-wise operation to eliminate data dependencies and increase HW utilization and enhance system throughput.
Abstract: This article presents a reconfigurable accelerator for REcurrent Neural networks with fine-grained cOlumn-Wise matrix–vector multiplicatioN (RENOWN). We propose a novel latency-hiding architecture for recurrent neural network (RNN) acceleration using column-wise matrix–vector multiplication (MVM) instead of the state-of-the-art row-wise operation. This hardware (HW) architecture can eliminate data dependencies to improve the throughput of RNN inference systems. Besides, we introduce a configurable checkerboard tiling strategy which allows large weight matrices, while incorporating various configurations of element-based parallelism (EP) and vector-based parallelism (VP). These optimizations improve the exploitation of parallelism to increase HW utilization and enhance system throughput. Evaluation results show that our design can achieve over 29.6 tera operations per second (TOPS) which would be among the highest for field-programmable gate array (FPGA)-based RNN designs. Compared to state-of-the-art accelerators on FPGAs, our design achieves 3.7–14.8 times better performance and has the highest HW utilization.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a high-performance hardware architecture for supersingular isogeny key encapsulation (SIKE) protocol, which includes an improved multiplier based on the highperformance finite field multiplication (HFFM) algorithm.
Abstract: Supersingular isogeny key encapsulation (SIKE) is a promising candidate in the NIST postquantum cryptography (PQC) standardization process, which has the smallest key lengths. It is the only isogeny-based cryptographic scheme in the NIST list that leverages the traditional elliptic curve cryptography (ECC) arithmetic; however, the high computational complexity is one of its limiting factors. In this work, we proposed a high-performance hardware architecture for the SIKE protocol. The architecture includes an improved multiplier based on the high-performance finite field multiplication (HFFM) algorithm which is 15%–20.7% faster than the previous multiplier based on the HFFM algorithm and a unified adder/subtractor with radix $3^{b}$ . In addition, it also comprises an efficient scheduler strategy that decomposes all the functions of SIKE into finite field $F_{p}$ and then effectively schedules through optimized multiplication chains for maximal performance. The proposed architecture is synthesized and implemented on Xilinx Virtex-7 FPGA for all the four variants of SIKE having security levels from 1 to 5 and achieved 2.6%–7.8% faster speeds as well as consumed less equivalent number of slices (ENS) than the state-of-the-art designs. In the comparison of area and time (AT), the proposed architecture is 14.2%–34.5% lower than the previous architecture.

Journal ArticleDOI
TL;DR: In this paper , the authors propose an algorithm-hardware co-optimized framework to accelerate Transformers by utilizing general N:M sparsity patterns, which can achieve significant speedup.
Abstract: The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to the immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere graphics processing units (GPUs) leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm–hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. First, from an algorithm perspective, we propose a sparsity inheritance mechanism along with inherited dynamic pruning (IDP) to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed to significantly reduce the storage requirement for deployment. Second, from a hardware perspective, we present a flexible and efficient hardware architecture, namely, STA, to achieve significant speedup when deploying N:M sparse Transformers. STA features not only a computing engine unifying both sparse–dense and dense–dense matrix multiplications with high computational efficiency but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that, compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency. Moreover, STA can achieve $14.47\times $ and $11.33\times $ speedups compared to Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform $2.00 \,\,\sim 19.47 \times $ faster inference than the state-of-the-art field-programmable gate array (FPGA)-based accelerators for Transformers.

Journal ArticleDOI
TL;DR: An efficient integrated HT detection technique based on evaluating changes in the integrated parasitic capacitors, which shows very promising results in detecting HTs with zero-delay effect, which is a challenging issue in the conventional delay-based side-channel signal analysis method.
Abstract: The increasing concern about the security and reliability of abroad manufactured integrated circuits (ICs) has attracted academia and industries to develop hardware Trojan (HT) detection approaches. This article presents an efficient integrated HT detection technique based on evaluating changes in the integrated parasitic capacitors. The HT detection circuit consists of a capacitively coupled, low-power, low-noise, operational transconductance amplifier (OTA), which can detect capacitance fluctuations in the range of 10 aF. The HT detection circuit consumes $5.88~\mu \text {W}$ from 1.8-V power supply in 180-nm CMOS technology. The detection method is based on clustering the IC and monitoring each cluster’s flag. The flag set circuit is designed to sense parasitic capacitance and change its status based on it. The proposed technique can detect the HT circuit before the activation of the IC. Moreover, this technique shows very promising results in detecting HTs with zero-delay effect, which is a challenging issue in the conventional delay-based side-channel signal analysis method. More significantly, the proposed method does not require a golden IC for HT detection and can detect the HT using simulation-based data. The proposed method creates a recognizable difference detection signal between the capacitive behavior of an infected and a pure IC. This results in a high confidence level in the proposed detection method. The proposed idea is implemented on ISCAS’85 benchmark circuits, and the detection outcomes and the statistical simulations are presented.

Journal ArticleDOI
TL;DR: A fully automated detection framework containing systematic methodologies for test generation, signature extraction, signal processing, threshold calculation, and metric-based decision-making that effectively enables the synergistic self-referencing approach is introduced.
Abstract: The globalization of the semiconductor supply chain has developed a new set of challenges for security researchers. Among them, malicious alterations of hardware designs at an untrusted facility, or Trojan insertion, are considered one of the most difficult challenges. While side-channel analysis-based hardware Trojan detection techniques have shown great potential, most solutions, proposed over the past decade, require the availability of golden (i.e., Trojan-free) chips and are susceptible to process variations. Few techniques that do not require a golden chip depend on simulation-based modeling of the side-channel signature, which may not be reliable for differentiating between process and Trojan induced variations. Furthermore, most of these techniques are evaluated either using very few Trojan inserted chips or simulation-based test setup. Spatial and temporal self-referencing-based detection mechanisms proposed earlier effectively eliminate the need for a golden chip and the impact of process variations. However, these techniques have not been adequately studied to achieve high detection sensitivity. In this article, we propose a golden-free multidimensional self-referencing technique that analyzes the side-channel signatures in both the time and frequency domains to significantly broaden the Trojan coverage and strengthen the detection confidence. We introduce a fully automated detection framework containing systematic methodologies for test generation, signature extraction, signal processing, threshold calculation, and metric-based decision-making that effectively enables the synergistic self-referencing approach. Finally, we evaluate the proposed technique through a comprehensive hardware measurement setup consisting of 96 Trojan-inserted test chips. Along with achieving a high detection coverage, we demonstrate that the analysis of spatial and temporal discrepancies in both frequency and time domains helps to reliably detect small hard-to-detect Trojans under process and measurement induced variations.

Journal ArticleDOI
TL;DR: It is shown that the posit-based multiplier requires higher power-delay-product (PDP) and area, whereas fixed-posit reduces PDP and area consumption by 71% and 36%, respectively, compared to (Devnath et al., 2020) for the same bit-width.
Abstract: This brief compares quantized float-point representation in posit and fixed-posit formats for a wide variety of pre-trained deep neural networks (DNNs). We observe that fixed-posit representation is far more suitable for DNNs as it results in a faster and low-power computation circuit. We show that accuracy remains within the range of 0.3% and 0.57% of top-1 accuracy for posit and fixed-posit quantization. We further show that the posit-based multiplier requires higher power-delay-product (PDP) and area, whereas fixed-posit reduces PDP and area consumption by 71% and 36%, respectively, compared to (Devnath et al., 2020) for the same bit-width.

Journal ArticleDOI
TL;DR: In this article , a configurable FP multiple-precision PE design is proposed with the LPC structure, which achieves the best energy efficiency performance with 975.13 GFLOPS/W. The proposed design is realized in a 28-nm process with 1.429 GHz clock frequency.
Abstract: There is an emerging need to design configurable accelerators for the high-performance computing (HPC) and artificial intelligence (AI) applications in different precisions. Thus, the floating-point (FP) processing element (PE), which is the key basic unit of the accelerators, is necessary to meet multiple-precision requirements with energy-efficient operations. However, the existing structures by using high-precision-split (HPS) and low-precision-combination (LPC) methods result in low utilization rate of the multiplication array and long multiterm processing period, respectively. In this article, a configurable FP multiple-precision PE design is proposed with the LPC structure. Half precision, single precision, and double precision are supported. The 100% multiplier utilization rate of the multiplication array for all precisions is achieved with improved speed in the comparison and summation process. The proposed design is realized in a 28-nm process with 1.429-GHz clock frequency. Compared with the existing multiple-precision FP methods, the proposed structure achieves 63% and 88% area-saving performance for FP16 and FP32 operations, respectively. The $4\times $ and $20\times $ maximum throughput rates are obtained when compared with fixed FP32 and FP64 operations. Compared with the previous multiple-precision PEs, the proposed one achieves the best energy-efficiency performance with 975.13 GFLOPS/W.

Journal ArticleDOI
TL;DR: This article introduces Cross-PUF attacks where a model is created using the power consumption of one PUF instance to attack another PUF created from the same GDSII file and proposes a lightweight countermeasure based on dual-rail and random initialization logic approaches called DRILL, which is shown to be highly effective in thwarting Cross- PUF attacks.
Abstract: Unintentional uncontrollable variations in the manufacturing process of integrated circuits are used to realize silicon primitives known as physical unclonable functions (PUFs). These primitives are used to create unique signatures for security purposes. Investigating the vulnerabilities of PUFs is of utmost importance to uphold their usefulness in secure applications. One such investigation includes exploring the susceptibility of PUFs to modeling attacks that aim at extracting the PUFs’ behavior. To date, these attacks have mainly focused on a single PUF instance where the targeted PUF is attacked using the model built based on the very same PUF’s challenge–response pairs or power side channel. In this article, we move one step forward and introduce Cross-PUF attacks where a model is created using the power consumption of one PUF instance to attack another PUF created from the same GDSII file. Through SPICE simulations, we show that these attacks are highly effective in modeling PUF behaviors even in the presence of noise and mismatches in temperature and aging of the PUF used for modeling versus the targeted PUF. To mitigate the Cross-PUF attacks, we then propose a lightweight countermeasure based on dual-rail and random initialization logic approaches called DRILL. We show that DRILL is highly effective in thwarting Cross-PUF attacks.

Journal ArticleDOI
TL;DR: This article develops hardware-efficient architecture for fractional-order correntropy adaptive filter (FoCAF) for its efficient real-time VLSI implementation and demonstrates that reformulations cause negligible performance degradation under the 16-bit fixed-point implementation.
Abstract: Conventional adaptive filters, which assume Gaussian distribution for signal and noise, exhibit significant performance degradation when operating in non-Gaussian environments. Recently proposed fractional-order adaptive filters (FoAFs) address this concern by assuming that the signal and noise are symmetric $\alpha $ -stable random processes. However, the literature does not include any VLSI architectures for these algorithms. Toward that end, this article develops hardware-efficient architecture for fractional-order correntropy adaptive filter (FoCAF). We first reformulate the FoCAF for its efficient real-time VLSI implementation and then demonstrate that these reformulations cause negligible performance degradation under the 16-bit fixed-point implementation. Using this reformulated algorithm, we design an FoCAF architecture. Furthermore, we analyze the critical path of the design to select the appropriate level of pipelining based on the sampling rate of the application. According to the critical-path analysis, the FoCAF design is pipelined using retiming techniques to obtain delayed FoCAF (DFoCAF), which is then synthesized using $\mathbf {45}$ -nm CMOS technology. Synthesis results reveal that DFoCAF architecture requires a minimal increase in hardware over the prominent least mean square (LMS) filter architecture and achieves a significant increase in the performance in symmetric $\alpha $ -stable environments where LMS fails to converge.

Journal ArticleDOI
TL;DR: This study proposes a configurable 6-transistor (6T) static random access memory (SRAM) array with a multilevel shared structure for in-memory computing (IMC) that aims to embed computing in memory to reduce the transfer of memory-processor data.
Abstract: Frequent to-and-from data transfers in the von Neumann architecture limit the overall throughput. One of the promising approaches used to overcome von Neumann bottleneck is in-memory computing (IMC) that aims to embed computing in memory to reduce the transfer of memory-processor data. This study proposes a configurable 6-transistor (6T) static random access memory (SRAM) array with a multilevel shared structure for IMC. A multilevel shared structure can effectively improve the utilization rate of the module. In addition to the conventional SRAM operation, the configurable structure can also perform the sum of absolute differences (SAD) and Hamming distance (HD) calculations. To quickly identify the minimum value among multiple calculation results, a four-input sense amplifier (SA) is proposed. The performance of the proposed memory is simulated in a 65-nm CMOS process. The post-layout simulation results show good linearity of the multirow read in the SAD and HD modes. The mean time required by the four-input SA to obtain the result is 190 ps. The SAD and HD calculations yield consumptions of 67.44 fJ/byte and 0.64 fJ/bit, respectively, at 0.8 V. Furthermore, a single column-sharing comparator consumes 2.78 and 3.41 pJ at 0.8 V in the SAD and HD modes, respectively.

Journal ArticleDOI
TL;DR: In this paper , the authors present a cryptographic hardware accelerator supporting multiple AES-based block cipher modes, including the more advanced cipher-based MAC (CMAC), counter with CBC-MAC (CCM), Galois counter mode (GCM), and XOR-encrypt-XOR-based tweaked-codebook mode with ciphertext stealing (XTS) modes.
Abstract: This article presents a cryptographic hardware (HW) accelerator supporting multiple advanced encryption standard (AES)-based block cipher modes, including the more advanced cipher-based MAC (CMAC), counter with CBC-MAC (CCM), Galois counter mode (GCM), and XOR-encrypt-XOR-based tweaked-codebook mode with ciphertext stealing (XTS) modes. The proposed design implements advanced and innovative features in HW, such as AES key secure management, on-chip clock randomization, and access privilege mechanisms. The system has been tested in a RISC-V-based system-on-chip (SoC), specifically designed for this purpose, on an Ultrascale + Xilinx FPGA, analyzing resource and power consumption, together with system performances. The cryptoprocessor has been then synthesized on a 7-nm CMOS standard-cells technology; performances, complexity, and power consumption information are analyzed and compared with the state of the art. The proposed cryptoprocessor is ready to be embedded within the innovative European Processor Initiative (EPI) chip.

Journal ArticleDOI
TL;DR: Based on this proposed algorithm and additional architectural optimizations, a new low-latency and highly accurate VLSI-architecture has been presented in this manuscript for computing eigenvalues and eigenvectors of real-symmetric matrix.
Abstract: This article proposes a low-latency parallel Jacobi-method-based algorithm for computing eigenvalues and eigenvectors of $n\times n$ -sized real-symmetric matrix. It is a coordinate rotations digital-computer (CORDIC)-based iterative algorithm that comprises multiple rotations and hence the key contribution of our work is to reduce the time cost of each rotation. Thus, alleviating the total latency for computing eigenvalues and eigenvectors using the parallel Jacobi method. Based on this proposed algorithm and additional architectural optimizations, a new low-latency and highly accurate VLSI-architecture has been presented in this manuscript for computing eigenvalues and eigenvectors of real-symmetric matrix. Subsequently, this work proposes a reconfigurable algorithm and its VLSI-architecture for computing eigenvalues and eigenvectors of complex Hermitian (CH), complex skew-Hermitian (CSH), and real skew-symmetric (RSS) matrices. Performance analysis of the proposed architectures has demonstrated minimal error-percentage of 0.0106% which is adequate for the wide range of real-time applications. The proposed architectures are hardware implemented on Zynq Ultrascale+ field-programmable gate array (FPGA)-board that consumed short latency of $9.377~\mu \text{s}$ while operating at maximum clock frequency of 172.75 MHz. Comparison of our implementation results with the reported works showed that the proposed architecture incurs 43.75% lower latency and 89.4% better accuracy than the state-of-the-art implementation.

Journal ArticleDOI
TL;DR: An energy-aware adaptive NN inference implementation that utilizes one of two exits with different accuracies and computation options that provides flexibility with a tradeoff between accuracy and processing time for different application requirements is proposed.
Abstract: Implementing a neural network (NN) inference in a millimeter-scale system is challenging due to limited energy and storage size. This article proposes an energy-aware adaptive NN inference implementation that utilizes one of two exits with different accuracies and computation options. The early-exit path provides a shorter processing time but less accuracy than the main-exit path. To compensate for the reduced accuracy, it additionally applies the main-exit path if the entropy of the early-exit inference is higher than a predetermined value. The NN is implemented with a custom low-power 180-nm CMOS processor chip and a 90-nm embedded flash memory chip and tested by the CIFAR-10 dataset. The measurement results show that the implemented convolutional NN (CNN) reduces processing time and thus energy consumption by 43.9% compared with a main-exit-only method while sacrificing its accuracy from 69.9% to 66.2%. Also, we explore the required minimum battery capacity at each optimal configuration for accuracy and/or energy consumption to achieve energy-autonomous operation under measured exemplary light profiles. It requires a minimum battery capacity of 855 mJ, acceptable for the target miniature system with two millimeter-scale batteries (684 mJ each). Compared with the state-of-the-art CNN technique (BranchyNet) allowing early stopping, the proposed design improves the accuracy by 0.7% and 3.3% to maintain energy-autonomous operation with two and one millimeter-scale batteries, respectively. Compared with the state-of-the-art lightweight CNN technique (MobileNet), this work provides flexibility with a tradeoff between accuracy and processing time for different application requirements.

Journal ArticleDOI
TL;DR: In this paper , a factored systolic array (FSA) architecture is presented, in which the carry propagation adder (CPA) and carry save adder(CSA) perform hybrid accumulation on least significant bit (LSB) bits and most significant bits (MSB) inside each processing element.
Abstract: Deep learning applications have become ubiquitous in today’s era and it has led to vast development in machine learning (ML) accelerators. Systolic arrays have been a primary part of ML accelerator architecture. To fully leverage the systolic arrays, it is required to explore the computer arithmetic data-path components and their tradeoffs in accelerators. We present a novel factored systolic array (FSA) architecture, in which the carry propagation adder (CPA) and carry-save adder (CSA) perform hybrid accumulation on least significant bit (LSB) bits and most significant bits (MSB) bits, respectively, inside each processing element. In addition, a small CPA to complete accumulation for MSB bits along with rounding logic for each column of the array is placed, which not only reduces the area, delay, and power but also balances the combinational and sequential area tradeoffs. We demonstrate the hybrid accumulator with partial CPA factoring in “Gemmini,” an open-source practical systolic array accelerator and factoring technique does not change the functionality of the base design. We implemented three baselines, original Gemmini and two variants of it, and show that the proposed approach leads to overall significant reduction in area within the range 12.8% – 50.2% and in power within the range 18.6% – 41% with improved or similar delay in comparison to the baselines.

Journal ArticleDOI
TL;DR: A more accurate and robust decryption scheme for ring-BinLWE based on 2’s complement ring is proposed, which significantly improves the decoding rate and the resource overhead while the decryption accuracy is significantly improved.
Abstract: Learning with error (LWE) over the ring based on binary distribution (ring-BinLWE) has become a potential Internet-of-Things (IoT) confidentiality solution with its anti-quantum attack properties and uncomplicated calculations. Compared with ring-LWE based on discrete Gaussian distribution, the decryption scheme of ring-LWE based on binary distribution needs to be re- determined due to the asymmetry of the error distribution. The direct application of the ring-LWE decryption function based on discrete Gaussian distribution can cause serious misjudgment. In this article, we propose a more accurate and robust decryption scheme for ring-BinLWE based on 2’s complement ring. Compared with the previous decryption function, the re- derived decryption function significantly improves the decoding rate by 50%. Furthermore, based on the proposed decryption function, high-performance, and lightweight hardware architectures for terminal devices in IoT are, respectively, proposed, which are scalable and can be easily adapted to ring-BinLWE hardware deployment with other parameter sets. When the parameter set is $n\,\,=$ 256, $q\,\,=$ 256, the high-performance implementation consumes 7.6k LUTs, 6.2k FFs, and 2.3k SLICEs on Spartan 6 field-programmable gate array (FPGA) platform. Compared with the previous implementation, our resource overhead increases by only 23% while the decryption accuracy is significantly improved by 50%. The lightweight implementation for parameter set $n\,\,=$ 256, $q\,\,=$ 256 consumes only 230 LUTs, 338 FFs, and 84 SLICEs on the Spartan 6 FPGA platform. Compared with the previous work, the area $\times $ time (AT) is reduced by 47.8%, which is more suitable for deployment on resource-constrained IoT nodes.

Journal ArticleDOI
TL;DR: An efficient high-throughput depth engine is proposed to generate high-quality 3-D depth maps for speckle-pattern structured-light depth cameras with significant reduction of computational complexity in contrast to the sum-of-absolute-distance (SAD) method.
Abstract: In this article, an efficient high-throughput depth engine is proposed to generate high-quality 3-D depth maps for speckle-pattern structured-light depth cameras. A dynamic-binarization (DB) method is introduced with a significant reduction of computational complexity in contrast to the sum-of-absolute-distance (SAD) method. The depth map evaluation shows good robustness compared with other window-based correlation methods. Parallel architecture and reuse of intermediate results are employed for efficient hardware implementation. Our design is verified on a field-programmable gate array (FPGA) and implemented in the SMIC 55-nm CMOS technology, achieving a frame rate of 1731.77 fps ( $640\times480$ ) with an area efficiency of 3.75 fps/KGE. The proposed engine shows a $2.71\times $ promotion of area efficiency in contrast to the SAD-based implementation. In addition, the subpixel estimation algorithm deployed in postprocessing is optimized for efficient hardware implementation, reducing the gate count by 69.2% without significant performance loss.

Journal ArticleDOI
TL;DR: In this paper , a 12-bit 20-MS/s asynchronous successive approximation register (SAR) analog-to-digital converter (ADC) is presented by using the digital place-and-route (DPR) tools.
Abstract: A 12-bit 20-MS/s asynchronous successive approximation register (SAR) analog-to-digital converter (ADC) is presented by using the digital place-and-route (DPR) tools. The macrocells for the capacitive digital-to-analog converter, the bootstrapped switch, and the dynamic comparator are presented. The custom standard cells for the dynamic SAR logic are also presented. By using the macrocells and the custom standard ones, the layout of this SAR ADC is completed by using the DPR tools. Several techniques are presented to improve the parasitic capacitances, the current density of the metal interconnections, and the nonideal effects caused by the DPR tools. This SAR ADC is fabricated in 40-nm CMOS technology and its active area is 0.0067 mm2. To compare with the full-custom method, the proposed DPR flow has speeded up by a factor of 288 to complete the interconnection wires. Its power dissipation is 363 $\mu \text{W}$ at 20 MS/s and the calculated Walden FoM is 23 fJ/c. step at Nyquist frequency.

Journal ArticleDOI
TL;DR: In this article , an FPGA-based dual-hiding asynchronous logic (async-logic) advanced encryption standard (AES) accelerator is presented, which is highly resistant against SCAs and yet low area/energy overheads.
Abstract: Encryption in field-programmable gate array (FPGA) often provides a good security solution to protect data privacy in Internet-of-Things systems, but this security solution can be compromised by side-channel attacks (SCAs). In this article, we present an FPGA-based dual-hiding asynchronous-logic (async-logic) advanced encryption standard (AES) accelerator, which is highly resistant against SCAs and yet low area/energy overheads. The proposed AES accelerator achieves vertical (amplitude) SCA hiding via an area-efficient dual-rail mapping approach and a zero-value (ZV) compensated substitution-box (S-Box), while enhancing the horizontal (temporal) SCA hiding of async-logic operations via a timing-boundary-free input arrival-time randomizer and a skewed-delay controller. A comprehensive SCA evaluation is performed with 11 SCA models, and we show that our proposed design can offer a strong SCA resistance with measurement-to-disclosure (MTD) of >20 million traces. To our best knowledge, our design is the most secure AES design evaluated with the largest number of traces in FPGA. To compare the design overheads for security, we quantify the figure of merit as normalized (Area $\times $ Energy/MTD(All) $\times 10^{6}$ ). The figure of merit of our proposed design is $403\times $ smaller than the benchmark dual-rail synchronous-logic design and $95\times $ smaller than a reported async-logic design.

Journal ArticleDOI
TL;DR: In this article , a 12-bit column-parallel two-step single-slope analog-to-digital converter (SS ADC) is proposed to realize simultaneously residue storage and zero-cross detection.
Abstract: This article presents a 12-bit column-parallel two-step single-slope analog-to-digital converter (SS ADC). With the merging of analog memory capacitor and input sampling capacitor, the proposed two-step SS ADC realizes simultaneously residue storage and zero-cross detection. The fixed decision point guarantees a static comparator offset. A constant input common-mode level resistor ramp generator, which exploits a current-mode R-2R digital-to-analog converter (DAC) and a variable feedback R-string DAC, is developed to enhance ADC linearity limited by finite common-mode rejection ratio (CMRR) of the operational amplifier. Using a bottom-up foreground self-calibration, harmonic distortion caused by both parasitic capacitor and resistor mismatch is mitigated. This prototype is fabricated using a 130-nm CMOS process. The proposed two-step SS ADC consumes 62- $\mu \text{W}$ power when operating at a 100-KS/s sampling frequency and yields a peak spurious-free dynamic range (SFDR) of 76.47 dB with a signal-to-noise-and-distortion ratio (SNDR) of 60.78 dB. The measured differential nonlinearity (DNL) and the integral nonlinearity (INL) are 0.83/−1 and 4.78/−3.31 LSB, respectively.