scispace - formally typeset
Search or ask a question

Showing papers on "Arithmetic logic unit published in 2021"


Journal ArticleDOI
TL;DR: In this article, an 8-bit wide, bit-parallel datapath composed of an arithmetic logic unit and register files for high-throughput oriented SFQ microprocessors based on a gate-level-pipeline structure was demonstrated.
Abstract: We successfully demonstrated an 8-bit-wide, bit-parallel datapath composed of an arithmetic logic unit and register files for high-throughput oriented SFQ microprocessors based on a gate-level-pipeline structure. Achieving high-speed operation in the bit-parallel datapath is difficult because of feedback paths. We used concurrent-flow clocking and counter-flow clocking in combination to solve the timing problem at the feedback path in the datapath, and we optimized the number of JJs and pipeline stages in the register file for solving the timing issue. We designed the datapath with the cell library for the AIST 10 kA/cm $^2$ Advanced Process. The total number of pipeline stages, Josephson junctions, and circuit area of the designed datapath were 52, 18448, and 3.81 mm × 4.05 mm, respectively. We obtained a relatively wide bias margin of the designed datapath at the target clock frequency of 50 GHz, and it operated up to 64 GHz in on-chip high-speed testing.

26 citations


Journal ArticleDOI
TL;DR: This paper presents a QCA technology-based reversible ALU unit using basic reversible blocks and a novel reversible block namely BS1 Block, which performs logic and arithmetic operations in the proposed scheme.
Abstract: Quantum dot cellular automata (QCA) technology is considered as one of the most suitable replacements to reduce the CMOS-based digital circuit design problems at the nanoscale due to its tiny size, fast, latency and very low power consumption. One of the main components of microprocessors is the arithmetic logic unit (ALU) and in other words, it acts as the heart of microprocessors. This paper presents a QCA technology-based reversible ALU unit using basic reversible blocks and a novel reversible block namely BS1 Block. The proposed block performs logic and arithmetic operations in the proposed scheme. The simulations of the proposed design are carried out by QCA Designer. According to the simulated results, the proposed structure has a 35%, 27% and 30% improvement in quantum cost, the number of cells and the occupied area in comparison to the previous conducted researches, respectively.

21 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigate a solution trading-off performance and complexity to execute the lattice-based algorithms CRYSTALS-Kyber and -Dilithium.
Abstract: In recent years, public-key cryptography has become a fundamental component of digital infrastructures. Such a scenario has to face a new and increasing threat, represented by quantum computers. It is well known that quantum computers in the next years will be able to run algorithms capable of breaking the security of currently widespread cryptographic schemes used for public-key cryptography. Post-quantum cryptography aims to define and execute algorithms on classical computer architectures, able to withstand attacks from quantum computers. The National Institute of Standards and Technology is currently running a selection process to define one or more quantum-resistant public-key algorithms and lattice-based cryptographic constructions are considered one of the leading candidates. However, such algorithms require non-negligible computational resources to be executed. One viable solution is to accelerate them totally or partially in hardware, to alleviate the workload of the main processing unit. In this paper, we investigate a solution trading-off performance and complexity to execute the lattice-based algorithms CRYSTALS-Kyber and -Dilithium: we introduce a dedicated Post-Quantum Arithmetic Logic Unit, embedded directly in the pipeline of a RISC-V processor. This results in an almost negligible area overhead with a large impact on the algorithms speed-up and a consistent reduction in the energy required per single operation.

14 citations


Journal ArticleDOI
TL;DR: Using the 64-Bit arithmetic logic unit and a Pseudo Random Bit Sequence generator as reference circuits, the use of Synopsys tools for superconductor IC design is demonstrated including spice circuit simulations, plotting waveforms, margins analysis, Monte-Carlo simulations, HDL simulations with timing back-annotation, and IC validation including design rule checker and layout-versus-schematic checker.
Abstract: HYPRES developed an advanced design flow and design infrastructure for single-flux-quantum (SFQ) superconductor integrated circuits using standard CMOS based EDA tools along with internally developed tools and has been successfully using this flow for the past several years The design infrastructure includes the process design kit, advanced simulation methodology, and IC verification rule decks The superconductor hierarchical circuit analyzer developed by HYPRES serves as bedrock of our simulation methodology facilitating circuit analysis and debugging including extraction of circuit parameter margins, analysis of Monte-Carlo simulations with process corners, as well as automated timing characterization Using this proven design flow and infrastructure as a knowledge source, we have collaborated with Synopsys to enhance their tools for a full native tool enabled design flow and infrastructure, which represents a significant expansion in design capabilities and capacity for superconducting electronics Using the 64-Bit arithmetic logic unit and a Pseudo Random Bit Sequence (PRBS) generator as reference circuits, we demonstrate the use of Synopsys tools for superconductor IC design including spice circuit simulations, plotting waveforms, margins analysis, Monte-Carlo simulations, HDL simulations with timing back-annotation, and IC validation including design rule checker and layout-versus-schematic checker

13 citations


Journal ArticleDOI
TL;DR: In this paper, a multimem SHA-256 accelerator was proposed to reduce the critical path delay and significantly increase the processing rate of the SHA algorithm at the system-on-chip (SoC) level.
Abstract: The development of a low-cost high-performance secure hash algorithm (SHA)-256 accelerator has recently received extensive interest because SHA-256 is important in widespread applications, such as cryptocurrencies, data security, data integrity, and digital signatures. Unfortunately, most current researches have focused on the performance of the SHA-256 accelerator but not on a system level, in which the data transfer between the external memory and accelerator occupies a large time fraction. In this paper, we solve the state-of-art problem with a novel SHA-256 architecture named the multimem SHA-256 accelerator that achieves high performance at the system on chip (SoC) level. Notably, our accelerator employs three novel techniques, the pipelined arithmetic logic unit (ALU), multimem processing element (PE), and shift buffer in shift buffer out (SBi-SBo), to reduce the critical path delay and significantly increase the processing rate. Experiments on a field-programmable gate array (FPGA) and an application-specific integrated circuit (ASIC) show that the proposed accelerator achieves significantly better processing rate and hardware efficiency than previous works. The accelerator accuracy is verified on a real hardware platform (FPGA ZCU102). The accelerator is synthesized and laid out with 180 nm complementary metal oxide semiconductor (CMOS) technology with a chip sized $8.5\,mm \times 8.5\,mm$ , consumes 1.86 W, and provides a maximum processing rate of 40.96 Gbps at 80 MHz and 1.8 V. With FPGA Xilinx 16 nm FinFET technology, the accelerator processing rate is as high as 284 Gbps.

11 citations


Journal ArticleDOI
01 Oct 2021-Optik
TL;DR: An all-optical ALU exclusively using Micro-Ring Resonator (MRR) to perform addition and comparison is proposed, which makes the proposed design novel and easy to integrate with VLSIO.

9 citations


Book ChapterDOI
25 Feb 2021
TL;DR: In this paper, a sum module is designed to give a high driving capability to the next adder circuits in multi-bit addition, the circuit used in sum module must consume less amount of power.
Abstract: In the past years applications in the digital signal processing, image processing, microprocessors and logical operations are executed using Arithmetic logic unit (ALU) by considering the stunning features of the FinFET. FinFET technology is booming in the integrated circuits and chip manufacturing industries. The 1-bit FinFET full adder categorized into three modules XOR module, Sum module and the Carry module. In this paper, a sum module is designed. The main requirement of 1-bit sum module is to give a high driving capability to the next adder circuits in multi-bit addition. The circuit used in sum module must consume less amount of power. The proposed full adder circuit is designed by Cadence virtuoso tools and FinFET 20 nm technology. The proposed circuit design consume low power, less delay and reduced chip area. Other performance characteristics analysis such as time delay, power consumption is done by comparing with the standard circuits.

7 citations


Journal ArticleDOI
TL;DR: The simulation results indicate that the proposed CNTFET-RRAM integration enables the compact circuit realization with good robustness, and due to the addition of RRAM as circuit element, the proposed ALU has the advantage of non-volatility.
Abstract: Due to the difficulties associated with scaling of silicon transistors, various technologies beyond binary logic processing are actively being investigated. Ternary logic circuit implementation with carbon nanotube field effect transistors (CNTFETs) and resistive random access memory (RRAM) integration is considered as a possible technology option. CNTFETs are currently being preferred for implementing ternary circuits due to their desirable multiple threshold voltage and geometry-dependent properties, whereas the RRAM is used due to its multilevel cell capability which enables storage of multiple resistance states within a single cell. This article presents the 2-trit arithmetic logic unit (ALU) design using CNTFETs and RRAM as the design elements. The proposed ALU incorporates a transmission gate block, a function select block, and various ternary function processing modules. The ALU design optimization is achieved by introducing a controlled ternary adder–subtractor module instead of separate adder and subtractor circuits. The simulations are analyzed and validated using Synopsis HSPICE simulation software with standard 32 nm CNTFET technology under different operating conditions (supply voltages) to test the robustness of the designs. The simulation results indicate that the proposed CNTFET-RRAM integration enables the compact circuit realization with good robustness. Moreover, due to the addition of RRAM as circuit element, the proposed ALU has the advantage of non-volatility.

6 citations


Journal ArticleDOI
TL;DR: The design of a novel multilayer portable, dynamic, fault-tolerant, power-efficient, thermally stable reversible ALU is proposed which is explored through QCA-ES and area density, delay, fault tolerance and thermal stability are investigated.
Abstract: Arithmetic logic unit (ALU), a core component of a processor, is one of the thrust areas of the current research. Presently, ALU is designed by transistor-based CMOS technique and its individual components are placed in different layers. The current design is affected by the limitations of Moore’s law and design complexity. At present, ‘Quantum cellular automata electro-spin (QCA-ES)’ technology is widely accepted technology as an alternative of ‘CMOS’ to minimize the above discussed problems. In this research paper, the design of a novel multilayer portable, dynamic, fault-tolerant, power-efficient, thermally stable reversible ALU is proposed which is explored through QCA-ES. All the arithmetic and logical components of ALU are separately placed in different layers. Area density, delay, fault tolerance and thermal stability are investigated. A specific type of gate, known as reversible gate (modified 3:3 ‘TSG’ gate), is used in this proposed design with QCA technology to get the optimized design ALU with low occupied area, complexity, delay and power dissipation. Investigation of a fault-free design and saturated amplitude level (of output) change with respect to temperature increment in the proposed device are also discussed in this paper. Not only the thermal stability (up to 6 k temperature) but also an investigation on cell complexity of the 100% fault-free (against multiple cell-omission, cell-displacement, cell-orientation change and extra cell deposition), multilayer nano-device is represented in this work. ‘QCA-Designer’ software is used in this research work to design and develop layout of the proposed components in quantum-field and find out the occupied area, delay and complexity of proposed design. ‘QCA-Pro’ software is used for getting the value of dissipated power.

6 citations


Journal ArticleDOI
01 Apr 2021-Optik
TL;DR: The proposed ripple ring resonator (RRR) structure is proposed and analyzed by coupling two and three optical microring resonators (OMRRs), giving promising results and has integration compatibility with complementary metal–oxide semiconductor (CMOS) photonic technology.

5 citations


Journal ArticleDOI
TL;DR: This work built upon the 1T1R crossbar topology and adopted a logic design style in which all computations are equivalent to modified memory read operations for higher reliability, performed either in a word-wise or bit-wise manner, owing to an enhanced peripheral circuitry.
Abstract: Resistive switching devices (memristors) constitute a promising device technology that has emerged for the development of future energy-efficient general-purpose computational memories. Research has been done both at device and circuit level for the realization of primitive logic operations with memristors. Likewise, important efforts are placed on the development of logic synthesis algorithms for resistive RAM (ReRAM)-based computing. However, system-level design of computational memories has not been given significant consideration, and developing arithmetic logic unit (ALU) functionality entirely using ReRAM-based word-wise arithmetic operations remains a challenging task. In this context, we present our results in circuit- and system-level design, towards implementing a ReRAM-based general-purpose computational memory with ALU functionality. We built upon the 1T1R crossbar topology and adopted a logic design style in which all computations are equivalent to modified memory read operations for higher reliability, performed either in a word-wise or bit-wise manner, owing to an enhanced peripheral circuitry. Moreover, we present the concept of a segmented ReRAM architecture with functional and topological features that benefit flexibility of data movement and improve latency of multi-level (sequential) in-memory computations. Robust system functionality is validated via LTspice circuit simulations for an n-bit word-wise binary adder, showing promising performance features compared to other state-of-the-art implementations.

Journal ArticleDOI
TL;DR: In this paper, a 3D Ex-OR, parity generator, parity checker, multiplexer, and arithmetic logic unit (ALU) functionality are synthesized using pNML technology.
Abstract: The approach to designing digital circuits using three-dimensional (3D) perpendicular nanomagnetic logic (pNML) is thoroughly investigated Nanomagnetic logic (NML) technology eventually optimizes the circuit performance in comparison with conventional metal–oxide–semiconductor (MOS) technology, which suffers from the hot carrier, velocity saturation, and short-channel effects, which may considerably degrade device performance In contrast, nanomagnetic logic is immune to radiation; it behaves as nonvolatile memory and shows zero leakage current, as required for use in high-speed and low-cost nanoelectronics applications In this paper, novel and organized designs, eg, for 3D Ex-OR, parity generator, parity checker, multiplexer, and arithmetic logic unit (ALU) functionality, are synthesized using pNML technology Previous designs are not compact in terms of delay, layer count, or bounded area To overcome this, new designs for the mentioned functionalities are proposed based on pNML with smaller area and lower latency compared with previous circuits

Journal ArticleDOI
TL;DR: RISC-V3 as discussed by the authors uses an RNS number representation internally to speed up instruction execution times and therefore increase the system performance, but it does not have a flags register which is expensive to calculate when using RNS.
Abstract: Redundant number systems (RNS) are a well-known technique to speed up arithmetic circuits. However, in a complete CPU, arithmetic circuits using RNS were only included on subcircuit level e.g. inside the Arithmetic Logic Unit (ALU) for realization of the division. Still, extending this approach to create a CPU with a complete data path based on RNS can be beneficial for speeding up data processing, due to avoiding conversions in the ALU between RNS and binary number representations. Therefore, with this paper we present a new CPU architecture called RISC-V3 which is compatible to the RISC-V instruction set, but uses an RNS number representation internally to speed up instruction execution times and therefore increase the system performance. RISC-V is very suitable for RNS because it does not have a flags register which is expensive to calculate when using an RNS. To present reliable performance numbers, arithmetic circuits using RNS were realized in different semiconductor technologies. Moreover, an instruction set simulator was used to estimate system performance for a benchmark suite (Embench). Our results show, that we are up to 81% faster with the RISC-V3 architecture compared to a binary one, depending on the executed benchmark and CMOS technology.

Proceedings ArticleDOI
25 Jun 2021
TL;DR: A four bit arithmetic logic unit which is energy efficient and temperature invariant implemented using the dual mode pass transistor logic using Cadence® Virtuoso® Schematic Editor is presented.
Abstract: In this paper, we present a four bit arithmetic logic unit which is energy efficient and temperature invariant implemented using the dual mode pass transistor logic. The basic logic gates such as NOR and NAND are designed using both CMOS logic and dual mode pass transistor logic and are used in the proposed design. Simulations performed demonstrated that DMPL can reduce the computed worst case delay by 42.39%and 39.13%for NOR and NAND gates respectively in dynamic mode and average power dissipation by 67.96%and 24.09%for NOR and NAND gates respectively in static mode. In the implemented Arithmetic and Logic Unit, we observe a reduction in worst case delay and average power dissipation by 62.67%and 28.28%. The proposed logic was implemented in 90nm bulk technology using Cadence® Virtuoso® Schematic Editor.

Proceedings ArticleDOI
04 Feb 2021
TL;DR: In this article, a QCA constructed full adder logic circuit based on five input majority gates is designed and simulated, and a 2:1 multiplexer is implemented in multilayer 1-bit ALU to reduce the overall cell count.
Abstract: Quantum-dot Cellular Automata (QCA) is an innovative favorable calculation pattern nanoscale technology to design arithmetic digital circuits of any size and can be considered as a proper best alternative to digital Complementary Metal Oxide Semiconductor process Also QCA provides a promising key for refining the computation results in numerous computational applications with high compaction density, and proficient in carrying out computations at ultra-high switching speeds The key component present in Central Processing Unit is Arithmetic Logic Unit (ALU) which is used to execute several operations like arithmetic and logical operations, and the design of full adder circuits and low order multiplexer circuits of different sizes are important In this proposed work, a QCA constructed full adder logic circuit based on five input majority gate is designed and simulated A QCA constructed multilayer 1- bit ALU structure is designed in this paper which can implement both logical and arithmetic operations To perform arithmetic operations a novel 2:1 multiplexer is proposed with reduced number of cell counts Therefore, a 2:1 multiplexer is implemented in multilayer 1-bit ALU to reduce the overall cell count Hence this proposed design of full-adder and other universal gates uses a reduced number of cells which in turn leads to reduced circuit size, and low power dissipation than the previous designs The proposed architecture is simulated and verified using QCAD tool The simulation results show that this proposed work shows superior performance in terms of circuit area and less number of cell counts

Journal ArticleDOI
TL;DR: The proposed structure for T-latch has a lower number of cells, occupied area and lower power consumption than existing methods, and up-down circuits are designed for the first time in QCA technology.
Abstract: One of the major problems in designing highly compact integrated circuits is the power consumption of the circuits. Therefore, several technologies have been introduced to overcome the problems facing MOSFET technology. One of these technologies is the Quantum-Dot Cellular Atomata (QCA), which has several advantages. In this paper, we focus on computational logic gates based on the T-Latch circuit. T-latch is the basis of many circuit in arithmetic logic unit (ALU). The proposed structure for T-latch has a lower number of cells, occupied area and lower power consumption than existing methods. In the proposed T-Latch, compared to previous best designs, 6.45% cross section area and 44.49% power consumption were reduced. Also in this paper, for the first time a T-latch with reset terminal and a T-Latch with both set and reset terminals were designed. In addition, using the proposed T-latch, a 3-bit bidirectional up-down counter which consists of 204 quantum cells, 0.26 µm2 cross-sectional area, delay of 5.25 clock cycles, a three-bit up-down counter with a reset pin and a three-bit up-down counter with set and reset terminals were made. The proposed up-down circuits are designed for the first time in QCA technology. All the design and simulation results are done in QCADesigner software.

Posted Content
TL;DR: Neural arithmetic logic units (NALU) as mentioned in this paper have become a growing area of interest, though remain a niche field, and have been extensively studied in the literature. But their performance has not yet reached the state of the art.
Abstract: Neural Arithmetic Logic Modules have become a growing area of interest, though remain a niche field. These units are small neural networks which aim to achieve systematic generalisation in learning arithmetic operations such as {+, -, *, \} while also being interpretive in their weights. This paper is the first in discussing the current state of progress of this field, explaining key works, starting with the Neural Arithmetic Logic Unit (NALU). Focusing on the shortcomings of NALU, we provide an in-depth analysis to reason about design choices of recent units. A cross-comparison between units is made on experiment setups and findings, where we highlight inconsistencies in a fundamental experiment causing the inability to directly compare across papers. We finish by providing a novel discussion of existing applications for NALU and research directions requiring further exploration.

Book ChapterDOI
01 Jan 2021
TL;DR: Different types of full adders are implemented usingCNTFET and their power delay product (PDP) is analysed for single and multiple threshold voltages of CNTFET, and the low and high PDP of fullAdders are identified.
Abstract: Adder is a basic building block of the arithmetic logic unit (ALU). Designing of optimized adder circuit inherently makes a pavement for obtaining optimized ALU design. The implementation of metal–oxide–semiconductor field-effect transistor (MOSFET)-based very large-scale integration (VLSI) circuits in the nanoscale range is reached saturation condition. This is due to the MOSFET that meets significant issues like producing more leakage current and highly dependent on PVT variation during nanoscale fabrication. The carbon nanotube field-effect transistor (CNTFET) can overcome the demerits of MOSFET, and it supports low-power, delay-optimized VLSI circuit design. In this paper, different types of full adders are implemented using CNTFET and their power delay product (PDP) is analysed for single and multiple threshold voltages of CNTFET. From the simulation, the low and high PDP of full adders are identified. The PDP of full adders is optimized by varying the threshold voltage of CNTFET. The simulation is carried out using the HSPICE simulation tool. The Stanford University 32-nm-CNTFET model is used for the simulation.

Book ChapterDOI
01 Jan 2021
TL;DR: In this article, the design of an important QIP module, i.e., arithmetic logic unit (ALU), has been shown. And the entire design has been made on top of quantum Clifford+T-group.
Abstract: The quest of efficient quantum circuit is to achieve quantum supremacy in theory as well as in practice. The foremost obstacle is to protect the cohesive time of extremely fragile quantum states from inherent noise. To address this issue, Quantum Error Correction Code (QECC) with fault-tolerant quantum circuit is most desirable. Aiming to contribute toward designing an efficient Quantum Information Processor (QIP); in this work, we have shown the design of an important QIP module, i.e., Arithmetic Logic Unit (ALU). The entire design has been made on top of quantum Clifford+T-group. In the design phase, initially, we formulate a 1-bit design and then to make a generalized representation of the ALU, multiple smaller modules have been integrated. For ensuring improved features in this component, the design has been made fault-tolerant, circuit optimization rules are executed to minimize the design metrics and parallelism in high latency T-gate is ensured. In a way to check the functional correctness of our proposed design, several logical operations have been successfully tested over it.

19 May 2021
TL;DR: In this paper, a design of 8-bit arithmetic logic unit (ALU) with eight different operations is presented, which is the most crucial parts in digital computer which is designed to compute all the arithmetic and logic operations, including decoding operations.
Abstract: This project presents a design of 8-bit Arithmetic Logic Unit (ALU) with eight different operations. ALU is the most crucial parts in digital computer which is designed to compute all the arithmetic and logic operations, including decoding operations that need to be done for almost any data that is being processed by the central processing unit (CPU). In the applications of digital circuits, there are some important attributes that need to be considered such as maximizing speed and minimizing power consumption. Higher power consumption results in more heat dissipation, higher cooling cost and make the system more prone to failures and malfunctions. However, low power consumption literally can reduce problems related to heat and also reduced the performance of the system. Therefore, this project will cover in designing 8-bit ALU with eight operations by using Intel Quartus Prime Development Suite. Next, determining optimize voltage and frequency by using Maplesoft Maple. Verifying the functionality and performance of ALU that implement DVFS technique and optimizing the performance of ALU in term of frequency, timing, and power by maximize power saving. This project will focus on three conditions which are high performance, optimized, and low performance test. The integrations of all sub-modules will create an efficient and effective solutions for ALU design. The generated graph from Maple proves the DVFS technique. DVFS technique improved by 25% from the conventional technique in term of timing performance.

Patent
02 Feb 2021
TL;DR: In this paper, a direct memory access controller, configured to be used in a computing node of a system on chip (SoC), includes: (1) an input buffer for receiving packets of data coming from an input/output interface of the computing node; (2) a write control module for controlling writing of data extracted from each packet to a local memory of the node shared by at least one processor other than the direct memory Access Controller.
Abstract: A direct memory access controller, configured to be used in a computing node of a system on chip (SoC), includes: (1) an input buffer for receiving packets of data coming from an input/output interface of the computing node; (2) a write control module for controlling writing of data extracted from each packet to a local memory of the computing node shared by at least one processor other than the direct memory access controller; and (3) an arithmetic logic unit for executing microprograms. The write control module is configured to control the execution by the arithmetic logic unit of at least one microprogram including instruction lines for arithmetic and/or logical calculation concerning only storage addresses for storing the data received by the input buffer for a reorganization of the data in the shared local memory. Optionally, at least one microprogram may be stored in a register, and at least two operating modes (e.g., restart mode and pause mode) of the at least one microprogram stored in the register may be configurable. Exemplary microprograms can (1) provide image processing parameters including sizes of columns of image blocks, (2) provide image processing parameters including numbers of successive pieces of data to be processed which are to be written to successive addresses in the shared local memory, and (3) utilize a sequential write mode and/or an absolute-offset write mode. Microprograms may be selected based on an identifier included in a header of each packet.

Patent
26 Jan 2021
TL;DR: In this article, power reduction in a computer processor based on detection of whether data destined for input to an arithmetic logic unit (ALU) has a particular value has been discussed, where data is written to a register prior to performing an arithmetic or logical operation using the data as an operand.
Abstract: Techniques are described for power reduction in a computer processor based on detection of whether data destined for input to an arithmetic logic unit (ALU) has a particular value. The data is written to a register prior to performing an arithmetic or logical operation using the data as an operand. Depending on a timing of when the data is supplied to the register, the determination is made before or after the data is written to the register, and a memory associated with the register is updated with a result of the determination. Contents of the memory are used to make a decision whether to allow the ALU to perform the arithmetic or logical operation. The memory can be implemented as a non-architectural register.

Patent
01 Apr 2021
TL;DR: In this article, a work machine that loads a container and is capable of acquiring a depth map of the container is used to calculate the position of a flat section constituting the container.
Abstract: This container measurement system comprises a depth map acquisition unit and an arithmetic logic unit. The depth map acquisition unit is provided in a work machine that loads a container and is capable of acquiring a depth map of the container. The arithmetic logic unit processes the depth map of the container acquired by the depth map acquisition unit. The arithmetic logic unit calculates, on the basis of the container depth map, the three-dimensional position of a flat section constituting the container. The arithmetic logic unit calculates, on the basis of the three-dimensional position of the flat section, three-dimensional information containing the three-dimensional position and the three-dimensional shape of the container.

Proceedings ArticleDOI
12 Mar 2021
TL;DR: In this paper, a dual-core digital signal processor (DSP) with a primary core and a secondary core sharing the same data memory and arithmetic logic unit (ALU) was designed to enable harmonic analysis in smart meters.
Abstract: To enable harmonic analysis in smart meters, we design an energy metering chip that supports the quasi-synchronous sampling. The quasi-synchronous sampling is realized by the Newton polynomial interpolation algorithm. To implement the overall metering algorithm, we design a dedicated dual-core digital signal processor (DSP) with a primary core and a secondary core sharing the same data memory and arithmetic logic unit (ALU). The DSP also has a special data memory architecture with a dynamic shifting and virtual address mechanism, which simplifies the control in the DSP program. The quasi-synchronous sampling technique can achieve an accuracy of 0.3% in harmonic metering, meeting the requirement of single-phase smart meters.

Journal ArticleDOI
TL;DR: The challenge of how cities can be designed and developed in an inclusive and sustainable direction is monumental as discussed by the authors, but the impact of such solutions will be significantly reduced without long-term, widespread adoption by citizens.
Abstract: The challenge of how cities can be designed and developed in an inclusive and sustainable direction is monumental. Smart city technologies currently offer the most promising solution for long-term sustainability, but the impact of such solutions will be significantly reduced without long-term, widespread adoption by citizens.

Journal ArticleDOI
TL;DR: In this paper, an extremely low-latency modular multiplier is devised based on a modified algorithm by fully parallelizing and highly optimizing the small-size multipliers and the reduction submodules.
Abstract: The supersingular isogeny key encapsulation (SIKE) protocol, as one of the post-quantum protocol candidates, is widely regarded as the best alternative for curve-based cryptography. However, the long latency, caused by the serial large-degree isogeny computation which is dominated by modular multiplications, has made it less competitive than most popular post-quantum candidates. In this paper, we propose a high-speed and low-latency architecture for our recently presented optimized SIKE algorithm. Firstly, we design a new field arithmetic logic unit (FALU) with many algorithmic transformations and architectural optimizations. Especially, for the FALU, an extremely low-latency modular multiplier is devised based on a modified algorithm by fully parallelizing and highly optimizing the small-size multipliers and the reduction submodules. Secondly, we develop a compact control logic and update the instructions based on the benchmark provided in the newest SIKE library, fitting well with our design. Thirdly, an efficient memory access method is proposed by scheduling the input and output of the arithmetic logic unit (ALU) in two identical RAMs, which can significantly reduce the latency. Finally, we code the proposed architectures using the Verilog language and integrate them into the SIKE library. The implementation results on a Xilinx Virtex-7 FPGA show that for SIKEp751, our design only costs 9.3 $ms$ with a frequency of 155.8 MHz, about $2\times$ faster than the state-of-the-art, and achieves the best area efficiency among existing works. Particularly, the modular multiplier merely needs 16 clock cycles, reducing the delay by nearly one order of magnitude with a small factor of increase in hardware resource.

Book ChapterDOI
05 Mar 2021
TL;DR: In this paper, an intermediate product (IP) shifter was designed and implemented using various approaches such as CMOS logic and clock-less techniques, such as Multi-Threshold Null Convention Logic (MTNCL) and proposed multi-threshold dual-spacer dual-rail delay-insensitive logic (MTD3L), which shifts 1-bit operand to right side.
Abstract: A floating point multiplier (FPM) is one of the building block for various appliances such as arithmetic logic unit (ALU), digital signal processor (DSP), and computational dynamic range applications. The most usable standard to represent FPM is Institute of Electrical and Electronics Engineers (IEEE)—754, which segregated into three fields—Sign, exponent and mantissa field. The operation of FPM consists three stages—pre-normalization, multiplication, and post-normalization process. The normalization process is utmost important progress for any floating point computations. Thus, this paper deals with post-normalization process of 32-bit and 64-bit FPM by using intermediate product (IP) shifter design, which shifts 1-bit operand to right side. For single precision and double precision FPM, we desire 47-bit and 105-bit IP shifter using 2:1 multiplexers. The IP shifter is designed and implemented using various approaches such as CMOS logic and clock-less techniques—Multi-Threshold Null Convention Logic (MTNCL) and proposed Multi-Threshold Dual-Spacer Dual-Rail Delay-Insensitive Logic (MTD3L). The IP shifter is designed in gate level by using mentor graphics EDA tools with 130 nm technology, and the proposed technique is compared with existing approaches in terms of power dissipation, delay, and power-delay product (PDP) constraints.

Patent
02 Apr 2021
TL;DR: In this article, an IT device consisting of a plurality of ALUs (9), a set of registers (11), a memory (13), and a control unit (5) controlling the ALUs is described.
Abstract: The invention relates to an IT device comprising: a plurality of ALUs (9); a set of registers (11); a memory (13); a memory interface between the registers (11) and the memory (13); a control unit (5) controlling the ALUs (9), generating: at least one cycle i including both the implementation of at least one first calculation by an arithmetic logic unit (9) and the downloading of a first data set (AA4_7; BB4_7) from the memory (13) to at least one register (11); and at least one cycle iI, subsequent to the at least one cycle i, including the implementation of a second calculation by an arithmetic logic unit (9), for which second calculation part (A4; B4) at least of the first data set (AA4_7; BB4_7) forms at least one operand.

Patent
18 May 2021
TL;DR: In this paper, a processor includes a front end including circuitry to decode a first instruction to set a performance register for an execution unit and a second instruction, and an allocator with circuitry to assign the second instruction to the execution unit.
Abstract: A processor includes a front end including circuitry to decode a first instruction to set a performance register for an execution unit and a second instruction, and an allocator including circuitry to assign the second instruction to the execution unit to execute the second instruction. The execution unit includes circuitry to select between a normal computation and an accelerated computation based on a mode field of the performance register, perform the selected computation, and select between a normal result associated with the normal computation and an accelerated result associated with the accelerated computation based on the mode field.

Patent
23 Mar 2021
TL;DR: In this article, a processor may comprise a plurality of processing elements (PEs) that each may comprise an arithmetic logic unit (ALU), a data buffer associated with the ALU, and an indicator associated with data buffer to indicate whether a piece of data inside the data buffer is to be reused for repeated execution of a same instruction as a pipeline stage.
Abstract: Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise an arithmetic logic unit (ALU), a data buffer associated with the ALU, and an indicator associated with the data buffer to indicate whether a piece of data inside the data buffer is to be reused for repeated execution of a same instruction as a pipeline stage.