scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Hardware Architecture in 2016"


Posted Content•
TL;DR: In this article, the authors present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V.
Abstract: Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very promising results. Unfortunately, even these highly optimized engines are still above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. On the algorithmic side, highly competitive classification accuracy can be achieved by properly training CNNs with binary weights. This novel algorithm approach brings major optimization opportunities in the arithmetic core by removing the need for the expensive multiplications as well as in the weight storage and I/O costs. In this work, we present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V. Our accelerator outperforms state-of-the-art performance in terms of ASIC energy efficiency as well as area efficiency with 61.2 TOp/s/W and 1135 GOp/s/MGE, respectively.

162 citations


Posted Content•
TL;DR: This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage.
Abstract: Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today's CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultra-low power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/-1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this work, we present an accelerator optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm$^2$ and with a power dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/W@0.6 V and 1135 GOp/s/MGE@1.2 V, respectively.

158 citations


Journal Article•DOI•
TL;DR: Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks is proposed.
Abstract: Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.

92 citations


Posted Content•
TL;DR: This work presents a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers.
Abstract: Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.

73 citations


Posted Content•
TL;DR: This paper describes the design of a 1024-core processor chip in 16nm FinFet technology that contains an array of 1024 64-bit RISC processors, 64MB of on-chip SRAM, three 136-bit wide mesh Networks-On-Chip, and 1024 programmable IO pins.
Abstract: This paper describes the design of a 1024-core processor chip in 16nm FinFet technology. The chip ("Epiphany-V") contains an array of 1024 64-bit RISC processors, 64MB of on-chip SRAM, three 136-bit wide mesh Networks-On-Chip, and 1024 programmable IO pins. The chip has taped out and is being manufactured by TSMC. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

67 citations


Posted Content•
TL;DR: Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip, is proposed and evaluated, showing that Buddy significantly outperforms the state-of-the-art.
Abstract: Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area). Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X---25.6X improvement in raw throughput and 25.1X---59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations: 1) bitmap indices, 2) BitWeaving, and 3) bitvector-based implementation of sets. Our evaluations show that Buddy significantly outperforms the state-of-the-art.

65 citations


Journal Article•DOI•
TL;DR: This dissertation provides a detailed analysis of DRAM latency by using both circuit-levelsimulation with a detailed DRAM model and FPGA-based pro?ling of real DRAM modules, and proposes anew technique, Architectural-Variation-Aware DRAM (AVA-DRAM), which reduces DRAMlatency at low cost.
Abstract: In modern systems, DRAM-based main memory is signi?cantly slower than the processor.Consequently, processors spend a long time waiting to access data from main memory, makingthe long main memory access latency one of the most critical bottlenecks to achieving highsystem performance. Unfortunately, the latency of DRAM has remained almost constant inthe past decade. This is mainly because DRAM has been optimized for cost-per-bit, ratherthan access latency. As a result, DRAM latency is not reducing with technology scaling, andcontinues to be an important performance bottleneck in modern and future systems.This dissertation seeks to achieve low latency DRAM-based memory systems at low costin three major directions. The key idea of these three major directions is to enable and ex-ploit latency heterogeneity in DRAM architecture. First, based on the observation that longbitlines in DRAM are one of the dominant sources of DRAM latency, we propose a newDRAM architecture, Tiered-Latency DRAM (TL-DRAM), which divides the long bitline intotwo shorter segments using an isolation transistor, allowing one segment to be accessed withreduced latency. Second, we propose a ?ne-grained DRAM latency reduction mechanism,Adaptive-Latency DRAM, which optimizes DRAM latency for the common operating conditions for individual DRAM module. We observe that DRAM manufacturers incorporate a very large timing margin as a provision against the worst-case operating conditions, whichis accessing the slowest cell across all DRAM products with the worst latency at the highesttemperature, even though such a slowest cell and such an operating condition are rare. Ourmechanism dynamically optimizes DRAM latency to the current operating condition of theaccessed DRAM module, thereby reliably improving system performance. Third, we observethat cells closer to the peripheral logic can be much faster than cells farther from the peripherallogic (a phenomenon we call architectural variation). Based on this observation, we propose anew technique, Architectural-Variation-Aware DRAM (AVA-DRAM), which reduces DRAMlatency at low cost, by pro?ling and identifying only the inherently slower regions in DRAMto dynamically determine the lowest latency DRAM can operate at without causing failures.This dissertation provides a detailed analysis of DRAM latency by using both circuit-levelsimulation with a detailed DRAM model and FPGA-based pro?ling of real DRAM modules.Our latency analysis shows that our low latency DRAM mechanisms enable significant latencyreductions, leading to large improvement in both system performance and energy e?fficiencyacross a variety of workloads in our evaluated systems, while ensuring reliable DRAM operation.

47 citations


Posted Content•
TL;DR: GRVI as discussed by the authors is an FPGA-efficient RISC-V RV32I soft processor with shared memory clusters and a 300-bit wide Hoplite NOC for I/O and memory devices.
Abstract: GRVI is an FPGA-efficient RISC-V RV32I soft processor. Phalanx is a parallel processor and accelerator array framework. Groups of processors and accelerators form shared memory clusters. Clusters are interconnected with each other and with extreme bandwidth I/O and memory devices by a 300- bit-wide Hoplite NOC. An example Kintex UltraScale KU040 system has 400 RISC-V cores, peak throughput of 100,000 MIPS, peak shared memory bandwidth of 600 GB/s, NOC bisection bandwidth of 700 Gbps, and uses 13 W.

44 citations


Posted Content•
TL;DR: An FPGA accelerator with a new architecture of deeply pipelined OpenCL kernels, which can be reused to explore new architectures for neural network accelerators and achieved a similar peak performance of 33.9 GOPS with a 34% resource reduction on DSP blocks.
Abstract: Convolutional neural networks (CNNs) have been widely employed in many applications such as image classification, video analysis and speech recognition. Being compute-intensive, CNN computations are mainly accelerated by GPUs with high power dissipations. Recently, studies were carried out exploiting FPGA as CNN accelerator because of its reconfigurability and energy efficiency advantage over GPU, especially when OpenCL-based high-level synthesis tools are now available providing fast verification and implementation flows. Previous OpenCL-based design only focused on creating a generic framework to identify performance-related hardware parameters, without utilizing FPGA's special capability of pipelining kernel functions to minimize memory bandwidth requirement. In this work, we propose an FPGA accelerator with a new architecture of deeply pipelined OpenCL kernels. Data reuse and task mapping techniques are also presented to improve design efficiency. The proposed schemes are verified by implementing two representative large-scale CNNs, AlexNet and VGG on Altera Stratix-V A7 FPGA. We have achieved a similar peak performance of 33.9 GOPS with a 34% resource reduction on DSP blocks compared to previous work. Our design is openly accessible and thus can be reused to explore new architectures for neural network accelerators.

39 citations


Posted Content•
TL;DR: A new fault-tolerant majority voter is proposed which is found to be more robust to faults than the existing voters in the presence of faults occurring internally and/or externally to the voter.
Abstract: For digital system designs, triple modular redundancy (TMR), which is a 3-tuple version of N-modular redundancy is widely preferred for many mission-control and safety-critical applications. The TMR scheme involves two-times duplication of the simplex system hardware, with a majority voter ensuring correctness provided at least two out of three copies of the system remain operational. Thus the majority voter plays a pivotal role in ensuring the correct operation of the system. The fundamental assumption implicit in the TMR scheme is that the majority voter does not become faulty, which may not hold well for implementations based on latest technology nodes with dimensions of the order of just tens of nanometers. To overcome the drawbacks of the classical majority voter some new voter designs were put forward in the literature with the aim of enhancing the fault tolerance. However, these voter designs generally ensure the correct system operation in the presence of either a faulty function module or the faulty voter, considered only in isolation. Since multiple faults may no longer be excluded in the nanoelectronics regime, simultaneous fault occurrences on both the function module and the voter should be considered, and the fault tolerance of the voters have to be analyzed under such a scenario. In this context, this article proposes a new fault-tolerant majority voter which is found to be more robust to faults than the existing voters in the presence of faults occurring internally and/or externally to the voter. Moreover, the proposed voter features less power dissipation, delay, and area metrics based on the simulation results obtained by using a 32/28nm CMOS process.

39 citations


Posted Content•
TL;DR: This thesis proposes page overlays, a framework that augments the existing virtual memory framework with the ability to track a new version of a subset of cache lines within each virtual page, and Gather-Scatter DRAM, a technique that exploits DRAM organization to effectively gather/scatter values with a power-of-2 strided access patterns.
Abstract: In most modern systems, the memory subsystem is managed and accessed at multiple different granularities at various resources. We observe that such multi-granularity management results in significant inefficiency in the memory subsystem. Specifically, we observe that 1) page-granularity virtual memory unnecessarily triggers large memory operations, and 2) existing cache-line granularity memory interface is inefficient for performing bulk data operations and operations that exhibit poor spatial locality. To address these problems, we present a series of techniques in this thesis. First, we propose page overlays, a framework augments the existing virtual memory framework with the ability to track a new version of a subset of cache lines within each virtual page. We show that this extension is powerful by demonstrating its benefits on a number of applications. Second, we show that DRAM can be used to perform more complex operations than just store data. We propose RowClone, a mechanism to perform bulk data copy and initialization completely inside DRAM, and Buddy RAM, a mechanism to perform bulk bitwise operations using DRAM. Both these techniques achieve an order-of-magnitude improvement in the efficiency of the respective operations. Third, we propose Gather-Scatter DRAM, a technique that exploits DRAM organization to effectively gather/scatter values with a power-of-2 strided access patterns. For these access patterns, GS-DRAM achieves near-ideal bandwidth and cache utilization, without increasing the latency of fetching data from memory. Finally, we propose the Dirty-Block Index, a new way of tracking dirty blocks. In addition to improving the efficiency of bulk data coherence, DBI has several applications including high-performance memory scheduling, efficient cache lookup bypassing, and enabling heterogeneous ECC.

Journal Article•DOI•
TL;DR: In this article, the authors introduce Pareto curves in the energy/op and mm$^2$/(ops/s) metric space for compute units, accelerators, and on-chip memory/interconnect.
Abstract: The key challenge to improving performance in the age of Dark Silicon is how to leverage transistors when they cannot all be used at the same time. In modern SOCs, these transistors are often used to create specialized accelerators which improve energy efficiency for some applications by 10-1000X. While this might seem like the magic bullet we need, for most CPU applications more energy is dissipated in the memory system than in the processor: these large gains in efficiency are only possible if the DRAM and memory hierarchy are mostly idle. We refer to this desirable state as Dark Memory, and it only occurs for applications with an extreme form of locality. To show our findings, we introduce Pareto curves in the energy/op and mm$^2$/(ops/s) metric space for compute units, accelerators, and on-chip memory/interconnect. These Pareto curves allow us to solve the power, performance, area constrained optimization problem to determine which accelerators should be used, and how to set their design parameters to optimize the system. This analysis shows that memory accesses create a floor to the achievable energy-per-op. Thus high performance requires Dark Memory, which in turn requires co-design of the algorithm for parallelism and locality, with the hardware.

Proceedings Article•DOI•
TL;DR: This paper makes a case for full-stack MCM verification and provides a toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and implementation collectively uphold MCM requirements, and showcases TriCheck's ability to evaluate a proposed ISA MCM.
Abstract: Memory consistency models (MCMs) which govern inter-module interactions in a shared memory system, are a significant, yet often under-appreciated, aspect of system design. MCMs are defined at the various layers of the hardware-software stack, requiring thoroughly verified specifications, compilers, and implementations at the interfaces between layers. Current verification techniques evaluate segments of the system stack in isolation, such as proving compiler mappings from a high-level language (HLL) to an ISA or proving validity of a microarchitectural implementation of an ISA. This paper makes a case for full-stack MCM verification and provides a toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and implementation collectively uphold MCM requirements. The work showcases TriCheck's ability to evaluate a proposed ISA MCM in order to ensure that each layer and each mapping is correct and complete. Specifically, we apply TriCheck to the open source RISC-V ISA, seeking to verify accurate, efficient, and legal compilations from C11. We uncover under-specifications and potential inefficiencies in the current RISC-V ISA documentation and identify possible solutions for each. As an example, we find that a RISC-V-compliant microarchitecture allows 144 outcomes forbidden by C11 to be observed out of 1,701 litmus tests examined. Overall, this paper demonstrates the necessity of full-stack verification for detecting MCM-related bugs in the hardware-software stack.

Posted Content•
TL;DR: This report aims to identify opportunities where architecture research can bridge the gap between the application and device domains.
Abstract: Application trends, device technologies and the architecture of systems drive progress in information technologies. However, the former engines of such progress - Moore's Law and Dennard Scaling - are rapidly reaching the point of diminishing returns. The time has come for the computing community to boldly confront a new challenge: how to secure a foundational future for information technology's continued progress. The computer architecture community engaged in several visioning exercises over the years. Five years ago, we released a white paper, 21st Century Computer Architecture, which influenced funding programs in both academia and industry. More recently, the IEEE Rebooting Computing Initiative explored the future of computing systems in the architecture, device, and circuit domains. This report stems from an effort to continue this dialogue, reach out to the applications and devices/circuits communities, and understand their trends and vision. We aim to identify opportunities where architecture research can bridge the gap between the application and device domains.

Posted Content•
TL;DR: This paper describes a multi-functional deep in-memory processor for inference applications by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS.
Abstract: This paper describes a multi-functional deep in-memory processor for inference applications Deep in-memory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS Four applications are demonstrated The prototype achieves up to 56X (97X estimated for multi-bank scenario) energy savings with negligible (<1%) accuracy degradation in all four applications as compared to the conventional architecture

Posted Content•
TL;DR: RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory, and a complementary work that uses DRAM to performs bulk bitwise AND and OR operations inside mainmemory significantly improve the performance and energy efficiency of the respective operations.
Abstract: In existing systems, the off-chip memory interface allows the memory controller to perform only read or write operations. Therefore, to perform any operation, the processor must first read the source data and then write the result back to memory after performing the operation. This approach consumes high latency, bandwidth, and energy for operations that work on a large amount of data. Several works have proposed techniques to process data near memory by adding a small amount of compute logic closer to the main memory chips. In this article, we describe two techniques proposed by recent works that take this approach of processing in memory further by exploiting the underlying operation of the main memory technology to perform more complex tasks. First, we describe RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory. We then describe a complementary work that uses DRAM to perform bulk bitwise AND and OR operations inside main memory. These two techniques significantly improve the performance and energy efficiency of the respective operations.

Proceedings Article•DOI•
TL;DR: In this article, a cross-layer resilience framework is proposed to achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm).
Abstract: We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50x improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50x improvement in silent data corruption rate).

Posted Content•
TL;DR: A number of standard-cell based majority voter designs relevant to TMR architectures are presented, and their power, delay and area parameters are estimated based on physical realization using a 32/28nm CMOS process.
Abstract: N-modular redundancy (NMR) is commonly used to enhance the fault tolerance of a circuit/system, when subject to a fault-inducing environment such as in space or military systems, where upsets due to radiation phenomena, temperature and/or other environmental conditions are anticipated. Triple Modular Redundancy (TMR), which is a 3-tuple version of NMR, is widely preferred for mission-control space, military, and aerospace, and safety-critical nuclear, power, medical, and industrial control and automation systems. The TMR scheme involves the two-times duplication of a simplex system hardware, with a majority voter ensuring correctness provided at least two out of three copies of the hardware remain operational. Thus the majority voter plays a pivotal role in ensuring the correct operation of the TMR scheme. In this paper, a number of standard-cell based majority voter designs relevant to TMR architectures are presented, and their power, delay and area parameters are estimated based on physical realization using a 32/28nm CMOS process.

Posted Content•
TL;DR: This work concentrates on leveraging the additional capacity offered by replacing SRAM-based L2 with Spin-Transfer Torque Random Access Memory (STT-RAM) to accommodate frequently accessed cache blocks in exclusive read mode in favor of reducing the overall read service time.
Abstract: As capacity and complexity of on-chip cache memory hierarchy increases, the service cost to the critical loads from Last Level Cache (LLC), which are frequently repeated, has become a major concern. The processor may stall for a considerable interval while waiting to access the data stored in the cache blocks in LLC, if there are no independent instructions to execute. To provide accelerated service to the critical loads requests from LLC, this work concentrates on leveraging the additional capacity offered by replacing SRAM-based L2 with Spin-Transfer Torque Random Access Memory (STT-RAM) to accommodate frequently accessed cache blocks in exclusive read mode in favor of reducing the overall read service time. Our proposed technique partitions L2 cache into two STT-RAM arrangements with different write performance and data retention time. The retention-relaxed STT-RAM arrays are utilized to effectively deal with the regular L2 cache requests while the high retention STT-RAM arrays in L2 are selected for maintaining repeatedly read accessed cache blocks from LLC by incurring negligible energy consumption for data retention. Our experimental results show that the proposed technique can reduce the mean L2 read miss ratio by 51.4% and increase the IPC by 11.7% on average across PARSEC benchmark suite while significantly decreasing the total L2 energy consumption compared to conventional SRAM-based L2 design.

Posted Content•
TL;DR: How an overlay fabric that allows the user to rapidly add debug instrumentation to a design can be created and exploited is discussed.
Abstract: FPGAs are going mainstream. Major companies that were not traditionally FPGA-focused are now seeking ways to exploit the benefits of reconfigurable technology and provide it to their customers. In order to do so, a debug ecosystem that provides for effective visibility into a working design and quick debug turn-around times is essential. Overlays have the opportunity to play a key role in this ecosystem. In this overview paper, we discuss how an overlay fabric that allows the user to rapidly add debug instrumentation to a design can be created and exploited. We discuss the requirements of such an overlay and some of the research challenges and opportunities that need to be addressed. To make our exposition concrete, we use two previously-published examples of overlays that have been developed to implement debug instrumentation.

Posted Content•
TL;DR: An FPGA based fixed-point DNN system using only on-chip memory not to access external DRAM is developed and the execution time and energy consumption of the developed system is compared with a GPU based implementation.
Abstract: Deep neural networks (DNNs) demand a very large amount of computation and weight storage, and thus efficient implementation using special purpose hardware is highly desired. In this work, we have developed an FPGA based fixed-point DNN system using only on-chip memory not to access external DRAM. The execution time and energy consumption of the developed system is compared with a GPU based implementation. Since the capacity of memory in FPGA is limited, only 3-bit weights are used for this implementation, and training based fixed-point weight optimization is employed. The implementation using Xilinx XC7Z045 is tested for the MNIST handwritten digit recognition benchmark and a phoneme recognition task on TIMIT corpus. The obtained speed is about one quarter of a GPU based implementation and much better than that of a PC based one. The power consumption is less than 5 Watt at the full speed operation resulting in much higher efficiency compared to GPU based systems.

Journal Article•DOI•
TL;DR: This article presents two area/latency optimized gate level asynchronous full adder designs which correspond to early output logic and their proposed full adders are constructed using the delay-insensitive dual-rail code and adhere to the four-phase return-to-zero handshaking.
Abstract: This article presents two area/latency optimized gate level asynchronous full adder designs which correspond to early output logic. The proposed full adders are constructed using the delay-insensitive dual-rail code and adhere to the four-phase return-to-zero handshaking. For an asynchronous ripple carry adder (RCA) constructed using the proposed early output full adders, the relative-timing assumption becomes necessary and the inherent advantages of the relative-timed RCA are: (1) computation with valid inputs, i.e., forward latency is data-dependent, and (2) computation with spacer inputs involves a bare minimum constant reverse latency of just one full adder delay, thus resulting in the optimal cycle time. With respect to different 32-bit RCA implementations, and in comparison with the optimized strong-indication, weak-indication, and early output full adder designs, one of the proposed early output full adders achieves respective reductions in latency by 67.8, 12.3 and 6.1 %, while the other proposed early output full adder achieves corresponding reductions in area by 32.6, 24.6 and 6.9 %, with practically no power penalty. Further, the proposed early output full adders based asynchronous RCAs enable minimum reductions in cycle time by 83.4, 15, and 8.8 % when considering carry-propagation over the entire RCA width of 32-bits, and maximum reductions in cycle time by 97.5, 27.4, and 22.4 % for the consideration of a typical carry chain length of 4 full adder stages, when compared to the least of the cycle time estimates of various strong-indication, weak-indication, and early output asynchronous RCAs of similar size. All the asynchronous full adders and RCAs were realized using standard cells in a semi-custom design fashion based on a 32/28 nm CMOS process technology.

Posted Content•
TL;DR: In this article, the authors describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multi-core clusters, which introduces instruction extensions and micro-architectural optimizations to increase the computational density and to minimize the pressure towards the shared memory hierarchy.
Abstract: Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold(NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multi-core clusters. We introduce instruction-extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure towards the shared memory hierarchy. For typical data-intensive sensor processing workloads the proposed core is on average 3.5x faster and 3.2x more energy-efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. SIMD extensions, such as dot-products, and a built-in L0 storage further reduce the shared memory accesses by 8x reducing contentions by 3.2x. With four NT-optimized cores, the cluster is operational from 0.6V to 1.2V achieving a peak efficiency of 67MOPS/mW in a low-cost 65nm bulk CMOS technology. In a low power 28nm FDSOI process a peak efficiency of 193MOPS/mW(40MHz, 1mW) can be achieved.

Journal Article•
TL;DR: Experimental results show that integrated software-hardware applications using the proposed tightly-coupled architecture achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time flexibility.
Abstract: FPGA overlays are commonly implemented as coarse-grained reconfigurable architectures with a goal to improve designers' productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate full application acceleration, it is often necessary to also include a highly efficient processor that integrates and collaborates with the accelerators while maintaining the benefits of being implemented within the same overlay framework. This paper presents an open-source soft processor that is designed to tightly-couple with FPGA accelerators as part of an overlay framework. RISC-V is chosen as the instruction set for its openness and portability, and the soft processor is designed as a 4-stage pipeline to balance resource consumption and performance when implemented on FPGAs. The processor is generically implemented so as to promote design portability and compatibility across different FPGA platforms. Experimental results show that integrated software-hardware applications using the proposed tightly-coupled architecture achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time flexibility. The processor has been synthesized to both low-end and high-performance FPGA families from different vendors, achieving the highest frequency of 268.67MHz and resource consumption comparable to existing RISC-V designs.

Posted Content•
TL;DR: It is made the case that a well-designed Reduced Instruction Set Computer (RISC) can match, and even exceed, the performance and code density of existing commercial Complex Instruction Set Computers (CISC) while maintaining the simplicity and cost-effectiveness that underpins the original RISC goals.
Abstract: This report makes the case that a well-designed Reduced Instruction Set Computer (RISC) can match, and even exceed, the performance and code density of existing commercial Complex Instruction Set Computers (CISC) while maintaining the simplicity and cost-effectiveness that underpins the original RISC goals. We begin by comparing the dynamic instruction counts and dynamic instruction bytes fetched for the popular proprietary ARMv7, ARMv8, IA-32, and x86-64 Instruction Set Architectures (ISAs) against the free and open RISC-V RV64G and RV64GC ISAs when running the SPEC CINT2006 benchmark suite. RISC-V was designed as a very small ISA to support a wide range of implementations, and has a less mature compiler toolchain. However, we observe that on SPEC CINT2006 RV64G executes on average 16% more instructions than x86-64, 3% more instructions than IA-32, 9% more instructions than ARMv8, but 4% fewer instructions than ARMv7. CISC x86 implementations break up complex instructions into smaller internal RISC-like micro-ops, and the RV64G instruction count is within 2% of the x86-64 retired micro-op count. RV64GC, the compressed variant of RV64G, is the densest ISA studied, fetching 8% fewer dynamic instruction bytes than x86-64. We observed that much of the increased RISC-V instruction count is due to a small set of common multi-instruction idioms. Exploiting this fact, the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion. Combining the compressed RISC-V ISA extension with macro-op fusion provides both the densest ISA and the fewest dynamic operations retired per program, reducing the motivation to add more instructions to the ISA. This approach retains a single simple ISA suitable for both low-end and high-end implementations, where high-end implementations can boost performance through microarchitectural techniques.

Posted Content•
TL;DR: The goal of this work is to understand and exploit design-induced variation to develop low-cost mechanisms to dynamically find and use the lowest latency a DRAM chip can reliably operate at and thus improve overall system performance while ensuring reliable system operation.
Abstract: Variation has been shown to exist across the cells within a modern DRAM chip Prior work has studied and exploited several prior forms of this variation, such as manufacturing-process- or temperature-induced variation We empirically observe a new form of variation that exists within a DRAM chip, induced by the design and placement of different components in the DRAM chip, where different regions in DRAM, based on their relative distance from the peripheral structures, require different minimum access latencies for reliable operation In particular, cells closer to the peripheral structures can be accessed much faster than cells that are farther We call this phenomenon design-induced variation in DRAM Our goal, in this work, is to understand and exploit design-induced variation to develop low-cost mechanisms to dynamically find and use the lowest latency a DRAM chip can reliably operate at and thus improve overall system performance while ensuring reliable system operation To this end, we first experimentally demonstrate and analyze designed-induced variation in modern DRAM devices by testing and characterizing 96 DIMMs (768 DRAM chips) Our characterization identifies DRAM regions that are vulnerable to errors, if operated at lower latency, and finds consistency in their locations across a given DRAM chip generation, due to design-induced variation Based on our experimental analysis, we develop two mechanisms that reliably reduce DRAM latency

Posted Content•
TL;DR: In this paper, the authors proposed a folding-flash 1GS/s ADC with a folding factor of two, which uses a fully matched input stage, an unbalanced latch stage, and a two-clock operation scheme.
Abstract: We present the design of a low-power 4-bit 1GS/s folding-flash ADC with a folding factor of two. The design of a new unbalanced double-tail dynamic comparator affords an ultra-low power operation and a high dynamic range. Unlike the conventional approaches, this design uses a fully matched input stage, an unbalanced latch stage, and a two-clock operation scheme. A combination of these features yields significant reduction of the kick-back noise, while allowing the design flexibility for adjusting the trip points of the comparators. As a result, the ADC achieves SNDR of 22.3 dB at 100MHz and 21.8 dB at 500MHz (i.e. the Nyquist frequency). The maximum INL and DNL are about 0.2 LSB. The converter consumes about 700uW from a 1-V supply yielding a figure of merit of 65fJ/conversion step. These attributes make the proposed folding-flash ADC attractive for the next-generation wireless applications.

Posted Content•
TL;DR: The proposed architecture, Non-volatile HTM (NVHTM), leverages large-scale solid state flash memory to realize a optimal memory organization, area and power envelope and is a proof of concept that storage processing is a viable platform for large scale HTM network models.
Abstract: Hierarchical Temporal Memory (HTM) is a biomimetic machine learning algorithm imbibing the structural and algorithmic properties of the neocortex. Two main functional components of HTM that enable spatio-temporal processing are the spatial pooler and temporal memory. In this research, we explore a scalable hardware realization of the spatial pooler closely coupled with the mathematical formulation of spatial pooler. This class of neuromorphic algorithms are advantageous in solving a subset of the future engineering problems by extracting nonintuitive patterns in complex data. The proposed architecture, Non-volatile HTM (NVHTM), leverages large-scale solid state flash memory to realize a optimal memory organization, area and power envelope. A behavioral model of NVHTM is evaluated against the MNIST dataset, yielding 91.98% classification accuracy. A full custom layout is developed to validate the design in a TSMC 180nm process. The area and power profile of the spatial pooler are 30.538mm2 and 64.394mW, respectively. This design is a proof-of-concept that storage processing is a viable platform for large scale HTM network models.

Posted Content•
TL;DR: In this paper, a low-power precision-scalable processor for convNets or convolutional neural networks (CNN) is implemented in a 40nm technology, which achieves a peak 102GOPS running at 204MHz.
Abstract: A low-power precision-scalable processor for ConvNets or convolutional neural networks (CNN) is implemented in a 40nm technology. Its 256 parallel processing units achieve a peak 102GOPS running at 204MHz. To minimize energy consumption while maintaining throughput, this works is the first to both exploit the sparsity of convolutions and to implement dynamic precision-scalability enabling supply- and energy scaling. The processor is fully C-programmable, consumes 25-288mW at 204 MHz and scales efficiency from 0.3-2.6 real TOPS/W. This system hereby outperforms the state-of-the-art up to 3.9x in energy efficiency.

Posted Content•
TL;DR: FPMax implements four FPUs optimized for latency or throughput workloads in two precisions, fabricated in 28nm UTBB FDSOI, and improves the energy efficiency by about 20%; at 10% activity this saving is almost 2x.
Abstract: FPMax implements four FPUs optimized for latency or throughput workloads in two precisions, fabricated in 28nm UTBB FDSOI. Each unit's parameters, e.g pipeline stages, booth encoding etc., were optimized to yield 1.42ns latency at 110GLOPS/W (SP) and 1.39ns latency at 36GFLOPS/W (DP). At 100% activity, body-bias control improves the energy efficiency by about 20%; at 10% activity this saving is almost 2x. Keywords: FPU, energy efficiency, hardware generator, SOI