arXiv: Hardware Architecture
About: arXiv: Hardware Architecture is an academic journal. The journal publishes majorly in the area(s): Field-programmable gate array & Cache. Over the lifetime, 1558 publications have been published receiving 14120 citations.
Papers published on a yearly basis
TL;DR: This article reviews the "Grid problem," and presents an extensible and open Grid architecture, in which protocols, services, application programming interfaces, and software development kits are categorized according to their roles in enabling resource sharing.
Abstract: "Grid" computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation. In this article, we define this new field. First, we review the "Grid problem," which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources-what we refer to as virtual organizations. In such settings, we encounter unique authentication, authorization, resource access, resource discovery, and other challenges. It is this class of problem that is addressed by Grid technologies. Next, we present an extensible and open Grid architecture, in which protocols, services, application programming interfaces, and software development kits are categorized according to their roles in enabling resource sharing. We describe requirements that we believe any such mechanisms must satisfy, and we discuss the central role played by the intergrid protocols that enable interoperability among different Grid systems. Finally, we discuss how Grid technologies relate to other contemporary technologies, including enterprise integration, application service provider, storage service provider, and peer-to-peer computing. We maintain that Grid concepts and technologies complement and have much to contribute to these other approaches.
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
TL;DR: In this paper, the authors proposed two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization), to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM.
Abstract: Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. However, there are two shortcomings of per-bank refresh. First, the per-bank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait. To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations show that our mechanisms improve system performance and energy efficiency compared to state-of-the-art refresh policies and the benefit increases as DRAM density increases.
TL;DR: The ARM Scalable Vector Extension supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation and introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.
Abstract: This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better address the computational requirements in domains such as high-performance computing, data analytics, computer vision, and machine learning. Second was the desire to introduce an extension that can scale across multiple implementations, both now and into the future, allowing CPU designers to choose the vector length most suitable for their power, performance, and area targets. Finally, the architecture should avoid imposing a software development cost as the vector length changes and where possible reduce it by improving the reach of compiler auto-vectorization technologies. SVE achieves these goals. It allows implementations to choose a vector register length between 128 and 2,048 bits. It supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation. Finally, it introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.
TL;DR: In this article, the authors present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V.
Abstract: Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very promising results. Unfortunately, even these highly optimized engines are still above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. On the algorithmic side, highly competitive classification accuracy can be achieved by properly training CNNs with binary weights. This novel algorithm approach brings major optimization opportunities in the arithmetic core by removing the need for the expensive multiplications as well as in the weight storage and I/O costs. In this work, we present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V. Our accelerator outperforms state-of-the-art performance in terms of ASIC energy efficiency as well as area efficiency with 61.2 TOp/s/W and 1135 GOp/s/MGE, respectively.
Related Journals (5)
IEEE Transactions on Very Large Scale Integration Systems
5K papers, 151.1K citations
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
6.7K papers, 241.6K citations
IEEE Transactions on Parallel and Distributed Systems
5.2K papers, 237.8K citations
Journal of Parallel and Distributed Computing
4.2K papers, 120.6K citations
arXiv: Neural and Evolutionary Computing
4.4K papers, 97.7K citations