scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Hardware Architecture in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.
Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

80 citations


Posted Content
TL;DR: Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory, exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy.
Abstract: Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.

77 citations


Posted Content
TL;DR: This paper surveys the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identifies the key challenges and open issues with future research directions and provides a glimpse of the approach to near-memory computing.
Abstract: The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory --- called near-memory computing (NMC) --- more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications. In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.

39 citations


Posted Content
TL;DR: Ara as discussed by the authors is a vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology.
Abstract: In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP-GFLOPS/W under the same conditions, which is slightly superior to similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

24 citations


Posted Content
TL;DR: In this article, the authors propose a framework that uses machine learning to accurately estimate the optimal performance and resource utilization of a high-level synthesis (HLS) design, which is used to bridge the accuracy gap and enable developers to estimate the throughput or throughput-to-area of hardware design with more than 95% accuracy and alleviates the need to perform actual implementation for estimation.
Abstract: The emergence of High-Level Synthesis (HLS) tools shifted the paradigm of hardware design by making the process of mapping high-level programming languages to hardware design such as C to VHDL/Verilog feasible. HLS tools offer a plethora of techniques to optimize designs for both area and performance, but resource usage and timing reports of HLS tools mostly deviate from the post-implementation results. In addition, to evaluate a hardware design performance, it is critical to determine the maximum achievable clock frequency. Obtaining such information using static timing analysis provided by CAD tools is difficult, due to the multitude of tool options. Moreover, a binary search to find the maximum frequency is tedious, time-consuming, and often does not obtain the optimal result. To address these challenges, we propose a framework, called Pyramid, that uses machine learning to accurately estimate the optimal performance and resource utilization of an HLS design. For this purpose, we first create a database of C-to-FPGA results from a diverse set of benchmarks. To find the achievable maximum clock frequency, we use Minerva, which is an automated hardware optimization tool. Minerva determines the close-to-optimal settings of tools, using static timing analysis and a heuristic algorithm, and targets either optimal throughput or throughput-to-area. Pyramid uses the database to train an ensemble machine learning model to map the HLS-reported features to the results of Minerva. To this end, Pyramid re-calibrates the results of HLS to bridge the accuracy gap and enable developers to estimate the throughput or throughput-to-area of hardware design with more than 95% accuracy and alleviates the need to perform actual implementation for estimation.

24 citations


Posted Content
TL;DR: Applied machine learning applied system-wide to simulation and run-time optimization, and in many individual components, including memory systems, branch predictors, networks-on-chip, and GPUs present a promising future for increasingly automated architectural design.
Abstract: Machine learning has enabled significant benefits in diverse fields, but, with a few exceptions, has had limited impact on computer architecture. Recent work, however, has explored broader applicability for design, optimization, and simulation. Notably, machine learning based strategies often surpass prior state-of-the-art analytical, heuristic, and human-expert approaches. This paper reviews machine learning applied system-wide to simulation and run-time optimization, and in many individual components, including memory systems, branch predictors, networks-on-chip, and GPUs. The paper further analyzes current practice to highlight useful design strategies and identify areas for future work, based on optimized implementation strategies, opportune extensions to existing work, and ambitious long term possibilities. Taken together, these strategies and techniques present a promising future for increasingly automated architectural design.

20 citations


Proceedings ArticleDOI
TL;DR: The BRisc-V Toolbox is an open-source, parameterized, synthesizable set of RTL modules for designing RISC-V based single and multi-core architecture systems, designed with a high degree of modularity.
Abstract: In this work, we introduce a platform for register-transfer level (RTL) architecture design space exploration. The platform is an open-source, parameterized, synthesizable set of RTL modules for designing RISC-V based single and multi-core architecture systems. The platform is designed with a high degree of modularity. It provides highly-parameterized, composable RTL modules for fast and accurate exploration of different RISC-V based core complexities, multi-level caching and memory organizations, system topologies, router architectures, and routing schemes. The platform can be used for both RTL simulation and FPGA based emulation. The hardware modules are implemented in synthesizable Verilog using no vendor-specific blocks. The platform includes a RISC-V compiler toolchain to assist in developing software for the cores, a web-based system configuration graphical user interface (GUI) and a web-based RISC-V assembly simulator. The platform supports a myriad of RISC-V architectures, ranging from a simple single cycle processor to a multi-core SoC with a complex memory hierarchy and a network-on-chip. The modules are designed to support incremental additions and modifications. The interfaces between components are particularly designed to allow parts of the processor such as whole cache modules, cores or individual pipeline stages, to be modified or replaced without impacting the rest of the system. The platform allows researchers to quickly instantiate complete working RISC-V multi-core systems with synthesizable RTL and make targeted modifications to fit their needs. The complete platform (including Verilog source code) can be downloaded at this https URL.

19 citations


Posted Content
TL;DR: The goal of hlslib is to create a community-driven arena of bleeding edge development, which can move quicker, and provides more powerful abstractions than what is provided by vendors; and to collect a wide range of example codes that can be reused directly or inspire other work.
Abstract: High-level synthesis (HLS) tools have brought FPGA development into the mainstream, by allowing programmers to design architectures using familiar languages such as C, C++, and OpenCL. While the move to these languages has brought significant benefits, many aspects of traditional software engineering are still unsupported, or not exploited by developers in practice. Furthermore, designing reconfigurable architectures requires support for hardware constructs, such as FIFOs and shift registers, that are not native to CPU-oriented languages. To address this gap, we have developed hlslib, a collection of software tools, plug-in hardware modules, and code samples, designed to enhance the productivity of HLS developers. The goal of hlslib is two-fold: first, create a community-driven arena of bleeding edge development, which can move quicker, and provides more powerful abstractions than what is provided by vendors; and second, collect a wide range of example codes, both minimal proofs of concept, and larger, real-world applications, that can be reused directly or inspire other work. hlslib is offered as an open source library, containing CMake files, C++ headers, convenience scripts, and examples codes, and is receptive to any contribution that can benefit HLS developers, through general functionality or examples.

18 citations


Posted Content
TL;DR: Buddy Compression is an architecture that makes novel use of compression to utilize a larger buddy-memory from the host or disaggregated memory, effectively increasing the memory capacity of the GPU.
Abstract: GPUs offer orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, GPU device memory tends to be relatively small and the memory capacity can not be increased by the user. This paper describes Buddy Compression, a scheme to increase both the effective GPU memory capacity and bandwidth while avoiding the downsides of conventional memory-expanding strategies. Buddy Compression compresses GPU memory, splitting each compressed memory entry between high-speed device memory and a slower-but-larger disaggregated memory pool (or system memory). Highly-compressible memory entries can thus be accessed completely from device memory, while incompressible entries source their data using both on and off-device accesses. Increasing the effective GPU memory capacity enables us to run larger-memory-footprint HPC workloads and larger batch-sizes or models for DL workloads than current memory capacities would allow. We show that our solution achieves an average compression ratio of 2.2x on HPC workloads and 1.5x on DL workloads, with a slowdown of just 1~2%.

17 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors summarize the recent research progress in eNVM-based in-memory processing from various aspects, including the adopted memory technologies, locations of the inmemory processing in the system, supported arithmetics, as well as applied applications.
Abstract: The conventional von Neumann architecture has been revealed as a major performance and energy bottleneck for rising data-intensive applications. %, due to the intensive data movements. The decade-old idea of leveraging in-memory processing to eliminate substantial data movements has returned and led extensive research activities. The effectiveness of in-memory processing heavily relies on memory scalability, which cannot be satisfied by traditional memory technologies. Emerging non-volatile memories (eNVMs) that pose appealing qualities such as excellent scaling and low energy consumption, on the other hand, have been heavily investigated and explored for realizing in-memory processing architecture. In this paper, we summarize the recent research progress in eNVM-based in-memory processing from various aspects, including the adopted memory technologies, locations of the in-memory processing in the system, supported arithmetics, as well as applied applications.

16 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements.
Abstract: Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still remain challenges of implementing STT-RAMs in the CPU cache. An increasingly popular method for addressing this challenge involves trading off the non-volatility for reduced write speed and write energy by relaxing the STT-RAM's data retention time. However, in order to maximize energy saving potential, the cache configurations, including STT-RAM's retention time, must be dynamically adapted to executing applications' variable memory needs. In this paper, we propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements. We also propose low-overhead runtime tuning algorithms to dynamically determine the best (lowest energy) cache configurations and retention times for executing applications. Compared to prior work, HALLS reduced the average energy consumption by 60.57% in a quad-core system, while introducing marginal latency overhead.

Proceedings ArticleDOI
TL;DR: In this article, an application-tailored data-driven fully automated method for functional approximation of combinational circuits is proposed, which is possible by employing a weighted mean error distance (WMED) metric for steering the circuit approximation process by means of genetic programming.
Abstract: We propose an application-tailored data-driven fully automated method for functional approximation of combinational circuits. We demonstrate how an application-level error metric such as the classification accuracy can be translated to a component-level error metric needed for an efficient and fast search in the space of approximate low-level components that are used in the application. This is possible by employing a weighted mean error distance (WMED) metric for steering the circuit approximation process which is conducted by means of genetic programming. WMED introduces a set of weights (calculated from the data distribution measured on a selected signal in a given application) determining the importance of each input vector for the approximation process. The method is evaluated using synthetic benchmarks and application-specific approximate MAC (multiply-and-accumulate) units that are designed to provide the best trade-offs between the classification accuracy and power consumption of two image classifiers based on neural networks.

Proceedings ArticleDOI
TL;DR: In this article, the Dual Spatial Pattern Prefetcher (DSPatch) is proposed to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak.
Abstract: High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak. To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.

Posted Content
TL;DR: One possible 3D-stacked microarchitecture is defined, dubbed BiHiwe, that leverages clustering and hierarchical design to best utilize power-efficiency of the mixed-signal domain and 3D stacking and also builds models for noise, computational non-idealities, and variations.
Abstract: Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This paper aims to address these challenges by offering and leveraging the insight that a vector dot-product (the basic operation in DNNs) can be bit-partitioned into groups of spatially parallel low-bitwidth operations, and interleaved across multiple elements of the vectors. As such, the building blocks of our accelerator become a group of wide, yet low-bitwidth multiply-accumulate units that operate in the analog domain and share a single A/D converter. The low-bitwidth operation tackles the encoding range limitation and facilitates noise mitigation. Moreover, we utilize the switched-capacitor design for our bit-level reformulation of DNN operations. The proposed switched-capacitor circuitry performs the group multiplications in the charge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combined with wide bit-partitioned operations alleviate the need for A/D conversion per operation. With such mathematical reformulation and its switched-capacitor implementation, we define a 3D-stacked microarchitecture, dubbed BIHIWE.

Posted Content
TL;DR: In this article, the authors propose an efficient framework to throttle the power consumption of multi-FPGA platforms by dynamically scaling the voltage and frequency during runtime according to prediction of, and adjustment to the workload level, while maintaining the desired Quality of Service (QoS).
Abstract: The continuous growth of big data applications with high computational and scalability demands has resulted in increasing popularity of cloud computing. Optimizing the performance and power consumption of cloud resources is therefore crucial to relieve the costs of data centers. In recent years, multi-FPGA platforms have gained traction in data centers as low-cost yet high-performance solutions particularly as acceleration engines, thanks to the high degree of parallelism they provide. Nonetheless, the size of data centers workloads varies during service time, leading to significant underutilization of computing resources while consuming a large amount of power, which turns out as a key factor of data center inefficiency, regardless of the underlying hardware structure. In this paper, we propose an efficient framework to throttle the power consumption of multi-FPGA platforms by dynamically scaling the voltage and hereby frequency during runtime according to prediction of, and adjustment to the workload level, while maintaining the desired Quality of Service (QoS). This is in contrast to, and more efficient than, conventional approaches that merely scale (i.e., power-gate) the computing nodes or frequency. The proposed framework carefully exploits a pre-characterized library of delay-voltage, and power-voltage information of FPGA resources, which we show is indispensable to obtain the efficient operating point due to the different sensitivity of resources w.r.t. voltage scaling, particularly considering multiple power rails residing in these devices. Our evaluations by implementing state-of-the-art deep neural network accelerators revealed that, providing an average power reduction of 4.0X, the proposed framework surpasses the previous works by 33.6% (up to 83%).

Posted Content
TL;DR: An OS-level solution to prevent denial of service (DoS) attacks on shared caches in multicore platforms is proposed by extending a state-of-the-art memory bandwidth regulation mechanism and implemented in Linux to show its effectiveness in protecting against cache DoS attacks.
Abstract: In this paper we investigate the feasibility of denial-of-service (DoS) attacks on shared caches in multicore platforms. With carefully engineered attacker tasks, we are able to cause more than 300X execution time increases on a victim task running on a dedicated core on a popular embedded multicore platform, regardless of whether we partition its shared cache or not. Based on careful experimentation on real and simulated multicore platforms, we identify an internal hardware structure of a non-blocking cache, namely the cache writeback buffer, as a potential target of shared cache DoS attacks. We propose an OS-level solution to prevent such DoS attacks by extending a state-of-the-art memory bandwidth regulation mechanism. We implement the proposed mechanism in Linux on a real multicore platform and show its effectiveness in protecting against cache DoS attacks.

Journal ArticleDOI
TL;DR: The 8-bit ERSFQ ALU design employs wave-pipelined instruction execution and features modular bit-slice architecture that is easily extendable to any number of bits and adaptable to current recycling.
Abstract: We have designed and tested a parallel 8-bit ERSFQ arithmetic logic unit (ALU). The ALU design employs wave-pipelined instruction execution and features modular bit-slice architecture that is easily extendable to any number of bits and adaptable to current recycling. A carry signal synchronized with an asynchronous instruction propagation provides the wave-pipeline operation of the ALU. The ALU instruction set consists of 14 arithmetical and logical instructions. It has been designed and simulated for operation up to a 10 GHz clock rate at the 10-kA/cm2 fabrication process. The ALU is embedded into a shift-register-based high-frequency testbed with on-chip clock generator to allow for comprehensive high frequency testing for all possible operands. The 8-bit ERSFQ ALU, comprising 6840 Josephson junctions, has been fabricated with MIT Lincoln Lab 10-kA/cm2 SFQ5ee fabrication process featuring eight Nb wiring layers and a high-kinetic inductance layer needed for ERSFQ technology. We evaluated the bias margins for all instructions and various operands at both low and high frequency clock. At low frequency, clock and all instruction propagation through ALU were observed with bias margins of +/-11% and +/-9%, respectively. Also at low speed, the ALU exhibited correct functionality for all arithmetical and logical instructions with +/-6% bias margins. We tested the 8-bit ALU for all instructions up to 2.8 GHz clock frequency.

Posted Content
TL;DR: The proposed SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores and provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11 percent increase in core area.
Abstract: Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33% on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy efficiency. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores. We implement the proposed extension in the RTL of an existing multi-core cluster and synthesize the design for a modern 22nm technology. Our extension provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11% increase in core area. Sequential code runs 3x faster on a single core, and 3x fewer cores are needed in a cluster to achieve the same performance. The utilization increase to almost 100% in leads to a 2x energy efficiency improvement in a multi-core cluster. The extension reduces instruction fetches by up to 3.5x and instruction cache power consumption by up to 5.6x. Compilers can automatically map loop nests to SSRs, making the changes transparent to the programmer.

Proceedings ArticleDOI
TL;DR: Touché, a framework for storing multiple compressed blocks from arbitrary addresses within a cacheline without tag area overheads, is proposed and achieves an average speedup of 12% (ideal 13%) when compared to an uncompressed baseline.
Abstract: Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads. This paper proposes Touche, a framework that enables storing multiple arbitrary compressed cache blocks within a physical cacheline without any tag area overheads. The Touche framework consists of three components. The first component, called the ``Signature'' (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the ``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with data. TADA enables Touche to detect false positive signature matches by ensuring that the actual tag address is available for comparison. The third component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique marker in the tag entry to indicate the occurrence of compressed cache blocks from neighboring physical addresses in the same cacheline. Touche is completely hardware-based and achieves an average speedup of 12\% (ideal 13\%) when compared to an uncompressed baseline.

Posted Content
TL;DR: Refresh Triggered Computation can be applied to a wide range of applications, whose memory access patterns remain predictable for sufficiently long time, and can reduce the DRAM refresh energy from 25% to 96%, for the least aggressive and the most aggressive designs.
Abstract: To employ a Convolutional Neural Network (CNN) in an energy-constrained embedded system, it is critical for the CNN implementation to be highly energy efficient. Many recent studies propose CNN accelerator architectures with custom computation units that try to improve energy-efficiency and performance of CNNs by minimizing data transfers from DRAM-based main memory. However, in these architectures, DRAM is still responsible for half of the overall energy consumption of the system, on average. A key factor of the high energy consumption of DRAM is the refresh overhead, which is estimated to consume 40% of the total DRAM energy. In this paper, we propose a new mechanism, Refresh Triggered Computation (RTC), that exploits the memory access patterns of CNN applications to reduce the number of refresh operations. We propose three RTC designs (min-RTC, mid-RTC, and full-RTC), each of which requires a different level of aggressiveness in terms of customization to the DRAM subsystem. All of our designs have small overhead. Even the most aggressive RTC design (i.e., full-RTC) imposes an area overhead of only 0.18% in a 16 Gb DRAM chip and can have less overhead for denser chips. Our experimental evaluation on six well-known CNNs show that RTC reduces average DRAM energy consumption by 24.4% and 61.3%, for the least aggressive and the most aggressive RTC implementations, respectively. Besides CNNs, we also evaluate our RTC mechanism on three workloads from other domains. We show that RTC saves 31.9% and 16.9% DRAM energy for Face Recognition and Bayesian Confidence Propagation Neural Network (BCPNN), respectively. We believe RTC can be applied to other applications whose memory access patterns remain predictable for a sufficiently long time.

Posted Content
TL;DR: This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and theper-core voltage-frequency settings to save energy while respecting QoS requirements of every application in multi-programmed workloads run on multi-core systems.
Abstract: Reducing the energy expended to carry out a computational task is important. In this work, we explore the prospects of meeting Quality-of-Service requirements of tasks on a multi-core system while adjusting resources to expend a minimum of energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of every application in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and Dynamic Voltage Frequency Scaling (DVFS) settings at runtime at negligible overhead. We show that the energy of 4-core and 8-core systems can be reduced by up to 18% and 14%, respectively, compared to a baseline with even distribution of cache resources and a fixed mid-range core voltage-frequency setting. The energy savings can potentially reach 29% if the QoS targets are relaxed to 40% longer execution time.

Posted Content
TL;DR: DRIM platform is developed that harnesses DRAM as computational memory and transforms it into a fundamental processing unit that outperforms recent processing-in-DRAM platforms with up to 3.7x better performance.
Abstract: With Von-Neumann computing architectures struggling to address computationally- and memory-intensive big data analytic task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this way, processing-in-DRAM architecture has achieved remarkable success by dramatically reducing data transfer energy and latency. However, the performance of such system unavoidably diminishes when dealing with more complex applications seeking bulk bit-wise X(N)OR- or addition operations, despite utilizing maximum internal DRAM bandwidth and in-memory parallelism. In this paper, we develop DRIM platform that harnesses DRAM as computational memory and transforms it into a fundamental processing unit. DRIM uses the analog operation of DRAM sub-arrays and elevates it to implement bit-wise X(N)OR operation between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. The simulation results show that DRIM achieves on average 71x and 8.4x higher throughput for performing bulk bit-wise X(N)OR-based operations compared with CPU and GPU, respectively. Besides, DRIM outperforms recent processing-in-DRAM platforms with up to 3.7x better performance.

Posted Content
TL;DR: The paper provides insights on how the current 'F' extension and the custom op-code space of RISC-V can be leveraged/modified to support Posit arithmetic and presents a workaround on how certain applications can be modified minimally to exploit the existing Risc-V tool-chain.
Abstract: Owing to the failure of Dennard's scaling the last decade has seen a steep growth of prominent new paradigms leveraging opportunities in computer architecture. Two technologies of interest are Posit and RISC-V. Posit was introduced in mid-2017 as a viable alternative to IEEE 754-2008. Posit promises more accuracy, higher dynamic range, and fewer unused states along with simpler hardware designs as compared to IEEE 754-2008. RISC-V, on the other hand, provides a commercial-grade open-source ISA. It is not only elegant and simple but also highly extensible and customizable, thereby facilitating novel micro-architectural research and exploration. In this paper, we bring these two technologies together and propose the first Posit Enabled RISC-V core. The paper provides insights on how the current 'F' extension and the custom op-code space of RISC-V can be leveraged/modified to support Posit arithmetic. We also present implementation details of a parameterized and feature-complete Posit FPU which is integrated with the RISC-V compliant SHAKTI C-class core either as an execution unit or as an accelerator. To fully leverage the potential of Posit, we further enhance our Posit FPU, with minimal overheads, to support two different exponent sizes (with posit-size being 32-bits). This allows applications to switch from high-accuracy computation mode to a mode with higher dynamic-range at run-time. In the absence of viable software tool-chain to enable porting of applications in the Posit domain, we present a workaround on how certain applications can be modified minimally to exploit the existing RISC-V tool-chain. We also provide examples of applications which can perform better with Posit as compared to IEEE 754-2008. The proposed Posit FPU consumes 3507 slice LUTs and 1294 slice registers on an Artix-7-100T Xilinx FPGA while capable of operating at 100 MHz.

Posted Content
TL;DR: P is proposed, a new mechanism that enables partition-level parallelism within each PCM bank, and exploits such parallelism by using the memory controller’s access scheduling decisions, and reduces average PCM access latency and improves average system performance.
Abstract: Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a modern PCM bank is implemented as a collection of partitions that operate mostly independently while sharing a few global peripheral structures, which include the sense amplifiers (to read) and the write drivers (to write). Based on this observation, we propose PALP, a new mechanism that enables partition-level parallelism within each PCM bank, and exploits such parallelism by using the memory controller's access scheduling decisions. PALP consists of three new contributions. First, we introduce new PCM commands to enable parallelism in a bank's partitions in order to resolve the read-write bank conflicts, with minimal changes needed to PCM logic and its interface. Second, we propose simple circuit modifications that introduce a new operating mode for the write drivers, in addition to their default mode of serving write requests. When configured in this new mode, the write drivers can resolve the read-read bank conflicts, working jointly with the sense amplifiers. Finally, we propose a new access scheduling mechanism in PCM that improves performance by prioritizing those requests that exploit partition-level parallelism over other requests, including the long outstanding ones. While doing so, the memory controller also guarantees starvation-freedom and the PCM's running-average-power-limit (RAPL). We evaluate PALP with workloads from the MiBench and SPEC CPU2017 Benchmark suites. Our results show that PALP reduces average PCM access latency by 23%, and improves average system performance by 28% compared to the state-of-the-art approaches.

Posted Content
TL;DR: A novel deep reinforcement framework is proposed, taking routerless networks-on-chip (NoC) as an evaluation case study, and successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions.
Abstract: Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.

Posted Content
TL;DR: This survey discusses some recent pieces of research that have been done to improve high-performance storage systems with caching and tiering techniques and discusses the key insight for intelligently moving data around different devices based on the needs.
Abstract: Although every individual invented storage technology made a big step towards perfection, none of them is spotless. Different data store essentials such as performance, availability, and recovery requirements have not met together in a single economically affordable medium, yet. One of the most influential factors is price. So, there has always been a trade-off between having a desired set of storage choices and the costs. To address this issue, a network of various types of storing media is used to deliver the high performance of expensive devices such as solid state drives and non-volatile memories, along with the high capacity of inexpensive ones like hard disk drives. In software, caching and tiering are long-established concepts for handling file operations and moving data automatically within such a storage network and manage data backup in low-cost media. Intelligently moving data around different devices based on the needs is the key insight for this matter. In this survey, we discuss some recent pieces of research that have been done to improve high-performance storage systems with caching and tiering techniques.

Posted Content
TL;DR: A thermal-aware voltage scaling flow is proposed that effectively utilizes the thermal margin to reduce power consumption without degrading performance and can be employed for energy optimization as well, whereby power consumption and delay are compromised to accomplish the tasks with minimum energy.
Abstract: Cutting edge FPGAs are not energy efficient as conventionally presumed to be, and therefore, aggressive power-saving techniques have become imperative. The clock rate of an FPGA-mapped design is set based on worst-case conditions to ensure reliable operation under all circumstances. This usually leaves a considerable timing margin that can be exploited to reduce power consumption by scaling voltage without lowering clock frequency. There are hurdles for such opportunistic voltage scaling in FPGAs because (a) critical paths change with designs, making timing evaluation difficult as voltage changes, (b) each FPGA resource has particular power-delay trade-off with voltage, (c) data corruption of configuration cells and memory blocks further hampers voltage scaling. In this paper, we propose a systematical approach to leverage the available thermal headroom of FPGA-mapped designs for power and energy improvement. By comprehensively analyzing the timing and power consumption of FPGA building blocks under varying temperatures and voltages, we propose a thermal-aware voltage scaling flow that effectively utilizes the thermal margin to reduce power consumption without degrading performance. We show the proposed flow can be employed for energy optimization as well, whereby power consumption and delay are compromised to accomplish the tasks with minimum energy. Lastly, we propose a simulation framework to be able to examine the efficiency of the proposed method for other applications that are inherently tolerant to a certain amount of error, granting further power saving opportunity. Experimental results over a set of industrial benchmarks indicate up to 36% power reduction with the same performance, and 66% total energy saving when energy is the optimization target.

Book ChapterDOI
TL;DR: This paper presents a technique, named SHIELD, to mitigate RDE in STT-RAM LLCs, which uses data compression to reduce the number of read operations from STt-RAM blocks to avoid RDE, and also to reduceThe number of bits written to cache during both write and restore operations.
Abstract: Due to its high density and close to SRAM read latency, spin transfer torque RAM (STT-RAM) is considered one of the most promising emerging memory technologies for designing large last level caches (LLCs). However, in a deep submicron region, STT-RAM shows read disturbance error (RDE) whereby a read operation may modify the stored data value, and this presents a severe threat to performance and reliability of STT-RAM caches. In this paper, we present a technique, named SHIELD, to mitigate RDE in STT-RAM LLCs. SHIELD uses data compression to reduce the number of read operations from STT-RAM blocks to avoid RDE, and also to reduce the number of bits written to cache during both write and restore operations. Experimental results have shown that SHIELD provides significant improvement in performance and energy efficiency. SHIELD consumes smaller energy than the two previous RDE-mitigation techniques, namely high current restore required read (HCRR, also called restore-after-read) and low current long latency read (LCLL) and even an ideal RDE-free STT-RAM cache.

Posted Content
TL;DR: This paper describes how Storage Class Memory functioned, while highlighting the problems of endurance and performance that these memory types face, and discusses the future possibilities of Multi-Level Cells (MLCs), as well as how SCM can be used to construct accelerators.
Abstract: Storage Class Memory (SCM) is a class of memory technology which has recently become viable for use. Their namearises from the fact that they exhibit non-volatility of data, similar to secondary storage while also having latencies comparable toprimary memory and byte-addressibility. In this area, Phase Change Memory (PCM), Spin-Transfer-Torque Random Access Memory(STT-RAM), and Resistive RAM (ReRAM) have emerged as the major contenders for commercial and industrial use. In this paper, wedescribe how these memory types function, while highlighting the problems of endurance and performance that these memory typesface. We also discuss the future possibilities of Multi-Level Cells (MLCs), as well as how SCM can be used to construct accelerators.

Posted Content
TL;DR: Pangloss is an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions that is able to reconstruct a variety of both simple and complex access patterns with a limited information scope and space/logic complexity.
Abstract: We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well.