Showing papers in "IEEE Computer Architecture Letters in 2019"

PDF

Open Access

Journal Article•DOI•

A Deep Q-Learning Approach for Dynamic Management of Heterogeneous Processors

[...]

Ujjwal Gupta¹, Sumit K. Mandal¹, Manqing Mao¹, Chaitali Chakrabarti¹, Umit Y. Ogras¹ - Show less +1 more•Institutions (1)

Arizona State University¹

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: An efficient deep Q-learning methodology to optimize the performance per watt (PPW) is proposed and experiments show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.

...read moreread less

Abstract: Heterogeneous multiprocessor system-on-chips (SoCs) provide a wide range of parameters that can be managed dynamically. For example, one can control the type (big/little), number and frequency of active cores in state-of-the-art mobile processors at runtime. These runtime choices lead to more than 10× range in execution time, 5× range in power consumption, and 50× range in performance per watt. Therefore, it is crucial to make optimum power management decisions as a function of dynamically varying workloads at runtime. This paper presents a reinforcement learning approach for dynamically controlling the number and frequency of active big and little cores in mobile processors. We propose an efficient deep Q-learning methodology to optimize the performance per watt (PPW). Experiments using Odroid XU3 mobile platform show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.

...read moreread less

48 citations

Journal Article•DOI•

Orbital Edge Computing: Machine Inference in Space

[...]

Bradley Denby¹, Brandon Lucia¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: This work defines and characterize Orbital Edge Computing, and describes power and software optimizations for the orbital edge, and uses formation flying to parallelize computation in space.

...read moreread less

Abstract: Edge computing is an emerging paradigm aiding responsiveness, reliability, and scalability of terrestrial computing and sensing networks like cellular and IoT. However, edge computing is largely unexplored in high-datarate nanosatellite constellations. Cubesats are small, energy-limited sensors separated from the cloud by hundreds of kilometers of atmosphere and space. As they proliferate, centralized architectures impede advanced applications. In this work, we define and characterize Orbital Edge Computing. We describe power and software optimizations for the orbital edge, and we use formation flying to parallelize computation in space.

...read moreread less

45 citations

Journal Article•DOI•

Spatial Correlation and Value Prediction in Convolutional Neural Networks

[...]

Gil Shomron¹, Uri Weiser¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: In this article, the spatial correlation of zero-valued activations within the CNN output feature maps is exploited to reduce the number of multiply-accumulate (MAC) operations per input.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are a widely used form of deep neural networks, introducing state-of-the-art results for different problems such as image classification, computer vision tasks, and speech recognition. However, CNNs are compute intensive, requiring billions of multiply-accumulate (MAC) operations per input. To reduce the number of MACs in CNNs, we propose a value prediction method that exploits the spatial correlation of zero-valued activations within the CNN output feature maps, thereby saving convolution operations. Our method reduces the number of MAC operations by 30.4 percent, averaged on three modern CNNs for ImageNet, with top-1 accuracy degradation of 1.7 percent, and top-5 accuracy degradation of 1.1 percent.

...read moreread less

35 citations

Journal Article•DOI•

PIMSim: A Flexible and Detailed Processing-in-Memory Simulator

[...]

Sheng Xu¹, Xiaoming Chen¹, Ying Wang¹, Yinhe Han¹, Xuehai Qian², Xiaowei Li¹ - Show less +2 more•Institutions (2)

Chinese Academy of Sciences¹, University of Southern California²

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: PIMSim enables architectural simulation of PIM and implements three simulation modes to provide a wide range of speed/accuracy tradeoffs and offers detailed performance and energy models to simulate PIM-enabled instructions, compiler, in-memory processing logic, various memory devices, and PIM coherence.

...read moreread less

Abstract: With the advent of big data applications and new process technologies, Process-in-Memory (PIM) attracts much attention in memory research as the architecture studies gradually shift from processors to heterogeneous aspects. How to achieve reliable and efficient PIM architecture modeling becomes increasingly urgent for the researchers, who want to experiment on critical issues from detailed implementations of their proposed PIM designs. This paper proposes PIMSim, a full-system and highly-configurable PIM simulator to facilitate circuit-, architecture- and system-level researches. PIMSim enables architectural simulation of PIM and implements three simulation modes to provide a wide range of speed/accuracy tradeoffs. It offers detailed performance and energy models to simulate PIM-enabled instructions, compiler, in-memory processing logic, various memory devices, and PIM coherence. PIMSim is open source and available at https://github.com/vineodd/PIMSim .

...read moreread less

34 citations

Journal Article•DOI•

A Framework to Explore Workload-Specific Performance and Lifetime Trade-offs in Neuromorphic Computing

[...]

Adarsha Balaji¹, Shihao Song¹, Anup Das¹, Nikil Dutt², Jeffrey L. Krichmar², Nagarajan Kandasamy¹, Francky Catthoor³ - Show less +3 more•Institutions (3)

Drexel University¹, University of California, Irvine², Katholieke Universiteit Leuven³

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: The proposed framework first extracts the precise times at which a charge pump in the hardware is activated to support neural computations within a workload, then uses a characterized NBTI reliability model to estimate the charge pump's aging during the workload execution.

...read moreread less

Abstract: Neuromorphic hardware with non-volatile memory (NVM) can implement machine learning workload in an energy-efficient manner. Unfortunately, certain NVMs such as phase change memory (PCM) require high voltages for correct operation. These voltages are supplied from an on-chip charge pump. If the charge pump is activated too frequently, its internal CMOS devices do not recover from stress, accelerating their aging and leading to negative bias temperature instability (NBTI) generated defects. Forcefully discharging the stressed charge pump can lower the aging rate of its CMOS devices, but makes the neuromorphic hardware unavailable to perform computations while its charge pump is being discharged. This negatively impacts performance such as latency and accuracy of the machine learning workload being executed. In this letter, we propose a novel framework to exploit workload-specific performance and lifetime trade-offs in neuromorphic computing. Our framework first extracts the precise times at which a charge pump in the hardware is activated to support neural computations within a workload. This timing information is then used with a characterized NBTI reliability model to estimate the charge pump's aging during the workload execution. We use our framework to evaluate workload-specific performance and reliability impacts of using 1) different SNN mapping strategies and 2) different charge pump discharge strategies. We show that our framework can be used by system designers to explore performance and reliability trade-offs early in the design of neuromorphic hardware such that appropriate reliability-oriented design margins can be set.

...read moreread less

33 citations

Journal Article•DOI•

PPT-GPU: Scalable GPU Performance Modeling

[...]

Yehia Arafa¹, Abdel-Hameed A. Badawy¹, Gopinath Chennupati², Nandakishore Santhi², Stephan Eidenbenz² - Show less +1 more•Institutions (2)

New Mexico State University¹, Los Alamos National Laboratory²

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: This paper presents PPT-GPU, a scalable and accurate simulation framework that enables GPU code developers and architects to predict the performance of applications in a fast, and accurate manner on different GPU architectures.

...read moreread less

Abstract: Performance modeling is a challenging problem due to the complexities of hardware architectures. In this paper, we present PPT-GPU, a scalable and accurate simulation framework that enables GPU code developers and architects to predict the performance of applications in a fast, and accurate manner on different GPU architectures. PPT-GPU is part of the open source project, Performance Prediction Toolkit (PPT) developed at the Los Alamos National Laboratory. We extend the old GPU model in PPT that predict the runtimes of computational physics codes to offer better prediction accuracy, for which, we add models for different memory hierarchies found in GPUs and latencies for different instructions. To further show the utility of PPT-GPU, we compare our model against real GPU device(s) and the widely used cycle-accurate simulator, GPGPU-Sim using different workloads from RODINIA and Parboil benchmarks. The results indicate that the predicted performance of PPT-GPU is within a 10 percent error compared to the real device(s). In addition, PPT-GPU is highly scalable, where it is up to 450x faster than GPGPU-Sim with more accurate results.

...read moreread less

31 citations

Journal Article•DOI•

RTSim: A Cycle-Accurate Simulator for Racetrack Memories

[...]

Asif Ali Khan¹, Fazal Hameed¹, Robin Bläsing², Stuart S. P. Parkin², Jeronimo Castrillon¹ - Show less +1 more•Institutions (2)

Dresden University of Technology¹, Max Planck Society²

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: RTSim, an open source cycle-accurate memory simulator that enables performance evaluation of the domain-wall-based racetrack memories and the skyrmions-based RTMs is proposed, developed in collaboration with physicists and computer scientists.

...read moreread less

Abstract: Racetrack memories (RTMs) have drawn considerable attention from computer architects of late. Owing to the ultra-high capacity and comparable access latency to SRAM, RTMs are promising candidates to revolutionize the memory subsystem. In order to evaluate their performance and suitability at various levels in the memory hierarchy, it is crucial to have RTM-specific simulation tools that accurately model their behavior and enable exhaustive design space exploration. To this end, we propose RTSim , an open source cycle-accurate memory simulator that enables performance evaluation of the domain-wall-based racetrack memories. The skyrmions-based RTMs can also be modeled with RTSim because they are architecturally similar to domain-wall-based RTMs. RTSim is developed in collaboration with physicists and computer scientists. It accurately models RTM-specific shift operations , access ports management and the sequence of memory commands beside handling the routine read/write operations. RTSim is built on top of NVMain2.0, offering larger design space for exploration.

...read moreread less

28 citations

Journal Article•DOI•

SMT-SA: Simultaneous Multithreading in Systolic Arrays

[...]

Gil Shomron¹, Tal Horowitz², Uri Weiser¹•Institutions (2)

Technion – Israel Institute of Technology¹, Huawei²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: This paper explores the design space of a SMT-SA variant and evaluates its performance, area efficiency, and energy consumption, and suggests a tiling method to reduce area overheads.

...read moreread less

Abstract: Systolic arrays (SAs) are highly parallel pipelined structures capable of executing various tasks such as matrix multiplication and convolution. They comprise a grid of usually homogeneous processing units (PUs) that are responsible for the multiply-accumulate (MAC) operations in the case of matrix multiplication. It is not rare for a PU input to be zero-valued, in which case the PU becomes idle and the array becomes underutilized. In this paper we consider a solution to employ the underutilized PUs via simultaneous multithreading (SMT). We explore the design space of a SMT-SA variant and evaluate its performance, area efficiency, and energy consumption. In addition, we suggest a tiling method to reduce area overheads. Our evaluation shows that a 4-thread FP16-based SMT-SA achieves speedups of up to 3.6× as compared to conventional SA, with 1.7× area overhead and negligible energy overhead.

...read moreread less

26 citations

Journal Article•DOI•

Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing

[...]

Jiho Kim¹, Jehee Cha¹, Jason Jong Kyu Park², Dongsuk Jeon³, Yongjun Park⁴ - Show less +1 more•Institutions (4)

Hongik University¹, University of Michigan², Seoul National University³, Hanyang University⁴

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.

...read moreread less

Abstract: As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to efficiently support concurrent execution of multiple different applications. Spatial multitasking, which assigns a different amount of streaming multiprocessors (SMs) to multiple applications, is one of the most common solutions for this. However, this is not a panacea for maximizing total resource utilization. It is because an SM consists of many different sub-resources such as caches, execution units and scheduling units, and the requirements of the sub-resources per kernel are not well matched to their fixed sizes inside an SM. To solve the resource requirement mismatch problem, this paper proposes a GPU Weaver , a dynamic sub-resource management system of multitasking GPUs. GPU Weaver can maximize sub-resource utilization through a shared resource controller (SRC) that is added between neighboring SMs. The SRC dynamically identifies idle sub-resources of an SM and allows them to be used by the neighboring SM when possible. Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.

...read moreread less

20 citations

Journal Article•DOI•

Coordinated DVFS and Precision Control for Deep Neural Networks

[...]

Seyed Morteza Nabavinejad, Hassan Hafez-Kolahi¹, Sherief Reda²•Institutions (2)

Sharif University of Technology¹, Brown University²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: The proposed approach, Power-Inference accuracy Trading (PIT), monitors the server's load, and accordingly adjusts the precision of the DNN model and the DVFS setting of GPU to trade-off the accuracy and power consumption with response time.

...read moreread less

Abstract: Traditionally, DVFS has been the main mechanism to trade-off performance and power. We observe that Deep Neural Network (DNN) applications offer the possibility to trade-off performance, power, and accuracy using both DVFS and numerical precision levels. Our proposed approach, Power-Inference accuracy Trading (PIT), monitors the server's load, and accordingly adjusts the precision of the DNN model and the DVFS setting of GPU to trade-off the accuracy and power consumption with response time. At high loads and tight request arrivals, PIT leverages INT8-precision instructions of GPU to dynamically change the precision of deployed DNN models and boosts GPU frequency to execute the requests faster at the expense of accuracy reduction and high power consumption. However, when the requests’ arrival rate is relaxed and there is slack time for requests, PIT deploys high precision version of models to improve the accuracy and reduces GPU frequency to decrease power consumption. We implement and deploy PIT on a state-of-the-art server equipped with a Tesla P40 GPU. Experimental results demonstrate that depending on the load, PIT can improve response time up to 11 percent compared to a job scheduler that uses only FP32 precision. It also improves the energy consumption by up to 28 percent, while achieving around 99.5 percent accuracy of sole FP32-precision.

...read moreread less

17 citations

Journal Article•DOI•

Detect DRAM Disturbance Error by Using Disturbance Bin Counters

[...]

Yicheng Wang¹, Yang Liu¹, Peiyun Wu¹, Zhao Zhang¹•Institutions (1)

University of Illinois at Chicago¹

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: This paper presents a design that guarantees 100 percent detection of DRAM disturbance errors or row hammering by malicious programs with a small and fixed hardware cost based on a novel idea called disturbance bin counter (DBC).

...read moreread less

Abstract: DRAM disturbance errors are increasingly a concern to computer system reliability and security. There have been a number of designs to detect and prevent them; however, there lacks any design that guarantees 100 percent detection (no false negative) with a small and fixed hardware cost. This paper presents such a design based on a novel idea called disturbance bin counter (DBC). Each DBC is a complex counter that maintains an upper bound of disturbances for a bin of DRAM rows. Their access is not in the critical path of processor execution and thus incurs no performance overhead. The design is optimized at the circuit level to minimize the storage requirement. Our simulation results using multi-core SPEC CPU2006 workloads show that no false positive occurs with a 1,024-entry DBC table, which requires only 4.5 KB storage. The design can be incorporated into a memory controller to guarantee the detection of DRAM disturbance errors or row hammering by malicious programs.

...read moreread less

Journal Article•DOI•

Isolating Speculative Data to Prevent Transient Execution Attacks

[...]

Kristin Barber¹, Anys Bacha², Li Zhou¹, Yinqian Zhang¹, Radu Teodorescu¹ - Show less +1 more•Institutions (2)

Ohio State University¹, University of Michigan²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: This study presents a microarchitectural mitigation technique for shielding transient state from covert channels during speculative execution, which prevents transient execution attacks at a cost of 18 percent average performance degradation.

...read moreread less

Abstract: Hardware security has recently re-surfaced as a first-order concern to the confidentiality protections of computing systems. Meltdown and Spectre introduced a new class of exploits that leverage transient state as an attack surface and have revealed fundamental security vulnerabilities of speculative execution in high-performance processors. These attacks derive benefit from the fact that programs may speculatively execute instructions outside their legal control flows. This insight is then utilized for gaining access to restricted data and exfiltrating it by means of a covert channel. This study presents a microarchitectural mitigation technique for shielding transient state from covert channels during speculative execution. Unlike prior work that has focused on closing individual covert channels used to leak sensitive information, this approach prevents the use of speculative data by downstream instructions until doing so is determined to be safe. This prevents transient execution attacks at a cost of 18 percent average performance degradation.

...read moreread less

Journal Article•DOI•

Precise Runahead Execution

[...]

Ajeya Naithani¹, Josue Feliu², Almutaz Adileh¹, Lieven Eeckhout¹•Institutions (2)

Ghent University¹, Polytechnic University of Valencia²

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: This work proposes precise runahead execution ( PRE), a novel approach to manage free processor resources to execute the detected instruction chains in runahead mode without flushing the pipeline that achieves an additional 21.1 percent performance improvement over the recent runahead proposals while reducing energy consumption.

...read moreread less

Abstract: Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests during runahead execution. Unfortunately, all these prior runahead proposals have shortcomings that limit performance and energy efficiency because they discard the full instruction window to enter runahead mode and then flush the pipeline to restart normal operation. This significantly constrains the performance benefits and increases the energy overhead of runahead execution. In addition, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that lead to the same long-latency load. We propose precise runahead execution (PRE) to mitigate the shortcomings of prior work. PRE leverages the renaming unit to track all the dependency chains leading to long-latency loads. PRE uses a novel approach to manage free processor resources to execute the detected instruction chains in runahead mode without flushing the pipeline. Our results show that PRE achieves an additional 21.1 percent performance improvement over the recent runahead proposals while reducing energy consumption by 6.1 percent.

...read moreread less

Journal Article•DOI•

A Scalable and Efficient In-Memory Interconnect Architecture for Automata Processing

[...]

Elaheh Sadredini¹, Reza Rahimi¹, Vaibhav Verma¹, Mircea R. Stan¹, Kevin Skadron¹ - Show less +1 more•Institutions (1)

University of Virginia¹

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: This paper proposes a compact, low-overhead, and yet flexible in-memory interconnect architecture that efficiently implements routing for next-state activation, and can be applied to the existing in- memory automata processing architectures.

...read moreread less

Abstract: Accelerating finite automata processing benefits regular-expression workloads and a wide range of other applications that do not map obviously to regular expressions, including pattern mining, bioinfomatics, and machine learning. Existing in-memory automata processing accelerators suffer from inefficient routing architectures. They are either incapable of efficiently place-and-route a highly connected automaton or require an excessive amount of hardware resources. In this paper, we propose a compact, low-overhead, and yet flexible in-memory interconnect architecture that efficiently implements routing for next-state activation, and can be applied to the existing in-memory automata processing architectures. We use SRAM 8T subarrays to evaluate our interconnect. Compared to the Cache Automaton routing design, our interconnect reduces the number of switches 7×, therefore, reduces area overhead for the interconnect. It also has faster row cycle time because of shorter wires and consumes less power.

...read moreread less

Journal Article•DOI•

Quantum Circuits for Dynamic Runtime Assertions in Quantum Computation

[...]

Huiyang Zhou¹, Gregory T. Byrd¹•Institutions (1)

North Carolina State University¹

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: In this article, the authors propose quantum circuits for runtime assertions, which can be used for both software debugging and error detection, and they design quantum circuits to assert classical states, entanglement, and superposition states.

...read moreread less

Abstract: In this paper, we propose quantum circuits for runtime assertions, which can be used for both software debugging and error detection. Runtime assertion is challenging in quantum computing for two key reasons. First, a quantum bit (qubit) cannot be copied, which is known as the non-cloning theorem. Second, when a qubit is measured, its superposition state collapses into a classical state, losing the inherent parallel information. In this paper, we overcome these challenges with runtime computation through ancilla qubits, which are used to indirectly collect the information of the qubits of interest. We design quantum circuits to assert classical states, entanglement, and superposition states.

...read moreread less

Journal Article•DOI•

Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization

[...]

Kshitij Bhardwaj¹, Marton Havasi², Yuan Yao¹, David Brooks¹, Jose Miguel Hernendez Lobato², Gu-Yeon Wei¹ - Show less +2 more•Institutions (2)

Harvard University¹, University of Cambridge²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: A novel performance-aware hybrid coherency interface is proposed, where different accelerators use different co herency models, decided at design time based on the target applications so as to optimize the overall system performance.

...read moreread less

Abstract: The modern system-on-chip (SoC) of the current exascale computing era is complex. These SoCs not only consist of several general-purpose processing cores but also integrate many specialized hardware accelerators. Three common coherency interfaces are used to integrate the accelerators with the memory hierarchy: non-coherent,coherent with the last-level cache (LLC), and fully-coherent.However, using a single coherence interface for all the accelerators in an SoC can lead to significant overheads: in the non-coherent model, accelerators directly access the main memory, which can have considerable performance penalty; whereas in the LLC-coherent model, the accelerators access the LLC but may suffer from performance bottleneck due to contention between several accelerators; and the fully-coherent model, that relies on private caches, can incur non-trivial power/area overheads. Given the limitations of each of these interfaces, this paper proposes a novel performance-aware hybrid coherency interface, where different accelerators use different coherency models, decided at design time based on the target applications so as to optimize the overall system performance. A new Bayesian optimization based framework is also proposed to determine the optimal hybrid coherency interface, i.e., use machine learning to select the best coherency model for each of the accelerators in the SoC in terms of performance. For image processing and classification workloads, the proposed framework determined that a hybrid interface achieves up to 23 percent better performance compared to the other ’homogeneous‘ interfaces, where all the accelerators use a single coherency model.

...read moreread less

Journal Article•DOI•

Locality-Aware GPU Register File

[...]

Hyeran Jeon¹, Hodjat Asghari Esfeden², Nael Abu-Ghazaleh², Daniel Wong², Sindhuja Elango¹ - Show less +1 more•Institutions (2)

San Jose State University¹, University of California, Riverside²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: A locality-aware GPU register file that enables data sharing for memory-intensive big data workloads on GPUs without relying on small on-chip memories is proposed and achieves over 2× speedup and saves register space up to 57 percent.

...read moreread less

Abstract: In many emerging applications such as deep learning, large data set is essential to generate reliable solutions. In these big data workloads, memory latency and bandwidth are the main performance bottlenecks. In this article, we propose a locality-aware GPU register file that enables data sharing for memory-intensive big data workloads on GPUs without relying on small on-chip memories. We exploit two types of data sharing patterns commonly found from the big data workloads and have warps opportunistically share data in physical registers instead of issuing memory loads separately and storing the same data redundantly in their registers as well as small shared memory. With an extended register file mapping mechanism, our proposed design enables warps to share data by simply mapping to the same physical registers or reconstructing from the data in the register file already. The proposed sharing not only reduces the memory transactions but also further decreases the register file usage. The spared registers make rooms for applying orthogonal optimizations for energy and performance improvement. Our evaluation on two deep learning workloads and matrixMul show that the proposed locality-aware GPU register file achieves over 2× speedup and saves register space up to 57 percent.

...read moreread less

Journal Article•DOI•

Design Space Exploration of Memory Controller Placement in Throughput Processors with Deep Learning

[...]

Ting-Ru Lin¹, Yunfan Li², Massoud Pedram¹, Lizhong Chen²•Institutions (2)

University of Southern California¹, Oregon State University²

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: A novel deep-learning based framework is presented that employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations.

...read moreread less

Abstract: As throughput-oriented processors incur a significant number of data accesses, the placement of memory controllers (MCs) has a critical impact on overall performance. However, due to the lack of a systematic way to explore the huge design space of MC placements, only a few ad-hoc placements have been proposed, leaving much of the opportunity unexploited. In this paper, we present a novel deep-learning based framework that explores this opportunity intelligently and automatically. The proposed framework employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations. Evaluation shows that, the proposed deep learning models achieves a speedup of 282X for the search process, and the MC placement found by our framework improves the average performance (IPC) of 18 benchmarks by 19.3 percent over the best-known placement found by human intuition.

...read moreread less

Journal Article•DOI•

Speeding up Collective Communications Through Inter-GPU Re-Routing

[...]

Kiran Ranganath¹, AmirAli Abdolrashidi¹, Shuaiwen Leon Song², Daniel Wong¹•Institutions (2)

University of California, Riverside¹, University of Sydney²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: WOTIR targets GPUs with no direct NVLink communication path and re-routes communication through intermediate GPUs to bridge NVLink segments and avoid PCIe communications, which allows the maximum possible utilization of the NVLink bandwidth between the GPUs without routing through the PCIe bus.

...read moreread less

Abstract: In order to address the vast needs of disparate domains, computing engines are becoming more sophisticated and complex. A typical high-performance computational engine is composed of several accelerator units, in most cases GPUs, plus one or more CPU controllers. All these components are becoming increasingly interconnected to satisfy bandwidth and latency tolerance demands from modern workloads. Due to these constraints, solutions to efficiently interconnect them or to systematically manage their traffic—such as PCIe v3, NVLink v1 and v2 on the hardware side, and NVIDIA Collective Communication Library (NCCL) and AMD ROCM layer on the software side—are becoming more commonplace inside HPC systems and cloud data centers. However, as the number of accelerators increases, workloads (especially machine learning) might not be able to fully exploit the computational substrate due to inefficient use of hardware interconnects. Such scenarios can lead to performance bottlenecks where high-bandwidth links are not used by the underlying libraries and under-performing links are overused. This work proposes Workload Optimization Through Inter-GPU Re-routing (WOTIR), which consists of enhanced NCCL-based collective primitives that aim to boost bandwidth utilization (through more efficient routing) and reduce communication overhead. WOTIR targets GPUs with no direct NVLink communication path (which leads to PCIe communications) and instead re-routes communication through intermediate GPUs to bridge NVLink segments and avoid PCIe communications. Such method allows the maximum possible utilization of the NVLink bandwidth between the GPUs without routing through the PCIe bus. Using this method, we see a reduction of up to 34 percent in execution time for selected machine learning workloads when non-optimal GPU allocations arise.

...read moreread less

Journal Article•DOI•

Performance and Fairness Improvement on CMPs Considering Bandwidth and Cache Utilization

[...]

Theodoros Marinakis¹, Iraklis Anagnostopoulos¹•Institutions (1)

Southern Illinois University Carbondale¹

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: The goal of this work is to improve performance by sophisticated grouping that balances bandwidth and LLC requirements, while at the same time providing a fair execution environment by prioritizing applications that experience the least accumulated progress.

...read moreread less

Abstract: Chip multiprocessors (CMPs) have become dominant both in server and embedded domain as they accommodate an increasing amount of cores in order to satisfy the workload demands. However, when applications run concurrently, they compete for shared resources, such as Last Level Cache (LLC) and main memory bandwidth. Applications are affected in various ways by contention, and uneven degradation makes the system unreliable and the overall performance unpredictable. The goal of this work is to improve performance by sophisticated grouping that balances bandwidth and LLC requirements, while at the same time providing a fair execution environment by prioritizing applications that experience the least accumulated progress. The proposed scheduler achieves an average performance gain of 16 percent over the Linux scheduler and 6.3 percent over another performance-oriented scheduler. Additionally, it keeps unfairness very close to two fairness-oriented schedulers.

...read moreread less

Journal Article•DOI•

Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems

[...]

Chen Li¹, Jun Yang², Yifan Sun³, Lingling Jin⁴, Lingjie Xu⁴, Zheng Cao⁴, Pengfei Fan⁴, David Kaeli³, Sheng Ma¹, Yang Guo¹ - Show less +6 more•Institutions (4)

National University of Defense Technology¹, University of Pittsburgh², Northeastern University³, Alibaba Group⁴

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: A priority-based scheduling policy is proposed which aims to overlap the data transfers and GPU execution for different applications to alleviate bandwidth contention and improve overall multi-GPU system throughput.

...read moreread less

Abstract: Multi-GPU systems are widely used in data centers to provide significant speedups to compute-intensive workloads such as deep neural network training. However, limited PCIe bandwidth between the CPU and multiple GPUs becomes a major performance bottleneck. We observe that relying on a traditional Round-Robin-based PCIe scheduling policy can result in severe bandwidth competition and stall the execution of multiple GPUs. In this article, we propose a priority-based scheduling policy which aims to overlap the data transfers and GPU execution for different applications to alleviate this bandwidth contention. We also propose a dynamic priority policy for semi-QoS management that can help applications to meet QoS requirements and improve overall multi-GPU system throughput. Experimental results show that the system throughput is improved by 7.6 percent on average using our priority-based PCIe scheduling scheme as compared with a Round-Robin-based PCIe scheduler. Leveraging semi-QoS management can help to meet defined QoS goals, while preserving application throughput.

...read moreread less

Journal Article•DOI•

NNBench-X: Benchmarking and Understanding Neural Network Workloads for Accelerator Designs

[...]

Xinfeng Xie¹, Xing Hu¹, Peng Gu¹, Shuangchen Li¹, Yu Ji¹, Yuan Xie¹ - Show less +2 more•Institutions (1)

University of California, Santa Barbara¹

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: A novel approach to understand the performance characteristic of NN workloads for accelerator designs and helps users select representative applications out of the large pool of possible applications, while providing insightful guidelines for the design of Nn accelerators.

...read moreread less

Abstract: The tremendous impact of deep learning algorithms over a wide range of application domains has encouraged a surge of neural network (NN) accelerator research. An evolving benchmark suite and its associated benchmark method are needed to incorporate emerging NN models and characterize NN workloads. In this paper, we propose a novel approach to understand the performance characteristic of NN workloads for accelerator designs. Our approach takes as input an application candidate pool and conducts an operator-level analysis and application-level analysis to understand the performance characteristics of both basic tensor primitives and whole applications. We conduct a case study on the TensorFlow model zoo by using this proposed characterization method. We find that tensor operators with the same functionality can have very different performance characteristics under different input sizes, while operators with different functionality can have similar characteristics. Additionally, we observe that without operator-level analysis, the application bottleneck is mischaracterized for 15 out of 57 models from the TensorFlow model zoo. Overall, our characterization method helps users select representative applications out of the large pool of possible applications, while providing insightful guidelines for the design of NN accelerators.

...read moreread less

Journal Article•DOI•

Are Crossbar Memories Secure? New Security Vulnerabilities in Crossbar Memories

[...]

Vamsee Reddy Kommareddy¹, Baogang Zhang¹, Fan Yao¹, Rickard Ewetz¹, Amro Awad¹ - Show less +1 more•Institutions (1)

University of Central Florida¹

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: The information leakage problem in memristor crossbar arrays (MCAs) is described, and how they can be potentially exploited from application level is discussed to highlight the need for future research to mitigate (and potentially eliminate) information leakage in crossbar memories in future computing systems.

...read moreread less

Abstract: Memristors are emerging Non-Volatile Memories (NVMs) that are promising for building future memory systems. Unlike DRAM, memristors are non-volatile, i.e., they can retain data after power loss. In contrast to DRAM where each cell is associated with a pass transistor, memristor cells can be implemented without such transistor, and hence enable high density ReRAM systems. Moreover, memristors leverage a unique crossbar architecture to improve the density of memory modules. Memristors have been considered to build future data centers with both energy-efficiency and high memory capacity goals. Surprisingly, we observe that using memristors in multi-tenant environments, e.g., cloud systems, entails new security vulnerabilities. In particular, the crossbar contents can severely affect the write latency of any data cells within the same crossbar. With various memory interleaving options (to optimize performance), a single crossbar might be shared among several applications/users from different security domains. Therefore, such content-dependent latency can open new source of information leakage. In this article, we describe the information leakage problem in memristor crossbar arrays (MCAs), discuss how they can be potentially exploited from application level. Our work highlights the need for future research to mitigate (and potentially eliminate) information leakage in crossbar memories in future computing systems.

...read moreread less

Journal Article•DOI•

A Unified Framework for Training, Mapping and Simulation of ReRAM-Based Convolutional Neural Network Acceleration

[...]

He Liu¹, Jianhui Han¹, Youhui Zhang¹•Institutions (1)

Tsinghua University¹

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: A simulation framework of RNA for CNN inference that encompasses a ReRAM-aware NN training tool, a CNN-oriented mapper and a micro-architecture simulator, which enables comprehensive architectural exploration and end-to-end evaluation.

...read moreread less

Abstract: ReRAM-based neural network accelerators (RNAs) could outshine their digital counterparts in terms of computational efficiency and performance remarkably. However, some open software tool for broad architectural exploration and end-to-end evaluation are still missing. We present a simulation framework of RNA for CNN inference that encompasses a ReRAM-aware NN training tool, a CNN-oriented mapper and a micro-architecture simulator. Main characteristics of ReRAM and circuits are reflected by the configurable simulator, as well as by the customized training algorithm. The function of the simulator's core components is verified by the corresponding circuit simulation of a real chip design. This framework enables comprehensive architectural exploration and end-to-end evaluation, and it's preliminary version is available at https://github.com/CRAFT-THU/XB-Sim.

...read moreread less

Journal Article•DOI•

Exploiting OS-Level Memory Offlining for DRAM Power Management

[...]

Seunghak Lee¹, Nam Sung Kim², Daehoon Kim¹•Institutions (2)

Daegu Gyeongbuk Institute of Science and Technology¹, University of Illinois at Urbana–Champaign²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: OffDIMM is proposed that is a software-assisted DRAM PM collaborating with the OS-level memory onlining/offlining that reduces background power by 24 percent on average without notable performance overheads.

...read moreread less

Abstract: Power and energy consumed by main memory systems in data-center servers have increased as the DRAM capacity and bandwidth increase. Particularly, background power accounts for a considerable fraction of the total DRAM power consumption; the fraction will increase further in the near future, especially when slowing-down technology scaling forces us to provide necessary DRAM capacity through plugging in more DRAM modules or stacking more DRAM chips in a DRAM package. Although current DRAM architecture supports low power states at rank granularity that turn off some components during idle periods, techniques to exploit memory-level parallelism make the rank-granularity power state become ineffective. Furthermore, the long wake-up latency is one of obstacles to adopting aggressive power management (PM) with deep power-down states. By tackling the limitations, we propose OffDIMM that is a software-assisted DRAM PM collaborating with the OS-level memory onlining/offlining. OffDIMM maps a memory block in the address space of the OS to a subarray group or groups of DRAM, and sets a deep power-down state for the subarray group when offlining the block. Through the dynamic OS-level memory onlining/offlining based on the current memory usage, our experimental results show OffDIMM reduces background power by 24 percent on average without notable performance overheads.

...read moreread less

Journal Article•DOI•

Modeling Emerging Memory-Divergent GPU Applications

[...]

Lu Wang¹, Magnus Jahre², Almutaz Adileh¹, Zhiying Wang³, Lieven Eeckhout¹ - Show less +1 more•Institutions (3)

Ghent University¹, Norwegian University of Science and Technology², National University of Defense Technology³

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: The Memory Divergence Model (MDM) is able to accurately represent the key performance-related behavior of GPU applications and thereby reduces average performance prediction error by 14× compared to the state-of-the-art GPUMech approach across the authors' memory-divergent applications.

...read moreread less

Abstract: Analytical performance models yield valuable architectural insight without incurring the excessive runtime overheads of simulation. In this work, we study contemporary GPU applications and find that the key performance-related behavior of such applications is distinct from traditional GPU applications. The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks. Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level Parallelism (TLP) to hide memory latencies. Our Memory Divergence Model (MDM) is able to accurately represent this behavior and thereby reduces average performance prediction error by 14× compared to the state-of-the-art GPUMech approach across our memory-divergent applications.

...read moreread less

Journal Article•DOI•

Scalable LLVM-Based Accelerator Modeling in gem5

[...]

Samuel Rogers, Joshua Slycord, Ronak Raheja, Hamed Tabkhi

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: A scalable integrated system architecture modeling for hardware accelerator based in gem5 simulation framework, a LLVM-based simulation engine for modeling any customized data-path with respect to inherent data/instruction-level parallelism and available compute units.

...read moreread less

Abstract: This article proposes a scalable integrated system architecture modeling for hardware accelerator based in gem5 simulation framework. The core of proposed modeling is a LLVM-based simulation engine for modeling any customized data-path with respect to inherent data/instruction-level parallelism (derived by algorithms) and available compute units (defined by the user). The simulation framework also offers a general-purpose communication interface that allows a scalable and flexible connection into the gem5 ecosystem. Python API of gem5, enabling modifications to the system hierarchy without the need to rebuild the underlying simulator. Our simulation framework currently supports full-system simulation (both bare-metal and a full Linux kernel) for ARM-based systems, with future plans to add support for RISC-V. The LLVM-based modeling and modular integration to gem5 allow long-term simulation expansion and sustainable design modeling for emerging applications with demands for acceleration.

...read moreread less

Journal Article•DOI•

HAD-TWL: Hot Address Detection-Based Wear Leveling for Phase-Change Memory Systems with Low Latency

[...]

Sunwoong Kim¹, Hyunmin Jung², Woojae Shin², Hyokeun Lee², Hyuk-Jae Lee² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Seoul National University²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: A stack size modulation scheme that enables a hot address detector (HAD) to efficiently counteract various memory write streams is proposed that achieves the detection rate of 94 percent while reducing the execution time by 57 percent.

...read moreread less

Abstract: Phase-change memory (PCM) is an emerging non-volatile memory device that offers faster access than flash memory does. However, PCM suffers from a critical problem where the number of write operations is limited. The previous practical attack detector (PAD) that uses a small memory space called stack adopts an algebraic mapping-based wear leveling (AWL) algorithm. Thanks to successful detection of malicious attacks, the PAD-AWL dramatically improves the lifetime of PCM. To enhance system factors such as write latency, the proposed method replaces the AWL algorithm with a table-based wear leveling (TWL) algorithm. Since the fixed stack size of the previous PAD is inefficient in detection of attack-like hot addresses, a stack size modulation scheme that enables a hot address detector (HAD) to efficiently counteract various memory write streams is proposed. Compared with the previous AWL-based algorithm, the integration with the TWL algorithm demands only 24 percent of the total number of swaps per write, and the proposed HAD with the stack size modulation scheme achieves the detection rate of 94 percent while reducing the execution time by 57 percent.

...read moreread less

Journal Article•DOI•

Tuning Performance via Metrics with Expectations

[...]

Ahmad Yasin¹, Avi Mendelson², Yosi Ben-Asher¹•Institutions (2)

University of Haifa¹, Technion – Israel Institute of Technology²

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: The preliminary results successfully provide 2x-4x extra speedup during tuning of commonly-used software optimizations on the matrix-multiply kernel and helped to identify counter-intuitive causes that hurt multicore scalability of an optimized deep-learning benchmark on a Cascade Lake server.

...read moreread less

Abstract: Modern server systems employ many features that are difficult to exploit by software developers. This paper calls for a new performance optimization approach that uses designated metrics with expected optimal values. A key insight is that expected values of these metrics are essential in order to verify that no performance is wasted during incremental utilization of processor features. We define sample primary metrics for modern architectures and present three distinct techniques that help to determine their optimal values. Our preliminary results successfully provide $2\text{x}\hbox{-} 4\text{x}$2x-4x extra speedup during tuning of commonly-used software optimizations on the matrix-multiply kernel. Additionally, our approach helped to identify counter-intuitive causes that hurt multicore scalability of an optimized deep-learning benchmark on a Cascade Lake server.

...read moreread less

Journal Article•DOI•

Rusty: Runtime System Predictability Leveraging LSTM Neural Networks

[...]

Dimosthenis Masouros¹, Sotirios Xydis¹, Dimitrios Soudris¹•Institutions (1)

National Technical University of Athens¹

01 Jul 2019-IEEE Computer Architecture Letters

TL;DR: Rusty is presented, a framework able to address the aforementioned challenges by leveraging the power of Long Short-Term Memory networks to forecast at runtime, performance metrics of applications executed on systems under interference.

...read moreread less

Abstract: Modern cloud scale data-centers are adopting workload co-location as an effective mechanism for improving resource utilization. However, workload co-location is stressing resource availability in unconventional and unpredictable manner. Efficient resource management requires continuous and ideally predictive runtime knowledge of system metrics, sensitive both to workload demands, e.g., CPU, memory etc., as well as interference effects induced by co-location. In this paper, we present Rusty, a framework able to address the aforementioned challenges by leveraging the power of Long Short-Term Memory networks to forecast at runtime, performance metrics of applications executed on systems under interference. We evaluate Rusty under a diverse set of interference scenarios for a plethora of cloud workloads, showing that Rusty achieves extremely high prediction accuracy, up to 0.99 in terms of $R^2$R2 value, satisfying at the same time the strict latency constraints to be usable at runtime.

...read moreread less