scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Computer Architecture Letters in 2019"


Journal ArticleDOI
TL;DR: An efficient deep Q-learning methodology to optimize the performance per watt (PPW) is proposed and experiments show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.
Abstract: Heterogeneous multiprocessor system-on-chips (SoCs) provide a wide range of parameters that can be managed dynamically. For example, one can control the type (big/little), number and frequency of active cores in state-of-the-art mobile processors at runtime. These runtime choices lead to more than 10× range in execution time, 5× range in power consumption, and 50× range in performance per watt. Therefore, it is crucial to make optimum power management decisions as a function of dynamically varying workloads at runtime. This paper presents a reinforcement learning approach for dynamically controlling the number and frequency of active big and little cores in mobile processors. We propose an efficient deep Q-learning methodology to optimize the performance per watt (PPW). Experiments using Odroid XU3 mobile platform show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.

48 citations


Journal ArticleDOI
TL;DR: This work defines and characterize Orbital Edge Computing, and describes power and software optimizations for the orbital edge, and uses formation flying to parallelize computation in space.
Abstract: Edge computing is an emerging paradigm aiding responsiveness, reliability, and scalability of terrestrial computing and sensing networks like cellular and IoT. However, edge computing is largely unexplored in high-datarate nanosatellite constellations. Cubesats are small, energy-limited sensors separated from the cloud by hundreds of kilometers of atmosphere and space. As they proliferate, centralized architectures impede advanced applications. In this work, we define and characterize Orbital Edge Computing. We describe power and software optimizations for the orbital edge, and we use formation flying to parallelize computation in space.

45 citations


Journal ArticleDOI
TL;DR: In this article, the spatial correlation of zero-valued activations within the CNN output feature maps is exploited to reduce the number of multiply-accumulate (MAC) operations per input.
Abstract: Convolutional neural networks (CNNs) are a widely used form of deep neural networks, introducing state-of-the-art results for different problems such as image classification, computer vision tasks, and speech recognition. However, CNNs are compute intensive, requiring billions of multiply-accumulate (MAC) operations per input. To reduce the number of MACs in CNNs, we propose a value prediction method that exploits the spatial correlation of zero-valued activations within the CNN output feature maps, thereby saving convolution operations. Our method reduces the number of MAC operations by 30.4 percent, averaged on three modern CNNs for ImageNet, with top-1 accuracy degradation of 1.7 percent, and top-5 accuracy degradation of 1.1 percent.

35 citations


Journal ArticleDOI
TL;DR: PIMSim enables architectural simulation of PIM and implements three simulation modes to provide a wide range of speed/accuracy tradeoffs and offers detailed performance and energy models to simulate PIM-enabled instructions, compiler, in-memory processing logic, various memory devices, and PIM coherence.
Abstract: With the advent of big data applications and new process technologies, Process-in-Memory (PIM) attracts much attention in memory research as the architecture studies gradually shift from processors to heterogeneous aspects. How to achieve reliable and efficient PIM architecture modeling becomes increasingly urgent for the researchers, who want to experiment on critical issues from detailed implementations of their proposed PIM designs. This paper proposes PIMSim, a full-system and highly-configurable PIM simulator to facilitate circuit-, architecture- and system-level researches. PIMSim enables architectural simulation of PIM and implements three simulation modes to provide a wide range of speed/accuracy tradeoffs. It offers detailed performance and energy models to simulate PIM-enabled instructions, compiler, in-memory processing logic, various memory devices, and PIM coherence. PIMSim is open source and available at https://github.com/vineodd/PIMSim .

34 citations


Journal ArticleDOI
TL;DR: The proposed framework first extracts the precise times at which a charge pump in the hardware is activated to support neural computations within a workload, then uses a characterized NBTI reliability model to estimate the charge pump's aging during the workload execution.
Abstract: Neuromorphic hardware with non-volatile memory (NVM) can implement machine learning workload in an energy-efficient manner. Unfortunately, certain NVMs such as phase change memory (PCM) require high voltages for correct operation. These voltages are supplied from an on-chip charge pump. If the charge pump is activated too frequently, its internal CMOS devices do not recover from stress, accelerating their aging and leading to negative bias temperature instability (NBTI) generated defects. Forcefully discharging the stressed charge pump can lower the aging rate of its CMOS devices, but makes the neuromorphic hardware unavailable to perform computations while its charge pump is being discharged. This negatively impacts performance such as latency and accuracy of the machine learning workload being executed. In this letter, we propose a novel framework to exploit workload-specific performance and lifetime trade-offs in neuromorphic computing. Our framework first extracts the precise times at which a charge pump in the hardware is activated to support neural computations within a workload. This timing information is then used with a characterized NBTI reliability model to estimate the charge pump's aging during the workload execution. We use our framework to evaluate workload-specific performance and reliability impacts of using 1) different SNN mapping strategies and 2) different charge pump discharge strategies. We show that our framework can be used by system designers to explore performance and reliability trade-offs early in the design of neuromorphic hardware such that appropriate reliability-oriented design margins can be set.

33 citations


Journal ArticleDOI
TL;DR: This paper presents PPT-GPU, a scalable and accurate simulation framework that enables GPU code developers and architects to predict the performance of applications in a fast, and accurate manner on different GPU architectures.
Abstract: Performance modeling is a challenging problem due to the complexities of hardware architectures. In this paper, we present PPT-GPU, a scalable and accurate simulation framework that enables GPU code developers and architects to predict the performance of applications in a fast, and accurate manner on different GPU architectures. PPT-GPU is part of the open source project, Performance Prediction Toolkit (PPT) developed at the Los Alamos National Laboratory. We extend the old GPU model in PPT that predict the runtimes of computational physics codes to offer better prediction accuracy, for which, we add models for different memory hierarchies found in GPUs and latencies for different instructions. To further show the utility of PPT-GPU, we compare our model against real GPU device(s) and the widely used cycle-accurate simulator, GPGPU-Sim using different workloads from RODINIA and Parboil benchmarks. The results indicate that the predicted performance of PPT-GPU is within a 10 percent error compared to the real device(s). In addition, PPT-GPU is highly scalable, where it is up to 450x faster than GPGPU-Sim with more accurate results.

31 citations


Journal ArticleDOI
TL;DR: RTSim, an open source cycle-accurate memory simulator that enables performance evaluation of the domain-wall-based racetrack memories and the skyrmions-based RTMs is proposed, developed in collaboration with physicists and computer scientists.
Abstract: Racetrack memories (RTMs) have drawn considerable attention from computer architects of late. Owing to the ultra-high capacity and comparable access latency to SRAM, RTMs are promising candidates to revolutionize the memory subsystem. In order to evaluate their performance and suitability at various levels in the memory hierarchy, it is crucial to have RTM-specific simulation tools that accurately model their behavior and enable exhaustive design space exploration. To this end, we propose RTSim , an open source cycle-accurate memory simulator that enables performance evaluation of the domain-wall-based racetrack memories. The skyrmions-based RTMs can also be modeled with RTSim because they are architecturally similar to domain-wall-based RTMs. RTSim is developed in collaboration with physicists and computer scientists. It accurately models RTM-specific shift operations , access ports management and the sequence of memory commands beside handling the routine read/write operations. RTSim is built on top of NVMain2.0, offering larger design space for exploration.

28 citations


Journal ArticleDOI
TL;DR: This paper explores the design space of a SMT-SA variant and evaluates its performance, area efficiency, and energy consumption, and suggests a tiling method to reduce area overheads.
Abstract: Systolic arrays (SAs) are highly parallel pipelined structures capable of executing various tasks such as matrix multiplication and convolution. They comprise a grid of usually homogeneous processing units (PUs) that are responsible for the multiply-accumulate (MAC) operations in the case of matrix multiplication. It is not rare for a PU input to be zero-valued, in which case the PU becomes idle and the array becomes underutilized. In this paper we consider a solution to employ the underutilized PUs via simultaneous multithreading (SMT). We explore the design space of a SMT-SA variant and evaluate its performance, area efficiency, and energy consumption. In addition, we suggest a tiling method to reduce area overheads. Our evaluation shows that a 4-thread FP16-based SMT-SA achieves speedups of up to 3.6× as compared to conventional SA, with 1.7× area overhead and negligible energy overhead.

26 citations


Journal ArticleDOI
TL;DR: Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.
Abstract: As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to efficiently support concurrent execution of multiple different applications. Spatial multitasking, which assigns a different amount of streaming multiprocessors (SMs) to multiple applications, is one of the most common solutions for this. However, this is not a panacea for maximizing total resource utilization. It is because an SM consists of many different sub-resources such as caches, execution units and scheduling units, and the requirements of the sub-resources per kernel are not well matched to their fixed sizes inside an SM. To solve the resource requirement mismatch problem, this paper proposes a GPU Weaver , a dynamic sub-resource management system of multitasking GPUs. GPU Weaver can maximize sub-resource utilization through a shared resource controller (SRC) that is added between neighboring SMs. The SRC dynamically identifies idle sub-resources of an SM and allows them to be used by the neighboring SM when possible. Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.

20 citations


Journal ArticleDOI
TL;DR: The proposed approach, Power-Inference accuracy Trading (PIT), monitors the server's load, and accordingly adjusts the precision of the DNN model and the DVFS setting of GPU to trade-off the accuracy and power consumption with response time.
Abstract: Traditionally, DVFS has been the main mechanism to trade-off performance and power. We observe that Deep Neural Network (DNN) applications offer the possibility to trade-off performance, power, and accuracy using both DVFS and numerical precision levels. Our proposed approach, Power-Inference accuracy Trading (PIT), monitors the server's load, and accordingly adjusts the precision of the DNN model and the DVFS setting of GPU to trade-off the accuracy and power consumption with response time. At high loads and tight request arrivals, PIT leverages INT8-precision instructions of GPU to dynamically change the precision of deployed DNN models and boosts GPU frequency to execute the requests faster at the expense of accuracy reduction and high power consumption. However, when the requests’ arrival rate is relaxed and there is slack time for requests, PIT deploys high precision version of models to improve the accuracy and reduces GPU frequency to decrease power consumption. We implement and deploy PIT on a state-of-the-art server equipped with a Tesla P40 GPU. Experimental results demonstrate that depending on the load, PIT can improve response time up to 11 percent compared to a job scheduler that uses only FP32 precision. It also improves the energy consumption by up to 28 percent, while achieving around 99.5 percent accuracy of sole FP32-precision.

17 citations


Journal ArticleDOI
TL;DR: This paper presents a design that guarantees 100 percent detection of DRAM disturbance errors or row hammering by malicious programs with a small and fixed hardware cost based on a novel idea called disturbance bin counter (DBC).
Abstract: DRAM disturbance errors are increasingly a concern to computer system reliability and security. There have been a number of designs to detect and prevent them; however, there lacks any design that guarantees 100 percent detection (no false negative) with a small and fixed hardware cost. This paper presents such a design based on a novel idea called disturbance bin counter (DBC). Each DBC is a complex counter that maintains an upper bound of disturbances for a bin of DRAM rows. Their access is not in the critical path of processor execution and thus incurs no performance overhead. The design is optimized at the circuit level to minimize the storage requirement. Our simulation results using multi-core SPEC CPU2006 workloads show that no false positive occurs with a 1,024-entry DBC table, which requires only 4.5 KB storage. The design can be incorporated into a memory controller to guarantee the detection of DRAM disturbance errors or row hammering by malicious programs.

Journal ArticleDOI
TL;DR: This study presents a microarchitectural mitigation technique for shielding transient state from covert channels during speculative execution, which prevents transient execution attacks at a cost of 18 percent average performance degradation.
Abstract: Hardware security has recently re-surfaced as a first-order concern to the confidentiality protections of computing systems. Meltdown and Spectre introduced a new class of exploits that leverage transient state as an attack surface and have revealed fundamental security vulnerabilities of speculative execution in high-performance processors. These attacks derive benefit from the fact that programs may speculatively execute instructions outside their legal control flows. This insight is then utilized for gaining access to restricted data and exfiltrating it by means of a covert channel. This study presents a microarchitectural mitigation technique for shielding transient state from covert channels during speculative execution. Unlike prior work that has focused on closing individual covert channels used to leak sensitive information, this approach prevents the use of speculative data by downstream instructions until doing so is determined to be safe. This prevents transient execution attacks at a cost of 18 percent average performance degradation.

Journal ArticleDOI
TL;DR: This work proposes precise runahead execution ( PRE), a novel approach to manage free processor resources to execute the detected instruction chains in runahead mode without flushing the pipeline that achieves an additional 21.1 percent performance improvement over the recent runahead proposals while reducing energy consumption.
Abstract: Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests during runahead execution. Unfortunately, all these prior runahead proposals have shortcomings that limit performance and energy efficiency because they discard the full instruction window to enter runahead mode and then flush the pipeline to restart normal operation. This significantly constrains the performance benefits and increases the energy overhead of runahead execution. In addition, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that lead to the same long-latency load. We propose precise runahead execution (PRE) to mitigate the shortcomings of prior work. PRE leverages the renaming unit to track all the dependency chains leading to long-latency loads. PRE uses a novel approach to manage free processor resources to execute the detected instruction chains in runahead mode without flushing the pipeline. Our results show that PRE achieves an additional 21.1 percent performance improvement over the recent runahead proposals while reducing energy consumption by 6.1 percent.

Journal ArticleDOI
TL;DR: This paper proposes a compact, low-overhead, and yet flexible in-memory interconnect architecture that efficiently implements routing for next-state activation, and can be applied to the existing in- memory automata processing architectures.
Abstract: Accelerating finite automata processing benefits regular-expression workloads and a wide range of other applications that do not map obviously to regular expressions, including pattern mining, bioinfomatics, and machine learning. Existing in-memory automata processing accelerators suffer from inefficient routing architectures. They are either incapable of efficiently place-and-route a highly connected automaton or require an excessive amount of hardware resources. In this paper, we propose a compact, low-overhead, and yet flexible in-memory interconnect architecture that efficiently implements routing for next-state activation, and can be applied to the existing in-memory automata processing architectures. We use SRAM 8T subarrays to evaluate our interconnect. Compared to the Cache Automaton routing design, our interconnect reduces the number of switches 7×, therefore, reduces area overhead for the interconnect. It also has faster row cycle time because of shorter wires and consumes less power.

Journal ArticleDOI
TL;DR: In this article, the authors propose quantum circuits for runtime assertions, which can be used for both software debugging and error detection, and they design quantum circuits to assert classical states, entanglement, and superposition states.
Abstract: In this paper, we propose quantum circuits for runtime assertions, which can be used for both software debugging and error detection. Runtime assertion is challenging in quantum computing for two key reasons. First, a quantum bit (qubit) cannot be copied, which is known as the non-cloning theorem. Second, when a qubit is measured, its superposition state collapses into a classical state, losing the inherent parallel information. In this paper, we overcome these challenges with runtime computation through ancilla qubits, which are used to indirectly collect the information of the qubits of interest. We design quantum circuits to assert classical states, entanglement, and superposition states.

Journal ArticleDOI
TL;DR: A novel performance-aware hybrid coherency interface is proposed, where different accelerators use different co herency models, decided at design time based on the target applications so as to optimize the overall system performance.
Abstract: The modern system-on-chip (SoC) of the current exascale computing era is complex. These SoCs not only consist of several general-purpose processing cores but also integrate many specialized hardware accelerators. Three common coherency interfaces are used to integrate the accelerators with the memory hierarchy: non-coherent,coherent with the last-level cache (LLC), and fully-coherent.However, using a single coherence interface for all the accelerators in an SoC can lead to significant overheads: in the non-coherent model, accelerators directly access the main memory, which can have considerable performance penalty; whereas in the LLC-coherent model, the accelerators access the LLC but may suffer from performance bottleneck due to contention between several accelerators; and the fully-coherent model, that relies on private caches, can incur non-trivial power/area overheads. Given the limitations of each of these interfaces, this paper proposes a novel performance-aware hybrid coherency interface, where different accelerators use different coherency models, decided at design time based on the target applications so as to optimize the overall system performance. A new Bayesian optimization based framework is also proposed to determine the optimal hybrid coherency interface, i.e., use machine learning to select the best coherency model for each of the accelerators in the SoC in terms of performance. For image processing and classification workloads, the proposed framework determined that a hybrid interface achieves up to 23 percent better performance compared to the other ’homogeneous‘ interfaces, where all the accelerators use a single coherency model.

Journal ArticleDOI
TL;DR: A locality-aware GPU register file that enables data sharing for memory-intensive big data workloads on GPUs without relying on small on-chip memories is proposed and achieves over 2× speedup and saves register space up to 57 percent.
Abstract: In many emerging applications such as deep learning, large data set is essential to generate reliable solutions. In these big data workloads, memory latency and bandwidth are the main performance bottlenecks. In this article, we propose a locality-aware GPU register file that enables data sharing for memory-intensive big data workloads on GPUs without relying on small on-chip memories. We exploit two types of data sharing patterns commonly found from the big data workloads and have warps opportunistically share data in physical registers instead of issuing memory loads separately and storing the same data redundantly in their registers as well as small shared memory. With an extended register file mapping mechanism, our proposed design enables warps to share data by simply mapping to the same physical registers or reconstructing from the data in the register file already. The proposed sharing not only reduces the memory transactions but also further decreases the register file usage. The spared registers make rooms for applying orthogonal optimizations for energy and performance improvement. Our evaluation on two deep learning workloads and matrixMul show that the proposed locality-aware GPU register file achieves over 2× speedup and saves register space up to 57 percent.

Journal ArticleDOI
TL;DR: A novel deep-learning based framework is presented that employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations.
Abstract: As throughput-oriented processors incur a significant number of data accesses, the placement of memory controllers (MCs) has a critical impact on overall performance. However, due to the lack of a systematic way to explore the huge design space of MC placements, only a few ad-hoc placements have been proposed, leaving much of the opportunity unexploited. In this paper, we present a novel deep-learning based framework that explores this opportunity intelligently and automatically. The proposed framework employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations. Evaluation shows that, the proposed deep learning models achieves a speedup of 282X for the search process, and the MC placement found by our framework improves the average performance (IPC) of 18 benchmarks by 19.3 percent over the best-known placement found by human intuition.

Journal ArticleDOI
TL;DR: WOTIR targets GPUs with no direct NVLink communication path and re-routes communication through intermediate GPUs to bridge NVLink segments and avoid PCIe communications, which allows the maximum possible utilization of the NVLink bandwidth between the GPUs without routing through the PCIe bus.
Abstract: In order to address the vast needs of disparate domains, computing engines are becoming more sophisticated and complex. A typical high-performance computational engine is composed of several accelerator units, in most cases GPUs, plus one or more CPU controllers. All these components are becoming increasingly interconnected to satisfy bandwidth and latency tolerance demands from modern workloads. Due to these constraints, solutions to efficiently interconnect them or to systematically manage their traffic—such as PCIe v3, NVLink v1 and v2 on the hardware side, and NVIDIA Collective Communication Library (NCCL) and AMD ROCM layer on the software side—are becoming more commonplace inside HPC systems and cloud data centers. However, as the number of accelerators increases, workloads (especially machine learning) might not be able to fully exploit the computational substrate due to inefficient use of hardware interconnects. Such scenarios can lead to performance bottlenecks where high-bandwidth links are not used by the underlying libraries and under-performing links are overused. This work proposes Workload Optimization Through Inter-GPU Re-routing (WOTIR), which consists of enhanced NCCL-based collective primitives that aim to boost bandwidth utilization (through more efficient routing) and reduce communication overhead. WOTIR targets GPUs with no direct NVLink communication path (which leads to PCIe communications) and instead re-routes communication through intermediate GPUs to bridge NVLink segments and avoid PCIe communications. Such method allows the maximum possible utilization of the NVLink bandwidth between the GPUs without routing through the PCIe bus. Using this method, we see a reduction of up to 34 percent in execution time for selected machine learning workloads when non-optimal GPU allocations arise.

Journal ArticleDOI
TL;DR: The goal of this work is to improve performance by sophisticated grouping that balances bandwidth and LLC requirements, while at the same time providing a fair execution environment by prioritizing applications that experience the least accumulated progress.
Abstract: Chip multiprocessors (CMPs) have become dominant both in server and embedded domain as they accommodate an increasing amount of cores in order to satisfy the workload demands. However, when applications run concurrently, they compete for shared resources, such as Last Level Cache (LLC) and main memory bandwidth. Applications are affected in various ways by contention, and uneven degradation makes the system unreliable and the overall performance unpredictable. The goal of this work is to improve performance by sophisticated grouping that balances bandwidth and LLC requirements, while at the same time providing a fair execution environment by prioritizing applications that experience the least accumulated progress. The proposed scheduler achieves an average performance gain of 16 percent over the Linux scheduler and 6.3 percent over another performance-oriented scheduler. Additionally, it keeps unfairness very close to two fairness-oriented schedulers.

Journal ArticleDOI
TL;DR: A priority-based scheduling policy is proposed which aims to overlap the data transfers and GPU execution for different applications to alleviate bandwidth contention and improve overall multi-GPU system throughput.
Abstract: Multi-GPU systems are widely used in data centers to provide significant speedups to compute-intensive workloads such as deep neural network training. However, limited PCIe bandwidth between the CPU and multiple GPUs becomes a major performance bottleneck. We observe that relying on a traditional Round-Robin-based PCIe scheduling policy can result in severe bandwidth competition and stall the execution of multiple GPUs. In this article, we propose a priority-based scheduling policy which aims to overlap the data transfers and GPU execution for different applications to alleviate this bandwidth contention. We also propose a dynamic priority policy for semi-QoS management that can help applications to meet QoS requirements and improve overall multi-GPU system throughput. Experimental results show that the system throughput is improved by 7.6 percent on average using our priority-based PCIe scheduling scheme as compared with a Round-Robin-based PCIe scheduler. Leveraging semi-QoS management can help to meet defined QoS goals, while preserving application throughput.

Journal ArticleDOI
TL;DR: A novel approach to understand the performance characteristic of NN workloads for accelerator designs and helps users select representative applications out of the large pool of possible applications, while providing insightful guidelines for the design of Nn accelerators.
Abstract: The tremendous impact of deep learning algorithms over a wide range of application domains has encouraged a surge of neural network (NN) accelerator research. An evolving benchmark suite and its associated benchmark method are needed to incorporate emerging NN models and characterize NN workloads. In this paper, we propose a novel approach to understand the performance characteristic of NN workloads for accelerator designs. Our approach takes as input an application candidate pool and conducts an operator-level analysis and application-level analysis to understand the performance characteristics of both basic tensor primitives and whole applications. We conduct a case study on the TensorFlow model zoo by using this proposed characterization method. We find that tensor operators with the same functionality can have very different performance characteristics under different input sizes, while operators with different functionality can have similar characteristics. Additionally, we observe that without operator-level analysis, the application bottleneck is mischaracterized for 15 out of 57 models from the TensorFlow model zoo. Overall, our characterization method helps users select representative applications out of the large pool of possible applications, while providing insightful guidelines for the design of NN accelerators.

Journal ArticleDOI
TL;DR: The information leakage problem in memristor crossbar arrays (MCAs) is described, and how they can be potentially exploited from application level is discussed to highlight the need for future research to mitigate (and potentially eliminate) information leakage in crossbar memories in future computing systems.
Abstract: Memristors are emerging Non-Volatile Memories (NVMs) that are promising for building future memory systems. Unlike DRAM, memristors are non-volatile, i.e., they can retain data after power loss. In contrast to DRAM where each cell is associated with a pass transistor, memristor cells can be implemented without such transistor, and hence enable high density ReRAM systems. Moreover, memristors leverage a unique crossbar architecture to improve the density of memory modules. Memristors have been considered to build future data centers with both energy-efficiency and high memory capacity goals. Surprisingly, we observe that using memristors in multi-tenant environments, e.g., cloud systems, entails new security vulnerabilities. In particular, the crossbar contents can severely affect the write latency of any data cells within the same crossbar. With various memory interleaving options (to optimize performance), a single crossbar might be shared among several applications/users from different security domains. Therefore, such content-dependent latency can open new source of information leakage. In this article, we describe the information leakage problem in memristor crossbar arrays (MCAs), discuss how they can be potentially exploited from application level. Our work highlights the need for future research to mitigate (and potentially eliminate) information leakage in crossbar memories in future computing systems.

Journal ArticleDOI
TL;DR: A simulation framework of RNA for CNN inference that encompasses a ReRAM-aware NN training tool, a CNN-oriented mapper and a micro-architecture simulator, which enables comprehensive architectural exploration and end-to-end evaluation.
Abstract: ReRAM-based neural network accelerators (RNAs) could outshine their digital counterparts in terms of computational efficiency and performance remarkably. However, some open software tool for broad architectural exploration and end-to-end evaluation are still missing. We present a simulation framework of RNA for CNN inference that encompasses a ReRAM-aware NN training tool, a CNN-oriented mapper and a micro-architecture simulator. Main characteristics of ReRAM and circuits are reflected by the configurable simulator, as well as by the customized training algorithm. The function of the simulator's core components is verified by the corresponding circuit simulation of a real chip design. This framework enables comprehensive architectural exploration and end-to-end evaluation, and it's preliminary version is available at https://github.com/CRAFT-THU/XB-Sim.

Journal ArticleDOI
TL;DR: OffDIMM is proposed that is a software-assisted DRAM PM collaborating with the OS-level memory onlining/offlining that reduces background power by 24 percent on average without notable performance overheads.
Abstract: Power and energy consumed by main memory systems in data-center servers have increased as the DRAM capacity and bandwidth increase. Particularly, background power accounts for a considerable fraction of the total DRAM power consumption; the fraction will increase further in the near future, especially when slowing-down technology scaling forces us to provide necessary DRAM capacity through plugging in more DRAM modules or stacking more DRAM chips in a DRAM package. Although current DRAM architecture supports low power states at rank granularity that turn off some components during idle periods, techniques to exploit memory-level parallelism make the rank-granularity power state become ineffective. Furthermore, the long wake-up latency is one of obstacles to adopting aggressive power management (PM) with deep power-down states. By tackling the limitations, we propose OffDIMM that is a software-assisted DRAM PM collaborating with the OS-level memory onlining/offlining. OffDIMM maps a memory block in the address space of the OS to a subarray group or groups of DRAM, and sets a deep power-down state for the subarray group when offlining the block. Through the dynamic OS-level memory onlining/offlining based on the current memory usage, our experimental results show OffDIMM reduces background power by 24 percent on average without notable performance overheads.

Journal ArticleDOI
TL;DR: The Memory Divergence Model (MDM) is able to accurately represent the key performance-related behavior of GPU applications and thereby reduces average performance prediction error by 14× compared to the state-of-the-art GPUMech approach across the authors' memory-divergent applications.
Abstract: Analytical performance models yield valuable architectural insight without incurring the excessive runtime overheads of simulation. In this work, we study contemporary GPU applications and find that the key performance-related behavior of such applications is distinct from traditional GPU applications. The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks. Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level Parallelism (TLP) to hide memory latencies. Our Memory Divergence Model (MDM) is able to accurately represent this behavior and thereby reduces average performance prediction error by 14× compared to the state-of-the-art GPUMech approach across our memory-divergent applications.

Journal ArticleDOI
TL;DR: A scalable integrated system architecture modeling for hardware accelerator based in gem5 simulation framework, a LLVM-based simulation engine for modeling any customized data-path with respect to inherent data/instruction-level parallelism and available compute units.
Abstract: This article proposes a scalable integrated system architecture modeling for hardware accelerator based in gem5 simulation framework. The core of proposed modeling is a LLVM-based simulation engine for modeling any customized data-path with respect to inherent data/instruction-level parallelism (derived by algorithms) and available compute units (defined by the user). The simulation framework also offers a general-purpose communication interface that allows a scalable and flexible connection into the gem5 ecosystem. Python API of gem5, enabling modifications to the system hierarchy without the need to rebuild the underlying simulator. Our simulation framework currently supports full-system simulation (both bare-metal and a full Linux kernel) for ARM-based systems, with future plans to add support for RISC-V. The LLVM-based modeling and modular integration to gem5 allow long-term simulation expansion and sustainable design modeling for emerging applications with demands for acceleration.

Journal ArticleDOI
TL;DR: A stack size modulation scheme that enables a hot address detector (HAD) to efficiently counteract various memory write streams is proposed that achieves the detection rate of 94 percent while reducing the execution time by 57 percent.
Abstract: Phase-change memory (PCM) is an emerging non-volatile memory device that offers faster access than flash memory does. However, PCM suffers from a critical problem where the number of write operations is limited. The previous practical attack detector (PAD) that uses a small memory space called stack adopts an algebraic mapping-based wear leveling (AWL) algorithm. Thanks to successful detection of malicious attacks, the PAD-AWL dramatically improves the lifetime of PCM. To enhance system factors such as write latency, the proposed method replaces the AWL algorithm with a table-based wear leveling (TWL) algorithm. Since the fixed stack size of the previous PAD is inefficient in detection of attack-like hot addresses, a stack size modulation scheme that enables a hot address detector (HAD) to efficiently counteract various memory write streams is proposed. Compared with the previous AWL-based algorithm, the integration with the TWL algorithm demands only 24 percent of the total number of swaps per write, and the proposed HAD with the stack size modulation scheme achieves the detection rate of 94 percent while reducing the execution time by 57 percent.

Journal ArticleDOI
TL;DR: The preliminary results successfully provide 2x-4x extra speedup during tuning of commonly-used software optimizations on the matrix-multiply kernel and helped to identify counter-intuitive causes that hurt multicore scalability of an optimized deep-learning benchmark on a Cascade Lake server.
Abstract: Modern server systems employ many features that are difficult to exploit by software developers. This paper calls for a new performance optimization approach that uses designated metrics with expected optimal values. A key insight is that expected values of these metrics are essential in order to verify that no performance is wasted during incremental utilization of processor features. We define sample primary metrics for modern architectures and present three distinct techniques that help to determine their optimal values. Our preliminary results successfully provide $2\text{x}\hbox{-} 4\text{x}$2x-4x extra speedup during tuning of commonly-used software optimizations on the matrix-multiply kernel. Additionally, our approach helped to identify counter-intuitive causes that hurt multicore scalability of an optimized deep-learning benchmark on a Cascade Lake server.

Journal ArticleDOI
TL;DR: Rusty is presented, a framework able to address the aforementioned challenges by leveraging the power of Long Short-Term Memory networks to forecast at runtime, performance metrics of applications executed on systems under interference.
Abstract: Modern cloud scale data-centers are adopting workload co-location as an effective mechanism for improving resource utilization. However, workload co-location is stressing resource availability in unconventional and unpredictable manner. Efficient resource management requires continuous and ideally predictive runtime knowledge of system metrics, sensitive both to workload demands, e.g., CPU, memory etc., as well as interference effects induced by co-location. In this paper, we present Rusty, a framework able to address the aforementioned challenges by leveraging the power of Long Short-Term Memory networks to forecast at runtime, performance metrics of applications executed on systems under interference. We evaluate Rusty under a diverse set of interference scenarios for a plethora of cloud workloads, showing that Rusty achieves extremely high prediction accuracy, up to 0.99 in terms of $R^2$R2 value, satisfying at the same time the strict latency constraints to be usable at runtime.