Traditional von Neumann computing systems involve separate processing and memory units. However, data movement is costly in terms of time and energy and this problem is aggravated by the recent explosive growth in highly data-centric applications related to artificial intelligence. This calls for a radical departure from the traditional systems and one such non-von Neumann computational approach is in-memory computing. Hereby certain computational tasks are performed in place in the memory itself by exploiting the physical attributes of the memory devices. Both charge-based and resistance-based memory devices are being explored for in-memory computing. In this Review, we provide a broad overview of the key computational primitives enabled by these memory devices as well as their applications spanning scientific computing, signal processing, optimization, machine learning, deep learning and stochastic computing. This Review provides an overview of memory devices and the key computational primitives for in-memory computing, and examines the possibilities of applying this computing approach to a wide range of applications.

Memory devices and applications for in-memory computing

Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.CCS CONCEPTS• Computer systems organization → Single instruction, multiple data; • Hardware → Hardware accelerator; • Hardware → Dynamic memory;

/pdf/ambit-in-memory-accelerator-for-bulk-bitwise-operations-2t2ntwty69.pdf

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3X over state-of-art multi-core CPU (Xeon E5), 7.7X over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4X over CPU (2.2X over GPU), while reducing power consumption by 50% over CPU (53% over GPU).

Neural cache: bit-serial in-cache acceleration of deep neural networks

Processing data where it makes sense: Enabling in-memory computation

This retrospective paper describes the RowHammer problem in dynamic random access memory (DRAM), which was initially introduced by Kim et al. at the ISCA 2014 Conference. RowHammer is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability. It is the phenomenon that repeatedly accessing a row in a modern DRAM chip causes bit flips in physically adjacent rows at consistently predictable bit locations. RowHammer is caused by a hardware failure mechanism called DRAM disturbance errors , which is a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Researchers from Google Project Zero demonstrated in 2015 that this hardware failure mechanism can be effectively exploited by user-level programs to gain kernel privileges on real systems. Many other follow-up works demonstrated other practical attacks exploiting RowHammer. In this paper, we comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammer. We also discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems, as the memory technologies scale to higher densities. We conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

RowHammer: A Retrospective

This paper presents the Compute Cache architecturethat enables in-place computation in caches. ComputeCaches uses emerging bit-line SRAM circuit technology to repurpose existing cache elements and transforms them into active very large vector computational units. Also, it significantlyreduces the overheads in moving data between different levelsin the cache hierarchy. Solutions to satisfy new constraints imposed by ComputeCaches such as operand locality are discussed. Also discussedare simple solutions to problems in integrating them into aconventional cache hierarchy while preserving properties suchas coherence, consistency, and reliability. Compute Caches increase performance by 1.9× and reduceenergy by 2.4× for a suite of data-centric applications, includingtext and database query processing, cryptographic kernels, and in-memory checkpointing. Applications with larger fractionof Compute Cache operations could benefit even more, asour micro-benchmarks indicate (54× throughput, 9× dynamicenergy savings).

Compute Caches

A practically feasible low-overhead hardware design that provides strong defenses against memory bus side channel remains elusive. This paper observes that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem. InvisiMem expands the trust base to include the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing several memory bus side channel vulnerabilities efficiently. This allows the secure host processor to send encrypted addresses over the untrusted memory bus, and thereby eliminates the need for expensive address obfuscation techniques based on Oblivious RAM (ORAM). In addition, smart memory enables efficient solutions for ensuring freshness without using expensive Merkle trees, and mitigates memory bus timing channel using constant heart-beat packets. We demonstrate that InvisiMem designs have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.

https://dl.acm.org/doi/pdf/10.1145/3140659.3080232

InvisiMem: Smart Memory Defenses for Memory Bus Side Channel

GPU programming models such as CUDA and OpenCL are starting to adopt a weaker data-race-free (DRF-0) memory model, which does not guarantee any semantics for programs with data-races. Before standardizing the memory model interface for GPUs, it is imperative that we understand the tradeoffs of different memory models for these devices. While there is a rich memory model literature for CPUs, studies on architectural mechanisms and performance costs for enforcing memory ordering constraints in GPU accelerators have been lacking. This paper shows that the performance cost of SC and TSO compared to DRF-0 is insignificant for most GPGPU applications, due to warp-level parallelism and in-order execution. For the remaining challenging applications that exhibit significant overhead for SC, we show that commonly employed memory ordering optimizations in CPUs are either expensive or ineffective for GPUs. We propose a GPU-specific non-speculative SC design that takes advantage of high spatial locality and temporally private data in GPU applications. Results show that the proposed design is effective in eliminating the performance gap between SC and DRF-0 in GPUs.

https://dl.acm.org/doi/pdf/10.1145/2830772.2830778

Efficiently enforcing strong memory ordering in GPUs

In the era of abundant-data computing, main memory's latency and power significantly impact overall system performance and power. Today's computing systems are typically composed of homogeneous memory modules, which are optimized to provide either low latency, high bandwidth, or low power. Such memory modules do not cater to a wide range of applications with diverse memory access behavior. Thus, heterogeneous memory systems, which include several memory modules with distinct performance and power characteristics, are becoming promising alternatives. In such a system, allocating applications to their best-fitting memory modules improves system performance and energy efficiency. However, such an approach still leaves the full potential of heterogeneous memory systems under-utilized because not only applications, but also the memory objects within that application differ in their memory access behavior. This paper proposes a novel page allocation approach to utilize heterogeneous memory systems at the memory object level. We design a memory object classification and allocation framework (MOCA) to characterize memory objects and then allocate them to their best-fit memory module to improve performance and energy efficiency. We experiment with heterogeneous memory systems that are composed of a Reduced Latency DRAM (RLDRAM) for latency-sensitive objects, a 2.5D-stacked High Bandwidth Memory (HBM) for bandwidth-sensitive objects, and a Low Power DDR (LPDDR) for non-memory-intensive objects. The MOCA framework includes detailed application profiling, a classification mechanism, and an allocation policy to place memory objects. Compared to a system with homogeneous memory modules, we demonstrate that heterogeneous memory systems with MOCA improve memory system energy efficiency by up to 63%. Compared to a heterogeneous memory system with only application-level page allocation, MOCA achieves a 26% memory performance and a 33% energy efficiency improvement for multi-program workloads.

MOCA: Memory Object Classification and Allocation in Heterogeneous Memory Systems

The growing importance of Machine Learning (ML) has led to a proliferation of accelerator designs that target ML workloads. The majority of these designs focus on accelerating compute-intensive regions of ML workloads such as general matrix multiplications (GEMMs) and convolutions. While this is a legitimate approach, we observe in this work that ML workloads also comprise data-intensive computations that manifest low compute-to-byte ratios and can often contribute considerably to the total execution time. Further, we also observe that, the presence of such computations opens up an exciting opportunity for near-data processing (NDP) architectures as they often provision for higher memory bandwidth that can benefit such computations.Based on the above observations, in this work we make a case for a more collaborative approach to ML acceleration, termed Co-ML, in which memory plays an active role and is responsible for NDP-amenable computations while the compute-intensive computations are executed on the host accelerator as before. We demonstrate how even a relatively simple NDP design can increase performance of data-intensive computations in ML by up to 20×. Further, for a suite of ML workloads we demonstrate that Co-ML can deliver speedups as high as 20% with average speedups of 14%. Finally, we show that with increasing efforts to build better accelerators for compute-intensive computations, these benefits will likely increase.

Shaizeen Aga

Papers

Compute Caches

InvisiMem: Smart Memory Defenses for Memory Bus Side Channel

Efficiently enforcing strong memory ordering in GPUs

MOCA: Memory Object Classification and Allocation in Heterogeneous Memory Systems

Co-ML: a case for Collaborative ML acceleration using near-data processing