scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2020"


Journal ArticleDOI
Onur Mutlu1, Jeremie S. Kim1
TL;DR: Kim et al. as mentioned in this paper comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammers, and discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems.
Abstract: This retrospective paper describes the RowHammer problem in dynamic random access memory (DRAM), which was initially introduced by Kim et al. at the ISCA 2014 Conference. RowHammer is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability. It is the phenomenon that repeatedly accessing a row in a modern DRAM chip causes bit flips in physically adjacent rows at consistently predictable bit locations. RowHammer is caused by a hardware failure mechanism called DRAM disturbance errors , which is a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Researchers from Google Project Zero demonstrated in 2015 that this hardware failure mechanism can be effectively exploited by user-level programs to gain kernel privileges on real systems. Many other follow-up works demonstrated other practical attacks exploiting RowHammer. In this paper, we comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammer. We also discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems, as the memory technologies scale to higher densities. We conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

153 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: The Zero Redundancy Optimizer (ZeRO) as mentioned in this paper eliminates memory redundancies in data and model-parallel training while retaining low communication volume and high computational granularity, allowing to scale the model size proportional to the number of devices.
Abstract: Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.

126 citations


Proceedings ArticleDOI
01 May 2020
TL;DR: It is demonstrated that Rowhammer is a threat to not only integrity, but to confidentiality as well, by employing Rowhammer as a read side channel, and the first security implication of successfully-corrected bit flips, which were previously considered benign.
Abstract: The Rowhammer bug is a reliability issue in DRAM cells that can enable an unprivileged adversary to flip the values of bits in neighboring rows on the memory module. Previous work has exploited this for various types of fault attacks across security boundaries, where the attacker flips inaccessible bits, often resulting in privilege escalation. It is widely assumed however, that bit flips within the adversary’s own private memory have no security implications, as the attacker can already modify its private memory via regular write operations.We demonstrate that this assumption is incorrect, by employing Rowhammer as a read side channel. More specifically, we show how an unprivileged attacker can exploit the data dependence between Rowhammer induced bit flips and the bits in nearby rows to deduce these bits, including values belonging to other processes and the kernel. Thus, the primary contribution of this work is to show that Rowhammer is a threat to not only integrity, but to confidentiality as well.Furthermore, in contrast to Rowhammer write side channels, which require persistent bit flips, our read channel succeeds even when ECC memory detects and corrects every bit flip. Thus, we demonstrate the first security implication of successfully-corrected bit flips, which were previously considered benign.To demonstrate the implications of this read side channel, we present an end-to-end attack on OpenSSH 7.9 that extracts an RSA-2048 key from the root level SSH daemon. To accomplish this, we develop novel techniques for massaging memory from user space into an exploitable state, and use the DRAM rowbuffer timing side channel to locate physically contiguous memory necessary for double-sided Rowhammering. Unlike previous Rowhammer attacks, our attack does not require the use of huge pages, and it works on Ubuntu Linux under its default configuration settings.

96 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: Capuchin is proposed, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation and makes memory management decisions based on dynamic tensor access pattern tracked at runtime.
Abstract: In recent years, deep learning has gained unprecedented success in various domains, the key of the success is the larger and deeper deep neural networks (DNNs) that achieved very high accuracy. On the other side, since GPU global memory is a scarce resource, large models also pose a significant challenge due to memory requirement in the training process. This restriction limits the DNN architecture exploration flexibility. In this paper, we propose Capuchin, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation. The key feature of Capuchin is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, one can exploit the total memory optimization space and offer the fine-grain and flexible control of when and how to perform memory optimization techniques. We deploy Capuchin in a widely-used deep learning framework, Tensorflow, and show that Capuchin can reduce the memory footprint by up to 85% among 6 state-of-the-art DNNs compared to the original Tensorflow. Especially, for the NLP task BERT, the maximum batch size that Capuchin can outperforms Tensorflow and gradient-checkpointing by 7x and 2.1x, respectively. We also show that Capuchin outperforms vDNN and gradient-checkpointing by up to 286% and 55% under the same memory oversubscription.

94 citations


Proceedings ArticleDOI
01 Feb 2020
TL;DR: The DLA is required to support various network topologies and highly precise neural operations in smartphone applications, and the Android neural network APIs currently specify the use of asymmetric quantization (ASYMM-Q), providing better precision than conventional symmetrical quantization.
Abstract: Recent advancements in deep learning (DL) have led to the wide adoption of AI applications, such as image recognition [1], image de-noising and speech recognition, in the 5G smartphones. For a satisfactory user experience, there are stringent requirements in the real-time response of smartphone applications. In order to meet the performance expectations for DL, numerous deep learning accelerators (DLA) have been proposed for DL inference on the edge devices [2]–[5]. As depicted in Fig. 7.1.1, the major challenge in designing a DLA for smartphones is achieving the required computing efficiency, while limited by the power budget and memory bandwidth (BW). Since the overall power consumption of a smartphone system-on-a-chip (SoC) is usually constrained to 2 to 3W and the available DRAM BW is around 10-to-30GB/s, the power budget allocated for a DLA must be below 1W with the memory BW limited to 1-to-10GB/s. While operating under such constraints, the DLA is required to support various network topologies and highly precise neural operations in smartphone applications. For instance, the Android neural network APIs currently specify the use of asymmetric quantization (ASYMM-Q), providing better precision than conventional symmetric quantization.

68 citations


Proceedings ArticleDOI
06 Jul 2020
TL;DR: SpreadSketch is presented, an invertible sketch data structure for network-wide superspreader detection with the theoretical guarantees on memory space, performance, and accuracy, and Trace-driven evaluation shows that SpreadSketCh achieves higher accuracy and performance over state-of-the-art sketches.
Abstract: Superspreaders (i.e., hosts with numerous distinct connections) remain severe threats to production networks. How to accurately detect superspreaders in real-time at scale remains a non-trivial yet challenging issue. We present SpreadSketch, an invertible sketch data structure for network-wide superspreader detection with the theoretical guarantees on memory space, performance, and accuracy. SpreadSketch tracks candidate super-spreaders and embeds estimated fan-outs in binary hash strings inside small and static memory space, such that multiple SpreadSketch instances can be merged to provide a network-wide measurement view for recovering superspreaders and their estimated fan-outs. We present formal theoretical analysis on SpreadSketch in terms of space and time complexities as well as error bounds. Trace-driven evaluation shows that SpreadSketch achieves higher accuracy and performance over state-of-the-art sketches. Furthermore, we prototype SpreadSketch in P4 and show its feasible deployment in commodity hardware switches.

57 citations


Proceedings ArticleDOI
01 Oct 2020
TL;DR: DUAL is proposed, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory and provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric.
Abstract: Today’s applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8× speedup and 251.2× energy efficiency improvement as compared to the state-of-the-art solution running on GPU.

50 citations


Journal ArticleDOI
TL;DR: This work proposes a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable physics-based learning for large-scale computational imaging systems.
Abstract: Critical aspects of computational imaging systems, such as experimental design and image priors, can be optimized through deep networks formed by the unrolled iterations of classical physics-based reconstructions. Termed physics-based networks, they incorporate both the known physics of the system via its forward model, and the power of deep learning via data-driven training. However, for realistic large-scale physics-based networks, computing gradients via backpropagation is infeasible due to the memory limitations of graphics processing units. In this work, we propose a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable physics-based learning for large-scale computational imaging systems. We demonstrate our method on a compressed sensing example, as well as two large-scale real-world systems: 3D multi-channel magnetic resonance imaging and super-resolution optical microscopy.

49 citations


Proceedings ArticleDOI
11 Oct 2020
TL;DR: DistDGL as mentioned in this paper is a system for training GNNs in a mini-batch fashion on a cluster of machines based on the Deep Graph Library (DGL), a popular GNN development framework.
Abstract: Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight min-cut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines.

48 citations


Proceedings ArticleDOI
18 May 2020
TL;DR: The approach, called xMP, provides (in-guest) selective memory protection primitives that allow VMs to isolate sensitive data in user or kernel space in disjoint xMP domains, and takes advantage of virtualization extensions, but after initialization, it does not require any hypervisor intervention.
Abstract: Attackers leverage memory corruption vulnerabilities to establish primitives for reading from or writing to the address space of a vulnerable process. These primitives form the foundation for code-reuse and data-oriented attacks. While various defenses against the former class of attacks have proven effective, mitigation of the latter remains an open problem. In this paper, we identify various shortcomings of the x86 architecture regarding memory isolation, and leverage virtualization to build an effective defense against data-oriented attacks. Our approach, called xMP, provides (in-guest) selective memory protection primitives that allow VMs to isolate sensitive data in user or kernel space in disjoint xMP domains. We interface the Xen altp2m subsystem with the Linux memory management system, lending VMs the flexibility to define custom policies. Contrary to conventional approaches, xMP takes advantage of virtualization extensions, but after initialization, it does not require any hypervisor intervention. To ensure the integrity of in-kernel management information and pointers to sensitive data within isolated domains, xMP protects pointers with HMACs bound to an immutable context, so that integrity validation succeeds only in the right context. We have applied xMP to protect the page tables and process credentials of the Linux kernel, as well as sensitive data in various user-space applications. Overall, our evaluation shows that xMP introduces minimal overhead for real-world workloads and applications, and offers effective protection against data-oriented attacks.

47 citations


Proceedings ArticleDOI
30 Jul 2020
TL;DR: NetLock is a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support, and to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane.
Abstract: Lock managers are widely used by distributed systems. Traditional centralized lock managers can easily support policies between multiple users using global knowledge, but they suffer from low performance. In contrast, emerging decentralized approaches are faster but cannot provide flexible policy support. Furthermore, performance in both cases is limited by the server capability. We present NetLock, a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support. The key idea of NetLock is to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane. Due to the limited switch memory, we design a memory management mechanism to seamlessly integrate the switch and server memory. To realize the locking functionality in the switch, we design a custom data plane module that efficiently pools multiple register arrays together to maximize memory utilization We have implemented a NetLock prototype with a Barefoot Tofino switch and a cluster of commodity servers. Evaluation results show that NetLock improves the throughput by 14.0-18.4x, and reduces the average and 99% latency by 4.7-20.3x and 10.4-18.7x over DSLR, a state-of-the-art RDMA-based solution, while providing flexible policy support.

Proceedings ArticleDOI
09 Mar 2020
TL;DR: This work provides the first comprehensive analysis of major inefficiencies that arise in page fault handling mechanisms employed in modern GPUs and proposes a GPU runtime software and hardware solution that increases the batch size and reduces the number of batches, thereby amortizing the øverheadName time.
Abstract: While unified virtual memory and demand paging in modern GPUs provide convenient abstractions to programmers for working with large-scale applications, they come at a significant performance cost. We provide the first comprehensive analysis of major inefficiencies that arise in page fault handling mechanisms employed in modern GPUs. To amortize the high costs in fault handling, the GPU runtime processes a large number of GPU page faults together. We observe that this batched processing of page faults introduces large-scale serialization that greatly hurts the GPU's execution throughput. We show real machine measurements that corroborate our findings. Our goal is to mitigate these inefficiencies and enable efficient demand paging for GPUs. To this end, we propose a GPU runtime software and hardware solution that (1) increases the batch size (i.e., the number of page faults handled together), thereby amortizing the overheadName time, and reduces the number of batches by supporting CPU-like thread block context switching, and (2) takes page eviction off the critical path with no hardware changes by overlapping evictions with CPU-to-GPU page migrations. Our evaluation demonstrates that the proposed solution provides an average speedup of 2x over the state-of-the-art page prefetching. We show that our solution increases the batch size by 2.27x and reduces the total number of batches by 51% on average. We also show that the average batch processing time is reduced by 27%.

Journal ArticleDOI
TL;DR: In this article, the authors proposed FSpiNN, an optimization framework for obtaining memory-efficient and energy-efficient SNNs for training and inference processing, with unsupervised learning capability while maintaining accuracy.
Abstract: Spiking neural networks (SNNs) are gaining interest due to their event-driven processing which potentially consumes low-power/energy computations in hardware platforms while offering unsupervised learning capability due to the spike-timing-dependent plasticity (STDP) rule. However, state-of-the-art SNNs require a large memory footprint to achieve high accuracy, thereby making them difficult to be deployed on embedded systems, for instance, on battery-powered mobile devices and IoT Edge nodes. Toward this, we propose FSpiNN, an optimization framework for obtaining memory-efficient and energy-efficient SNNs for training and inference processing, with unsupervised learning capability while maintaining accuracy. It is achieved by: 1) reducing the computational requirements of neuronal and STDP operations; 2) improving the accuracy of STDP-based learning; 3) compressing the SNN through a fixed-point quantization; and 4) incorporating the memory and energy requirements in the optimization process. FSpiNN reduces the computational requirements by reducing the number of neuronal operations, the STDP-based synaptic weight updates, and the STDP complexity. To improve the accuracy of learning, FSpiNN employs timestep-based synaptic weight updates and adaptively determines the STDP potentiation factor and the effective inhibition strength. The experimental results show that as compared to the state-of-the-art work, FSpiNN achieves $7.5\times $ memory saving, and improves the energy efficiency by $3.5\times $ on average for training and by $1.8\times $ on average for inference, across MNIST and Fashion MNIST datasets, with no accuracy loss for a network with 4900 excitatory neurons, thereby enabling energy-efficient SNNs for edge devices/embedded systems.

Journal ArticleDOI
TL;DR: This work introduces PRECISION, an algorithm that uses Partial Recirculation to find top flows on a programmable switch and achieves higher accuracy than previous heavy hitter detection algorithms that avoid recirculation, and suggests two algorithms for the hierarchical heavy hitters detection problem.
Abstract: Programmable network switches promise flexibility and high throughput, enabling applications such as load balancing and traffic engineering. Network measurement is a fundamental building block for such applications, including tasks such as the identification of heavy hitters (largest flows) or the detection of traffic changes. However, high-throughput packet processing architectures place certain limitations on the programming model, such as restricted branching, limited capability for memory access, and a limited number of processing stages. These limitations restrict the types of measurement algorithms that can run on programmable switches. In this paper, we focus on the Reconfigurable Match Tables (RMT) programmable high-throughput switch architecture, and carefully examine its constraints on designing measurement algorithms. We demonstrate our findings while solving the heavy hitter problem. We introduce PRECISION, an algorithm that uses Partial Recirculation to find top flows on a programmable switch. By recirculating a small fraction of packets, PRECISION simplifies the access to stateful memory to conform with RMT limitations and achieves higher accuracy than previous heavy hitter detection algorithms that avoid recirculation. We also evaluate each of the adaptations made by PRECISION and analyze its effect on the measurement accuracy. Finally, we suggest two algorithms for the hierarchical heavy hitters detection problem in which the goal is identifying the subnets that send excessive traffic and are potentially malicious. To the best of our knowledge, our work is the first to do so on RMT switches.

Journal ArticleDOI
Juhyoung Lee1, Jinsu Lee1, Hoi-Jun Yoo1
TL;DR: The SRNPU is the first ASIC implementation of the CNN-based SR algorithm which supports real-time Full-HD up-scaling and achieves higher restoration performance and power efficiency than previous SR hardware implementations.
Abstract: In this article, we propose an energy-efficient convolutional neural network (CNN) based super-resolution (SR) processor, super-resolution neural processing unit (SRNPU), for mobile applications. Traditionally, it is hard to realize real-time CNN-based SR on resource-limited platforms like mobile devices due to its massive amount of computation workload and communication bandwidth with external memory. The SRNPU can support the tile-based selective super-resolution (TSSR) which dynamically selects the proper sized CNN in a tile-by-tile manner. The TSSR reduces the computational workload of CNN-based SR by 31.1 % while maintaining image restoration performance. Moreover, a proposed selective caching based convolutional layer fusion (SC2LF) can reduce 78.8 % of external memory bandwidth with 93.2 % smaller on-chip memory footprint compared with previous layer fusion methods, by only caching short reuse distance intermediate feature maps. Additionally, reconfigurable cyclic ring architecture in the SRNPU enables maintaining high PE utilization by amortizing the reloading process caused by SC2LF operation under various convolutional layer configurations. The SRNPU is fabricated in 65 nm CMOS technology and occupies $4 \times 4$ mm2 die area. The SRNPU has a peak power efficiency of 1.9 TOPS/W at 0.75 V, 50 MHz. The SRNPU achieves 31.8 fps $\times 2$ scale Full-HD generation and 88.3 fps $\times 4$ scale Full-HD generation with higher restoration performance and power efficiency than previous SR hardware implementations. To the best of our knowledge, the SRNPU is the first ASIC implementation of the CNN-based SR algorithm which supports real-time Full-HD up-scaling.

Posted Content
TL;DR: A transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework that can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into PyTorch code with a few lines of code.
Abstract: The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, Transformers can process on dimensions of sequence lengths in parallel, therefore leading to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: This paper studies the problem of autotuning the memory allocation for applications running on modern distributed data processing systems, and shows that an empirically-driven "white-box" algorithm, called RelM, provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI- driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG).
Abstract: There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. We show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called Guided-BO (GBO), we use RelM's analytical models to speed up BO. Through an evaluation based on Apache Spark, we showcase that the RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower cost overhead compared to the state-of-the-art AI-driven policies.

Proceedings ArticleDOI
18 May 2020
TL;DR: A programmer-agnostic runtime that leverages the hardware access counters to automatically categorize memory allocations based on the access pattern and frequency is proposed and it is shown that although designed to address memory oversubscription, the scheme has no impact on performance when working sets fit in the device-local memory.
Abstract: Unified Memory in heterogeneous systems serves a wide range of applications. However, limited capacity of the device memory becomes a first order performance bottleneck for data-intensive general-purpose applications with increasing working sets. The performance overhead under memory oversubscription depends on the memory access pattern of the corresponding workload. While a regular application with sequential, dense memory access suffers from long latency write-backs, performance of a irregular application with sparse, seldom access to large data-sets degrades due to page thrashing. Although smart spatio-temporal prefetching and large page eviction yield good performance in general, remote zero-copy access to host-pinned memory proves to be beneficial for irregular, data-intensive applications. Further, new generation GPUs introduced hardware access counters to delay page migration and reduce memory thrashing. However, the responsibility of deciding what strategy is the best fit for a given application relies heavily on the programmer based on thorough understanding of the memory access pattern through intrusive profiling. In this work, we propose a programmer-agnostic runtime that leverages the hardware access counters to automatically categorize memory allocations based on the access pattern and frequency. The proposed heuristic adaptively navigates between remote zero-copy access to host-pinned memory and first-touch page migration based on the trade-off between low latency remote access and high-bandwidth local access. We show that although designed to address memory oversubscription, our scheme has no impact on performance when working sets fit in the device-local memory. Experimental results show that our scheme provides performance improvement of 22% to 78% for irregular applications under 125% memory oversubscription compared to the state of the art. At the same time, regular applications are not impacted by the framework.

Proceedings ArticleDOI
09 Mar 2020
TL;DR: The results focus on memory allocation and address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution, and apply to other system-level problems with strict latency and resource requirements where machine learning could be applied.
Abstract: Modern C++ servers have memory footprints that vary widely over time, causing persistent heap fragmentation of up to 2x from long-lived objects allocated during peak memory usage. This fragmentation is exacerbated by the use of huge (2MB) pages, a requirement for high performance on large heap sizes. Reducing fragmentation automatically is challenging because C++ memory managers cannot move objects. This paper presents a new approach to huge page fragmentation. It combines modern machine learning techniques with a novel memory manager (LLAMA) that manages the heap based on object lifetimes and huge pages (divided into blocks and lines). A neural network-based language model predicts lifetime classes using symbolized calling contexts. The model learns context-sensitive per-allocation site lifetimes from previous runs, generalizes over different binary versions, and extrapolates from samples to unobserved calling contexts. Instead of size classes, LLAMA's heap is organized by lifetime classes that are dynamically adjusted based on observed behavior at a block granularity. LLAMA reduces memory fragmentation by up to 78% while only using huge pages on several production servers. We address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution. Although our results focus on memory allocation, the questions we identify apply to other system-level problems with strict latency and resource requirements where machine learning could be applied.

Proceedings ArticleDOI
01 Feb 2020
TL;DR: Impala, a multi-stride in-memory automata processing architecture, is presented, which introduces three-fold area, throughput, and energy benefits at the expense of increased offline compilation time.
Abstract: High-throughput and concurrent processing of thousands of patterns on each byte of an input stream is critical for many applications with real-time processing needs, such as network intrusion detection, spam filters, virus scanners, and many more. The demand for accelerated pattern matching has motivated several recent in-memory accelerator architectures for automata processing, which is an efficient computation model for pattern matching. Our key observations are: (1) all these architectures are based on 8-bit symbol processing (derived from ASCII), and our analysis on a large set of real-world automata benchmarks reveals that the 8-bit processing dramatically underutilizes hardware resources, and (2) multi-stride symbol processing, a major source of throughput growth, is not explored in the existing in-memory solutions. This paper presents Impala, a multi-stride in-memory automata processing architecture by leveraging our observations. The key insight of our work is that transforming 8-bit processing to 4-bit processing exponentially reduces hardware resources for state-matching and improves resource utilization. This, in turn, brings the opportunity to have a denser design, and be able to utilize more memory columns to process multiple symbols per cycle with a linear increase in state-matching resources. Impala thus introduces three-fold area, throughput, and energy benefits at the expense of increased offline compilation time. Our empirical evaluations on a wide range of automata benchmarks reveal that Impala has on average 2.7X (up to 3.7X) higher throughput per unit area and 1.22X lower power consumption than Cache Automaton, which is the best performing prior work.

Proceedings ArticleDOI
30 May 2020
TL;DR: Results show that CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.
Abstract: We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging applies to the hypervisor and guest OS memory manager independently, as well as to native systems. Moreover, CA paging benefits any address translation scheme that leverages contiguous mappings. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations in both native and virtualized systems. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and (ii) SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.

Proceedings ArticleDOI
Soroush Bateni1, Zhendong Wang1, Yuankun Zhu1, Yang Hu1, Cong Liu1 
21 Apr 2020
TL;DR: This work develops a performance model that can predict system overhead for each memory management method based on application characteristics and proposes a runtime scheduler that can significantly relieve the system memory pressure and reduce the multitasking co-run response time.
Abstract: Cutting-edge embedded system applications, such as self-driving cars and unmanned drone software, are reliant on integrated CPU/GPU platforms for their DNNs-driven workload, such as perception and other highly parallel components. In this work, we set out to explore the hidden performance implication of GPU memory management methods of integrated CPU/GPU architecture. Through a series of experiments on micro-benchmarks and real-world workloads, we find that the performance under different memory management methods may vary according to application characteristics. Based on this observation, we develop a performance model that can predict system overhead for each memory management method based on application characteristics. Guided by the performance model, we further propose a runtime scheduler. By conducting per-task memory management policy switching and kernel overlapping, the scheduler can significantly relieve the system memory pressure and reduce the multitasking co-run response time. We have implemented and extensively evaluated our system prototype on the NVIDIA Jetson TX2, Drive PX2, and Xavier AGX platforms, using both Rodinia benchmark suite and two real-world case studies of drone software and autonomous driving software.

Proceedings ArticleDOI
01 May 2020
TL;DR: Cornucopia is a lightweight capability revocation system for CHERI that implements non-probabilistic C/C++ temporal memory safety for standard heap allocations and extends the CheriBSD virtual-memory subsystem to track capability flow through memory and provides a concurrent kernel-resident revocation service that is amenable to multi-processor and hardware acceleration.
Abstract: Use-after-free violations of temporal memory safety continue to plague software systems, underpinning many high-impact exploits. The CHERI capability system shows great promise in achieving C and C++ language spatial memory safety, preventing out-of-bounds accesses. Enforcing language-level temporal safety on CHERI requires capability revocation, traditionally achieved either via table lookups (avoided for performance in the CHERI design) or by identifying capabilities in memory to revoke them (similar to a garbage-collector sweep). CHERIvoke, a prior feasibility study, suggested that CHERI’s tagged capabilities could make this latter strategy viable, but modeled only architectural limits and did not consider the full implementation or evaluation of the approach.Cornucopia is a lightweight capability revocation system for CHERI that implements non-probabilistic C/C++ temporal memory safety for standard heap allocations. It extends the CheriBSD virtual-memory subsystem to track capability flow through memory and provides a concurrent kernel-resident revocation service that is amenable to multi-processor and hardware acceleration. We demonstrate an average overhead of less than 2% and a worst-case of 8.9% for concurrent revocation on compatible SPEC CPU2006 benchmarks on a multi-core CHERI CPU on FPGA, and we validate Cornucopia against the Juliet test suite’s corpus of temporally unsafe programs. We test its compatibility with a large corpus of C programs by using a revoking allocator as the system allocator while booting multi-user CheriBSD. Cornucopia is a viable strategy for always-on temporal heap memory safety, suitable for production environments.

Proceedings ArticleDOI
30 May 2020
TL;DR: This work enables the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages, and evaluates the effectiveness of perforation pages with timing simulations under diverse and realistic fragmentation scenarios.
Abstract: The availability of large pages has dramatically improved the efficiency of address translation for applications that use large contiguous regions of memory. However, large pages can be difficult to allocate due to fragmented memory, non-movable pages, or the need to split a large page into regular pages when part of the large page is forced to have a different permission status from the rest of the page. Furthermore, they can also be expensive due to memory bloating caused by sparse accesses to application data. In this work, we enable the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages. Perforated pages permit the OS to punch 4KB page-sized holes in the physical address range allocated to a large page and re-map them to other addresses as needed. This not only enables the system to benefit from large pages in the presence of fragmentation, but also allows for different permissions to exist within a large page, enhancing sharing flexibility. In addition, it allows unused parts of a large page to be used elsewhere, mitigating memory bloating. To minimize changes to the system, perforated pages reuse the 4KB-level page table entries to store the hole locations and translates holes into regular 4KB pages. For performance, the proposed technique caches the translations for hole pages in the TLBs and track holes via cached bitmaps in the L2 TLB. By enabling large pages in the presence of physical memory fragmentation, perforated pages increase the applicability and resulting benefits of large pages with only minor changes to the hardware and OS. In this work, we evaluate the effectiveness of perforated pages with timing simulations under diverse and realistic fragmentation scenarios. Our results show that even with fragmented memory, perforated pages accomplish 93.2% to 99.9% of the performance achievable by ideal memory allocation, and 2.0% to 11.5% better performance over the conventional system running with fragmented memory.

Posted Content
TL;DR: It is argued that keeping all entities in memory is unnecessary, and a memory-augmented neural network that tracks only a small bounded number of entities at a time is proposed, thus guaranteeing a linear runtime in length of document.
Abstract: Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires keeping all entities in memory, which can be impractical for long documents. We argue that keeping all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time, thus guaranteeing a linear runtime in length of document. We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy.

Journal ArticleDOI
TL;DR: Various in-memory computing primitives in both CMOS and emerging nonvolatile memory (NVM) technologies are discussed and how such primitives can be incorporated in standalone machine learning accelerator architectures are described.
Abstract: Machine learning applications, especially deep neural networks (DNNs) have seen ubiquitous use in computer vision, speech recognition, and robotics. However, the growing complexity of DNN models have necessitated efficient hardware implementations. The key compute primitives of DNNs are matrix vector multiplications, which lead to significant data movement between memory and processing units in today's von Neumann systems. A promising alternative would be colocating memory and processing elements, which can be further extended to performing computations inside the memory itself. We believe in-memory computing is a propitious candidate for future DNN accelerators, since it mitigates the memory wall bottleneck. In this article, we discuss various in-memory computing primitives in both CMOS and emerging nonvolatile memory (NVM) technologies. Subsequently, we describe how such primitives can be incorporated in standalone machine learning accelerator architectures. Finally, we analyze the challenges associated with designing such in-memory computing accelerators and explore future opportunities.

Proceedings ArticleDOI
TL;DR: RelM as mentioned in this paper is an empirically-driven white-box algorithm that provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG).
Abstract: There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. For this problem, we show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called GBO, we use the RelM's analytical models to speed up Bayesian Optimization. Through an evaluation based on Apache Spark, we showcase that RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower compared to the state-of-the-art AI-driven policies, cost overhead.

Journal ArticleDOI
TL;DR: MV-Sketch as discussed by the authors tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy.
Abstract: Fast detection of heavy flows (e.g., heavy hitters and heavy changers) in massive network traffic is challenging due to the stringent requirements of fast packet processing and limited resource availability. Invertible sketches are summary data structures that can recover heavy flows with small memory footprints and bounded errors, yet existing invertible sketches incur high memory access overhead that leads to performance degradation. We present MV-Sketch, a fast and compact invertible sketch that supports heavy flow detection with small and static memory allocation. MV-Sketch tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy. We present theoretical analysis on the memory usage, performance, and accuracy of MV-Sketch in both local and network-wide scenarios. We further show how MV-Sketch can be implemented and deployed on P4-based programmable switches subject to hardware deployment constraints. We conduct evaluation in both software and hardware environments. Trace-driven evaluation in software shows that MV-Sketch achieves higher accuracy than existing invertible sketches, with up to $3.38\times $ throughput gain. We also show how to boost the performance of MV-Sketch with SIMD instructions. Furthermore, we evaluate MV-Sketch on a Barefoot Tofino switch and show how MV-Sketch achieves line-rate measurement with limited hardware resource overhead.

Proceedings ArticleDOI
01 Oct 2020
TL;DR: This paper proposes AOS, a low-overhead always-on heap memory safety solution that implements a novel bounds-checking mechanism and introduces a micro-architectural unit to remove the need for memory checking instructions.
Abstract: Memory safety violations, caused by illegal use of pointers in unsafe programming languages such as C and C++, have been a major threat to modern computer systems. However, implementing a low-overhead yet robust runtime memory safety solution is still challenging. Various hardware-based mechanisms have been proposed, but their significant hardware requirements have limited their feasibility, and their performance overhead is too high to be an always-on solution.In this paper, we propose AOS, a low-overhead always-on heap memory safety solution that implements a novel bounds-checking mechanism. We identify that the major challenges of existing bounds-checking approaches are 1) the extra instruction overhead for memory checking and metadata propagation and 2) the complex metadata addressing. To address these challenges, using Arm PA primitives, we leverage unused upper bits of a pointer to store a key and have it propagated along with the pointer address, eliminating propagation overhead. Then, we use the embedded key to index a hashed bounds table to achieve efficient metadata management. We also introduce a micro-architectural unit to remove the need for memory checking instructions. We show that AOS overcomes all the aforementioned challenges and demonstrate its feasibility as an efficient runtime memory safety solution. Our evaluation for SPEC 2006 workloads shows an 8.4% performance overhead on average.

Journal ArticleDOI
TL;DR: It is found that Intel's persistent memory is highly sensitive to data locality, size, and access patterns, which becomes clearer by optimizing both virtual memory page size and data layout for locality.
Abstract: We evaluated Intel® OptaneTM DC Persistent Memory and found that Intel's persistent memory is highly sensitive to data locality, size, and access patterns, which becomes clearer by optimizing both virtual memory page size and data layout for locality. Using the Polybench high-performance computing benchmark suite and controlling for mapped page size, we evaluate persistent meemory (PMEM) performance relative to DRAM. In particular, the Linux PMEM support maps preferentially maps persistent memory in large pages while always mapping DRAM to small pages. We observed using large pages for PMEM and small pages for DRAM can create a 5x difference in performance, dwarfing other effects discussed in the literature. We found PMEM performance comparable to DRAM performance for the majority of tests when controlled for page size and optimized for data locality.