Showing papers on "Memory management published in 2020"

PDF

Open Access

Journal Article•DOI•

[...]

Onur Mutlu¹, Jeremie S. Kim¹•Institutions (1)

01 Aug 2020-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: Kim et al. as mentioned in this paper comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammers, and discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems.

...read moreread less

Abstract: This retrospective paper describes the RowHammer problem in dynamic random access memory (DRAM), which was initially introduced by Kim et al. at the ISCA 2014 Conference. RowHammer is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability. It is the phenomenon that repeatedly accessing a row in a modern DRAM chip causes bit flips in physically adjacent rows at consistently predictable bit locations. RowHammer is caused by a hardware failure mechanism called DRAM disturbance errors , which is a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Researchers from Google Project Zero demonstrated in 2015 that this hardware failure mechanism can be effectively exploited by user-level programs to gain kernel privileges on real systems. Many other follow-up works demonstrated other practical attacks exploiting RowHammer. In this paper, we comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammer. We also discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems, as the memory technologies scale to higher densities. We conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

...read moreread less

153 citations

Proceedings Article•DOI•

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

[...]

Samyam Rajbhandari¹, Jeff Rasley¹, Olatunji Ruwase¹, Yuxiong He¹•Institutions (1)

Microsoft¹

01 Nov 2020

TL;DR: The Zero Redundancy Optimizer (ZeRO) as mentioned in this paper eliminates memory redundancies in data and model-parallel training while retaining low communication volume and high computational granularity, allowing to scale the model size proportional to the number of devices.

...read moreread less

Abstract: Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.

...read moreread less

126 citations

Proceedings Article•DOI•

RAMBleed: Reading Bits in Memory Without Accessing Them

[...]

Andrew Kwong¹, Daniel Genkin¹, Daniel Gruss¹, Yuval Yarom¹•Institutions (1)

University of Michigan¹

01 May 2020

TL;DR: It is demonstrated that Rowhammer is a threat to not only integrity, but to confidentiality as well, by employing Rowhammer as a read side channel, and the first security implication of successfully-corrected bit flips, which were previously considered benign.

...read moreread less

Abstract: The Rowhammer bug is a reliability issue in DRAM cells that can enable an unprivileged adversary to flip the values of bits in neighboring rows on the memory module. Previous work has exploited this for various types of fault attacks across security boundaries, where the attacker flips inaccessible bits, often resulting in privilege escalation. It is widely assumed however, that bit flips within the adversary’s own private memory have no security implications, as the attacker can already modify its private memory via regular write operations.We demonstrate that this assumption is incorrect, by employing Rowhammer as a read side channel. More specifically, we show how an unprivileged attacker can exploit the data dependence between Rowhammer induced bit flips and the bits in nearby rows to deduce these bits, including values belonging to other processes and the kernel. Thus, the primary contribution of this work is to show that Rowhammer is a threat to not only integrity, but to confidentiality as well.Furthermore, in contrast to Rowhammer write side channels, which require persistent bit flips, our read channel succeeds even when ECC memory detects and corrects every bit flip. Thus, we demonstrate the first security implication of successfully-corrected bit flips, which were previously considered benign.To demonstrate the implications of this read side channel, we present an end-to-end attack on OpenSSH 7.9 that extracts an RSA-2048 key from the root level SSH daemon. To accomplish this, we develop novel techniques for massaging memory from user space into an exploitable state, and use the DRAM rowbuffer timing side channel to locate physically contiguous memory necessary for double-sided Rowhammering. Unlike previous Rowhammer attacks, our attack does not require the use of huge pages, and it works on Ubuntu Linux under its default configuration settings.

...read moreread less

96 citations

Proceedings Article•DOI•

Capuchin: Tensor-based GPU Memory Management for Deep Learning

[...]

Xuan Peng¹, Xuanhua Shi¹, Hulin Dai¹, Hai Jin¹, Ma Weiliang¹, Xiong Qian¹, Fan Yang², Xuehai Qian³ - Show less +4 more•Institutions (3)

Huazhong University of Science and Technology¹, Microsoft², University of Southern California³

09 Mar 2020

TL;DR: Capuchin is proposed, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation and makes memory management decisions based on dynamic tensor access pattern tracked at runtime.

...read moreread less

Abstract: In recent years, deep learning has gained unprecedented success in various domains, the key of the success is the larger and deeper deep neural networks (DNNs) that achieved very high accuracy. On the other side, since GPU global memory is a scarce resource, large models also pose a significant challenge due to memory requirement in the training process. This restriction limits the DNN architecture exploration flexibility. In this paper, we propose Capuchin, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation. The key feature of Capuchin is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, one can exploit the total memory optimization space and offer the fine-grain and flexible control of when and how to perform memory optimization techniques. We deploy Capuchin in a widely-used deep learning framework, Tensorflow, and show that Capuchin can reduce the memory footprint by up to 85% among 6 state-of-the-art DNNs compared to the original Tensorflow. Especially, for the NLP task BERT, the maximum batch size that Capuchin can outperforms Tensorflow and gradient-checkpointing by 7x and 2.1x, respectively. We also show that Capuchin outperforms vDNN and gradient-checkpointing by up to 286% and 55% under the same memory oversubscription.

...read moreread less

94 citations

Proceedings Article•DOI•

7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC

[...]

Chien-Hung Lin¹, Chih-Chung Cheng¹, Yi-Min Tsai¹, Hung Shengje¹, Kuo Yu-Ting¹, Perry H Wang¹, Pei-Kuei Tsung¹, Jeng-Yun Hsu¹, Wei-Chih Lai¹, Chia-Hung Liu¹, Shao-Yu Wang¹, Chin-Hua Kuo¹, Chih-Yu Chang¹, Ming-Hsien Lee¹, Tsung-Yao Lin¹, Chih-Cheng Chen¹ - Show less +12 more•Institutions (1)

MediaTek¹

01 Feb 2020

TL;DR: The DLA is required to support various network topologies and highly precise neural operations in smartphone applications, and the Android neural network APIs currently specify the use of asymmetric quantization (ASYMM-Q), providing better precision than conventional symmetrical quantization.

...read moreread less

Abstract: Recent advancements in deep learning (DL) have led to the wide adoption of AI applications, such as image recognition [1], image de-noising and speech recognition, in the 5G smartphones. For a satisfactory user experience, there are stringent requirements in the real-time response of smartphone applications. In order to meet the performance expectations for DL, numerous deep learning accelerators (DLA) have been proposed for DL inference on the edge devices [2]–[5]. As depicted in Fig. 7.1.1, the major challenge in designing a DLA for smartphones is achieving the required computing efficiency, while limited by the power budget and memory bandwidth (BW). Since the overall power consumption of a smartphone system-on-a-chip (SoC) is usually constrained to 2 to 3W and the available DRAM BW is around 10-to-30GB/s, the power budget allocated for a DLA must be below 1W with the memory BW limited to 1-to-10GB/s. While operating under such constraints, the DLA is required to support various network topologies and highly precise neural operations in smartphone applications. For instance, the Android neural network APIs currently specify the use of asymmetric quantization (ASYMM-Q), providing better precision than conventional symmetric quantization.

...read moreread less

68 citations

Proceedings Article•DOI•

SpreadSketch: Toward Invertible and Network-Wide Detection of Superspreaders

[...]

Lu Tang¹, Qun Huang², Patrick P. C. Lee¹•Institutions (2)

The Chinese University of Hong Kong¹, Chinese Academy of Sciences²

06 Jul 2020

TL;DR: SpreadSketch is presented, an invertible sketch data structure for network-wide superspreader detection with the theoretical guarantees on memory space, performance, and accuracy, and Trace-driven evaluation shows that SpreadSketCh achieves higher accuracy and performance over state-of-the-art sketches.

...read moreread less

Abstract: Superspreaders (i.e., hosts with numerous distinct connections) remain severe threats to production networks. How to accurately detect superspreaders in real-time at scale remains a non-trivial yet challenging issue. We present SpreadSketch, an invertible sketch data structure for network-wide superspreader detection with the theoretical guarantees on memory space, performance, and accuracy. SpreadSketch tracks candidate super-spreaders and embeds estimated fan-outs in binary hash strings inside small and static memory space, such that multiple SpreadSketch instances can be merged to provide a network-wide measurement view for recovering superspreaders and their estimated fan-outs. We present formal theoretical analysis on SpreadSketch in terms of space and time complexities as well as error bounds. Trace-driven evaluation shows that SpreadSketch achieves higher accuracy and performance over state-of-the-art sketches. Furthermore, we prototype SpreadSketch in P4 and show its feasible deployment in commodity hardware switches.

...read moreread less

57 citations

Proceedings Article•DOI•

DUAL: Acceleration of Clustering Algorithms using Digital-based Processing In-Memory

[...]

Mohsen Imani, Saikishan Pampana, Saransh Gupta, Minxuan Zhou, Yeseong Kim¹, Tajana Rosing - Show less +2 more•Institutions (1)

Daegu Gyeongbuk Institute of Science and Technology¹

01 Oct 2020

TL;DR: DUAL is proposed, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory and provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric.

...read moreread less

Abstract: Today’s applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8× speedup and 251.2× energy efficiency improvement as compared to the state-of-the-art solution running on GPU.

...read moreread less

50 citations

Journal Article•DOI•

Memory-Efficient Learning for Large-Scale Computational Imaging

[...]

Michael Kellman¹, Kevin Zhang¹, Eric Markley¹, Jon Tamir¹, Emrah Bostan², Michael Lustig¹, Laura Waller¹ - Show less +3 more•Institutions (2)

University of California, Berkeley¹, University of Amsterdam²

11 Mar 2020-IEEE Transactions on Computational Imaging

TL;DR: This work proposes a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable physics-based learning for large-scale computational imaging systems.

...read moreread less

Abstract: Critical aspects of computational imaging systems, such as experimental design and image priors, can be optimized through deep networks formed by the unrolled iterations of classical physics-based reconstructions. Termed physics-based networks, they incorporate both the known physics of the system via its forward model, and the power of deep learning via data-driven training. However, for realistic large-scale physics-based networks, computing gradients via backpropagation is infeasible due to the memory limitations of graphics processing units. In this work, we propose a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable physics-based learning for large-scale computational imaging systems. We demonstrate our method on a compressed sensing example, as well as two large-scale real-world systems: 3D multi-channel magnetic resonance imaging and super-resolution optical microscopy.

...read moreread less

49 citations

Proceedings Article•DOI•

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

[...]

Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, George Karypis - Show less +5 more

11 Oct 2020

TL;DR: DistDGL as mentioned in this paper is a system for training GNNs in a mini-batch fashion on a cluster of machines based on the Deep Graph Library (DGL), a popular GNN development framework.

...read moreread less

Abstract: Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight min-cut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines.

...read moreread less

48 citations

Proceedings Article•DOI•

xMP: Selective Memory Protection for Kernel and User Space

[...]

Sergej Proskurin¹, Marius Momeu¹, Seyedhamed Ghavamnia², Vasileios P. Kemerlis³, Michalis Polychronakis² - Show less +1 more•Institutions (3)

Technische Universität München¹, Stony Brook University², Brown University³

18 May 2020

TL;DR: The approach, called xMP, provides (in-guest) selective memory protection primitives that allow VMs to isolate sensitive data in user or kernel space in disjoint xMP domains, and takes advantage of virtualization extensions, but after initialization, it does not require any hypervisor intervention.

...read moreread less

Abstract: Attackers leverage memory corruption vulnerabilities to establish primitives for reading from or writing to the address space of a vulnerable process. These primitives form the foundation for code-reuse and data-oriented attacks. While various defenses against the former class of attacks have proven effective, mitigation of the latter remains an open problem. In this paper, we identify various shortcomings of the x86 architecture regarding memory isolation, and leverage virtualization to build an effective defense against data-oriented attacks. Our approach, called xMP, provides (in-guest) selective memory protection primitives that allow VMs to isolate sensitive data in user or kernel space in disjoint xMP domains. We interface the Xen altp2m subsystem with the Linux memory management system, lending VMs the flexibility to define custom policies. Contrary to conventional approaches, xMP takes advantage of virtualization extensions, but after initialization, it does not require any hypervisor intervention. To ensure the integrity of in-kernel management information and pointers to sensitive data within isolated domains, xMP protects pointers with HMACs bound to an immutable context, so that integrity validation succeeds only in the right context. We have applied xMP to protect the page tables and process credentials of the Linux kernel, as well as sensitive data in various user-space applications. Overall, our evaluation shows that xMP introduces minimal overhead for real-world workloads and applications, and offers effective protection against data-oriented attacks.

...read moreread less

47 citations

Proceedings Article•DOI•

NetLock: Fast, Centralized Lock Management Using Programmable Switches

[...]

Zhuolong Yu¹, Yiwen Zhang², Vladimir Braverman¹, Mosharaf Chowdhury², Xin Jin¹ - Show less +1 more•Institutions (2)

Johns Hopkins University¹, University of Michigan²

30 Jul 2020

TL;DR: NetLock is a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support, and to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane.

...read moreread less

Abstract: Lock managers are widely used by distributed systems. Traditional centralized lock managers can easily support policies between multiple users using global knowledge, but they suffer from low performance. In contrast, emerging decentralized approaches are faster but cannot provide flexible policy support. Furthermore, performance in both cases is limited by the server capability. We present NetLock, a new centralized lock manager that co-designs servers and network switches to achieve high performance without sacrificing flexibility in policy support. The key idea of NetLock is to exploit the capability of emerging programmable switches to directly process lock requests in the switch data plane. Due to the limited switch memory, we design a memory management mechanism to seamlessly integrate the switch and server memory. To realize the locking functionality in the switch, we design a custom data plane module that efficiently pools multiple register arrays together to maximize memory utilization We have implemented a NetLock prototype with a Barefoot Tofino switch and a cluster of commodity servers. Evaluation results show that NetLock improves the throughput by 14.0-18.4x, and reduces the average and 99% latency by 4.7-20.3x and 10.4-18.7x over DSLR, a state-of-the-art RDMA-based solution, while providing flexible policy support.

...read moreread less

Proceedings Article•DOI•

Batch-Aware Unified Memory Management in GPUs for Irregular Workloads

[...]

Hyojong Kim¹, Jaewoong Sim², Prasun Gera¹, Ramyad Hadidi¹, Hyesoon Kim¹ - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Intel²

09 Mar 2020

TL;DR: This work provides the first comprehensive analysis of major inefficiencies that arise in page fault handling mechanisms employed in modern GPUs and proposes a GPU runtime software and hardware solution that increases the batch size and reduces the number of batches, thereby amortizing the øverheadName time.

...read moreread less

Abstract: While unified virtual memory and demand paging in modern GPUs provide convenient abstractions to programmers for working with large-scale applications, they come at a significant performance cost. We provide the first comprehensive analysis of major inefficiencies that arise in page fault handling mechanisms employed in modern GPUs. To amortize the high costs in fault handling, the GPU runtime processes a large number of GPU page faults together. We observe that this batched processing of page faults introduces large-scale serialization that greatly hurts the GPU's execution throughput. We show real machine measurements that corroborate our findings. Our goal is to mitigate these inefficiencies and enable efficient demand paging for GPUs. To this end, we propose a GPU runtime software and hardware solution that (1) increases the batch size (i.e., the number of page faults handled together), thereby amortizing the overheadName time, and reduces the number of batches by supporting CPU-like thread block context switching, and (2) takes page eviction off the critical path with no hardware changes by overlapping evictions with CPU-to-GPU page migrations. Our evaluation demonstrates that the proposed solution provides an average speedup of 2x over the state-of-the-art page prefetching. We show that our solution increases the batch size by 2.27x and reduces the total number of batches by 51% on average. We also show that the average batch processing time is reduced by 27%.

...read moreread less

Journal Article•DOI•

FSpiNN: An Optimization Framework for Memory-Efficient and Energy-Efficient Spiking Neural Networks

[...]

Rachmad Vidya Wicaksana Putra¹, Muhammad Shafique¹•Institutions (1)

Vienna University of Technology¹

02 Oct 2020-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: In this article, the authors proposed FSpiNN, an optimization framework for obtaining memory-efficient and energy-efficient SNNs for training and inference processing, with unsupervised learning capability while maintaining accuracy.

...read moreread less

Abstract: Spiking neural networks (SNNs) are gaining interest due to their event-driven processing which potentially consumes low-power/energy computations in hardware platforms while offering unsupervised learning capability due to the spike-timing-dependent plasticity (STDP) rule. However, state-of-the-art SNNs require a large memory footprint to achieve high accuracy, thereby making them difficult to be deployed on embedded systems, for instance, on battery-powered mobile devices and IoT Edge nodes. Toward this, we propose FSpiNN, an optimization framework for obtaining memory-efficient and energy-efficient SNNs for training and inference processing, with unsupervised learning capability while maintaining accuracy. It is achieved by: 1) reducing the computational requirements of neuronal and STDP operations; 2) improving the accuracy of STDP-based learning; 3) compressing the SNN through a fixed-point quantization; and 4) incorporating the memory and energy requirements in the optimization process. FSpiNN reduces the computational requirements by reducing the number of neuronal operations, the STDP-based synaptic weight updates, and the STDP complexity. To improve the accuracy of learning, FSpiNN employs timestep-based synaptic weight updates and adaptively determines the STDP potentiation factor and the effective inhibition strength. The experimental results show that as compared to the state-of-the-art work, FSpiNN achieves $7.5\times $ memory saving, and improves the energy efficiency by $3.5\times $ on average for training and by $1.8\times $ on average for inference, across MNIST and Fashion MNIST datasets, with no accuracy loss for a network with 4900 excitatory neurons, thereby enabling energy-efficient SNNs for edge devices/embedded systems.

...read moreread less

Journal Article•DOI•

Designing Heavy-Hitter Detection Algorithms for Programmable Switches

[...]

Ran Ben Basat¹, Xiaoqi Chen², Gil Einziger³, Ori Rottenstreich⁴•Institutions (4)

Harvard University¹, Princeton University², Ben-Gurion University of the Negev³, Technion – Israel Institute of Technology⁴

16 Apr 2020-IEEE ACM Transactions on Networking

TL;DR: This work introduces PRECISION, an algorithm that uses Partial Recirculation to find top flows on a programmable switch and achieves higher accuracy than previous heavy hitter detection algorithms that avoid recirculation, and suggests two algorithms for the hierarchical heavy hitters detection problem.

...read moreread less

Abstract: Programmable network switches promise flexibility and high throughput, enabling applications such as load balancing and traffic engineering. Network measurement is a fundamental building block for such applications, including tasks such as the identification of heavy hitters (largest flows) or the detection of traffic changes. However, high-throughput packet processing architectures place certain limitations on the programming model, such as restricted branching, limited capability for memory access, and a limited number of processing stages. These limitations restrict the types of measurement algorithms that can run on programmable switches. In this paper, we focus on the Reconfigurable Match Tables (RMT) programmable high-throughput switch architecture, and carefully examine its constraints on designing measurement algorithms. We demonstrate our findings while solving the heavy hitter problem. We introduce PRECISION, an algorithm that uses Partial Recirculation to find top flows on a programmable switch. By recirculating a small fraction of packets, PRECISION simplifies the access to stateful memory to conform with RMT limitations and achieves higher accuracy than previous heavy hitter detection algorithms that avoid recirculation. We also evaluate each of the adaptations made by PRECISION and analyze its effect on the measurement accuracy. Finally, we suggest two algorithms for the hierarchical heavy hitters detection problem in which the goal is identifying the subnets that send excessive traffic and are potentially malicious. To the best of our knowledge, our work is the first to do so on RMT switches.

...read moreread less

Journal Article•DOI•

SRNPU: An Energy-Efficient CNN-Based Super-Resolution Processor With Tile-Based Selective Super-Resolution in Mobile Devices

[...]

Juhyoung Lee¹, Jinsu Lee¹, Hoi-Jun Yoo¹•Institutions (1)

KAIST¹

05 Aug 2020-IEEE Journal on Emerging and Selected Topics in Circuits and Systems

TL;DR: The SRNPU is the first ASIC implementation of the CNN-based SR algorithm which supports real-time Full-HD up-scaling and achieves higher restoration performance and power efficiency than previous SR hardware implementations.

...read moreread less

Abstract: In this article, we propose an energy-efficient convolutional neural network (CNN) based super-resolution (SR) processor, super-resolution neural processing unit (SRNPU), for mobile applications. Traditionally, it is hard to realize real-time CNN-based SR on resource-limited platforms like mobile devices due to its massive amount of computation workload and communication bandwidth with external memory. The SRNPU can support the tile-based selective super-resolution (TSSR) which dynamically selects the proper sized CNN in a tile-by-tile manner. The TSSR reduces the computational workload of CNN-based SR by 31.1 % while maintaining image restoration performance. Moreover, a proposed selective caching based convolutional layer fusion (SC2LF) can reduce 78.8 % of external memory bandwidth with 93.2 % smaller on-chip memory footprint compared with previous layer fusion methods, by only caching short reuse distance intermediate feature maps. Additionally, reconfigurable cyclic ring architecture in the SRNPU enables maintaining high PE utilization by amortizing the reloading process caused by SC2LF operation under various convolutional layer configurations. The SRNPU is fabricated in 65 nm CMOS technology and occupies $4 \times 4$ mm2 die area. The SRNPU has a peak power efficiency of 1.9 TOPS/W at 0.75 V, 50 MHz. The SRNPU achieves 31.8 fps $\times 2$ scale Full-HD generation and 88.3 fps $\times 4$ scale Full-HD generation with higher restoration performance and power efficiency than previous SR hardware implementations. To the best of our knowledge, the SRNPU is the first ASIC implementation of the CNN-based SR algorithm which supports real-time Full-HD up-scaling.

...read moreread less

Posted Content•

TurboTransformers: An Efficient GPU Serving System For Transformer Models

[...]

Jiarui Fang¹, Yang Yu¹, Chengduo Zhao¹, Jie Zhou¹•Institutions (1)

Tencent¹

09 Oct 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: A transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework that can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into PyTorch code with a few lines of code.

...read moreread less

Abstract: The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, Transformers can process on dimensions of sequence lengths in parallel, therefore leading to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

...read moreread less

Proceedings Article•DOI•

Black or White? How to Develop an AutoTuner for Memory-based Analytics

[...]

Mayuresh Kunjir¹, Shivnath Babu•Institutions (1)

Duke University¹

11 Jun 2020

TL;DR: This paper studies the problem of autotuning the memory allocation for applications running on modern distributed data processing systems, and shows that an empirically-driven "white-box" algorithm, called RelM, provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI- driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG).

...read moreread less

Abstract: There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. We show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called Guided-BO (GBO), we use RelM's analytical models to speed up BO. Through an evaluation based on Apache Spark, we showcase that the RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower cost overhead compared to the state-of-the-art AI-driven policies.

...read moreread less

Proceedings Article•DOI•

Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription

[...]

Debashis Ganguly¹, Ziyu Zhang¹, Jun Yang¹, Rami Melhem¹•Institutions (1)

University of Pittsburgh¹

18 May 2020

TL;DR: A programmer-agnostic runtime that leverages the hardware access counters to automatically categorize memory allocations based on the access pattern and frequency is proposed and it is shown that although designed to address memory oversubscription, the scheme has no impact on performance when working sets fit in the device-local memory.

...read moreread less

Abstract: Unified Memory in heterogeneous systems serves a wide range of applications. However, limited capacity of the device memory becomes a first order performance bottleneck for data-intensive general-purpose applications with increasing working sets. The performance overhead under memory oversubscription depends on the memory access pattern of the corresponding workload. While a regular application with sequential, dense memory access suffers from long latency write-backs, performance of a irregular application with sparse, seldom access to large data-sets degrades due to page thrashing. Although smart spatio-temporal prefetching and large page eviction yield good performance in general, remote zero-copy access to host-pinned memory proves to be beneficial for irregular, data-intensive applications. Further, new generation GPUs introduced hardware access counters to delay page migration and reduce memory thrashing. However, the responsibility of deciding what strategy is the best fit for a given application relies heavily on the programmer based on thorough understanding of the memory access pattern through intrusive profiling. In this work, we propose a programmer-agnostic runtime that leverages the hardware access counters to automatically categorize memory allocations based on the access pattern and frequency. The proposed heuristic adaptively navigates between remote zero-copy access to host-pinned memory and first-touch page migration based on the trade-off between low latency remote access and high-bandwidth local access. We show that although designed to address memory oversubscription, our scheme has no impact on performance when working sets fit in the device-local memory. Experimental results show that our scheme provides performance improvement of 22% to 78% for irregular applications under 125% memory oversubscription compared to the state of the art. At the same time, regular applications are not impacted by the framework.

...read moreread less

Proceedings Article•DOI•

Learning-based Memory Allocation for C++ Server Workloads

[...]

Martin Maas¹, David G. Andersen², Michael Isard¹, Mohammad Mahdi Javanmard³, Kathryn S. McKinley¹, Colin Raffel¹ - Show less +2 more•Institutions (3)

Google¹, Carnegie Mellon University², Stony Brook University³

09 Mar 2020

TL;DR: The results focus on memory allocation and address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution, and apply to other system-level problems with strict latency and resource requirements where machine learning could be applied.

...read moreread less

Abstract: Modern C++ servers have memory footprints that vary widely over time, causing persistent heap fragmentation of up to 2x from long-lived objects allocated during peak memory usage. This fragmentation is exacerbated by the use of huge (2MB) pages, a requirement for high performance on large heap sizes. Reducing fragmentation automatically is challenging because C++ memory managers cannot move objects. This paper presents a new approach to huge page fragmentation. It combines modern machine learning techniques with a novel memory manager (LLAMA) that manages the heap based on object lifetimes and huge pages (divided into blocks and lines). A neural network-based language model predicts lifetime classes using symbolized calling contexts. The model learns context-sensitive per-allocation site lifetimes from previous runs, generalizes over different binary versions, and extrapolates from samples to unobserved calling contexts. Instead of size classes, LLAMA's heap is organized by lifetime classes that are dynamically adjusted based on observed behavior at a block granularity. LLAMA reduces memory fragmentation by up to 78% while only using huge pages on several production servers. We address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution. Although our results focus on memory allocation, the questions we identify apply to other system-level problems with strict latency and resource requirements where machine learning could be applied.

...read moreread less

Proceedings Article•DOI•

Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching

[...]

Elaheh Sadredini¹, Reza Rahimi¹, Marzieh Lenjani¹, Mircea R. Stan¹, Kevin Skadron¹ - Show less +1 more•Institutions (1)

University of Virginia¹

01 Feb 2020

TL;DR: Impala, a multi-stride in-memory automata processing architecture, is presented, which introduces three-fold area, throughput, and energy benefits at the expense of increased offline compilation time.

...read moreread less

Abstract: High-throughput and concurrent processing of thousands of patterns on each byte of an input stream is critical for many applications with real-time processing needs, such as network intrusion detection, spam filters, virus scanners, and many more. The demand for accelerated pattern matching has motivated several recent in-memory accelerator architectures for automata processing, which is an efficient computation model for pattern matching. Our key observations are: (1) all these architectures are based on 8-bit symbol processing (derived from ASCII), and our analysis on a large set of real-world automata benchmarks reveals that the 8-bit processing dramatically underutilizes hardware resources, and (2) multi-stride symbol processing, a major source of throughput growth, is not explored in the existing in-memory solutions. This paper presents Impala, a multi-stride in-memory automata processing architecture by leveraging our observations. The key insight of our work is that transforming 8-bit processing to 4-bit processing exponentially reduces hardware resources for state-matching and improves resource utilization. This, in turn, brings the opportunity to have a denser design, and be able to utilize more memory columns to process multiple symbols per cycle with a linear increase in state-matching resources. Impala thus introduces three-fold area, throughput, and energy benefits at the expense of increased offline compilation time. Our empirical evaluations on a wide range of automata benchmarks reveal that Impala has on average 2.7X (up to 3.7X) higher throughput per unit area and 1.22X lower power consumption than Cache Automaton, which is the best performing prior work.

...read moreread less

Proceedings Article•DOI•

Enhancing and exploiting contiguity for fast memory virtualization

[...]

Chloe Alverti¹, Stratos Psomadakis¹, Vasileios Karakostas¹, Jayneel Gandhi², Konstantinos Nikas¹, Georgios Goumas¹, Nectarios Koziris¹ - Show less +3 more•Institutions (2)

National Technical University of Athens¹, VMware²

30 May 2020

TL;DR: Results show that CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.

...read moreread less

Abstract: We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging applies to the hypervisor and guest OS memory manager independently, as well as to native systems. Moreover, CA paging benefits any address translation scheme that leverages contiguous mappings. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations in both native and virtualized systems. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and (ii) SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.

...read moreread less

Proceedings Article•DOI•

Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform

[...]

Soroush Bateni¹, Zhendong Wang¹, Yuankun Zhu¹, Yang Hu¹, Cong Liu¹ - Show less +1 more•Institutions (1)

University of Texas at Dallas¹

21 Apr 2020

TL;DR: This work develops a performance model that can predict system overhead for each memory management method based on application characteristics and proposes a runtime scheduler that can significantly relieve the system memory pressure and reduce the multitasking co-run response time.

...read moreread less

Abstract: Cutting-edge embedded system applications, such as self-driving cars and unmanned drone software, are reliant on integrated CPU/GPU platforms for their DNNs-driven workload, such as perception and other highly parallel components. In this work, we set out to explore the hidden performance implication of GPU memory management methods of integrated CPU/GPU architecture. Through a series of experiments on micro-benchmarks and real-world workloads, we find that the performance under different memory management methods may vary according to application characteristics. Based on this observation, we develop a performance model that can predict system overhead for each memory management method based on application characteristics. Guided by the performance model, we further propose a runtime scheduler. By conducting per-task memory management policy switching and kernel overlapping, the scheduler can significantly relieve the system memory pressure and reduce the multitasking co-run response time. We have implemented and extensively evaluated our system prototype on the NVIDIA Jetson TX2, Drive PX2, and Xavier AGX platforms, using both Rodinia benchmark suite and two real-world case studies of drone software and autonomous driving software.

...read moreread less

Proceedings Article•DOI•

Cornucopia: Temporal Safety for CHERI Heaps

[...]

Nathaniel Wesley Filardo¹, Brett F. Gutstein¹, Jonathan Woodruff¹, Sam Ainsworth¹, Lucian Paul-Trifu¹, Brooks Davis², Hongyan Xia¹, Edward Napierala¹, Alexander Richardson¹, John Baldwin, David Chisnall³, Jessica Clarke¹, Khilan Gudka¹, Alexandre Joannou¹, A. Theodore Markettos¹, Alfredo Mazzinghi¹, Robert M. Norton¹, Michael Roe¹, Peter Sewell¹, Stacey Son¹, Timothy M. Jones¹, Simon W. Moore¹, Peter G. Neumann², Robert N. M. Watson¹ - Show less +20 more•Institutions (3)

University of Cambridge¹, SRI International², Microsoft³

01 May 2020

TL;DR: Cornucopia is a lightweight capability revocation system for CHERI that implements non-probabilistic C/C++ temporal memory safety for standard heap allocations and extends the CheriBSD virtual-memory subsystem to track capability flow through memory and provides a concurrent kernel-resident revocation service that is amenable to multi-processor and hardware acceleration.

...read moreread less

Abstract: Use-after-free violations of temporal memory safety continue to plague software systems, underpinning many high-impact exploits. The CHERI capability system shows great promise in achieving C and C++ language spatial memory safety, preventing out-of-bounds accesses. Enforcing language-level temporal safety on CHERI requires capability revocation, traditionally achieved either via table lookups (avoided for performance in the CHERI design) or by identifying capabilities in memory to revoke them (similar to a garbage-collector sweep). CHERIvoke, a prior feasibility study, suggested that CHERI’s tagged capabilities could make this latter strategy viable, but modeled only architectural limits and did not consider the full implementation or evaluation of the approach.Cornucopia is a lightweight capability revocation system for CHERI that implements non-probabilistic C/C++ temporal memory safety for standard heap allocations. It extends the CheriBSD virtual-memory subsystem to track capability flow through memory and provides a concurrent kernel-resident revocation service that is amenable to multi-processor and hardware acceleration. We demonstrate an average overhead of less than 2% and a worst-case of 8.9% for concurrent revocation on compatible SPEC CPU2006 benchmarks on a multi-core CHERI CPU on FPGA, and we validate Cornucopia against the Juliet test suite’s corpus of temporally unsafe programs. We test its compatibility with a large corpus of C programs by using a revoking allocator as the system allocator while booting multi-user CheriBSD. Cornucopia is a viable strategy for always-on temporal heap memory safety, suitable for production environments.

...read moreread less

Proceedings Article•DOI•

Perforated page: supporting fragmented memory allocation for large pages

[...]

Chang Hyun Park¹, Sanghoon Cha², Bo-Kyeong Kim², Youngjin Kwon², David Black-Schaffer¹, Jaehyuk Huh² - Show less +2 more•Institutions (2)

Uppsala University¹, KAIST²

30 May 2020

TL;DR: This work enables the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages, and evaluates the effectiveness of perforation pages with timing simulations under diverse and realistic fragmentation scenarios.

...read moreread less

Abstract: The availability of large pages has dramatically improved the efficiency of address translation for applications that use large contiguous regions of memory. However, large pages can be difficult to allocate due to fragmented memory, non-movable pages, or the need to split a large page into regular pages when part of the large page is forced to have a different permission status from the rest of the page. Furthermore, they can also be expensive due to memory bloating caused by sparse accesses to application data. In this work, we enable the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages. Perforated pages permit the OS to punch 4KB page-sized holes in the physical address range allocated to a large page and re-map them to other addresses as needed. This not only enables the system to benefit from large pages in the presence of fragmentation, but also allows for different permissions to exist within a large page, enhancing sharing flexibility. In addition, it allows unused parts of a large page to be used elsewhere, mitigating memory bloating. To minimize changes to the system, perforated pages reuse the 4KB-level page table entries to store the hole locations and translates holes into regular 4KB pages. For performance, the proposed technique caches the translations for hole pages in the TLBs and track holes via cached bitmaps in the L2 TLB. By enabling large pages in the presence of physical memory fragmentation, perforated pages increase the applicability and resulting benefits of large pages with only minor changes to the hardware and OS. In this work, we evaluate the effectiveness of perforated pages with timing simulations under diverse and realistic fragmentation scenarios. Our results show that even with fragmented memory, perforated pages accomplish 93.2% to 99.9% of the performance achievable by ideal memory allocation, and 2.0% to 11.5% better performance over the conventional system running with fragmented memory.

...read moreread less

Posted Content•

Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks

[...]

Shubham Toshniwal¹, Sam Wiseman¹, Allyson Ettinger², Karen Livescu¹, Kevin Gimpel¹ - Show less +1 more•Institutions (2)

Toyota Technological Institute at Chicago¹, University of Chicago²

06 Oct 2020-arXiv: Computation and Language

TL;DR: It is argued that keeping all entities in memory is unnecessary, and a memory-augmented neural network that tracks only a small bounded number of entities at a time is proposed, thus guaranteeing a linear runtime in length of document.

...read moreread less

Abstract: Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires keeping all entities in memory, which can be impractical for long documents. We argue that keeping all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time, thus guaranteeing a linear runtime in length of document. We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy.

...read moreread less

Journal Article•DOI•

Circuits and Architectures for In-Memory Computing-Based Machine Learning Accelerators

[...]

Aayush Ankit¹, Indranil Chakraborty¹, Amogh Agrawal¹, Mustafa Ali¹, Kaushik Roy¹ - Show less +1 more•Institutions (1)

Purdue University¹

01 Nov 2020-IEEE Micro

TL;DR: Various in-memory computing primitives in both CMOS and emerging nonvolatile memory (NVM) technologies are discussed and how such primitives can be incorporated in standalone machine learning accelerator architectures are described.

...read moreread less

Abstract: Machine learning applications, especially deep neural networks (DNNs) have seen ubiquitous use in computer vision, speech recognition, and robotics. However, the growing complexity of DNN models have necessitated efficient hardware implementations. The key compute primitives of DNNs are matrix vector multiplications, which lead to significant data movement between memory and processing units in today's von Neumann systems. A promising alternative would be colocating memory and processing elements, which can be further extended to performing computations inside the memory itself. We believe in-memory computing is a propitious candidate for future DNN accelerators, since it mitigates the memory wall bottleneck. In this article, we discuss various in-memory computing primitives in both CMOS and emerging nonvolatile memory (NVM) technologies. Subsequently, we describe how such primitives can be incorporated in standalone machine learning accelerator architectures. Finally, we analyze the challenges associated with designing such in-memory computing accelerators and explore future opportunities.

...read moreread less

Proceedings Article•DOI•

Black or White? How to Develop an AutoTuner for Memory-based Analytics [Extended Version].

[...]

Mayuresh Kunjir, Shivnath Babu

26 Feb 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: RelM as mentioned in this paper is an empirically-driven white-box algorithm that provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG).

...read moreread less

Abstract: There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. For this problem, we show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called GBO, we use the RelM's analytical models to speed up Bayesian Optimization. Through an evaluation based on Apache Spark, we showcase that RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower compared to the state-of-the-art AI-driven policies, cost overhead.

...read moreread less

Journal Article•DOI•

A Fast and Compact Invertible Sketch for Network-Wide Heavy Flow Detection

[...]

Lu Tang¹, Qun Huang², Patrick P. C. Lee¹•Institutions (2)

The Chinese University of Hong Kong¹, Peking University²

06 Aug 2020-IEEE ACM Transactions on Networking

TL;DR: MV-Sketch as discussed by the authors tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy.

...read moreread less

Abstract: Fast detection of heavy flows (e.g., heavy hitters and heavy changers) in massive network traffic is challenging due to the stringent requirements of fast packet processing and limited resource availability. Invertible sketches are summary data structures that can recover heavy flows with small memory footprints and bounded errors, yet existing invertible sketches incur high memory access overhead that leads to performance degradation. We present MV-Sketch, a fast and compact invertible sketch that supports heavy flow detection with small and static memory allocation. MV-Sketch tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy. We present theoretical analysis on the memory usage, performance, and accuracy of MV-Sketch in both local and network-wide scenarios. We further show how MV-Sketch can be implemented and deployed on P4-based programmable switches subject to hardware deployment constraints. We conduct evaluation in both software and hardware environments. Trace-driven evaluation in software shows that MV-Sketch achieves higher accuracy than existing invertible sketches, with up to $3.38\times $ throughput gain. We also show how to boost the performance of MV-Sketch with SIMD instructions. Furthermore, we evaluate MV-Sketch on a Barefoot Tofino switch and show how MV-Sketch achieves line-rate measurement with limited hardware resource overhead.

...read moreread less

Proceedings Article•DOI•

Hardware-based Always-On Heap Memory Safety

[...]

Yonghae Kim¹, Jaekyu Lee, Hyesoon Kim¹•Institutions (1)

Georgia Institute of Technology¹

01 Oct 2020

TL;DR: This paper proposes AOS, a low-overhead always-on heap memory safety solution that implements a novel bounds-checking mechanism and introduces a micro-architectural unit to remove the need for memory checking instructions.

...read moreread less

Abstract: Memory safety violations, caused by illegal use of pointers in unsafe programming languages such as C and C++, have been a major threat to modern computer systems. However, implementing a low-overhead yet robust runtime memory safety solution is still challenging. Various hardware-based mechanisms have been proposed, but their significant hardware requirements have limited their feasibility, and their performance overhead is too high to be an always-on solution.In this paper, we propose AOS, a low-overhead always-on heap memory safety solution that implements a novel bounds-checking mechanism. We identify that the major challenges of existing bounds-checking approaches are 1) the extra instruction overhead for memory checking and metadata propagation and 2) the complex metadata addressing. To address these challenges, using Arm PA primitives, we leverage unused upper bits of a pointer to store a key and have it propagated along with the pointer address, eliminating propagation overhead. Then, we use the embedded key to index a hashed bounds table to achieve efficient metadata management. We also introduce a micro-architectural unit to remove the need for memory checking instructions. We show that AOS overcomes all the aforementioned challenges and demonstrate its feasibility as an efficient runtime memory safety solution. Our evaluation for SPEC 2006 workloads shows an 8.4% performance overhead on average.

...read moreread less

Journal Article•DOI•

Unexpected Performance of Intel® Optane™ DC Persistent Memory

[...]

Tony Mason¹, Thaleia Dimitra Doudali², Margo Seltzer¹, Ada Gavrilovska²•Institutions (2)

University of British Columbia¹, Georgia Institute of Technology²

01 Jan 2020-IEEE Computer Architecture Letters

TL;DR: It is found that Intel's persistent memory is highly sensitive to data locality, size, and access patterns, which becomes clearer by optimizing both virtual memory page size and data layout for locality.

...read moreread less

Abstract: We evaluated Intel® OptaneTM DC Persistent Memory and found that Intel's persistent memory is highly sensitive to data locality, size, and access patterns, which becomes clearer by optimizing both virtual memory page size and data layout for locality. Using the Polybench high-performance computing benchmark suite and controlling for mapped page size, we evaluate persistent meemory (PMEM) performance relative to DRAM. In particular, the Linux PMEM support maps preferentially maps persistent memory in large pages while always mapping DRAM to small pages. We observed using large pages for PMEM and small pages for DRAM can create a 5x difference in performance, dwarfing other effects discussed in the literature. We found PMEM performance comparable to DRAM performance for the majority of tests when controlled for page size and optimized for data locality.

...read moreread less

Collapse