scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2019"


Journal ArticleDOI
TL;DR: An optimized block-floating-point (BFP) arithmetic is adopted in the accelerator for efficient inference of deep neural networks in this paper, and improves the energy and hardware efficiency by three times.
Abstract: Convolutional neural networks (CNNs) are widely used and have achieved great success in computer vision and speech processing applications. However, deploying the large-scale CNN model in the embedded system is subject to the constraints of computation and memory. An optimized block-floating-point (BFP) arithmetic is adopted in our accelerator for efficient inference of deep neural networks in this paper. The feature maps and model parameters are represented in 16-bit and 8-bit formats, respectively, in the off-chip memory, which can reduce memory and off-chip bandwidth requirements by 50% and 75% compared to the 32-bit FP counterpart. The proposed 8-bit BFP arithmetic with optimized rounding and shifting-operation-based quantization schemes improves the energy and hardware efficiency by three times. One CNN model can be deployed in our accelerator without retraining at the cost of an accuracy loss of not more than 0.12%. The proposed reconfigurable accelerator with three parallelism dimensions, ping-pong off-chip DDR3 memory access, and an optimized on-chip buffer group is implemented on the Xilinx VC709 evaluation board. Our accelerator achieves a performance of 760.83 GOP/s and 82.88 GOP/s/W under a 200-MHz working frequency, significantly outperforming previous accelerators.

116 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a broad range of computational techniques to improve applicability, run time, and memory management of automatic differentiation packages, including operation overloading, region based memory, and expression templates.
Abstract: Derivatives play a critical role in computational statistics, examples being Bayesian inference using Hamiltonian Monte Carlo sampling and the training of neural networks. Automatic differentiation is a powerful tool to automate the calculation of derivatives and is preferable to more traditional methods, especially when differentiating complex algorithms and mathematical functions. The implementation of automatic differentiation however requires some care to insure efficiency. Modern differentiation packages deploy a broad range of computational techniques to improve applicability, run time, and memory management. Among these techniques are operation overloading, region based memory, and expression templates. There also exist several mathematical techniques which can yield high performance gains when applied to complex algorithms. For example, semi-analytical derivatives can reduce by orders of magnitude the runtime required to numerically solve and differentiate an algebraic equation. Open problems include the extension of current packages to provide more specialized routines, and efficient methods to perform higher-order differentiation.

68 citations


Proceedings ArticleDOI
01 Apr 2019
TL;DR: MV-Sketch as mentioned in this paper tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy.
Abstract: Fast detection of heavy flows (e.g., heavy hitters and heavy changers) in massive network traffic is challenging due to the stringent requirements of fast packet processing and limited resource availability. Invertible sketches are summary data structures that can recover heavy flows with small memory footprints and bounded errors, yet existing invertible sketches incur high memory access overhead that leads to performance degradation. We present MV-Sketch, a fast and compact invertible sketch that supports heavy flow detection with small and static memory allocation. MV-Sketch tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy. We present theoretical analysis on the memory usage, performance, and accuracy of MV-Sketch. Trace-driven evaluation shows that MV-Sketch achieves higher accuracy than existing invertible sketches, with up to $3.38 \times$ throughput gain. We also show how to boost the performance of MV-Sketch with SIMD instructions.

62 citations


Journal ArticleDOI
TL;DR: In this paper, a distributed variance-reduced strategy for a collection of interacting agents that are connected by a graph topology is developed, which is shown to have linear convergence to the exact solution, and is more memory efficient than other alternative algorithms.
Abstract: This paper develops a distributed variance-reduced strategy for a collection of interacting agents that are connected by a graph topology. The resulting diffusion-AVRG (where AVRG stands for “amortized variance-reduced gradient”) algorithm is shown to have linear convergence to the exact solution, and is more memory efficient than other alternative algorithms. When a batch implementation is employed, it is observed in simulations that diffusion-AVRG is more computationally efficient than exact diffusion or EXTRA, while maintaining almost the same communication efficiency.

59 citations


Proceedings ArticleDOI
25 Mar 2019
TL;DR: A slice-aware memory management scheme, wherein frequently used data can be accessed faster via the LLC, and it is shown that a key-value store can potentially improve its average performance by up to 12.2% and 11.4% for 100% & 95% GET workloads, respectively.
Abstract: In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory bandwidth. After a careful study of Intel's Complex Addressing, we introduce a slice-aware memory management scheme, wherein frequently used data can be accessed faster via the LLC. Using our proposed scheme, we show that a key-value store can potentially improve its average performance ~12.2% and ~11.4% for 100% & 95% GET workloads, respectively. Furthermore, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet's header in the slice of the LLC that is closest to the relevant processing core. We implemented CacheDirector as an extension to DPDK and evaluated our proposed solution for latency-critical applications in Network Function Virtualization (NFV) systems. Evaluation results show that CacheDirector makes packet processing faster by reducing tail latencies (90-99th percentiles) by up to 119 μs (~21.5%) for optimized NFV service chains that are running at 100 Gbps. Finally, we analyze the effectiveness of slice-aware memory management to realize cache isolation.

56 citations


Journal ArticleDOI
TL;DR: A new LSRTM approach by using the excitation amplitude (EA) imaging condition to suppress the crosstalk noise, which can avoid the frequent memory transfer and is suitable for the graphics processing unit (GPU) parallelization.
Abstract: Least-squares reverse time migration (LSRTM) can provide higher quality images than conventional reverse time migration, which is helpful to image simultaneous-source data. However, it still faces the problems of the crosstalk noise, great computation time, and storage requirement. We propose a new LSRTM approach by using the excitation amplitude (EA) imaging condition to suppress the crosstalk noise. Since only the maximum amplitude or limited local maximum amplitudes at each imaging point and the corresponding travel time step(s) need to be saved, the great storage problem can be naturally solved. Consequently, the proposed algorithm can avoid the frequent memory transfer and is suitable for the graphics processing unit (GPU) parallelization. Besides, the shared memory with high bandwidth is used to optimize the GPU-based algorithm. In order to further improve the image quality of EA imaging condition, we adopt the shaping regularization as a constraint. The single-source tests with Marmousi and salt models show the feasibility of our algorithm to image the complex and subsalt structures, among which a wrong background velocity is used to test its sensitivity to the velocity error. The noise-free and noise-included simultaneous-source examples demonstrate the ability of EA imaging condition to suppress the crosstalk noise. During the implementation of the GPU parallelization, we find that the shared memory cannot always optimize the GPU parallel algorithm and just works well for the eighth- or higher order spatial finite difference scheme.

55 citations


Proceedings ArticleDOI
22 Jun 2019
TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

51 citations


Proceedings ArticleDOI
02 Jun 2019
TL;DR: A layer conscious memory management framework for FPGA-based DNN hardware accelerators that exploits the layer diversity and the disjoint lifespan information of memory buffers to efficiently utilize the on-chip memory to improve the performance of the layers bounded by memory and thus the entire performance of DNNs.
Abstract: Deep Neural Networks (DNNs) are becoming more and more complex than before. Previous hardware accelerator designs neglect the layer diversity in terms of computation and communication behavior. On-chip memory resources are underutilized for the memory bounded layers, leading to suboptimal performance. In addition, the increasing complexity of DNN structures makes it difficult to do on-chip memory allocation. To address these issues, we propose a layer conscious memory management framework for FPGA-based DNN hardware accelerators. Our framework exploits the layer diversity and the disjoint lifespan information of memory buffers to efficiently utilize the on-chip memory to improve the performance of the layers bounded by memory and thus the entire performance of DNNs. It consists of four key techniques working coordinately with each other. We first devise a memory allocation algorithm to allocate on-chip buffers for the memory bound layers. In addition, buffer sharing between different layers is applied to improve on-chip memory utilization. Finally, buffer prefetching and splitting are used to further reduce latency. Experiments show that our techniques can achieve 1.36X performance improvement compared with previous designs.

48 citations


Journal ArticleDOI
TL;DR: An efficient deep Q-learning methodology to optimize the performance per watt (PPW) is proposed and experiments show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.
Abstract: Heterogeneous multiprocessor system-on-chips (SoCs) provide a wide range of parameters that can be managed dynamically. For example, one can control the type (big/little), number and frequency of active cores in state-of-the-art mobile processors at runtime. These runtime choices lead to more than 10× range in execution time, 5× range in power consumption, and 50× range in performance per watt. Therefore, it is crucial to make optimum power management decisions as a function of dynamically varying workloads at runtime. This paper presents a reinforcement learning approach for dynamically controlling the number and frequency of active big and little cores in mobile processors. We propose an efficient deep Q-learning methodology to optimize the performance per watt (PPW). Experiments using Odroid XU3 mobile platform show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.

48 citations


Journal ArticleDOI
TL;DR: This paper introduces Memos, a memory management framework which can hierarchically schedule memory resources over the entire memory hierarchy including cache, channels, and main memory comprising DRAM and NVM simultaneously.
Abstract: The emerging hybrid DRAM-NVM architecture is challenging the existing memory management mechanism at the level of the architecture and operating system. In this paper, we introduce Memos, a memory management framework which can hierarchically schedule memory resources over the entire memory hierarchy including cache, channels, and main memory comprising DRAM and NVM simultaneously. Powered by our newly designed kernel-level monitoring module that samples the memory patterns by combining TLB monitoring with page walks, and page migration engine, Memos can dynamically optimize the data placement in the memory hierarchy in response to the memory access pattern, current resource utilization, and memory medium features. Our experimental results show that Memos can achieve high memory utilization, improving system throughput by around 20.0 percent; reduce the memory energy consumption by up to 82.5 percent; and improve the NVM lifetime by up to 34X.

47 citations


Journal ArticleDOI
TL;DR: This paper proposes a multiobjective optimization approach for reconfigurable real-time systems called MO2R2S for the development of a reconfigured real- time system and focuses on three optimization criteria: response time; memory allocation; and energy consumption.
Abstract: This paper deals with the reconfigurable real-time systems that should be adapted to their environment under real-time constraints. The reconfiguration allows moving from one implementation to another by adding/removing/modifying parameters of real-time software tasks which should meet related deadlines. Implementing those systems as threads generates a complex system code due to the large number of threads, which may lead to a reconfiguration time overhead as well as the energy consumption and the memory allocation increase. Thus this paper proposes a multiobjective optimization approach for reconfigurable systems called MO2R2S for the development of a reconfigurable real-time system. Given a specification, the proposed approach aims to produce an optimal design while ensuring the system feasibility. We focus on three optimization criteria: 1) response time; 2) memory allocation; and 3) energy consumption. To address the portability issue, the optimal design is then transformed to an abstract code that may in turn be transformed to a concrete code which is specific to a procedural programming (i.e., POSIX) or an object-oriented language (i.e., RT-Java). The MO2R2S approach allows reducing the number of threads by minimizing the redundancy between the implementation sets. By an experimental study, such optimization permits to decrease the memory allocation by 28.89%, the energy consumption by 40.2%, and the response time by 61.32%.

Journal ArticleDOI
TL;DR: This paper proposes centralized uncoded placement and linear delivery schemes which are optimized by solving a linear program and derives a lower bound on the delivery memory tradeoff with uncoded placed that accounts for the heterogeneity in cache sizes.
Abstract: In cache-aided networks, the server populates the cache memories at the users during low-traffic periods in order to reduce the delivery load during peak-traffic hours. In turn, there exists a fundamental tradeoff between the delivery load on the server and the cache sizes at the users. In this paper, we study this tradeoff in a multicast network, where the server is connected to users with unequal cache sizes and the number of users is less than or equal to the number of library files. We propose centralized uncoded placement and linear delivery schemes which are optimized by solving a linear program. Additionally, we derive a lower bound on the delivery memory tradeoff with uncoded placement that accounts for the heterogeneity in cache sizes. We explicitly characterize this tradeoff for the case of three end-users, as well as an arbitrary number of end-users when the total memory size at the users is small, and when it is large. Next, we consider a system where the server is connected to the users via rate-limited links of different capacities and the server assigns the users’ cache sizes subject to a total cache budget. We characterize the optimal cache sizes that minimize the delivery completion time with uncoded placement and linear delivery. In particular, the optimal memory allocation balances between assigning larger cache sizes to users with low capacity links and uniform memory allocation.

Journal ArticleDOI
TL;DR: This paper proposes a novel processing in-memory architecture, called NNPIM, that significantly accelerates neural network's inference phase inside the memory and introduces simple optimization techniques which significantly improves NNs’ performance and reduces the overall energy consumption.
Abstract: Neural networks (NNs) have shown great ability to process emerging applications such as speech recognition, language recognition, image classification, video segmentation, and gaming. It is therefore important to make NNs efficient. Although attempts have been made to improve NNs’ computation cost, the data movement between memory and processing cores is the main bottleneck for NNs’ energy consumption and execution time. This makes the implementation of NNs significantly slower on traditional CPU/GPU cores. In this paper, we propose a novel processing in-memory architecture, called NNPIM, that significantly accelerates neural network's inference phase inside the memory. First, we design a crossbar memory architecture that supports fast addition, multiplication, and search operations inside the memory. Second, we introduce simple optimization techniques which significantly improves NNs’ performance and reduces the overall energy consumption. We also map all NN functionalities using parallel in-memory components. To further improve the efficiency, our design supports weight sharing to reduce the number of computations in memory and consecutively speedup NNPIM computation. We compare the efficiency of our proposed NNPIM with GPU and the state-of-the-art PIM architectures. Our evaluation shows that our design can achieve 131.5× higher energy efficiency and is 48.2× faster as compared to NVIDIA GTX 1,080 GPU architecture. Compared to state-of-the-art neural network accelerators, NNPIM can achieve on an average 3.6× higher energy efficiency and is 4.6× faster, while providing the same classification accuracy.

Proceedings ArticleDOI
08 Jun 2019
TL;DR: Panthera is proposed, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories that reduces energy by 32 – 52% at only a 1 – 9% execution time overhead.
Abstract: Modern data-parallel systems such as Spark rely increasingly on in-memory computing that can significantly improve the efficiency of iterative algorithms. To process real-world datasets, modern data-parallel systems often require extremely large amounts of memory, which are both costly and energy-inefficient. Emerging non-volatile memory (NVM) technologies offers high capacity compared to DRAM and low energy compared to SSDs. Hence, NVMs have the potential to fundamentally change the dichotomy between DRAM and durable storage in Big Data processing. However, most Big Data applications are written in managed languages (e.g., Scala and Java) and executed on top of a managed runtime (e.g., the Java Virtual Machine) that already performs various dimensions of memory management. Supporting hybrid physical memories adds in a new dimension, creating unique challenges in data replacement and migration. This paper proposes Panthera, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories. Panthera analyzes user programs on a Big Data system to infer their coarse-grained access patterns, which are then passed down to the Panthera runtime for efficient data placement and migration. For Big Data applications, the coarse-grained data division is accurate enough to guide GC for data layout, which hardly incurs data monitoring and moving overhead. We have implemented Panthera in OpenJDK and Apache Spark. An extensive evaluation with various datasets and applications demonstrates that Panthera reduces energy by 32 – 52% at only a 1 – 9% execution time overhead.

Proceedings ArticleDOI
01 Feb 2019
TL;DR: A novel approach that “mines” the unexploited opportunity of on-chip data reusing by introducing the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and proposing a sequence of procedures which can effectively reuse both shortcut and non-shortcut feature maps.
Abstract: Off-chip memory traffic has been a major performance bottleneck in deep learning accelerators. While reusing on-chip data is a promising way to reduce off-chip traffic, the opportunity on reusing shortcut connection data in deep networks (e.g., residual networks) have been largely neglected. Those shortcut data accounts for nearly 40% of the total feature map data. In this paper, we propose Shortcut Mining, a novel approach that “mines” the unexploited opportunity of on-chip data reusing. We introduce the abstraction of logical buffers to address the lack of flexibility in existing buffer architecture, and then propose a sequence of procedures which, collectively, can effectively reuse both shortcut and non-shortcut feature maps. The proposed procedures are also able to reuse shortcut data across any number of intermediate layers without using additional buffer resources. Experiment results from prototyping on FPGAs show that, the proposed Shortcut Mining achieves 53.3%, 58%, and 43% reduction in off-chip feature map traffic for SqueezeNet, ResNet-34, and ResNet152, respectively and a 1.93X increase in throughput compared with a state-of-the-art accelerator.

Journal ArticleDOI
TL;DR: A single ended SE6T (SE6T) static random access memory (SRAM) cell which has about 50% less dynamic power compared to conventional 6-T SRAM cell with the same bit error rate (BER) is proposed.
Abstract: Image processing and other multimedia applications require large embedded storage. In some of the earlier works approximate memory has been shown as a potential energy-efficient solution for such error-tolerant applications. In this brief, we propose a single ended 6-T (SE6T) static random access memory (SRAM) cell which has about 50% less dynamic power compared to conventional 6-T SRAM cell with the same bit error rate (BER). Since image processing applications are tolerant to errors, ultra low voltage power-efficient embedded memories with BER can be used for storage. We show that 1 KB ( $256 \times 32$ ) SE6T memory consumes $ {0.45\times }$ dynamic power, $ {0.83\times }$ leakage power and takes $ {0.60\times }$ area as compared to conventional 6T SRAM memory for similar peak signal to noise ratio (PSNR). We have also proposed heterogeneous SE6T 1K SRAM memory and show that for a given power budget, PSNR enhances by at least by 14 dB compared to when homogeneous (identically sized bit-cells) SE6T SRAM memory are used. When compared with heterogeneous 6T SRAM memory, the heterogeneous SE6T SRAM memory consumes $ {0.44\times }$ dynamic power, $ {0.86\times }$ leakage power and takes $ {0.6\times }$ area for almost similar PSNR. For a given PSNR, the SE6T memory is cumulatively better in terms of design complexity, area and power when compared with other hybrid and heterogeneous approximate memories.

Journal ArticleDOI
TL;DR: The challenges and the emerging solutions in testing three classes of memories: 3D stacked memories, Resistive memories and Spin-Transfer-Torque Magnetic memories are discussed.
Abstract: The research and prototyping of new memory technologies are getting a lot of attention in order to enable new (computer) architectures and provide new opportunities for today's and future applications. Delivering high quality and reliability products was and will remain a crucial step in the introduction of new technologies. Therefore, appropriate fault modelling, test development and design for testability (DfT) is needed. This paper overviews and discusses the challenges and the emerging solutions in testing three classes of memories: 3D stacked memories, Resistive memories and Spin-Transfer-Torque Magnetic memories. Defects mechanisms, fault models, and emerging test solutions will be discussed.

Journal ArticleDOI
Meiqi Wang1, Zhisheng Wang1, Jinming Lu1, Jun Lin1, Zhongfeng Wang1 
TL;DR: Efficient hardware architectures for accelerating the compressed LSTM are proposed, which support the inference of multi-layer and multiple time steps and the computation process is judiciously reorganized and the memory access pattern is well optimized, which alleviate the limited memory bandwidth bottleneck and enable higher throughput.
Abstract: Long Short-Term Memory (LSTM) and its variants have been widely adopted in many sequential learning tasks, such as speech recognition and machine translation. Significant accuracy improvements can be achieved using complex LSTM model with a large memory requirement and high computational complexity, which is time-consuming and energy demanding. The low-latency and energy-efficiency requirements of the real-world applications make model compression and hardware acceleration for LSTM an urgent need. In this paper, several hardware-efficient network compression schemes are introduced first, including structured top- $k$ pruning, clipped gating, and multiplication-free quantization, to reduce the model size and the number of matrix operations by 32 $\times $ and 21.6 $\times $ , respectively, with negligible accuracy loss. Furthermore, efficient hardware architectures for accelerating the compressed LSTM are proposed, which support the inference of multi-layer and multiple time steps. The computation process is judiciously reorganized and the memory access pattern is well optimized, which alleviate the limited memory bandwidth bottleneck and enable higher throughput. Moreover, the parallel processing strategy is carefully designed to make full use of the sparsity introduced by pruning and clipped gating with high hardware utilization efficiency. Implemented on Intel Arria10 S $\times $ 660 FPGA running at 200MHz, the proposed design is able to achieve 1.4–2.2 $\times $ energy efficiency and requires significantly less hardware resources compared with the state-of-the-art LSTM implementations.

Proceedings ArticleDOI
04 Apr 2019
TL;DR: It is demonstrated that Split-CNN achieves significantly higher training scalability by dramatically reducing the memory requirements of training algorithms on GPU accelerators and empirical evidence that splitting at randomly chosen boundaries can even result in accuracy gains over baseline CNN due to its regularization effect.
Abstract: We present an interdisciplinary study to tackle the memory bottleneck of training deep convolutional neural networks (CNN). Firstly, we introduce Split Convolutional Neural Network (Split-CNN) that is derived from the automatic transformation of the state-of-the-art CNN models. The main distinction between Split-CNN and regular CNN is that Split-CNN splits the input images into small patches and operates on these patches independently before entering later stages of the CNN model. Secondly, we propose a novel heterogeneous memory management system (HMMS) to utilize the memory-friendly properties of Split-CNN. Through experiments, we demonstrate that Split-CNN achieves significantly higher training scalability by dramatically reducing the memory requirements of training algorithms on GPU accelerators. Furthermore, we provide empirical evidence that splitting at randomly chosen boundaries can even result in accuracy gains over baseline CNN due to its regularization effect.

Posted Content
TL;DR: This paper proposes two orthogonal approaches to reduce the memory cost from the system perspective by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables.
Abstract: GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger models in order to achieve higher accuracy, memory management becomes an important research topic for deep learning systems, given that GPU has limited memory size. Many approaches have been proposed towards this issue, e.g., model compression and memory swapping. However, they either degrade the model accuracy or require a lot of manual intervention. In this paper, we propose two orthogonal approaches to reduce the memory cost from the system perspective. Our approaches are transparent to the models, and thus do not affect the model accuracy. They are achieved by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables. With the lifetime semantics, we are able to implement a memory pool with minimal fragments. However, the optimization problem is NP-complete. We propose a heuristic algorithm that reduces up to 13.3% of memory compared with Nvidia's default memory pool with equal time complexity. With the read/write semantics, the variables that are not in use can be swapped out from GPU to CPU to reduce the memory footprint. We propose multiple swapping strategies to automatically decide which variable to swap and when to swap out (in), which reduces the memory cost by up to 34.2% without communication overhead.

Journal ArticleDOI
Jinsun Cho1, Doohyeok Lim1, Sola Woo1, Kyungah Cho1, Sangsig Kim1 
TL;DR: A novel static random access memory unit cell design and its array structure consisting of single-gated feedback field-effect transistors (FBFETs) demonstrates the promising potential of the FBFET SRAM for high-performance, high-density, and low-power memory applications.
Abstract: In this paper, we propose a novel static random access memory (SRAM) unit cell design and its array structure consisting of single-gated feedback field-effect transistors (FBFETs). To verify the SRAM characteristics, the basic memory operations and write disturbances of the unit cell are investigated through the mixed-mode technology computer-aided design simulations. The unit cell exhibits the superior SRAM characteristics including a write speed of 0.6 ns, a fast read-out speed of ~0.1 ns, and a retention time of 3600 s. Furthermore, the unit cell design exhibits advantages in density, with a small cell area of 8F2, and in the power consumption; the standby power consumption is 0.24 nW/bit for holding “1” and negligible for holding “0.” Moreover, our SRAM array shows reliable ${3} \times {3}$ array operations without any disturbances. This paper demonstrates the promising potential of the FBFET SRAM for high-performance, high-density, and low-power memory applications.

Proceedings ArticleDOI
20 May 2019
TL;DR: vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory, and is presented as vDNN++ which extends v DNN and resolves the identified issues.
Abstract: Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.

Journal ArticleDOI
Youngeun Kwon1, Minsoo Rhu1
TL;DR: This article summarizes the recent work on designing an accelerator-centric, disaggregated memory system for DL.
Abstract: As the complexity of deep learning (DL) models scales up, computer architects are faced with a memory “capacity” wall, where the limited physical memory inside the accelerator device constrains the algorithm that can be trained and deployed. This article summarizes our recent work on designing an accelerator-centric, disaggregated memory system for DL.

Proceedings ArticleDOI
Juhyoung Lee1, Dongjoo Shin1, Jinsu Lee1, Jinmook Lee1, Sanghoon Kang1, Hoi-Jun Yoo1 
09 Jun 2019
TL;DR: A high-throughput CNN super resolution (SR) processor that has three key features: selective caching based layer fusion to minimize external memory access (EMA), memory compaction scheme for smaller on-chip memory footprint, and cyclic ring core architecture to increase the throughput with improved core utilization.
Abstract: A high-throughput CNN super resolution (SR) processor is proposed for memory efficient SR processing. It has three key features: 1) selective caching based layer fusion to minimize external memory access (EMA), 2) memory compaction scheme for smaller on-chip memory footprint, and 3) cyclic ring core architecture to increase the throughput with improved core utilization. As a result, the implemented processor achieves 60 frames-per-second throughput in generating full HD images.

Proceedings ArticleDOI
26 Mar 2019
TL;DR: The scheme PageSeer is proposed, which effectively hides the swap overhead and services many requests from the DRAM, and initiates other types of page swaps, building a complete solution for hybrid memory.
Abstract: Hybrid main memories composed of DRAM and NonVolatile Memory (NVM) combine the capacity benefits of NVM with the low-latency properties of DRAM. For highest performance, data segments should be exchanged between the two types of memories dynamically—a process known as segment swapping—based on the access patterns to the segments in the program. The key difficulty in hardwaremanaged swapping is to identify the appropriate segments to swap between the memories at the right time in the execution. To perform hardware-managed segment swapping both accurately and with substantial lead time, this paper proposes to use hints from the page walk in a TLB miss. We call the scheme PageSeer. During the generation of the physical address for a page in a TLB miss, the memory controller is informed. The controller uses historic data on the accesses to that page and to a subsequently-referenced page (i.e., its follower page), to potentially initiate swaps for the page and for its follower. We call these actions MMU-Triggered Prefetch Swaps. PageSeer also initiates other types of page swaps, building a complete solution for hybrid memory. Our evaluation of PageSeer with simulations of 26 workloads shows that PageSeer effectively hides the swap overhead and services many requests from the DRAM. Compared to a state-of-the-art hardware-only scheme for hybrid memory management, PageSeer on average improves performance by 19% and reduces the average main memory access time by 29%. Keywords-Hybrid Memory Systems; Non-Volatile Memory; Virtual Memory; Page Walks; Page Swapping.

Journal ArticleDOI
TL;DR: Assessment of Unified Memory performance with data prefetching and memory oversubscription and recommendations and hints for other similar codes regarding expected performance on modern and already widely available GPUs are presented.
Abstract: The paper presents assessment of Unified Memory performance with data prefetching and memory oversubscription. Several versions of code are used with: standard memory management, standard Unified Memory and optimized Unified Memory with programmer-assisted data prefetching. Evaluation of execution times is provided for four applications: Sobel and image rotation filters, stream image processing and computational fluid dynamic simulation, performed on Pascal and Volta architecture GPUs—NVIDIA GTX 1080 and NVIDIA V100 cards. Furthermore, we evaluate the possibility of allocating more memory than available on GPUs and assess performance of codes using the three aforementioned implementations, including memory oversubscription available in CUDA. Results serve as recommendations and hints for other similar codes regarding expected performance on modern and already widely available GPUs.

Proceedings ArticleDOI
17 Nov 2019
TL;DR: This paper proposes several user-transparent unified memory management schemes to achieve adaptive implicit and explicit data transfer and prevent data thrashing, and implements the proposed schemes to improve OpenMP GPU offloading performance.
Abstract: To improve programmability and productivity, recent GPUs adopt a virtual memory address space shared with CPUs (e.g., NVIDIA's unified memory). Unified memory migrates the data management burden from programmers to system software and hardware, and enables GPUs to address datasets that exceed their memory capacity. Our experiments show that while the implicit data transfer of unified memory may bring better data movement efficiency, page fault overhead and data thrashing can erase its benefits. In this paper, we propose several user-transparent unified memory management schemes to 1) achieve adaptive implicit and explicit data transfer and 2) prevent data thrashing. Unlike previous approaches which mostly rely on the runtime and thus suffer from large overhead, we demonstrate the benefits of exploiting key information from compiler analyses, including data locality, access density, and target reuse distance, to accomplish our goal. We implement the proposed schemes to improve OpenMP GPU offloading performance. Our evaluation shows that our schemes improve the GPU performance and memory efficiency significantly.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: This work presents a stability-based approach for filter-level pruning of CNNs that reduces the number of FLOPS and GPU memory footprint and is significantly outperforming other state-of-the-art filter pruning methods.
Abstract: Convolutional neural networks (CNN) have achieved impressive performance on the wide variety of tasks (classification, detection, etc.) across multiple domains at the cost of high computational and memory requirements. Thus, leveraging CNNs for real-time applications necessitates model compression approaches that not only reduce the total number of parameters but reduce the overall computation as well. In this work, we present a stability-based approach for filter-level pruning of CNNs. We evaluate our proposed approach on different architectures (LeNet, VGG-16, ResNet, and Faster RCNN) and datasets and demonstrate its generalizability through extensive experiments. Moreover, our compressed models can be used at run-time without requiring any special libraries or hardware. Our model compression method reduces the number of FLOPS by an impressive factor of 6.03X and GPU memory footprint by more than 17X, significantly outperforming other state-of-the-art filter pruning methods.

Journal ArticleDOI
TL;DR: This paper proposes maximizes cache value (MCV), an efficient cache management policy, which MCV to minimize memory access cost in a hybrid main memory platform for edge computing.
Abstract: Edge computing is proposed to bridge mobile devices with cloud computing data centers in the era of mobile big data, as an intermediate level of computing power. One important issue in edge computing is how to improve performance for mobile devices. Current systems utilize cache in multicore systems to reduce memory access cost with an acceptable hardware cost. However, existing cache management policies are unable to maximize cache value in the newly developed hybrid memory platform that combines phase-change memory and dynamic random-access memory. In this paper, we propose maximizes cache value (MCV), an efficient cache management policy, which MCV to minimize memory access cost in a hybrid main memory platform for edge computing. Extensive simulation studies indicate that this strategy can improve performance in hybrid main memory-based edge computing for mobile devices.

Proceedings ArticleDOI
25 Mar 2019
TL;DR: In this paper, a SAT-based algorithm for quantum memory management is proposed, which returns a valid clean-up strategy, taking the limitations of the quantum hardware into account, based on the reversible pebbling game.
Abstract: Quantum memory management is becoming a pressing problem, especially given the recent research effort to develop new and more complex quantum algorithms. The only existing automatic method for quantum states clean-up relies on the availability of many extra resources. In this work, we propose an automatic tool for quantum memory management. We show how this problem exactly matches the reversible pebbling game. Based on that, we develop a SAT-based algorithm that returns a valid clean-up strategy, taking the limitations of the quantum hardware into account. The developed tool empowers the designer with the flexibility required to explore the trade-off between memory resources and number of operations. We present two show-cases to prove the validity of our approach. First, we apply the algorithm to straight-line programs, widely used in cryptographic applications. Second, we perform a comparison with the existing approach, showing an average improvement of 52.77%.