scispace - formally typeset
Search or ask a question
Topic

Memory management

About: Memory management is a research topic. Over the lifetime, 16743 publications have been published within this topic receiving 312028 citations. The topic is also known as: memory allocation.


Papers
More filters
Proceedings ArticleDOI
17 Sep 2005
TL;DR: A general-purpose compiler approach, called memory coloring, to efficiently allocating the arrays in a program to an SPM, by adapting an existing graph-colouring algorithm for register allocation to assign the array in the program into the register file.
Abstract: Scratchpad memory (SPM), a fast software-managed on-chip SRAM, is now widely used in modern embedded processors. Compared to hardware-managed cache, it is more efficient in performance, power and area cost, and has the added advantage of better time predictability. This paper introduces a general-purpose compiler approach, called memory coloring, to efficiently allocating the arrays in a program to an SPM. The novelty of our approach lies in partitioning an SPM into a "register file", splitting the live ranges of arrays to create potential data transfer statements between the SPM and off-chip memory, and finally, adapting an existing graph-colouring algorithm for register allocation to assign the arrays in the program into the register file. Our approach is efficient due to the practical efficiency of graph-colouring algorithms. We have implemented this work in SUIF and machSUIF. Preliminary results over benchmarks show that our approach represents a promising solution to automatic SPM management.

117 citations

Journal ArticleDOI
TL;DR: The results show that there are memory management policies implemented in the system that can improve the performance of programs written using the simpler uniform memory access (UMA) programming model, and there appears to be no single policy that can be considered the best over a set of test applications.
Abstract: Non-uniformity of memory access is an almost inevitable feature of memory architecture in shared memory multiprocessor designs that can scale to large numbers of processors. One implication of NUMA architectures is that the placement and movement of code and data become crucial to performance. As memory architectures become more complex and the nonuniformity becomes less well hidden, systems software must assume a larger role in providing memory management support for the programmer. This paper investigates the role of the operating system. We take an experimental approach to evaluating a wide-range of memory management policies. The target NUMA environment is BBN''s GP-1000 multiprocessor. Extensive local modifications have been made to the memory management subsystem of BBN''s nX operating system to support multiple policy implementations. Policy comparisons are based on the measured performance of real parallel applications. Our results show that there are memory management policies implemented in our system that can improve the performance of programs written using the simpler uniform memory access (UMA) programming model. While achieving the level of performance of a highly tuned NUMA program is still a difficult problem, some examples come close. There appears to be no single policy that can be considered the best over our set of test applications. Investigations into the contributions made by individual policy features toward overall behavior of the workload provide some insight into the design of a set of effective policies.

117 citations

Proceedings ArticleDOI
10 Jun 2006
TL;DR: This paper is the first to integrate a software transactional memory system with a malloc/free based memory allocator and presents the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup.
Abstract: Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, libraries (such as malloc-free packages) would therefore need to use non-blocking algorithms. But lock-free algorithms are notoriously difficult to reason about and inappropriate for average programmers. Transactional memory promises to significantly ease concurrent programming for the average programmer. This paper describes a highly efficient non-blocking malloc/free algorithm that supports memory allocation and deallocation inside transactional code blocks. Thus this paper describes a memory allocator that is suitable for emerging multi-core applications, while supporting modern concurrency constructs.This paper makes several novel contributions. It is the first to integrate a software transactional memory system with a malloc/free based memory allocator. We present the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup. Unlike previous lock-free malloc packages, our algorithm avoids atomic operations on typical code paths, making our algorithm substantially more efficient.

117 citations

Proceedings ArticleDOI
04 Dec 2010
TL;DR: This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications, and refers to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized.
Abstract: High density memory is becoming more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM’s per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This paper shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient– they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. We refer to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized. We extend the GEMS on SIMICS tool-set to include Elastic Refresh. Simulations show the proposed solution provides a 10% average performance improvement over existing techniques across the entire SPEC CPU suite, and up to a 41%improvement for certain workloads.

116 citations

Journal ArticleDOI
TL;DR: An optimized block-floating-point (BFP) arithmetic is adopted in the accelerator for efficient inference of deep neural networks in this paper, and improves the energy and hardware efficiency by three times.
Abstract: Convolutional neural networks (CNNs) are widely used and have achieved great success in computer vision and speech processing applications. However, deploying the large-scale CNN model in the embedded system is subject to the constraints of computation and memory. An optimized block-floating-point (BFP) arithmetic is adopted in our accelerator for efficient inference of deep neural networks in this paper. The feature maps and model parameters are represented in 16-bit and 8-bit formats, respectively, in the off-chip memory, which can reduce memory and off-chip bandwidth requirements by 50% and 75% compared to the 32-bit FP counterpart. The proposed 8-bit BFP arithmetic with optimized rounding and shifting-operation-based quantization schemes improves the energy and hardware efficiency by three times. One CNN model can be deployed in our accelerator without retraining at the cost of an accuracy loss of not more than 0.12%. The proposed reconfigurable accelerator with three parallelism dimensions, ping-pong off-chip DDR3 memory access, and an optimized on-chip buffer group is implemented on the Xilinx VC709 evaluation board. Our accelerator achieves a performance of 760.83 GOP/s and 82.88 GOP/s/W under a 200-MHz working frequency, significantly outperforming previous accelerators.

116 citations


Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
94% related
Scalability
50.9K papers, 931.6K citations
92% related
Server
79.5K papers, 1.4M citations
89% related
Virtual machine
43.9K papers, 718.3K citations
87% related
Scheduling (computing)
78.6K papers, 1.3M citations
86% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202333
202288
2021629
2020467
2019461
2018591