Showing papers on "Memory management published in 2010"

PDF

Open Access

Book•

Custom Memory Management Methodology : Exploration of Memory Organisation for Embedded Multimedia System Design

[...]

Francky Catthoor, Eddy De Greef, Sven Suytack

03 Dec 2010

TL;DR: This book grants the reader a comprehensive overview of the state-of-the-art in system-level memory management (data transfer and storage) related issues for complex data-dominated real-time signal and data processing applications.

...read moreread less

Abstract: From the Publisher: This book grants the reader a comprehensive overview of the state-of-the-art in system-level memory management (data transfer and storage) related issues for complex data-dominated real-time signal and data processing applications. The authors introduce their own system-level data transfer and storage exploration methodology for data-dominated video applications. This methodology tackles the power and area reduction cost components in the architecture for this target domain, namely the system-level busses and the background memories. For the most critical tasks in the methodology, prototype tools have been developed to reduce the design time. To the researcher the book will serve as an excellent reference source, both for the overall description of the methodology and for the detailed descriptions of the system-level methodologies and synthesis techniques and algorithms. To the design engineers and CAD managers it offers an invaluable insight into the anticipated evolution of commercially available design tools as well as allowing them to utilize the book's concepts in their own research and development.

...read moreread less

599 citations

Proceedings Article•DOI•

RAPL: memory power estimation and capping

[...]

Howard S. David¹, Eugene Gorbatov¹, Ulf R. Hanebutte¹, Rahul Khanna¹, Christian Le¹ - Show less +1 more•Institutions (1)

Intel¹

18 Aug 2010

TL;DR: This paper proposes a new approach for measuring memory power and demonstrating its applicability to a novel power limiting algorithm and shows that it achieves up to 40% lower performance impact when compared to the state-of-art baseline across the power limiting range.

...read moreread less

Abstract: The drive for higher performance and energy efficiency in data-centers has influenced trends toward increased power and cooling requirements in the facilities. Since enterprise servers rarely operate at their peak capacity, efficient power capping is deemed as a critical component of modern enterprise computing environments. In this paper we propose a new power measurement and power limiting architecture for main memory. Specifically, we describe a new approach for measuring memory power and demonstrate its applicability to a novel power limiting algorithm. We implement and evaluate our approach in the modern servers and show that we achieve up to 40% lower performance impact when compared to the state-of-art baseline across the power limiting range.

...read moreread less

533 citations

Proceedings Article•DOI•

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

[...]

Anthony Nguyen¹, Nadathur Satish¹, Jatin Chhugani¹, Changkyu Kim¹, Pradeep Dubey¹ - Show less +1 more•Institutions (1)

Intel¹

13 Nov 2010

TL;DR: A novel 3.

...read moreread less

Abstract: Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

...read moreread less

299 citations

Journal Article•DOI•

Cross-Point Memory Array Without Cell Selectors—Device Characteristics and Data Storage Pattern Dependencies

[...]

Jiale Liang¹, H.-S. Philip Wong¹•Institutions (1)

Stanford University¹

23 Aug 2010-IEEE Transactions on Electron Devices

TL;DR: In this article, the authors study the device requirements of a resistive cross-point memory array under the worst-case write and read operations and compare the effect of the memory cell resistance values and resistance ratio for determining the maximum array size.

...read moreread less

Abstract: Cross-point memory architecture offers high device density, yet it suffers from substantial sneak path leakages, which result in large power dissipation and a small sensing margin. The parasitic resistance associated with the interconnects further degrades the output signal and imposes an additional limitation on the maximum allowable array size. In this paper, we study the device requirements of a resistive cross-point memory array under the worst-case write and read operations. We focus on the data pattern dependence of the memory array and compare the effect of the memory cell resistance values and resistance ratio for determining the maximum array size. The number of cells in the array can reach 106 with a signal swing > 50% of the reading voltage when Ron is beyond 3 M and Roff/Ron is greater than 2. A large memory cell resistance value can further reduce the power consumption, obviate the need for a large Roff/Ron ratio, and avoid the inclusion of cell selection devices. The effect of the nonlinearity of the I -V characteristics of the memory cells is also investigated. The nonlinearity calls for a substantial tradeoff between the memory cell resistance values and the resistance ratio, and must be taken into consideration for the device design.

...read moreread less

271 citations

Proceedings Article•DOI•

Evidence and solution of over-RESET problem for HfO X based resistive memory with sub-ns switching speed and high endurance

[...]

Hong-Ji Lee¹, Yu-Hsiu Chen¹, P. S. Chen, P. Y. Gu², Y. Y. Hsu², S. M. Wang², W. H. Liu², C. H. Tsai², Shyh-Shyuan Sheu², P. C. Chiang², W. P. Lin², C. H. Lin², W. S. Chen², Frederick T. Chen², Chen-Hsin Lien¹, Ming-Jinn Tsai² - Show less +12 more•Institutions (2)

National Tsing Hua University¹, Industrial Technology Research Institute²

01 Dec 2010

TL;DR: In this article, a modified bottom electrode is proposed for the memory device to maintain the memory window and to endure resistive switching up to 1010 cycles, and the performance of the HfO X-based bipolar resistive memory was improved.

...read moreread less

Abstract: The memory performances of the HfO X based bipolar resistive memory, including switching speed and memory reliability, are greatly improved in this work. Record high switching speed down to 300 ps is achieved. The cycling test shed a clear light on the wearing behavior of resistance states, and the correlation between over-RESET phenomenon and the worn low resistance state in the devices is discussed. The modified bottom electrode is proposed for the memory device to maintain the memory window and to endure resistive switching up to 1010 cycles.

...read moreread less

256 citations

Proceedings Article•DOI•

A constant-space belief propagation algorithm for stereo matching

[...]

Qingxiong Yang¹, Wang Liang², Narendra Ahuja¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Kentucky²

13 Jun 2010

TL;DR: This paper considers the problem of stereo matching using loopy belief propagation with large number of disparity levels as efficient as the small ones, and solves the message updating problem in a time linear in the number of pixels contained in the image and requires only constant memory space.

...read moreread less

Abstract: In this paper, we consider the problem of stereo matching using loopy belief propagation. Unlike previous methods which focus on the original spatial resolution, we hierarchically reduce the disparity search range. By fixing the number of disparity levels on the original resolution, our method solves the message updating problem in a time linear in the number of pixels contained in the image and requires only constant memory space. Specifically, for a 800 × 600 image with 300 disparities, our message updating method is about 30× faster (1.5 second) than standard method, and requires only about 0.6% memory (9 MB). Also, our algorithm lends itself to a parallel implementation. Our GPU implementation (NVIDIA Geforce 8800GTX) is about 10× faster than our CPU implementation. Given the trend toward higher-resolution images, stereo matching using belief propagation with large number of disparity levels as efficient as the small ones makes our method future-proof. In addition to the computational and memory advantages, our method is straightforward to implement1.

...read moreread less

235 citations

Proceedings Article•DOI•

Morphable memory system: a robust architecture for exploiting multi-level phase change memories

[...]

Moinuddin K. Qureshi¹, Michele M. Franceschini¹, Luis A. Lastras-Montano¹, John P. Karidis¹•Institutions (1)

IBM¹

19 Jun 2010

TL;DR: MMS as discussed by the authors is a robust architecture for efficiently incorporating MLC PCM devices in main memory, based on observation that memory requirement varies between workloads, and systems are typically over-provisioned in terms of memory capacity.

...read moreread less

Abstract: Phase Change Memory (PCM) is emerging as a scalable and power efficient technology to architect future main memory systems. The scalability of PCM is enhanced by the property that PCM devices can store multiple bits per cell. While such Multi-Level Cell (MLC) devices can offer high density, this benefit comes at the expense of increased read latency, which can cause significant performance degradation. This paper proposes Morphable Memory System (MMS), a robust architecture for efficiently incorporating MLC PCM devices in main memory. MMS is based on observation that memory requirement varies between workloads, and systems are typically over-provisioned in terms of memory capacity. So, during a phase of low memory usage, some of the MLC devices can be operated at fewer bits per cell to obtain lower latency. When the workload requires full memory capacity, these devices can be restored to high density MLC operation to have full main-memory capacity. We provide the runtime monitors, the hardware-OS interface, and the detailed mechanism for implementing MMS. Our evaluations on an 8-core 8GB MLC PCM-based system show that MMS provides, on average, low latency access for 95% of all memory requests, thereby improving overall system performance by 40%.

...read moreread less

211 citations

Patent•

Memory management device

[...]

Atsushi Kunimatsu, Masaki Miyagawa, Hiroshi Nozue, Kazuhiro Kawagome, Hiroto Nakai, Hiroyuki Sakamoto, Tsutomu Owa, Tsutomu Unesaki, Reina Nishino, Kenichi Maeda, Mari Takada - Show less +7 more

08 Mar 2010

TL;DR: In this paper, a coloring information storage unit that stores coloring information generated based on a data characteristic of write target data to be written into at least one of the nonvolatile semiconductor memories and the volatile semiconductor memory was proposed.

...read moreread less

Abstract: A memory management device of an example of the invention controls writing into and reading from a main memory including a nonvolatile semiconductor memory and a volatile semiconductor memory in response to a writing request and a reading request from a processor. The memory management device includes a coloring information storage unit that stores coloring information generated based on a data characteristic of write target data to be written into at least one of the nonvolatile semiconductor memory and the volatile semiconductor memory, and a writing management unit that references the coloring information to determines a region into which the write target data is written from the nonvolatile semiconductor memory and the volatile semiconductor memory.

...read moreread less

205 citations

Proceedings Article•DOI•

DieHarder: securing the heap

[...]

Gene Novark¹, Emery D. Berger¹•Institutions (1)

University of Massachusetts Amherst¹

04 Oct 2010

TL;DR: DieHarder as mentioned in this paper analyzes a range of widely deployed memory allocators, including those used by Windows, Linux, FreeBSD and OpenBSD, and shows that they remain vulnerable to heap-based attacks.

...read moreread less

Abstract: Heap-based attacks depend on a combination of memory management error and an exploitable memory allocator. Many allocators include ad hoc countermeasures against particular exploits but their effectiveness against future exploits has been uncertain. This paper presents the first formal treatment of the impact of allocator design on security. It analyzes a range of widely-deployed memory allocators, including those used by Windows, Linux, FreeBSD and OpenBSD, and shows that they remain vulnerable to attack. It them presents DieHarder, a new allocator whose design was guided by this analysis. DieHarder provides the highest degree of security from heap-based attacks of any practical allocator of which we are aware while imposing modest performance overhead. In particular, the Firefox web browser runs as fast with DieHarder as with the Linux allocator.

...read moreread less

199 citations

Proceedings Article•DOI•

Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

[...]

Xiangyu Dong¹, Yuan Xie¹, Naveen Muralimanohar², Norman P. Jouppi²•Institutions (2)

Pennsylvania State University¹, Hewlett-Packard²

13 Nov 2010

TL;DR: This paper introduces another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off- package memory either in hardware or with operating system assistance depending on the migration granularity.

...read moreread less

Abstract: System-in-Package (SiP) and 3D integration are promising technologies to bring more memory onto a microprocessor package to mitigate the "memory wall" problem. In this paper, instead of using them to build caches, we study a heterogenous main memory using both on- and off-package memories providing both fast and high-bandwidth on-package accesses and expandable and low-cost commodity off-package memory capacity. We introduce another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off-package memory either in hardware or with operating system assistance depending on the migration granularity. Our experimental results demonstrate that such design can achieve the average effectiveness of 83% of the ideal case where all memory can be placed in high-speed on-package memory for our simulated benchmarks.

...read moreread less

147 citations

Proceedings Article•DOI•

The virtual write queue: coordinating DRAM and last-level cache policies

[...]

Jeffrey A. Stuecheli¹, Dimitris Kaseridis¹, David M. Daly², Hillery C. Hunter², Lizy K. John¹ - Show less +1 more•Institutions (2)

University of Texas at Austin¹, IBM²

19 Jun 2010

TL;DR: This paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome, and shows that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved.

...read moreread less

Abstract: In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.

...read moreread less

Patent•

System and method for managing the execution of memory commands in a solid-state memory

[...]

Ho-Fan Kang¹•Institutions (1)

Western Digital¹

26 Mar 2010

TL;DR: In this article, the authors present a storage subsystem comprising a nonvolatile solid-state memory array and a controller, which is configured to execute, in the memory array, memory commands from the queue in a sequence that is based at least in part on a throttling ratio provided by the system operation module.

...read moreread less

Abstract: Embodiments of the invention are directed to a storage subsystem comprising a non-volatile solid-state memory array and a controller. In one embodiment, the controller includes a system operation module configured to manage system memory operations and a queue configured to receive memory commands from a host system and the system operation module. The controller is configured to execute, in the memory array, memory commands from the queue in a sequence that is based at least in part on a throttling ratio provided by the system operation module.

...read moreread less

Patent•

Caching using virtual memory

[...]

Julien Margetts

16 Jul 2010

TL;DR: In this article, a method for caching in a processor system having virtual memory is presented, the method comprising: monitoring slow memory in the processor system to determine frequently accessed pages; copy the frequently accessed page from slow memory to a location in fast memory.

...read moreread less

Abstract: In a first embodiment of the present invention, a method for caching in a processor system having virtual memory is provided, the method comprising: monitoring slow memory in the processor system to determine frequently accessed pages; for a frequently accessed page in slow memory: copy the frequently accessed page from slow memory to a location in fast memory; and update virtual address page tables to reflect the location of the frequently accessed page in fast memory.

...read moreread less

Proceedings Article•DOI•

StatStack: Efficient modeling of LRU caches

[...]

David Eklov¹, Erik Hagersten¹•Institutions (1)

Uppsala University¹

28 Mar 2010

TL;DR: This research presents a novel approach to measuring stack distance by counting the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection.

...read moreread less

Abstract: Efficient execution on modern architectures requires good data locality, which can be measured by the powerful stack distance abstraction. Based on this abstraction, the miss rate for LRU caches of any size can be predicted. However, measuring stack distance requires the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection.

...read moreread less

Book Chapter•DOI•

A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement

[...]

Guangyu Sun¹, Yongsoo Joo¹, Yibo Chen¹, Niu Dimin¹, Yuan Xie¹, Yi Chen², Hai Li² - Show less +3 more•Institutions (2)

Pennsylvania State University¹, Seagate Technology²

01 Apr 2010

TL;DR: A hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM), which has the following advantages: the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records.

...read moreread less

Abstract: In recent years, many systems have employed NAND flash memory as storage devices because of its advantages of higher performance (compared to the traditional hard disk drive), high-density, random-access, increasing capacity, and falling cost. On the other hand, the performance of NAND flash memory is limited by its “erase-before-write” requirement. Log-based structures have been used to alleviate this problem by writing updated data to the clean space. Prior log-based methods, however, cannot avoid excessive erase operations when there are frequent updates, which quickly consume free pages, especially when some data are updated repeatedly. In this paper, we propose a hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM). Compared to traditional log-based architectures, it has the following advantages: (1) the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records; (2) it greatly reduces the traffic of reading from the NAND flash memory storage since the size of logs loaded for the read operation is decreased; (3) the energy consumption of the storage system is reduced as the overhead of writing and reading log data is decreased with the PRAM log region; (4) the lifetime of NAND flash memory is increased because the number of erase operations are reduced. To facilitate the PRAM log region, we propose several management policies. The simulation results show that our proposed methods can substantially improve the performance, energy consumption, and lifetime of the NAND flash memory storage1.

...read moreread less

Book Chapter•DOI•

Fences in weak memory models

[...]

Jade Alglave¹, Luc Maranget², Susmit Sarkar³, Peter Sewell³•Institutions (3)

University of Oxford¹, French Institute for Research in Computer Science and Automation², University of Cambridge³

15 Jul 2010

TL;DR: A class of relaxed memory models, defined in Coq, parameterised by the chosen permitted local reorderings of reads and writes, and the visibility of inter- and intra-processor communications through memory is presented.

...read moreread less

Abstract: We present a class of relaxed memory models, defined in Coq, parameterised by the chosen permitted local reorderings of reads and writes, and the visibility of inter- and intra-processor communications through memory (e.g. store atomicity relaxation) We prove results on the required behaviour and placement of memory fences to restore a given model (such as Sequential Consistency) from a weaker one Based on this class of models we develop a tool, diy, that systematically and automatically generates and runs litmus tests to determine properties of processor implementations We detail the results of our experiments on Power and the model we base on them This work identified a rare implementation error in Power 5 memory barriers (for which IBM is providing a workaround); our results also suggest that Power 6 does not suffer from this problem.

...read moreread less

Proceedings Article•DOI•

Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms

[...]

Collin McCurdy¹, Jeffrey S. Vetter¹•Institutions (1)

Oak Ridge National Laboratory¹

28 Mar 2010

TL;DR: It is demonstrated that NUMA can indeed be a significant problem for scientific applications, showing that it can mean the difference between an application scaling perfectly and failing to scale at all.

...read moreread less

Abstract: Until recently, most high-end scientific applications have been immune to performance problems caused by Non-Uniform Memory Access (NUMA). However, current trends in microprocessor design are pushing NUMA to smaller and smaller scales. This paper examines the current state of NUMA and makes several contributions. First, we summarize the performance problems that NUMA can present for multi-threaded applications and describe methods of addressing them. Second, we demonstrate that NUMA can indeed be a significant problem for scientific applications, showing that it can mean the difference between an application scaling perfectly and failing to scale at all. Third, we describe, in increasing order of usefulness, three methods of using hardware performance counters to aid in finding NUMA-related problems. Finally, we introduce Memphis, a data-centric toolset that uses Instruction Based Sampling to help pinpoint problematic memory accesses, and demonstrate how we used it to improve the performance of several production-level codes - HYCOM, XGC1 and CAM - by 13%, 23% and 24% respectively.

...read moreread less

Proceedings Article•DOI•

Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputation

[...]

Jingtong Hu¹, Chun Jason Xue², Wei-Che Tseng¹, Yi He¹, Meikang Qiu³, Edwin H.-M. Sha¹ - Show less +2 more•Institutions (3)

University of Texas at Dallas¹, City University of Hong Kong², University of Kentucky³

13 Jun 2010

TL;DR: This paper targets embedded Chip Multiprocessors with Scratch Pad Memory (SPM) and non-volatile main memory and introduces data migration and recompu-tation techniques to reduce the number of write activities on non-Volatile memories.

...read moreread less

Abstract: Recent advances in circuit and process technologies have pushed non-volatile memory technologies into a new era. These technologies exhibit appealing properties such as low power consumption, non-volatility, shock-resistivity, and high density. However, there are challenges to which we need answers in the road of applying non-volatile memories as main memory in computer systems. First, non-volatile memories have limited number of write/erase cycles compared with DRAM memory. Second, write activities on non-volatile memory are more expensive than DRAM memory in terms of energy consumption and access latency. Both challenges will benefit from reduction of the write activities on the nonvolatile memory. In this paper, we target embedded Chip Multiprocessors (CMPs) with Scratch Pad Memory (SPM) and non-volatile main memory. We introduce data migration and recompu-tation techniques to reduce the number of write activities on non-volatile memories. Experimental results show that the proposed methods can reduce the number of writes by 59.41% on average, which means that the non-volatile memory can last 2.8 times as long as before. Meanwhile, the finish time of programs is reduced by 31.81% on average.

...read moreread less

Proceedings Article•DOI•

Recommendations for Virtualization Technologies in High Performance Computing

[...]

Nathan Regola¹, Jean-Christophe Ducom¹•Institutions (1)

University of Notre Dame¹

30 Nov 2010

TL;DR: This work benchmarks two virtual machine monitors, Open VZ and KVM, specifically focusing on I/O throughput, and concludes that KVM’s I/o performance is sub optimal, potentially due to memory management problems in the hyper visor.

...read moreread less

Abstract: The benefits of virtualization are typically considered to be server consolidation, (leading to the reduction of power and cooling costs) increased availability, isolation, ease of operating system deployment and simplified disaster recovery. High Performance Computing (HPC) environments pose one main challenge for virtualization: the need to maximize throughput with minimal loss of CPU and I/O efficiency. However, virtualization is usually evaluated in terms of enterprise workloads and assumes that servers are underutilized and can be consolidated. In this paper we evaluate the performance of several virtual machine technologies in the context of HPC. A fundamental requirement of current high performance workloads is that both CPU and I/O must be highly efficient for tasks such as MPI jobs. This work benchmarks two virtual machine monitors, Open VZ and KVM, specifically focusing on I/O throughput since CPU efficiency has been extensively studied [1]. Open VZ offers near native I/O performance. Amazon’s EC2 “ClusterCompute Node” product is also considered for comparative purposes and performs quite well. The EC2 “Cluster ComputeNode” product utilizes the Xen hyper visor in hvm mode and 10Gbit/s Ethernet for high throughput communication. Therefore, we also briefly studied Xen on our hardware platform (in hvmmode) to determine if there are still areas of improvement in KVM that allow EC2 to outperform KVM (with InfiniBand host channel adapters operating at 20 Gbit/s) in MPI benchmarks. We conclude that KVM’s I/O performance is sub optimal, potentially due to memory management problems in the hyper visor. Amazon’sEC2 service is promising, although further investigation is necessary to understand the effects of network based storage on I/O throughput in compute nodes. Amazon’s offering may be attractive for users searching for “InfiniBand-like” performance without the upfront investment required to build an InfiniBand cluster or users wishing to dynamically expand their cluster during periods of high demand.

...read moreread less

Proceedings Article•

FlashVM: virtual memory management on flash

[...]

Mohit Saxena¹, Michael M. Swift¹•Institutions (1)

University of Wisconsin-Madison¹

23 Jun 2010

TL;DR: The first comprehensive description of the usage of the discard command on a real flash device is presented and two enhancements to provide fast online garbage collection of free VM pages are shown.

...read moreread less

Abstract: With the decreasing price of flash memory, systems will increasingly use solid-state storage for virtual-memory paging rather than disks. FlashVM is a system architecture and a core virtual memory subsystem built in the Linux kernel that uses dedicated flash for paging. FlashVM focuses on three major design goals for memory management on flash: high performance, reduced flash wear out for improved reliability, and efficient garbage collection. FlashVM modifies the paging system along code paths for allocating, reading and writing back pages to optimize for the performance characteristics of flash. It also reduces the number of page writes using zero-page sharing and page sampling that prioritize the eviction of clean pages. In addition, we present the first comprehensive description of the usage of the discard command on a real flash device and show two enhancements to provide fast online garbage collection of free VM pages. Overall, the FlashVM system provides up to 94% reduction in application execution time and is four times more responsive than swapping to disk. Furthermore, it improves reliability by writing up to 93% fewer pages than Linux, and provides a garbage collection mechanism that is up to 10 times faster than Linux with discard support.

...read moreread less

Proceedings Article•DOI•

NAND Flash-Based Disk Cache Using SLC/MLC Combined Flash Memory

[...]

Seongcheol Hong¹, Dongkun Shin¹•Institutions (1)

Sungkyunkwan University¹

03 May 2010

TL;DR: An effective management scheme for heterogeneous SLC and MLC regions of combined flash memory is proposed and a design technique which is able to determine the optimal proportion between the two regions that maximizes performance and energy reduction is proposed, guaranteeing the lifespan constraint.

...read moreread less

Abstract: Flash memory-based non-volatile cache (NVC) is emerging as an effective solution for enhancing both the performances and the energy consumptions of storage systems. In order to attain significant performance and energy gains from NVC, it would be better to use multi-level-cell (MLC) flash memories since they can provide a large NVC capacity at low cost. However, the number of available program/erase cycles of MLC flash memory is smaller than that of single-level-cell (SLC) flash memory, which limits the lifespan of an NVC. In order to overcome this limitation, SLC/MLC combined flash memory is a promising solution for use in NVC. This paper proposes an effective management scheme for heterogeneous SLC and MLC regions of combined flash memory. It also proposes a design technique which is able to determine the optimal proportion between the two regions that maximizes performance and energy reduction, guaranteeing the lifespan constraint. We show experimentally how performance, lifespan, and energy consumption of the NVC-embedded hard disk change depending upon the configuration of the combined flash memory. We also show the superiority of the proposed NVC management policy in comparison to alternative policies.

...read moreread less

Proceedings Article•DOI•

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration

[...]

Xiang Zhang¹, Zhigang Huo¹, Jie Ma¹, Dan Meng¹•Institutions (1)

Chinese Academy of Sciences¹

20 Sep 2010

TL;DR: A novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration, which utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration.

...read moreread less

Abstract: As one of the key characteristics of virtualization, live virtual machine (VM) migration provides great benefits for load balancing, power management, fault tolerance and other system maintenance issues in modern clusters and data centers. Although Pre-Copy is a widespread used migration algorithm, it does transfer a lot of duplicated memory image data from source to destination, which results in longer migration time and downtime. This paper proposes a novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration. MDD utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration. Experiment demonstrates that compared with Xen's default Pre-Copy migration algorithm, MDD can reduce 56.60% of total data transferred during migration, 34.93% of total migration time, and 26.16% of downtime on average.

...read moreread less

Proceedings Article•DOI•

Task management for irregular-parallel workloads on the GPU

[...]

Stanley Tzeng¹, Anjul Patney¹, John D. Owens¹•Institutions (1)

University of California, Davis¹

25 Jun 2010

TL;DR: It is demonstrated that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads and the preferred choice is task-donation because of comparable performance to task-stealing while using less memory overhead.

...read moreread less

Abstract: We explore software mechanisms for managing irregular tasks on graphics processing units (GPUs). We demonstrate that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads. We experiment with several task-management techniques, ranging from the use of a single monolithic task queue to distributed queuing with task stealing and donation. On irregular workloads, we show that both centralized and distributed queues have more than 100 times as much idle times as our task-stealing and -donation queues. Our preferred choice is task-donation because of comparable performance to task-stealing while using less memory overhead. To help in this analysis, we use an artificial task-management system that monitors performance and memory usage to quantify the impact of these different techniques. We validate our results by implementing a Reyes renderer with its irregular split-and-dice workload that is able to achieve real-time framerates on a single GPU.

...read moreread less

Proceedings Article•DOI•

Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory

[...]

Jeffrey A. Stuecheli¹, Dimitris Kaseridis¹, Hillery C. Hunter², Lizy K. John¹•Institutions (2)

University of Texas at Austin¹, IBM²

04 Dec 2010

TL;DR: This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications, and refers to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized.

...read moreread less

Abstract: High density memory is becoming more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM’s per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This paper shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient– they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. We refer to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized. We extend the GEMS on SIMICS tool-set to include Elastic Refresh. Simulations show the proposed solution provides a 10% average performance improvement over existing techniques across the entire SPEC CPU suite, and up to a 41%improvement for certain workloads.

...read moreread less

Proceedings Article•DOI•

SD3: A Scalable Approach to Dynamic Data-Dependence Profiling

[...]

Minjang Kim¹, Hyesoon Kim¹, Chi-Keung Luk²•Institutions (2)

Georgia Institute of Technology¹, Intel²

04 Dec 2010

TL;DR: This paper proposes a scalable approach to data-dependence profiling that addresses both runtime and memory overhead in a single framework, called SD3, and reduces the runtime overhead by parallelizing the dependence profiling step itself and compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format.

...read moreread less

Abstract: As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program. However, state-of-the-art data-dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications: (1) runtime overhead and (2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications or only report very limited information. In this paper, we propose a scalable approach to data-dependence profiling that addresses both runtime and memory overhead in a single framework. Our technique, called SD3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD3 reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1X and 9.7X on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20X improvement in memory consumption and a 16X speedup in profiling time when 32 cores are used.

...read moreread less

Patent•

Method for restoring and maintaining solid-state drive performance

[...]

Anthony Leach, Franz Michael Schuette

12 Nov 2010

TL;DR: In this paper, a block consolidation utility is executed to eliminate at least some of the partially-filled blocks by consolidating the file fragments into a fewer number of the memory blocks.

...read moreread less

Abstract: A method of maintaining a solid-state drive so that free space within memory blocks of the drive becomes free usable space to the drive. The drive comprises cells organized in pages that are organized in memory blocks in which at least user files are stored. A defragmentation utility is executed to cause at least some of the memory blocks that are partially filled with data and contain file fragments to be combined or aligned and to cause at least some of the memory blocks that contain only invalid data to be combined or aligned. A block consolidation utility is then executed to eliminate at least some of the partially-filled blocks by consolidating the file fragments into a fewer number of the memory blocks. The consolidation utility also increases the number of memory blocks that contain only invalid memory. All of the memory blocks containing only invalid data are then erased.

...read moreread less

Patent•

Memory control apparatus and method

[...]

Toshiaki Minami¹•Institutions (1)

Canon Inc.¹

19 Feb 2010

TL;DR: In this paper, a memory control apparatus generates a plurality of commands whose unit of data transfer is smaller than the unit of memory access request, and when the memory access requests are transmitted from a number of request sources, issues the plurality commands to a memory in alternate order for each request source.

...read moreread less

Abstract: A memory control apparatus generates a plurality of commands whose unit of data transfer is smaller than the unit of data transfer of a memory access request, and when the memory access requests are transmitted from a plurality of request sources, issues the plurality of commands to a memory in alternate order for each request source. The plurality of memory access requests are executed by time division and concurrently.

...read moreread less

Patent•

Managing memory faults

[...]

Doug Burger¹, James R. Larus¹, Karin Strauss¹, Jeremy P. Condit¹•Institutions (1)

Microsoft¹

17 May 2010

TL;DR: In this paper, Embodiments for managing memory faults are described, which include a memory controller module to manage memory cells and report memory faults, an error buffer module can store memory fault information received from the memory controller, and a notification module can be in communication with the error buffer.

...read moreread less

Abstract: Embodiments are described for managing memory faults. An example system can include a memory controller module to manage memory cells and report memory faults. An error buffer module can store memory fault information received from the memory controller. A notification module can be in communication with the error buffer module. The notification module may generate a notification of a memory fault in a memory access operation. A system software module can provide services and manage executing programs on a processor. In addition, the system software module can receive the notifications of the memory fault for the memory access operation. A notification handler may be activated by an interrupt when the notification of the memory fault in the memory access operation is received.

...read moreread less

Proceedings Article•DOI•

Automatic inference of memory fences

[...]

Michael Kuperstein¹, Martin Vechev², Eran Yahav¹•Institutions (2)

Technion – Israel Institute of Technology¹, IBM²

20 Oct 2010

TL;DR: A framework for automatic inference of memory fences in concurrent programs, assisting the programmer in this complex task is presented, and it is used to infer correct and efficient placements of fences for several non-trivial algorithms, including practical concurrent data structures.

...read moreread less

Abstract: This paper addresses the problem of placing memory fences in a concurrent program running on a relaxed memory model. Modern architectures implement relaxed memory models which may reorder memory operations or execute them non-atomically. Special instructions called memory fences are provided to the programmer, allowing control of this behavior. To ensure correctness of many algorithms, in particular of non-blocking ones, a programmer is often required to explicitly insert memory fences into her program. However, she must use as few fences as possible, or the benefits of the relaxed architecture may be lost. Placing memory fences is challenging and very error prone, as it requires subtle reasoning about the underlying memory model. We present a framework for automatic inference of memory fences in concurrent programs, assisting the programmer in this complex task. Given a finite-state program, a safety specification and a description of the memory model, our framework computes a set of ordering constraints that guarantee the correctness of the program under the memory model. The computed constraints are maximally permissive: removing any constraint from the solution would permit an execution violating the specification. Our framework then realizes the computed constraints as additional fences in the input program. We implemented our approach in a tool called FENDER and used it to infer correct and efficient placements of fences for several non-trivial algorithms, including practical concurrent data structures.

...read moreread less

Patent•

Use of Host System Resources by Memory Controller

[...]

Dotan Sokolov¹, Barak Rotbard¹•Institutions (1)

Apple Inc.¹

22 Mar 2010

TL;DR: In this paper, the authors present a non-volatile memory system for data storage, which includes a host having a host memory and a memory controller that is separate from the host.

...read moreread less

Abstract: A method for data storage includes, in a system that includes a host having a host memory and a memory controller that is separate from the host and stores data for the host in a non-volatile memory including multiple analog memory cells, storing in the host memory information items relating to respective groups of the analog memory cells of the non-volatile memory. A command that causes the memory controller to access a given group of the analog memory cells is received from the host. In response to the command, a respective information item relating to the given group of the analog memory cells is retrieved from the host memory by the memory controller, and the given group of the analog memory cells is accessed using the retrieved information item.

...read moreread less

Collapse