scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2010"


Book
03 Dec 2010
TL;DR: This book grants the reader a comprehensive overview of the state-of-the-art in system-level memory management (data transfer and storage) related issues for complex data-dominated real-time signal and data processing applications.
Abstract: From the Publisher: This book grants the reader a comprehensive overview of the state-of-the-art in system-level memory management (data transfer and storage) related issues for complex data-dominated real-time signal and data processing applications. The authors introduce their own system-level data transfer and storage exploration methodology for data-dominated video applications. This methodology tackles the power and area reduction cost components in the architecture for this target domain, namely the system-level busses and the background memories. For the most critical tasks in the methodology, prototype tools have been developed to reduce the design time. To the researcher the book will serve as an excellent reference source, both for the overall description of the methodology and for the detailed descriptions of the system-level methodologies and synthesis techniques and algorithms. To the design engineers and CAD managers it offers an invaluable insight into the anticipated evolution of commercially available design tools as well as allowing them to utilize the book's concepts in their own research and development.

599 citations


Proceedings ArticleDOI
Howard S. David1, Eugene Gorbatov1, Ulf R. Hanebutte1, Rahul Khanna1, Christian Le1 
18 Aug 2010
TL;DR: This paper proposes a new approach for measuring memory power and demonstrating its applicability to a novel power limiting algorithm and shows that it achieves up to 40% lower performance impact when compared to the state-of-art baseline across the power limiting range.
Abstract: The drive for higher performance and energy efficiency in data-centers has influenced trends toward increased power and cooling requirements in the facilities. Since enterprise servers rarely operate at their peak capacity, efficient power capping is deemed as a critical component of modern enterprise computing environments. In this paper we propose a new power measurement and power limiting architecture for main memory. Specifically, we describe a new approach for measuring memory power and demonstrate its applicability to a novel power limiting algorithm. We implement and evaluate our approach in the modern servers and show that we achieve up to 40% lower performance impact when compared to the state-of-art baseline across the power limiting range.

533 citations


Proceedings ArticleDOI
Anthony Nguyen1, Nadathur Satish1, Jatin Chhugani1, Changkyu Kim1, Pradeep Dubey1 
13 Nov 2010
TL;DR: A novel 3.
Abstract: Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

299 citations


Journal ArticleDOI
TL;DR: In this article, the authors study the device requirements of a resistive cross-point memory array under the worst-case write and read operations and compare the effect of the memory cell resistance values and resistance ratio for determining the maximum array size.
Abstract: Cross-point memory architecture offers high device density, yet it suffers from substantial sneak path leakages, which result in large power dissipation and a small sensing margin. The parasitic resistance associated with the interconnects further degrades the output signal and imposes an additional limitation on the maximum allowable array size. In this paper, we study the device requirements of a resistive cross-point memory array under the worst-case write and read operations. We focus on the data pattern dependence of the memory array and compare the effect of the memory cell resistance values and resistance ratio for determining the maximum array size. The number of cells in the array can reach 106 with a signal swing > 50% of the reading voltage when Ron is beyond 3 M and Roff/Ron is greater than 2. A large memory cell resistance value can further reduce the power consumption, obviate the need for a large Roff/Ron ratio, and avoid the inclusion of cell selection devices. The effect of the nonlinearity of the I -V characteristics of the memory cells is also investigated. The nonlinearity calls for a substantial tradeoff between the memory cell resistance values and the resistance ratio, and must be taken into consideration for the device design.

271 citations


Proceedings ArticleDOI
01 Dec 2010
TL;DR: In this article, a modified bottom electrode is proposed for the memory device to maintain the memory window and to endure resistive switching up to 1010 cycles, and the performance of the HfO X-based bipolar resistive memory was improved.
Abstract: The memory performances of the HfO X based bipolar resistive memory, including switching speed and memory reliability, are greatly improved in this work. Record high switching speed down to 300 ps is achieved. The cycling test shed a clear light on the wearing behavior of resistance states, and the correlation between over-RESET phenomenon and the worn low resistance state in the devices is discussed. The modified bottom electrode is proposed for the memory device to maintain the memory window and to endure resistive switching up to 1010 cycles.

256 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper considers the problem of stereo matching using loopy belief propagation with large number of disparity levels as efficient as the small ones, and solves the message updating problem in a time linear in the number of pixels contained in the image and requires only constant memory space.
Abstract: In this paper, we consider the problem of stereo matching using loopy belief propagation. Unlike previous methods which focus on the original spatial resolution, we hierarchically reduce the disparity search range. By fixing the number of disparity levels on the original resolution, our method solves the message updating problem in a time linear in the number of pixels contained in the image and requires only constant memory space. Specifically, for a 800 × 600 image with 300 disparities, our message updating method is about 30× faster (1.5 second) than standard method, and requires only about 0.6% memory (9 MB). Also, our algorithm lends itself to a parallel implementation. Our GPU implementation (NVIDIA Geforce 8800GTX) is about 10× faster than our CPU implementation. Given the trend toward higher-resolution images, stereo matching using belief propagation with large number of disparity levels as efficient as the small ones makes our method future-proof. In addition to the computational and memory advantages, our method is straightforward to implement1.

235 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: MMS as discussed by the authors is a robust architecture for efficiently incorporating MLC PCM devices in main memory, based on observation that memory requirement varies between workloads, and systems are typically over-provisioned in terms of memory capacity.
Abstract: Phase Change Memory (PCM) is emerging as a scalable and power efficient technology to architect future main memory systems. The scalability of PCM is enhanced by the property that PCM devices can store multiple bits per cell. While such Multi-Level Cell (MLC) devices can offer high density, this benefit comes at the expense of increased read latency, which can cause significant performance degradation. This paper proposes Morphable Memory System (MMS), a robust architecture for efficiently incorporating MLC PCM devices in main memory. MMS is based on observation that memory requirement varies between workloads, and systems are typically over-provisioned in terms of memory capacity. So, during a phase of low memory usage, some of the MLC devices can be operated at fewer bits per cell to obtain lower latency. When the workload requires full memory capacity, these devices can be restored to high density MLC operation to have full main-memory capacity. We provide the runtime monitors, the hardware-OS interface, and the detailed mechanism for implementing MMS. Our evaluations on an 8-core 8GB MLC PCM-based system show that MMS provides, on average, low latency access for 95% of all memory requests, thereby improving overall system performance by 40%.

211 citations


Patent
08 Mar 2010
TL;DR: In this paper, a coloring information storage unit that stores coloring information generated based on a data characteristic of write target data to be written into at least one of the nonvolatile semiconductor memories and the volatile semiconductor memory was proposed.
Abstract: A memory management device of an example of the invention controls writing into and reading from a main memory including a nonvolatile semiconductor memory and a volatile semiconductor memory in response to a writing request and a reading request from a processor. The memory management device includes a coloring information storage unit that stores coloring information generated based on a data characteristic of write target data to be written into at least one of the nonvolatile semiconductor memory and the volatile semiconductor memory, and a writing management unit that references the coloring information to determines a region into which the write target data is written from the nonvolatile semiconductor memory and the volatile semiconductor memory.

205 citations


Proceedings ArticleDOI
04 Oct 2010
TL;DR: DieHarder as mentioned in this paper analyzes a range of widely deployed memory allocators, including those used by Windows, Linux, FreeBSD and OpenBSD, and shows that they remain vulnerable to heap-based attacks.
Abstract: Heap-based attacks depend on a combination of memory management error and an exploitable memory allocator. Many allocators include ad hoc countermeasures against particular exploits but their effectiveness against future exploits has been uncertain. This paper presents the first formal treatment of the impact of allocator design on security. It analyzes a range of widely-deployed memory allocators, including those used by Windows, Linux, FreeBSD and OpenBSD, and shows that they remain vulnerable to attack. It them presents DieHarder, a new allocator whose design was guided by this analysis. DieHarder provides the highest degree of security from heap-based attacks of any practical allocator of which we are aware while imposing modest performance overhead. In particular, the Firefox web browser runs as fast with DieHarder as with the Linux allocator.

199 citations


Proceedings ArticleDOI
13 Nov 2010
TL;DR: This paper introduces another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off- package memory either in hardware or with operating system assistance depending on the migration granularity.
Abstract: System-in-Package (SiP) and 3D integration are promising technologies to bring more memory onto a microprocessor package to mitigate the "memory wall" problem. In this paper, instead of using them to build caches, we study a heterogenous main memory using both on- and off-package memories providing both fast and high-bandwidth on-package accesses and expandable and low-cost commodity off-package memory capacity. We introduce another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off-package memory either in hardware or with operating system assistance depending on the migration granularity. Our experimental results demonstrate that such design can achieve the average effectiveness of 83% of the ideal case where all memory can be placed in high-speed on-package memory for our simulated benchmarks.

147 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome, and shows that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved.
Abstract: In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.

Patent
Ho-Fan Kang1
26 Mar 2010
TL;DR: In this article, the authors present a storage subsystem comprising a nonvolatile solid-state memory array and a controller, which is configured to execute, in the memory array, memory commands from the queue in a sequence that is based at least in part on a throttling ratio provided by the system operation module.
Abstract: Embodiments of the invention are directed to a storage subsystem comprising a non-volatile solid-state memory array and a controller. In one embodiment, the controller includes a system operation module configured to manage system memory operations and a queue configured to receive memory commands from a host system and the system operation module. The controller is configured to execute, in the memory array, memory commands from the queue in a sequence that is based at least in part on a throttling ratio provided by the system operation module.

Patent
16 Jul 2010
TL;DR: In this article, a method for caching in a processor system having virtual memory is presented, the method comprising: monitoring slow memory in the processor system to determine frequently accessed pages; copy the frequently accessed page from slow memory to a location in fast memory.
Abstract: In a first embodiment of the present invention, a method for caching in a processor system having virtual memory is provided, the method comprising: monitoring slow memory in the processor system to determine frequently accessed pages; for a frequently accessed page in slow memory: copy the frequently accessed page from slow memory to a location in fast memory; and update virtual address page tables to reflect the location of the frequently accessed page in fast memory.

Proceedings ArticleDOI
28 Mar 2010
TL;DR: This research presents a novel approach to measuring stack distance by counting the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection.
Abstract: Efficient execution on modern architectures requires good data locality, which can be measured by the powerful stack distance abstraction. Based on this abstraction, the miss rate for LRU caches of any size can be predicted. However, measuring stack distance requires the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection.

Book ChapterDOI
01 Apr 2010
TL;DR: A hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM), which has the following advantages: the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records.
Abstract: In recent years, many systems have employed NAND flash memory as storage devices because of its advantages of higher performance (compared to the traditional hard disk drive), high-density, random-access, increasing capacity, and falling cost. On the other hand, the performance of NAND flash memory is limited by its “erase-before-write” requirement. Log-based structures have been used to alleviate this problem by writing updated data to the clean space. Prior log-based methods, however, cannot avoid excessive erase operations when there are frequent updates, which quickly consume free pages, especially when some data are updated repeatedly. In this paper, we propose a hybrid architecture for the NAND flash memory storage, of which the log region is implemented using phase change random access memory (PRAM). Compared to traditional log-based architectures, it has the following advantages: (1) the PRAM log region allows in-place updating so that it significantly improves the usage efficiency of log pages by eliminating out-of-date log records; (2) it greatly reduces the traffic of reading from the NAND flash memory storage since the size of logs loaded for the read operation is decreased; (3) the energy consumption of the storage system is reduced as the overhead of writing and reading log data is decreased with the PRAM log region; (4) the lifetime of NAND flash memory is increased because the number of erase operations are reduced. To facilitate the PRAM log region, we propose several management policies. The simulation results show that our proposed methods can substantially improve the performance, energy consumption, and lifetime of the NAND flash memory storage1.

Book ChapterDOI
15 Jul 2010
TL;DR: A class of relaxed memory models, defined in Coq, parameterised by the chosen permitted local reorderings of reads and writes, and the visibility of inter- and intra-processor communications through memory is presented.
Abstract: We present a class of relaxed memory models, defined in Coq, parameterised by the chosen permitted local reorderings of reads and writes, and the visibility of inter- and intra-processor communications through memory (e.g. store atomicity relaxation) We prove results on the required behaviour and placement of memory fences to restore a given model (such as Sequential Consistency) from a weaker one Based on this class of models we develop a tool, diy, that systematically and automatically generates and runs litmus tests to determine properties of processor implementations We detail the results of our experiments on Power and the model we base on them This work identified a rare implementation error in Power 5 memory barriers (for which IBM is providing a workaround); our results also suggest that Power 6 does not suffer from this problem.

Proceedings ArticleDOI
28 Mar 2010
TL;DR: It is demonstrated that NUMA can indeed be a significant problem for scientific applications, showing that it can mean the difference between an application scaling perfectly and failing to scale at all.
Abstract: Until recently, most high-end scientific applications have been immune to performance problems caused by Non-Uniform Memory Access (NUMA). However, current trends in microprocessor design are pushing NUMA to smaller and smaller scales. This paper examines the current state of NUMA and makes several contributions. First, we summarize the performance problems that NUMA can present for multi-threaded applications and describe methods of addressing them. Second, we demonstrate that NUMA can indeed be a significant problem for scientific applications, showing that it can mean the difference between an application scaling perfectly and failing to scale at all. Third, we describe, in increasing order of usefulness, three methods of using hardware performance counters to aid in finding NUMA-related problems. Finally, we introduce Memphis, a data-centric toolset that uses Instruction Based Sampling to help pinpoint problematic memory accesses, and demonstrate how we used it to improve the performance of several production-level codes - HYCOM, XGC1 and CAM - by 13%, 23% and 24% respectively.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper targets embedded Chip Multiprocessors with Scratch Pad Memory (SPM) and non-volatile main memory and introduces data migration and recompu-tation techniques to reduce the number of write activities on non-Volatile memories.
Abstract: Recent advances in circuit and process technologies have pushed non-volatile memory technologies into a new era. These technologies exhibit appealing properties such as low power consumption, non-volatility, shock-resistivity, and high density. However, there are challenges to which we need answers in the road of applying non-volatile memories as main memory in computer systems. First, non-volatile memories have limited number of write/erase cycles compared with DRAM memory. Second, write activities on non-volatile memory are more expensive than DRAM memory in terms of energy consumption and access latency. Both challenges will benefit from reduction of the write activities on the nonvolatile memory. In this paper, we target embedded Chip Multiprocessors (CMPs) with Scratch Pad Memory (SPM) and non-volatile main memory. We introduce data migration and recompu-tation techniques to reduce the number of write activities on non-volatile memories. Experimental results show that the proposed methods can reduce the number of writes by 59.41% on average, which means that the non-volatile memory can last 2.8 times as long as before. Meanwhile, the finish time of programs is reduced by 31.81% on average.

Proceedings ArticleDOI
30 Nov 2010
TL;DR: This work benchmarks two virtual machine monitors, Open VZ and KVM, specifically focusing on I/O throughput, and concludes that KVM’s I/o performance is sub optimal, potentially due to memory management problems in the hyper visor.
Abstract: The benefits of virtualization are typically considered to be server consolidation, (leading to the reduction of power and cooling costs) increased availability, isolation, ease of operating system deployment and simplified disaster recovery. High Performance Computing (HPC) environments pose one main challenge for virtualization: the need to maximize throughput with minimal loss of CPU and I/O efficiency. However, virtualization is usually evaluated in terms of enterprise workloads and assumes that servers are underutilized and can be consolidated. In this paper we evaluate the performance of several virtual machine technologies in the context of HPC. A fundamental requirement of current high performance workloads is that both CPU and I/O must be highly efficient for tasks such as MPI jobs. This work benchmarks two virtual machine monitors, Open VZ and KVM, specifically focusing on I/O throughput since CPU efficiency has been extensively studied [1]. Open VZ offers near native I/O performance. Amazon’s EC2 “ClusterCompute Node” product is also considered for comparative purposes and performs quite well. The EC2 “Cluster ComputeNode” product utilizes the Xen hyper visor in hvm mode and 10Gbit/s Ethernet for high throughput communication. Therefore, we also briefly studied Xen on our hardware platform (in hvmmode) to determine if there are still areas of improvement in KVM that allow EC2 to outperform KVM (with InfiniBand host channel adapters operating at 20 Gbit/s) in MPI benchmarks. We conclude that KVM’s I/O performance is sub optimal, potentially due to memory management problems in the hyper visor. Amazon’sEC2 service is promising, although further investigation is necessary to understand the effects of network based storage on I/O throughput in compute nodes. Amazon’s offering may be attractive for users searching for “InfiniBand-like” performance without the upfront investment required to build an InfiniBand cluster or users wishing to dynamically expand their cluster during periods of high demand.

Proceedings Article
23 Jun 2010
TL;DR: The first comprehensive description of the usage of the discard command on a real flash device is presented and two enhancements to provide fast online garbage collection of free VM pages are shown.
Abstract: With the decreasing price of flash memory, systems will increasingly use solid-state storage for virtual-memory paging rather than disks. FlashVM is a system architecture and a core virtual memory subsystem built in the Linux kernel that uses dedicated flash for paging. FlashVM focuses on three major design goals for memory management on flash: high performance, reduced flash wear out for improved reliability, and efficient garbage collection. FlashVM modifies the paging system along code paths for allocating, reading and writing back pages to optimize for the performance characteristics of flash. It also reduces the number of page writes using zero-page sharing and page sampling that prioritize the eviction of clean pages. In addition, we present the first comprehensive description of the usage of the discard command on a real flash device and show two enhancements to provide fast online garbage collection of free VM pages. Overall, the FlashVM system provides up to 94% reduction in application execution time and is four times more responsive than swapping to disk. Furthermore, it improves reliability by writing up to 93% fewer pages than Linux, and provides a garbage collection mechanism that is up to 10 times faster than Linux with discard support.

Proceedings ArticleDOI
03 May 2010
TL;DR: An effective management scheme for heterogeneous SLC and MLC regions of combined flash memory is proposed and a design technique which is able to determine the optimal proportion between the two regions that maximizes performance and energy reduction is proposed, guaranteeing the lifespan constraint.
Abstract: Flash memory-based non-volatile cache (NVC) is emerging as an effective solution for enhancing both the performances and the energy consumptions of storage systems. In order to attain significant performance and energy gains from NVC, it would be better to use multi-level-cell (MLC) flash memories since they can provide a large NVC capacity at low cost. However, the number of available program/erase cycles of MLC flash memory is smaller than that of single-level-cell (SLC) flash memory, which limits the lifespan of an NVC. In order to overcome this limitation, SLC/MLC combined flash memory is a promising solution for use in NVC. This paper proposes an effective management scheme for heterogeneous SLC and MLC regions of combined flash memory. It also proposes a design technique which is able to determine the optimal proportion between the two regions that maximizes performance and energy reduction, guaranteeing the lifespan constraint. We show experimentally how performance, lifespan, and energy consumption of the NVC-embedded hard disk change depending upon the configuration of the combined flash memory. We also show the superiority of the proposed NVC management policy in comparison to alternative policies.

Proceedings ArticleDOI
20 Sep 2010
TL;DR: A novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration, which utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration.
Abstract: As one of the key characteristics of virtualization, live virtual machine (VM) migration provides great benefits for load balancing, power management, fault tolerance and other system maintenance issues in modern clusters and data centers. Although Pre-Copy is a widespread used migration algorithm, it does transfer a lot of duplicated memory image data from source to destination, which results in longer migration time and downtime. This paper proposes a novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration. MDD utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration. Experiment demonstrates that compared with Xen's default Pre-Copy migration algorithm, MDD can reduce 56.60% of total data transferred during migration, 34.93% of total migration time, and 26.16% of downtime on average.

Proceedings ArticleDOI
25 Jun 2010
TL;DR: It is demonstrated that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads and the preferred choice is task-donation because of comparable performance to task-stealing while using less memory overhead.
Abstract: We explore software mechanisms for managing irregular tasks on graphics processing units (GPUs). We demonstrate that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads. We experiment with several task-management techniques, ranging from the use of a single monolithic task queue to distributed queuing with task stealing and donation. On irregular workloads, we show that both centralized and distributed queues have more than 100 times as much idle times as our task-stealing and -donation queues. Our preferred choice is task-donation because of comparable performance to task-stealing while using less memory overhead. To help in this analysis, we use an artificial task-management system that monitors performance and memory usage to quantify the impact of these different techniques. We validate our results by implementing a Reyes renderer with its irregular split-and-dice workload that is able to achieve real-time framerates on a single GPU.

Proceedings ArticleDOI
04 Dec 2010
TL;DR: This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications, and refers to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized.
Abstract: High density memory is becoming more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM’s per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This paper shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient– they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. We refer to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized. We extend the GEMS on SIMICS tool-set to include Elastic Refresh. Simulations show the proposed solution provides a 10% average performance improvement over existing techniques across the entire SPEC CPU suite, and up to a 41%improvement for certain workloads.

Proceedings ArticleDOI
04 Dec 2010
TL;DR: This paper proposes a scalable approach to data-dependence profiling that addresses both runtime and memory overhead in a single framework, called SD3, and reduces the runtime overhead by parallelizing the dependence profiling step itself and compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format.
Abstract: As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program. However, state-of-the-art data-dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications: (1) runtime overhead and (2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications or only report very limited information. In this paper, we propose a scalable approach to data-dependence profiling that addresses both runtime and memory overhead in a single framework. Our technique, called SD3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD3 reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1X and 9.7X on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20X improvement in memory consumption and a 16X speedup in profiling time when 32 cores are used.

Patent
12 Nov 2010
TL;DR: In this paper, a block consolidation utility is executed to eliminate at least some of the partially-filled blocks by consolidating the file fragments into a fewer number of the memory blocks.
Abstract: A method of maintaining a solid-state drive so that free space within memory blocks of the drive becomes free usable space to the drive. The drive comprises cells organized in pages that are organized in memory blocks in which at least user files are stored. A defragmentation utility is executed to cause at least some of the memory blocks that are partially filled with data and contain file fragments to be combined or aligned and to cause at least some of the memory blocks that contain only invalid data to be combined or aligned. A block consolidation utility is then executed to eliminate at least some of the partially-filled blocks by consolidating the file fragments into a fewer number of the memory blocks. The consolidation utility also increases the number of memory blocks that contain only invalid memory. All of the memory blocks containing only invalid data are then erased.

Patent
Toshiaki Minami1
19 Feb 2010
TL;DR: In this paper, a memory control apparatus generates a plurality of commands whose unit of data transfer is smaller than the unit of memory access request, and when the memory access requests are transmitted from a number of request sources, issues the plurality commands to a memory in alternate order for each request source.
Abstract: A memory control apparatus generates a plurality of commands whose unit of data transfer is smaller than the unit of data transfer of a memory access request, and when the memory access requests are transmitted from a plurality of request sources, issues the plurality of commands to a memory in alternate order for each request source. The plurality of memory access requests are executed by time division and concurrently.

Patent
17 May 2010
TL;DR: In this paper, Embodiments for managing memory faults are described, which include a memory controller module to manage memory cells and report memory faults, an error buffer module can store memory fault information received from the memory controller, and a notification module can be in communication with the error buffer.
Abstract: Embodiments are described for managing memory faults. An example system can include a memory controller module to manage memory cells and report memory faults. An error buffer module can store memory fault information received from the memory controller. A notification module can be in communication with the error buffer module. The notification module may generate a notification of a memory fault in a memory access operation. A system software module can provide services and manage executing programs on a processor. In addition, the system software module can receive the notifications of the memory fault for the memory access operation. A notification handler may be activated by an interrupt when the notification of the memory fault in the memory access operation is received.

Proceedings ArticleDOI
20 Oct 2010
TL;DR: A framework for automatic inference of memory fences in concurrent programs, assisting the programmer in this complex task is presented, and it is used to infer correct and efficient placements of fences for several non-trivial algorithms, including practical concurrent data structures.
Abstract: This paper addresses the problem of placing memory fences in a concurrent program running on a relaxed memory model. Modern architectures implement relaxed memory models which may reorder memory operations or execute them non-atomically. Special instructions called memory fences are provided to the programmer, allowing control of this behavior. To ensure correctness of many algorithms, in particular of non-blocking ones, a programmer is often required to explicitly insert memory fences into her program. However, she must use as few fences as possible, or the benefits of the relaxed architecture may be lost. Placing memory fences is challenging and very error prone, as it requires subtle reasoning about the underlying memory model. We present a framework for automatic inference of memory fences in concurrent programs, assisting the programmer in this complex task. Given a finite-state program, a safety specification and a description of the memory model, our framework computes a set of ordering constraints that guarantee the correctness of the program under the memory model. The computed constraints are maximally permissive: removing any constraint from the solution would permit an execution violating the specification. Our framework then realizes the computed constraints as additional fences in the input program. We implemented our approach in a tool called FENDER and used it to infer correct and efficient placements of fences for several non-trivial algorithms, including practical concurrent data structures.

Patent
Dotan Sokolov1, Barak Rotbard1
22 Mar 2010
TL;DR: In this paper, the authors present a non-volatile memory system for data storage, which includes a host having a host memory and a memory controller that is separate from the host.
Abstract: A method for data storage includes, in a system that includes a host having a host memory and a memory controller that is separate from the host and stores data for the host in a non-volatile memory including multiple analog memory cells, storing in the host memory information items relating to respective groups of the analog memory cells of the non-volatile memory. A command that causes the memory controller to access a given group of the analog memory cells is received from the host. In response to the command, a respective information item relating to the given group of the analog memory cells is retrieved from the host memory by the memory controller, and the given group of the analog memory cells is accessed using the retrieved information item.