scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2011"


Proceedings ArticleDOI
05 Mar 2011
TL;DR: A lightweight, high-performance persistent object system called NV-heaps is implemented that provides transactional semantics while preventing these errors and providing a model for persistence that is easy to use and reason about.
Abstract: Persistent, user-defined objects present an attractive abstraction for working with non-volatile program state. However, the slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance. Fast, byte-addressable, non-volatile technologies, such as phase change memory, will remove this constraint and allow programmers to build high-performance, persistent data structures in non-volatile storage that is almost as fast as DRAM. Creating these data structures requires a system that is lightweight enough to expose the performance of the underlying memories but also ensures safety in the presence of application and system failures by avoiding familiar bugs such as dangling pointers, multiple free()s, and locking errors. In addition, the system must prevent new types of hard-to-find pointer safety bugs that only arise with persistent objects. These bugs are especially dangerous since any corruption they cause will be permanent.We have implemented a lightweight, high-performance persistent object system called NV-heaps that provides transactional semantics while preventing these errors and providing a model for persistence that is easy to use and reason about. We implement search trees, hash tables, sparse graphs, and arrays using NV-heaps, BerkeleyDB, and Stasis. Our results show that NV-heap performance scales with thread count and that data structures implemented using NV-heaps out-perform BerkeleyDB and Stasis implementations by 32x and 244x, respectively, by avoiding the operating system and minimizing other software overheads. We also quantify the cost of enforcing the safety guarantees that NV-heaps provide and measure the costs of NV-heap primitive operations.

850 citations


Proceedings ArticleDOI
12 Nov 2011
TL;DR: This paper proposes a new file system, called SCMFS, which is implemented on the virtual address space, which utilizes the existing memory management module in the operating system to do the block management and keep the space always contiguous for each file.
Abstract: This paper considers the problem of how to implement a file system on Storage Class Memory (SCM), that is directly connected to the memory bus, byte addressable and is also non-volatile. In this paper, we propose a new file system, called SCMFS, which is implemented on the virtual address space. In SCMFS, we utilize the existing memory management module in the operating system to do the block management and keep the space always contiguous for each file. The simplicity of SCMFS not only makes it easy to implement, but also improves the performance. We have implemented a prototype in Linux and evaluated its performance through multiple benchmarks.

331 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: In this paper, the authors present an alternative approach to reduce inter-application interference in the memory system: application-aware memory channel partitioning (MCP), which maps the data of applications that are likely to severely interfere with each other to different memory channels.
Abstract: Main memory is a major shared resource among cores in a multicore system. If the interference between different applications' memory requests is not controlled effectively, system performance can degrade significantly. Previous work aimed to mitigate the problem of interference between applications by changing the scheduling policy in the memory controller, i.e., by prioritizing memory requests from applications in a way that benefits system performance.In this paper, we first present an alternative approach to reducing inter-application interference in the memory system: application-aware memory channel partitioning (MCP). The idea is to map the data of applications that are likely to severely interfere with each other to different memory channels. The key principles are to partition onto separate channels 1) the data of light (memory non-intensive) and heavy (memory-intensive) applications, 2) the data of applications with low and high row-buffer locality.Second, we observe that interference can be further reduced with a combination of memory channel partitioning and scheduling, which we call integrated memory partitioning and scheduling (IMPS). The key idea is to 1) always prioritize very light applications in the memory scheduler since such applications cause negligible interference to others, 2) use MCP to reduce interference among the remaining applications.We evaluate MCP and IMPS on a variety of multi-programmed workloads and system configurations and compare them to four previously proposed state-of-the-art memory scheduling policies. Averaged over 240 workloads on a 24-core system with 4 memory channels, MCP improves system throughput by 7.1% over an application-unaware memory scheduler and 1% over the previous best scheduler, while avoiding modifications to existing memory schedulers. IMPS improves system throughput by 11.1% over an application-unaware scheduler and 5% over the previous best scheduler, while incurring much lower hardware complexity than the latter.

281 citations


Proceedings ArticleDOI
05 Mar 2011
TL;DR: The results demonstrate that MemScale reduces energy consumption significantly compared to modern memory energy management approaches, and it is concluded that the potential benefits of the MemScale mechanisms and policy more than compensate for their small hardware cost.
Abstract: Main memory is responsible for a large and increasing fraction of the energy consumed by servers. Prior work has focused on exploiting DRAM low-power states to conserve energy. However, these states require entire DRAM ranks to be idled, which is difficult to achieve even in lightly loaded servers. In this paper, we propose to conserve memory energy while improving its energy-proportionality by creating active low-power modes for it. Specifically, we propose MemScale, a scheme wherein we apply dynamic voltage and frequency scaling (DVFS) to the memory controller and dynamic frequency scaling (DFS) to the memory channels and DRAM devices. MemScale is guided by an operating system policy that determines the DVFS/DFS mode of the memory subsystem based on the current need for memory bandwidth, the potential energy savings, and the performance degradation that applications are willing to withstand. Our results demonstrate that MemScale reduces energy consumption significantly compared to modern memory energy management approaches. We conclude that the potential benefits of the MemScale mechanisms and policy more than compensate for their small hardware cost.

277 citations


Book
17 Aug 2011
TL;DR: The Garbage Collection Handbook: The Art of Automatic Memory Management brings together a wealth of knowledge gathered by automatic memory management researchers and developers over the past fifty years and compares the most important approaches and state-of-the-art techniques in a single, accessible framework as discussed by the authors.
Abstract: Published in 1996, Richard Joness Garbage Collection was a milestone in the area of automatic memory management. The field has grown considerably since then, sparking a need for an updated look at the latest state-of-the-art developments. The Garbage Collection Handbook: The Art of Automatic Memory Management brings together a wealth of knowledge gathered by automatic memory management researchers and developers over the past fifty years. The authors compare the most important approaches and state-of-the-art techniques in a single, accessible framework. The book addresses new challenges to garbage collection made by recent advances in hardware and software. It explores the consequences of these changes for designers and implementers of high performance garbage collectors. Along with simple and traditional algorithms, the book covers parallel, incremental, concurrent, and real-time garbage collection. Algorithms and concepts are often described with pseudocode and illustrations. The nearly universal adoption of garbage collection by modern programming languages makes a thorough understanding of this topic essential for any programmer. This authoritative handbook gives expert insight on how different collectors work as well as the various issues currently facing garbage collectors. Armed with this knowledge, programmers can confidently select and configure the many choices of garbage collectors. Web ResourceThe books online bibliographic database at www.gchandbook.org includes over 2,500 garbage collection-related publications. Continually updated, it contains abstracts for some entries and URLs or DOIs for most of the electronically available ones. The database can be searched online or downloaded as BibTeX, PostScript, or PDF.

211 citations


Proceedings ArticleDOI
02 Apr 2011
TL;DR: Dr. Memory is presented, a memory checking tool that operates on both Windows and Linux applications that handles the complex and not fully documented Windows environment, and avoids reporting false positive memory leaks that plague traditional leak locating algorithms.
Abstract: Memory corruption, reading uninitialized memory, using freed memory, and other memory-related errors are among the most difficult programming bugs to identify and fix due to the delay and non-determinism linking the error to an observable symptom. Dedicated memory checking tools are invaluable for finding these errors. However, such tools are difficult to build, and because they must monitor all memory accesses by the application, they incur significant overhead. Accuracy is another challenge: memory errors are not always straightforward to identify, and numerous false positive error reports can make a tool unusable. A third obstacle to creating such a tool is that it depends on low-level operating system and architectural details, making it difficult to port to other platforms and difficult to target proprietary systems like Windows. This paper presents Dr. Memory, a memory checking tool that operates on both Windows and Linux applications. Dr. Memory handles the complex and not fully documented Windows environment, and avoids reporting false positive memory leaks that plague traditional leak locating algorithms. Dr. Memory employs efficient instrumentation techniques; a direct comparison with the state-of-the-art Valgrind Memcheck tool reveals that Dr. Memory is twice as fast as Memcheck on average and up to four times faster on individual benchmarks.

171 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: i-NVMM is introduced, a data privacy protection scheme for NVMM, where the main memory is encrypted incrementally, i.e. different data in the mainMemory is encrypted at different times depending on whether the data is predicted to still be useful to the processor.
Abstract: Emerging technologies for building non-volatile main memory (NVMM) systems suffer from a security vulnerability where information lingers on long after the system is powered down, enabling an attacker with physical access to the system to extract sensitive information off the memory. The goal of this study is to find a solution for such a security vulnerability. We introduce i-NVMM, a data privacy protection scheme for NVMM, where the main memory is encrypted incrementally, i.e. different data in the main memory is encrypted at different times depending on whether the data is predicted to still be useful to the processor. The motivation behind incremental encryption is the observation that the working set of an application is much smaller than its resident set. By identifying the working set and encrypting remaining part of the resident set, i-NVMM can keep the majority of the main memory encrypted at all times without penalizing performance by much. Our experiments demonstrate promising results. i-NVMM keeps 78% of the main memory encrypted across SPEC2006 benchmarks, yet only incurs 3.7% execution time overhead, and has a negligible impact on the write endurance of NVMM, all achieved with a relatively simple hardware support in the memory module.

162 citations


Proceedings ArticleDOI
23 May 2011
TL;DR: A static wear leveling algorithm, named as Rejuvenator, is proposed, which is adaptive to the changes in workloads and minimizes the cost of expensive data migrations for large scale NAND flash memory.
Abstract: NAND flash memory is fast replacing traditional magnetic storage media due to its better performance and low power requirements. However the endurance of flash memory is still a critical issue in using it for large scale enterprise applications. Rethinking the basic design of NAND flash memory is essential to realize its maximum potential in large scale storage. NAND flash memory is organized as blocks and blocks in turn have pages. A block can be erased reliably only for a limited number of times and frequent block erase operations to a few blocks reduce the lifetime of the flash memory. Wear leveling helps to prevent the early wear out of blocks in the flash memory. In order to achieve efficient wear leveling, data is moved around throughout the flash memory. The existing wear leveling algorithms do not scale for large scale NAND flash based SSDs. In this paper we propose a static wear leveling algorithm, named as Rejuvenator, for large scale NAND flash memory. Rejuvenator is adaptive to the changes in workloads and minimizes the cost of expensive data migrations. Our evaluation of Rejuvenator is based on detailed simulations with large scale enterprise workloads and synthetic micro benchmarks.

160 citations


Proceedings ArticleDOI
30 Mar 2011
TL;DR: SSDAlloc is introduced, a hybrid main memory management system that allows developers to treat solid-state disk as an extension of the RAM in a system, usable as a larger, slower form of RAM instead of just a cache for the hard drive.
Abstract: We introduce SSDAlloc, a hybrid main memory management system that allows developers to treat solid-state disk (SSD) as an extension of the RAM in a system. SSDAlloc moves the SSD upward in the memory hierarchy, usable as a larger, slower form of RAM instead of just a cache for the hard drive. Using SSDAlloc, applications can nearly transparently extend their memory footprints to hundreds of gigabytes and beyond without restructuring, well beyond the RAM capacities of most servers. Additionally, SSDAlloc can extract 90% of the SSD's raw performance while increasing the lifetime of the SSD by up to 32 times. Other approaches either require intrusive application changes or deliver only 6- 30% of the SSD's raw performance.

149 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: This paper proposes a memory scheduling algorithm designed specifically for parallel applications, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers, and shows that it speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.
Abstract: A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application. In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier.We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.

147 citations


Proceedings ArticleDOI
23 May 2011
TL;DR: A novel hot data identification scheme adopting multiple bloom filters to efficiently capture finer-grained recency as well as frequency is proposed and demonstrates that its multiple bloom filter-based scheme outperforms the state-of-the-art scheme.
Abstract: Hot data identification can be applied to a variety of fields. Particularly in flash memory, it has a critical impact on its performance (due to a garbage collection) as well as its life span (due to a wear leveling). Although the hot data identification is an issue of paramount importance in flash memory, little investigation has been made. Moreover, all existing schemes focus almost exclusively on a frequency viewpoint. However, recency also must be considered equally with the frequency for effective hot data identification. In this paper, we propose a novel hot data identification scheme adopting multiple bloom filters to efficiently capture finer-grained recency as well as frequency. In addition to this scheme, we propose a Window-based Direct Address Counting (WDAC) algorithm to approximate an ideal hot data identification as our baseline. Unlike the existing baseline algorithm that cannot appropriately capture recency information due to its exponential batch decay, our WDAC algorithm, using a sliding window concept, can capture very fine-grained recency information. Our experimental evaluation with diverse realistic workloads including real SSD traces demonstrates that our multiple bloom filter-based scheme outperforms the state-of-the-art scheme. In particular, ours not only consumes 50% less memory and requires less computational overhead up to 58%, but also improves its performance up to 65%.

Patent
David Flynn, David Nellans1, John Strasser1, James G. Peterson1, Robert Wipfel 
13 Dec 2011
TL;DR: An auto-commit memory as discussed by the authors is capable of implementing a preconfigured, triggered commit action in response to a failure condition, such as a loss of power, invalid shutdown, fault, or the like.
Abstract: An auto-commit memory is capable of implementing a pre-configured, triggered commit action in response to a failure condition, such as a loss of power, invalid shutdown, fault, or the like. A computing device may access the auto-commit memory using memory access semantics (using a memory mapping mechanism or the like), bypassing system calls typically required in virtual memory operations. Since the auto-commit memory is pre-configured to commit data stored thereon in the event of a failure, users of the auto-commit memory may view these memory semantic operations as being instantly committed. Since operations to commit the data are taken out of the write-commit path, the performance of applications that are write-commit bound may be significantly improved.

Proceedings ArticleDOI
12 Nov 2011
TL;DR: This paper proposes a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms and achieves 3.3× speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU.
Abstract: Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to sub-optimal performance for programs designed with a CPU memory interface — or no particular memory interface at all! — in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion1, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA's CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3× speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.

01 Jan 2011
TL;DR: The Garbage Collection Handbook: The Art of Automatic Memory Management brings together a wealth of knowledge gathered by automatic memory management researchers and developers over the past fifty years and addresses new challenges to garbage collection made by recent advances in hardware and software.
Abstract: Published in 1996, Richard Joness Garbage Collection was a milestone in the area of automatic memory management. The field has grown considerably since then, sparking a need for an updated look at the latest state-of-the-art developments. The Garbage Collection Handbook: The Art of Automatic Memory Management brings together a wealth of knowledge gathered by automatic memory management researchers and developers over the past fifty years. The authors compare the most important approaches and state-of-the-art techniques in a single, accessible framework. The book addresses new challenges to garbage collection made by recent advances in hardware and software. It explores the consequences of these changes for designers and implementers of high performance garbage collectors. Along with simple and traditional algorithms, the book covers parallel, incremental, concurrent, and real-time garbage collection. Algorithms and concepts are often described with pseudocode and illustrations. The nearly universal adoption of garbage collection by modern programming languages makes a thorough understanding of this topic essential for any programmer. This authoritative handbook gives expert insight on how different collectors work as well as the various issues currently facing garbage collectors. Armed with this knowledge, programmers can confidently select and configure the many choices of garbage collectors. Web ResourceThe books online bibliographic database at www.gchandbook.org includes over 2,500 garbage collection-related publications. Continually updated, it contains abstracts for some entries and URLs or DOIs for most of the electronically available ones. The database can be searched online or downloaded as BibTeX, PostScript, or PDF.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: This work presents a novel device, the SpecTLB, that exploits the predictable behavior of reservation-based physical memory allocators to interpolate address translations and effectively enables the use of small pages to achieve fine-grained allocation and protection, while avoiding the associated latency penalties ofsmall pages.
Abstract: Data-intensive computing applications are using more and more memory and are placing an increasing load on the virtual memory system. While the use of large pages can help alleviate the overhead of address translation, they limit the control the operating system has over memory allocation and protection. We present a novel device, the SpecTLB, that exploits the predictable behavior of reservation-based physical memory allocators to interpolate address translations. Our device provides speculative translations for many TLB misses on small pages without referencing the page table. While these interpolations must be confirmed, doing so can be done in parallel with speculative execution. This effectively hides the execution latency of these TLB misses. In simulation, the SpecTLB is able to overlap an average of 57% of page table walks with successful speculative execution over a wide variety of applications. We also show that the SpecTLB outperforms a state-of-the-art TLB prefetching scheme for virtually all tested applications with significant TLB miss rates. Moreover, we show that the SpecTLB is efficient since mispredictions are extremely rare, occurring in less than 1% of TLB misses. In essense, the SpecTLB effectively enables the use of small pages to achieve fine-grained allocation and protection, while avoiding the associated latency penalties of small pages.

Patent
29 Dec 2011
TL;DR: In this paper, the authors describe a storage subsystem that includes a main memory area that is accessible via standard memory access commands (such as ATA commands), and a restricted memory area which is accessible only via one or more non-standard commands.
Abstract: A solid-state storage subsystem, such as a non-volatile memory card or drive, includes a main memory area that is accessible via standard memory access commands (such as ATA commands), and a restricted memory area that is accessible only via one or more non-standard commands. The restricted memory area stores information used to control access to, and/or use of, information stored in the main memory area. As one example, the restricted area may store one or more identifiers, such as a unique subsystem identifier, needed to decrypt an executable or data file stored in the main memory area. A host software component is configured to retrieve the information from the subsystem's restricted memory area, and to use the information to control access to and/or use of the information in the main memory area.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: C4, the Continuously Concurrent Compacting Collector, an updated generational form of the Pauseless GC Algorithm, is introduced and described, along with details of its implementation on modern X86 hardware.
Abstract: C4, the Continuously Concurrent Compacting Collector, an updated generational form of the Pauseless GC Algorithm [7], is introduced and described, along with details of its implementation on modern X86 hardware. It uses a read barrier to support concur- rent compaction, concurrent remapping, and concurrent incremental update tracing. C4 differentiates itself from other generational garbage collectors by supporting simultaneous-generational concurrency: the different generations are collected using concurrent (non stop-the-world) mechanisms that can be simultaneously and independently active. C4 is able to continuously perform concurrent young generation collections, even during long periods of concurrent full heap collection, allowing C4 to sustain high allocation rates and maintain the efficiency typical to generational collectors, without sacrificing response times or reverting to stop-the-world operation. Azul systems has been shipping a commercial implementation of the Pauseless GC mechanism, since 2005. Three successive generations of Azul's Vega series systems relied on custom multi-core processors and a custom OS kernel to deliver both the scale and features needed to support Pauseless GC. In 2010, Azul released its first software-only commercial implementation of C4 for modern commodity X86 hardware, using Linux kernel enhancements to support the required feature set. We discuss implementa- tion details of C4 on X86, including the Linux virtual and physical memory management enhancements that were used to support the high rate of virtual memory operations required for sustained pauseless operation. We discuss updates to the collector's manage- ment of the heap for efficient generational collection and provide throughput and pause time data while running sustained workloads.

Proceedings ArticleDOI
Zoltan Majo1, Thomas R. Gross1
04 Jun 2011
TL;DR: This work presents a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem, and describes two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention.
Abstract: Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.

Proceedings ArticleDOI
14 Mar 2011
TL;DR: This paper proposes a heterogeneous main memory with three different memory modules, where each module is heavily optimized for one the three parameters at the cost of compromising the other two.
Abstract: Main memory plays a critical role in a computer system's performance and energy efficiency. Three key parameters define a main memory system's efficiency: latency, bandwidth, and power. Current memory systems tries to balance all these three parameters to achieve reasonable efficiency for most programs. However, in a multi-core system, applications with various memory demands are simultaneously executed. This paper proposes a heterogeneous main memory with three different memory modules, where each module is heavily optimized for one the three parameters at the cost of compromising the other two. Based on the memory access characteristics of an application, the operating system allocates its pages in a memory module that satisfies its memory requirements. When compared to a homogeneous memory system, we demonstrate through cycle-accurate simulations that our design results in about 13.5% increase in system performance and a 20% improvement in memory power.

Proceedings ArticleDOI
Zhiwei Qin1, Yi Wang1, Duo Liu1, Zili Shao1, Yong Guan 
05 Jun 2011
TL;DR: This paper proposes two approaches, namely, concentrated mapping and postponed reclamation, to effective reduce the valid page copies in the design of MLC flash translation layer to reduce the garbage collection overhead.
Abstract: The new write constraints of multi-level cell (MLC) NAND flash memory make most of the existing flash translation layer (FTL) schemes inefficient or inapplicable. In this paper, we solve several fundamental problems in the design of MLC flash translation layer. The objective is to reduce the garbage collection overhead so as to reduce the average system response time. We make the key observation that the valid page copy is the essential garbage collection overhead. Based on this observation, we propose two approaches, namely, concentrated mapping and postponed reclamation, to effective reduce the valid page copies. We conduct experiments on a set of benchmarks from both the real world and synthetic traces. The experimental results show that our scheme can achieve a significant reduction in the average system response time compared with the previous work.

Proceedings ArticleDOI
27 Oct 2011
TL;DR: This paper presents CloudSec, a new virtualization-aware monitoring appliance that provides active, transparent and real-time security monitoring for hosted VMs in the IaaS model and evaluates the system monitoring accuracy and the performance overhead of CloudSec.
Abstract: The Infrastructure-as-a-Service (IaaS) cloud computing model has become a compelling computing solution with a proven ability to reduce costs and improve resource efficiency. Virtualization has a key role in supporting the IaaS model. However, virtualization also makes it a target for potent rootkits because of the loss of control problem over the hosted Virtual Machines (VMs). This makes traditional in-guest security solutions, relying on operating system kernel trustworthiness, no longer an effective solution to secure the virtual infrastructure of the IaaS model. In this paper, we explore briefly the security problem of the IaaS cloud computing model, and present CloudSec, a new virtualization-aware monitoring appliance that provides active, transparent and real-time security monitoring for hosted VMs in the IaaS model. CloudSec utilizes virtual machine introspection techniques to provide fine-grained inspection of VM's physical memory without installing any monitoring code inside the VM. It actively reconstructs and monitors the dynamically changing kernel data structures instances, as a prior step to enable providing protection for kernel data structures. We have implemented a proof-of-concept prototype using VMsafe libraries on a VMware ESX platform. We have evaluated the system monitoring accuracy and the performance overhead of CloudSec.

Proceedings ArticleDOI
05 Jun 2011
TL;DR: This paper implements a set of storage management techniques in the open source CCNx prototype and performs extensive experiments in a real testbed under fairly realistic network conditions, allowing to clarify the relation between CCN chunk-level caching and Quality of Experience (QoE) perceived by end users.
Abstract: Content-Centric Networking is a new communication architecture that rethinks the Internet communication model, designed for point-to-point connections between hosts, and centers it around content dissemination and retrieval. Most of the issues faced by the current IP infrastructure in terms of mobility management, security, scalability, which are accrued by today's Internet trends, find a natural solution in CCN shift from IP addresses to named data. In this paper we explore the impact of storage management on the performance of multiple applications sharing the same CCN infrastructure and we quantify the effectiveness of static storage partitioning and dynamic management techniques in providing service differentiation. To this purpose, we implement a set of storage management techniques in the open source CCNx prototype and perform extensive experiments in a real testbed under fairly realistic network conditions. Our experimental results allow to clarify the relation between CCN chunk-level caching and Quality of Experience (QoE) perceived by end users.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The evaluation shows that performance is improved by 61% without ECC and 44% with ECC in memory-intensive applications, while the reduction in memory power consumption and traffic is significant.
Abstract: We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and error-tolerance tradeoffs. We use sector caches and sub-ranked memory systems to implement adaptive granularity. We also show how to incorporate adaptive granularity into memory access scheduling. We evaluate our architecture with and without ECC using memory intensive benchmarks from the SPEC, Olden, PARSEC, SPLASH2, and HPCS benchmark suites and micro-benchmarks. The evaluation shows that performance is improved by 61% without ECC and 44% with ECC in memory-intensive applications, while the reduction in memory power consumption (29% without ECC and 14% with ECC) and traffic (78% without ECC and 66% with ECC) is significant.

Patent
30 Sep 2011
TL;DR: In this article, a system and method for integrating a memory and storage hierarchy including a nonvolatile memory tier within a computer system is described, where PCMS memory devices are used as one tier in the hierarchy, sometimes referred to as far memory.
Abstract: A system and method are described for integrating a memory and storage hierarchy including a non-volatile memory tier within a computer system. In one embodiment, PCMS memory devices are used as one tier in the hierarchy, sometimes referred to as “far memory.” Higher performance memory devices such as DRAM placed in front of the far memory and are used to mask some of the performance limitations of the far memory. These higher performance memory devices are referred to as “near memory.”

Proceedings ArticleDOI
05 Jun 2011
TL;DR: In order to reduce DRAM refresh energy which occupies a significant portion of total memory energy, a runtime-adaptive method of DRAM decay is presented and two methods, DRAM bypass and dirty data keeping, are presented for further reduction in refresh energy and memory access latency.
Abstract: Hybrid main memory consisting of DRAM and non-volatile memory is attractive since the non-volatile memory can give the advantage of low standby power while DRAM provides high performance and better active power. In this work, we address the power management of such a hybrid main memory consisting of DRAM and phase-change RAM (PRAM). In order to reduce DRAM refresh energy which occupies a significant portion of total memory energy, we present a runtime-adaptive method of DRAM decay. In addition, we present two methods, DRAM bypass and dirty data keeping, for further reduction in refresh energy and memory access latency, respectively. The experiments show that by reducing DRAM refreshes, we can obtain 23.5%~94.7% reduction in the energy consumption with negligible performance overhead compared with the conventional DRAM-only main memory.

Journal ArticleDOI
TL;DR: A new message passing scheme named tile-based BP that reduces the memory and bandwidth to a fraction of the ordinary BP algorithms without performance degradation by splitting the MRF into many tiles and only storing the messages across the neighboring tiles is proposed.
Abstract: Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF), but it requires high memory, bandwidth, and computational costs. Furthermore, the iterative, pixel-wise, and sequential operations of BP make it difficult to parallelize the computation. In this paper, we propose two techniques to address these issues. The first technique is a new message passing scheme named tile-based BP that reduces the memory and bandwidth to a fraction of the ordinary BP algorithms without performance degradation by splitting the MRF into many tiles and only storing the messages across the neighboring tiles. The tile-wise processing also enables data reuse and pipeline, resulting in efficient hardware implementation. The second technique is an O(L) fast message construction algorithm that exploits the properties of robust functions for parallelization. We apply these two techniques to a very large-scale integration circuit for stereo matching that generates high-resolution disparity maps in near real-time. We also implement the proposed schemes on graphics processing unit (GPU) which is four-time faster than standard BP on GPU.

Patent
21 Sep 2011
TL;DR: In this paper, the Elementary Network Description (END) format is proposed to describe a large-scale neuronal model and implementations of software or hardware engines to simulate such a model efficiently, and the architecture of such neuromorphic engines is optimal for high-performance parallel processing of spiking networks with spike-timing dependent plasticity.
Abstract: A simple format is disclosed and referred to as Elementary Network Description (END). The format can fully describe a large-scale neuronal model and embodiments of software or hardware engines to simulate such a model efficiently. The architecture of such neuromorphic engines is optimal for high-performance parallel processing of spiking networks with spike-timing dependent plasticity. Methods for managing memory in a processing system are described whereby memory can be allocated among a plurality of elements and rules configured for each element such that the parallel execution of the spiking networks is most optimal.

Patent
04 Mar 2011
TL;DR: In this article, the operating method of a data storage device includes storing data in the buffer memory according to an external request, and determining whether the data stored in buffer memory is data accompanying a buffer program operation of the memory cell array.
Abstract: A data storage device includes a non-volatile memory device which includes a memory cell array; and a memory controller which includes a buffer memory and which controls the non-volatile memory device. The operating method of the data storage device includes storing data in the buffer memory according to an external request, and determining whether the data stored in the buffer memory is data accompanying a buffer program operation of the memory cell array. When the data stored in the buffer memory is data accompanying the buffer program operation, the method further includes determining whether a main program operation on the memory cell array is required, and when a main program operation on the memory cell array is required, determining a program pattern of the main program operation on the memory cell array. The method further includes issuing a set of commands for the main program operation on the memory cell array to the multi-bit memory device based on the determined program pattern.

Patent
30 Sep 2011
TL;DR: In this paper, a system and method for integrating a memory and storage hierarchy including a nonvolatile memory tier within a computer system is described, where PCMS memory devices are used as one tier in the hierarchy, sometimes referred to as far memory.
Abstract: A system and method are described for integrating a memory and storage hierarchy including a non-volatile memory tier within a computer system. In one embodiment, PCMS memory devices are used as one tier in the hierarchy, sometimes referred to as “far memory.” Higher performance memory devices such as DRAM placed in front of the far memory and are used to mask some of the performance limitations of the far memory. These higher performance memory devices are referred to as “near memory.”

Patent
Ahmed Said Sallam1
29 Mar 2011
TL;DR: In this paper, a system for securing an electronic device may include a memory, a processor, one or more operating systems residing in the memory for execution by the processor, and a security agent configured to execute on the electronic device at a level below all of the operating systems of the electronic devices accessing the memory.
Abstract: A system for securing an electronic device may include a memory, a processor; one or more operating systems residing in the memory for execution by the processor; and a security agent configured to execute on the electronic device at a level below all of the operating systems of the electronic device accessing the memory. The security agent may be further configured to: (i) trap attempted accesses to the memory, wherein each of such attempted accesses may, individually or in the aggregate, indicate the presence of self-modifying malware; (ii) in response to trapping each attempted access to the memory, record information associated with the attempted access in a history; and (iii) in response to a triggering attempted access associated with a particular memory location, analyze information in the history associated with the particular memory location to determine if suspicious behavior has occurred with respect to the particular memory location.