Showing papers on "Memory management published in 2006"

PDF

Open Access

Proceedings Article•DOI•

The DaCapo benchmarks: java benchmarking development and analysis

[...]

Stephen M. Blackburn¹, Robin Garner¹, Chris Hoffmann², Asjad M. Khang², Kathryn S. McKinley³, Rotem Bentzur⁴, Amer Diwan⁵, Daniel Feinberg⁴, Daniel Frampton¹, Samuel Z. Guyer⁶, Martin Hirzel⁷, Antony L. Hosking⁸, Maria Jump³, Han Lee⁹, J. Eliot B. Moss², Aashish Phansalkar³, Darko Stefanovic⁴, Thomas VanDrunen¹⁰, Daniel von Dincklage⁵, Ben Wiedermann³ - Show less +16 more•Institutions (10)

Australian National University¹, University of Massachusetts Amherst², University of Texas at Austin³, University of New Mexico⁴, University of Colorado Boulder⁵, Tufts University⁶, IBM⁷, Purdue University⁸, Intel⁹, Wheaton College (Illinois)¹⁰

16 Oct 2006

TL;DR: This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks that improve over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements.

...read moreread less

Abstract: Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection required for Java programs, many evaluations still use methodologies developed for C, C++, and Fortran. SPEC, the dominant purveyor of benchmarks, compounded this problem by institutionalizing these methodologies for their Java benchmark suite. This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks. We demonstrate that the complex interactions of (1) architecture, (2) compiler, (3) virtual machine, (4) memory management, and (5) application require more extensive evaluation than C, C++, and Fortran which stress (4) much less, and do not require (3). We use and introduce new value, time-series, and statistical metrics for static and dynamic properties such as code complexity, code size, heap composition, and pointer mutations. No benchmark suite is definitive, but these metrics show that DaCapo improves over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements. This paper takes a step towards improving methodologies for choosing and evaluating benchmarks to foster innovation in system design and implementation for Java and other managed languages.

...read moreread less

1,561 citations

Proceedings Article•DOI•

LogTM: log-based transactional memory

[...]

Kevin E. Moore¹, Jayaram Bobba¹, M.J. Moravan¹, Mark D. Hill¹, Darien Wood¹ - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

27 Feb 2006

TL;DR: This paper presents a new implementation of transactional memory, log-based transactionalMemory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place.

...read moreread less

Abstract: Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values "in place" (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, log-based transactional memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32-way multiprocessor support our decision to optimize for commit by showing that only 1-2% of transactions abort.

...read moreread less

724 citations

Journal Article•DOI•

DieHard: probabilistic memory safety for unsafe languages

[...]

Emery D. Berger¹, Benjamin G. Zorn²•Institutions (2)

University of Massachusetts Amherst¹, Microsoft²

11 Jun 2006

TL;DR: Analytical and experimental results are presented that show DieHard's resilience to a wide range of memory errors, including a heap-based buffer overflow in an actual application.

...read moreread less

Abstract: Applications written in unsafe languages like C and C++ are vulnerable to memory errors such as buffer overflows, dangling pointers, and reads of uninitialized data. Such errors can lead to program crashes, security vulnerabilities, and unpredictable behavior. We present DieHard, a runtime system that tolerates these errors while probabilistically maintaining soundness. DieHard uses randomization and replication to achieve probabilistic memory safety by approximating an infinite-sized heap. DieHard's memory manager randomizes the location of objects in a heap that is at least twice as large as required. This algorithm prevents heap corruption and provides a probabilistic guarantee of avoiding memory errors. For additional safety, DieHard can operate in a replicated mode where multiple replicas of the same application are run simultaneously. By initializing each replica with a different random seed and requiring agreement on output, the replicated version of Die-Hard increases the likelihood of correct execution because errors are unlikely to have the same effect across all replicas. We present analytical and experimental results that show DieHard's resilience to a wide range of memory errors, including a heap-based buffer overflow in an actual application.

...read moreread less

486 citations

Proceedings Article•DOI•

Sequoia: programming the memory hierarchy

[...]

Kayvon Fatahalian¹, Daniel Reiter Horn¹, Timothy James Knight¹, Larkhoon Leem¹, Mike Houston¹, Ji Young Park¹, Mattan Erez¹, Manman Ren¹, Alex Aiken¹, William J. Dally¹, Pat Hanrahan¹ - Show less +7 more•Institutions (1)

Stanford University¹

11 Nov 2006

TL;DR: This work has implemented a complete programming system, including a compiler and runtime systems for cell processor-based blade systems and distributed memory clusters, and demonstrates efficient performance running Sequoia programs on both of these platforms.

...read moreread less

Abstract: We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.

...read moreread less

482 citations

Patent•

Increasing the memory performance of flash memory devices by writing sectors simultaneously to multiple flash memory devices

[...]

Petro Estakhri, Berhanu Iman

13 Apr 2006

TL;DR: In this article, a memory storage system for storing information organized in sectors within a nonvolatile memory bank is disclosed, where sectors are organized into blocks with each sector identified by a host provided logical block address (LBA).

...read moreread less

Abstract: In one embodiment of the present invention, a memory storage system for storing information organized in sectors within a nonvolatile memory bank is disclosed. The memory bank is defined by sector storage locations spanning across one or more rows of a nonvolatile memory device, each the sector including a user data portion and an overhead portion. The sectors being organized into blocks with each sector identified by a host provided logical block address (LBA). Each block is identified by a modified LBA derived from the host-provided LBA and said virtual PBA, said host-provided LBA being received by the storage device from the host for identifying a sector of information to be accessed, the actual PBA developed by said storage device for identifying a free location within said memory bank wherein said accessed sector is to be stored. The storage system includes a memory controller coupled to the host; and a nonvolatile memory bank coupled to the memory controller via a memory bus, the memory bank being included in a non-volatile semiconductor memory unit, the memory bank has storage blocks each of which includes a first row-portion located in said memory unit, and a corresponding second row-portion located in each of the memory unit, each of the memory row-portions provides storage space for two of said sectors, wherein the speed of performing write operations is increased by writing sector information to the memory unit simultaneously.

...read moreread less

462 citations

Journal Article•DOI•

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

[...]

Jarek Nieplocha¹, Bruce J. Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold E. Trease¹, Edoardo Aprà² - Show less +2 more•Institutions (2)

Pacific Northwest National Laboratory¹, Environmental Molecular Sciences Laboratory²

01 May 2006

TL;DR: Compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate, and demonstrates the attractiveness of using higher level abstractions to write parallel code.

...read moreread less

Abstract: This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an inteface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the attractiveness of using higher level abstractions to write parallel code.

...read moreread less

341 citations

Journal Article•DOI•

Unbounded Transactional Memory

[...]

C.S. Ananian¹, Krste Asanovic¹, Bradley C. Kuszmaul¹, Charles E. Leiserson¹, S. Lie² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Advanced Micro Devices²

01 Jan 2006-IEEE Micro

TL;DR: A hardware implementation of unbounded transactional memory, called UTM, is described, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory.

...read moreread less

Abstract: This article advances the following thesis: transactional memory should be virtualized to support transactions of arbitrary footprint and duration. Such support should be provided through hardware and be made visible to software through the machines instruction set architecture. We call a transactional memory system unbounded if the system can handle transactions of arbitrary duration that have footprints nearly as big as the systems virtual memory. The primary goal of unbounded transactional memory is to make concurrent programming easier without incurring much implementation overhead. Unbounded transactional-memory architectures can achieve high performance in the common case of small transactions, without sacrificing correctness in large transactions

...read moreread less

295 citations

Proceedings Article•DOI•

Advanced algorithms for fast and scalable deep packet inspection

[...]

Sailesh Kumar¹, Jonathan S. Turner¹, John Williams²•Institutions (2)

University of Washington¹, Cisco Systems, Inc.²

03 Dec 2006

TL;DR: This paper introduces the content addressed delayed input DFA (CD2FA), which provides a compact representation of regular expressions that match the throughput of traditional uncompressed DFAs.

...read moreread less

Abstract: Modern deep packet inspection systems use regular expressions to define various patterns of interest in network data streams. Deterministic finite automata (DFA) are commonly used to parse regular expressions. DFAs are fast, but can require prohibitively large amounts of memory for patterns arising in network applications. Traditional DFA table compression only slightly reduces the memory required and requires an additional memory access per input character. Alternative representations of regular expressions, such as NFAs and delayed input DFAs (D2FA) require less memory but sacrifice throughput. In this paper we introduce the content addressed delayed input DFA (CD2FA), which provides a compact representation of regular expressions that match the throughput of traditional uncompressed DFAs. A CD2FA addresses successive states of a D2FA using their content, rather than a "content-less" identifier. This makes selected information available earlier in the state traversal process, which makes it possible to avoid unnecessary memory accesses. We demonstrate that such content-addressing can be effectively used to obtain automata that are very compact and can achieve high throughput. Specifically, we show that for an application using thousands of patterns defined by regular expressions, CD2FAs use as little as 10% of the space required by a conventional compressed DFA, and match the throughput of an uncompressed DFA.

...read moreread less

205 citations

Proceedings Article•DOI•

Geiger: monitoring the buffer cache in a virtual machine environment

[...]

Stephen Jones¹, Andrea C. Arpaci-Dusseau¹, Remzi H. Arpaci-Dusseau¹•Institutions (1)

University of Wisconsin-Madison¹

20 Oct 2006

TL;DR: This paper creates a prototype implementation of techniques that can be used by a VMM to passively infer useful information about a guest operating system's unified buffer cache and virtual memory system, and implements a novel working set size estimator which allows the V MM to make more informed memory allocation decisions.

...read moreread less

Abstract: Virtualization is increasingly being used to address server management and administration issues like flexible resource allocation, service isolation and workload migration. In a virtualized environment, the virtual machine monitor (VMM) is the primary resource manager and is an attractive target for implementing system features like scheduling, caching, and monitoring. However, the lackof runtime information within the VMM about guest operating systems, sometimes called the semantic gap, is a significant obstacle to efficiently implementing some kinds of services.In this paper we explore techniques that can be used by a VMM to passively infer useful information about a guest operating system's unified buffer cache and virtual memory system. We have created a prototype implementation of these techniques inside the Xen VMM called Geiger and show that it can accurately infer when pages are inserted into and evicted from a system's buffer cache. We explore several nuances involved in passively implementing eviction detection that have not previously been addressed, such as the importance of tracking disk block liveness, the effect of file system journaling, and the importance of accounting for the unified caches found in modern operating systems.Using case studies we show that the information provided by Geiger enables a VMM to implement useful VMM-level services. We implement a novel working set size estimator which allows the VMM to make more informed memory allocation decisions. We also show that a VMM can be used to drastically improve the hit rate in remote storage caches by using eviction-based cache placement without modifying the application or operating system storage interface. Both case studies hint at a future where inference techniques enable a broad new class of VMM-level functionality.

...read moreread less

185 citations

Patent•

Recovering from a non-volatile memory failure

[...]

Richard L. Coulson¹, Sanjeev N. Trika¹, Robert W. Faber¹•Institutions (1)

Intel¹

02 Nov 2006

TL;DR: In this article, each of the physical memory locations associated with a logical address that is shared in common among the physical addresses is associated with the last write operations of a memory operation, and the available erased memory location can be split into a list of erased memory locations available to be used.

...read moreread less

Abstract: Write operations store data in different physical memory locations. Each of the physical memory locations are associated with a logical address that is shared in common among the physical addresses. Sequence information stored in the physical memory location indicates which one of the write operations occurred last. The available erased memory location can be split into a list of erased memory locations available to be used and a list of erased memory locations not available to be used. Then, on a failure, only the list of erased memory locations available to be used needs to be analyzed to reconstruct the consumption states of memory locations.

...read moreread less

178 citations

Patent•

Direct data file storage implementation techniques in flash memories

[...]

Alan Welsh Sinclair¹•Institutions (1)

SanDisk¹

08 Feb 2006

TL;DR: The file-based interface between the host and memory system allows the memory system controller to utilize the data storage blocks within the memory with increased efficiency as discussed by the authors, without the use of any intermediate logical addresses or a virtual address space for the memory.

...read moreread less

Abstract: Host system data files are written directly to a large erase block flash memory system with a unique identification of each file and offsets of data within the file but without the use of any intermediate logical addresses or a virtual address space for the memory. Directory information of where the files are stored in the memory is maintained within the memory system by its controller, rather than by the host. The file based interface between the host and memory systems allows the memory system controller to utilize the data storage blocks within the memory with increased efficiency.

...read moreread less

Journal Article•DOI•

An efficient NAND flash file system for flash memory storage

[...]

Seung-Ho Lim¹, Kyu Ho Park¹•Institutions (1)

KAIST¹

01 Jul 2006-IEEE Transactions on Computers

TL;DR: The flash file system proposed in this paper is designed for NAND flash memory storage while considering the existing file system characteristics and outperformed other flash file systems both in booting time and garbage collection overheads.

...read moreread less

Abstract: In this paper, we present an efficient flash file system for flash memory storage. Flash memory, especially NAND flash memory, has become a major method for data storage. Currently, a block level translation interface is required between an existing file system and flash memory chips due to its physical characteristics. However, the approach of existing file systems on top of the emulating block interface has many restrictions and is, thus, inefficient because existing file systems are designed for disk-based storage systems. The flash file system proposed in this paper is designed for NAND flash memory storage while considering the existing file system characteristics. Our target performance metrics are the system booting time and garbage collection overheads, which are important issues in flash memory. In our experiments, the proposed flash file system outperformed other flash file systems both in booting time and garbage collection overheads.

...read moreread less

Patent•

Solid-state memory device with protection against power failure

[...]

Mark Moshayedi, Brian Robinson¹•Institutions (1)

Western Digital¹

26 Jul 2006

TL;DR: In this paper, a data preservation system for flash memory systems with a host system, the flash memory system receiving a host-system power supply and energizing an auxiliary energy store therewith and communicating with the host system via an interface bus, is described.

...read moreread less

Abstract: A data preservation system for flash memory systems with a host system, the flash memory system receiving a host system power supply and energizing an auxiliary energy store therewith and communicating with the host system via an interface bus, wherein, upon loss of the host system power supply, the flash memory system actively isolates the connection to the host system power supply and isolates the interface bus and employs the supplemental energy store to continue write operations to flash memory.

...read moreread less

Patent•

Nonvolatile memory system

[...]

Shigemasa Shiota¹, Hiroyuki Goto¹, Hirofumi Shibuya¹, Fumio Hara¹, Kinji Mitani¹ - Show less +1 more•Institutions (1)

Renesas Electronics¹

12 Jun 2006

TL;DR: In this article, a nonvolatile memory system with a plurality of data blocks in predetermined physical address units and a controller for controlling the non-vivo memory in response to an access request from outside is presented.

...read moreread less

Abstract: A memory system permitting a number of alternative memory blocks to be made ready in order to extend the rewritable life and thereby contributing to enhanced reliability of information storage is to be provided. The memory system is provided with a nonvolatile memory having a plurality of data blocks in predetermined physical address units and a controller for controlling the nonvolatile memory in response to an access request from outside. Each of the data blocks has areas for holding a rewrite count and error check information regarding each data area. The controller, in a read operation on the nonvolatile memory, checks for any error in the area subject to the read according to error check information and, when there is any error, if the rewrite count is greater than a predetermined value, will replace the pertinent data block with another data block or if it is not greater, correct data in the data block pertaining to the error.

...read moreread less

Proceedings Article•DOI•

Adaptive self-tuning memory in DB2

[...]

Adam J. Storm¹, Christian Garcia-Arellano¹, Sam Lightstone¹, Yixin Diao¹, Maheswaran Surendra¹ - Show less +1 more•Institutions (1)

IBM¹

01 Sep 2006

TL;DR: This work believes this is the first known use of cost-benefit analysis and control theory in database memory tuning across heterogeneous memory consumers.

...read moreread less

Abstract: DB2 for Linux, UNIX, and Windows Version 9.1 introduces the Self-Tuning Memory Manager (STMM), which provides adaptive self tuning of both database memory heaps and cumulative database memory allocation. This technology provides state-of-the-art memory tuning combining control theory, runtime simulation modeling, cost-benefit analysis, and operating system resource analysis. In particular, the nove use of cost-benefit analysis and control theory techniques makes STMM a breakthrough technology in database memory management. The cost-benefit analysis allows STMM to tune memory between radically different memory consumers such as compiled statement cache, sort, and buffer pools. These methods allow for the fast convergence of memory settings while also providing stability in the presence of system noise. The tuning mode has been found in numerous experiments to tune memory allocation as well as expert human administrators, including OLTP, DSS, and mixed environments. We believe this is the first known use of cost-benefit analysis and control theory in database memory tuning across heterogeneous memory consumers.

...read moreread less

Patent•

Monitoring health of non-volatile memory

[...]

Michael J. Cornwell¹, Christopher P. Dudte¹•Institutions (1)

Apple Inc.¹

27 Jan 2006

TL;DR: In this paper, a host processor is coupled to a memory controller and configurable to retrieve from the memory controller information indicative of the health of a non-volatile memory device operatively coupled to the controller.

...read moreread less

Abstract: A host processor is coupled to a memory controller and configurable to retrieve from the memory controller information indicative of the health of a non-volatile memory device operatively coupled to the memory controller. A host system uses the information to monitor the health of the non-volatile memory device.

...read moreread less

Patent•

System for controlling use of a solid-state storage subsystem

[...]

David E. Merry, Mark S. Diggs, Gary A. Drossel, Michael J. Hajeck

27 Jun 2006

TL;DR: In this article, the authors describe a storage subsystem that includes a main memory area that is accessible via standard memory access commands (such as ATA commands), and a restricted memory area which is accessible only via one or more non-standard commands.

...read moreread less

Abstract: A solid-state storage subsystem, such as a non-volatile memory card or drive, includes a main memory area that is accessible via standard memory access commands (such as ATA commands), and a restricted memory area that is accessible only via one or more non-standard commands. The restricted memory area stores information used to control access to, and/or use of, information stored in the main memory area. As one example, the restricted area may store one or more identifiers, such as a unique subsystem identifier, needed to decrypt an executable or data file stored in the main memory area. A host software component is configured to retrieve the information from the subsystem's restricted memory area, and to use the information to control access to and/or use of the information in the main memory area.

...read moreread less

Proceedings Article•DOI•

Fast packet classification using bloom filters

[...]

Sarang Dharmapurikar¹, Haoyu Song¹, Jonathan S. Turner¹, John W. Lockwood¹•Institutions (1)

Washington University in St. Louis¹

03 Dec 2006

TL;DR: This paper shows how to modify the crossproduct method in a way that drastically reduces the memory requirement without compromising on performance, and proposes a new approach to packet classification which combines architectural and algorithmic techniques.

...read moreread less

Abstract: Ternary content addressable memory (TCAM), although widely used for general packet classification, is an expensive and high power-consuming device. Algorithmic solutions which rely on commodity memory chips are relatively inexpensive and power-efficient but have not been able to match the generality and performance of TCAMs. Therefore, the development of fast and power-efficient algorithmic packet classification techniques continues to be a research subject. In this paper we propose a new approach to packet classification which combines architectural and algorithmic techniques. Our starting point is the well-known crossproduct algorithm which is fast but has significant memory overhead due to the extra rules needed to represent the crossproducts. We show how to modify the crossproduct method in a way that drastically reduces the memory requirement without compromising on performance. Unnecessary accesses to the off-chip memory are avoided by filtering them through on- chip Bloom filters. For packets that match p rules in a rule set, our algorithm requires just 4+p+epsiv independent memory accesses to return all matching rules, where epsiv Lt 1 is a small constant that depends on the false positive rate of the Bloom filters. Using two commodity SRAM chips, a throughput of 38 million packets per second can be achieved. For rule set sizes ranging from a few hundred to several thousand filters, the average rule set expansion factor attributable to the algorithm is just 1.2 to 1.4. The average memory consumption per rule is 32 to 45 bytes.

...read moreread less

Patent•

Determining memory conditions in a virtual machine

[...]

Xiaoxin Chen¹, Carl A. Waldspurger¹, Anil Rao¹•Institutions (1)

VMware¹

21 Sep 2006

TL;DR: In this article, a resource reservation application running as a guest application on the virtual machine reserves a location in guest virtual memory and the corresponding physical memory can be reclaimed and allocated to another virtual machine.

...read moreread less

Abstract: Memory assigned to a virtual machine is reclaimed. A resource reservation application running as a guest application on the virtual machine reserves a location in guest virtual memory. The corresponding physical memory can be reclaimed and allocated to another virtual machine. The resource reservation application allows detection of guest virtual memory page-out by the guest operating system. Measuring guest virtual memory page-out is useful for determining memory conditions inside the guest operating system. Given determined memory conditions, memory allocation and reclaiming can be used control memory conditions. Memory conditions in the virtual machine can be controlled with the objective of achieving some target memory conditions.

...read moreread less

Patent•

Multi-stage memory buffer and automatic transfers in vehicle event recording systems

[...]

James Plante

07 Dec 2006

TL;DR: In this paper, a multi-stage video memory management system for a vehicle event recorder is provided that includes the management of a plurality of stage memories and the transfer of data there between.

...read moreread less

Abstract: A multi-stage video memory management system for a vehicle event recorder is provided that includes the management of a plurality of stage memories and the transfer of data therebetween. A managed loop memory receives data from a video camera in real-time and continuously overwrites expired data determined to be no longer useful. Data in the managed loop memory is transferred to a more stable memory in response to an event to be recorded. An event trigger first produces a signal causing data transfer between the managed loop memory and an on-board, high-capacity buffer memory, suitable for storing video series associated with a plurality of events. Subsequently, a permanent data store receives data from the high-capacity buffer memory whenever the system reaches a predetermined distance from a download station.

...read moreread less

Patent•

Flash drive fast wear leveling

[...]

Steve Kolokowsky¹•Institutions (1)

Cypress Semiconductor¹

30 Aug 2006

TL;DR: In this article, a system and method comprising a non-volatile memory including one or more memory blocks to store data, a controller to allocate the memory blocks associated with the identified pointers for the storage of data is presented.

...read moreread less

Abstract: A system and method comprising a non-volatile memory including one or more memory blocks to store data, a controller to allocate one or more of the memory blocks to store data, and a wear-leveling table populated with pointers to unallocated memory blocks in the non-volatile memory, the controller to identify one or more pointers in the wear-leveling table and to allocate the unallocated memory blocks associated with the identified pointers for the storage of data.

...read moreread less

Proceedings Article•DOI•

McRT-Malloc: a scalable transactional memory allocator

[...]

Richard L. Hudson¹, Bratin Saha¹, Ali-Reza Adl-Tabatabai¹, Benjamin C. Hertzberg²•Institutions (2)

Intel¹, Stanford University²

10 Jun 2006

TL;DR: This paper is the first to integrate a software transactional memory system with a malloc/free based memory allocator and presents the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup.

...read moreread less

Abstract: Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, libraries (such as malloc-free packages) would therefore need to use non-blocking algorithms. But lock-free algorithms are notoriously difficult to reason about and inappropriate for average programmers. Transactional memory promises to significantly ease concurrent programming for the average programmer. This paper describes a highly efficient non-blocking malloc/free algorithm that supports memory allocation and deallocation inside transactional code blocks. Thus this paper describes a memory allocator that is suitable for emerging multi-core applications, while supporting modern concurrency constructs.This paper makes several novel contributions. It is the first to integrate a software transactional memory system with a malloc/free based memory allocator. We present the first algorithm which ensures that space allocated in an aborted transaction is properly freed and does not lead to a space blowup. Unlike previous lock-free malloc packages, our algorithm avoids atomic operations on typical code paths, making our algorithm substantially more efficient.

...read moreread less

Journal Article•DOI•

An on-the-fly reference-counting garbage collector for java

[...]

Yossi Levanoni¹, Erez Petrank²•Institutions (2)

Microsoft¹, Technion – Israel Institute of Technology²

01 Jan 2006-ACM Transactions on Programming Languages and Systems

TL;DR: A novel reference-counting algorithm suitable for a multiprocessor system that does not require any synchronized operation in its write barrier (not even a compare-and-swap type of synchronization) and allows eliminating a large fraction of thereference-count updates, thus, drastically reducing the reference- Counting traditional overhead.

...read moreread less

Abstract: Reference-counting is traditionally considered unsuitable for multiprocessor systems. According to conventional wisdom, the update of reference slots and reference-counts requires atomic or synchronized operations. In this work we demonstrate this is not the case by presenting a novel reference-counting algorithm suitable for a multiprocessor system that does not require any synchronized operation in its write barrier (not even a compare-and-swap type of synchronization). A second novelty of this algorithm is that it allows eliminating a large fraction of the reference-count updates, thus, drastically reducing the reference-counting traditional overhead. This article includes a full proof of the algorithm showing that it is safe (does not reclaim live objects) and live (eventually reclaims all unreachable objects).We have implemented our algorithm on Sun Microsystems' Java Virtual Machine (JVM) 1.2.2 and ran it on a four-way IBM Netfinity 8500R server with 550-MHz Intel Pentium III Xeon and 2 GB of physical memory. Our results show that the algorithm has an extremely low latency and throughput that is comparable to the stop-the-world mark and sweep algorithm used in the original JVM.

...read moreread less

Patent•

Identification of uncommitted memory blocks during an initialization procedure

[...]

Derick G. Moore¹, Lawrence J. Rawe¹, Roy Wade¹•Institutions (1)

LSI Corporation¹

08 Sep 2006

TL;DR: In this paper, an apparatus and method are described for identifying uncommitted memory in a system RAM during an initialization process of a computer system, such as a boot procedure or power-on self test, during which memory management is uncontrolled.

...read moreread less

Abstract: An apparatus and method are described for identifying uncommitted memory in a system RAM during an initialization process of a computer system, such as a boot procedure or power-on self test, during which memory management is uncontrolled. In various embodiments of the invention, repeating patterns that are indicative of uncommitted memory blocks are identified within a conventional memory area of the system RAM. At least some of the uncommitted memory blocks are allocated for use by an option ROM or other BIOS data and a table is created identifying these uncommitted memory blocks. After the BIOS code exits the system RAM, the table is used to restore the uncommitted memory blocks into their previous data states.

...read moreread less

Proceedings Article•DOI•

Scalable locality-conscious multithreaded memory allocation

[...]

Scott Schneider¹, Christos D. Antonopoulos¹, Dimitrios S. Nikolopoulos¹•Institutions (1)

College of William & Mary¹

10 Jun 2006

TL;DR: Streamflow enables low over-head simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators, and favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing.

...read moreread less

Abstract: We present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. Streamflow enables low over-head simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators. It favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing, all using a unified design based on segregated heaps. Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization in the less common case of remote deallocations. Spatial locality at the cache and page level is favoredby eliminating small objects headers, reducing allocator-induced conflicts via contiguous allocation of page blocks in physical memory, reducing allocator-induced false sharing by using segregated heaps and achieving better TLB performance and fewer page faults via the use of superpages. Combining these locality optimizations with the drastic reduction of synchronization and latency overhead allows Streamflow to perform comparably with optimized sequential allocators and outperform--on a shared-memory systemwith four two-way SMT processors--four state-of-the-art multi-processor allocators by sizeable margins in our experiments. The allocation-intensive sequential and parallel benchmarks used in our experiments represent a variety of behaviors, including mostly local object allocation-deallocation patterns and producer-consumer allocation-deallocation patterns.

...read moreread less

Patent•

Virtualizing physical memory in a virtual machine system

[...]

Steven M. Bennett¹, Andrew V. Anderson, Gilbert Neiger, Sankaran Rajesh M, Richard Uhlig², Larry Smith, Scott D. Rodgers - Show less +3 more•Institutions (2)

Intel¹, Los Angeles Mission College²

10 Jan 2006

TL;DR: In this article, a processor including a virtualization system of the processor with a memory virtualization support system is used to map a reference to guest-physical memory made by guest software executable on a virtual machine which in turn is executed on a host machine in which the processor is operable to a reference reference to host-Physical memory of the host machine.

...read moreread less

Abstract: A processor including a virtualization system of the processor with a memory virtualization support system to map a reference to guest-physical memory made by guest software executable on a virtual machine which in turn is executable on a host machine in which the processor is operable to a reference to host-physical memory of the host machine.

...read moreread less

Patent•

Storage of multiple keys in memory

[...]

Simon Robert Walmsley

29 Jun 2006

TL;DR: In this paper, a method of storing multiple first bit-patterns in nonvolatile memory of a device is proposed, where each bit pattern is associated with a second bit pattern associated with the device.

...read moreread less

Abstract: A method of storing multiple first bit-patterns in non-volatile memory of a device, the method comprising, for each of the first bit-patterns to be stored: (a) applying a one way function to a third bit-pattern based on a second bit-pattern associated with the device, thereby to generate a first result; (b) applying a second function to the first result and the first bit-pattern, thereby to generate a second result; and (c) storing the second result in the memory, thereby indirectly storing the first bit-pattern; wherein the third bit-patterns used for the respective first bit-patterns are relatively unique compared to each other.

...read moreread less

Patent•

Reducing latencies in computing systems using probabilistic and/or decision-theoretic reasoning under scarce memory resources

[...]

Johnson T. Apacible¹, Eric Horvitz¹, Mehmet Iyigun¹•Institutions (1)

Microsoft¹

30 Jun 2006

TL;DR: In this paper, a probabilistic and/or decision-theoretic model(s) of application usage is employed to predict application use and in view of bounded or limited-availability memory.

...read moreread less

Abstract: Architecture that employs probabilistic and/or decision-theoretic model(s) of application usage to predict application use and in view of bounded or limited-availability memory. The model(s) is applied with cost-benefit analysis to guide memory management in an operating system, in particular, for both decisions about prefetching and memory retention versus deletion or “paging out” of memory of lower priority items, to free up space for higher value items. Contextual information is employed in addition to computer action monitoring for predicting next applications to be launched. Prefetching is optimized so as to minimize user perceived latencies.

...read moreread less

Proceedings Article•DOI•

DMA-aware memory energy management

[...]

Vivek Pandey¹, W. Jiang¹, Yuanyuan Zhou¹, Ricardo Bianchini•Institutions (1)

University of Illinois at Urbana–Champaign¹

27 Feb 2006

TL;DR: Two novel performance-directed energy management techniques that maximize the utilization of memory devices by increasing the level of concurrency between multiple DMA transfers from different I/O buses to the same memory device are proposed.

...read moreread less

Abstract: As increasingly larger memories are used to bridge the widening gap between processor and disk speeds, main memory energy consumption is becoming increasingly dominant. Even though much prior research has been conducted on memory energy management, no study has focused on data servers, where main memory is predominantly accessed by DMAs instead of processors. In this paper, we study DMA-aware techniques for memory energy management in data servers. We first characterize the effect of DMA accesses on memory energy and show that, due to the mismatch between memory and I/O bus band-widths, significant energy is wasted when memory is idle but still active during DMA transfers. To reduce this waste, we propose two novel performance-directed energy management techniques that maximize the utilization of memory devices by increasing the level of concurrency between multiple DMA transfers from different I/O buses to the same memory device. We evaluate our techniques using a detailed trace-driven simulator, and storage and database server traces. The results show that our techniques can effectively minimize the amount of idle energy waste during DMA transfers and, consequently, conserve up to 38.6% more memory energy than previous approaches while providing similar performance.

...read moreread less

Proceedings Article•DOI•

Detecting phases in parallel applications on shared memory architectures

[...]

Erez Perelman¹, Marzia Polito², Jean-Yves Bouguet², Jack Sampson¹, Brad Calder¹, Carole Dulong² - Show less +2 more•Institutions (2)

University of California, San Diego¹, Intel²

25 Apr 2006

TL;DR: This paper examines applying phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors, and examines using the phase analysis to pick simulation points to guide multithreaded simulation.

...read moreread less

Abstract: Most programs are repetitive, where similar behavior can be seen at different execution times. Algorithms have been proposed that automatically group similar portions of a program's execution into phases, where samples of execution in the same phase have homogeneous behavior and similar resource requirements. In this paper, we examine applying these phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors. Our approach relies on a separate representation of each thread's activity. We first focus on showing its ability to identify similar intervals of execution across threads for a single run. We then show that it is effective at identifying similar behavior of a program when the number of threads is varied between runs. This can be used by developers to examine how different phases scale across different number of threads. Finally, we examine using the phase analysis to pick simulation points to guide multithreaded simulation.

...read moreread less

Collapse