scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2013"


Proceedings ArticleDOI
19 May 2013
TL;DR: The current knowledge about various protection techniques are systematized by setting up a general model for memory corruption attacks, and what policies can stop which attacks are shown, to analyze the reasons why protection mechanisms implementing stricter polices are not deployed.
Abstract: Memory corruption bugs in software written in low-level languages like C or C++ are one of the oldest problems in computer security. The lack of safety in these languages allows attackers to alter the program's behavior or take full control over it by hijacking its control flow. This problem has existed for more than 30 years and a vast number of potential solutions have been proposed, yet memory corruption attacks continue to pose a serious threat. Real world exploits show that all currently deployed protections can be defeated. This paper sheds light on the primary reasons for this by describing attacks that succeed on today's systems. We systematize the current knowledge about various protection techniques by setting up a general model for memory corruption attacks. Using this model we show what policies can stop which attacks. The model identifies weaknesses of currently deployed techniques, as well as other proposed protections enforcing stricter policies. We analyze the reasons why protection mechanisms implementing stricter polices are not deployed. To achieve wide adoption, protection mechanisms must support a multitude of features and must satisfy a host of requirements. Especially important is performance, as experience shows that only solutions whose overhead is in reasonable bounds get deployed. A comparison of different enforceable policies helps designers of new protection mechanisms in finding the balance between effectiveness (security) and efficiency. We identify some open research problems, and provide suggestions on improving the adoption of newer techniques.

635 citations


Proceedings ArticleDOI
19 May 2013
TL;DR: This paper shows that an adversary can implement a generic side channel attack against the memory management system to deduce information about the privileged address space layout and can successfully circumvent kernel space ASLR on current operating systems.
Abstract: Due to the prevalence of control-flow hijacking attacks, a wide variety of defense methods to protect both user space and kernel space code have been developed in the past years. A few examples that have received widespread adoption include stack canaries, non-executable memory, and Address Space Layout Randomization (ASLR). When implemented correctly (i.e., a given system fully supports these protection methods and no information leak exists), the attack surface is significantly reduced and typical exploitation strategies are severely thwarted. All modern desktop and server operating systems support these techniques and ASLR has also been added to different mobile operating systems recently. In this paper, we study the limitations of kernel space ASLR against a local attacker with restricted privileges. We show that an adversary can implement a generic side channel attack against the memory management system to deduce information about the privileged address space layout. Our approach is based on the intrinsic property that the different caches are shared resources on computer systems. We introduce three implementations of our methodology and show that our attacks are feasible on four different x86-based CPUs (both 32- and 64-bit architectures) and also applicable to virtual machines. As a result, we can successfully circumvent kernel space ASLR on current operating systems. Furthermore, we also discuss mitigation strategies against our attacks, and propose and implement a defense solution with negligible performance overhead.

370 citations


Proceedings ArticleDOI
07 Nov 2013
TL;DR: It is shown that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload and ensures that on-chip memory size is minimized, which reduces area and energy usage.
Abstract: In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.

361 citations


Journal ArticleDOI
TL;DR: This paper presents an online loop-closure detection approach for large-scale and long-term operation based on a memory management method, which limits the number of locations used for loop- closure detection so that the computation time remains under real-time constraints.
Abstract: In appearance-based localization and mapping, loop-closure detection is the process used to determinate if the current observation comes from a previously visited location or a new one. As the size of the internal map increases, so does the time required to compare new observations with all stored locations, eventually limiting online processing. This paper presents an online loop-closure detection approach for large-scale and long-term operation. The approach is based on a memory management method, which limits the number of locations used for loop-closure detection so that the computation time remains under real-time constraints. The idea consists of keeping the most recent and frequently observed locations in a working memory (WM) that is used for loop-closure detection, and transferring the others into a long-term memory (LTM). When a match is found between the current location and one stored in WM, associated locations that are stored in LTM can be updated and remembered for additional loop-closure detections. Results demonstrate the approach's adaptability and scalability using ten standard datasets from other appearance-based loop-closure approaches, one custom dataset using real images taken over a 2-km loop of our university campus, and one custom dataset (7 h) using virtual images from the racing video game “Need for Speed: Most Wanted”.

337 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work proposes mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of thevirtual address space to remove the TLB miss overhead for big-memory workloads.
Abstract: Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.To remove the TLB miss overhead for big-memory workloads, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware---base, limit and offset registers per core---to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed.We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.

319 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: Three key solution directions are surveyed: enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system, designing a memory system that employs emerging memory technologies and takes advantage of multiple different technologies, and providing predictable performance and QoS to applications sharing the memory system.
Abstract: The memory system is a fundamental performance and energy bottleneck in almost all computing systems Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck At the same time, DRAM technology is experiencing difficult technology scaling challenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques In this paper, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling Specifically, we survey three key solution directions: 1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system, 2) designing a memory system that employs emerging memory technologies and takes advantage of multiple different technologies, 3) providing predictable performance and QoS to applications sharing the memory system We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory

270 citations


Proceedings Article
01 Jan 2013
TL;DR: In this paper, a generic side channel attack against the memory management system to deduce information about the privileged address space layout is proposed, based on the intrinsic property that the different caches are shared resources on computer systems.
Abstract: Due to the prevalence of control-flow hijacking attacks, a wide variety of defense methods to protect both user space and kernel space code have been developed in the past years. A few examples that have received widespread adoption include stack canaries, non-executable memory, and Address Space Layout Randomization (ASLR). When implemented correctly (i.e., a given system fully supports these protection methods and no information leak exists), the attack surface is significantly reduced and typical exploitation strategies are severely thwarted. All modern desktop and server operating systems support these techniques and ASLR has also been added to different mobile operating systems recently. In this paper, we study the limitations of kernel space ASLR against a local attacker with restricted privileges. We show that an adversary can implement a generic side channel attack against the memory management system to deduce information about the privileged address space layout. Our approach is based on the intrinsic property that the different caches are shared resources on computer systems. We introduce three implementations of our methodology and show that our attacks are feasible on four different x86-based CPUs (both 32- and 64-bit architectures) and also applicable to virtual machines. As a result, we can successfully circumvent kernel space ASLR on current operating systems. Furthermore, we also discuss mitigation strategies against our attacks, and propose and implement a defense solution with negligible performance overhead.

261 citations


Proceedings ArticleDOI
07 Dec 2013
TL;DR: Kiln is a persistent memory design that adopts a nonvolatile cache and aNonvolatile main memory to enable atomic in-place updates without logging or copy-on-write and can achieve 2× performance improvement compared with NVRAM-based persistent memory with write-ahead logging.
Abstract: Persistent memory is an emerging technology which allows in-memory persistent data objects to be updated at much higher throughput than when using disks as persistent storage. Previous persistent memory designs use logging or copy-on-write mechanisms to update persistent data, which unfortunately reduces the system performance to roughly half that of a native system with no persistence support. One of the great challenges in this application class is therefore how to efficiently enable atomic, consistent, and durable updates to ensure data persistence that survives application and/or system failures. Our goal is to design a persistent memory system with performance very close to that of a native system. We propose Kiln, a persistent memory design that adopts a nonvolatile cache and a nonvolatile main memory to enable atomic in-place updates without logging or copy-on-write. Our evaluation shows that Kiln can achieve 2× performance improvement compared with NVRAM-based persistent memory with write-ahead logging. In addition, our design has numerous practical advantages: a simple and intuitive abstract interface, microarchitecture-level optimizations, fast recovery from failures, and eliminating redundant writes to nonvolatile storage media.

239 citations


Patent
15 Mar 2013
TL;DR: In this paper, a mapping module is configured to determine whether to associate a range of data with the auto-commit memory, and a bypass module is used to service a request for the data directly from the autocommit memory.
Abstract: Apparatuses, systems, methods, and computer program products are disclosed for providing access to auto-commit memory. An auto-commit memory module is configured to cause an auto-commit memory to commit stored data to a non-volatile memory medium in response to a failure condition. A mapping module is configured to determine whether to associate a range of data with the auto-commit memory. A bypass module is configured to service a request for the range of data directly from the auto-commit memory in response to the auto-commit mapping module determining to associate the range of data with the auto-commit memory.

184 citations


Patent
09 May 2013
TL;DR: In this article, a switching device is configured to route memory requests based on the mappings between data addresses associated with memory requests from a consumer device relating to a data object and information relating to storage location in the one or more memory resources associated with the data from the data object.
Abstract: Systems, methods and devices for distributed memory management comprising a network component configured for network communication with one or more memory resources that store data and one or more consumer devices that use data, the network component comprising a switching device in operative communication with a mapping resource, wherein the mapping resource is configured to associate mappings between data addresses associated with memory requests from a consumer device relating to a data object and information relating to a storage location in the one or more memory resources associated with the data from the data object, wherein each data address has contained therein identification information for identifying the data from the data object associated with that data address; and the switching device is configured to route memory requests based on the mappings.

173 citations


Proceedings ArticleDOI
07 Dec 2013
TL;DR: It is shown that any compression algorithm can be adapted to fit the requirements of LCP, and two previously-proposed compression algorithms to LCP are adapted: Frequent Pattern Compression and Base-Delta-Immediate Compression.
Abstract: Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed memory page, thereby increasing access latency and degrading system performance. Prior proposals for addressing this performance degradation problem are either costly or energy inefficient. By leveraging the key insight that all cache lines within a page should be compressed to the same size, this paper proposes a new approach to main memory compression — Linearly Compressed Pages (LCP) — that avoids the performance degradation problem without requiring costly or energy-inefficient hardware. We show that any compression algorithm can be adapted to fit the requirements of LCP, and we specifically adapt two previously-proposed compression algorithms to LCP: Frequent Pattern Compression and Base-Delta-Immediate Compression. Evaluations using benchmarks from SPEC CPU2006 and five server benchmarks show that our approach can significantly increase the effective memory capacity (by 69% on average). In addition to the capacity gains, we evaluate the benefit of transferring consecutive compressed cache lines between the memory controller and main memory. Our new mechanism considerably reduces the memory bandwidth requirements of most of the evaluated benchmarks (by 24% on average), and improves overall performance (by 6.1%/13.9%/10.7% for single-/two-/four-core workloads on average) compared to a baseline system that does not employ main memory compression. LCP also decreases energy consumed by the main memory subsystem (by 9.5% on average over the best prior mechanism).

Proceedings ArticleDOI
03 Nov 2013
TL;DR: This paper introduces techniques for robust wear-aware memory allocation, preventing of erroneous writes, and consistency-preserving updates that are cache-efficient and demonstrates a B+-tree implementation modified to make full use of the toolkit.
Abstract: This paper presents three building blocks for enabling the efficient and safe design of persistent data stores for emerging non-volatile memory technologies. Taking the fullest advantage of the low latency and high bandwidths of emerging memories such as phase change memory (PCM), spin torque, and memristor necessitates a serious look at placing these persistent storage technologies on the main memory bus. Doing so, however, introduces critical challenges of not sacrificing the data reliability and consistency that users demand from storage. This paper introduces techniques for (1) robust wear-aware memory allocation, (2) preventing of erroneous writes, and (3) consistency-preserving updates that are cache-efficient. We show through our evaluation that these techniques are efficiently implementable and effective by demonstrating a B+-tree implementation modified to make full use of our toolkit.

Journal ArticleDOI
01 Sep 2013
TL;DR: The results show that for higher skewed workloads the anti-caching architecture has a performance advantage over either of the other architectures tested of up to 9× for a data size 8× larger than memory.
Abstract: The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple transactions to access the database at the same time. Previous research has shown that this results in substantial overhead for on-line transaction processing (OLTP) applications [15].The next generation DBMSs seek to overcome these limitations with architecture based on main memory resident data. To overcome the restriction that all data fit in main memory, we propose a new technique, called anti-caching, where cold data is moved to disk in a transactionally-safe manner as the database grows in size. Because data initially resides in memory, an anti-caching architecture reverses the traditional storage hierarchy of disk-based systems. Main memory is now the primary storage device.We implemented a prototype of our anti-caching proposal in a high-performance, main memory OLTP DBMS and performed a series of experiments across a range of database sizes, workload skews, and read/write mixes. We compared its performance with an open-source, disk-based DBMS optionally fronted by a distributed main memory cache. Our results show that for higher skewed workloads the anti-caching architecture has a performance advantage over either of the other architectures tested of up to 9× for a data size 8× larger than memory.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: This work designs and evaluates a locality-aware memory hierarchy for throughput processors, such as GPUs, that retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine- grained access to memory.
Abstract: As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.

Patent
11 Mar 2013
TL;DR: In this article, the error recovery mechanism is configured for receiving notification of the failing memory channel, for performing a recovery operation on the failed memory channel while other memory channels are performing normal system operations, for bringing the recovered channel back into operational mode with the other memory channel for store operations, and for removing any stale data after the recovery operation is complete.
Abstract: Providing heterogeneous recovery in a redundant memory system that includes a memory controller, a plurality of memory channels in communication with the memory controller, an error detection code mechanism configured for detecting a failing memory channel, and an error recovery mechanism. The error recovery mechanism is configured for receiving notification of the failing memory channel, for performing a recovery operation on the failing memory channel while other memory channels are performing normal system operations, for bringing the recovered channel back into operational mode with the other memory channels for store operations, for continuing to mark the recovered channel to guard against stale data, for removing any stale data after the recovery operation is complete, and for removing the mark on the recovered channel to allow the normal system operations with all of the memory channels, the removing based on the removing any stale data being complete.

Patent
05 Mar 2013
TL;DR: In this paper, the authors present a method for accessing data of a range of virtual memory from a non-volatile medium using a persistent identifier associated with referenced data and written data.
Abstract: Apparatuses, systems, methods, and computer program products are disclosed for hybrid checkpointed memory. A method includes referencing data of a range of virtual memory of a host. The referenced data is already stored by a non-volatile medium. A method includes writing, to a non-volatile medium, data of a range of virtual memory that is not stored by the non-volatile medium. A method includes providing access to data of a range of virtual memory from a non-volatile medium using a persistent identifier associated with referenced data and written data.

Journal ArticleDOI
TL;DR: An multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length is presented.
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-Ns butterflies at each stage, where Ns is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let Ns=4 and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 mm2. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption.

Proceedings ArticleDOI
15 Apr 2013
TL;DR: This work extends ballooning to applications so that memory can be efficiently and effectively moved between virtualized instances as the demands of each change over time, with significantly lower memory requirements.
Abstract: Systems software like databases and language runtimes typically manage memory themselves to exploit application knowledge unavailable to the OS. Traditionally deployed on dedicated machines, they are designed to be statically configured with memory sufficient for peak load. In virtualization scenarios (cloud computing, server consolidation), however, static peak provisioning of RAM to applications dramatically reduces the efficiency and cost-saving benefits of virtualization. Unfortunately, existing memory "ballooning" techniques used to dynamically reallocate physical memory between VMs badly impact the performance of applications which manage their own memory. We address this problem by extending ballooning to applications (here, a database engine and Java runtime) so that memory can be efficiently and effectively moved between virtualized instances as the demands of each change over time. The results are significantly lower memory requirements to provide the same performance guarantees to a collocated set of VM running such applications, with minimal overhead or intrusive changes to application code.

Proceedings ArticleDOI
Weirong Jiang1
21 Oct 2013
TL;DR: This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-of-the-art FPGAs, and is the first FPGA design that implements a TCAM larger than 1 Mbits.
Abstract: Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGA-based TCAM designs are based on brute-force implementations, which result in inefficient on-chip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-of-the-art FPGAs. We give a formal study on RAM-based TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of small-size RAM-based TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid range-to-ternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 150 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance trade-offs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than 1 Mbits.

Patent
22 Aug 2013
TL;DR: In this article, a method for enabling inter-process communication between a first application and a second application, the first application running within a first context and the second application running in a second context of a virtualization system is described.
Abstract: A method for enabling inter-process communication between a first application and a second application, the first application running within a first context and the second application running within a second context of a virtualization system is described. The method includes receiving a request to attach a shared region of memory to a memory allocation, identifying a list of one or more physical memory pages defining the shared region that corresponds to the handle, and mapping guest memory pages corresponding to the allocation to the physical memory pages. The request is received by a framework from the second application and includes a handle that uniquely identifies the shared region of memory as well as an identification of at least one guest memory page corresponding to the memory allocation. The framework is a component of a virtualization software, which executes in a context distinct from the context of the first application.

Journal ArticleDOI
TL;DR: An energy profiler tool for the systems that use ARM7TDMI processors is developed by embedding the model parameters in an instruction-level profiler from the SimpleScalar toolset which provides valuable information and guidelines for software energy optimization.
Abstract: Estimating the energy consumption of applications is a key aspect in optimizing embedded systems energy consumption. This paper proposes a simple yet accurate instruction-level energy estimation model for embedded systems. As a case study, the model parameters were determined for a commonly used ARM7TDMI-based microcontroller. The total energy includes the energy consumption of the processor core, Flash memory, memory controller, and SRAM. The model parameters are instructions opcode, number of shift operations, register bank bit flips, instructions weight and their Hamming distance, and different types of memory accesses. Also, the effect of pipeline stalls have been considered. In order to validate the proposed model, a physical hardware platform equipped with energy measurement capabilities was developed. We have conducted experiments on several embedded applications from MiBench benchmark suite and the results show less than 6% error in the energy consumption estimation. We have also developed an energy profiler tool for the systems that use ARM7TDMI processors by embedding the model parameters in an instruction-level profiler from the SimpleScalar toolset which provides valuable information and guidelines for software energy optimization.

Patent
30 Sep 2013
TL;DR: In this paper, the authors present a system and method for clients, a control module, and storage modules to participate in a unified address space in order to and read and write data efficiently using direct-memory access.
Abstract: A system and method for clients, a control module, and storage modules to participate in a unified address space in order to and read and write data efficiently using direct-memory access. The method for reading data includes determining a first location in a first memory to write a first copy of the data, a second location in a second memory to write a second copy of the data, where the first memory is located in a first storage module including a first persistent storage and the second memory is located in a second storage module including a second persistent storage. The method further includes programming a direct memory access engine to read the data from client memory and issue a first write request to a multicast address, where the first location, the second location, and a third location are associated with the multicast address.

Proceedings ArticleDOI
24 Jun 2013
TL;DR: This paper investigates the security implication of memory deduplication from the perspectives of both attackers and defenders, and demonstrates two new attacks to create a covert channel and detect virtualization.
Abstract: Memory deduplication has been widely used in various commodity hypervisors. By merging identical memory contents, it allows more virtual machines to run concurrently on top of a hypervisor. However, while this technique improves memory efficiency, it has a large impact on system security. In particular, memory deduplication is usually implemented using a variant of copy-on-write techniques, for which, writing to a shared page would incur a longer access time than those non-shared. In this paper, we investigate the security implication of memory deduplication from the perspectives of both attackers and defenders. On one hand, using the artifact above, we demonstrate two new attacks to create a covert channel and detect virtualization, respectively. On the other hand, we also show that memory deduplication can be leveraged to safeguard Linux kernel integrity.

Proceedings ArticleDOI
08 Apr 2013
TL;DR: HANDS is a framework that dynamically pre-fetches fingerprints from disk into memory cache according to working sets statistically derived from access patterns, making it suitable for a wide range of storage systems without the need to modify host file systems.
Abstract: Deduplicating in-line data on primary storage is hampered by the disk bottleneck problem, an issue which results from the need to keep an index mapping portions of data to hash values in memory in order to detect duplicate data without paying the performance penalty of disk paging. The index size is proportional to the volume of unique data, so placing the entire index into RAM is not cost effective with a deduplication ratio below 45%. HANDS reduces the amount of in-memory index storage required by up to 99% while still achieving between 30% and 90% of the deduplication a full memory-resident index provides, making primary deduplication cost effective in workloads with deduplication rates as low as 8%. HANDS is a framework that dynamically pre-fetches fingerprints from disk into memory cache according to working sets statistically derived from access patterns. We use a simple neighborhood grouping as our statistical technique to demonstrate the effectiveness of our approach. HANDS is modular and requires only spatio-temporal data, making it suitable for a wide range of storage systems without the need to modify host file systems.

Patent
26 Feb 2013
TL;DR: In this article, metadata updates are stored in a first tier of a multi-tier nonvolatile memory structure responsive to access operations associated with data objects in the memory structure, and the stored metadata update are further migrated to a different location within the first tier responsive to an accumulated count of said access operations.
Abstract: Method and apparatus for managing data in a memory. In accordance with some embodiments, metadata updates are stored in a first tier of a a multi-tier non-volatile memory structure responsive to access operations associated with data objects in the memory structure. The stored metadata updates are logged in a second, lower tier of the memory structure. The stored metadata updates are further migrated to a different location within the first tier responsive to an accumulated count of said access operations.

Proceedings ArticleDOI
24 Jun 2013
TL;DR: This paper proposes a simple and low-overhead technique that enables main-memory databases to efficiently migrate cold data to secondary storage by relying on the OS's virtual memory paging mechanism, and transparently re-organizes the in-memory data structures to reduce paging I/O and improve hit rates.
Abstract: Even though main memory is becoming large enough to fit most OLTP databases, it may not always be the best option. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). Therefore, it is more economical to store the coldest records on a fast secondary storage device such as a solid-state disk. However, main-memory DBMS have no knowledge of secondary storage, while traditional disk-based databases, designed for workloads where data resides on HDD, introduce too much overhead for the common case where the working set is memory resident.In this paper, we propose a simple and low-overhead technique that enables main-memory databases to efficiently migrate cold data to secondary storage by relying on the OS's virtual memory paging mechanism. We propose to log accesses at the tuple level, process the access traces offline to identify relevant access patterns, and then transparently re-organize the in-memory data structures to reduce paging I/O and improve hit rates. The hot/cold data separation is performed on demand and incrementally through careful memory management, without any change to the underlying data structures. We validate experimentally the data re-organization proposal and show that OS paging can be efficient: a TPC-C database can grow two orders of magnitude larger than the available memory size without a noticeable impact on performance.

Proceedings ArticleDOI
22 Sep 2013
TL;DR: This work proposes an automated approach that combines performance counters and execution logs to diagnose memory-related issues in load tests and performs three case studies on two systems.
Abstract: Load tests ensure that software systems are able to perform under the expected workloads. The current state of load test analysis requires significant manual review of performance counters and execution logs, and a high degree of system-specific expertise. In particular, memory-related issues (e.g., memory leaks or spikes), which may degrade performance and cause crashes, are difficult to diagnose. Performance analysts must correlate hundreds of megabytes or gigabytes of performance counters (to understand resource usage) with execution logs (to understand system behaviour). However, little work has been done to combine these two types of information to assist performance analysts in their diagnosis. We propose an automated approach that combines performance counters and execution logs to diagnose memory-related issues in load tests. We perform three case studies on two systems: one open-source system and one large-scale enterprise system. Our approach flags ≤ 0.1% of the execution logs with a precision ≥ 80%.

Proceedings ArticleDOI
10 Jun 2013
TL;DR: This paper addresses new system design issues that will occur when a large quantity of emerging persistent RAM (PRAM) is put on the main memory bus of a platform and proposes Memorage, a system architecture that virtually manages all available physical resources for memory and storage in an integrated manner.
Abstract: This paper addresses new system design issues that will occur when a large quantity of emerging persistent RAM (PRAM) is put on the main memory bus of a platform. First, we anticipate that continued technology advances will enable us to integrate (portions of) the system storage within the PRAM modules on a system board. This change calls for comprehensive re-examination of the system design concepts that assume "slow" disk and the block I/O concept. Next, we propose Memorage, a system architecture that virtually manages all available physical resources for memory and storage in an integrated manner. Memorage leverages the existing OS virtual memory (VM) manager to improve the performance of memory-intensive workloads and achieve longer lifetime of the main memory.We design and implement a prototype system in the Linux OS to study the effectiveness of Memorage. Obtained results are promising; Memorage is shown to offer additional physical memory capacity to demanding workloads much more efficiently than a conventional VM manager. Under memory pressure, the performance of studied memory-intensive multiprogramming workloads was improved by up to 40.5% with an average of 16.7%. Moreover, Memorage is shown to extend the lifetime of the PRAM main memory by 3.9 or 6.9 times on a system with 8 GB PRAM main memory and a 240 GB or 480 GB PRAM storage.

Proceedings ArticleDOI
15 Jul 2013
TL;DR: A fast and memory efficient Dynamic Time Warping (MES-DTW) algorithm for the task of Query-by-Example Spoken Term Detection (QbE-STD) and describes the system used to perform it, including an energy-based quantification for speech/non-speech detection and an overlap detector for putative matches.
Abstract: In this paper we propose a fast and memory efficient Dynamic Time Warping (MES-DTW) algorithm for the task of Query-by-Example Spoken Term Detection (QbE-STD). The proposed algorithm is based on the subsequence-DTW (S-DTW) algorithm, which allows the search for small spoken queries within a much bigger search collection of spoken documents by considering fixed start-end points in the query and discovering optimal matching subsequences along the search collection. The proposed algorithm applies some modifications to S-DTW that make it better suited for the QbE-STD task, including a way to perform the matching with virtually no system memory, optimal when querying large scale databases. We also describe the system used to perform QbE-STD, including an energy-based quantification for speech/non-speech detection and an overlap detector for putative matches. We test the system proposed using the Mediaeval 2012 spoken-web-search dataset and show that, in addition to the memory savings, the proposed algorithm brings an advantage in terms of matching accuracy (up to 0.235 absolute MTWV increase) and speed (around 25% faster) in comparison to the original S-DTW.

Patent
14 Mar 2013
TL;DR: In this paper, the authors present a system for managing instances of virtual memory components for storing computer readable information for use by at least one first computing device, the system comprising a physical memory component, a computing processor component, an operating system, a virtual machine monitor, and virtual memory storage appliances.
Abstract: Systems, methods and devices for management of instances of virtual memory components for storing computer readable information for use by at least one first computing device, the system comprising at least one physical computing device, each physical computing device being communicatively coupled over a network and comprising: a physical memory component, a computing processor component, an operating system, a virtual machine monitor, and virtual memory storage appliances; at least one of the virtual memory storage appliances being configured to (a) accept memory instructions from the at least one first computing device, (b) instantiate instances of at least one virtual memory component, (c) allocate memory resources from at least one physical memory component for use by any one of the least one virtual memory components, optionally according to a pre-defined policy; and (d) implement memory instructions on the at least one physical memory component.