scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2016"


Proceedings ArticleDOI
05 Jun 2016
TL;DR: This work proposes Pinatubo, a Processing In Non-volatile memory ArchiTecture for bUlk Bitwise Operations, which redesigns the read circuitry so that it can compute the bitwise logic of two or more memory rows very efficiently, and support one-step multi-row operations.
Abstract: Processing-in-memory (PIM) provides high bandwidth, massive parallelism, and high energy efficiency by implementing computations in main memory, therefore eliminating the overhead of data movement between CPU and memory. While most of the recent work focused on PIM in DRAM memory with 3D die-stacking technology, we propose to leverage the unique features of emerging non-volatile memory (NVM), such as resistance-based storage and current sensing, to enable efficient PIM design in NVM. We propose Pinatubo1, a Processing In Non-volatile memory ArchiTecture for bUlk Bitwise Operations. Instead of integrating complex logic inside the cost-sensitive memory, Pinatubo redesigns the read circuitry so that it can compute the bitwise logic of two or more memory rows very efficiently, and support one-step multi-row operations. The experimental results on data intensive graph processing and database applications show that Pinatubo achieves a ∼500 x speedup, ∼28000x energy saving on bitwise operations, and 1.12× overall speedup, 1.11× overall energy saving over the conventional processor.

389 citations


Proceedings ArticleDOI
24 Oct 2016
TL;DR: It is shown that deterministic Rowhammer attacks are feasible on commodity mobile platforms and that they cannot be mitigated by current defenses, and the first Rowhammer-based Android root exploit is presented, relying on no software vulnerability, and requiring no user permissions.
Abstract: Recent work shows that the Rowhammer hardware bug can be used to craft powerful attacks and completely subvert a system. However, existing efforts either describe probabilistic (and thus unreliable) attacks or rely on special (and often unavailable) memory management features to place victim objects in vulnerable physical memory locations. Moreover, prior work only targets x86 and researchers have openly wondered whether Rowhammer attacks on other architectures, such as ARM, are even possible. We show that deterministic Rowhammer attacks are feasible on commodity mobile platforms and that they cannot be mitigated by current defenses. Rather than assuming special memory management features, our attack, DRAMMER, solely relies on the predictable memory reuse patterns of standard physical memory allocators. We implement DRAMMER on Android/ARM, demonstrating the practicability of our attack, but also discuss a generalization of our approach to other Linux-based platforms. Furthermore, we show that traditional x86-based Rowhammer exploitation techniques no longer work on mobile platforms and address the resulting challenges towards practical mobile Rowhammer attacks. To support our claims, we present the first Rowhammer-based Android root exploit relying on no software vulnerability, and requiring no user permissions. In addition, we present an analysis of several popular smartphones and find that many of them are susceptible to our DRAMMER attack. We conclude by discussing potential mitigation strategies and urging our community to address the concrete threat of faulty DRAM chips in widespread commodity platforms.

293 citations


Journal ArticleDOI
TL;DR: A survey of software techniques that have been proposed to exploit the advantages and mitigate the disadvantages of NVMs when used for designing memory systems, and, in particular, secondary storage and main memory.
Abstract: Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit the advantages and mitigate the disadvantages of NVMs when used for designing memory systems, and, in particular, secondary storage (e.g., solid state drive) and main memory. We classify these software techniques along several dimensions to highlight their similarities and differences. Given that NVMs are growing in popularity, we believe that this survey will motivate further research in the field of software technology for NVMs.

244 citations


Journal ArticleDOI
18 Jun 2016
TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
Abstract: Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

234 citations


Proceedings ArticleDOI
01 Oct 2016
TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.
Abstract: Pointer chasing is a fundamental operation, used by many important data-intensive applications (e.g., databases, key-value stores, graph processing workloads) to traverse linked data structures. This operation is both memory bound and latency sensitive, as it (1) exhibits irregular access patterns that cause frequent cache and TLB misses, and (2) requires the data from every memory access to be sent back to the CPU to determine the next pointer to access. Our goal is to accelerate pointer chasing by performing it inside main memory, thereby avoiding inefficient and high-latency data transfers between main memory and the CPU. To this end, we propose the In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal.

205 citations


Proceedings ArticleDOI
22 May 2016
TL;DR: This paper demonstrates that the deduplication side channel is much more powerful than previously assumed, potentially providing an attacker with a weird machine to read arbitrary data in the system and presents an end-to-end JavaScript-based attack against the new Microsoft Edge browser.
Abstract: Memory deduplication, a well-known technique to reduce the memory footprint across virtual machines, is now also a default-on feature inside the Windows 8.1 and Windows 10 operating systems. Deduplication maps multiple identical copies of a physical page onto a single shared copy with copy-on-write semantics. As a result, a write to such a shared page triggers a page fault and is thus measurably slower than a write to a normal page. Prior work has shown that an attacker able to craft pages on the target system can use this timing difference as a simple single-bit side channel to discover that certain pages exist in the system. In this paper, we demonstrate that the deduplication side channel is much more powerful than previously assumed, potentially providing an attacker with a weird machine to read arbitrary data in the system. We first show that an attacker controlling the alignment and reuse of data in memory is able to perform byte-by-byte disclosure of sensitive data (such as randomized 64 bit pointers). Next, even without control over data alignment or reuse, we show that an attacker can still disclose high-entropy randomized pointers using a birthday attack. To show these primitives are practical, we present an end-to-end JavaScript-based attack against the new Microsoft Edge browser, in absence of software bugs and with all defenses turned on. Our attack combines our deduplication-based primitives with a reliable Rowhammer exploit to gain arbitrary memory read and write access in the browser. We conclude by extending our JavaScript-based attack to cross-process system-wide exploitation (using the popular nginx web server as an example) and discussing mitigation strategies.

201 citations


Proceedings ArticleDOI
21 Mar 2016
TL;DR: A new hash function Argon2 is presented, which is oriented at protection of low-entropy secrets without secret keys, which can provide ASIC-and botnet-resistance by filling the memory in 0.6 cycles per byte in the non-compressible way.
Abstract: We present a new hash function Argon2, which is oriented at protection of low-entropy secrets without secret keys. It requires a certain (but tunable) amount of memory, imposes prohibitive time-memory and computation-memory tradeoffs on memory-saving users, and is exceptionally fast on regular PC. Overall, it can provide ASIC-and botnet-resistance by filling the memory in 0.6 cycles per byte in the non-compressible way.

190 citations


Proceedings ArticleDOI
Minsoo Rhu1, Natalia Gimelshein1, Jason Clemons1, Arslan Zulfiqar1, Stephen W. Keckler1 
15 Oct 2016
TL;DR: In this article, the authors propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNN.
Abstract: The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.

178 citations


Posted Content
Minsoo Rhu1, Natalia Gimelshein1, Jason Clemons1, Arslan Zulfiqar1, Stephen W. Keckler1 
TL;DR: In this article, the authors propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNN.
Abstract: The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.

141 citations


Proceedings ArticleDOI
19 Oct 2016
TL;DR: This paper presents Makalu, a system that addresses non-volatile memory management and offers an integrated allocator and recovery-time garbage collector that maintains internal consistency, avoids NVRAM memory leaks, and is efficient, all in the face of failures.
Abstract: Byte addressable non-volatile memory (NVRAM) is likely to supplement, and perhaps eventually replace, DRAM. Applications can then persist data structures directly in memory instead of serializing them and storing them onto a durable block device. However, failures during execution can leave data structures in NVRAM unreachable or corrupt. In this paper, we present Makalu, a system that addresses non-volatile memory management. Makalu offers an integrated allocator and recovery-time garbage collector that maintains internal consistency, avoids NVRAM memory leaks, and is efficient, all in the face of failures. We show that a careful allocator design can support a less restrictive and a much more familiar programming model than existing persistent memory allocators. Our allocator significantly reduces the per allocation persistence overhead by lazily persisting non-essential metadata and by employing a post-failure recovery-time garbage collector. Experimental results show that the resulting online speed and scalability of our allocator are comparable to well-known transient allocators, and significantly better than state-of-the-art persistent allocators.

126 citations


Proceedings ArticleDOI
24 Oct 2016
TL;DR: Cache as mentioned in this paper is a software approach to mitigate access-driven side-channel attacks that leverage last-level caches (LLCs) shared across cores to leak information between security domains (e.g., tenants in a cloud).
Abstract: We present a software approach to mitigate access-driven side-channel attacks that leverage last-level caches (LLCs) shared across cores to leak information between security domains (e.g., tenants in a cloud). Our approach dynamically manages physical memory pages shared between security domains to disable sharing of LLC lines, thus preventing "Flush-Reload" side channels via LLCs. It also manages cacheability of memory pages to thwart cross-tenant "Prime-Probe" attacks in LLCs. We have implemented our approach as a memory management subsystem called CacheBar within the Linux kernel to intervene on such side channels across container boundaries, as containers are a common method for enforcing tenant isolation in Platform-as-a-Service (PaaS) clouds. Through formal verification, principled analysis, and empirical evaluation, we show that CacheBar achieves strong security with small performance overheads for PaaS workloads.

Proceedings ArticleDOI
07 Nov 2016
TL;DR: OpenRAM is introduced, an open-source memory compiler that provides a platform for the generation, characterization, and verification of fabricable memory designs across various technologies, sizes, and configurations and enables research in computer architecture, system-on-chip design, memory circuit and device research, and computer-aided design.
Abstract: Computer systems research is often inhibited by the availability of memory designs. Existing Process Design Kits (PDKs) frequently lack memory compilers, while expensive commercial solutions only provide memory models with immutable cells, limited configurations, and restrictive licenses. Manually creating memories can be time consuming and tedious and the designs are usually inflexible. This paper introduces OpenRAM, an open-source memory compiler, that provides a platform for the generation, characterization, and verification of fabricable memory designs across various technologies, sizes, and configurations. It enables research in computer architecture, system-on-chip design, memory circuit and device research, and computer-aided design.

Proceedings ArticleDOI
Tianhao Zheng1, David Nellans1, Arslan Zulfiqar1, Mark Stephenson1, Stephen W. Keckler1 
12 Mar 2016
TL;DR: Without modifying the GPU execution pipeline, it is shown it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2× slowdown into a 12% speedup when compared to programmer directed transfers.
Abstract: Despite industrial investment in both on-die GPUs and next generation interconnects, the highest performing parallel accelerators shipping today continue to be discrete GPUs. Connected via PCIe, these GPUs utilize their own privately managed physical memory that is optimized for high bandwidth. These separate memories force GPU programmers to manage the movement of data between the CPU and GPU, in addition to the on-chip GPU memory hierarchy. To simplify this process, GPU vendors are developing software runtimes that automatically page memory in and out of the GPU on-demand, reducing programmer effort and enabling computation across datasets that exceed the GPU memory capacity. Because this memory migration occurs over a high latency and low bandwidth link (compared to GPU memory), these software runtimes may result in significant performance penalties. In this work, we explore the features needed in GPU hardware and software to close the performance gap of GPU paged memory versus legacy programmer directed memory management. Without modifying the GPU execution pipeline, we show it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2x slowdown into a 12% speedup when compared to programmer directed transfers. Additionally, we examine the performance impact that GPU memory oversubscription has on application run times, enabling application designers to make informed decisions on how to shard their datasets across hosts and GPU instances.

Journal ArticleDOI
TL;DR: This paper looks at the state-of-the-art TinyOS and the different dimensions of its design paradigm, programming model, execution model, scheduling algorithms, concurrency, memory management, hardware support platforms, and other features.
Abstract: The wireless sensor network (WSN) is an interesting area for modern day research groups. Tiny sensor nodes are deployed in a diversity of environments but with limited resources. Scarce resources compel researchers to employ an operating system that requires limited memory and minimum power. Tiny operating system (TinyOS) is a widely used operating system for sensor nodes, which provides concurrency and flexibility while adhering to the constraints of scarce resources. Comparatively, TinyOS is considered to be the most robust, innovative, energy-efficient, and widely used operating system in sensor networks. This paper looks at the state-of-the-art TinyOS and the different dimensions of its design paradigm, programming model, execution model, scheduling algorithms, concurrency, memory management, hardware support platforms, and other features. The addition of different features in TinyOS makes it the operating system of choice for WSNs. Sensing nodes with TinyOS seem to show more flexibility in supporting diverse types of sensing applications.

Journal ArticleDOI
TL;DR: Some of the novel emerging memory technologies and how they can enable energy-efficient implementation of large neuromorphic computing systems and the emerging trends and challenges in the path towards successful implementations of large learning systems that could be ubiquitously deployed for a wide variety of cognitive computing tasks are outlined.
Abstract: In this paper, we review some of the novel emerging memory technologies and how they can enable energy-efficient implementation of large neuromorphic computing systems. We will highlight some of the key aspects of biological computation that are being mimicked in these novel nanoscale devices, and discuss various strategies employed to implement them efficiently. Though large scale learning systems have not been implemented using these devices yet, we will discuss the ideal specifications and metrics to be satisfied by these devices based on theoretical estimations and simulations. We also outline the emerging trends and challenges in the path towards successful implementations of large learning systems that could be ubiquitously deployed for a wide variety of cognitive computing tasks.

Journal ArticleDOI
TL;DR: A survey of techniques for using compression in cache and main memory systems and classifies the techniques based on key parameters to highlight their similarities and differences is presented.
Abstract: As the number of cores on a chip increases and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in memory components of future extreme-scale systems.

Journal ArticleDOI
TL;DR: An improved design of Newcache is presented, in terms of security, circuit design and simplicity, and it is shown that Newcache can be used as L1 data and instruction caches to improve security without impacting performance.
Abstract: Newcache is a secure cache that can thwart cache side-channel attacks to prevent the leakage of secret information All caches today are susceptible to cache side-channel attacks, despite software isolation of memory pages in virtual address spaces or virtual machines These cache attacks can leak secret encryption keys or private identity keys, nullifying any protection provided by strong cryptography Newcache uses a novel dynamic, randomized memory-to-cache mapping to thwart contention-based side-channel attacks, rather than the static mapping used by conventional set-associative caches In this article, the authors present an improved design of Newcache, in terms of security, circuit design and simplicity They show Newcache's security against a suite of cache side-channel attacks They evaluate Newcache's system performance for cloud computing, smartphone, and SPEC benchmarks and find that Newcache performs as well as conventional set-associative caches, and sometimes better They also designed a VLSI test chip with a 32-Kbyte Newcache and a 32-Kbyte, eight-way, set-associative cache and verified that the access latency, power, and area of the two caches are comparable These results show that Newcache can be used as L1 data and instruction caches to improve security without impacting performance

Proceedings ArticleDOI
14 Mar 2016
TL;DR: A resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency and reduce energy.
Abstract: Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GPGPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.

Proceedings ArticleDOI
17 Apr 2016
TL;DR: This work analyzes, using real-system measurements, shared virtual memory across the CPU and an integrated GPU, and presents a detailed measurement study of a commercially available integrated APU that illustrates these effects and motivates future research opportunities.
Abstract: Computing is becoming increasingly heterogeneous with accelerators like GPUs being tightly integrated with CPUs on the same die. Extending the CPU's virtual addressing mechanism to these accelerators is a key step in making accelerators easily programmable. In this work, we analyze, using real-system measurements, shared virtual memory across the CPU and an integrated GPU. We make several key observations and highlight consequent research opportunities: (1) servicing a TLB miss from the GPU can be an order of magnitude slower than that from the CPU and consequently it is imperative to enable many concurrent TLB misses to hide this larger latency; (2) divergence in memory accesses impacts the GPU's address translation more than the rest of the memory hierarchy, and research in designing address translation mechanisms tolerant to this effect is imperative; and (3) page faults from the GPU are considerably slower than that from the CPU and software-hardware co-design is essential for efficient implementation of page faults from throughput-oriented accelerators like GPUs. We present a detailed measurement study of a commercially available integrated APU that illustrates these effects and motivates future research opportunities.

Proceedings ArticleDOI
14 Mar 2016
TL;DR: In this paper, the authors propose a highly modular software-defined architecture for the next generation datacentre, where SoC-based microservers, memory modules and accelerators are placed in separated modular server trays interconnected via a high-speed, low-latency opto-electronic system fabric, and be allocated in arbitrary sets, as driven by fit-for-purpose resource/power management software.
Abstract: For quite some time now, computing systems servers, whether low-power or high-end ones designs are created around a common design principle: the main-board and its hardware components form a baseline, monolithic building block that the rest of the hardware/software stack design builds upon. This proportionality of compute/memory/network/storage resources is fixed during design time and remains static throughout machine lifetime, with known ramifications in terms of low system resource utilization, costly upgrade cycles and degraded energy proportionality. dReDBox takes on the challenge of revolutionizing the low-power computing market by breaking server boundaries through materialization of the concept of disaggregation. Besides proposing a highly modular software-defined architecture for the next generation datacentre, dRedBox will specify, design and prototype a novel hardware architecture where SoC-based microservers, memory modules and accelerators, will be placed in separated modular server trays interconnected via a high-speed, low-latency opto-electronic system fabric, and be allocated in arbitrary sets, as driven by fit-for-purpose resource/power management software. These blocks will employ state-of-the-art low-power components and be amenable to deployment in various integration form factors and target scenarios. dRedBox aims to deliver a full-fledged, vertically integrated datacentre-in-a-box prototype to showcase the superiority of disaggregation in terms of scalability, efficiency, reliability, performance and energy reduction which will be demonstrated in three pilot use-cases.

Journal ArticleDOI
TL;DR: The Blacklisting Memory Scheduler (BLISS) is proposed, which achieves high system performance and fairness while incurring low hardware cost and complexity, and is based on two new observations.
Abstract: In a multicore system, applications running on different cores interfere at main memory. This inter-application interference degrades overall system performance and unfairly slows down applications. Prior works have developed application-aware memory request schedulers to tackle this problem. State-of-the-art application-aware memory request schedulers prioritize memory requests of applications that are vulnerable to interference, by ranking individual applications based on their memory access characteristics and enforcing a total rank order. In this paper, we observe that state-of-the-art application-aware memory schedulers have two major shortcomings. First, such schedulers trade off hardware complexity in order to achieve high performance or fairness, since ranking applications individually with a total order based on memory access characteristics leads to high hardware cost and complexity. Such complexity could prevent the scheduler from meeting the stringent timing requirements of state-of-the-art DDR protocols. Second, ranking can unfairly slow down applications that are at the bottom of the ranking stack, thereby sometimes leading to high slowdowns and low overall system performance. To overcome these shortcomings, we propose the Blacklisting Memory Scheduler (BLISS) , which achieves high system performance and fairness while incurring low hardware cost and complexity. BLISS design is based on two new observations. First, we find that, to mitigate interference, it is sufficient to separate applications into only two groups, one containing applications that are vulnerable to interference and another containing applications that cause interference, instead of ranking individual applications with a total order. Vulnerable-to-interference group is prioritized over the interference-causing group. Second, we show that this grouping can be efficiently performed by simply counting the number of consecutive requests served from each application. We evaluate BLISS across a wide variety of workloads and system configurations and compare its performance and hardware complexity (via RTL implementations), with five state-of-the-art memory schedulers. Our evaluations show that BLISS achieves 5 percent better system performance and 25 percent better fairness than the best-performing previous memory scheduler while greatly reducing critical path latency and hardware area cost of the memory scheduler (by 79 and 43 percent, respectively), thereby achieving a good trade-off between performance, fairness and hardware complexity.

Proceedings ArticleDOI
13 Nov 2016
TL;DR: In this article, the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns, with up to 27.9× for a single layer and up to 5.6× on the whole networks.
Abstract: Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed. While existing works mainly focus on the computational efficiency of CNNs, the memory efficiency of CNNs have been largely overlooked. Yet CNNs have intricate data structures and their memory behavior can have significant impact on the performance. In this work, we study the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns. Experiments show the universal effect of our proposed optimizations on both single layers and various networks, with up to 27.9× for a single layer and up to 5.6× on the whole networks.

Proceedings ArticleDOI
22 May 2016
TL;DR: Shreds, a set of OS-backed programming primitives that addresses developers' currently unmet needs for fine-grained, convenient, and efficient protection of sensitive memory content against in-process adversaries, is proposed.
Abstract: Once attackers have injected code into a victim program's address space, or found a memory disclosure vulnerability, all sensitive data and code inside that address space are subject to thefts or manipulation. Unfortunately, this broad type of attack is hard to prevent, even if software developers wish to cooperate, mostly because the conventional memory protection only works at process level and previously proposed in-process memory isolation methods are not practical for wide adoption. We propose shreds, a set of OS-backed programming primitives that addresses developers' currently unmet needs for fine-grained, convenient, and efficient protection of sensitive memory content against in-process adversaries. A shred can be viewed as a flexibly defined segment of a thread execution (hence the name). Each shred is associated with a protected memory pool, which is accessible only to code running in the shred. Unlike previous works, shreds offer in-process private memory without relying on separate page tables, nested paging, or even modified hardware. Plus, shreds provide the essential data flow and control flow guarantees for running sensitive code. We have built the compiler toolchain and the OS module that together enable shreds on Linux. We demonstrated the usage of shreds and evaluated their performance using 5 non-trivial open source software, including OpenSSH and Lighttpd. The results show that shreds are fairly easy to use and incur low runtime overhead (4.67%).

Journal ArticleDOI
Shay Gueron1
01 Nov 2016
TL;DR: The MEE is a successful feat of real-world cryptographic engineering: it's the first time such cryptographic memory protection has been added to a widely deployed general-purpose processor.
Abstract: Intel's Software Guard Extensions allows general-purpose computing platforms to run software in a trustworthy manner and securely handle encrypted data. To satisfy the technology's security goals, the external system memory must be cryptographically protected. A new hardware unit added to the processor's memory controller--the Memory Encryption Engine (MEE)--was recently developed to protect the confidentiality, integrity, and freshness of this external memory traffic, against eavesdropping and tampering. The MEE is a successful feat of real-world cryptographic engineering: it's the first time such cryptographic memory protection has been added to a widely deployed general-purpose processor.

Proceedings ArticleDOI
14 Jun 2016
TL;DR: This work exposes available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes, and implements several novel scenarios to demonstrate these benefits.
Abstract: Memory is a crucial resource in relational databases (RDBMSs). When there is insufficient memory, RDBMSs are forced to use slower media such as SSDs or HDDs, which can significantly degrade workload performance. Cloud database services are deployed in data centers where network adapters supporting remote direct memory access (RDMA) at low latency and high bandwidth are becoming prevalent. We study the novel problem of how a Symmetric Multi-Processing (SMP) RDBMS, whose memory demands exceed locally-available memory, can leverage available remote memory in the cluster accessed via RDMA to improve query performance. We expose available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes. We identify and implement several novel scenarios to demonstrate these benefits, and address design challenges that are crucial for efficient implementation. We implemented the scenarios in Microsoft SQL Server engine and present the first end-to-end study to demonstrate benefits of remote memory for a variety of micro-benchmarks and industry-standard benchmarks. Compared to using disks when memory is insufficient, we improve the throughput and latency of queries with short reads and writes by 3X to 10X, while improving the latency of multiple TPC-H and TPC-DS queries by 2X to 100X.

Journal ArticleDOI
TL;DR: The analysis and numerical evaluation suggest that the proposed Replisom system has significant potential in reducing the delay, energy consumption, and cost for cloud offloading of IoT applications given the massive number of devices with tiny memory sizes.
Abstract: Augmenting the long-term evolution (LTE)-evolved NodeB (eNB) with cloud resources offers a low-latency, resilient, and LTE-aware environment for offloading the Internet of Things (IoT) services and applications. By means of devices memory replication, the IoT applications deployed at an LTE-integrated edge cloud can scale its computing and storage requirements to support different resource-intensive service offerings. Despite this potential, the massive number of IoT devices limits the LTE edge cloud responsiveness as the LTE radio interface becomes the major bottleneck given the unscalability of its uplink access and data transfer procedures to support a large number of devices that simultaneously replicate their memory objects with the LTE edge cloud. We propose Replisom ; an LTE-aware edge cloud architecture and an LTE-optimized memory replication protocol which relaxes the LTE bottlenecks by a delay and radio resource-efficient memory replication protocol based on the device-to-device communication technology and the sparse recovery in the theory of compressed sampling. Replisom effectively schedules the memory replication occasions to resolve contentions for the radio resources as a large number of devices simultaneously transmit their memory replicas. Our analysis and numerical evaluation suggest that this system has significant potential in reducing the delay, energy consumption, and cost for cloud offloading of IoT applications given the massive number of devices with tiny memory sizes.

Journal ArticleDOI
Ramzi Mahmoudi1, Mohamed Akil1
TL;DR: In this paper, the authors presented an enhanced computation method for smoothing 2D object in binary case, which provides a parallel computation and better memory management, while preserving the topology of the original image by using homotopic transformations defined in the framework of digital topology.
Abstract: To prepare images for better segmentation, we need preprocessing applications, such as smoothing, to reduce noise. In this paper, we present an enhanced computation method for smoothing 2D object in binary case. Unlike existing approaches, proposed method provides a parallel computation and better memory management, while preserving the topology (number of connected components) of the original image by using homotopic transformations defined in the framework of digital topology. We introduce an adapted parallelization strategy called split, distribute and merge (SDM) strategy which allows efficient parallelization of a large class of topological operators. To achieve a good speedup and better memory allocation, we cared about task scheduling and managing. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D grayscale image (512*512), using shared memory parallel machine (SMPM) with 8 CPU cores (2 Xeon E5405 running at frequency of 2 GHz), showed an enhancement of 5.2 with cache success rate of 70%.

Proceedings ArticleDOI
Luna Xu1, Min Li2, Li Zhang2, Ali R. Butt1, Yandong Wang2, Zane Zhenhua Hu3 
23 May 2016
TL;DR: MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs, and if needed, the scheduling information from the analytic framework is leveraged to evict data that will be needed in the near future.
Abstract: Memory is a crucial resource for big data processing frameworks such as Spark and M3R, where the memory is used both for computation and for caching intermediate storage data. Consequently, optimizing memory is the key to extracting high performance. The extant approach is to statically split thememory for computation and caching based on workload profiling. This approach is unable to capture the varying workload characteristics and dynamic memory demands. Another factor that affects caching efficiency is the choice of data placement and eviction policy. The extant LRU policy is oblivious of task scheduling information from the analytic frameworks, and thus can lead to lost optimization opportunities. In this paper, we address the above issues by designing MEMTUNE, a dynamic memory manager for in-memory data analytics. MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs. Moreover, if needed, the scheduling information from the analytic framework isleveraged to evict data that will not be needed in the near future. Finally, MEMTUNE also supports task-level data prefetching with a configurable window size to more effectively overlap computation with I/O. Our experiments show that MEMTUNE improves memory utilization, yields an overall performance gain of up to 46%, and achieves cache hit ratio of up to 41% compared to standard Spark.

Proceedings ArticleDOI
02 Jun 2016
TL;DR: In this paper, the authors present a methodology for designing applications in a way that enables certifying their confidentiality, which consists of forcing the application to communicate with the external world through a narrow interface, compiling it with runtime checks that aid verification, and linking it with a small runtime that implements the narrow interface.
Abstract: Hardware support for isolated execution (such as Intel SGX) enables development of applications that keep their code and data confidential even while running in a hostile or compromised host. However, automatically verifying that such applications satisfy confidentiality remains challenging. We present a methodology for designing such applications in a way that enables certifying their confidentiality. Our methodology consists of forcing the application to communicate with the external world through a narrow interface, compiling it with runtime checks that aid verification, and linking it with a small runtime that implements the narrow interface. The runtime includes services such as secure communication channels and memory management. We formalize this restriction on the application as Information Release Confinement (IRC), and we show that it allows us to decompose the task of proving confidentiality into (a) one-time, human-assisted functional verification of the runtime to ensure that it does not leak secrets, (b) automatic verification of the application's machine code to ensure that it satisfies IRC and does not directly read or corrupt the runtime's internal state. We present /CONFIDENTIAL: a verifier for IRC that is modular, automatic, and keeps our compiler out of the trusted computing base. Our evaluation suggests that the methodology scales to real-world applications.

Journal ArticleDOI
TL;DR: MemGuard separates memory bandwidth in two parts: guaranteed and best effort, and provides bandwidth reservation for the guaranteed bandwidth for temporal isolation, with efficient reclaiming to maximally utilize the reserved bandwidth.
Abstract: Memory bandwidth in modern multi-core platforms is highly variable for many reasons and it is a big challenge in designing real-time systems as applications are increasingly becoming more memory intensive. In this work, we proposed, designed, and implemented an efficient memory bandwidth reservation system, that we call MemGuard . MemGuard separates memory bandwidth in two parts: guaranteed and best effort . It provides bandwidth reservation for the guaranteed bandwidth for temporal isolation, with efficient reclaiming to maximally utilize the reserved bandwidth. It further improves performance by exploiting the best effort bandwidth after satisfying each core's reserved bandwidth. MemGuard is evaluated with SPEC2006 benchmarks on a real hardware platform, and the results demonstrate that it is able to provide memory performance isolation with minimal impact on overall throughput.