scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2017"


Journal ArticleDOI
TL;DR: A fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link, and it is proposed and proved that it is close to optimal.
Abstract: We consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache. Knowing the popularity distribution of the files, the goal is to optimally populate the caches, such as to minimize the expected load of the shared link. For a single cache, it is well known that storing the most popular files is optimal in this setting. However, we show here that this is no longer the case for multiple caches. Indeed, caching only the most popular files can be highly suboptimal. Instead, a fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link. We propose such a coded caching scheme and prove that it is close to optimal.

224 citations


Journal ArticleDOI
TL;DR: A tool is designed that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies, and a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels is introduced.
Abstract: Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products—buffer-on-board, LRDIMM, HMC, HBM, and NVMs, to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems, often deviating from DDR standards and/or integrating new functionality within memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions.First, we design a tool that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models, and is integrated into version 7 of the popular CACTI package. Our analysis with the tool shows that several design parameters have a significant impact on I/O power.We then use the tool to help craft novel specialized memory system channels. We introduce a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels. We show that this simple change to the channel topology can improve performance by 22% for DDR DRAM and lower cost by up to 65% for DDR DRAM. This new architecture does not require any changes to DIMMs, and it efficiently supports hybrid DRAM/NVM systems.Finally, as an example of a more disruptive architecture, we design a custom DIMM and parallel bus that moves away from the DDR3/DDR4 standards. To reduce energy and improve performance, the baseline data channel is split into three narrow parallel channels and the on-DIMM interconnects are operated at a lower frequency. In addition, this allows us to design a two-tier error protection strategy that reduces data transfers on the interconnect. This architecture yields a performance improvement of 18% and a memory power reduction of 23%.The cascaded channel and narrow channel architectures serve as case studies for the new tool and show the potential for benefit from re-organizing basic memory interconnects.

217 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: DUDETM is presented, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging and can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.
Abstract: Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. To make the best use of these properties, it has been shown by empirical evidence that programs should access NVM directly through CPU load and store instructions, so that the overhead of a traditional file system or database can be avoided. Thus, durable transactions become a common choice of applications for accessing persistent memory data in a crash consistent manner. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions.This paper presents DUDETM, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging. DUDETM uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design also enables an out-of-the-box transactional memory (TM) to be used as an independent component in our system. The evaluation results show that DUDETM adds durability to a TM system with only 7.4 ~ 24.6% throughput degradation. Compared to the existing durable transaction systems, DUDETM provides 1.7times to 4.4times higher throughput. Moreover, DUDETM can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.

179 citations


Journal ArticleDOI
07 Aug 2017
TL;DR: With a combination of high performance and nonvolatility, the arrival of 3D XPoint memory promises to fundamentally change the memory-storage hierarchy at the hardware, system software, and application levels.
Abstract: With a combination of high performance and nonvolatility, the arrival of 3D XPoint memory promises to fundamentally change the memory-storage hierarchy at the hardware, system software, and application levels. This memory will be deployed first as a block addressable storage device, known as the Intel Optane SSD, and even in this familiar form it will drive basic system change. Access times consistently as fast, or faster, than the rest of the system will blur the line between storage and memory. The low latencies from these solid-state drives (SSDs) allow rethinking even basic storage methodologies to be more memory-like. For example, the manner in which storage performance is measured shifts from input–output operations (IOs) at a given queue depth to response time for a given load, like memory is typically measured. System changes to match the low latency of these SSDs are already advanced, and in many cases they enable the application to utilize the SSD’s performance. In other cases, additional work is required, particularly on policies set originally with slow storage in mind. On top of these already-capable systems are real applications. System-level tests show that applications such as key–value stores and real-time analytics can benefit immediately. These application benefits include significantly faster runtime (up to $3\times $ ) and access to larger data sets than supported in DRAM. Newly viable mechanisms for expanding application memory footprint include native application support or native operating system paging, a significant change in the use of SSDs. The next step in this convergence is 3D XPoint memory accessed through processor load/store operations. Significant operating system support is already in place. The implications of consistently low latency storage and fast persistent memory on computing are great, with applications and systems taking advantage of this new technology as storage as the first to benefit.

179 citations


Journal ArticleDOI
Denis Foley1, John M. Danskin1
TL;DR: Nvidia's high-performance Pascal GPU GP100 features in-package high-bandwidth memory, support for efficient FP16 operations, unified memory, and instruction preemption, and incorporates Nvidia's NVLink I/O for high- Bandwidth connections between GPUs and between GPU and CPUs.
Abstract: This article introduces Nvidia's high-performance Pascal GPU. GP100 features in-package high-bandwidth memory, support for efficient FP16 operations, unified memory, and instruction preemption, and incorporates Nvidia's NVLink I/O for high-bandwidth connections between GPUs and between GPUs and CPUs.

159 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: This work presents Thermostat, an application-transparent huge-page-aware mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages, and implements and evaluates its effectiveness on representative cloud computing workloads running under KVM virtualization.
Abstract: The advent of new memory technologies that are denser and cheaper than commodity DRAM has renewed interest in two-tiered main memory schemes. Infrequently accessed application data can be stored in such memories to achieve significant memory cost savings. Past research on two-tiered main memory has assumed a 4KB page size. However, 2MB huge pages are performance critical in cloud applications with large memory footprints, especially in virtualized cloud environments, where nested paging drastically increases the cost of 4KB page management. We present Thermostat, an application-transparent huge-page-aware mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. We present an online page classification mechanism that accurately classifies both 4KB and 2MB pages as hot or cold while incurring no observable performance overhead across several representative cloud applications. We implement Thermostat in Linux kernel version 4.5 and evaluate its effectiveness on representative cloud computing workloads running under KVM virtualization. We emulate slow memory with performance characteristics approximating near-future high-density memory technology and show that Thermostat migrates up to 50% of application footprint to slow memory while limiting performance degradation to 3%, thereby reducing memory cost up to 30%.

146 citations


Journal ArticleDOI
TL;DR: The dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality are discussed.
Abstract: Unlike traditional dark silicon works that attack the computing logic, this article puts a focus on the memory part, which dissipates most of the energy for memory-bound CPU applications. This article discusses the dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality. –Muhammad Shafique, Vienna University of Technology

121 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: M Mosaic is introduced, a GPU memory manager that provides application-transparent support for multiple page sizes and uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that preserves base page contiguity.
Abstract: Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page. In modern GPUs, we face a trade-o on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-o so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages. We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5% and 29.7% on average, respectively, coming within 6.8% and 15.4% of the performance of an ideal TLB where all TLB requests are hits. CCS CONCEPTS • Computer systems organization → Parallel architectures;

110 citations


Proceedings ArticleDOI
24 Jun 2017
TL;DR: This thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams, and introduces an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine.
Abstract: The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumptionModern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy In addition, utilizing NMP's ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP's tight area and power constraints Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine Compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49x and 5x, and efficiency by up to 28x and 5x, respectively

110 citations


Journal ArticleDOI
TL;DR: The algorithms to build and query Hashedcubes are described, and how it can drive well-known interactive visualizations such as binned scatterplots, linked histograms and heatmaps, and the typical query is answered fast enough to easily sustain a interaction.
Abstract: We propose Hashedcubes, a data structure that enables real-time visual exploration of large datasets that improves the state of the art by virtue of its low memory requirements, low query latencies, and implementation simplicity. In some instances, Hashedcubes notably requires two orders of magnitude less space than recent data cube visualization proposals. In this paper, we describe the algorithms to build and query Hashedcubes, and how it can drive well-known interactive visualizations such as binned scatterplots, linked histograms and heatmaps. We report memory usage, build time and query latencies for a variety of synthetic and real-world datasets, and find that although sometimes Hashedcubes offers slightly slower querying times to the state of the art, the typical query is answered fast enough to easily sustain a interaction. In datasets with hundreds of millions of elements, only about 2% of the queries take longer than 40ms. Finally, we discuss the limitations of data structure, potential spacetime tradeoffs, and future research directions.

98 citations


Proceedings ArticleDOI
23 Apr 2017
TL;DR: NVthreads is a drop-in replacement for the pthreads library and requires only tens of lines of program changes to leverage non-volatile memory, and infers consistent states via synchronization points, uses the process memory to buffer uncommitted changes, and logs writes to ensure a program's data is recoverable even after a crash.
Abstract: Non-volatile memory technologies, such as memristor and phase-change memory, will allow programs to persist data with regular memory instructions. Liberated from the overhead to serialize and deserialize data to storage devices, programs can aim for high performance and still be crash fault-tolerant. Unfortunately, to leverage non-volatile memory, existing systems require hardware changes or extensive program modifications.We present NVthreads, a programming model and runtime that adds persistence to existing multi-threaded C/C++ programs. NVthreads is a drop-in replacement for the pthreads library and requires only tens of lines of program changes to leverage non-volatile memory. NVthreads infers consistent states via synchronization points, uses the process memory to buffer uncommitted changes, and logs writes to ensure a program's data is recoverable even after a crash. NVthreads' page level mechanisms result in good performance: applications that use NVthreads can be more than 2× faster than state-of-the-art systems that favor fine-grained tracking of writes. After a failure, iterative applications that use NVthreads gain speedups by resuming execution.

Proceedings ArticleDOI
01 May 2017
TL;DR: This work applies a novel technique, called privilege overlaying, wherein operations requiring privileged execution are identified and only these operations execute in privileged mode, which provides the foundation on which code-integrity, adapted control-flow hijacking defenses, and protections for sensitive IO are applied.
Abstract: Embedded systems are ubiquitous in every aspect ofmodern life. As the Internet of Thing expands, our dependenceon these systems increases. Many of these interconnected systemsare and will be low cost bare-metal systems, executing without anoperating system. Bare-metal systems rarely employ any securityprotection mechanisms and their development assumptions (un-restricted access to all memory and instructions), and constraints(runtime, energy, and memory) makes applying protectionschallenging. To address these challenges we present EPOXY, an LLVM-based embedded compiler. We apply a novel technique, calledprivilege overlaying, wherein operations requiring privilegedexecution are identified and only these operations execute inprivileged mode. This provides the foundation on which code-integrity, adapted control-flow hijacking defenses, and protections for sensitive IO are applied. We also design fine-grainedrandomization schemes, that work within the constraints of bare-metal systems to provide further protection against control-flowand data corruption attacks. These defenses prevent code injection attacks and ROP attacksfrom scaling across large sets of devices. We evaluate theperformance of our combined defense mechanisms for a suite of75 benchmarks and 3 real-world IoT applications. Our results forthe application case studies show that EPOXY has, on average, a 1.8% increase in execution time and a 0.5% increase in energy usage.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: This work states that hierarchical intelligence at the edge enhances radio bandwidth and power efficiency by trading-off computation and communication at edge devices and implies significant additional power consumption due to intensive off-chip data movement.
Abstract: Deep learning has proven to be a powerful tool for a wide range of applications, such as speech recognition and object detection, among others. Recently there has been increased interest in deep learning for mobile IoT [1] to enable intelligence at the edge and shield the cloud from a deluge of data by only forwarding meaningful events. This hierarchical intelligence thereby enhances radio bandwidth and power efficiency by trading-off computation and communication at edge devices. Since many mobile applications are “always-on” (e.g., voice commands), low power is a critical design constraint. However, prior works have focused on high performance reconfigurable processors [2–3] optimized for large-scale deep neural networks (DNNs) that consume >50mW. Off-chip weight storage in DRAM is also common in the prior works [2–3], which implies significant additional power consumption due to intensive off-chip data movement.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: This work proposes a new approach to access pattern obfuscation, called ObfusMem, which adds the memory to the trusted computing base and incorporates cryptographic engines within the memory, and encrypts commands and addresses on the memory bus, hence the access pattern is cryptographically obfuscated from external observers.
Abstract: Trustworthy software requires strong privacy and security guarantees from a secure trust base in hardware. While chipmakers provide hardware support for basic security and privacy primitives such as enclaves and memory encryption. these primitives do not address hiding of the memory access pattern, information about which may enable attacks on the system or reveal characteristics of sensitive user data. State-of-the-art approaches to protecting the access pattern are largely based on Oblivious RAM (ORAM). Unfortunately, current ORAM implementations suffer from very significant practicality and overhead concerns, including roughly an order of magnitude slowdown, more than 100% memory capacity overheads, and the potential for system deadlock.Memory technology trends are moving towards 3D and 2.5D integration, enabling significant logic capabilities and sophisticated memory interfaces. Leveraging the trends, we propose a new approach to access pattern obfuscation, called ObfusMem. ObfusMem adds the memory to the trusted computing base and incorporates cryptographic engines within the memory. ObfusMem encrypts commands and addresses on the memory bus, hence the access pattern is cryptographically obfuscated from external observers. Our evaluation shows that ObfusMem incurs an overhead of 10.9% on average, which is about an order of magnitude faster than ORAM implementations. Furthermore, ObfusMem does not incur capacity overheads and does not amplify writes. We analyze and compare the security protections provided by ObfusMem and ORAM, and highlight their differences.

Journal ArticleDOI
TL;DR: Compared with a variety of Intel i7s and Nvidia GPUs, the KiloCore at 1.1 V has geometric mean improvements of 4.3 $\times$ higher throughput per area and 9.3 pJ/instruction for AES encryption, 4095-b low-density parity-check decoding, 4096-point complex fast Fourier transform, and 100-B record sorting applications.
Abstract: A processor array containing 1000 independent processors and 12 memory modules was fabricated in 32-nm partially depleted silicon on insulator CMOS. The programmable processors occupy 0.055 mm2 each, contain no algorithm-specific hardware, and operate up to an average maximum clock frequency of 1.78 GHz at 1.1 V. At 0.9 V, processors operating at an average of 1.24 GHz dissipate 17 mW while issuing one instruction per cycle. At 0.56 V, processors operating at an average of 115 MHz dissipate 0.61 mW while issuing one instruction per cycle, resulting in an energy consumption of 5.3 pJ/instruction. On-die communication is performed by complementary circuit and packet-based networks that yield a total array bisection bandwidth of 4.2 Tb/s. Independent memory modules handle data and instructions and operate up to an average maximum clock frequency of 1.77 GHz at 1.1 V. All processors, their packet routers, and the memory modules contain unconstrained clock oscillators within independent clock domains that adapt to large supply voltage noise. Compared with a variety of Intel i7s and Nvidia GPUs, the KiloCore at 1.1 V has geometric mean improvements of 4.3 $\times$ higher throughput per area and 9.4 $\times$ higher energy efficiency for AES encryption, 4095-b low-density parity-check decoding, 4096-point complex fast Fourier transform, and 100-B record sorting applications.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: HeteroOS is a novel application-transparent OS-level solution for managing memory heterogeneity in virtualized system that shows up to 2x performance improvement compared to the state-of-the-art VMM-exclusive approach.
Abstract: Heterogeneous memory management combined with server virtualization in datacenters is expected to increase the software and OS management complexity. State-of-the-art solutions rely exclusively on the hypervisor (VMM) for expensive page hotness tracking and migrations, limiting the benefits from heterogeneity. To address this, we design HeteroOS, a novel application-transparent OS-level solution for managing memory heterogeneity in virtualized system. The HeteroOS design first makes the guest-OSes heterogeneity-aware and then extracts rich OS-level information about applications' memory usage to place data in the 'right' memory avoiding page migrations. When such pro-active placements are not possible, HeteroOS combines the power of the guest-OSes' information about applications with the VMM's hardware control to track for hotness and migrate only performance-critical pages. Finally, HeteroOS also designs an efficient heterogeneous memory sharing across multiple guest-VMs. Evaluation of HeteroOS with memory, storage, and network-intensive datacenter applications shows up to 2x performance improvement compared to the state-of-the-art VMM-exclusive approach.

Journal ArticleDOI
TL;DR: A novel hybrid compression method for a widely used RNN variant, long–short term memory (LSTM), to tackle implementation challenges and can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks is presented.
Abstract: Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long–short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix–vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix–vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a $512 \times 512$ compressed LSTM in 1.71 $\mu {{{\text{s}}}}$ , corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm2 chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.

Proceedings ArticleDOI
18 Jun 2017
TL;DR: This paper proposes an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data.
Abstract: Recent years have witnessed a rapid growth in the domain of Internet of Things (IoT). This network of billions of devices generates and exchanges huge amount of data. The limited cache capacity and memory bandwidth make transferring and processing such data on traditional CPUs and GPUs highly inefficient, both in terms of energy consumption and delay. However, many IoT applications are statistical at heart and can accept a part of inaccuracy in their computation. This enables the designers to reduce complexity of processing by approximating the results for a desired accuracy. In this paper, we propose an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data. The proposed design eliminates the overhead involved in transferring data to processor by virtually bringing the processor inside memory. APIM dynamically configures the precision of computation for each application in order to tune the level of accuracy during runtime. Our experimental evaluation running six general OpenCL applications shows that the proposed design achieves up to 20× performance improvement and provides 480× improvement in energy-delay product, ensuring acceptable quality of service. In exact mode, it achieves 28× energy savings and 4.8× speed up compared to the state-of-the-art GPU cores.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: UH-MEM as discussed by the authors is a page management mechanism for various hybrid memories that systematically estimates the utility of migrating a page between different memory types, and uses this information to guide data placement.
Abstract: While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (eg, different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page It is important to make intelligent page management (ie, placement and migration) decisions, as they can significantly affect system performanceIn this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (ie, the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement UH-MEM operates in two steps First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migrationWe evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads

Proceedings ArticleDOI
01 Jan 2017
TL;DR: A Multi-Purpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing, and which can achieve up to 5.5× energy savings and 19× speedup for the search operations as compared to AMD GPU-based implementation.
Abstract: Running Internet of Things applications on general purpose processors results in a large energy and performance overhead, due to the high cost of data movement. Processing inmemory is a promising solution to reduce the data movement cost by processing the data locally inside the memory. In this paper, we design a Multi-Purpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing. MPIM consists of multiple crossbar memories with the capability of efficient in-memory computations. Instead of transferring the large dataset to the processors, MPIM provides two important in-memory processing capabilities: i) data searching for the nearest neighbor ii) bitwise operations including OR, AND and XOR with small analog sense amplifiers. The experimental results show that the MPIM can achieve up to 5.5× energy savings and 19× speedup for the search operations as compared to AMD GPU-based implementation. For bitwise vector processing, we present 11000× energy improvements with 62× speedup over the SIMD-based computation, while outperforming other state-of-the-art in-memory processing techniques.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: This paper proposes a novel HW-SW hybrid translation architecture, which can adapt to different memory mappings efficiently and can provide scalable translation coverage improvements with minor hardware changes, while allowing the flexibility of memory allocation.
Abstract: To mitigate excessive TLB misses in large memory applications, techniques such as large pages, variable length segments, and HW coalescing, increase the coverage of limited hardware translation entries by exploiting the contiguous memory allocation. However, recent studies show that in non-uniform memory systems, using large pages often leads to performance degradation, or allocating large chunks of memory becomes more difficult due to memory fragmentation. Although each of the prior techniques favors its own best chunk size, diverse contiguity of memory allocation in real systems cannot always provide the optimal chunk of each technique. Under such fragmented and diverse memory allocations, this paper proposes a novel HW-SW hybrid translation architecture, which can adapt to different memory mappings efficiently. In the proposed hybrid coalescing technique, the operating system encodes memory contiguity information in a subset of page table entries, called anchor entries. During address translation through TLBs, an anchor entry provides translation for contiguous pages following the anchor entry. As a smaller number of anchor entries can cover a large portion of virtual address space, the efficiency of TLB can be significantly improved. The most important benefit of hybrid coalescing is its ability to change the coverage of the anchor entry dynamically, reflecting the current allocation contiguity status. By using the contiguity information directly set by the operating system, the technique can provide scalable translation coverage improvements with minor hardware changes, while allowing the flexibility of memory allocation. Our experimental results show that across diverse allocation scenarios with different distributions of contiguous memory chunks, the proposed scheme can effectively reap the potential translation coverage improvement from the existing contiguity.

Journal ArticleDOI
Wang Kang1, Haotian Wang1, Zhaohao Wang1, Youguang Zhang1, Weisheng Zhao1 
12 May 2017
TL;DR: A cost-efficient IMP/NMP solution in spin-transfer torque magnetic random access memory (STT–MRAM) without adding any processing units on the memory chip is proposed.
Abstract: In the current big data era, the memory wall issue between the processor and the memory becomes one of the most critical bottlenecks for conventional Von-Newman computer architecture In-memory processing (IMP) or near-memory processing (NMP) paradigms have been proposed to address this problem by adding a small amount of processing units inside/near the memory Unfortunately, although intensively studied, prior IMP/NMP platforms are practically unsuccessful because of the fabrication complexity and cost efficiency by integrating the processing units and memory on the same chip Recently, emerging nonvolatile memories provide new possibility for efficiently implementing the IMP/NMP paradigm In this paper, we propose a cost-efficient IMP/NMP solution in spin-transfer torque magnetic random access memory (STT–MRAM) without adding any processing units on the memory chip The key idea behind the proposed IMP/NMP solution is to exploit the peripheral circuitry already existing inside memory (or with minimal changes) to perform bitwise logic operations Such an IMP/NMP platform enables rather fast logic operations as the logic results can be obtained immediately through just a memory-like readout operation Memory read and logics not, and/nand, and or/nor operations can be achieved and dynamically configured within the same STT–MRAM chip Functionality and performance are evaluated with hybrid simulations under the 40 nm technology node

Proceedings ArticleDOI
27 Mar 2017
TL;DR: The Reconfigurable Vector Unit (RVU) is presented that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness.
Abstract: Nowadays, applications that predominantly perform lookups over large databases are becoming more popular with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidth of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing databases. However, even with the high available memory bandwidth and processing power, in order to achieve the peak performance, data movements through the memory hierarchy consumes an unnecessary amount of time and energy. In order to accelerate database operations, and reduce the energy consumption of the system, this paper presents the Reconfigurable Vector Unit (RVU) that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness. RVU enables the programmer to reconfigure it to perform as a large vector unit or multiple small vectors units to better adjust for the application needs during different computation phases. Due to its adaptability, RVU is capable of achieving performance increase of 27 χ on average and reduce the DRAM energy consumption in 29% when compared to an x86 processor with 16 cores. Compared with the state-of-the-art mechanism capable of performing large vector operations with fixed size, inside the HMC, RVU performed up to 12% better in terms of performance and improve in 53% the energy consumption.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this article, a novel memory network model named Read-Write Memory Network (RWMN) is proposed to perform question and answering tasks for large-scale, multimodal movie story understanding.
Abstract: We propose a novel memory network model named Read-Write Memory Network (RWMN) to perform question and answering tasks for large-scale, multimodal movie story understanding. The key focus of our RWMN model is to design the read network and the write network that consist of multiple convolutional layers, which enable memory read and write operations to have high capacity and flexibility. While existing memory-augmented network models treat each memory slot as an independent block, our use of multi-layered CNNs allows the model to read and write sequential memory cells as chunks, which is more reasonable to represent a sequential story because adjacent memory blocks often have strong correlations. For evaluation, we apply our model to all the six tasks of the MovieQA benchmark [24], and achieve the best accuracies on several tasks, especially on the visual QA task. Our model shows a potential to better understand not only the content in the story, but also more abstract information, such as relationships between characters and the reasons for their actions.

Journal ArticleDOI
01 Aug 2017
TL;DR: PAllocator is presented, a fail-safe persistent SCM allocator whose design emphasizes high concurrency and capacity scalability and outperforms state-of-the-art persistent allocators by up to one order of magnitude, both in operation throughput and recovery time.
Abstract: Storage Class Memory (SCM) is a novel class of memory technologies that promise to revolutionize database architectures. SCM is byte-addressable and exhibits latencies similar to those of DRAM, while being non-volatile. Hence, SCM could replace both main memory and storage, enabling a novel single-level database architecture without the traditional I/O bottleneck. Fail-safe persistent SCM allocation can be considered conditio sine qua non for enabling this novel architecture paradigm for database management systems. In this paper we present PAllocator, a fail-safe persistent SCM allocator whose design emphasizes high concurrency and capacity scalability. Contrary to previous works, PAllocator thoroughly addresses the important challenge of persistent memory fragmentation by implementing an efficient defragmentation algorithm. We show that PAllocator outperforms state-of-the-art persistent allocators by up to one order of magnitude, both in operation throughput and recovery time, and enables up to 2.39x higher operation throughput on a persistent B-Tree.

Proceedings ArticleDOI
11 Dec 2017
TL;DR: It is shown that by accurately estimating the resource level that is needed, a co-location scheme can effectively determine how many applications can be co-located on the same host to improve the system throughput, by taking into consideration the memory and CPU requirements of co-running application tasks.
Abstract: Data analytic applications built upon big data processing frameworks such as Apache Spark are an important class of applications. Many of these applications are not latency-sensitive and thus can run as batch jobs in data centers. By running multiple applications on a computing host, task co-location can significantly improve the server utilization and system throughput. However, effective task co-location is a non-trivial task, as it requires an understanding of the computing resource requirement of the co-running applications, in order to determine what tasks, and how many of them, can be co-located. State-of-the-art co-location schemes either require the user to supply the resource demands which are often far beyond what is needed; or use a one-size-fits-all function to estimate the requirement, which, unfortunately, is unlikely to capture the diverse behaviors of applications. In this paper, we present a mixture-of-experts approach to model the memory behavior of Spark applications. We achieve this by learning, off-line, a range of specialized memory models on a range of typical applications; we then determine at runtime which of the memory models, or experts, best describes the memory behavior of the target application. We show that by accurately estimating the resource level that is needed, a co-location scheme can effectively determine how many applications can be co-located on the same host to improve the system throughput, by taking into consideration the memory and CPU requirements of co-running application tasks. Our technique is applied to a set of representative data analytic applications built upon the Apache Spark framework. We evaluated our approach for system throughput and average normalized turnaround time on a multi-core cluster. Our approach achieves over 83.9% of the performance delivered using an ideal memory predictor. We obtain, on average, 8.69x improvement on system throughput and a 49% reduction on turnaround time over executing application tasks in isolation, which translates to a 1.28x and 1.68x improvement over a state-of-the-art co-location scheme for system throughput and turnaround time respectively.

Journal ArticleDOI
TL;DR: This article developed a conceptual model to link relevant concepts in psychology and tourism research to each stage of the long-term memory (LTM) system, and focus groups were conducted to examine how practitioners are helping tourists encode, consolidate, and retrieve their memories in the context of this model.
Abstract: This research first develops a conceptual model to link relevant concepts in psychology and tourism research to each stage of the long-term memory (LTM) system. It combines insights from mindfulness, positive affect, and quality of conscious experience to understand how tourists encode information; research in short-term memory and working memory as well as social identity to address the consolidation of information; and concepts of remembering, false memory, and storytelling to highlight information retrieval. Next, focus groups were conducted to examine how practitioners are helping tourists encode, consolidate, and retrieve their memories in the context of this model (Study 1). Finally, in-depth interviews were conducted to complement the practitioner’s perspective by reflecting the tourist’s voice that is relevant in each stage of the LTM system (Study 2). Overall, this research connects findings from the practitioner’s viewpoint with the tourist’s voice to present a framework of memory manage...

Proceedings ArticleDOI
24 Jun 2017
TL;DR: It is demonstrated that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem and have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.
Abstract: A practically feasible low-overhead hardware design that provides strong defenses against memory bus side channel remains elusive. This paper observes that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem. InvisiMem expands the trust base to include the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing several memory bus side channel vulnerabilities efficiently. This allows the secure host processor to send encrypted addresses over the untrusted memory bus, and thereby eliminates the need for expensive address obfuscation techniques based on Oblivious RAM (ORAM). In addition, smart memory enables efficient solutions for ensuring freshness without using expensive Merkle trees, and mitigates memory bus timing channel using constant heart-beat packets. We demonstrate that InvisiMem designs have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.

Journal ArticleDOI
TL;DR: A scalable counter architecture called Counter Tree is designed, which leverages a 2-D counter sharing scheme to achieve far better memory efficiency and in the meantime extend estimation range significantly and improves the performance of Counter Tree by adding a status bit to each counter.
Abstract: Per-flow traffic measurement, which is to count the number of packets for each active flow during a certain measurement period, has many applications in traffic engineering, classification of routing distribution or network usage pattern, service provision, anomaly detection, and network forensics. In order to keep up with the high throughput of modern routers or switches, the online module for per-flow traffic measurement should use high-bandwidth SRAM that allows fast memory accesses. Due to limited SRAM space, exact counting, which requires to keep a counter for each flow, does not scale to large networks consisting of numerous flows. Some recent work takes a different approach to estimate the flow sizes using counter architectures that can fit into tight SRAM. However, existing counter architectures have limitations, either still requiring considerable SRAM space or having a small estimation range. In this paper, we design a scalable counter architecture called Counter Tree, which leverages a 2-D counter sharing scheme to achieve far better memory efficiency and in the meantime extend estimation range significantly. Furthermore, we improve the performance of Counter Tree by adding a status bit to each counter. Extensive experiments with real network traces demonstrate that our counter architecture can produce accurate estimates for flows of all sizes under very tight memory space.

Journal ArticleDOI
TL;DR: A dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed and results have shown that the DARP improved the memory access efficiency by 25.4%.
Abstract: The increasing demand on the main memory capacity is one of the main big data challenges. Dynamic random access memory (DRAM) does not represent the best choice for a main memory, due to high power consumption and low density. However, the nonvolatile memory, such as the phase-change memory (PCM), represents an additional choice because of the low power consumption and high-density characteristic. Nevertheless, the high access latency and limited write endurance have disabled the PCM to replace the DRAM currently. Therefore, a hybrid memory, which combines both the DRAM and the PCM, has become a good alternative to the traditional DRAM memory. Both DRAM and PCM disadvantages are challenges for the hybrid memory. In this paper, a dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed. The DARP distinguishes the cache data into the PCM data and the DRAM data, then, the algorithm adopts different replacement policies for each data type. Specifically, for the PCM data, the least recently used (LRU) replacement policy is adopted, and for the DRAM data, the DARP is employed according to the process behavior. Experimental results have shown that the DARP improved the memory access efficiency by 25.4%.