scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2023"


Journal ArticleDOI
TL;DR: AM4 as discussed by the authors is a combined STT-MTJ-based Content Addressable Memory (CAM), Ternary CAM (TCAM), approximate matching (similarity search) CAM (ACAM), and in-memory Associative Processor (AP) design, inspired by the recently announced Samsung MRAM crossbar.
Abstract: In-memory computing seeks to minimize data movement and alleviate the memory wall by computing in-situ, in the same place that the data is located. One of the key emerging technologies that promises to enable such computing-in-memory is spin-transfer torque magnetic tunnel junction (STT-MTJ). This paper proposes AM4, a combined STT-MTJ-based Content Addressable Memory (CAM), Ternary CAM (TCAM), approximate matching (similarity search) CAM (ACAM), and in-memory Associative Processor (AP) design, inspired by the recently announced Samsung MRAM crossbar. We demonstrate and evaluate the performance and energy-efficiency of the AM4-based AP using a variety of data intensive workloads. We show that an AM4-based AP outperforms state-of-the-art solutions both in performance (with the average speedup of about 10 ×) and energy-efficiency (by about 60 × on average).

3 citations


Journal ArticleDOI
TL;DR: Vecmem as mentioned in this paper is a library of memory resources which allows efficient and user-friendly allocation of memory on CUDA, HIP, and SYCL devices through standard C++ containers.
Abstract: Programmers using the C++ programming language are increasingly taught to manage memory implicitly through containers provided by the C++ standard library. However, heterogeneous programming platforms often require explicit allocation and deallocation of memory. This discrepancy in memory management strategies can be daunting and problematic for C++ developers who are not already familiar with heterogeneous programming. The C++17 standard introduces the concept of memory resources, which allow the user to control how standard library containers allocate memory; we believe that this addition to the C++17 standard is a powerful tool towards the unification of memory management for heterogeneous systems with best-practice C++ development. In this paper, we present vecmem, a library of memory resources which allows efficient and user-friendly allocation of memory on CUDA, HIP, and SYCL devices through standard C++ containers. We investigate the design and use cases of such a library, the potential performance gains over naive memory allocation, and the limitations of this memory allocation model.

1 citations


Proceedings ArticleDOI
05 Jun 2023
TL;DR: In this article , the authors enhance the remote memory architecture with Near Memory Processing (NMP), a capability that offloads particular compute tasks from the client to the server side as illustrated in Figure 1.
Abstract: Traditional Von Neumann computing architectures are struggling to keep up with the rapidly growing demand for scale, performance, power-efficiency and memory capacity. One promising approach to this challenge is Remote Memory, in which the memory is over RDMA fabric [1]. We enhance the remote memory architecture with Near Memory Processing (NMP), a capability that offloads particular compute tasks from the client to the server side as illustrated in Figure 1. Similar motivation drove IBM to offload object processing to their remote KV storage [2]. NMP offload adds latency and server resource costs, therefore, it should only be used when the offload value is substantial, specifically, to save: network bandwidth (e.g. Filter/Aggregate), round trip time (e.g. tree Lookup) and/or distributed locks (e.g. Append to a shared journal).

Posted ContentDOI
13 Feb 2023
TL;DR: In this article , the authors extend the lock-free general purpose memory allocator LRMalloc to support the Optimistic Access (OA) method, which is able to simplify the memory reclamation method implementation and also allow memory to be reused by other parts of the same process.
Abstract: Lock-free data structures are an important tool for the development of concurrent programs as they provide scalability, low latency and avoid deadlocks, livelocks and priority inversion. However, they require some sort of additional support to guarantee memory reclamation. The Optimistic Access (OA) method has most of the desired properties for memory reclamation, but since it allows memory to be accessed after being reclaimed, it is incompatible with the traditional memory management model. This renders it unable to release memory to the memory allocator/operating system, and, as such, it requires a complex memory recycling mechanism. In this paper, we extend the lock-free general purpose memory allocator LRMalloc to support the OA method. By doing so, we are able to simplify the memory reclamation method implementation and also allow memory to be reused by other parts of the same process. We further exploit the virtual memory system provided by the operating system and hardware in order to make it possible to release reclaimed memory to the operating system.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: In this article , a microbenchmarking tool called pmmeter is proposed to measure the performance impact of synchronization instructions on a given sequence of memory access in Intel Optane DCPMM.
Abstract: Persistent memory is an emerging memory device, which offers durable, relatively large memory space cost-effectively. This technology potentially enables a new horizon for a broad spectrum of data-intensive applications. Historically, memory space has been deemed to be volatile for application programs. Only very recently the processor industry has introduced a variety of memory synchronization instructions to ensure the durability of persistent memory. Yet, no known programming practice yields how application programs can fully exploit those instructions. This paper proposes pmmeter, a new microbenchmarking tool that we have developed to measure the primary performance of persistent memory. pmmeter allows for clarifying the performance impact that the choice of synchronization instructions incurs on a given sequence of memory access. This paper presents our experiment to demonstrate that pmmeter unveils the synchronization cost of Intel Optane DCPMM.

Journal ArticleDOI
TL;DR: Similarity-Managed Hybrid Memory System (SM-HMS) as discussed by the authors leverages the memory access similarity among nodes in a cluster to improve the hybrid memory system performance by leveraging the distance between two vectors.
Abstract: With the increasing problem complexity, more irregular applications are deployed on high-performance clusters due to the parallel working paradigm, and yield irregular memory access behaviors across nodes. However, the irregularity of memory access behaviors is not comprehensively studied, which results in low utilization of the integrated hybrid memory system compositing of stacked DRAM and off-chip DRAM. To address this problem, we devise a novel method called Similarity-Managed Hybrid Memory System ( SM-HMS ) to improve the hybrid memory system performance by leveraging the memory access similarity among nodes in a cluster. Within SM-HMS , two techniques are proposed, Memory Access Similarity Measuring and Similarity-based Memory Access Behavior Sharing . To quantify the memory access similarity, memory access behaviors of each node are vectorized, and the distance between two vectors is used as the memory access similarity. The calculated memory access similarity is used to share memory access behaviors precisely across nodes. With the shared memory access behaviors, SM-HMS divides the stacked DRAM into two sections, the sliding window section and the outlier section . The shared memory access behaviors guide the replacement of the sliding window section while the outlier section is managed in the LRU manner. Our evaluation results with a set of irregular applications on various clusters consisting of up to 256 nodes have shown that SM-HMS outperforms the state-of-the-art approaches, Cameo , Chameleon , and Hyrbid2 , on job finish time reduction by up to $58.6\%$ , $56.7\%$ , and $31.3\%$ , with $46.1\%$ , $41.6\%$ , and $19.3\%$ on average, respectively. SM-HMS can also achieve up to $98.6\%$ ( $91.9\%$ on average) of the ideal hybrid memory system performance.

DissertationDOI
10 Mar 2023
TL;DR: In this article , the authors developed two frameworks for instant processing and block processing of memory sharing in over-committed cloud data centers, which are useful tools for the management of cloud data centres.
Abstract: This thesis studies memory sharing systematically for handling memory overload of physical machines in over-committed cloud data centres. It develops two frameworks for instant processing and block processing of memory sharing. The developed memory sharing frameworks are useful tools for the management of cloud data centres.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: AstriFlash as discussed by the authors is a hardware-software co-design that tightly integrates flash and DRAM with ns-scale overheads, achieving 95% of a DRAM-only system's throughput while maintaining the required 99th-percentile tail latency and reducing the memory cost by 20x.
Abstract: Modern datacenters host datasets in DRAM to offer large-scale online services with tight tail-latency requirements. Unfortunately, as DRAM is expensive and increasingly difficult to scale, datacenter operators are forced to consider denser storage technologies. While modern flash-based storage exhibits μs-scale access latency, which is well within the tail-latency constraints of many online services, traditional demand paging abstraction used to manage memory and storage incurs high overheads and prohibits flash usage in online services. We introduce AstriFlash, a hardware-software co-design that tightly integrates flash and DRAM with ns-scale overheads. Our evaluation of server workloads with cycle-accurate full-system simulation shows that AstriFlash achieves 95% of a DRAM-only system’s throughput while maintaining the required 99th-percentile tail latency and reducing the memory cost by 20x.

Proceedings ArticleDOI
14 Apr 2023
TL;DR: In this article , the authors propose a UMAT scheme for automatically optimizing unified memory-based data transfer management, which guides runtime management of data transfers by calculating the access frequency of data objects in the unified memory space.
Abstract: OpenMP 4.5 refines the offload feature to better support heterogeneous computing. Explicit specification programming is currently required to optimize data transfer in OpenMP offload programs, but manual programming is not efficient or performant. Although DCU-supported unified memory provides a scheme for compilers to implicitly manage data transfers, target offload programs using unified memory perform poorly when program requests exceed the physical memory size. Therefore, this paper proposes a UMAT scheme for automatically optimizing unified memory-based data transfer management, which guides runtime management of data transfers by calculating the access frequency of data objects in the unified memory space. Test results show that the scheme has significant performance improvement for target offload programs using unified memory.

Proceedings ArticleDOI
27 Jan 2023
TL;DR: In this paper , a modified version of run length encoding which scales aptly with matrix size and sparsity density of the binary sparse matrix has been proposed, which performs better than many state-of-the-art algorithms.
Abstract: Binary Sparse Matrix storage is one of the critical problems in embedded system applications. Storing these matrices in the memory efficiently is important. Magnitude of increase in matrix size also has significant impact on the memory requirement. Sometimes, it might not be possible to accommodate the data due to memory constraints. In this work, we have analyzed some of state-of-the-art methods deployed for storing these matrices in the system on-chip memory and we have demonstrated the shortcomings of each. Thus, we propose a modified version of run length encoding which scales aptly with matrix size and sparsity density of the binary sparse matrix. Through simulations we have shown that the proposed method performs better than many state-of-the-art algorithms.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: Baryon as discussed by the authors leverages both memory compression and data sub-blocking techniques to improve the utilization of fast memory capacity and slow memory bandwidth, with only moderate metadata overheads and management complexity.
Abstract: Hybrid memory systems are able to achieve both high performance and large capacity when combining fast commodity DDR memories with larger but slower non-volatile memories in a heterogeneous way. However, it is critical to best utilize the limited fast memory capacity and slow memory bandwidth in such systems to gain the maximum efficiency. In this paper, we propose a novel hybrid memory design, Baryon, that leverages both memory compression and data sub-blocking techniques to improve the utilization of fast memory capacity and slow memory bandwidth, with only moderate metadata overheads and management complexity. Baryon reserves a small fast memory area to efficiently manage and stabilize the irregular and frequently varying data layouts resulted from compression and sub-blocking, and selectively commits only stable blocks to the rest fast memory space. It also adopts a novel dual-format metadata scheme to support flexible address remapping under such complex data layouts with low storage cost. Baryon is completely transparent to software, and works with both cache and flat schemes of hybrid memories. Our evaluation shows Baryon achieves up to 1.68× and on average 1.27× performance improvements over state-of-the-art designs.

Proceedings ArticleDOI
22 Jun 2023
TL;DR: The File-Based Memory Management (FBMM) as discussed by the authors is an extensibility approach to kernel memory management that is based on the idea of file-based memory management (FBM).
Abstract: Modern memory hierarchies are increasingly complex, with more memory types and richer topologies. Unfortunately kernel memory managers lack the extensibility that many other parts of the kernel use to support diversity. This makes it difficult to add and deploy support for new memory configurations, such as tiered memory: engineers must navigate and modify the monolithic memory management code to add support, and custom kernels are needed to deploy such support until it is upstreamed. We take inspiration from filesystems and note that VFS, the extensible interface for filesystems, supports a huge variety of filesystems for different media and different use cases, and importantly, has interfaces for memory management operations such as controlling virtual-to-physical mapping and handling page faults. We propose writing memory management systems as filesystems using VFS, bringing extensibility to kernel memory management. We call this idea File-Based Memory Management (FBMM). Using this approach, many recent memory management extensions, e.g., tiering support, can be written without modifying existing memory management code. We prototype FBMM in Linux to show that the overhead of extensibility is low (within 1.6%) and that it enables useful extensions.

Proceedings ArticleDOI
02 Feb 2023
TL;DR: In this article, the authors proposed a contiguous memory management mechanism for a large-scale CPU-accelerator hybrid architecture (CLMalloc) to simultaneously support the different types of memory requirements of CPU and accelerator programs.
Abstract: Heterogeneous accelerators play a crucial role in improving computer performance. General-purpose computers reduce the frequent communication between traditional accelerators with separate memory and the host computer through fast communication links. Some high-speed devices such as supercomputers integrate the accelerator and CPU on one chip, and the shared memory is managed by the operating system, which shifts the performance bottleneck from data acquisition to accelerator addressing. Existing memory management mechanisms typically reserve contiguous physical memory locally for peripherals for efficient direct memory access. However, in large computer systems with multiple memory nodes, the accelerator's memory access behavior is limited by the local memory capacity. The difficulty of addressing accelerators across nodes prevents computers from maximizing the benefits of massive memory. This paper proposes a contiguous memory management mechanism for a large-scale CPU-accelerator hybrid architecture (CLMalloc) to simultaneously support the different types of memory requirements of CPU and accelerator programs. In simulation experiments, CLMalloc achieves similar (or even better) performance to the system functions malloc/free. Compared with the DMA-based baseline, the space utilization of CLMalloc is increased by 2×, and the latency is reduced by 80% to 90%.

Proceedings ArticleDOI
08 May 2023
TL;DR: In this article , the authors proposed vTMM, a tiered memory management system specifically designed for virtualization, which automatically determines page hotness and migrates pages between fast and slow memory.
Abstract: The memory demand of virtual machines (VMs) is increasing, while the traditional DRAM-only memory system has limited capacity and high power consumption. The tiered memory system can effectively expand the memory capacity and increase the cost efficiency. Virtualization introduces new challenges for memory tiering, specifically enforcing performance isolation, minimizing context switching, and providing resource overcommit. However, none of the state-of-the-art designs consider virtualization and thus address these challenges; we observe that a VM with tiered memory incurs up to a 2× slowdown compared to a DRAM-only VM. This paper proposes vTMM, a tiered memory management system specifically designed for virtualization. vTMM automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. A key insight in vTMM is to leverage the unique system characteristics in virtualization to meet the above challenges. Specifically, vTMM tracks memory accesses with page-modification logging (PML) and a multi-level queue design. Next, vTMM quantifies the page "temperature" and makes a fine-grained page classification with bucket-sorting. vTMM performs page migration with PML while providing resource overcommit by transparently resizing VM memory through the two-dimensional page tables. In combination, the above techniques minimize overhead, ensure performance isolation and provide dynamic memory partitioning to improve the overall system performance. We evaluate vTMM on a real DRAM+NVM system and a simulated CXL-Memory system. The results show that vTMM outperforms NUMA balancing, Intel Optane memory mode and Nimble (an OS-level tiered memory management system) for VM tiered memory management. Multi-VM co-running results show that vTMM improves the performance of a DRAM+NVM system by 50%--140% and a CXL-Memory system by 16% -- 40%, respectively.

Posted ContentDOI
23 May 2023
TL;DR: In this article , a simple method to improve BlenderBot3 by integrating memory management ability into it is proposed, which requires little cost for data construction, does not affect performance in other tasks, and reduces external memory.
Abstract: Open-domain conversation systems integrate multiple conversation skills into a single system through a modular approach. One of the limitations of the system, however, is the absence of management capability for external memory. In this paper, we propose a simple method to improve BlenderBot3 by integrating memory management ability into it. Since no training data exists for this purpose, we propose an automating dataset creation for memory management. Our method 1) requires little cost for data construction, 2) does not affect performance in other tasks, and 3) reduces external memory. We show that our proposed model BlenderBot3-M^3, which is multi-task trained with memory management, outperforms BlenderBot3 with a relative 4% performance gain in terms of F1 score.

Proceedings ArticleDOI
13 Feb 2023
TL;DR: In this paper , the authors extend the lock-free general purpose memory allocator LRMalloc to support the Optimistic Access (OA) method, which is able to simplify the memory reclamation method implementation and also allow memory to be reused by other parts of the same process.
Abstract: Lock-free data structures are an important tool for the development of concurrent programs as they provide scalability, low latency and avoid deadlocks, livelocks and priority inversion. However, they require some sort of additional support to guarantee memory reclamation. The Optimistic Access (OA) method has most of the desired properties for memory reclamation, but since it allows memory to be accessed after being reclaimed, it is incompatible with the traditional memory management model. This renders it unable to release memory to the memory allocator/operating system, and, as such, it requires a complex memory recycling mechanism. In this paper, we extend the lock-free general purpose memory allocator LRMalloc to support the OA method. By doing so, we are able to simplify the memory reclamation method implementation and also allow memory to be reused by other parts of the same process. We further exploit the virtual memory system provided by the operating system and hardware in order to make it possible to release reclaimed memory to the operating system.

Proceedings ArticleDOI
17 Apr 2023
TL;DR: In this paper , the authors address the read endurance problem in SCM and tackle the challenge of achieving uniform memory read distribution across memory space while maintaining a balanced management cost, showing significant improvements in memory lifetime while maintaining competitive performance.
Abstract: Modern computer systems continue to demand larger memory capacities in a cost-effective way, and storage class memory (SCM) has emerged as a promising solution with byte-addressability, non-volatility, and high bit-density. While past research has primarily focused on mitigating the write endurance problem in SCM, the impact of read damage on memory lifetime has been frequently overlooked, despite read dominating typical memory workloads. Therefore, this work aims to address the read endurance problem in SCM and tackle the challenge of achieving uniform memory read distribution across memory space while maintaining a balanced management cost. Our proposed idea is evaluated through a series of experiments, showing significant improvements in memory lifetime while maintaining competitive performance. This work highlights the need for a comprehensive consideration of both read and write endurance in SCM, providing insights into developing effective management strategies for next-generation memory systems.

Proceedings ArticleDOI
22 Jun 2023
TL;DR: Soft memory as mentioned in this paper is a software-level abstraction on top of standard primary storage that, under memory pressure, makes memory revocable for reallocation elsewhere, and it has low overhead.
Abstract: Memory is the bottleneck resource in today's datacenters because it is inflexible: low-priority processes are routinely killed to free up resources during memory pressure. This wastes CPU cycles upon re-running killed jobs and incentivizes datacenter operators to run at low memory utilization for safety. This paper introduces soft memory, a software-level abstraction on top of standard primary storage that, under memory pressure, makes memory revocable for re-allocation elsewhere. We prototype soft memory with the Redis key-value store, and find that it has low overhead.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: Wang et al. as mentioned in this paper proposed a multi-granularity shadow paging (MGSP) strategy, which smartly utilizes the redo and undo logs as shadow logs to provide a light-weight crash-resilient mechanism for MMIO.
Abstract: The complex software stack has become the performance bottleneck of the system with high-speed Non-Volatile Memory (NVM). Memory-mapped I/O (MMIO) could avoid the long-stack overhead by bypassing the kernel, but the performance is limited by existing crash-resilient mechanisms. We propose a Multi-Granularity Shadow Paging (MGSP) strategy, which smartly utilizes the redo and undo logs as shadow logs to provide a light-weight crash-resilient mechanism for MMIO. In addition, a multi-granularity strategy is designed to provide high-performance updating and locking for reducing runtime overhead, where strong consistency is preserved with a lockfree metadata log. Experimental results show that the proposed MGSP achieves 1.1 ~ 4.21× performance improvement with write and 2.56 ~ 3.76× improvement with multi-threads write compared with the underlying file system. For SQLite, MGSP can improve the database performance by 29.4% for Mobibench and 36.5% for TPCC, on average.

Journal ArticleDOI
TL;DR: DirectCXL as mentioned in this paper proposes directly accessible memory disaggregation, DirectCXL that straight connects a host processor complex and remote memory resources over CXL's memory protocol (CXL.mem).
Abstract: Compute Express Link (CXL) has recently attracted great attention thanks to its excellent hardware heterogeneity management and resource disaggregation capabilities. Even though there is yet no commercially available product or platform integrating CXL into memory pooling, it is expected to make memory resources practically and efficiently disaggregated much better than ever before. In this article, we propose directly accessible memory disaggregation, DirectCXL that straight connects a host processor complex and remote memory resources over CXL’s memory protocol (CXL.mem). Our empirical evaluation shows that DirectCXL exhibits around 7× better performance than remote direct memory access (RDMA)-based memory pooling for diverse real-world workloads.

Proceedings ArticleDOI
27 Mar 2023
TL;DR: In this paper , a VM-based disaggregated cloud memory platform (DCM) virtualizes the memory device of a remote server connected to a highspeed network as an expansion of local memory.
Abstract: A VM-based disaggregated cloud memory platform (DCM) virtualizes the memory device of a remote server connected to a highspeed network as an expansion of local memory. DCM provides large memory to applications to increase throughput. However, DCM is not well-suited to managing fair memory usage between processes when they run concurrently in a VM. This is because DCM has no mechanism to provide independent memory space to each process. As a result, DCM does not guarantee fairness and performance to processes. Partitioning memory for each process is a way to solve this problem. However, in DCM, the host kernel running DCM cannot obtain the memory page information of a process (including memory page address and PID) running in the guest kernel. So it can not segregate memory pages according to the process. Therefore, this paper proposes an efficient method for the host kernel to obtain the memory page information to partition the memory for each process in DCM, called MFence. The MFence was evaluated using two Linux servers connected by a 100 Gbps IB network. Extensive evaluation has confirmed that MFence ideally provides memory partitioning to provide fairness between processes and improve overall performance.

Proceedings ArticleDOI
10 Jul 2023
TL;DR: Multi-Tag as discussed by the authors is a hardware-software co-design utilizing a multi-granular tagging structure that provides strong protection against spatial and temporal memory safety violations by combining object granular memory tags with page granular tags stored in the page table entries.
Abstract: Memory safety vulnerabilities are a severe threat to modern computer systems allowing adversaries to leak or modify security-critical data. To protect systems from this attack vector, full memory safety is required. As software-based countermeasures tend to induce significant runtime overheads, which is not acceptable for production code, hardware assistance is needed. Tagged memory architectures, e.g., already offered by the ARM MTE and SPARC ADI extensions, assign meta-information to memory objects, thus allowing to implement memory safety policies. However, due to the high tag collision probability caused by the small tag sizes, the protection guarantees of these schemes are limited. This paper presents Multi-Tag, the first hardware-software co-design utilizing a multi-granular tagging structure that provides strong protection against spatial and temporal memory safety violations. By combining object-granular memory tags with page-granular tags stored in the page table entries, Multi-Tag overcomes the limitation of small tag sizes. Introducing page-granular tags significantly enhances the probabilistic protection capabilities of memory tagging without increasing the memory overhead or the system’s complexity. We develop a prototype implementation comprising a gem5 model of the tagged architecture, a Linux kernel extension, and an LLVM-based compiler toolchain. The simulated performance overhead for the SPEC CPU2017 and nbench-byte benchmarks highlights the practicability of our design.

Proceedings ArticleDOI
17 Jun 2023
TL;DR: In this paper , the authors propose Implicit Memory Tagging (IMT), a novel approach that provides no-overhead hardware-accelerated memory tagging by leveraging the system error correcting code (ECC) to check for the equivalence of a memory tag in addition to its regular duties of detecting and correcting data errors.
Abstract: Memory safety is a major security concern for unsafe programming languages, including C/C++ and CUDA/OpenACC. Hardware-accelerated memory tagging is an effective mechanism for detecting memory safety violations; however, its adoption is challenged by significant meta-data storage and memory traffic overheads. This paper proposes Implicit Memory Tagging (IMT), a novel approach that provides no-overhead hardware-accelerated memory tagging by leveraging the system error correcting code (ECC) to check for the equivalence of a memory tag in addition to its regular duties of detecting and correcting data errors. Implicit Memory Tagging relies on a new class of ECC codes called Alias-Free Tagged ECC (AFT-ECC) that can unambiguously identify tag mismatches in the absence of data errors, while maintaining the efficacy of ECC when data errors are present. When applied to GPUs, IMT addresses the increasing importance of GPU memory safety and the costs of adding meta-data to GPU memory. Ultimately, IMT detects memory safety violations without meta-data storage or memory access overheads. In practice, IMT can provide larger tag sizes than existing industry memory tagging implementations, enhancing security.

Journal ArticleDOI
TL;DR: The von Neumann architecture has been the status quo since the dawn of modern computing as discussed by the authors , and the excessive amounts of data movement between processor and memory/storage in more and more real-world applications (e.g., machine learning and AI applications) have made the processorcentric design a severe power and performance bottleneck.
Abstract: The von Neumann architecture has been the status quo since the dawn of modern computing. Computers built on the von Neumann architecture are composed of an intelligent master processor (e.g., CPU) and dumb memory/storage devices incapable of computation (e.g., memory and disk). However, the skyrocketing data volume in modern computing is calling such status quo into question. The excessive amounts of data movement between processor and memory/storage in more and more real-world applications (e.g., machine learning and AI applications) have made the processor-centric design a severe power and performance bottleneck. The diminishing Moore's Law also raises the need for a memory-centric design, which is rising on top of the recent material advancement and manufacturing innovation to open a paradigm shift. By doing computation right inside or near the memory, the memory-centric design promises massive throughput and energy savings.

Proceedings ArticleDOI
12 May 2023
TL;DR: In this paper , the authors evaluate the performance impact of CUDA unified memory using the heterogeneous pixel reconstruction code from the CMS experiment as a realistic use case of a GPU-targeting HEP reconstruction software.
Abstract: memory management can depend heavily on the application. In this paper we evaluate the performance impact of CUDA unified memory using the heterogeneous pixel reconstruction code from the CMS experiment as a realistic use case of a GPU-targeting HEP reconstruction software. We also compare the programming model using CUDA unified memory to the explicit management of separate CPU and GPU memory spaces.

Journal ArticleDOI
TL;DR: In this paper , a dynamic capacity service (DCS) implementation for CXL pooled memory is presented, which can substantially improve system memory utilization by dynamically allocating and releasing memory resources on demand.
Abstract: Compute Express Link (CXL) pooled memory is gaining attention from the industry as a viable memory disaggregation solution offering memory expansion and alleviating memory overprovisioning. One essential feature for the efficient use of the pooled memory is to dynamically allocate or release memory from the pool based on hosts’ demands. We refer to this feature dynamic capacity service (DCS). This article introduces one of the industry’s first DCS implementation for CXL pooled memory. We demonstrate fully functional DCS by implementing a field-programmable gate array-based CXL pooled memory prototype and full software stacks. Our experiment shows that DCS can substantially improve system memory utilization by dynamically allocating and releasing memory resources on demand. We also present the lessons learned from the DCS implementation.

Journal ArticleDOI
TL;DR: ChainSketch as mentioned in this paper uses the selective replacement strategy to mitigate the over-estimation issue and utilizes the hash chain and compact structure to improve memory efficiency for detecting heavy flows.
Abstract: Identifying heavy flows is essential for network management. However, it is challenging to detect heavy flow quickly and accurately under the highly dynamic traffic and rapid growth of network capacity. Existing heavy flow detection schemes can make a trade-off in efficiency, accuracy and speed. However, these schemes still require memory large enough to obtain acceptable performance. To address this issue, we propose ChainSketch, which has the advantages of good memory efficiency, high accuracy and fast detection. Specifically, ChainSketch uses the selective replacement strategy to mitigate the over-estimation issue. Meanwhile, ChainSketch utilizes the hash chain and compact structure to improve memory efficiency. We implement the ChainSketch on OVS platform, P4-based testbed and large-scale simulations to process heavy hitter and heavy changer detection. The results of trace-driven tests show that, ChainSketch greatly improves the F1-score by up to $3.43\times $ compared with the state-of-the-art solutions especially for small memory.

Proceedings ArticleDOI
05 Feb 2023
TL;DR: In this article , the authors analyzed the memory access patterns that occur in main memory during the training process of various CNN models and found that BP accounted for 83.4% of the total main memory accesses on average.
Abstract: Convolutional neural network (CNN) models require deeper networks and more training data for better performance, which in turn results in greater computational and memory requirements. In this paper, we analyze the memory access patterns that occur in main memory during the training processes of various CNN models. CNN training is a linear procedure consisting of a forward pass (FP) and a backward pass (BP). As a result of the analysis, we found that BP accounted for 83.4% of the total main memory accesses on average. Therefore, CNN training including FP and BP is much more memory-intensive than CNN inference using only FP. This demonstrates that CNN training is a suitable application for near-data processing to reduce memory bottlenecks and conserve computational resources.

Proceedings ArticleDOI
22 Jun 2023
TL;DR: In this paper, the authors make a case for offloading memory allocation from main processing cores to other processing units to boost performance, reduce energy consumption, and customize services to specific applications or application domains.
Abstract: Memory allocation and management have a significant impact on performance and energy of modern applications. We observe that performance can vary by as much as 72% in some applications based on which memory allocator is used. Many current allocators are multi-threaded to support concurrent allocation requests from different threads. However, such multi-threading comes at the cost of maintaining complex metadata that is tightly coupled and intertwined with user data. When memory management functions and other user programs run on the same core, the metadata used by management functions may pollute the processor caches and other resources. In this paper, we make a case for offloading memory allocation (and other similar management functions) from main processing cores to other processing units to boost performance, reduce energy consumption, and customize services to specific applications or application domains. To offload these multi-threaded fine-granularity functions, we propose to decouple the metadata of these functions from the rest of application data to reduce the overhead of inter-thread metadata synchronization. We draw attention to the following key questions to realize this opportunity: (a) What are the tradeoffs and challenges in offloading memory allocation to a dedicated core? (b) Should we use general-purpose cores or special-purpose cores for executing critical system management functions? (c) Can this methodology apply to heterogeneous systems (e.g., with GPUs, accelerators) and other service functions as well?

Proceedings ArticleDOI
06 Jun 2023
TL;DR: In this article , the authors apply the concept of memory shading to memory tagging, and present HWASanIO, a HWASAN-based sanitizer implementing the memory shading concept to detect intra-object violations.
Abstract: C/C++ are often used in high-performance areas with critical security demands, such as operating systems, browsers, and libraries. One major drawback from a security standpoint is their susceptibility to memory bugs, which are often hard to spot during development. A possible solution is the deployment of a memory safety framework such as the memory tagging framework Hardware-assisted AddressSanitizer (HWASan). The dynamic analysis tool instruments object allocations and inserts additional check logic to detect memory violations during runtime. A current limitation of memory tagging is its inability to detect intra-object memory violations i.e., over- and underflows between fields and members of structs and classes. This work addresses the issue by applying the concept of memory shading to memory tagging. We then present HWASanIO, a HWASan-based sanitizer implementing the memory shading concept to detect intra-object violations. Our evaluation shows that this increases the bug detection rate from 85.4% to 100% in the memory corruptions test cases of the Juliet Test Suite while maintaining high interoperability with existing C/C++ code.