scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 2022"


Journal ArticleDOI
TL;DR: A distributed framework by reimplementing one of state-of-the-art algorithms, i.e., CoFex, using MapReduce is presented, which can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.
Abstract: Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins. With the rapid development of high-throughput genomic technologies, massive protein-protein interaction (PPI) data have been generated, making it very difficult to analyze them efficiently. To address this problem, this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms, i.e., CoFex, using MapReduce. To do so, an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction. Respective solutions are then devised to overcome these limitations. In particular, we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins. After that, its procedure is modified by following the MapReduce framework to take the prediction task distributively. A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy. Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.

48 citations


Proceedings ArticleDOI
06 Jun 2022
TL;DR: This paper proposes a novel OS-level application-transparent page placement mechanism (TPP) for efficient memory management that outperforms NUMA balancing and AutoTiering, state-of-the-art solutions for tiered memory, by 10–17%.
Abstract: The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance. We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release. We evaluate TPP with diverse memory-sensitive workloads in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today’s Linux, and 5–17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release while the remaining ones are just pending for more discussion.

35 citations


Proceedings ArticleDOI
22 Feb 2022
TL;DR: Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management as mentioned in this paper . But memory disaggregation research has all taken one of two approaches: building/emulating memory nodes using regular servers or building them using raw memory devices with no processing power.
Abstract: Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches: building/emulating memory nodes using regular servers or building them using raw memory devices with no processing power. The former incurs higher monetary cost and faces tail latency and scalability limitations, while the latter introduces performance, security, and management problems.

22 citations


Journal ArticleDOI
TL;DR: vPipe as mentioned in this paper provides dynamic layer partitioning and memory management for pipeline parallelism by searching a near-optimal partitioning/memory management plan and live layer migration protocol for rebalancing the layer distribution across a training pipeline.
Abstract: The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.

18 citations


Proceedings ArticleDOI
12 Jun 2022
TL;DR: This work introduces flexible CXL memory expansion using a CXL type 3 prototype and evaluates its performance in an IMDBMS, showing that C XL memory devices interfaced with PCIe Gen5 are appropriate for memory expansion with nearly no throughput degradation in OLTP workloads and less than 8% throughput degraded in OLAP workloads.
Abstract: Limited memory volume is always a performance bottleneck in an in-memory database management system (IMDBMS) as the data size keeps increasing. To overcome the physical memory limitation, heterogeneous and disaggregated computing platforms are proposed, such as Gen-Z, CCIX, OpenCAPI, and CXL. In this work, we introduce flexible CXL memory expansion using a CXL type 3 prototype and evaluate its performance in an IMDBMS. Our evaluation shows that CXL memory devices interfaced with PCIe Gen5 are appropriate for memory expansion with nearly no throughput degradation in OLTP workloads and less than 8% throughput degradation in OLAP workloads. Thus, CXL memory is a good candidate for memory expansion with lower TCO in IMDBMSs.

10 citations


Journal ArticleDOI
TL;DR: A Silent-PIM is proposed that performs the PIM computation with standard DRAM memory requests, requiring no hardware modifications and allowing the P IM memory device to perform the computation while servicing non-P IM applications’ memory requests.
Abstract: The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications’ memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM’s offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For $(p \times 512) \times (512 \times 2048)$ ( p × 512 ) × ( 512 × 2048 ) matrix multiplication with a batch size $p$ p varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, $p=1$ p = 1 , which was the case without having any data reuse. At $p=128$ p = 128 , the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at $(p \times 2048)$ ( p × 2048 ) element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM’s EDP performance was superior to the others in all the cases having no data reuse.

8 citations


Journal ArticleDOI
TL;DR: In this paper , the authors discuss trends in NVM Express storage, Compute Express Link, and heterogeneous memory as well as object storage with experts, and give a view of the future of compute, memory, and storage.
Abstract: Storage and memory technologies are changing to support big data applications. This article discusses trends in NVM Express storage, Compute Express Link, and heterogeneous memory as well as object storage with experts, and it gives a view of the future of compute, memory, and storage.

6 citations


Proceedings ArticleDOI
28 Jun 2022
TL;DR: MegTaiChi is a dynamic tensor-based memory management optimization module for the DNN training, which first achieves an efficient coordination of tensor partition and tensor rematerialization and achieves the heuristic, adaptive and fine-grained memory management.
Abstract: In real applications, it is common to train deep neural networks (DNNs) on modest clusters. With the continuous increase of model size and batch size, the training of DNNs becomes challenging under restricted memory budget. The tensor partition and tensor rematerialization are two major memory optimization techniques to enable larger model size and batch size within the limited-memory constrain. However, the related algorithms failed to fully extract the memory reduction opportunity, because they ignored the invariable characteristics of dynamic computational graphs and the variation among the same size tensors at different memory locations. In this work, we propose MegTaiChi, a dynamic tensor-based memory management optimization module for the DNN training, which first achieves an efficient coordination of tensor partition and tensor rematerialization. The key feature of MegTaiChi is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, MegTaiChi exploits the total memory optimization space and achieves the heuristic, adaptive and fine-grained memory management. The experimental results show, MegTaiChi can reduce the memory footprint by up to 11% for ResNet-50 and 10.5% for GL-base compared with DTR. For the training of 6 representative DNNs, MegTaiChi outperforms MegEngine and Sublinear by 5X and 2.4X of the maximum batch sizes. Compared with FlexFlow, Gshard and ZeRo-3, MegTaiChi achieves 1.2X, 1.8X and 1.5X performance speedups respectively on average. For the million-scale face recognition application, Meg-TaiChi achieves 1.8X speedup compared with the optimal empirical parallelism strategy on 256 GPUs.

5 citations


Proceedings ArticleDOI
07 Nov 2022
TL;DR: CETIS as mentioned in this paper is a generic and efficient intra-process memory isolation mechanism, which provides memory file abstraction for the isolated memory regions and a set of APIs to access said regions, and also comes with a compiler-assisted tool chain for users to build secure applications easily.
Abstract: Intel control-flow enforcement technology (CET) is a new hardware feature available in recent Intel processors. It supports the coarse-grained control-flow integrity for software to defeat memory corruption attacks. In this paper, we retrofit CET, particularly the write-protected shadow pages of CET used for implementing shadow stacks, to develop a generic and efficient intra-process memory isolation mechanism, dubbed CETIS. To provide user-friendly interfaces, a CETIS framework was developed, which provides memory file abstraction for the isolated memory regions and a set of APIs to access said regions. CETIS also comes with a compiler-assisted tool chain for users to build secure applications easily. The practicality of using CETIS to protect CPI, CFIXX, and JIT-compilers was demonstrated, and the evaluation reveals that CETIS is performed better than state-of-the-art intra-memory isolation mechanisms, such as MPK.

4 citations


Proceedings ArticleDOI
01 May 2022
TL;DR: TSPLIT is a fine-grained DNN memory management system that breaks apart memory bottlenecks while maintaining the efficiency of DNNs training by proposing a model-guided approach to holistically exploit the tensor-split and its joint optimization with out-of-core execution methods (via offload and recompute).
Abstract: Since Deep Neural Networks (DNNs) are deeper and larger, performing DNNs training on existing accelerators (e.g., GPUs) is challenging due to their limited device memory capacity. Existing memory management systems reduce the mem-ory footprint via tensor offloading and recomputing. However, this coarse-grained, one-tensor-at-a-time memory management often incurs high peak GPU memory usage and cannot fully utilize available hardware resources (e.g., PCIe). In this paper, we propose TSPLIT, a fine-grained DNN memory management system that breaks apart memory bottlenecks while maintaining the efficiency of DNNs training. TSPLIT achieves this by proposing a model-guided approach to holistically exploit the tensor-split and its joint optimization with out-of-core execution methods (via offload and recompute). We further provide an efficient implementation of TSPLIT with proposed splittable tensor abstraction, profiling-based planner, and optimized DNN runtime. Evaluations on 6 DNN models show that compared to vDNN and SuperNeurons, TSPLIT can achieve maximum model scale up to 10.5× and 3.1 x and throughput improved up to 4.7× and 2.7 × under the same memory over-subscription, respectively.

4 citations


Journal ArticleDOI
TL;DR: A framework that optimizes the memory usage by performing memory defragmentation operations in HLS many-accelerator architectures that share on-chip memories is proposed and results highlight the effectiveness of the proposed solution to eliminate memory allocation failures due to memory fragmentation.
Abstract: Many-accelerator platforms have been introduced for maximizing FPGA's throughput. However, as the high saturation rate of the FPGA's on-chip memories limits the number of synthesized accelerators, frameworks for Dynamic Memory Management (DMM) that allow the synthesized designs to allocate/de-allocate on-chip memory resources during run-time have been suggested. Although, those frameworks manage to increase the accelerators’ density by minimizing the utilized memory resources, the parallel execution of many-accelerators may cause severe memory fragmentation and thus memory allocation failures. In this work, a framework that optimizes the memory usage by performing memory defragmentation operations in HLS many-accelerator architectures that share on-chip memories is proposed. Experimental results highlight the effectiveness of the proposed solution to eliminate memory allocation failures due to memory fragmentation, reduce memory allocation failures up to 32% on average and decrease the memory size requirements up to 5% with controllable latency and resource utilization overhead.

Journal ArticleDOI
TL;DR: In this paper , the authors propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests, thus, requiring no hardware modifications and allowing PIM memory device to perform the computation while servicing non-pIM applications' memory requests.
Abstract: The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications' memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM's offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For (p ×512) ×(512 ×2048) matrix multiplication with a batch size p varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, p=1, which was the case without having any data reuse. At p=128, the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at (p ×2048) element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM's EDP performance was superior to the others in all the cases having no data reuse.

Journal ArticleDOI
TL;DR: GShare as mentioned in this paper is a centralized GPU memory management framework for containers that enables GPU memory sharing for containers by enforcing the GPU memory limit of each container by mediating the memory allocation calls.

Journal ArticleDOI
TL;DR: MespaConfig as mentioned in this paper is a job-level configuration optimizer for distributed in-memory computing jobs, which improves the performance of six typical programs by up to 12× compared with default configurations.
Abstract: Distributed in-memory computing frameworks usually have lots of parameters (e.g., the buffer size of shuffle) to form a configuration for each execution. A well-tuned configuration can bring large improvements of performance. However, to improve resource utilization, jobs are often share the same cluster, which causes dynamic cluster load conditions. According to our observation, the variation of cluster load reduces effectiveness of configuration tuning. Besides, as a common problem of cluster computing jobs, overestimation of resources also occurs during configuration tuning. It is challenging to efficiently find the optimal configuration in a shared cluster with the consideration of memory-sparing. In this article, we introduce MespaConfig, a job-level configuration optimizer for distributed in-memory computing jobs. Advancements of MespaConfig over previous work are features including memory-sparing and load-sensitive. We evaluate MespaConfig by 6 typical Spark programs under different load conditions. The evaluation results show that MespaConfig improves the performance of six typical programs by up to 12× compared with default configurations. MespaConfig also achieves at most 41 percent reduction of configuration memory usage and reduces the optimization time overhead by 10.8× compared with the state-of-the-art approach.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a circuit to perform address calculation in the memory instead of the CPU, which can significantly reduce the CPU-CIM address transactions and therefore contribute to considerable energy saving, latency and bus traffic.
Abstract: Computation-in-Memory (CIM) is an emerging computing paradigm to address memory bottleneck challenges in computer architecture. A CIM unit cannot fully replace a general-purpose processor. Still, it significantly reduces the amount of data transfer between a traditional memory unit and the processor by enriching the transferred information. Data transactions between processor and memory consist of memory access addresses and values. While the main focus in the field of in-memory computing is to apply computations on the content of the memory (values), the importance of CPU-CIM address transactions and calculations for generating the sequence of access addresses for data-dominated applications is generally overlooked. However, the amount of information transactions used for “address” can easily be even more than half of the total transferred bits in many applications. In this article, we propose a circuit to perform the in-memory Address Calculation Accelerator. Our simulation results showed that calculating address sequences inside the memory (instead of the CPU) can significantly reduce the CPU-CIM address transactions and therefore contribute to considerable energy saving, latency, and bus traffic. For a chosen application of guided image filtering, in-memory address calculation results in almost two orders of magnitude reduction in address transactions over the memory bus.

Journal ArticleDOI
TL;DR: In this article , the authors proposed thermal-constrained memory management (TCMM) for 3D hybrid DRAM-PCM memory, which applies different methods to reduce the power and constrain the peak temperature.

Proceedings ArticleDOI
12 Jun 2022
TL;DR: The memory-centric transformation has just begun with PIM (Processing-In-Memory) and is expected to evolve by bringing memory and logic closer together with advanced packaging techniques in order to achieve optimal system performance.
Abstract: Innovation in the memory semiconductor industry has continued to provide a number of key solutions to address the challenges of ever-changing, data-driven computing. However, besides the demand for high performance, low power, low cost, and high capacity, there is also an increasing demand for more smart functionalities in or near memory to minimize the data movement.In this paper, we will share our vision of memory innovation. First, we begin the journey with memory extension, in which the conventional scaling in both DRAM and NAND can be pushed further to defy the device scaling limits. Then, the journey will ultimately lead to the memory-centric transformation. The memory-centric transformation has just begun with PIM (Processing-In-Memory) and is expected to evolve by bringing memory and logic closer together with advanced packaging techniques in order to achieve optimal system performance.In addition, new solutions enabled by new interfaces such as CXL (Compute Express Link) will be introduced to enhance the current value proposition of the memory technology.Last but not least, our endeavors as a responsible member of the global community will be introduced. Our ongoing efforts are focused on reducing carbon emissions, water usage, and power consumption in all our products and manufacturing processes.SK hynix truly believes that the journey of Memory would only be possible when the ICT industry as a whole embraces open innovation to create a better and more sustainable world.

Journal ArticleDOI
TL;DR: Nonvolatile Memory Express over Fabrics and Compute Express Link, combined with new memory technologies, are creating computational storage and capabilities near and in memory, driving new computer architectures for use in data centers, at the network edge, and in endpoint devices as mentioned in this paper .
Abstract: Nonvolatile Memory Express over Fabrics and Compute Express Link, combined with new memory technologies, are creating computational storage and capabilities near and in memory, driving new computer architectures for use in data centers, at the network edge, and in endpoint devices.

Proceedings ArticleDOI
18 Jun 2022
TL;DR: Fence-Free Crash-consistent Concurrent Defragmentation (FFCCD) introduces architecture support for concurrent defragmentation that enables a fence-free design and fast read barrier, reducing two major overheads of defragmenting persistent memory.
Abstract: Persistent Memory (PM) is increasingly supplementing or substituting DRAM as main memory. Prior work have focused on reusability and memory leaks of persistent memory but have not addressed a problem amplified by persistence, persistent memory fragmentation, which refers to the continuous worsening of fragmentation of persistent memory throughout its usage. This paper reveals the challenges and proposes the first systematic crash-consistent solution, Fence-Free Crash-consistent Concurrent Defragmentation (FFCCD). FFCCD resues persistent pointer format, root nodes and typed allocation provided by persistent memory programming model to enable concurrent defragmentation on PM. FFCCD introduces architecture support for concurrent defragmentation that enables a fence-free design and fast read barrier, reducing two major overheads of defragmenting persistent memory. The techniques is effective (28--73% fragmentation reduction) and fast (4.1% execution time overhead).

Journal ArticleDOI
TL;DR: This work proposes an RISC-V framework that supports logic-in-memory operations and substitutes data memory with a circuit capable of storing data and of performing in-memory computation, and demonstrates an improvement in algorithm execution speed and a reduction in energy consumption.
Abstract: Most modern CPU architectures are based on the von Neumann principle, where memory and processing units are separate entities. Although processing unit performance has improved over the years, memory capacity has not followed the same trend, creating a performance gap between them. This problem is known as the "memory wall" and severely limits the performance of a microprocessor. One of the most promising solutions is the "logic-in-memory" approach. It consists of merging memory and logic units, enabling data to be processed directly inside the memory itself. Here we propose an RISC-V framework that supports logic-in-memory operations. We substitute data memory with a circuit capable of storing data and of performing in-memory computation. The framework is based on a standard memory interface, so different logic-in-memory architectures can be inserted inside the microprocessor, based both on CMOS and emerging technologies. The main advantage of this framework is the possibility of comparing the performance of different logic-in-memory solutions on code execution. We demonstrate the effectiveness of the framework using a CMOS volatile memory and a memory based on a new emerging technology, racetrack logic. The results demonstrate an improvement in algorithm execution speed and a reduction in energy consumption.

Proceedings ArticleDOI
01 Jun 2022
TL;DR: This paper presents a partial-surgery approach that forces in-memory KVSes to prune the damaged objects and reconstructs their internals by using undamaged ones and significantly outperform the conventional all-clean approach.
Abstract: Memory errors that can be detected but cannot be fixed by error correction code (ECC) modules, called ECC-uncorrectable errors, have a severe impact on the availability of the datacenter applications. In-memory key-value stores (KVSes) suffer relatively more from ECC-uncorrectable errors compared with other applications because they typically allocate a large amount of memory and manage KVs and their running states in their address spaces. The standard way of recovery is the all-clean approach that reboots the damaged applications. This eliminates all the memory objects, causing a significant performance degradation of the in-memory KVSes. This paper presents a partial-surgery approach that forces in-memory KVSes to prune the damaged objects and reconstructs their internals by using undamaged ones. We prototyped our approach on memcached 1.4.39 and Redis 5.0.3, and conducted several experiments. The results show that the prototypes successfully recover from our injected memory errors and significantly outperform the conventional all-clean approach.

Journal ArticleDOI
TL;DR: The Compute Express Link (CXL) Consortium as mentioned in this paper is a standard for disaggregating memory and creating memory pools indirectly connected to central processing units (CPUs) in data centers.
Abstract: With memory requirements growing to process increasing data for machine learning and other data-intensive applications, we need better ways to utilize installed memory. The CXL protocol enables creating pools of memory and accelerators, allowing memory disaggregation, and enabling composable virtual machines or containers which can be spun up or down on demand and make more efficient use of expensive memory. New software will make CXL memory pools even more useful by addressing needs such as enhanced security of data in disaggregated memory and state consistency preservation in the face of decoupled CPU and memory failures. Memory Disaggregation in the Data Center Data centers, esp. the large ones, are constantly seeking to optimize their resource utilization. With scale comes increasing pressure to get the most out one’s hardware. The requirement to use compute resources more efficiently for instance led to widespread use of virtual machines running on servers and more recently to creating virtual machines or containers utilizing disaggregated (separated) storage and networking components. Disaggregation usually results in interconnected pools of computer resources such as processors, networks, and storage, which can then be re-aggregated using software to configure virtual machines or containers for running various processes. Software-based combination of pooled computer resources is also known as composable infrastructure. Storage pooling today focuses on using NVMe running on fabrics (NVMe-oF), allowing arrays of SSDs in a storage pool that can then be assigned to provide storage for containers or virtual machines that can be spun up and spun down at will, resulting in much higher utilization of storage resources. New memory networking standards are now making it possible to disaggregate memory beyond today’s direct connection to a CPU toward memory pools that can be shared on an interconnection network and allocated as part of a data center’s composable infrastructure. Let’s examine these developments that will help future data centers tame their memory needs. In 2016, Rao and Porter [1] found memory disaggregation over traditional networks favorable for Apache Spark’s memory-intensive and highly partitionable workloads. In 2017, Barroso, et al. [2] anticipated the changing access characteristics of data in data centers and encouraged software developers to address a gap in their stacks when it came to accessing data that was approximately one microsecond away. A form of disaggregating memory was possible even before Rao and Porter’s work. Hardware proposals for standalone memory blades [4] anticipated many of the aspects of modern memory disaggregation fabrics. In 2019 the Compute Express Link (CXL) Consortium was formed to create standards for disaggregating memory and creating memory pools indirectly connected to central processing units (CPUs). In November 2020 the CXL Consortium released its 2.0 specification [3]. The CXL 3.0 specification release is expected sometime in 2022. CXL runs on the PCIe bus and uses advances in serial link technology (such as high-speed SerDes), and the decades-old idea that a handful of serial links, each forming a lane of 4x to 16x wide-serial links, can serve as a system-expansion interconnect. CXL-enabled systems are expected by the end of 2022 or early 2023, based upon the latest PCIe specification, generation 5. CXL makes protocol-layer enhancements to PCIe that make it especially apt for memory attachment. First, it allows long I/O packets and short cache-line grain accesses to share the same physical link by supporting arbitration at flow-digit (or, flit) level so that load-store operations and I/O Direct Memory Access (DMA) operations can share the same physical link without memory accesses incurring exorbitant latencies due to I/O Transport Layer packets crossing switch ports in front of memory data. Second, it specifies coherence protocols that allow caches and buffers to be coherently connected to processors inside a disaggregated heterogeneous system composed of both traditional elements such as generalpurpose CPUs with their tightly coupled memory devices and novel elements such as far memory and domain-specific accelerators (FPGAs, GPUs and CGRAs with highly integrated SRAM or HBM DRAM). Figure 1 shows some CXL pooling approaches. Figure 1. CXL Memory/Accelerator Pooling Approaches (Image Courtesy of the CXL Consortium) From In-server and Distributed Memory to Disaggregated Memory Each generation of CXL will allow memory to be deployed farther from the CPU with increasing flexibility in terms of the capacity deployed, the dynamic configuration of host memory capacity, and the number of hosts able to share and efficiently access fabric-attached memory. The benefits of this are best understood in contrast with traditional bespoke deployment of dual-inline-memory-modules (or DIMMs) on the DDR buses of CPU sockets, each socket exposing 4, 6, or even 8 DDR channels, and allowing 2 (lately just 1 due to capacitive loading) DIMMs per channel. Those CPUs were interconnected via a switched or point-to-point symmetric coherency fabric that allowed uniform or non-uniform latency of load-store access to each other’s memory. The lanes of PCIe emanated from CPU sockets separately, often with 96 or 128 lanes per socket, were routed to I/O devices such as Network Interface Cards (NICs) or Solid State Disks (SSDs), with or without switches and retimers on the backplane or midplane. In other words, CPUs attached to memory one way and to I/O, another. Due to the disaggregation of I/O, first providing access to storage over Fibre Channel and IP networks in the 1990s, and subsequently using the more expensive NICs and SmartNICs (Xsigo, Virtensys and Mellanox Multihost NICs) during 2000s and 2010s, PCIe was created to meet the need for system-expansion fabrics capable of supporting RDMA (Remote Direct Memory Access). Although CPUs and their application software also adopted RDMA for efficient interprocessor communication, the heavy software path of setting up and tearing down the memory registrations required for safe, zero-copy RDMA, and the heavy queue-pair based issue and completion paths of RDMA read and write operations remind one more of storage protocols (such as NVMe) than of memory access. By contrast, it is expected that even the higher CXL latencies (compared to DIMMs) will be an order of magnitude lower than the lower RDMA Read round-trip time (or, RTT). Disaggregation-related Trends and their Implications Some of the implications of memory disaggregation are similar to storage disaggregation in the late 1990s. When any resource decouples from a host server, it must be managed differently. Starting with power-up and boot, there are fewer ordering guarantees over the power-up sequence across disaggregated components. Due to independence of procurement and decommissioning of resources, and due to independent failures, there are fewer assurances of co-availability. On the positive side, components that could not previously be independently scaled may now do so. Independent manageability required of the freshly disaggregated components creates opportunity for value-added services. For instance, storage arrays developed many new software-based capabilities not previously available in hard disk drives, such as snapshots, cloning and thin provisioning, to name a few. We likewise expect disaggregated memory nodes to evolve from devices into subsystems with a growing list of novel software-based capabilities. Independent scaling of computation and memory is to be contrasted with homogeneous scale-out where the sins of bespoke memory deployment were compounded by eager overprovisioning and the inability to acquire more memory without the cost and latency of additional processors. Moreover, the economic impact of bespoke memory deployment runs deep in today’s data centers. First, memory has now become the costliest element of a data center server’s bill of materials, accounting for as much as 50 percent of the overall cost compared with 25 percent in 2009 [4]. Due to this, as many as 5-7 server stock keeping units (SKUs) are commonly found in a 100,000-server cloud data center, mainly differing in their memory capacity. The use of these fixed SKUs can result in up to 34% of memory capacity remaining idle. Second, due to the inability to dynamically grow memory capacity of a server to match demand, applications are forced to consider either tolerating Out of Memory errors or moving their data to larger instances, just when the footprint of their state is at its peak, neither of which is particularly palatable to modern DevOps. Third, as if that wasn’t enough trouble, the capacity needs of applications vary wildly [4]. Speaking at the 5 International Symposium on Heterogeneous Integration, John Shalf, the CTO of Nuclear Energy Research Supercomputer (NERSC) at US Department of Energy, has observed that server workloads use less than 25% of their memory, 75% of the time [5]. So wasteful is bespoke deployment of memory in the data center that a resource that is procured by data center operators at approximately $4/GB is then rented out to cloud service operators at approximately $22-$30/GB per year, probably to make up for the losses in a poorly architected value chain. In their 2022 ASPLOS paper, Microsoft Azure researchers [6] estimate that they can save approximately 10 percent of overall memory cost by placing just the cold pages (infrequently accessed provisioned memory) in a CXL-based far memory tier shared between 16 and 32 servers. Industry's roadmap of memory disaggregation Given that the demand for memory keeps rising due to the growth of memory-intensive workloads, architects will need to get much more aggressive about leveraging memory as a far, fungible, and shared resource. There has been some recognition that bottom-up, hardware developments such as CXL are merely a

Proceedings ArticleDOI
28 Jun 2022
TL;DR: In this paper , the authors proposed a dynamic tensor-based memory management optimization module for the DNN training, which first achieves an efficient coordination of tensor partition and tensor rematerialization.
Abstract: In real applications, it is common to train deep neural networks (DNNs) on modest clusters. With the continuous increase of model size and batch size, the training of DNNs becomes challenging under restricted memory budget. The tensor partition and tensor rematerialization are two major memory optimization techniques to enable larger model size and batch size within the limited-memory constrain. However, the related algorithms failed to fully extract the memory reduction opportunity, because they ignored the invariable characteristics of dynamic computational graphs and the variation among the same size tensors at different memory locations. In this work, we propose MegTaiChi, a dynamic tensor-based memory management optimization module for the DNN training, which first achieves an efficient coordination of tensor partition and tensor rematerialization. The key feature of MegTaiChi is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, MegTaiChi exploits the total memory optimization space and achieves the heuristic, adaptive and fine-grained memory management. The experimental results show, MegTaiChi can reduce the memory footprint by up to 11% for ResNet-50 and 10.5% for GL-base compared with DTR. For the training of 6 representative DNNs, MegTaiChi outperforms MegEngine and Sublinear by 5X and 2.4X of the maximum batch sizes. Compared with FlexFlow, Gshard and ZeRo-3, MegTaiChi achieves 1.2X, 1.8X and 1.5X performance speedups respectively on average. For the million-scale face recognition application, Meg-TaiChi achieves 1.8X speedup compared with the optimal empirical parallelism strategy on 256 GPUs.

Book ChapterDOI
TL;DR: FFIChecker as mentioned in this paper is an automated static analysis and bug detection tool dedicated to memory management issues across the Rust/C Foreign Function Interface (FFI) boundaries, which can detect cross-language memory management problems.
Abstract: Rust is a promising system-level programming language that can prevent memory corruption bugs using its strong type system and ownership-based memory management scheme. In practice, programmers usually write Rust code in conjunction with other languages such as C/C++ through Foreign Function Interface (FFI). For example, many notable projects are developed using Rust and other programming languages, such as Firefox, Google Fuchsia OS, and the Linux kernel. Although it is widely believed that gradually re-implementing security-critical components in Rust is a way of enhancing software security, however, using FFI is inherently unsafe. In this paper, we show that memory management across the FFI boundaries is error-prone. Any incorrect use of FFI may corrupt Rust’s ownership system, leading to memory safety issues. To tackle this problem, we design and build FFIChecker, an automated static analysis and bug detection tool dedicated to memory management issues across the Rust/C FFI. We evaluate our tool by checking 987 Rust packages crawled from the official package registry and reveal 34 bugs in 12 packages. Our experiments show that FFIChecker is a useful tool to detect real-world cross-language memory management issues with a reasonable amount of computational resources.

Proceedings ArticleDOI
14 Sep 2022
TL;DR: In this article , the authors examine the costs and benefits of garbage collection through four studies, exploring the differences in costs between manual memory management and garbage collection and find that the benefits of manual management are likely to remain subject to contentious discussion.
Abstract: Automatic memory management relieves programmers of the burden of having to reason about object lifetimes in order to soundly reclaim allocated memory. However, this automation comes at a cost. The cost and benefits of garbage collection relative to manual memory management have been the subject of contention for a long time, and will likely remain so. However, until now, the question is surprisingly under-studied. We examine the costs and benefits of garbage collection through four studies, exploring: We conduct this study in a contemporary setting using recent CPU microarchitectures, and novel methodologies including a mark-sweep collector built upon off-the-shelf free-list allocators, allowing us to shed new light on garbage collection overheads in a modern context. We find that: The costs and benefits of garbage collection are likely to remain subject to contentious discussion. However, the methodologies and evaluations we present here provide a deeper understanding of the differences in costs between manual memory management and garbage collection.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a remote memory sharing for mutualizing memory management in virtualized computing infrastructures, which monitors the working set of virtual machines, reclaims unused memory and makes it available (as a remote swap device) for virtual machines which need memory.
Abstract: Resource management is a critical issue in today’s virtualized computing infrastructures. Consolidation is the main technique used to optimize such infrastructure. Regarding memory management, it allows gathering overloaded and underloaded VM on the same server so that memory can be mutualized. However, because of infrastructures constraints and complexity of managing multiple resources, consolidation can hardly optimize memory management. In this article, we propose to rely on a remote memory sharing for mutualizing memory. We implemented a system which monitors the working set of virtual machines, reclaims unused memory and makes it available (as a remote swap device) for virtual machines which need memory. Our evaluations with HPC and Big Data benchmarks demonstrate the effectiveness of this approach. We show that remote memory can improve the performance of a standard Spark benchmark by up the 17 percent with an average performance degradation of 1.5 percent (for the providing application).

Journal ArticleDOI
TL;DR: The motivation behind this study is to review existing contiguous memory allocation (CMA) method by identifying and removing its drawbacks.
Abstract: The demand of contiguous memory allocation has been expanded in day-to-day life in all the devices. It is achieved in existing systems by using various reservation techniques. There are various other methods to achieve the goal of contiguous memory allocation in linux kernel such as, input output memory management units (IOMMU’s), scatter/gather direct memory access (DMA) and reserved static memory at boot time. But these solutions have its own drawbacks such as, IOMMU requires hardware. However, the configuration of additional hardware's increases the cost. The power consumption of the system and the reserved static memory in the system goes waste when not in used for specific purpose. It is very difficult to access contiguous memory in low-end devices that are unable to provide real contiguous memory. There is one existing method called contiguous memory allocator (CMA), which provides dynamic contiguous memory. It overcomes most of the problems but CMA itself has some drawbacks, which do not provide the guarantee of failure in future of contiguous memory. The motivation behind this study is to review existing contiguous memory allocation (CMA) method by identifying and removing its drawbacks.

Journal ArticleDOI
TL;DR: XUnified as mentioned in this paper is an advice controller that combines the offline training with the online adaptation to guide the optimal use of unified memory and discrete memory for various applications at run-time.
Abstract: Unified Memory is a single memory address space that is accessible by any processor (GPUs or CPUs) in a system. NVIDIA’s unified memory creates a pool of managed memory on top of physically separated CPU and GPU memories. NVIDIA’s unified memory automatically migrates page-level data on-demand, so programmers can quickly develop CUDA codes on heterogeneous machines. However, it is extremely difficult for programmers to decide when and how to efficiently use NVIDIA’s unified memory because (1) users are usually unaware of which unified memory hint (e.g., ReadMostly, PreferredLocation, AccessedBy) should be used for a data object in the application, and (2) it is tedious and error-prone to do manual memory management (i.e., manual code modifications) for various applications with difference data objects or inputs. We present XUnified, an advice controller which combines the offline training with the online adaptation to guide the optimal use of unified memory and discrete memory for various applications at runtime. The offline phase uses profiler-generated metrics to train a machine learning model, which is used to predict optimal memory advice choice and it then applies this advice to applications at runtime. We evaluate XUnified on NVIDIA Volta GPUs with a set of heterogeneous computing benchmarks. Results show that it achieves 94.0% prediction accuracy in correctly identifying the optimal memory advice choice with a maximal 34.3% reduction in kernel execution time.

Proceedings ArticleDOI
Xiangyan Sun1
03 Dec 2022
TL;DR: In this paper , a memory management strategy that terminates full memory concerns due to the data dependencies of the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm is presented.
Abstract: Reliable wireless communication is enabled by adopting advanced Forward Error Correction (FEC) codes, such as Turbo codes. However, the Turbo decoder for real-time communication necessitates accommodating limited resources offered by the DSP processor. This motivates the benchmarking of the Super Harvard Architecture (SHARC) Digital Signal Processor (DSP), which is considered a high flexibility platform. However, implementing turbo code, which is based on the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm on DSP, requires extensive processing time and high memory usage due to its data dependencies. Motivated by this, the hardware considerations to implement a coded receiver under limited memory and processing time are proposed. A memory management strategy that terminates full memory concerns due to the data dependencies of the BCJR is presented. We show that by jointly optimizing the dynamic memory allocation and accessing the external memory, a significant improvement of the available computing resources and lessen the memory requirements. Results showed that the proposed coded receiver requires a memory size of 21 symbol lengths.

Proceedings ArticleDOI
14 Mar 2022
TL;DR: In this article , the authors propose a memory management methodology that leverages a data structure refinement approach to improve data placement results, in terms of execution time and energy consumption, using machine learning algorithms.
Abstract: The emergence of memory systems that combine multiple memory technologies with alternative performance and energy characteristics are becoming mainstream. Existing data placement strategies evolve to map application requirements to the underlying heterogeneous memory systems. In this work, we propose a memory management methodology that leverages a data structure refinement approach to improve data placement results, in terms of execution time and energy consumption. The methodology is evaluated on three machine learning algorithms deployed on various NVM technologies, both on emulated and on real DRAM/NVM systems. Results show execution time improvement up to 57% and energy consumption gains up to 41%.