scispace - formally typeset
Search or ask a question

Showing papers on "Distributed memory published in 2017"


Proceedings ArticleDOI
24 Sep 2017
TL;DR: This work built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability, and implemented it in the Linux kernel.
Abstract: Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance They have the potential to benefit many datacenter applications However, most previous research on NVMs has focused on using them in a single machine environment It is still unclear how to best utilize them in distributed, datacenter environments We introduce Distributed Shared Persistent Memory (DSPM), a new framework for using persistent memories in distributed data-center environments DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data We built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability The key ideas of Hotpot are to integrate distributed memory caching and data replication techniques and to exploit application hints We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 13× to 32× and a recent distributed PM-based file system by 15× to 30×

104 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: UH-MEM as discussed by the authors is a page management mechanism for various hybrid memories that systematically estimates the utility of migrating a page between different memory types, and uses this information to guide data placement.
Abstract: While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (eg, different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page It is important to make intelligent page management (ie, placement and migration) decisions, as they can significantly affect system performanceIn this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (ie, the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement UH-MEM operates in two steps First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migrationWe evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads

78 citations


Proceedings ArticleDOI
24 Sep 2017
TL;DR: This paper enumerates the challenges of remote memory, discusses their feasibility, explain how some of them are addressed by recent work, and indicates other promising ways to tackle them.
Abstract: As the latency of the network approaches that of memory, it becomes increasingly attractive for applications to use remote memory---random-access memory at another computer that is accessed using the virtual memory subsystem. This is an old idea whose time has come, in the age of fast networks. To work effectively, remote memory must address many technical challenges. In this paper, we enumerate these challenges, discuss their feasibility, explain how some of them are addressed by recent work, and indicate other promising ways to tackle them. Some challenges remain as open problems, while others deserve more study. In this paper, we hope to provide a broad research agenda around this topic, by proposing more problems than solutions.

70 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work-in-progress builds on existing heuristics for pruning the number of wedge checks by ordering based on degree, and presents a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph, and a weak scaling study on synthetic Graph500 RMAT graphs up to 274.9 billion edges.
Abstract: Triangle counting has long been a challenge problem for sparse graphs containing high-degree "hub" vertices that exist in many real-world scenarios. These high-degree vertices create a quadratic number of wedges, or 2-edge paths, which for brute force algorithms require closure checking or wedge checks. Our work-in-progress builds on existing heuristics for pruning the number of wedge checks by ordering based on degree and other simple metrics. Such heuristics can dramatically reduce the number of required wedge checks for exact triangle counting for both real and synthetic scale-free graphs. Our triangle counting algorithm is implemented using HavoqGT, an asynchronous vertex-centric graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph, and a weak scaling study on synthetic Graph500 RMAT graphs up to 274.9 billion edges.

59 citations


Journal ArticleDOI
27 Apr 2017-Sensors
TL;DR: This paper proposes a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing and demonstrates that the cost in terms of the response time is negligible.
Abstract: In a cloud computing environment, the number of virtual machines (VMs) on a single physical server and the number of applications running on each VM are continuously growing. This has led to an enormous increase in the demand of memory capacity and subsequent increase in the energy consumption in the cloud. Lack of enough memory has become a major bottleneck for scalability and performance of virtualization interfaces in cloud computing. To address this problem, memory deduplication techniques which reduce memory demand through page sharing are being adopted. However, such techniques suffer from overheads in terms of number of online comparisons required for the memory deduplication. In this paper, we propose a static memory deduplication (SMD) technique which can reduce memory capacity requirement and provide performance optimization in cloud computing. The main innovation of SMD is that the process of page detection is performed offline, thus potentially reducing the performance cost, especially in terms of response time. In SMD, page comparisons are restricted to the code segment, which has the highest shared content. Our experimental results show that SMD efficiently reduces memory capacity requirement and improves performance. We demonstrate that, compared to other approaches, the cost in terms of the response time is negligible.

49 citations


Journal ArticleDOI
TL;DR: In comparison to previous long characteristics methods, this work has greatly improved the parallel performance of the adaptive long-characteristics method by developing a new completely asynchronous and non-blocking communication algorithm.

48 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: The results of the evaluation reveal that the proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.
Abstract: Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.

39 citations


Journal ArticleDOI
TL;DR: A cycle-accurate simulator for hybrid memory cube called CasHMC provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data.
Abstract: 3D-stacked DRAM has been actively studied to overcome the limits of conventional DRAM. The Hybrid Memory Cube (HMC) is a type of 3D-stacked DRAM that has drawn great attention because of its usability for server systems and processing-in-memory (PIM) architecture. Since HMC is not directly stacked on the processor die where the central processing units (CPUs) and graphic processing units (GPUs) are integrated, HMC has to be linked to other processor components through high speed serial links. Therefore, the communication bandwidth and latency should be carefully estimated to evaluate the performance of HMC. However, most existing HMC simulators employ only simple HMC modeling. In this paper, we propose a cycle-accurate simulator for hybrid memory cube called CasHMC. It provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data. Furthermore, CasHMC is implemented in C++ as a single wrapped object that includes an HMC controller, communication links, and HMC memory. Instantiating this single wrapped object facilitates simultaneous simulation in parallel with other simulators that generate memory access patterns such as a processor simulator or a memory trace generator.

38 citations


Journal ArticleDOI
TL;DR: The design, implementation and performance of the new hybrid parallelization scheme in the Monte Carlo radiative transfer code SKIRT is described, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori.

35 citations


Posted Content
TL;DR: This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers, and considers both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm.
Abstract: Author(s): Buluc, Aydin; Beamer, Scott; Madduri, Kamesh; Asanovic, Krste; Patterson, David | Abstract: This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.

34 citations


Journal ArticleDOI
TL;DR: The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
Abstract: Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.

Book
01 Jun 2017
TL;DR: Ligra, the first high-level shared-memory framework for parallel graph traversal algorithms, is introduced, enabling short and concise implementations that deliver performance competitive with that of highly optimized code and up to orders of magnitude faster than previous systems designed for distributed memory.
Abstract: Parallelism is the key to achieving high performance in computing. However, writing efficient and scalable parallel programs is notoriously difficult, and often requires significant expertise. To address this challenge, it is crucial to provide programmers with high-level tools to enable them to develop solutions easily, and at the same time emphasize the theoretical and practical aspects of algorithm design to allow the solutions developed to run efficiently under many different settings. This book, a revised version of the thesis that won the 2015 ACM Doctoral Dissertation Award, addresses this challenge using a three-pronged approach consisting of the design of shared-memory programming techniques, frameworks, and algorithms for important problems in computing. It provides evidence that with appropriate programming techniques, frameworks, and algorithms, shared-memory programs can be simple, fast, and scalable, both in theory and in practice. The results serve to ease the transition into the multicore era. The book starts by introducing tools and techniques for deterministic parallel programming, including means for encapsulating nondeterminism via powerful commutative building blocks, as well as a novel framework for executing sequential iterative loops in parallel, which lead to deterministic parallel algorithms that are efficient both in theory and in practice. The book then introduces Ligra, the first high-level shared-memory framework for parallel graph traversal algorithms. The framework enables short and concise implementations that deliver performance competitive with that of highly optimized code and up to orders of magnitude faster than previous systems designed for distributed memory. Finally, the book bridges the gap between theory and practice in parallel algorithm design by introducing the first algorithms for a variety of important problems on graphs and strings that are both practical and theoretically efficient.

Journal ArticleDOI
TL;DR: A complete methodology for designing the private local memories (PLMs) of multiple accelerators based on the memory requirements of each accelerator, which automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information.
Abstract: In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called Mnemosyne, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (Perfect and CortexSuite). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately.

Proceedings ArticleDOI
TL;DR: This paper analyzes the Intel KNL system and quantifies the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads to show that applications with regular memory access benefit from MCDRAM.
Abstract: Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide 5x higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3x performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.

Journal ArticleDOI
TL;DR: This work applies sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion to three important data analysis tasks—PageRank, eigensolving, and non-negative matrix factorization—and shows that the SEM implementations significantly advance the state of the art.
Abstract: Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100 percent performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65 percent performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks—PageRank, eigensolving, and non-negative matrix factorization—and show that our SEM implementations significantly advance the state of the art.

Journal ArticleDOI
TL;DR: A new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer.
Abstract: There are many large-scale graphs in real world such as Web graphs and social graphs. The interest in large-scale graph analysis is growing in recent years. Breadth-First Search (BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a second, using large supercomputers such as the K-Computer. By the use of our proposed algorithm, the K-Computer was ranked 1st in Graph500 using all the 82,944 nodes available on June and November 2015 and June 2016 38,621.4 GTEPS. Based on the hybrid BFS algorithm by Beamer (Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’13, IEEE Computer Society, Washington, 2013), we devise sets of optimizations for scaling to extreme number of nodes, including a new efficient graph data structure and several optimization techniques such as vertex reordering and load balancing. Our performance evaluation on K-Computer shows that our new BFS is 3.19 times faster on 30,720 nodes than the base version using the previously known best techniques.

Journal ArticleDOI
TL;DR: In this paper, the authors describe the design, implementation and performance of the new hybrid parallelization scheme in their Monte Carlo radiative transfer code SKIRT, which has been used extensively for modeling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori.
Abstract: We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modeling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behavior of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.

Journal ArticleDOI
TL;DR: In this article, the authors analyzed the memory capacity requirements of important HPC benchmarks and applications and found that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but also detect applications and use cases that require gigabytes per core.
Abstract: An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: Mallacc is proposed, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators, which accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage.
Abstract: Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 um2 of silicon area, less than 0.006% of a typical high-performance processor core.

Journal ArticleDOI
TL;DR: A brief review of various memorycentric systems that implement different approaches of merging or placing the memory near to the processing elements and a deep analysis of several well-known memory-centric systems are given.
Abstract: The growing rate of technology improvements has caused dramatic advances in processor performances, causing significant speed-up of processor working frequency and increased amount of instructions which can be processed in parallel. The given development of processor's technology has brought performance improvements in computer systems, but not for all the types of applications. The reason for this resides in the well known Von-Neumann bottleneck problem which occurs during the communication between the processor and the main memory into a standard processor-centric system. This problem has been reviewed by many scientists, which proposed different approaches for improving the memory bandwidth and latency. This paper provides a brief review of these techniques and also gives a deep analysis of various memorycentric systems that implement different approaches of merging or placing the memory near to the processing elements. Within this analysis we discuss the advantages, disadvantages and the application (purpose) of several well-known memory-centric systems.

Journal ArticleDOI
TL;DR: A novel conflict-free access scheme for memory-based fast Fourier transform (FFT) processors is presented and proved to satisfy the constraints of the mixed-radix, continuous-flow, parallel-processing, and variable-size FFT computations.
Abstract: This brief presents a novel conflict-free access scheme for memory-based fast Fourier transform (FFT) processors. It is proved to satisfy the constraints of the mixed-radix, continuous-flow, parallel-processing, and variable-size FFT computations. An address generation unit is also designed and outperforms existing architectures with reduced gate delay and lower hardware complexity.

Journal ArticleDOI
TL;DR: A decomposition of the Tikhonov Regularization (TR) functional is introduced which split this operator into several TR functionals, suitably modified in order to enforce the matching of their solutions and leads to a reduction of the overall execution time.
Abstract: We introduce a decomposition of the Tikhonov Regularization (TR) functional which split this operator into several TR functionals, suitably modified in order to enforce the matching of their solutions. As a consequence, instead of solving one problem we can solve several problems reproducing the initial one at smaller dimensions. Such approach leads to a reduction of the time complexity of the resulting algorithm. Since the subproblems are solved in parallel, this decomposition also leads to a reduction of the overall execution time. Main outcome of the decomposition is that the parallel algorithm is oriented to exploit the highest performance of parallel architectures where concurrency is implemented both at the coarsest and finest levels of granularity. Performance analysis is discussed in terms of the algorithm and software scalability. Validation is performed on a reference parallel architecture made of a distributed memory multiprocessor and a Graphic Processing Unit. Results are presented on the Data Assimilation problem, for oceanographic models.

Journal ArticleDOI
01 Jan 2017
TL;DR: A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems.
Abstract: Next generation data centers will likely be based on the emerging paradigm of disaggregated function-blocks-as-a-unit departing from the current state of mainboard-as-a-unit. Multiple functional blocks or bricks such as compute, memory and peripheral will be spread through the entire system and interconnected together via one or multiple high speed networks. The amount of memory available will be very large distributed among multiple bricks. This new architecture brings various benefits that are desirable in today’s data centers such as fine-grained technology upgrade cycles, fine-grained resource allocation, and access to a larger amount of memory and accelerators. An analysis of the impact and benefits of memory disaggregation is presented in this paper. One of the biggest challenges when analyzing these architectures is that memory accesses should be modeled correctly in order to obtain accurate results. However, modeling every memory access would generate a high overhead that can make the simulation unfeasible for real data center applications. A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems. With a mean error of 10%, simulation results pointed out that the network layers may introduce overheads that degrade applications’ performance up to 66%. Initial results also suggest that low memory access bandwidth may degrade up to 20% applications’ performance.

Proceedings Article
12 Jul 2017
TL;DR: This paper identifies, quantifies and demonstrates memory elasticity, an intrinsic property of dataparallel tasks, which allows tasks to run with significantly less memory than they would ideally need while only paying a moderate performance penalty.
Abstract: Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory elasticity, an intrinsic property of dataparallel tasks. Memory elasticity allows tasks to run with significantly less memory than they would ideally need while only paying a moderate performance penalty. For example, we find that given as little as 10% of ideal memory, PageRank and NutchIndexing Hadoop reducers become only 1.2×/1.75× and 1.08× slower. We show that memory elasticity is prevalent in the Hadoop, Spark, Tez and Flink frameworks. We also show that memory elasticity is predictable in nature by building simple models for Hadoop and extending them to Tez and Spark. To demonstrate the potential benefits of leveraging memory elasticity, this paper further explores its application to cluster scheduling. In this setting, we observe that the resource vs. time trade-off enabled by memory elasticity becomes a task queuing time vs. task runtime trade-off. Tasks may complete faster when scheduled with less memory because their waiting time is reduced. We show that a scheduler can turn this task-level trade-off into improved job completion time and cluster-wide memory utilization. We have integrated memory elasticity into Apache YARN. We show gains of up to 60% in average job completion time on a 50-node Hadoop cluster. Extensive simulations show similar improvements over a large number of scenarios.

Journal ArticleDOI
TL;DR: This paper presents a novel hybrid (shared + distributed memory) parallel algorithm to efficiently detect high quality communities in massive social networks and shows the scalability and quality performance of this algorithm.

Journal ArticleDOI
TL;DR: KiloCore, an array of 1,000 independent processors and 12 memory modules, has been designed to efficiently support these applications, and has been fabricated in 32-nm PD-SOI CMOS.
Abstract: Many important applications can be expressed as a group of fine-grained interconnected tasks, in which individual tasks require under 100 instructions and little data memory. KiloCore, an array of 1,000 independent processors and 12 memory modules, has been designed to efficiently support these applications, and has been fabricated in 32-nm PD-SOI CMOS. Each programmable processor occupies 0.055 mm 2 and supports energy-efficient computation of small tasks, requiring 17 mW to operate with a clock frequency of 1.24 GHz at 0.9 V. Processors may operate up to 1.78 GHz at 1.1 V, or down to 115 MHz and 0.61 mW at 0.56 V. Coarse-grained tasks are supported with the assistance of the independent memory modules, which can each supply 64 Kbytes of data and instructions to neighboring processors. Processors are connected using complementary circuit and packet-based networks, which offer a total array bisection bandwidth of up to 4.2 Tbps at 1.1 V. Fine-grained tasks are found to have low communication link densities in sampled applications, allowing a large majority of links to be assigned to the energy-efficient, high-performance circuit network.

Journal ArticleDOI
TL;DR: Research from Micron Technology is details research in the area of processing in memory as a form of memory-centric computing, which allows DRAM to provide functionality in a heterogeneous system to alleviate the pressures of the von-Neumann barrier.
Abstract: Recent activity in near-data processing has built or proposed systems that can exploit technologies such as 3D stacks, in-situ computing, or dataflow devices. However, little effort has been applied to exploit the natural parallelism and throughput of DRAM. This article details research from Micron Technology in the area of processing in memory as a form of memory-centric computing. In-Memory Intelligence (IMI) attempts to place a massive array of bit-serial computing elements on pitch with the memory array, as close to the information as possible. This contrasts with near-memory devices that rely on some form of storage but must communicate with that storage via a fast, low-latency interface. Initial simulations and models show stair-step improvements in performance and power for various applications. Such technology allows DRAM to provide functionality in a heterogeneous system to alleviate the pressures of the von-Neumann barrier.

Journal ArticleDOI
TL;DR: A parallel Cartesian-grid-based time-dependent Schrodinger equation (TDSE) solver for modeling laser–atom interactions that uses a split-operator method combined with fast Fourier transforms (FFT) on a three-dimensional Cartesian grid, which results in a good parallel scaling on modern supercomputers.

Journal ArticleDOI
12 Jun 2017
TL;DR: In this paper, the authors propose a neural network model of auto-associative, distributed memory for storage and retrieval of many items (vectors) where the number of stored items can exceed the vector dimension.
Abstract: Introduction. Neural network models of autoassociative, distributed memory allow storage and retrieval of many items (vectors) where the number of stored items can exceed the vector dimension (the ...

Proceedings ArticleDOI
24 Jun 2017
TL;DR: This work first analyzes several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems, and introduces three techniques to address MN latency issues.
Abstract: High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory "cubes" with high per-package capacity (from 3D integration) along with high-speed point-to-point interconnects provide a scalable memory system architecture with the potential to deliver both capacity and performance. Multiple such cubes connected together can form a "Memory Network" (MN), but the design space for such MNs is quite vast, including multiple topology types and multiple memory technologies per memory cube.In this work, we first analyze several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems. We find that most of a MN's performance challenges arise from the interconnection network that binds the memory cubes together. In particular, arbitration schemes used to route through MNs, ratio of NVM to DRAM, and specific topologies used have dramatic impact on performance and energy results. Our initial analysis indicates that introducing non-volatile memory to the MN presents a unique tradeoff between memory array latency and network latency. We observe that placing NVM cubes in a specific order in the MN improves performance by reducing the network size/diameter up to a certain NVM to DRAM ratio. Novel MN topologies and arbitration schemes also provide performance and energy deltas by reducing the hop count of requests and response in the MN. Based on our analyses, we introduce three techniques to address MN latency issues: (1) Distance-based arbitration scheme to improve queuing latencies throughout the network, (2) skip-list topology, derived from the classic data structure, to improve network latency and link usage, and (3) the MetaCube, a denser memory cube that leverages advanced packaging technologies to improve latency by reducing MN size.