Proceedings ArticleDOI
Application-transparent near-memory processing architecture with memory channel network
Mohammad Alian,Seungwon Min,Hadi Asghari-Moghaddam,Ashutosh Dhar,Dong Kai Wang,Thomas Roewer,Adam J. McPadden,Oliver O'Halloran,Deming Chen,Jinjun Xiong,Daehoon Kim,Wen-mei W. Hwu,Nam Sung Kim +12 more
- pp 802-814
TLDR
Memory Channel Network can serve as an application-transparent framework which can seamlessly unify near-memory processing within a server and distributed computing across such servers for data-intensive applications.Abstract:
The physical memory capacity of servers is expected to increase drastically with the deployment of the forthcoming non-volatile memory technologies. This is a welcomed improvement for the emerging data-intensive applications. For such servers to be cost-effective, nonetheless, we must cost-effectively increase compute throughput and memory bandwidth commensurate with the increase in memory capacity without compromising the application readiness. Tackling this challenge, we present Memory Channel Network (MCN) architecture in this paper. Specifically, first, we propose an MCN DIMM, an extension of a buffered DIMM where a small but capable processor called MCN processor is integrated with a buffer device on the DIMM for near-memory processing. Second, we implement device drivers to give the host and MCN processors in a server an illusion that they are independent heterogeneous nodes connected through an Ethernet link. These allow the host and MCN processors in a server to run a given data-intensive application together based on popular distributed computing frameworks such as MPI and Spark without any change in the host processor hardware and its application software, while offering the benefits of high-bandwidth and low-latency communication between the host and MCN processors over the memory channels. As such, MCN can serve as an application-transparent framework which can seamlessly unify the near-memory processing within a server and the distributed computing across such servers for data-intensive applications. Our simulation running the full software stack shows that a server with 8 MCN DIMMs offers 4.56 x higher throughput and consume 47.5% less energy than a cluster with 9 conventional nodes connected through Ethernet links, as it facilitates up to 8.17 x higher aggregate DRAM bandwidth utilization. Lastly, we demonstrate the feasibility of MCN with an IBM POWER8 system and an experimental buffered DIMM.read more
Citations
More filters
Proceedings ArticleDOI
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
TL;DR: In this article, the authors present a vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations.
Journal ArticleDOI
Processing-in-memory: A workload-driven perspective
TL;DR: This article describes the work on systematically identifying opportunities for PIM in real applications and quantifies potential gains for popular emerging applications (e.g., machine learning, data analytics, genome analysis) and describes challenges that remain for the widespread adoption of PIM.
Proceedings ArticleDOI
Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product
Sukhan Lee,Shin-haeng Kang,Jae-Hoon Lee,Hyeon-Su Kim,Eojin Lee,Seung-Woo Seo,Hosang Yoon,Seung-Won Lee,Kyoung-Hwan Lim,Hyun-Sung Shin,Jin-Hyun Kim,O Seongil,Anand Iyer,Wang David T,Kyomin Sohn,Nam Sung Kim +15 more
TL;DR: Wang et al. as discussed by the authors proposed an innovative yet practical processing-in-memory (PIM) architecture, which improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5× respectively.
Posted Content
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
TL;DR: This paper presents a vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations, populated inside a GPU-centric system interconnect as a remote memory pool.
Proceedings ArticleDOI
NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling
Gagandeep Singh,Dionysios Diamantopoulos,Christoph Hagleitner,Juan Gómez-Luna,Sander Stuijk,Onur Mutlu,Henk Corporaal +6 more
TL;DR: NERO, an FPGA+HBM-based accelerator connected through IBM CAPI2 (Coherent Accelerator Processor Interface) to an IBM POWER9 host system is developed and it is concluded that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
References
More filters
Proceedings ArticleDOI
The Hadoop Distributed File System
TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Proceedings Article
Spark: cluster computing with working sets
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Journal ArticleDOI
The gem5 simulator
Nathan Binkert,Bradford M. Beckmann,Gabriel Black,Steven K. Reinhardt,Ali G. Saidi,Arkaprava Basu,Joel Hestness,Derek R. Hower,Tushar Krishna,Somayeh Sardashti,Rathijit Sen,Korey Sewell,Muhammad Shoaib,Nilay Vaish,Mark D. Hill,Darien Wood +15 more
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Proceedings ArticleDOI
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Journal ArticleDOI
The Nas Parallel Benchmarks
David H. Bailey,Eric Barszcz,John T. Barton,D. S. Browning,Russell Carter,Leonardo Dagum,Rod Fatoohi,Paul O. Frederickson,T. A. Lasinski,Robert Schreiber,Horst D. Simon,V. Venkatakrishnan,Sisira Weeratunga +12 more
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.