scispace - formally typeset
Proceedings ArticleDOI

Application-transparent near-memory processing architecture with memory channel network

TLDR
Memory Channel Network can serve as an application-transparent framework which can seamlessly unify near-memory processing within a server and distributed computing across such servers for data-intensive applications.
Abstract
The physical memory capacity of servers is expected to increase drastically with the deployment of the forthcoming non-volatile memory technologies. This is a welcomed improvement for the emerging data-intensive applications. For such servers to be cost-effective, nonetheless, we must cost-effectively increase compute throughput and memory bandwidth commensurate with the increase in memory capacity without compromising the application readiness. Tackling this challenge, we present Memory Channel Network (MCN) architecture in this paper. Specifically, first, we propose an MCN DIMM, an extension of a buffered DIMM where a small but capable processor called MCN processor is integrated with a buffer device on the DIMM for near-memory processing. Second, we implement device drivers to give the host and MCN processors in a server an illusion that they are independent heterogeneous nodes connected through an Ethernet link. These allow the host and MCN processors in a server to run a given data-intensive application together based on popular distributed computing frameworks such as MPI and Spark without any change in the host processor hardware and its application software, while offering the benefits of high-bandwidth and low-latency communication between the host and MCN processors over the memory channels. As such, MCN can serve as an application-transparent framework which can seamlessly unify the near-memory processing within a server and the distributed computing across such servers for data-intensive applications. Our simulation running the full software stack shows that a server with 8 MCN DIMMs offers 4.56 x higher throughput and consume 47.5% less energy than a cluster with 9 conventional nodes connected through Ethernet links, as it facilitates up to 8.17 x higher aggregate DRAM bandwidth utilization. Lastly, we demonstrate the feasibility of MCN with an IBM POWER8 system and an experimental buffered DIMM.

read more

Citations
More filters
Proceedings ArticleDOI

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

TL;DR: In this article, the authors present a vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations.
Journal ArticleDOI

Processing-in-memory: A workload-driven perspective

TL;DR: This article describes the work on systematically identifying opportunities for PIM in real applications and quantifies potential gains for popular emerging applications (e.g., machine learning, data analytics, genome analysis) and describes challenges that remain for the widespread adoption of PIM.
Proceedings ArticleDOI

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product

TL;DR: Wang et al. as discussed by the authors proposed an innovative yet practical processing-in-memory (PIM) architecture, which improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5× respectively.
Posted Content

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

TL;DR: This paper presents a vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations, populated inside a GPU-centric system interconnect as a remote memory pool.
Proceedings ArticleDOI

NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling

TL;DR: NERO, an FPGA+HBM-based accelerator connected through IBM CAPI2 (Coherent Accelerator Processor Interface) to an IBM POWER9 host system is developed and it is concluded that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
References
More filters
Proceedings ArticleDOI

The Hadoop Distributed File System

TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Proceedings Article

Spark: cluster computing with working sets

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings ArticleDOI

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Journal ArticleDOI

The Nas Parallel Benchmarks

TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Related Papers (5)