scispace - formally typeset
Search or ask a question
Topic

Multi-channel memory architecture

About: Multi-channel memory architecture is a research topic. Over the lifetime, 329 publications have been published within this topic receiving 5548 citations. The topic is also known as: multi-channel memory & multi-channel RAM.


Papers
More filters
Proceedings ArticleDOI
07 Dec 2013
TL;DR: RowClone is proposed, a new and simple mechanism to perform bulk copy and initialization completely within DRAM — eliminating the need to transfer any data over the memory channel to perform such operations.
Abstract: Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy — degrading both system performance and energy efficiency. In this work, we propose RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM — eliminating the need to transfer any data over the memory channel to perform such operations. Our key observation is that DRAM can internally and efficiently transfer a large quantity of data (multiple KBs) between a row of DRAM cells and the associated row buffer. Based on this, our primary mechanism can quickly copy an entire row of data from a source row to a destination row by first copying the data from the source row to the row buffer and then from the row buffer to the destination row, via two back-to-back activate commands. This mechanism, which we call the Fast Parallel Mode of RowClone, reduces the latency and energy consumption of a 4KB bulk copy operation by 11.6× and 74.4×, respectively, and a 4KB bulk zeroing operation by 6.0× and 41.5×, respectively. To efficiently copy data between rows that do not share a row buffer, we propose a second mode of RowClone, the Pipelined Serial Mode, which uses the shared internal bus of a DRAM chip to quickly copy data between two banks. RowClone requires only a 0.01% increase in DRAM chip area. We quantitatively evaluate the benefits of RowClone by focusing on fork, one of the frequently invoked system calls, and five other copy and initialization intensive applications. Our results show that RowClone can significantly improve both single-core and multi-core system performance, while also significantly reducing main memory bandwidth and energy consumption.

385 citations

Proceedings ArticleDOI
03 Dec 2011
TL;DR: In this paper, the authors present an alternative approach to reduce inter-application interference in the memory system: application-aware memory channel partitioning (MCP), which maps the data of applications that are likely to severely interfere with each other to different memory channels.
Abstract: Main memory is a major shared resource among cores in a multicore system. If the interference between different applications' memory requests is not controlled effectively, system performance can degrade significantly. Previous work aimed to mitigate the problem of interference between applications by changing the scheduling policy in the memory controller, i.e., by prioritizing memory requests from applications in a way that benefits system performance.In this paper, we first present an alternative approach to reducing inter-application interference in the memory system: application-aware memory channel partitioning (MCP). The idea is to map the data of applications that are likely to severely interfere with each other to different memory channels. The key principles are to partition onto separate channels 1) the data of light (memory non-intensive) and heavy (memory-intensive) applications, 2) the data of applications with low and high row-buffer locality.Second, we observe that interference can be further reduced with a combination of memory channel partitioning and scheduling, which we call integrated memory partitioning and scheduling (IMPS). The key idea is to 1) always prioritize very light applications in the memory scheduler since such applications cause negligible interference to others, 2) use MCP to reduce interference among the remaining applications.We evaluate MCP and IMPS on a variety of multi-programmed workloads and system configurations and compare them to four previously proposed state-of-the-art memory scheduling policies. Averaged over 240 workloads on a 24-core system with 4 memory channels, MCP improves system throughput by 7.1% over an application-unaware memory scheduler and 1% over the previous best scheduler, while avoiding modifications to existing memory schedulers. IMPS improves system throughput by 11.1% over an application-unaware scheduler and 5% over the previous best scheduler, while incurring much lower hardware complexity than the latter.

281 citations

Patent
22 Nov 1996
TL;DR: In this article, a general purpose programmable media processor for processing and transmitting a media data stream of audio, video, radio, graphics, encryption, authentication, and networking information in real-time is presented.
Abstract: A general purpose, programmable media processor for processing and transmitting a media data stream of audio, video, radio, graphics, encryption, authentication, and networking information in real-time. The media processor incorporates an execution unit that maintains substantially peak data throughout of media data streams. The execution unit includes a dynamically partionable multi-precision arithmetic unit, programmable switch and programmable extended mathematical element. A high bandwidth external interface supplies media data streams at substantially peak rates to a general purpose register file and the multi-precision execution unit. A memory management unit, and instruction and data cache/buffers are also provided. High bandwidth memory controllers are linked in series to provide a memory channel to the general purpose, programmable media processor. The general purpose, programmable media processor is disposed in a network fabric consisting of fiber optic cable, coaxial cable and twisted pair wires to transmit, process and receive single or unified media data streams. Parallel general purpose media processors are disposed throughout the network in a distributed virtual manner to allow for multi-processor operations and sharing of resources through the network. A method for receiving, processing and transmitting media data streams over the communications fabric is also provided.

263 citations

Proceedings ArticleDOI
08 Nov 2008
TL;DR: A novel idea called mini-rank for DDRx (DDR/DDR2/ DDR3) DRAMs is proposed, which uses a small bridge chip on each DRAM DIMM to break a conventional DRAM rank into multiple smaller mini-ranks so as to reduce the number of devices involved in a single memory access.
Abstract: The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture prevents any meaningful power and performance trade-offs for memory-intensive workloads. We propose a novel idea called mini-rank for DDRx (DDR/DDR2/DDR3) DRAMs, which uses a small bridge chip on each DRAM DIMM to break a conventional DRAM rank into multiple smaller mini-ranks so as to reduce the number of devices involved in a single memory access. The design dramatically reduces the memory power consumption with only a slight increase on the memory idle latency. It does not change the DDRx bus protocol and its configuration can be adapted for the best performance-power trade-offs. Our experimental results using four-core multiprogramming workloads show that using x32 mini-ranks reduces memory power by 27.0% with 2.8% performance penalty and using x16 mini-ranks reduces memory power by 44.1% with 7.4% performance penalty on average for memory-intensive workloads, respectively.

256 citations

Journal ArticleDOI
TL;DR: MC implements a form of virtual shared memory that permits applications to completely bypass the operating system and perform cluster communication directly from the user level, and drops communication latency and overhead by up to three orders of magnitude.
Abstract: A memory-based networking approach provides clusters of computers up to 1,000 times the communication performance of conventional networks, with no compromise in cost or reliability. The memory channel for PCI's performance gains are the result of a system design approach that exploits natural cluster constraints to define a memory-based network. MC implements a form of virtual shared memory that permits applications to completely bypass the operating system and perform cluster communication directly from the user level. The hardware's simple and powerful communication model supports error handling at almost no cost or complexity to the application; guaranteed ordering under errors is the key innovation. The end result: Real-world cluster communication latency dropped by up to two orders of magnitude, and overhead by up to three orders of magnitude. These improvements elevate a lowly set of standard PCI computers running Unix into an impressive, highly available, parallel computing system.

155 citations


Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
77% related
Semiconductor memory
45.4K papers, 663.1K citations
74% related
Scalability
50.9K papers, 931.6K citations
72% related
Compiler
26.3K papers, 578.5K citations
71% related
Integrated circuit
82.7K papers, 1M citations
71% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202111
202018
201923
201815
201717
201628