Topic

Multi-core processor

About: Multi-core processor is a research topic. Over the lifetime, 15435 publications have been published within this topic receiving 266049 citations. The topic is also known as: multi-core & multicore processor.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

[...]

Rakesh Kumar¹, Keith Farkas², Norman P. Jouppi², Parthasarathy Ranganathan², Dean M. Tullsen¹ - Show less +1 more•Institutions (2)

University of California, San Diego¹, Hewlett-Packard²

03 Dec 2003

TL;DR: This paper proposes and evaluates single-ISA heterogeneousmulti-core architectures as a mechanism to reduceprocessor power dissipation and results indicate a 39% average energy reduction while only sacrificing 3% in performance.

...read moreread less

Abstract: This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an application's execution, system software dynamically chooses the most appropriate core to meet specific performance and power requirements. Our evaluation of this architecture shows significant energy benefits. For an objective function that optimizes for energy efficiency with a tight performance threshold, for 14 SPEC benchmarks, our results indicate a 39% average energy reduction while only sacrificing 3% in performance. An objective function that optimizes for energy-delay with looser performance bounds achieves, on average, nearly a factor of three improvements in energy-delay product while sacrificing only 22% in performance. Energy savings are substantially more than chip-wide voltage/frequency scaling.

...read moreread less

809 citations

Proceedings Article•DOI•

Mars: a MapReduce framework on graphics processors

[...]

Bingsheng He¹, Wenbin Fang¹, Qiong Luo¹, Naga K. Govindaraju², Tuyong Wang - Show less +1 more•Institutions (2)

Hong Kong University of Science and Technology¹, Microsoft²

25 Oct 2008

TL;DR: Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface, and is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.

...read moreread less

Abstract: We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a special-purpose co-processor and their programming interfaces are typically for graphics applications. As the first attempt to harness GPU's power for MapReduce, we developed Mars on an NVIDIA G80 GPU, which contains over one hundred processors, and evaluated it in comparison with Phoenix, the state-of-the-art MapReduce framework on multi-core CPUs. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.

...read moreread less

793 citations

Proceedings Article•DOI•

Benchmarking GPUs to tune dense linear algebra

[...]

Vasily Volkov¹, James Demmel¹•Institutions (1)

University of California, Berkeley¹

15 Nov 2008

TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.

...read moreread less

Abstract: We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.

...read moreread less

787 citations

Journal Article•DOI•

Larrabee: a many-core x86 architecture for visual computing

[...]

Larry D. Seiler¹, Doug Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Michael Abrash, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Jeremy Sugerman², Robert Dale Cavin¹, Roger Espasa¹, Ed Grochowski¹, Toni Juan¹, Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Aug 2008

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.

...read moreread less

Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

...read moreread less

784 citations

Proceedings Article•

Deep learning with COTS HPC systems

[...]

Adam Coates¹, Brody Huval¹, Tao Wang¹, David J. Wu¹, Bryan Catanzaro², Ng Andrew¹ - Show less +2 more•Institutions (2)

Stanford University¹, Nvidia²

16 Jun 2013

TL;DR: This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.

...read moreread less

Abstract: Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloudlike computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines. As this infrastructure is much more easily marshaled by others, the approach enables much wider-spread research with extremely large neural networks.

...read moreread less

740 citations

Collapse

Network Information

Performance

Metrics

16,114

Papers

288,404

Citations

No. of papers in the topic in previous years
Year	Papers
2023	197
2022	477
2021	367
2020	545
2019	766
2018	809

Multi-core processor

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics