Home
/
Authors
/
Doug Carmean

Author

Doug Carmean

Bio: Doug Carmean is an academic researcher from Intel. The author has contributed to research in topics: Multi-core processor & Scoreboarding. The author has an hindex of 9, co-authored 13 publications receiving 1407 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Larrabee: a many-core x86 architecture for visual computing

[...]

Larry D. Seiler¹, Doug Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Michael Abrash, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Jeremy Sugerman², Robert Dale Cavin¹, Roger Espasa¹, Ed Grochowski¹, Toni Juan¹, Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Aug 2008

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.

...read moreread less

Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

...read moreread less

784 citations

Journal Article•DOI•

Increasing processor performance by implementing deeper pipelines

[...]

Eric Sprangle¹, Doug Carmean¹•Institutions (1)

Intel¹

01 May 2002

TL;DR: It is shown that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium® 4 like processor.

...read moreread less

Abstract: One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentium® 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance.This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium® 4 like processor.

...read moreread less

231 citations

Patent•

Distribution of tasks among asymmetric processing elements

[...]

Herbert H. J. Hum, Eric Sprangle, Doug Carmean, Raghavan Kumar

26 Sep 2014

TL;DR: In this paper, the authors present techniques to control power and processing among a plurality of asymmetric cores by migrating processes or threads among a number of cores according to the performance and power needs of the system.

...read moreread less

Abstract: Techniques to control power and processing among a plurality of asymmetric cores. In one embodiment, one or more asymmetric cores are power managed to migrate processes or threads among a plurality of cores according to the performance and power needs of the system.

...read moreread less

148 citations

Journal Article•DOI•

Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures

[...]

Mikhail Smelyanskiy¹, David R. Holmes², Jatin Chhugani¹, A. Larson², Doug Carmean¹, Dennis P. Hanson², Pradeep Dubey¹, Kurt E. Augustine², Daehyun Kim¹, A. Kyker¹, Victor W. Lee¹, Anthony-Trung D. Nguyen¹, Lauren H. Seiler¹, Richard A. Robb² - Show less +10 more•Institutions (2)

Intel¹, Mayo Clinic²

01 Nov 2009-IEEE Transactions on Visualization and Computer Graphics

TL;DR: This work describes a thread- and data-parallel implementation of ray-casting that makes it amenable to key architectural trends of three modern commodity parallel architectures: multi-core, GPU, and an upcoming many-core Intelreg architecture code-named Larrabee.

...read moreread less

Abstract: Medical volumetric imaging requires high fidelity, high performance rendering algorithms. We motivate and analyze new volumetric rendering algorithms that are suited to modern parallel processing architectures. First, we describe the three major categories of volume rendering algorithms and confirm through an imaging scientist-guided evaluation that ray-casting is the most acceptable. We describe a thread- and data-parallel implementation of ray-casting that makes it amenable to key architectural trends of three modern commodity parallel architectures: multi-core, GPU, and an upcoming many-core Intelreg architecture code-named Larrabee. We achieve more than an order of magnitude performance improvement on a number of large 3D medical datasets. We further describe a data compression scheme that significantly reduces data-transfer overhead. This allows our approach to scale well to large numbers of Larrabee cores.

...read moreread less

88 citations

Proceedings Article•DOI•

Enabling scalability and performance in a large scale CMP environment

[...]

Bratin Saha¹, Ali-Reza Adl-Tabatabai¹, Anwar Ghuloum¹, Mohan Rajagopalan¹, Richard L. Hudson¹, Leaf Petersen¹, Vijay Menon¹, Brian R. Murphy¹, Tatiana Shpeisman¹, Eric Sprangle¹, Anwar Rohillah¹, Doug Carmean¹, Jesse Fang¹ - Show less +9 more•Institutions (1)

Intel¹

21 Mar 2007

TL;DR: This paper presents the architecture of McRT and discusses the experiences with the system, including experimental evaluation that lead to several interesting, non-intuitive findings, providing key insights about the structure of the system stack at this scale.

...read moreread less

Abstract: Hardware trends suggest that large-scale CMP architectures, with tens to hundreds of processing cores on a single piece of silicon, are iminent within the next decade. While existing CMP machines have traditionally been handled in the same way as SMPs, this magnitude of parallelism introduces several fundamental challenges at the architectural level and this, in turn, translates to novel challenges in the design of the software stack for these platforms. This paper presents the "Many Core Run Time" (McRT), a software prototype of an integrated language runtime that was designed to explore configurations of the software stack for enabling performance and scalability on large scale CMP platforms. This paper presents the architecture of McRT and discusses our experiences with the system, including experimental evaluation that lead to several interesting, non-intuitive findings, providing key insights about the structure of the system stack at this scale. A key contribution of this paper is to demonstrate how McRT enables near linear improvements in performance and scalability for desktop workloads such as the popular XviD encoder and a set of RMS (recognition, mining, and synthesis) applications. Another key contribution of this work is its use of McRT to explore non-traditional system configurations such as a light-weight executive in which McRT runs on "bare metal" and replaces the traditional OS. Such configurations are becoming an increasingly attractive alternative to leverage heterogeneous computing uints as seen in today's CPU-GPU configurations.

...read moreread less

83 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

The multikernel: a new OS architecture for scalable multicore systems

[...]

Andrew Baumann¹, Paul Barham², Pierre-Évariste Dagand³, Tim Harris², Rebecca Isaacs², Simon Peter¹, Timothy Roscoe¹, Adrian Schüpbach¹, Akhilesh Singhania¹ - Show less +5 more•Institutions (3)

ETH Zurich¹, Microsoft², École normale supérieure de Cachan³

11 Oct 2009

TL;DR: This work investigates a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.

...read moreread less

Abstract: Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures.We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.

...read moreread less

926 citations

Journal Article•DOI•

The future of microprocessors

[...]

Shekhar Borkar¹, Andrew A. Chien²•Institutions (2)

Intel¹, University of California, San Diego²

01 May 2011-Communications of The ACM

TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

...read moreread less

Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

...read moreread less

920 citations

Proceedings Article•DOI•

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

[...]

Victor W. Lee¹, Changkyu Kim¹, Jatin Chhugani¹, Michael E. Deisher¹, Daehyun Kim¹, Anthony D. Nguyen¹, Nadathur Satish¹, Mikhail Smelyanskiy¹, Srinivas Chennupaty¹, Per Hammarlund¹, Ronak Singhal¹, Pradeep Dubey¹ - Show less +8 more•Institutions (1)

Intel¹

19 Jun 2010

TL;DR: This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

810 citations

Proceedings Article•DOI•

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

[...]

Sunpyo Hong¹, Hyesoon Kim¹•Institutions (1)

Georgia Institute of Technology¹

20 Jun 2009

TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.

...read moreread less

Abstract: GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

...read moreread less

672 citations

Proceedings Article•DOI•

PacketShader: a GPU-accelerated software router

[...]

Sangjin Han¹, Keon Jang¹, KyoungSoo Park¹, Sue Moon¹•Institutions (1)

KAIST¹

30 Aug 2010

TL;DR: The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.

...read moreread less

Abstract: We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.

...read moreread less

585 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse