Home
/
Authors
/
Toni Juan

Author

Toni Juan

Other affiliations: Polytechnic University of Catalonia

Bio: Toni Juan is an academic researcher from Intel. The author has contributed to research in topics: Cache & Vector processor. The author has an hindex of 14, co-authored 18 publications receiving 2042 citations. Previous affiliations of Toni Juan include Polytechnic University of Catalonia.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Larrabee: a many-core x86 architecture for visual computing

[...]

Larry D. Seiler¹, Doug Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Michael Abrash, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Jeremy Sugerman², Robert Dale Cavin¹, Roger Espasa¹, Ed Grochowski¹, Toni Juan¹, Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Aug 2008

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.

...read moreread less

Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

...read moreread less

784 citations

Journal Article•DOI•

Larrabee: A Many-Core x86 Architecture for Visual Computing

[...]

Larry D. Seiler¹, Douglas M. Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Robert Dale Cavin¹, Roger Espasa¹, Edward T. Grochowski¹, Toni Juan¹, Michael Abrash, Jeremy Sugerman², Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Jan 2009-IEEE Micro

TL;DR: The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic, which increases the architecture's programmability as compared to standard GPUs.

...read moreread less

Abstract: The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The article describes the Larrabee architecture, a software renderer optimized for it, and other highly parallel applications. The article analyzes performance through scalability studies based on real-world workloads.

...read moreread less

379 citations

Journal Article•DOI•

Asim: a performance model framework

[...]

Joel Emer¹, Pritpal S. Ahuja¹, E. Borch¹, Artur Klauser¹, Chi-Keung Luk¹, S. Manne¹, Shubhendu S. Mukherjee¹, Harish Patil¹, Steven Wallace¹, Nathan Binkert², Roger Espasa³, Toni Juan³ - Show less +8 more•Institutions (3)

Intel¹, University of Michigan², Polytechnic University of Catalonia³

01 Feb 2002-IEEE Computer

TL;DR: Asim provides a modular and reusable framework for creating many models that helps break down the performance-modeling problem into individual pieces that can be modeled separately, while its reusability allows using a software component repeatedly in different contexts.

...read moreread less

Abstract: The longevity and usefulness of a microprocessor performance model has historically depended on the model writer's skills and discipline. However, at Compaq the models became extremely complex and unmanageable because designers lacked a structured way to develop them. To cope with these complexities, Compaq researchers began developing Asim in late 1998 to allow model writers to faithfully represent the detailed timing of complex modern machines and effectively manage the large software projects needed to model such machines. Asim addresses these needs by providing a modular and reusable framework for creating many models. The framework's modularity helps break down the performance-modeling problem into individual pieces that can be modeled separately, while its reusability allows using a software component repeatedly in different contexts.

...read moreread less

221 citations

Proceedings Article•DOI•

Dynamic history-length fitting: a third level of adaptivity for branch prediction

[...]

Toni Juan, Sanji Sanjeevan, Juan J. Navarro

16 Apr 1998

TL;DR: A method is introduced that dynamically determines the optimum history length during execution, adapting to the specific requirements of any code, input data and system workload, which adds an extra level of adaptivity to two-level adaptive branch predictors.

...read moreread less

Abstract: Accurate branch prediction is essential for obtaining high performance in pipelined superscalar processors that execute instructions speculatively. Some of the best current predictors combine a part of the branch address with a fixed amount of global history of branch outcomes in order to make a prediction. These predictors cannot perform uniformly well across all workloads because the best amount of history to be used depends on the code, the input data and the frequency of context switches. Consequently, all predictors that use a fixed history length are therefore unable to perform up to their maximum potential.We introduce a method---called DHLF---that dynamically determines the optimum history length during execution, adapting to the specific requirements of any code, input data and system workload. Our proposal adds an extra level of adaptivity to two-level adaptive branch predictors. The DHLF method can be applied to any one of the predictors that combine global branch history with the branch address. We apply the DHLF method to gshare (dhlf-gshare) and obtain near-optimal results for all SPECint95 benchmarks, with and without context switches. Some results are also presented for gskewed (dhlf-gskewed), confirming that other predictors can benefit from our proposal.

...read moreread less

116 citations

Journal Article•DOI•

Tarantula: a vector extension to the alpha architecture

[...]

Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, Isaac Hernandez, Toni Juan, Geoff Lowney, Matthew Mattina, André Seznec - Show less +7 more

01 May 2002

TL;DR: Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads that fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, and achieves excellent "real-computation" per transistor and per watt ratios.

...read moreread less

Abstract: Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw band- width. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

...read moreread less

112 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

The multikernel: a new OS architecture for scalable multicore systems

[...]

Andrew Baumann¹, Paul Barham², Pierre-Évariste Dagand³, Tim Harris², Rebecca Isaacs², Simon Peter¹, Timothy Roscoe¹, Adrian Schüpbach¹, Akhilesh Singhania¹ - Show less +5 more•Institutions (3)

ETH Zurich¹, Microsoft², École normale supérieure de Cachan³

11 Oct 2009

TL;DR: This work investigates a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.

...read moreread less

Abstract: Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures.We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.

...read moreread less

926 citations

Journal Article•DOI•

The future of microprocessors

[...]

Shekhar Borkar¹, Andrew A. Chien²•Institutions (2)

Intel¹, University of California, San Diego²

01 May 2011-Communications of The ACM

TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

...read moreread less

Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

...read moreread less

920 citations

Proceedings Article•DOI•

A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor

[...]

Shubhendu S. Mukherjee¹, Christopher T. Weaver¹, Joel Emer¹, Steven K. Reinhardt¹, Todd Austin¹ - Show less +1 more•Institutions (1)

University of Michigan¹

03 Dec 2003

TL;DR: This paper identifies numerous cases, such as prefetches, dynamicallydead code, and wrong-path instructions, in which a fault will not affect correct execution, and shows AVFs of 28% and 9% for the instruction queue and execution units, respectively,averaged across dynamic sections of the entire CPU2000benchmark suite.

...read moreread less

Abstract: Single-event upsets from particle strikes have become a key challenge in microprocessor design. Techniques to deal with these transients faults exist, but come at a cost. Designers clearly require accurate estimates of processor error rates to make appropriate cost/reliability tradeoffs. This paper describes a method for generating these estimates. A key aspect of this analysis is that some single-bit faults (such as those occurring in the branch predictor) do not produce an error in a program's output. We define a structure's architectural vulnerability factor (AVF) as the probability that a fault in that particular structure do not result in an error. A structure's error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Unfortunately, computing AVFs of complex structures, such as the instruction queue, can be quite involved. We identify numerous cases, such as prefetches, dynamically dead code, and wrong-path instructions, in which a fault do not affect, correct execution. We instrument a detailed 1A64 processor simulator to map bit-level microarchitectural state to these cases, generating per-structure AVF estimates. This analysis shows AVFs of 28% and 9% for the instruction queue and execution units, respectively, averaged across dynamic sections of the entire CPU2000 benchmark suite.

...read moreread less

915 citations

Journal Article•DOI•

The M5 Simulator: Modeling Networked Systems

[...]

Nathan Binkert¹, Ronald G. Dreslinski¹, Lisa R. Hsu¹, Kevin Lim¹, Ali G. Saidi¹, Steven K. Reinhardt¹ - Show less +2 more•Institutions (1)

University of Michigan¹

01 Jul 2006-IEEE Micro

TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.

...read moreread less

Abstract: The M5 simulator is developed specifically to enable research in TCP/IP networking. The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically. M5's usefulness as a general-purpose architecture simulator and its liberal open-source license has led to its adoption by several academic and commercial groups

...read moreread less

839 citations

Proceedings Article•DOI•

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

[...]

Victor W. Lee¹, Changkyu Kim¹, Jatin Chhugani¹, Michael E. Deisher¹, Daehyun Kim¹, Anthony D. Nguyen¹, Nadathur Satish¹, Mikhail Smelyanskiy¹, Srinivas Chennupaty¹, Per Hammarlund¹, Ronak Singhal¹, Pradeep Dubey¹ - Show less +8 more•Institutions (1)

Intel¹

19 Jun 2010

TL;DR: This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

810 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse