Home
/
Authors
/
Jeremy Sugerman

Author

Jeremy Sugerman

Other affiliations: Nvidia, VMware

Bio: Jeremy Sugerman is an academic researcher from Stanford University. The author has contributed to research in topics: Virtual machine & Full virtualization. The author has an hindex of 12, co-authored 16 publications receiving 4570 citations. Previous affiliations of Jeremy Sugerman include Nvidia & VMware.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Brook for GPUs: stream computing on graphics hardware

[...]

Ian Buck¹, Tim Foley¹, Daniel Reiter Horn¹, Jeremy Sugerman¹, Kayvon Fatahalian¹, Mike Houston¹, Pat Hanrahan¹ - Show less +3 more•Institutions (1)

Stanford University¹

01 Aug 2004

TL;DR: This paper presents Brook for GPUs, a system for general-purpose computation on programmable graphics hardware that abstracts and virtualizes many aspects of graphics hardware, and presents an analysis of the effectiveness of the GPU as a compute engine compared to the CPU.

...read moreread less

Abstract: In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

...read moreread less

1,288 citations

Proceedings Article•

Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor

[...]

Jeremy Sugerman¹, Ganesh Venkitachalam¹, Beng-Hong Lim¹•Institutions (1)

VMware¹

25 Jun 2001

TL;DR: Results indicate that with optimizations, VMware Workstation’s hosted virtualization architecture can match native I/O throughput on standard PCs.

...read moreread less

Abstract: Virtual machines were developed by IBM in the 1960’s to provide concurrent, interactive access to a mainframe computer. Each virtual machine is a replica of the underlying physical machine and users are given the illusion of running directly on the physical machine. Virtual machines also provide benefits like isolation and resource sharing, and the ability to run multiple flavors and configurations of operating systems. VMwareWorkstation brings such mainframe-class virtual machine technology to PC-based desktop and workstation computers. This paper focuses on VMware Workstation’s approach to virtualizing I/O devices. PCs have a staggering variety of hardware, and are usually pre-installed with an operating system. Instead of replacing the pre-installed OS, VMware Workstation uses it to host a user-level application (VMApp) component, as well as to schedule a privileged virtual machine monitor (VMM) component. The VMM directly provides high-performance CPU virtualization while the VMApp uses the host OS to virtualize I/O devices and shield the VMM from the variety of devices. A crucial question is whether virtualizing devices via such a hosted architecture can meet the performance required of high throughput, low latency devices. To this end, this paper studies the virtualization and performance of an Ethernet adapter on VMware Workstation. Results indicate that with optimizations, VMware Workstation’s hosted virtualization architecture can match native I/O throughput on standard PCs. Although a straightforward hosted implementation is CPU-limited due to virtualization overhead on a 733 MHz Pentium R III system on a 100 Mb/s Ethernet, a series of optimizations targeted at reducing CPU utilization allows the system to match native network throughput. Further optimizations are discussed both within and outside a hosted architecture.

...read moreread less

808 citations

Journal Article•DOI•

Larrabee: a many-core x86 architecture for visual computing

[...]

Larry D. Seiler¹, Doug Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Michael Abrash, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Jeremy Sugerman², Robert Dale Cavin¹, Roger Espasa¹, Ed Grochowski¹, Toni Juan¹, Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Aug 2008

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.

...read moreread less

Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

...read moreread less

784 citations

Journal Article•DOI•

Larrabee: A Many-Core x86 Architecture for Visual Computing

[...]

Larry D. Seiler¹, Douglas M. Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Robert Dale Cavin¹, Roger Espasa¹, Edward T. Grochowski¹, Toni Juan¹, Michael Abrash, Jeremy Sugerman², Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Jan 2009-IEEE Micro

TL;DR: The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic, which increases the architecture's programmability as compared to standard GPUs.

...read moreread less

Abstract: The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The article describes the Larrabee architecture, a software renderer optimized for it, and other highly parallel applications. The article analyzes performance through scalability studies based on real-world workloads.

...read moreread less

379 citations

Proceedings Article•DOI•

Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

[...]

Kayvon Fatahalian¹, Jeremy Sugerman¹, Pat Hanrahan¹•Institutions (1)

Stanford University¹

29 Aug 2004

TL;DR: An in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times, finds even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches.

...read moreread less

Abstract: Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.

...read moreread less

353 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Scalable parallel programming with CUDA

[...]

John R. Nickolls¹, Ian Buck¹, Michael Garland¹, Kevin Skadron²•Institutions (2)

Nvidia¹, University of Virginia²

11 Aug 2008

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.

...read moreread less

Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

...read moreread less

2,216 citations

Journal Article•DOI•

A Survey of General-Purpose Computation on Graphics Hardware

[...]

John D. Owens¹, David Luebke², Naga K. Govindaraju³, Mark J. Harris², Jens Krüger⁴, Aaron Lefohn, Timothy John Purcell² - Show less +3 more•Institutions (4)

University of California, Davis¹, Nvidia², Microsoft³, Technische Universität München⁴

01 Mar 2007-Computer Graphics Forum

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.

...read moreread less

Abstract: The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest.

...read moreread less

1,998 citations

Proceedings Article•

A Survey of General-Purpose Computation on Graphics Hardware.

[...]

John D. Owens¹, David Luebke², Naga K. Govindaraju³, Mark J. Harris², Jens Krüger⁴, Aaron Lefohn, Timothy John Purcell² - Show less +3 more•Institutions (4)

University of California, Davis¹, Nvidia², Microsoft³, Technische Universität München⁴

01 Jan 2005

TL;DR: The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.

...read moreread less

1,728 citations

Journal Article•DOI•

Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born

[...]

Andreas W. Götz¹, Mark J. Williamson¹, Dong Xu¹, Duncan Poole², Scott M. Le Grand², Ross C. Walker¹ - Show less +2 more•Institutions (2)

University of California, San Diego¹, Nvidia²

15 Apr 2012-Journal of Chemical Theory and Computation

TL;DR: An implementation of generalized Born implicit solvent all-atom classical molecular dynamics within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs) and shows performance that is on par with, and in some cases exceeds, that of traditional supercomputers.

...read moreread less

Abstract: We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.

...read moreread less

1,645 citations

Proceedings Article•

A Virtual Machine Introspection Based Architecture for Intrusion Detection.

[...]

Tal Garfinkel¹, Mendel Rosenblum¹•Institutions (1)

Stanford University¹

01 Jan 2003

TL;DR: This paper presents an architecture that retains the visibility of a host-based IDS, but pulls the IDS outside of the host for greater attack resistance, achieved through the use of a virtual machine monitor.

...read moreread less

Abstract: Today’s architectures for intrusion detection force the IDS designer to make a difficult choice If the IDS resides on the host, it has an excellent view of what is happening in that host’s software, but is highly susceptible to attack On the other hand, if the IDS resides in the network, it is more resistant to attack, but has a poor view of what is happening inside the host, making it more susceptible to evasion In this paper we present an architecture that retains the visibility of a host-based IDS, but pulls the IDS outside of the host for greater attack resistance We achieve this through the use of a virtual machine monitor Using this approach allows us to isolate the IDS from the monitored host but still retain excellent visibility into the host’s state The VMM also offers us the unique ability to completely mediate interactions between the host software and the underlying hardware We present a detailed study of our architecture, including Livewire, a prototype implementation We demonstrate Livewire by implementing a suite of simple intrusion detection policies and using them to detect real attacks

...read moreread less

1,629 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse