scispace - formally typeset
Search or ask a question
Author

Larry D. Seiler

Bio: Larry D. Seiler is an academic researcher from Intel. The author has contributed to research in topics: Rendering (computer graphics) & Pixel. The author has an hindex of 25, co-authored 82 publications receiving 2953 citations. Previous affiliations of Larry D. Seiler include Hewlett-Packard & Mitsubishi Electric Research Laboratories.


Papers
More filters
Journal ArticleDOI
01 Aug 2008
TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.
Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

784 citations

Proceedings ArticleDOI
01 Jul 1999
TL;DR: This paper describes VolumePro, the world’s first single-chip realtime volume rendering system for consumer PCs, which implements ray-casting with parallel slice-by-slice processing and has hardware for gradient estimation, classification, and per-sample Phong illumination.
Abstract: This paper describes VolumePro, the world’s first single-chip realtime volume rendering system for consumer PCs. VolumePro implements ray-casting with parallel slice-by-slice processing. Our discussion of the architecture focuses mainly on the rendering pipeline and the memory organization. VolumePro has hardware for gradient estimation, classification, and per-sample Phong illumination. The system does not perform any pre-processing and makes parameter adjustments and changes to the volume data immediately visible. We describe several advanced features of VolumePro, such as gradient magnitude modulation of opacity and illumination, supersampling, cropping and cut planes. The system renders 500 million interpolated, Phong illuminated, composited samples per second. This is sufficient to render volumes with up to 16 million voxels (e.g., 256) at 30 frames per second. CR Categories: B.4.2 [Hardware]: Input/Output and Data Communications—Input/Output DevicesImage display; C.3 [Computer Systems Organization]: Special-Purpose and ApplicationBased Systems—Real-time and embedded systems; I.3.1 [Computer Graphics]: Hardware Architecture—Graphics processor;

428 citations

Journal ArticleDOI
TL;DR: The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic, which increases the architecture's programmability as compared to standard GPUs.
Abstract: The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The article describes the Larrabee architecture, a software renderer optimized for it, and other highly parallel applications. The article analyzes performance through scalability studies based on real-world workloads.

379 citations

Patent
20 Aug 2001
TL;DR: In this article, a rasterizer circuit generates fragments for an image having multiple surfaces that have been tessellated into primitive objects, such as triangles, which are associated with the same pixel.
Abstract: In a graphics pipeline, a rasterizer circuit generates fragments for an image having multiple surfaces that have been tessellated into primitive objects, such as triangles. First and second fragments are associated with the same pixel. A merge buffer merges the first fragment with the second fragment when the two fragments belong to the same tessellated surface, the first fragment's primitive is adjacent to the second fragment's primitive, both fragments face either toward or away from the viewer, and the first and second fragment are sufficiently similar that merging is unlikely to introduce visually objectionable artifacts. A frame buffer receives fragments from the merge buffer, stores the fragments, combines the fragments into pixels, and outputs the pixels to a display.

122 citations

Proceedings ArticleDOI
01 Aug 1998
TL;DR: Neon as discussed by the authors is a single chip that performs like a multichip design and accelerates OpenGL [19] 3D rendering, as well as X11 and Windows/NT 2D rendering.
Abstract: High-performance 3D graphics accelerators traditionally require multiple chips on multiple boards, including geometry, rasterizing, pixel processing, and texture mapping chips. These designs are often scalable: they can increase performance by using more chips. Scalability has obvious costs: a minimal configuration needs several chips, and some configurations must replicate texture maps. A less obvious cost is the almost irresistible temptation to replicate chips to increase performance, rather than to design individual chips for higher performance in the first place.In contrast, Neon is a single chip that performs like a multichip design. Neon accelerates OpenGL [19] 3D rendering, as well as X11 [20] and Windows/NT 2D rendering. Since our pin budget limited peak memory bandwidth, we designed Neon from the memory system upward in order to reduce bandwidth requirements. Neon has no special-purpose memories; its eight independent 32-bit memory controllers can access color buffers, Z depth buffers, stencil buffers, and texture data. To fit our gate budget, we shared logic among different operations with similar implementation requirements, and left floating point calculations to Digital's Alpha CPUs. Neon's performance is between HP's Visualize fx4 and fx6, and is well above SGI''s MXE for most operations. Neon-based boards cost much less than these competitors, due to a small part count and use of commodity SDRAMs.

75 citations


Cited by
More filters
Proceedings ArticleDOI
11 Oct 2009
TL;DR: This work investigates a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.
Abstract: Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures.We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.

926 citations

Journal ArticleDOI
TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.
Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

920 citations

Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

810 citations

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.
Abstract: GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

672 citations

Journal ArticleDOI
TL;DR: An important class of 3D transfer functions for scalar data is demonstrated, and the application of multi-dimensional transfer functions to multivariate data is described, and a set of direct manipulation widgets that make specifying such transfer functions intuitive and convenient are presented.
Abstract: Most direct volume renderings produced today employ 1D transfer functions which assign color and opacity to the volume based solely on the single scalar quantity which comprises the data set. Though they have not received widespread attention, multi-dimensional transfer functions are a very effective way to extract materials and their boundaries for both scalar and multivariate data. However, identifying good transfer functions is difficult enough in 1D, let alone 2D or 3D. This paper demonstrates an important class of 3D transfer functions for scalar data, and describes the application of multi-dimensional transfer functions to multivariate data. We present a set of direct manipulation widgets that make specifying such transfer functions intuitive and convenient. We also describe how to use modern graphics hardware to both interactively render with multidimensional transfer functions and to provide interactive shadows for volumes. The transfer functions, widgets and hardware combine to form a powerful system for interactive volume exploration.

623 citations