The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

/pdf/scalable-parallel-programming-with-cuda-fx7qj9io5g.pdf

Scalable parallel programming with CUDA

The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest.

/pdf/a-survey-of-general-purpose-computation-on-graphics-hardware-2phx1q9xxr.pdf

A Survey of General-Purpose Computation on Graphics Hardware

A Survey of General-Purpose Computation on Graphics Hardware.

We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.

/pdf/routine-microsecond-molecular-dynamics-simulations-with-41c2fb6wpv.pdf

Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born

Today’s architectures for intrusion detection force the IDS designer to make a difficult choice If the IDS resides on the host, it has an excellent view of what is happening in that host’s software, but is highly susceptible to attack On the other hand, if the IDS resides in the network, it is more resistant to attack, but has a poor view of what is happening inside the host, making it more susceptible to evasion In this paper we present an architecture that retains the visibility of a host-based IDS, but pulls the IDS outside of the host for greater attack resistance We achieve this through the use of a virtual machine monitor Using this approach allows us to isolate the IDS from the monitored host but still retain excellent visibility into the host’s state The VMM also offers us the unique ability to completely mediate interactions between the host software and the underlying hardware We present a detailed study of our architecture, including Livewire, a prototype implementation We demonstrate Livewire by implementing a suite of simple intrusion detection policies and using them to detect real attacks

/pdf/a-virtual-machine-introspection-based-architecture-for-xmi1i6bc30.pdf

A Virtual Machine Introspection Based Architecture for Intrusion Detection.

In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

/pdf/brook-for-gpus-stream-computing-on-graphics-hardware-49gy5u8zyg.pdf

Brook for GPUs: stream computing on graphics hardware

Virtual machines were developed by IBM in the 1960’s to provide concurrent, interactive access to a mainframe computer. Each virtual machine is a replica of the underlying physical machine and users are given the illusion of running directly on the physical machine. Virtual machines also provide benefits like isolation and resource sharing, and the ability to run multiple flavors and configurations of operating systems. VMwareWorkstation brings such mainframe-class virtual machine technology to PC-based desktop and workstation computers. This paper focuses on VMware Workstation’s approach to virtualizing I/O devices. PCs have a staggering variety of hardware, and are usually pre-installed with an operating system. Instead of replacing the pre-installed OS, VMware Workstation uses it to host a user-level application (VMApp) component, as well as to schedule a privileged virtual machine monitor (VMM) component. The VMM directly provides high-performance CPU virtualization while the VMApp uses the host OS to virtualize I/O devices and shield the VMM from the variety of devices. A crucial question is whether virtualizing devices via such a hosted architecture can meet the performance required of high throughput, low latency devices. To this end, this paper studies the virtualization and performance of an Ethernet adapter on VMware Workstation. Results indicate that with optimizations, VMware Workstation’s hosted virtualization architecture can match native I/O throughput on standard PCs. Although a straightforward hosted implementation is CPU-limited due to virtualization overhead on a 733 MHz Pentium R III system on a 100 Mb/s Ethernet, a series of optimizations targeted at reducing CPU utilization allows the system to match native network throughput. Further optimizations are discussed both within and outside a hosted architecture.

/pdf/virtualizing-i-o-devices-on-vmware-workstation-s-hosted-50aruqutw5.pdf

Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

/pdf/larrabee-a-many-core-x86-architecture-for-visual-computing-58j94b6w2n.pdf

Larrabee: a many-core x86 architecture for visual computing

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The article describes the Larrabee architecture, a software renderer optimized for it, and other highly parallel applications. The article analyzes performance through scalability studies based on real-world workloads.

Larrabee: A Many-Core x86 Architecture for Visual Computing

Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.

/pdf/understanding-the-efficiency-of-gpu-algorithms-for-matrix-1heqwki3s7.pdf

Jeremy Sugerman

Papers

Brook for GPUs: stream computing on graphics hardware

Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor

Larrabee: a many-core x86 architecture for visual computing

Larrabee: A Many-Core x86 Architecture for Visual Computing

Understanding the efficiency of GPU algorithms for matrix-matrix multiplication