scispace - formally typeset
Search or ask a question
Topic

CUDA Pinned memory

About: CUDA Pinned memory is a research topic. Over the lifetime, 1097 publications have been published within this topic receiving 30198 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: As computer CPUs get faster, primary memories tend to be organized in parallel banks, and important questions of design and use of such memories are discussed.
Abstract: As computer CPUs get faster, primary memories tend to be organized in parallel banks. The fastest machines now being developed can fetch of the order of 100 words in parallel. Unless memory and compiler designers are careful, serious memory conflicts and resulting performance degradation may result. Some of the important questions of design and use of such memories are discussed.

306 citations

Book ChapterDOI
28 Nov 2008
TL;DR: The present CUDA-lite, an enhancement to CUDA, is presented and preliminary results that indicate auto-generated code can have performance comparable to hand coding are shown.
Abstract: The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.

257 citations

Proceedings ArticleDOI
29 Aug 2004
TL;DR: By combining memory objects with floating-point fragment programs, this work has implemented a particle engine that entirely avoids the transfer of particle data at runtime.
Abstract: We present a system for real-time animation and rendering of large particle sets using GPU computation and memory objects in OpenGL. Memory objects can be used both as containers for geometry data stored on the graphics card and as render targets, providing an effective means for the manipulation and rendering of particle data on the GPU.To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform particle manipulation are essential. Our system implements a versatile particle engine, including inter-particle collisions and visibility sorting. By combining memory objects with floating-point fragment programs, we have implemented a particle engine that entirely avoids the transfer of particle data at run-time. Our system can be seen as a forerunner of a new class of graphics algorithms, exploiting memory objects or similar concepts on upcoming graphics hardware to avoid bus bandwidth becoming the major performance bottleneck.

255 citations

Proceedings ArticleDOI
06 Nov 2005
TL;DR: This paper proposes using GPUs in approximately the reverse way: to assist in "converting pictures into numbers" (i.e. computer vision) and provides a simple API which implements some common computer vision algorithms.
Abstract: Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert "numbers into pictures" (i.e. computer graphics). In this paper, we propose using GPUs in approximately the reverse way: to assist in "converting pictures into numbers" (i.e. computer vision). The OpenVIDIA project uses single or multiple graphics cards to accelerate image analysis and computer vision. It is a library and API aimed at providing a graphics hardware accelerated processing framework for image processing and computer vision. OpenVIDIA explores the creation of a parallel computer architecture consisting of multiple Graphics Processing Units (GPUs) built entirely from commodity hardware. OpenVIDIA uses multiple Graphic.Processing Units in parallel to operate as a general-purpose parallel computer architecture. It provides a simple API which implements some common computer vision algorithms. Many components can be used immediately and because the project is Open Source, the code is intended to serve as templates and examples for how similar algorithms are mapped onto graphics hardware. Implemented are image processing techniques (Canny edge detection, filtering), image feature handling (identifying and matching features) and image registration, to name a few.

250 citations

Book ChapterDOI
20 Mar 2010
TL;DR: An automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs, that is quite close to hand-optimizedCUDA code and considerably better than the benchmarks' performance on a multicore CPU.
Abstract: Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierarchy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest. This paper describes an automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs. Using and adapting publicly available tools that have made polyhedral compiler optimization practically effective, we develop a C-to-CUDA transformation system that generates two-level parallel CUDA code that is optimized for efficient data access. The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. The performance of the automatically generated CUDA code is quite close to hand-optimized CUDA code and considerably better than the benchmarks' performance on a multicore CPU.

229 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
84% related
Cache
59.1K papers, 976.6K citations
84% related
Mobile computing
51.3K papers, 1M citations
81% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Web service
57.6K papers, 989K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20211
20203
20197
20187
201745
201654