scispace - formally typeset
Open AccessProceedings ArticleDOI

Fast computation of database operations using graphics processors

TLDR
In this article, the authors present new algorithms for performing fast computation of several common database operations on commodity graphics processors, such as conjunctive selections, aggregations, and semi-linear queries.
Abstract
We present new algorithms for performing fast computation of several common database operations on commodity graphics processors. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semi-linear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA's GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPU-based algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective co-processor for performing database operations.

read more

Content maybe subject to copyright    Report

Fast Computation of Database Operations using Graphics
Processors
Naga K. Govindaraju Brandon Lloyd Wei Wang Ming Lin Dinesh Manocha
University of North Carolina at Chapel Hill
{naga, blloyd, weiwang, lin, dm}@cs.unc.edu
http://gamma.cs.unc.edu/DataBase
ABSTRACT
We present new algorithms for performing fast computa-
tion of several common database operations on commod-
ity graphics processors. Specifically, we consider operations
such as conjunctive selections, aggregations, and semi-linear
queries, which are essential computational components of
typical database, data warehousing, and data mining appli-
cations. While graphics processing units (GPUs) have been
designed for fast display of geometric primitives, we utilize
the inherent pipelining and parallelism, single instruction
and multiple data (SIMD) capabilities, and vector process-
ing functionality of GPUs, for evaluating boolean predicate
combinations and semi-linear queries on attributes and exe-
cuting database operations efficiently. Our algorithms take
into account some of the limitations of the programming
model of current GPUs and perform no data rearrange-
ments. Our algorithms have been implemented on a pro-
grammable GPU (e.g. NVIDIA’s GeForce FX 5900) and
applied to databases consisting of up to a million records.
We have compared their performance with an optimized im-
plementation of CPU-based algorithms. Our experiments
indicate that the graphics processor available on commodity
computer systems is an effective co-processor for performing
database operations.
Keywords: graphics processor, query optimization, selec-
tion query, aggregation, selectivity analysis, semi-linear query.
1. INTRODUCTION
As database technology becomes pervasive, Database Man-
agement Systems (DBMSs) have been deployed in a wide
variety of applications. The rapid growth of data volume
for the past decades has intensified the need for high-speed
database management systems. Most database queries and,
more recently, data warehousing and data mining applica-
tions, are very data- and computation-intensive and there-
fore demand high processing power. Researchers have ac-
tively sought to design and develop architectures and algo-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage, and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD 2004 June 13-18, 2004, Paris, France.
Copyright 2004 ACM 1-58113-859-8/04/06 . . .
$5.00.
rithms for faster query execution. Special attention has been
given to increase the performance of selection, aggregation,
and join operations on large databases. These operations
are widely used as fundamental primitives for building com-
plex database queries and for supporting on-line analytic
processing (OLAP) and data mining procedures. The effi-
ciency of these operations has a significant impact on the
performance of a database system.
As the current trend of database architecture moves from
disk-based system towards main-memory databases, appli-
cations have become increasingly computation- and memory-
bound. Recent work [3, 21] investigating the processor and
memory behaviors of current DBMSs has demonstrated a
significant increase in the query execution time due to mem-
ory stalls (on account of data and instruction misses), branch
mispredictions, and resource stalls (due to instruction de-
pendencies and hardware specific characteristics). Increased
attention has been given on redesigning traditional database
algorithms for fully utilizing the available architectural fea-
tures and for exploiting parallel execution possibilities, min-
imizing memory and resource stalls, and reducing branch
mispredictions [2, 5, 20, 24, 31, 32, 34, 37].
1.1 Graphics Processing Units
In this paper, we exploit the computational power of graph-
ics processing units (GPUs) for database operations. In the
last decade, high-performance 3D graphics hardware has be-
come as ubiquitous as floating-point hardware. Graphics
processors are now a part of almost every personal computer,
game console, or workstation. In fact, the two major com-
putational components of a desktop computer system are its
main central processing unit (CPU) and its (GPU). While
CPUs are used for general purpose computation, GPUs have
been primarily designed for transforming, rendering, and
texturing geometric primitives, such as triangles. The driv-
ing application of GPUs has been fast rendering for visual
simulation, virtual reality, and computer gaming.
GPUs are increasingly being used as co-processors to CPUs.
GPUs are extremely fast and are capable of processing tens
of millions of geometric primitives per second. The peak
performance of GPUs has been increasing at the rate of
2.5 3.0 times a year, much faster than the Moore’s law
for CPUs. At this rate, the GPU’s peak performance may
move into the teraflop range by 2006 [19]. Most of this per-
formance arises from multiple processing units and stream
processing. The GPU treats the vertices and pixels consti-
tuting graphics primitives as streams. Multiple vertex and

pixel processing engines on a GPU are connected via data
flows. These processing engines perform simple operations
in parallel.
Recently, GPUs have become programmable, allowing a
user to write fragment programs that are executed on pixel
processing engines. The pixel processing engines have di-
rect access to the texture memory and can perform vector
operations with floating point arithmetic. These capabil-
ities have been successfully exploited for many geometric
and scientific applications. As graphics hardware becomes
increasingly programmable and powerful, the roles of CPUs
and GPUs in computing are being redefined.
1.2 Main Contributions
In this paper, we present novel algorithms for fast com-
putation of database operations on GPUs. The operations
include predicates, boolean combinations, and aggregations.
We utilize the SIMD capabilities of pixel processing engines
within a GPU to perform these operations efficiently. We
have used these algorithms for selection queries on one or
more attributes and generic aggregation queries including
selectivity analysis on large databases.
Our algorithms take into account some of the limitations
of the current programming model of GPUs which make it
difficult to perform data rearrangement. We present novel
algorithms for performing multi-attribute comparisons, semi-
linear queries, range queries, computing the kth largest num-
ber, and other aggregates. These algorithms have been im-
plemented using fragment programs and have been applied
to large databases composed of up to a million records. The
performance of these algorithms depends on the instruction
sets available for fragment programs, the number of frag-
ment processors, and the underlying clock rate of the GPU.
We also perform a preliminary comparison between GPU-
based algorithms running on a NVIDIA GeForceFX 5900 Ul-
tra graphics processor and optimized CPU-based algorithms
running on dual 2.8 GHz Intel Xeon processors.
We show that algorithms for semi-linear and selection
queries map very well to GPUs and we are able to ob-
tain significant performance improvement over CPU-based
implementations. The algorithms for aggregates obtain a
modest gain of 2 4 times speedup over CPU-based imple-
mentations. Overall, the GPU can be used as an effective
co-processor for many database operations.
1.3 Organization
The rest of the paper is organized as follows. We briefly
survey related work on database operations and use of GPUs
for geometric and scientific computing in Section 2. We give
an overview of the graphics architectural pipeline in Section
3. We present algorithms for database operations includ-
ing predicates, boolean combinations, and aggregations in
Section 4. We describe their implementation in Section 5
and compare their performance with optimized CPU-based
implementations. We analyze the performance in Section 6
and outline the cases where GPU-based algorithms can offer
considerable gain over CPU-based algorithms.
2. RELATED WORK
In this section, we highlight the related research in main-
memory database operations and general purpose computa-
tion using GPUs.
2.1 Hardware Accelerated Database Opera-
tions
Many acceleration techniques have been proposed for data-
base operations. Ailamaki et al. [3] analyzed the execution
time of commercial DBMSs and observed that almost half
of the time is spent in stalls. This indicates that the perfor-
mance of a DBMS can be significantly improved by reducing
stalls.
Meki and Kambayashi used a vector processor for accel-
erating the execution of relational database operations in-
cluding selection, projection, and join [24]. To utilize the
efficiency of pipelining and parallelism that a vector pro-
cessor provides, the implementation of each operation was
redesigned for increasing the vectorization rate and the vec-
tor length. The limitation of using a vector processor is that
the load-store instruction can have high latency [37].
Modern CPUs have SIMD instructions that allow a single
basic operation to be performed on multiple data elements
in parallel. Zhu and Ross described SIMD implementation
of many important database operations including sequential
scans, aggregation, indexed searches, and joins [37]. Consid-
erable performance gains were achieved by exploiting the in-
herent parallelism of SIMD instructions and reducing branch
mispredictions.
Recently, Sun et al. present the use of graphics processors
for spatial selections and joins [35]. They use color blending
capabilities available on graphics processors to test if two
polygons intersect in screen-space. Their experiments on
graphics processors indicate a speedup of nearly 5 times on
intersection joins and within-distance joins when compared
against their software implementation. The technique fo-
cuses on pruning intersections between triangles based on
their 2D overlap and is quite conservative.
2.2 General-Purpose Computing Using GPUs
In theory, GPUs are capable of performing any computa-
tion that can be mapped to the stream-computing model.
This model has been exploited for ray-tracing [29], global
illumination [30] and geometric computations [22].
The programming model of GPUs is somewhat limited,
mainly due to the lack of random access writes. This limi-
tation makes it more difficult to implement many data struc-
tures and common algorithms such as sorting. Purcell et al.
[30] present an implementation of bitonic merge sort, where
the output routing from one step to another is known in
advance. The algorithm is implemented as a fragment pro-
gram and each stage of the sorting algorithm is performed
as one rendering pass. However, the algorithm can be quite
slow for database operations on large databases.
GPUs have been used for performing many discretized
geometric computations [22]. These include using stencil
buffer hardware for interference computations [33], using
depth-buffer hardware to perform distance field and proxim-
ity computations [15], and visibility queries for interactive
walkthroughs and shadow generation [12].
High throughput and direct access to texture memory
makes fragment processors powerful computation engines for
certain numerical algorithms, including dense matrix-matrix
multiplication [18], general purpose vector processing [36],
visual simulation based on coupled-map lattices [13], linear
algebra operations [17], sparse matrix solvers for conjugate
gradient and multigrid [4], a multigrid solver for boundary
value problems [11], geometric computations [1, 16], etc.

3. OVERVIEW
In this section, we introduce the basic functionality avail-
able on GPUs and give an overview of the architectural
pipeline. More details are given in [9].
3.1 Graphics Pipeline
A GPU is designed to rapidly transform the geometric
description of a scene into the pixels on the screen that con-
stitute a final image. Pixels are stored on the graphics card
in a frame-buffer. The frame buffer is conceptually divided
into three buffers according to the different values stored at
each pixel:
Color Buffer: Stores the color components of each
pixel in the frame-buffer. Color is typically divided
into red, green, and blue channels with an alpha chan-
nel that is used for blending effects.
Depth Buffer: Stores a depth value associated with
each pixel. The depth is used to determine surface
visibility.
Stencil Buffer: Stores a stencil value for each pixel.
It is called the stencil buffer because it is typically
used for enabling/disabling writes to portions of the
frame-buffer.
Figure 1: Graphics architectural pipeline overview: This
figure shows the various units of a modern GPU. Each unit
is designed for performing a specific operation efficiently.
The transformation of geometric primitives (points, lines,
triangles, etc.) to pixels is performed by the graphics pipeline,
consisting of several functional units, each optimized for per-
forming a specific operation. Fig 1 shows the various stages
involved in rendering a primitive.
Vertex Processing Engine: This unit receives ver-
tices as input and transforms them to points on the
screen.
Setup Engine: Transformed vertex data is streamed
to the setup engine which generates slope and initial
value information for color, depth, and other param-
eters associated with the primitive vertices. This in-
formation is used during rasterization for constructing
fragments at each pixel location covered by the prim-
itive.
Pixel Processing Engines: Before the fragments
are written as pixels to the frame buffer, they pass
through the pixel processing engines or fragment pro-
cessors. A series of tests can be used for discarding a
fragment before it is written to the frame buffer. Each
test performs a comparison using a user-specified re-
lational operator and discards the fragment if the test
fails.
Alpha test: Compares a fragment’s alpha value
to a user-specified reference value.
Stencil test: Compares the stencil value of a
fragment’s corresponding pixel with a user-specified
reference value.
Depth test: Compares the depth value of a frag-
ment to the depth value of the corresponding pixel
in the frame buffer.
The relational operator can be any of the following : =, <,
>, , , and 6=. In addition, there are two operators, never
and always, that do not require a reference value.
Current generations of GPUs have a pixel processing en-
gine that is programmable. The user can supply a custom
fragment program to be executed on each fragment. For ex-
ample, a fragment program can compute the alpha value of
a fragment as a complex function of the fragment’s other
color components or its depth.
3.2 Visibility and Occlusion Queries
Current GPUs can perform visibility and occlusion queries
[27]. When a primitive is rasterized, it is converted to frag-
ments. Some of these fragments may or may not be written
to pixels in the frame buffer depending on whether they pass
the alpha, stencil and depth tests. An occlusion query re-
turns the pixel pass count, the number of fragments that
pass the different tests. We use these queries for performing
aggregation computations (see Section 4).
3.3 Data Representation on the GPUs
Our goal is to utilize the inherent parallelism and vector
processing capabilities of the GPUs for database operations.
A key aspect is the underlying data representation.
Data is stored on the GPU as textures. Textures are 2D
arrays of values. They are usually used for applying images
to rendered surfaces. They may contain multiple channels.
For example, an RGBA texture has four color channels -
red, blue, green and alpha. A number of different data for-
mats can be used for textures including 8-bit bytes, 16-bit
integers, and floating point. We store data in textures in the
floating-point format. This format can precisely represent
integers up to 24 bits.
To perform computations on the values stored in a tex-
ture, we render a single quadrilateral that covers the win-
dow. The texture is applied to the quadrilateral such that
the individual elements of the texture, texels, line up with
the pixels in the frame-buffer. Rendering the textured quadri-
lateral causes a fragment to be generated for every data
value in the texture. Fragment programs are used for per-
forming computations using the data value from the texture.
Then the alpha, stencil, and depth tests can be used to per-
form comparisons.
3.4 Stencil Tests
Graphics processors use stencil tests for restricting com-
putations to a portion of the frame-buffer based on the value
in the stencil buffer. Abstractly, we can consider the stencil
buffer as a mask on the screen. Each fragment that enters

the pixel processing engine corresponds to a pixel in the
frame-buffer. The stencil test compares the stencil value of
a fragment’s corresponding pixel against a reference value.
Fragments that fail the comparison operation are rejected
from the rasterization pipeline.
Stencil operations can modify the stencil value of a frag-
ment’s corresponding pixel. Examples of such stencil oper-
ations include
KEEP: Keep the stencil value in stencil buffer. We
use this operation if we do not want to modify the
stencil value.
INCR: Increment the stencil value by one.
DECR: Decrement the stencil value by one.
ZERO: Set the stencil value to zero.
REPLACE: Set the stencil value to the reference
value.
INVERT: Bitwise invert the stencil value.
For each fragment there are three possible outcomes based
on the stencil and depth tests. Based on the outcome of the
tests, the corresponding stencil operation is performed:
Op1: when a fragment fails the stencil test,
Op2: when a fragment passes the stencil test and fails
the depth test,
Op3: when the fragment passes the stencil and depth
tests.
We illustrate these operations with the following pseudo-
code for the StencilOp routine:
StencilOp( Op1, Op2, Op3)
if (stencil test passed) /* perform stencil test */
/* fragment passed stencil test */
if(depth test passed) /* perform depth test */
/* fragment passed stencil and depth test */
perform Op3 on stencil value
else
/* fragment passed stencil test */
/* but failed depth test */
perform Op2 on stencil value
end if
else
/* fragment failed stencil test */
perform Op1 on stencil value
end if
4. BASIC DATABASE OPERATIONS USING
GPUS
In this section, we give a brief overview of basic database
operations that are performed efficiently on a GPU. Given
a relational table T of m attributes (a
1
, a
2
, ..., a
m
), a basic
SQL query is in the form of
SELECT A
FROM T
WHERE C
where A may be a list of attributes or aggregations (SUM,
COUNT, AVG, MIN, MAX) defined on individual attributes,
and C is a boolean combination (using AND, OR, EXIST,
NOT EXIST) of predicates that have the form a
i
op a
j
or a
i
op constant. The operator op may be any of the
following: =, 6=, >, , <, . In essence, queries specified in
this form involve three categories of basic operations: pred-
icates, boolean combinations, and aggregations. Our goal
is to design efficient algorithms for performing these opera-
tions using graphics processors.
Predicates: Predicates in the form of a
i
op constant
can be evaluated via the depth test and stencil test.
The comparison between two attributes, a
i
op a
j
, can
be transformed into a semi-linear query a
i
a
j
op 0,
which can be executed on the GPUs.
Boolean combinations: A boolean combination of pred-
icates can always be rewritten in a conjunctive normal
form (CNF). The stencil test can be used repeatedly
for evaluating a series of logical operators with the in-
termediate results stored in the stencil buffer.
Aggregations: This category includes simple operations
such as COUNT, SUM, AVG, MIN, MAX, all of which
can be implemented using the counting capability of
the occlusion queries on GPUs.
To perform these operations on a relational table using
GPUs, we store the attributes of each record in multiple
channels of a single texel, or the same texel location in mul-
tiple textures.
4.1 Predicate Evaluation
In this section, we present novel GPU-based algorithms for
performing comparisons as well as the semi-linear queries.
4.1.1 Comparison between an Attribute and a Con-
stant
We can implement a comparison between an attribute
tex and a constant d by using the depth test function-
ality of graphics hardware. The stencil buffer can be config-
ured to store the result of the depth test. This is important
not only for evaluating a single comparison but also for con-
structing more complex boolean combinations of multiple
predicates.
To use the depth test for performing comparisons, at-
tribute values need to be stored in the depth buffer. We
use a simple fragment program for copying the attribute
values from the texture memory to the depth buffer.
A comparison operation against a depth value d is imple-
mented by rendering a screen filling quadrilateral with depth
d. In this operation, the rasterization hardware uses the
comparison function for testing each attribute value stored
in the depth buffer against d. The comparison function is
specified using the depth function. Routine 4.1 describes
the pseudo-code for our implementation.
4.1.2 Comparison between Two Attributes
The comparison between two attributes, a
i
op a
j
, can
be transformed into a special semi-linear query (a
i
a
j
op
0), which can be performed very efficiently using the vector
processors on the GPUs. Here, we propose a fast algorithm
that can perform any general semi-linear query on GPUs.

Compare( tex, op, d )
1 CopyToDepth( tex )
2 set depth test function to op
3 RenderQuad( d )
CopyToDepth( tex )
1 set up fragment program
2 RenderTexturedQuad( tex )
ROUTINE 4.1: Compare compares the attribute values
stored in texture tex against d using the comparison function
op. CopyToDepth called on line 1 copies the attribute values in
tex into the depth buffer. CopyToDepth uses a simple fragment
program on each pixel of the screen for performing the copy oper-
ation. On line 2, the depth test is configured to use the compar-
ison operator op. The function RenderQuad(d) called on line
3 generates a fragment at a specified depth d for each pixel on
the screen. Rasterization hardware compares the fragment depth
d against the attribute values in depth buffer using the operation
op.
Semi-linear Queries on GPUs
Applications encountered in Geographical Information Sys-
tems (GIS), geometric modeling, and spatial databases de-
fine geometric data objects as linear inequalities of the at-
tributes in a relational database [28]. Such geometric data
objects are called semi-linear sets. GPUs are capable of fast
computation on semi-linear sets. A linear combination of m
attributes is represented as:
i=m
X
i=1
s
i
· a
i
where each s
i
is a scalar multiplier and each a
i
is an attribute
of a record in the database. The above expression can be
considered as a dot product of two vectors s and a where
s = (s
1
, s
2
, ..., s
m
) and a = (a
1
, a
2
, ..., a
m
).
Semilinear( tex, s, op, b )
1 enable fragment program SemilinearFP(s, b)
2 RenderTexturedQuad( tex )
SemilinearFP( s, op, b)
1 a = value from tex
2 if dot( s, a) op b
3 discard fragment
ROUTINE 4.2: Semilinear computes the semi-linear query
by performing linear combination of attribute values in tex and
scalar constants in s. Using the operator op, it compares the the
scalar value due to linear combination with b. To perform this
operation, we render a screen filling quad and generate fragments
on which the semi-linear query is executed. For each fragment,
a fragment program SemilinearFP discards fragments that fail
the query.
Semilinear computes the semi-linear query:
(s · a) op b
where op is a comparison operator and b is a scalar con-
stant. The attributes a
i
are stored in separate channels in
the texture tex. There is a limit of four channels per texture.
Longer vectors can be split into multiple textures, each with
four components. The fragment program SemilinearFP()
performs the dot product of a texel from tex with s and
compares the result to b. It discards the fragment if the
comparison fails. Line 2 renders a textured quadrilateral
using the fragment program. Semilinear maps very well
to the parallel pixel processing as well as vector processing
capabilities available on the GPUs. This algorithm can also
be extended for evaluating polynomial queries.
EvalCNF( A )
1 Clear Stencil to 1.
2 For each of A
i
, i = 1, .., k
3 do
4 if ( mod(i, 2) ) /* valid stencil value is 1 */
5 Stencil Test to pass if stencil value is equal to 1
6 StencilOp(KEEP,KEEP,INCR)
7 else /* valid stencil value is 2 */
8 Stencil Test to pass if stencil value is equal to 2
9 StencilOp(KEEP,KEEP,DECR)
10 endif
11 For each B
i
j
, j = 1, .., m
i
12 do
13 Perform B
i
j
using Compare
14 end for
15 if ( mod(i, 2)) /* valid stencil value is 2 */
16 if a stencil value on screen is 1, replace it with 0
17 else /* valid stencil value is 1 */
18 if a stencil value on screen is 2, replace it with 0
19 endif
20 end for
ROUTINE 4.3: EvalCNF is used to evaluate a CNF ex-
pression. Initially, the stencil is initialized to 1. This is used
for performing T RUE AND A
1
. While evaluating each formula
A
i
, Line 4 sets the appropriate stencil test and stencil operations
based on whether i is even or odd. If i is even, valid portions
on screen have stencil value 2. Otherwise, valid portions have
stencil value 1. Lines 11 14 invalidate portions on screen that
satisfy (A
1
A
2
... A
i1
) and fail (A
1
A
2
... A
i
). Lines
15 19 compute the disjunction of B
i
j
for each predicate A
i
. At
the end of line 19, valid portions on screen have stencil value 2
if i is odd and 1, otherwise. At the end of the line 20, records
corresponding to non-zero stencil values satisfy A.
4.2 Boolean Combination
Complex boolean combinations are often formed by com-
bining simple predicates with the logical operators AND,
OR, NOT. In these cases, the stencil operation is specified
to store the result of a predicate. We use the function Sten-
cilOp (as defined in Section 3.4) to initialize the appropriate
stencil operation for storing the result in stencil buffer.
Our algorithm evaluates a boolean expression represented
as a CNF expression. We assume that the CNF expression
has no NOT operators. If a simple predicate in this ex-
pression has a NOT operator, we can invert the comparison
operation and eliminate the NOT operator. A CNF expres-
sion C
k
is represented as A
1
A
2
... A
k
where each A
i
is
represented as B
i
1
B
i
2
... B
i
m
i
. Each B
i
j
, j = 1, 2, .., m
i
is a simple predicate.
The CNF C
k
can be evaluated using the recursion C
k
=
C
k1
A
k
. C
0
is considered as T RU E. We use the pseu-
docode in routine 4.3 for evaluating C
k
. Our approach uses
three stencil values 0, 1, 2 for validating data. Data values
corresponding to the stencil value 0 are always invalid. Ini-
tially, the stencil values are initialized to 1. If i is the iter-
ation value for the loop in line 2, lines 3 19 evaluate C
i
.
The valid stencil value is 1 or 2 depending on whether i is
even or odd respectively. At the end of line 19, portions on
the screen with non-zero stencil value satisfy the CNF C
k
.
We can easily modify our algorithm for handling a boolean
expression represented as a DNF.

Citations
More filters
Proceedings ArticleDOI

Mars: a MapReduce framework on graphics processors

TL;DR: Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface, and is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.
Proceedings ArticleDOI

GPUTeraSort: high performance graphics co-processor sorting for large database management

TL;DR: Overall, the results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.
Proceedings ArticleDOI

Accelerating SQL database operations on a GPU with CUDA

TL;DR: This paper implements a subset of the SQLite command processor directly on the GPU, reducing the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.
Proceedings ArticleDOI

PTask: operating system abstractions to manage GPUs as compute devices

TL;DR: It is shown that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.
Proceedings ArticleDOI

A memory model for scientific algorithms on graphics processors

TL;DR: A memory model is presented to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs) and incorporates many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and the 3C's model to analyze the cache misses.
References
More filters
Proceedings ArticleDOI

Sparse matrix solvers on the GPU: conjugate gradients and multigrid

TL;DR: This work implemented two basic, broadly useful, computational kernels: a sparse matrix conjugate gradient solver and a regular-grid multigrid solver for high-intensity numerical simulation of geometric flow and fluid simulation on the GPU.
Proceedings ArticleDOI

Cg: a system for programming graphics hardware in a C-like language

TL;DR: A programming language and supporting system that are designed for programming programmable floating-point vertex and fragment processors, with support for data-dependent control flow in the vertex processor.
Proceedings ArticleDOI

Ray tracing on programmable graphics hardware

TL;DR: It is demonstrated that ray tracing on graphics hardware could prove to be faster than CPU based implementations as well as competitive with traditional hardware accelerated feed-forward triangle rendering.
Proceedings ArticleDOI

Fast computation of generalized Voronoi diagrams using graphics hardware

TL;DR: A new approach for computing generalized 2D and 3D Voronoi diagrams using interpolation-based polygon rasterization hardware is presented and the application of this algorithm to fast motion planning in static and dynamic environments, selection in complex user-interfaces, and creation of dynamic mosaic effects is demonstrated.
Proceedings Article

DBMSs on a Modern Processor: Where Does Time Go?

TL;DR: This paper examines four commercial DBMSs running on an Intel Xeon and NT 4.0 and introduces a framework for analyzing query execution time, and finds that database developers should not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What have the authors contributed in "Fast computation of database operations using graphics processors" ?

The authors present new algorithms for performing fast computation of several common database operations on commodity graphics processors. Specifically, the authors consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. The authors have compared their performance with an optimized implementation of CPU-based algorithms. 

This number is expected to increase in the future. There are many avenues for future work. The authors also plan to design GPU-based algorithms for queries on more complicated data types in the context of spatial and temporal databases and perform continuous queries over streams using GPUs. As the instruction sets of fragment programs improve, the running time of many of their algorithms will further decrease. 

As the current trend of database architecture moves from disk-based system towards main-memory databases, applications have become increasingly computation- and memorybound. 

To utilize the efficiency of pipelining and parallelism that a vector processor provides, the implementation of each operation was redesigned for increasing the vectorization rate and the vector length. 

GPUs have become programmable, allowing a user to write fragment programs that are executed on pixel processing engines. 

The graphics processor has 256MB of video memory with a memory data rate of 950MHz and can process upto 8 pixels at processor clock rate of 450 MHz. 

The authors show that algorithms for semi-linear and selection queries map very well to GPUs and the authors are able to obtain significant performance improvement over CPU-based implementations. 

due to the limited video memory, the authors may notbe able to copy very large databases (with tens of millions of records) into GPU memory. 

In fact, the two major computational components of a desktop computer system are its main central processing unit (CPU) and its (GPU). 

For their benchmarks, the authors have used a database consisting of TCP/IP data for monitoring traffic patterns in local area network and wide area network and a census database [6] consisting of monthly income information. 

As graphics hardware becomes increasingly programmable and powerful, the roles of CPUs and GPUs in computing are being redefined. 

Due to different clock rates and instruction sets, it is difficult to perform explicit comparisons between CPU-based and GPU-based algorithms. 

The performance of these algorithms depends on the instruction sets available for fragment programs, the number of fragment processors, and the underlying clock rate of the GPU. 

These operations are widely used as fundamental primitives for building complex database queries and for supporting on-line analytic processing (OLAP) and data mining procedures. 

Their experiments on graphics processors indicate a speedup of nearly 5 times on intersection joins and within-distance joins when compared against their software implementation. 

Given the GPU clock rate and the number of pixel processing engines, the authors can estimate the time taken to perform KthLargest() under some assumptions. 

The authors can implement a comparison between an attribute “tex” and a constant “d” by using the depth test functionality of graphics hardware.