Home
/
Authors
/
Jatin Chhugani

Author

Jatin Chhugani

Other affiliations: IBM, Johns Hopkins University, eBay

Bio: Jatin Chhugani is an academic researcher from Intel. The author has contributed to research in topics: SIMD & Rendering (computer graphics). The author has an hindex of 28, co-authored 80 publications receiving 4728 citations. Previous affiliations of Jatin Chhugani include IBM & Johns Hopkins University.

Papers published on a yearly basis

2023
2016
2015
2014
2012
2011
2010
2009
2008
2007
2005
2004
2003
2002
2001
2000
1999

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

[...]

Victor W. Lee¹, Changkyu Kim¹, Jatin Chhugani¹, Michael E. Deisher¹, Daehyun Kim¹, Anthony D. Nguyen¹, Nadathur Satish¹, Mikhail Smelyanskiy¹, Srinivas Chennupaty¹, Per Hammarlund¹, Ronak Singhal¹, Pradeep Dubey¹ - Show less +8 more•Institutions (1)

Intel¹

19 Jun 2010

TL;DR: This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

810 citations

Proceedings Article•DOI•

ClearPath: highly parallel collision avoidance for multi-agent simulation

[...]

Stephen J. Guy¹, Jatin Chhugani², Changkyu Kim², Nadathur Satish², Ming C. Lin¹, Dinesh Manocha¹, Pradeep Dubey² - Show less +3 more•Institutions (2)

University of North Carolina at Chapel Hill¹, Intel²

01 Aug 2009

TL;DR: The approach extends the notion of velocity obstacles from robotics and formulates the conditions for collision free navigation as a quadratic optimization problem and uses a discrete optimization method to efficiently compute the motion of each agent.

...read moreread less

Abstract: We present a new local collision avoidance algorithm between multiple agents for real-time simulations. Our approach extends the notion of velocity obstacles from robotics and formulates the conditions for collision free navigation as a quadratic optimization problem. We use a discrete optimization method to efficiently compute the motion of each agent. This resulting algorithm can be parallelized by exploiting data-parallelism and thread-level parallelism. The overall approach, ClearPath, is general and can robustly handle dense scenarios with tens or hundreds of thousands of heterogeneous agents in a few milli-seconds. As compared to prior collision avoidance algorithms, we observe more than an order of magnitude performance improvement.

...read moreread less

336 citations

Proceedings Article•DOI•

FAST: fast architecture sensitive tree search on modern CPUs and GPUs

[...]

Changkyu Kim¹, Jatin Chhugani¹, Nadathur Satish¹, Eric Sedlar², Anthony D. Nguyen¹, Tim Kaldewey², Victor W. Lee¹, Scott A. Brandt³, Pradeep Dubey¹ - Show less +5 more•Institutions (3)

Intel¹, Oracle Corporation², University of California, Santa Cruz³

06 Jun 2010

TL;DR: FAST is an extremely fast architecture sensitive layout of the index tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware, and achieves a 6X performance improvement over uncompressed index search for large keys on CPUs.

...read moreread less

Abstract: In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. In this paper, we present FAST, an extremely fast architecture sensitive layout of the index tree. FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware. FAST eliminates impact of memory latency, and exploits thread-level and datalevel parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X (GPU) faster than the best previously reported performance on the same architectures. FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64Mkeys and naturally integrates compression techniques, overcoming the memory bandwidth bottleneck and achieving a 6X performance improvement over uncompressed index search for large keys on CPUs.

...read moreread less

323 citations

Journal Article•DOI•

Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

[...]

Changkyu Kim¹, Tim Kaldewey², Victor W. Lee¹, Eric Sedlar², Anthony D. Nguyen¹, Nadathur Satish¹, Jatin Chhugani¹, Andrea Di Blas², Pradeep Dubey¹ - Show less +5 more•Institutions (2)

Intel¹, Oracle Corporation²

01 Aug 2009

TL;DR: This paper re-examines two popular join algorithms to determine if the latest computer architecture trends shift the tide that has favored hash join for many years and offers multicore implementations of hash join and sort-merge join which consistently outperform all previously reported results.

...read moreread less

Abstract: Join is an important database operation. As computer architectures evolve, the best join algorithm may change hand. This paper re-examines two popular join algorithms -- hash join and sort-merge join -- to determine if the latest computer architecture trends shift the tide that has favored hash join for many years. For a fair comparison, we implemented the most optimized parallel version of both algorithms on the latest Intel Core i7 platform. Both implementations scale well with the number of cores in the system and take advantages of latest processor features for performance. Our hash-based implementation achieves more than 100M tuples per second which is 17X faster than the best reported performance on CPUs and 8X faster than that reported for GPUs. Moreover, the performance of our hash join implementation is consistent over a wide range of input data sizes from 64K to 128M tuples and is not affected by data skew. We compare this implementation to our highly optimized sort-based implementation that achieves 47M to 80M tuples per second. We developed analytical models to study how both algorithms would scale with upcoming processor architecture trends. Our analysis projects that current architectural trends of wider SIMD, more cores, and smaller memory bandwidth per core imply better scalability potential for sort-merge join. Consequently, sort-merge join is likely to outperform hash join on upcoming chip multiprocessors. In summary, we offer multicore implementations of hash join and sort-merge join which consistently outperform all previously reported results. We further conclude that the tide that favors the hash join algorithm has not changed yet, but the change is just around the corner.

...read moreread less

311 citations

Proceedings Article•DOI•

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

[...]

Anthony Nguyen¹, Nadathur Satish¹, Jatin Chhugani¹, Changkyu Kim¹, Pradeep Dubey¹ - Show less +1 more•Institutions (1)

Intel¹

13 Nov 2010

TL;DR: A novel 3.

...read moreread less

Abstract: Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

...read moreread less

299 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Second-generation PLINK: rising to the challenge of larger and richer datasets

[...]

Christopher C. Chang, Carson C. Chow¹, Laurent C. A. M. Tellier², Shashaank Vattikuti¹, Shaun Purcell³, James J. Lee⁴ - Show less +2 more•Institutions (4)

National Institutes of Health¹, University of Copenhagen², Icahn School of Medicine at Mount Sinai³, University of Minnesota⁴

25 Feb 2015-GigaScience

TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

...read moreread less

Abstract: Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O √ n -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

...read moreread less

7,038 citations

Proceedings Article•DOI•

The PARSEC benchmark suite: characterization and architectural implications

[...]

Christian Bienia¹, Sanjeev Kumar², Jaswinder Pal Singh¹, Kai Li¹•Institutions (2)

Princeton University¹, Intel²

25 Oct 2008

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.

...read moreread less

Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

...read moreread less

3,514 citations

Journal Article•DOI•

Second-generation PLINK: rising to the challenge of larger and richer datasets

[...]

Christopher C. Chang, Carson C. Chow¹, Laurent C. A. M. Tellier², Shashaank Vattikuti¹, Shaun Purcell³, James J. Lee⁴ - Show less +2 more•Institutions (4)

National Institutes of Health¹, University of Copenhagen², Icahn School of Medicine at Mount Sinai³, University of Minnesota⁴

17 Oct 2014-arXiv: Genomics

TL;DR: PLINK as discussed by the authors is a C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics, which has been widely used in the literature.

...read moreread less

Abstract: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

...read moreread less

3,513 citations

Journal Article•

When is nearest neighbor meaningful

[...]

Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft

01 Jan 1999-Lecture Notes in Computer Science

TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.

...read moreread less

Abstract: We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.

...read moreread less

1,992 citations

Posted Content•

Billion-scale similarity search with GPUs

[...]

Jeff Johnson¹, Matthijs Douze¹, Hervé Jégou¹•Institutions (1)

Facebook¹

28 Feb 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art.

...read moreread less

Abstract: Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

...read moreread less

1,663 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse