Home
/
Authors
/
Jaswinder Pal Singh

Author

Jaswinder Pal Singh

Bio: Jaswinder Pal Singh is an academic researcher from Princeton University. The author has contributed to research in topics: Synchronization (computer science) & Garbage collection. The author has an hindex of 4, co-authored 5 publications receiving 3378 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

The PARSEC benchmark suite: characterization and architectural implications

[...]

Christian Bienia¹, Sanjeev Kumar², Jaswinder Pal Singh¹, Kai Li¹•Institutions (2)

Princeton University¹, Intel²

25 Oct 2008

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.

...read moreread less

Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

...read moreread less

3,514 citations

Proceedings Article•DOI•

Creating and preserving locality of java applications at allocation and garbage collection times

[...]

Yefim Shuf¹, Manish Gupta¹, Hubertus Franke¹, Andrew W. Appel², Jaswinder Pal Singh² - Show less +1 more•Institutions (2)

IBM¹, Princeton University²

04 Nov 2002

TL;DR: An allocation time object placement technique based on the recently introduced notion of prolific (frequently instantiated) types that attempts to co-locate, at allocation time, objects of prolific types that are connected via object references and a novel locality based graph traversal technique.

...read moreread less

Abstract: The growing gap between processor and memory speeds is motivating the need for optimization strategies that improve data locality. A major challenge is to devise techniques suitable for pointer-intensive applications. This paper presents two techniques aimed at improving the memory behavior of pointer-intensive applications with dynamic memory allocation, such as those written in Java. First, we present an allocation time object placement technique based on the recently introduced notion of prolific (frequently instantiated) types. We attempt to co-locate, at allocation time, objects of prolific types that are connected via object references. Then, we present a novel locality based graph traversal technique. The benefits of this technique, when applied to garbage collection (GC), are twofold: (i) it improves the performance of GC due to better locality during a heap traversal and (ii) it restructures surviving objects in a way that enhances locality. On multiprocessors, this technique can further reduce overhead due to synchronization and false sharing. The experimental results, on a well-known suite of Java benchmarks (SPECjvm98 [26], SPECjbb2000 [27], and jOlden [4]), from an implementation of these techniques in the Jikes RVM [1], are very encouraging. The object co-allocation technique improves application performance by up to 21% (10% on average) in the Jikes RVM configured with a non-copying mark-and-sweep collector. The locality-based traversal technique reduces GC times by up to 20% (10% on average) and improves the performance of applications by up to 14% (6% on average) in the Jikes RVM configured with a copying semi-space collector. Both techniques combined can improve application performance by up to 22% (10% on average) in the Jikes RVM configured with a non-copying mark-and-sweep collector.

...read moreread less

73 citations

Journal Article•DOI•

A scalable clustered camera system for multiple object tracking

[...]

Senem Velipasalar¹, Jason Schlessman², Cheng-Yao Chen², Wayne Wolf³, Jaswinder Pal Singh² - Show less +1 more•Institutions (3)

University of Nebraska–Lincoln¹, Princeton University², Georgia Institute of Technology³

09 Jul 2008-Eurasip Journal on Image and Video Processing

TL;DR: The scalable clustered camera system is introduced, which is a peer-to-peer multicamera system for multiple object tracking, and each camera in the presented system performs its own tracking, keeping its own trajectories for each target object, which provides fault tolerance.

...read moreread less

Abstract: Reliable and efficient tracking of objects by multiple cameras is an important and challenging problem, which finds wide-ranging application areas. Most existing systems assume that data from multiple cameras is processed on a single processing unit or by a centralized server. However, these approaches are neither scalable nor fault tolerant. We propose multicamera algorithms that operate on peer-to-peer computing systems. Peer-to-peer vision systems require codesign of image processing and distributed computing algorithms as well as sophisticated communication protocols, which should be carefully designed and verified to avoid deadlocks and other problems. This paper introduces the scalable clustered camera system, which is a peer-to-peer multicamera system for multiple object tracking. Instead of transferring control of tracking jobs from one camera to another, each camera in the presented system performs its own tracking, keeping its own trajectories for each target object, which provides fault tolerance. A fast and robust tracking algorithm is proposed to perform tracking on each camera view, while maintaining consistent labeling. In addition, a novel communication protocol is introduced, which can handle the problems caused by communication delays and different processor loads and speeds, and incorporates variable synchronization capabilities, so as to allow flexibility with accuracy tradeoffs. This protocol was exhaustively verified by using the SPIN verification tool. The success of the proposed system is demonstrated on different scenarios captured by multiple cameras placed in different setups. Also, simulation and verification results for the protocol are presented.

...read moreread less

28 citations

Proceedings Article•DOI•

Optimizing Communication Scheduling Using Dataflow Semantics

[...]

Adrian Soviani¹, Jaswinder Pal Singh¹•Institutions (1)

Princeton University¹

22 Sep 2009

TL;DR: It is shown how coarse grain dataflow semantics (CGD) applied to SPMD algorithms makes application development and design space exploration simpler compared to message passing, at the same time providing on par performance.

...read moreread less

Abstract: We show how coarse grain dataflow semantics (CGD) applied to SPMD algorithms makes application development and design space exploration simpler compared to message passing, at the same time providing on par performance. CGD applications are specified as dependencies between computation modules and data distributions. Communication and synchronization are added automatically and optimized for specific architectures, relieving programmers of this task. Many high level algorithm changes are easy to implement in CGD by redefining data distributions. These include exposing communication overlap by decreasing task grain, and aggregating communication by replicating data and computation. We briefly present a coordination language with dataflow semantics that implements the CGD model. Our implementation currently supports MPI, SHMEM, and pthreads. Results on Altix 4700 show our optimized CGD FT is 27% faster than original NPB 2.3 MPI implementation, and optimized CGD stencil has a 41% advantage over handwritten MPI.

...read moreread less

5 citations

Proceedings Article•DOI•

Estimating Application Hierarchical Bandwidth Requirements Using BSP Family Models

[...]

Adrian Soviani¹, Jaswinder Pal Singh¹•Institutions (1)

Princeton University¹

21 May 2012

TL;DR: A hierarchical bandwidth machine model (alpha DBSP) is presented that naturally extends the Decomposable BSP (DBSP) model by associating a bandwidth growth factor alpha to each message pattern and estimates the hierarchical bandwidth required by a given application.

...read moreread less

Abstract: There has been a vast amount of work to develop programming models that provide good performance across machine architectures, are easy to use, and have predictable performance. Similarly, the design and optimization of architectures to achieve optimal performance for an application class remains a challenging task. Accurate cost modeling is essential for both application development and system design. Many scientific computing codes are developed by using libraries that provide custom-built collective communication primitives. For example, the family of Bulk Synchronous Parallel (BSP) machine models provides suitable tools for analyzing such problems. However, modeling the effect of bandwidth limitations for globally unbalanced communication and estimating the hierarchical bandwidth used by applications remain key challenges. We present a hierarchical bandwidth machine model (alpha DBSP) that naturally extends the Decomposable BSP (DBSP) model by associating a bandwidth growth factor alpha to each message pattern. Algorithms executed on alpha DBSP have a runtime that is at least as good as DBSP. Hence, there are globally unbalanced problems for which alpha DBSP analysis is simpler or more accurate We present three scientific computing kernels that illustrate the differences between alpha DBSP and DBSP analysis. Similar to the BSP family models, alpha DBSP predicts collective communication execution time for a given machine. Additionally, alpha DBSP estimates the hierarchical bandwidth required by a given application. System architects may use this estimation to design machines that avoid bandwidth bottlenecks for their target application class.

...read moreread less

4 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Rodinia: A benchmark suite for heterogeneous computing

[...]

Shuai Che¹, Michael Boyer¹, Jiayuan Meng¹, David Tarjan¹, Jeremy W. Sheaffer¹, Sang-Ha Lee¹, Kevin Skadron¹ - Show less +3 more•Institutions (1)

University of Virginia¹

04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

2,697 citations

Proceedings Article•DOI•

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

[...]

Sheng Li¹, Jung Ho Ahn², Richard Strong³, Jay B. Brockman¹, Dean M. Tullsen³, Norman P. Jouppi⁴ - Show less +2 more•Institutions (4)

University of Notre Dame¹, Seoul National University², University of California, San Diego³, Hewlett-Packard⁴

12 Dec 2009

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.

...read moreread less

Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

...read moreread less

2,487 citations

Journal Article•DOI•

Roofline: an insightful visual performance model for multicore architectures

[...]

Samuel Williams¹, Andrew Waterman², David A. Patterson²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

01 Apr 2009-Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

Abstract: The Roofline model offers insight on how to improve the performance of software and hardware.

...read moreread less

2,181 citations

Proceedings Article•DOI•

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

[...]

Tianshi Chen¹, Zidong Du¹, Ninghui Sun¹, Jia Wang¹, Chengyong Wu¹, Yunji Chen¹, Olivier Temam² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, French Institute for Research in Computer Science and Automation²

24 Feb 2014

TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.

...read moreread less

Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

...read moreread less

1,582 citations

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse