scispace - formally typeset
Search or ask a question
Author

Jaswinder Pal Singh

Bio: Jaswinder Pal Singh is an academic researcher from Princeton University. The author has contributed to research in topics: Synchronization (computer science) & Garbage collection. The author has an hindex of 4, co-authored 5 publications receiving 3378 citations.

Papers
More filters
Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Proceedings ArticleDOI
04 Nov 2002
TL;DR: An allocation time object placement technique based on the recently introduced notion of prolific (frequently instantiated) types that attempts to co-locate, at allocation time, objects of prolific types that are connected via object references and a novel locality based graph traversal technique.
Abstract: The growing gap between processor and memory speeds is motivating the need for optimization strategies that improve data locality. A major challenge is to devise techniques suitable for pointer-intensive applications. This paper presents two techniques aimed at improving the memory behavior of pointer-intensive applications with dynamic memory allocation, such as those written in Java. First, we present an allocation time object placement technique based on the recently introduced notion of prolific (frequently instantiated) types. We attempt to co-locate, at allocation time, objects of prolific types that are connected via object references. Then, we present a novel locality based graph traversal technique. The benefits of this technique, when applied to garbage collection (GC), are twofold: (i) it improves the performance of GC due to better locality during a heap traversal and (ii) it restructures surviving objects in a way that enhances locality. On multiprocessors, this technique can further reduce overhead due to synchronization and false sharing. The experimental results, on a well-known suite of Java benchmarks (SPECjvm98 [26], SPECjbb2000 [27], and jOlden [4]), from an implementation of these techniques in the Jikes RVM [1], are very encouraging. The object co-allocation technique improves application performance by up to 21% (10% on average) in the Jikes RVM configured with a non-copying mark-and-sweep collector. The locality-based traversal technique reduces GC times by up to 20% (10% on average) and improves the performance of applications by up to 14% (6% on average) in the Jikes RVM configured with a copying semi-space collector. Both techniques combined can improve application performance by up to 22% (10% on average) in the Jikes RVM configured with a non-copying mark-and-sweep collector.

73 citations

Journal ArticleDOI
TL;DR: The scalable clustered camera system is introduced, which is a peer-to-peer multicamera system for multiple object tracking, and each camera in the presented system performs its own tracking, keeping its own trajectories for each target object, which provides fault tolerance.
Abstract: Reliable and efficient tracking of objects by multiple cameras is an important and challenging problem, which finds wide-ranging application areas. Most existing systems assume that data from multiple cameras is processed on a single processing unit or by a centralized server. However, these approaches are neither scalable nor fault tolerant. We propose multicamera algorithms that operate on peer-to-peer computing systems. Peer-to-peer vision systems require codesign of image processing and distributed computing algorithms as well as sophisticated communication protocols, which should be carefully designed and verified to avoid deadlocks and other problems. This paper introduces the scalable clustered camera system, which is a peer-to-peer multicamera system for multiple object tracking. Instead of transferring control of tracking jobs from one camera to another, each camera in the presented system performs its own tracking, keeping its own trajectories for each target object, which provides fault tolerance. A fast and robust tracking algorithm is proposed to perform tracking on each camera view, while maintaining consistent labeling. In addition, a novel communication protocol is introduced, which can handle the problems caused by communication delays and different processor loads and speeds, and incorporates variable synchronization capabilities, so as to allow flexibility with accuracy tradeoffs. This protocol was exhaustively verified by using the SPIN verification tool. The success of the proposed system is demonstrated on different scenarios captured by multiple cameras placed in different setups. Also, simulation and verification results for the protocol are presented.

28 citations

Proceedings ArticleDOI
22 Sep 2009
TL;DR: It is shown how coarse grain dataflow semantics (CGD) applied to SPMD algorithms makes application development and design space exploration simpler compared to message passing, at the same time providing on par performance.
Abstract: We show how coarse grain dataflow semantics (CGD) applied to SPMD algorithms makes application development and design space exploration simpler compared to message passing, at the same time providing on par performance. CGD applications are specified as dependencies between computation modules and data distributions. Communication and synchronization are added automatically and optimized for specific architectures, relieving programmers of this task. Many high level algorithm changes are easy to implement in CGD by redefining data distributions. These include exposing communication overlap by decreasing task grain, and aggregating communication by replicating data and computation. We briefly present a coordination language with dataflow semantics that implements the CGD model. Our implementation currently supports MPI, SHMEM, and pthreads. Results on Altix 4700 show our optimized CGD FT is 27% faster than original NPB 2.3 MPI implementation, and optimized CGD stencil has a 41% advantage over handwritten MPI.

5 citations

Proceedings ArticleDOI
21 May 2012
TL;DR: A hierarchical bandwidth machine model (alpha DBSP) is presented that naturally extends the Decomposable BSP (DBSP) model by associating a bandwidth growth factor alpha to each message pattern and estimates the hierarchical bandwidth required by a given application.
Abstract: There has been a vast amount of work to develop programming models that provide good performance across machine architectures, are easy to use, and have predictable performance. Similarly, the design and optimization of architectures to achieve optimal performance for an application class remains a challenging task. Accurate cost modeling is essential for both application development and system design. Many scientific computing codes are developed by using libraries that provide custom-built collective communication primitives. For example, the family of Bulk Synchronous Parallel (BSP) machine models provides suitable tools for analyzing such problems. However, modeling the effect of bandwidth limitations for globally unbalanced communication and estimating the hierarchical bandwidth used by applications remain key challenges. We present a hierarchical bandwidth machine model (alpha DBSP) that naturally extends the Decomposable BSP (DBSP) model by associating a bandwidth growth factor alpha to each message pattern. Algorithms executed on alpha DBSP have a runtime that is at least as good as DBSP. Hence, there are globally unbalanced problems for which alpha DBSP analysis is simpler or more accurate We present three scientific computing kernels that illustrate the differences between alpha DBSP and DBSP analysis. Similar to the BSP family models, alpha DBSP predicts collective communication execution time for a given machine. Additionally, alpha DBSP estimates the hierarchical bandwidth required by a given application. System architects may use this estimation to design machines that avoid bandwidth bottlenecks for their target application class.

4 citations


Cited by
More filters
Proceedings ArticleDOI
04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

2,697 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations

Journal ArticleDOI
TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.
Abstract: The Roofline model offers insight on how to improve the performance of software and hardware.

2,181 citations

Proceedings ArticleDOI
24 Feb 2014
TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

1,582 citations

Journal ArticleDOI
TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

1,556 citations