scispace - formally typeset
Search or ask a question
Author

Christian Bienia

Other affiliations: Intel
Bio: Christian Bienia is an academic researcher from Princeton University. The author has contributed to research in topics: Benchmark (computing) & Parsec. The author has an hindex of 8, co-authored 9 publications receiving 4748 citations. Previous affiliations of Christian Bienia include Intel.

Papers
More filters
Proceedings Article•DOI•
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Kai Li1, Christian Bienia1•
01 Jan 2011
TL;DR: A methodology to design effective benchmark suites is developed and its effectiveness is demonstrated by developing and deploying a benchmark suite for evaluating multiprocessors called PARSEC, which has been adopted by many architecture groups in both research and industry.
Abstract: Benchmarking has become one of the most important methods for quantitative performance evaluation of processor and computer system designs. Benchmarking of modern multiprocessors such as chip multiprocessors is challenging because of their application domain, scalability and parallelism requirements. In my thesis, I have developed a methodology to design effective benchmark suites and demonstrated its effectiveness by developing and deploying a benchmark suite for evaluating multiprocessors. More specifically, this thesis includes several contributions. First, the thesis shows that a new benchmark suite for multiprocessors is needed because the behavior of modern parallel programs is significantly different from those represented by SPLASH-2, the most popular parallel benchmark suite developed over ten years ago. Second, the thesis quantitatively describes the requirements and characteristics of a set of multithreaded programs and their underlying technology trends. Third, the thesis presents a systematic approach to scale and select benchmark inputs with the goal of optimizing benchmarking accuracy subject to constrained execution or simulation time. Finally, the thesis describes a parallel benchmark suite called PARSEC for evaluating modern shared-memory multiprocessors. Since its initial release, PARSEC has been adopted by many architecture groups in both research and industry.

1,043 citations

Proceedings Article•DOI•
30 Sep 2008
TL;DR: It is shown that PARSEC workloads are fundamentally different from SPLASH-2 benchmarks, and the observed differences can be explained with two technology trends, the proliferation of CMPs and the accelerating growth of world data.
Abstract: The PARSEC benchmark suite was recently released and has been adopted by a significant number of users within a short amount of time. This new collection of workloads is not yet fully understood by researchers. In this study we compare the SPLASH-2 and PARSEC benchmark suites with each other to gain insights into differences and similarities between the two program collections. We use standard statistical methods and machine learning to analyze the suites for redundancy and overlap on chip-multiprocessors (CMPs). Our analysis shows that PARSEC workloads are fundamentally different from SPLASH-2 benchmarks. The observed differences can be explained with two technology trends, the proliferation of CMPs and the accelerating growth of world data.

239 citations

Journal Article•DOI•
TL;DR: An analysis of Second Life illustrates the demands metaverses applications place on clients, servers, and the network and suggests possible optimizations.
Abstract: Unlike online games, metaverses present a single seamless, persistent world where users can transparently roam around without predefined objectives. An analysis of Second Life illustrates the demands such applications place on clients, servers, and the network and suggests possible optimizations.

175 citations

Journal Article•DOI•
TL;DR: PARSEC3.0 is introduced, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain and integrates Splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently.
Abstract: Benchmarks play a very important role in accelerating the development and research of CMP. As one of them, the PARSEC suite continues to be updated and revised over and over again so that it can offer better support for researchers. The former versions of PARSEC have enough workloads to evaluate the property of CMP about CPU, cache and memory, but it lacks of applications based on network stack to assess the performance of CMPs in respect of network. In this work, we introduce PARSEC3.0, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain. We explore the input sets of splash-2 and expand them to multiple scales, a.k.a, splash-2x. We integrate splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently. Finally, we evaluate the u-TCP/IP stack and new network workloads, and analyze the characterizes of splash-2 and splash-2x

50 citations


Cited by
More filters
Proceedings Article•DOI•
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Proceedings Article•DOI•
04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

2,697 citations

Proceedings Article•DOI•
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations

Journal Article•DOI•
TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.
Abstract: The Roofline model offers insight on how to improve the performance of software and hardware.

2,181 citations

Proceedings Article•DOI•
24 Feb 2014
TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

1,582 citations