scispace - formally typeset
Search or ask a question
Author

Stéphane Zuckerman

Bio: Stéphane Zuckerman is an academic researcher from University of Delaware. The author has contributed to research in topics: Execution model & Dataflow. The author has an hindex of 8, co-authored 30 publications receiving 304 citations. Previous affiliations of Stéphane Zuckerman include Cergy-Pontoise University & Michigan Technological University.

Papers
More filters
Proceedings ArticleDOI
05 Jun 2011
TL;DR: This research is exploring a fine-grain, event-driven model in support of adaptive operation of exascaleclass machines, developing a Codelet Program Execution Model which breaks applications into codelets and dependencies and dependencies between these objects.
Abstract: As computing has moved relentlessly through giga-, tera-, and peta-scale systems, exa-scale (a million trillion operations/sec.) computing is currently under active research. DARPA has recently sponsored the "UHPC" [1] --- ubiquitous high-performance computing --- program, encouraging partnership with academia and industry to explore such systems. Among the requirements are the development of novel techniques in "self-awareness"in support of performance, energy-efficiency, and resiliency.Trends in processor and system architecture, driven by power and complexity, point us toward very high-core-count designs and extreme software parallelism to solve exascaleclass problems. Our research is exploring a fine-grain, event-driven model in support of adaptive operation of these machines. We are developing a Codelet Program Execution Model which breaks applications into codelets (small bits of functionality) and dependencies (control and data) between these objects. It then uses this decomposition to accomplish advanced scheduling, to accommodate code and data motion within the system, and to permit flexible exploitation of parallelism in support of goals for performance and power.

90 citations

Book ChapterDOI
26 Aug 2013
TL;DR: The validity of fine-grain execution as a promising and viable execution model for future and current architectures is explored and it is shown that the runtime is on par or performs better than AMD's highly-optimized parallel library for matrix multication, outperforming it on average by 1.40×.
Abstract: Chip architectures are shifting from few, faster, functionally heavy cores to abundant, slower, simpler cores to address pressing physical limitations such as energy consumption and heat expenditure. As architectural trends continue to fluctuate, we propose a novel program execution model, the Codelet model, which is designed for new systems tasked with efficiently managing varying resources. The Codelet model is a fine-grained dataflow inspired model extended to address the cumbersome resources available in new architectures. In the following, we define the Codelet execution model as well as provide an implementation named DARTS. Utilizing DARTS and two predominant kernels, matrix multiplication and the Graph 500's breadth first search, we explore the validity of fine-grain execution as a promising and viable execution model for future and current architectures. We show that our runtime is on par or performs better than AMD's highly-optimized parallel library for matrix multication, outperforming it on average by 1.40× with a speedup up to 4×. Our implementation of the parallel BFS outperforms Graph 500's reference implementation (with or without dynamic scheduling) on average by 1.50× with a speed up of up to 2.38×.

61 citations

01 Jan 2011
TL;DR: This research is exploring a ne-grain, eventdriven model in support of adaptive operation of exascaleclass machines, developing a Codelet Program Execution Model which breaks applications into codelets and dependencies and dependencies between these objects.
Abstract: As computing has moved relentlessly through giga-, tera-, and peta-scale systems, exa-scale (a million trillion operations/sec.) computing is currently under active research. DARPA has recently sponsored the \UHPC" [1] | ubiquitous high-performance computing | program, encouraging partnership with academia and industry to explore such systems. Among the requirements are the development of novel techniques in \self-awareness" 1 in support of performance, energy-eciency, and resiliency. Trends in processor and system architecture, driven by power and complexity, point us toward very high-core-count designs and extreme software parallelism to solve exascaleclass problems. Our research is exploring a ne-grain, eventdriven model in support of adaptive operation of these machines. We are developing a Codelet Program Execution Model which breaks applications into codelets (small bits of functionality) and dependencies (control and data) between these objects. It then uses this decomposition to accomplish advanced scheduling, to accommodate code and data motion within the system, and to permit exible exploitation of parallelism in support of goals for performance and power.

17 citations

Book ChapterDOI
08 Oct 2009
TL;DR: A new technique, decremental analysis (DECAN), is introduced, to iteratively identify the individual instructions responsible for performance bottlenecks and help discover problems related to memory access locality and loop unrolling that lead to a sequential performance improvement of a factor of 2.
Abstract: Current hardware trends place increasing pressure on programmers and tools to optimize scientific code. Numerous tools and techniques exist, but no single tool is a panacea; instead, different tools have different strengths. Therefore, an assortment of performance tuning utilities and strategies are necessary to best utilize scarce resources (e.g., bandwidth, functional units, cache). This paper describes a combined methodology for the optimization process. The strategy combines static assembly analysis using MAQAO with dynamic information from hardware performance monitoring (HPM) and memory traces. We introduce a new technique, decremental analysis (DECAN), to iteratively identify the individual instructions responsible for performance bottlenecks. We present case studies on applications from several independent software vendors (ISVs) on a SMP Xeon Core 2 platform. These strategies help discover problems related to memory access locality and loop unrolling that lead to a sequential performance improvement of a factor of 2.

12 citations


Cited by
More filters
Proceedings ArticleDOI
13 May 2013
TL;DR: Swift/T is composed of several enabling technologies to address scalability challenges, offers a high-level optimizing compiler for user programming and debugging, and provides tools for binding user code in C/C++/Fortran into a logical script.
Abstract: Many scientific applications are conceptually built up from independent component tasks as a parameter study, optimization, or other search. Large batches of these tasks may be executed on high-end computing systems; however, the coordination of the independent processes, their data, and their data dependencies is a significant scalability challenge. Many problems must be addressed, including load balancing, data distribution, notifications, concurrent programming, and linking to existing codes. In this work, we present Swift/T, a programming language and runtime that enables the rapid development of highly concurrent, task-parallel applications. Swift/T is composed of several enabling technologies to address scalability challenges, offers a high-level optimizing compiler for user programming and debugging, and provides tools for binding user code in C/C++/Fortran into a logical script. In this work, we describe the Swift/T solution and present scaling results from the IBM Blue Gene/P and Blue Gene/Q.

141 citations

Proceedings ArticleDOI
23 Feb 2013
TL;DR: An initial evaluation of Runnemede is presented that shows the design process for the on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on- chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on the architecture.
Abstract: DARPA's Ubiquitous High-Performance Computing (UHPC) program asked researchers to develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt, assuming 2018-era fabrication technologies. This paper describes Runnemede, the research architecture developed by the Intel-led UHPC team. Runnemede is being developed through a co-design process that considers the hardware, the runtime/OS, and applications simultaneously. Near-threshold voltage operation, fine-grained power and clock management, and separate execution units for runtime and application code are used to reduce energy consumption. Memory energy is minimized through application-managed on-chip memory and direct physical addressing. A hierarchical on-chip network reduces communication energy, and a codelet-based execution model supports extreme parallelism and fine-grained tasks. We present an initial evaluation of Runnemede that shows the design process for our on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on-chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on our architecture.

101 citations

Book
01 Jan 2000
TL;DR: This paper discusses how to measure a large computational load globally, using as much architectural detail as needed, Besides the traditional goals of sequential and parallel system performance, these methods are useful for energy optimization.
Abstract: Computer performance improvement embraces many issues, but is severely hampered by existing approaches that examine one or a few topics at a time. Each problem solved leads to another saturation point and serious problem. In the most frustrating cases, solving some problems exacerbates others and achieves no net performance gain. This paper discusses how to measure a large computational load globally, using as much architectural detail as needed. Besides the traditional goals of sequential and parallel system performance, these methods are useful for energy optimization.

76 citations

Book
17 Sep 2008
TL;DR: This book constitutes the refereed proceedings of the 14th International Conference on Parallel Computing, Euro-Par 2008, held in Las Palmas de Gran Canaria, Spain, in August 2008 and contains 86 revised papers presented.
Abstract: This book constitutes the refereed proceedings of the 14th International Conference on Parallel Computing, Euro-Par 2008, held in Las Palmas de Gran Canaria, Spain, in August 2008. The 86 revised papers presented were carefully reviewed and selected from 264 submissions. The papers are organized in topical sections on support tools and environments; performance prediction and evaluation; scheduling and load balancing; high performance architectures and compilers; parallel and distributed databases; grid and cluster computing; peer-to-peer computing; distributed systems and algorithms; parallel and distributed programming; parallel numerical algorithms; distributed and high-performance multimedia; theory and algorithms for parallel computation; and high performance networks.

65 citations

Book ChapterDOI
26 Aug 2013
TL;DR: The validity of fine-grain execution as a promising and viable execution model for future and current architectures is explored and it is shown that the runtime is on par or performs better than AMD's highly-optimized parallel library for matrix multication, outperforming it on average by 1.40×.
Abstract: Chip architectures are shifting from few, faster, functionally heavy cores to abundant, slower, simpler cores to address pressing physical limitations such as energy consumption and heat expenditure. As architectural trends continue to fluctuate, we propose a novel program execution model, the Codelet model, which is designed for new systems tasked with efficiently managing varying resources. The Codelet model is a fine-grained dataflow inspired model extended to address the cumbersome resources available in new architectures. In the following, we define the Codelet execution model as well as provide an implementation named DARTS. Utilizing DARTS and two predominant kernels, matrix multiplication and the Graph 500's breadth first search, we explore the validity of fine-grain execution as a promising and viable execution model for future and current architectures. We show that our runtime is on par or performs better than AMD's highly-optimized parallel library for matrix multication, outperforming it on average by 1.40× with a speedup up to 4×. Our implementation of the parallel BFS outperforms Graph 500's reference implementation (with or without dynamic scheduling) on average by 1.50× with a speed up of up to 2.38×.

61 citations