scispace - formally typeset
Search or ask a question

Showing papers by "Sameer Shende published in 2013"


Proceedings ArticleDOI
10 Jun 2013
TL;DR: The DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems are described and results are presented that highlight the challenges of highly integrative observation and runtime analysis.
Abstract: Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a strong need for performance observation that merges first-and third-person observation, in situ analysis, and introspection across stack layers that serves online dynamic feedback and adaptation. In this paper we describe the DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems. XPRESS will build an integrated Exascale software stack (called OpenX) that supports the ParalleX execution model and is targeted towards future Exascale platforms. An initial version of an autonomic performance environment called APEX has been developed for OpenX using the current TAU performance technology and results are presented that highlight the challenges of highly integrative observation and runtime analysis.

22 citations


Proceedings ArticleDOI
18 Dec 2013
TL;DR: This paper proposes a language, MIL, for the development of program analysis tools based on static binary instrumentation to ease the integration of static, global program analysis with instrumentation and shows how this enables both a precise targeting of the code regions to analyze and a better understanding of the optimized program behavior.
Abstract: As software complexity increases, the analysis of code behavior during its execution is becoming more important. Instrumentation techniques, through the insertion of code directly into binaries, are essential for program analyses used in debugging, runtime profiling, and performance evaluation. In the context of high-performance parallel applications, building an instrumentation framework is quite challenging. One of the difficulties is due to the necessity to capture both coarse-grain behavior, such as the execution time of different functions, as well as finer-grain actions, in order to pinpoint performance issues. In this paper, we propose a language, MIL, for the development of program analysis tools based on static binary instrumentation. The key feature of MIL is to ease the integration of static, global program analysis with instrumentation. We will show how this enables both a precise targeting of the code regions to analyze and a better understanding of the optimized program behavior.

18 citations


Proceedings ArticleDOI
10 Jun 2013
TL;DR: A set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance) are explored.
Abstract: Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.

15 citations


Proceedings ArticleDOI
01 Oct 2013
TL;DR: This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance) in order to develop effective yet scalable load-balancing methods for irregular computations.
Abstract: Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.

11 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively, and studies the resulting performance.
Abstract: This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multi-core processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallize the program, and study the resulting performance. Our scaling studies show that the bottleneck in the performance was due to the implementation of the collective sum procedure. Replacing the sequential procedure with a binary tree procedure improved the scaling performance of the program. This bottleneck will be resolved in the future by new collective procedures in Fortran 2015.

9 citations