scispace - formally typeset
Search or ask a question

Showing papers by "Sameer Shende published in 2019"


Proceedings ArticleDOI
01 Nov 2019
TL;DR: TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ functors to highlight performance problems across multiple layers of the runtime system without modifying the application binary.
Abstract: The TAU Performance System® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ functors. It can also use instrumentation at the OpenMP, CUDA, pthread, or other runtime levels to expose the implementation details giving a dual focus of higher-level abstractions as well as low-level execution dynamics. This multi-level instrumentation strategy adopted by TAU can highlight performance problems across multiple layers of the runtime system without modifying the application binary.

6 citations


Journal ArticleDOI
01 Jul 2019
TL;DR: The work done in the MPI runtime to enable both transparent and application-level checkpointing mechanisms is described, and it is shown how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration.
Abstract: Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

6 citations


Proceedings ArticleDOI
05 Aug 2019
TL;DR: Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.
Abstract: Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.

4 citations


Journal ArticleDOI
TL;DR: In this paper, the authors describe the work done in MPI runtime to enable both transparent and application-level checkpointing mechanisms, and show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration.
Abstract: Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

3 citations


Proceedings ArticleDOI
11 Sep 2019
TL;DR: This paper proposes to outline how runtime-level work-sharing through what they define as MPI tasks, enabling the ability to progress indifferently from stream context could enable optimizations which were previously linked to specific MPI runtimes executing MPI processes in shared memory (e.g. thread-based MPI).
Abstract: Since the beginning, MPI has defined the rank as an implicit attribute associated with the MPI process' environment. In particular, each MPI process generally runs inside a given UNIX process and is associated with a fixed identifier in its WORLD communicator. However, this state of things is about to change with the rise of new abstractions such as MPI Sessions. In this paper, we propose to outline how such evolution could enable optimizations which were previously linked to specific MPI runtimes executing MPI processes in shared memory (e.g. thread-based MPI). By implementing runtime-level work-sharing through what we define as MPI tasks, enabling the ability to progress indifferently from stream context we show that there is potential for improved asynchronous progress. In the absence of a Session implementation, this assumption is validated in the context of a thread-based MPI where nonblocking Collective (NBC) were implemented on top of Extended Generic Requests progressed by any rank on the node thanks to an MPI extension enabling threads to dynamically share their MPI context.

1 citations