scispace - formally typeset
Search or ask a question

Showing papers by "Robert E. Walkup published in 2005"


Journal ArticleDOI
TL;DR: A heuristic map is implemented that attempts to sequentially map a domain and its communication neighbors either to the same BG/L node or to near-neighbor nodes on theBG/L torus, while keeping the number of domains mapped to a BG/ L node constant.
Abstract: A general method for optimizing problem layout on the Blue Gene®/L (BG/L) supercomputer is described. The method takes as input the communication matrix of an arbitrary problem as an array with entries C(i, j), which represents the data communicated from domain i to domain j. Given C(i, j), we implement a heuristic map that attempts to sequentially map a domain and its communication neighbors either to the same BG/L node or to near-neighbor nodes on the BG/L torus, while keeping the number of domains mapped to a BG/L node constant. We then generate a Markov chain of maps using Monte Carlo simulation with free energy F =Σi,j C(i, j)H(i, j), where H(i, j) is the smallest number of hops on the BG/L torus between domain i and domain j. For two large parallel applications, SAGE and UMT2000, the method was tested against the default Message Passing Interface rank order layout on up to 2,048 BG/L nodes. It produced maps that improved communication efficiency by up to 45%.

80 citations


Journal ArticleDOI
TL;DR: This work provides a variety of performance analysis tools for the new Blue Gene®/L supercomputer, and demonstrates their usefulness and applicability with case studies of application optimization.
Abstract: Good performance monitoring is the basis of modern performance analysis tools for application optimization. We are providing a variety of such performance analysis tools for the new Blue Gene®/L supercomputer. Those tools can be divided into two categories: single-node performance tools and multinode performance tools. From a single-node perspective, we provide standard interfaces and libraries, such as PAPI and libHPM, that provide access to the hardware performance counters for applications running on the Blue Gene/L compute nodes. From a multinode perspective, we focus on tools that analyze Message Passing Interface (MPI) behavior. Those tools work by first collecting message-passing trace data when a program runs. The trace data is then used by graphical interface tools that analyze the behavior of applications. Using the current prototype tools, we demonstrate their usefulness and applicability with case studies of application optimization.

10 citations


Patent
06 Oct 2005
TL;DR: In this paper, a general computer-implement method and apparatus to optimize problem layout on a massively parallel supercomputer is described, which takes as input the communication matrix of an arbitrary problem in the form of an array whose entries C(i, j) are the amount to data communicated from domain i to domain j. Given C( i, j), first implement a heuristic map is implemented which attempts sequentially to map a domain and its communications neighbors either to the same supercomputer node or to near-neighbor nodes on the supercomputer torus while keeping the
Abstract: A general computer-implement method and apparatus to optimize problem layout on a massively parallel supercomputer is described. The method takes as input the communication matrix of an arbitrary problem in the form of an array whose entries C(i, j)are the amount to data communicated from domain i to domain j. Given C(i, j), first implement a heuristic map is implemented which attempts sequentially to map a domain and its communications neighbors either to the same supercomputer node or to near-neighbor nodes on the supercomputer torus while keeping the number of domains mapped to a supercomputer node constant (as much as possible). Next a Markov Chain of maps is generated from the initial map using Monte Carlo simulation with Free Energy (cost function) F=ΣEij C(i,j)H(i,j)-where H(i,j)is the smallest number of hops on the supercomputer torus between domain i and domain j. On the cases tested, found was that the method produces good mappings and has the potential to be used as a general layout optimization tool for parallel codes. At the moment, the serial code implemented to test the method is un-optimized so that computation time to find the optimum map can be several hours on a typical PC. For production implementation, good parallel code for our algorithm would be required which could itself be implemented on supercomputer.

2 citations