scispace - formally typeset
Journal ArticleDOI

The design of ultra scalable MPI collective communication on the K computer

TLDR
The evaluation results on up to 55,296 nodes of the K computer show the new implementation of MPI collective communication outperforms the existing one for long messages by a factor of 4 to 11 times and shows the short-message algorithms complement the long-message ones.
Abstract
This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world's first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer. On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.

read more

Citations
More filters
Journal ArticleDOI

Performance evaluation of ultra-large-scale first-principles electronic structure calculation code on the K computer

TL;DR: Real-space density functional theory code performs first-principles electronic structure calculations and enables analysis of the behavior of a silicon nanowire with a diameter of 10–20 nm in a scale of simulation.
Journal ArticleDOI

Multi-axis decomposition of density functional program for strong scaling up to 82,944 nodes on the K computer: Compactly folded 3D-FFT communicators in the 6D torus network

TL;DR: A multi-axis decomposition scheme in which both G-vectors and band axes are decomposed and 3D-FFT communicators are folded compactly is developed, which retains the inner-most do-loop lengths sufficiently long and restrains the increased MPI communication costs as the number of nodes increases.
Book ChapterDOI

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

TL;DR: The new hardware-based streaming-aggregation capability added to Mellanox’s Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches is described, and the range of data sizes for which Streaming-Aggregation performs better than the low-latency aggregation algorithm.
Proceedings ArticleDOI

MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale

TL;DR: This paper offers a new approach, extending MPI with a new concept called Sessions, which makes two key contributions: a tighter integration with the underlying runtime system; and a scalable route to communication groups.
References
More filters
Proceedings ArticleDOI

Optimization of MPI collective communication on BlueGene/L systems

TL;DR: This paper discusses the implementation of machine-optimized MPI collectives on BlueGene/L, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real Blue Gene/L hardware with up to 4096 compute nodes.
Journal ArticleDOI

Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

TL;DR: A new architecture with a six-dimensional mesh/torus topology achieves highly scalable and fault-tolerant interconnection networks for large-scale supercomputers that can exceed 10 petaflops.
Book ChapterDOI

Optimization of Collective Reduction Operations

TL;DR: Five algorithms optimized for different choices of vector size and number of processes are presented, focusing on bandwidth dominated protocols for power-of-two and non-power- of-two number of process, optimizing the load balance in communication and computation.
Journal ArticleDOI

Data communication in parallel architectures

TL;DR: The most common data exchange operations in parallel numerical methods are examined and different methods for performing them efficiently on each of several ensemble architectures are proposed and analyzed.
Proceedings ArticleDOI

Open MPI: A High-Performance, Heterogeneous MPI

TL;DR: This work describes Open MPI's architecture for heterogeneous network and processor support, and demonstrates the transparency to the application developer while maintaining very high levels of performance.
Related Papers (5)