The design of ultra scalable MPI collective communication on the K computer

doi:10.1007/S00450-012-0211-7

Journal ArticleDOI

The design of ultra scalable MPI collective communication on the K computer

Tomoya Adachi, +7 more

- 01 May 2013 -

Computer Science - Research and Developm...

- Vol. 28, Iss: 2, pp 147-155

TLDR

The evaluation results on up to 55,296 nodes of the K computer show the new implementation of MPI collective communication outperforms the existing one for long messages by a factor of 4 to 11 times and shows the short-message algorithms complement the long-message ones.

Abstract:

This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world's first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer. On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Performance evaluation of ultra-large-scale first-principles electronic structure calculation code on the K computer

Yukihiro Hasegawa, +10 more

TL;DR: Real-space density functional theory code performs first-principles electronic structure calculations and enables analysis of the behavior of a silicon nanowire with a diameter of 10–20 nm in a scale of simulation.

...read moreread less

Journal ArticleDOI

The K computer Operations: Experiences and Statistics☆

K. Yamamoto, +14 more

TL;DR: It is found that the K computer is an extremely stable and high utilization system.

...read moreread less

Journal ArticleDOI

Multi-axis decomposition of density functional program for strong scaling up to 82,944 nodes on the K computer: Compactly folded 3D-FFT communicators in the 6D torus network

Takahiro Yamasaki, +8 more

- 01 Nov 2019 -

Computer Physics Communications

TL;DR: A multi-axis decomposition scheme in which both G-vectors and band axes are decomposed and 3D-FFT communicators are folded compactly is developed, which retains the inner-most do-loop lengths sufficiently long and restrains the increased MPI communication costs as the number of nodes increases.

...read moreread less

Book ChapterDOI

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

Richard L. Graham, +14 more

TL;DR: The new hardware-based streaming-aggregation capability added to Mellanox’s Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches is described, and the range of data sizes for which Streaming-Aggregation performs better than the low-latency aggregation algorithm.

...read moreread less

Proceedings ArticleDOI

MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale

Daniel J. Holmes, +6 more

TL;DR: This paper offers a new approach, extending MPI with a new concept called Sessions, which makes two key contributions: a tighter integration with the underlying runtime system; and a scalable route to communication groups.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Optimization of MPI collective communication on BlueGene/L systems

George Almási, +7 more

TL;DR: This paper discusses the implementation of machine-optimized MPI collectives on BlueGene/L, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real Blue Gene/L hardware with up to 4096 compute nodes.

...read moreread less

Journal ArticleDOI

Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

Yuichiro Ajima, +2 more

- 01 Nov 2009 -

IEEE Computer

TL;DR: A new architecture with a six-dimensional mesh/torus topology achieves highly scalable and fault-tolerant interconnection networks for large-scale supercomputers that can exceed 10 petaflops.

...read moreread less

Book ChapterDOI

Optimization of Collective Reduction Operations

Rolf Rabenseifner

TL;DR: Five algorithms optimized for different choices of vector size and number of processes are presented, focusing on bandwidth dominated protocols for power-of-two and non-power- of-two number of process, optimizing the load balance in communication and computation.

...read moreread less

Journal ArticleDOI

Data communication in parallel architectures

Youcef Saad, +1 more

TL;DR: The most common data exchange operations in parallel numerical methods are examined and different methods for performing them efficiently on each of several ensemble architectures are proposed and analyzed.

...read moreread less

Proceedings ArticleDOI

Open MPI: A High-Performance, Heterogeneous MPI

R.L. Graham, +5 more

TL;DR: This work describes Open MPI's architecture for heterogeneous network and processor support, and demonstrates the transparency to the application developer while maintaining very high levels of performance.

...read moreread less

The design of ultra scalable MPI collective communication on the K computer

Citations

Performance evaluation of ultra-large-scale first-principles electronic structure calculation code on the K computer

The K computer Operations: Experiences and Statistics☆

Multi-axis decomposition of density functional program for strong scaling up to 82,944 nodes on the K computer: Compactly folded 3D-FFT communicators in the 6D torus network

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale

References

Optimization of MPI collective communication on BlueGene/L systems

Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

Optimization of Collective Reduction Operations

Data communication in parallel architectures

Open MPI: A High-Performance, Heterogeneous MPI

Related Papers (5)

Optimization of Collective Communication Operations in MPICH

Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

Efficient and scalable all-to-all personalized exchange for InfiniBand-based clusters

Portable and scalable algorithms for irregular all-to-all communication

Collective communication on architectures that support simultaneous communication over multiple links