Journal ArticleDOI
The design of ultra scalable MPI collective communication on the K computer
Tomoya Adachi,Naoyuki Shida,Kenichi Miura,Shinji Sumimoto,Atsuya Uno,Motoyoshi Kurokawa,Fumiyoshi Shoji,Mitsuo Yokokawa +7 more
TLDR
The evaluation results on up to 55,296 nodes of the K computer show the new implementation of MPI collective communication outperforms the existing one for long messages by a factor of 4 to 11 times and shows the short-message algorithms complement the long-message ones.Abstract:
This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world's first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer.
On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.read more
Citations
More filters
Journal ArticleDOI
Performance evaluation of ultra-large-scale first-principles electronic structure calculation code on the K computer
Yukihiro Hasegawa,Jun-Ichi Iwata,Miwako Tsuji,Daisuke Takahashi,Atsushi Oshiyama,Kazuo Minami,Taisuke Boku,Hikaru Inoue,Yoshito Kitazawa,Ikuo Miyoshi,Mitsuo Yokokawa +10 more
TL;DR: Real-space density functional theory code performs first-principles electronic structure calculations and enables analysis of the behavior of a silicon nanowire with a diameter of 10–20 nm in a scale of simulation.
Journal ArticleDOI
The K computer Operations: Experiences and Statistics☆
K. Yamamoto,Atsuya Uno,Hitoshi Murai,Toshiyuki Tsukamoto,Fumiyoshi Shoji,Shuji Matsui,Ryuichi Sekizawa,Fumichika Sueyasu,Hiroshi Uchiyama,Mitsuo Okamoto,Nobuo Ohgushi,Katsutoshi Takashina,Daisuke Wakabayashi,Yuki Taguchi,Mitsuo Yokokawa +14 more
TL;DR: It is found that the K computer is an extremely stable and high utilization system.
Journal ArticleDOI
Multi-axis decomposition of density functional program for strong scaling up to 82,944 nodes on the K computer: Compactly folded 3D-FFT communicators in the 6D torus network
Takahiro Yamasaki,Akiyoshi Kuroda,Toshihiro Kato,Jun Nara,Junichiro Koga,Tsuyoshi Uda,Kazuo Minami,Takahisa Ohno,Takahisa Ohno +8 more
TL;DR: A multi-axis decomposition scheme in which both G-vectors and band axes are decomposed and 3D-FFT communicators are folded compactly is developed, which retains the inner-most do-loop lengths sufficiently long and restrains the increased MPI communication costs as the number of nodes increases.
Book ChapterDOI
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation
Richard L. Graham,Lion Levi,Devendar Burredy,Gil Bloch,Gilad Shainer,David Cho,George Elias,Daniel Klein,Joshua Ladd,Ophir Maor,Ami Marelli,Valentin Petrov,Evyatar Romlet,Yong Qin,Ido Zemah +14 more
TL;DR: The new hardware-based streaming-aggregation capability added to Mellanox’s Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches is described, and the range of data sizes for which Streaming-Aggregation performs better than the low-latency aggregation algorithm.
Proceedings ArticleDOI
MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale
Daniel J. Holmes,Kathryn Mohror,Ryan E. Grant,Anthony Skjellum,Martin Schulz,Wesley Bland,Jeffrey M. Squyres +6 more
TL;DR: This paper offers a new approach, extending MPI with a new concept called Sessions, which makes two key contributions: a tighter integration with the underlying runtime system; and a scalable route to communication groups.
References
More filters
Proceedings ArticleDOI
Optimization of MPI collective communication on BlueGene/L systems
George Almási,Philip Heidelberger,Charles J. Archer,Xavier Martorell,C. Christopher Erway,José E. Moreira,Burkhard Steinmacher-Burow,Yili Zheng +7 more
TL;DR: This paper discusses the implementation of machine-optimized MPI collectives on BlueGene/L, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real Blue Gene/L hardware with up to 4096 compute nodes.
Journal ArticleDOI
Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers
TL;DR: A new architecture with a six-dimensional mesh/torus topology achieves highly scalable and fault-tolerant interconnection networks for large-scale supercomputers that can exceed 10 petaflops.
Book ChapterDOI
Optimization of Collective Reduction Operations
TL;DR: Five algorithms optimized for different choices of vector size and number of processes are presented, focusing on bandwidth dominated protocols for power-of-two and non-power- of-two number of process, optimizing the load balance in communication and computation.
Journal ArticleDOI
Data communication in parallel architectures
Youcef Saad,Martin H. Schultz +1 more
TL;DR: The most common data exchange operations in parallel numerical methods are examined and different methods for performing them efficiently on each of several ensemble architectures are proposed and analyzed.
Proceedings ArticleDOI
Open MPI: A High-Performance, Heterogeneous MPI
TL;DR: This work describes Open MPI's architecture for heterogeneous network and processor support, and demonstrates the transparency to the application developer while maintaining very high levels of performance.