Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

Fast parallel algorithms for short-range molecular dynamics

In order to improve the stability of mobile network system for application of the next generation of Internet of Things (IoT), balance the network load and guarantee the quality of user service experience, this article first introduces the computing migration framework for the network of the next generation, and summarizes the concept and content of mobile edge computing (MEC) using software-defined network (SDN) and network function virtualization (NFV). And then, this article proceeds to introduce the MEC strategy based on SDN and NFV technology as well as multiattribute decision making, computing migration, multiattribute decision, the MEC decision model based on SDN and NFV technology and the solving process of the MEC decision model based on SDN and NFV. Finally, the three sets of simulation experiments based on MATLAB are designed to validate the multiattribute decision of MEC migration strategy based on SDN and NFV. The results show that the multiattribute decision making based on SDN and NFV can select the appropriate MEC center, further reduce the server response time and improve the quality of user service experience. This article is of great significance to the application of IoT terminal in the next generation of network environment.

Interaction of Edge-Cloud Computing Based on SDN and NFV for Next Generation IoT

Analysis of healthcare big data

Deep belief network and linear perceptron based cognitive computing for collaborative robots

The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/√MK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% - 5% communication data size.

/pdf/a-linear-speedup-analysis-of-distributed-deep-learning-with-1zuv2pkyqw.pdf

A linear speedup analysis of distributed deep learning with sparse and quantized communication

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques Besides that, the newly introduced GPUDirect RDMA (GDR) is a promising solution to further solve this data movement bottleneck However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes Our designs to support the eager protocol include efficient support at both sender and receiver sides Furthermore, we propose a new data path to provide fast copies between host and GPU memories To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively These designs boost the uni-directional bandwidth by 73x and 17x, respectively We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 234% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

Testing a scalability bottleneck requires a large system to generate sufficient load, which is usually not accessible to researchers. To address this problem, this paper extrapolates the workload to a bottleneck node. The key observation that motivates our approach is that systems at a large scale are often repeating their behaviors at small scales, by running a job more times, running more nodes of the same type, or running more iterations of the same loop. Following this observation, we record a node's workloads at small scales and extrapolate such workload at a large scale. Towards this goal, we have developed PatternMiner, a semi-automatic tool to identify how workload patterns change with scale. We have tested our method on HDFS NameNode and YARN's Resource Manager. Our evaluation shows that PatternMiner is able to predict 98% of the workloads for NameNode and 83% of the workloads for the Resource Manager. Furthermore, by utilizing the extrapolated workload, we are able to emulate a cluster of up to 60,000 nodes with only 8 physical machines to evaluate NameNode and Resource Manager.

Evaluating Scalability Bottlenecks by Workload Extrapolation

Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand and can attain near to native performance for inter-node communication. However, the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within a same physical node. To address this issue, this paper first proposes a high performance design of MPI library over SR-IOV enabled InfiniBand clusters by dynamically detecting VM locality and coordinating data movements between SR-IOV and Inter-VM shared memory (IVShmem) channels. Through our proposed design, MPI applications running in virtualized mode can achieve efficient locality-aware communication on SR-IOV enabled InfiniBand clusters. In addition, we optimize communications in IVShmem and SR-IOV channels by analyzing the performance impact of core mechanisms and parameters inside MPI library to deliver better performance in virtual machines. Finally, we conduct comprehensive performance studies by using point-to-point and collective benchmarks, and HPC applications. Experimental evaluations show that our proposed MPI library design can significantly improve the performance for point-to-point and collective operations, and MPI applications with different InfiniBand transport protocols (RC and UD) by up to 158%, 76%, 43%, respectively, compared with SR-IOV. To the best of our knowledge, this is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled InfiniBand clusters.

High performance MPI library over SR-IOV enabled infiniband clusters

An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain data patterns efficiently while incurring overheads for the others. In this paper, we first propose a set of optimized techniques to handle different MPI data types. Next, we propose a novel framework (HAND) that enables hybrid and adaptive selection among different techniques and tuning to achieve better performance with all data types. Our experimental results using the modified DDTBench suite demonstrate up to a 98% reduction in data type latency. We also apply this data type-aware design on an N-Body particle simulation application. Performance evaluation of this application on a 64 GPU cluster shows that our proposed approach can achieve up to 80% and 54% increase in performance by using struct and indexed data types compared to the existing best design. To the best of our knowledge, this is the first attempt to propose a hybrid and adaptive solution to integrate all existing schemes to optimize arbitrary non-contiguous data movement using MPI data types on GPU clusters.

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

Accelerating High-Performance Linkpack (HPL) on heterogeneous clusters with multi-core CPUs and GPUs has attracted a lot of attention from the High Performance Computing community. It is becoming common for large scale clusters to have GPUs on only a subset of nodes in order to limit system costs. The major challenge for HPL in this case is to efficiently take advantage of all the CPU and GPU resources available on a cluster. In this paper, we present a novel two-level workload partitioning approach for HPL that distributes workload based on the compute power of CPU/GPU nodes across the cluster. Our approach also handles multi-GPU configurations. Unlike earlier approaches for heterogeneous clusters with CPU and GPU nodes, our design takes advantage of asynchronous kernel launches and CUDA copies to overlap computation and CPU-GPU data movement. It uses techniques such as process grid reordering to reduce MPI communication/contention while ensuring load balance across nodes. Our experimental results using 32 GPU and 128 CPU nodes of Oakley, a research cluster at Ohio Supercomputer Center, shows that our proposed approach can achieve more than 80% of combined actual peak performance of CPU and GPU nodes. This provides 47% and 63% increase in the HPL performance that can be reported using only CPU nodes and only GPU nodes, respectively.

Rong Shi

Papers

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

Evaluating Scalability Bottlenecks by Workload Extrapolation

High performance MPI library over SR-IOV enabled infiniband clusters

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters