https://www.jstage.jst.go.jp/article/jsprs1975/37/6/37_6_39/_pdf

1998 IEEE International Conference on SMCに参加して

КРИЗИС ВЛАСТИ В КАБАРДИНО-БАЛКАРИИ В АВГУСТЕ 1991 ГОДА Статья посвящена исследованию влияния путча 1991 г. на этнополитическую обстановку в Кабардино-Балкарской Республике. Анализируется процесс "включения" национального фактора в разгоревшееся политическое противоборство, вылившееся в кризис власти. Авторами впервые сформулирована и обоснована мысль о том, что события августа 1991 г. привели к объединению к тому времени уже нелояльных по отношению друг к другу национальных движений "титульных" народов Кабардино-Балкарии, что нашло выражение в выдвижении общего требования отставки первых лиц республиканской власти. Адрес статьи: www.gramota.net/materials/3/2017/9/58.html

1991년도 사업경과 보고

Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impractical because of the large amount of computation required. We introduce MALT, a machine learning library that integrates with existing machine learning software and provides data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model updates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. Through its general-purpose API, MALT can be used to provide data-parallelism to existing ML applications written in C++ and Lua and based on SVM, matrix factorization and neural networks. In our results, we show MALT provides fault tolerance, network efficiency and speedup to these applications.

/pdf/malt-distributed-data-parallelism-for-existing-ml-10wbt9mngh.pdf

MALT: distributed data-parallelism for existing ML applications

In high performance computing systems, object deserialization can become a surprisingly important bottleneck---in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects.This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a co-processor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations.This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.

Morpheus: creating application objects efficiently for heterogeneous computing

Performance tests and analyses are critical to effective high-performance computing software development and are central components in the design and implementation of computational algorithms for ...

Scalability of high-performance PDE solvers:

Modern GPUs are powerful high-core-count processors, which are no longer used solely for graphics applications, but are also employed to accelerate computationally intensive general-purpose tasks. For utmost performance, GPUs are distributed throughout the cluster to process parallel programs. In fact, many recent high-performance systems in the TOP500 list are heterogeneous architectures. Despite being highly effective processing units, GPUs on different hosts are incapable of communicating without assistance from a CPU. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Most communication libraries even require intermediate copies from GPU memory to host memory. This overhead in particular penalizes small data movements and synchronization operations, reduces efficiency and limits scalability. In this work we introduce global address spaces to facilitate direct communication between distributed GPUs without CPU involvement. Avoiding context switches and unnecessary copying dramatically reduces communication overhead. We evaluate our approach using a variety of workloads including low-level latency and bandwidth benchmarks, basic synchronization primitives like barriers, and a stencil computation as an example application. We see performance benefits of up to 2× for basic benchmarks and up to 1.67× for stencil computations.

GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters

Due to their massive parallelism and high performance per watt GPUs gain high popularity in high performance computing and are a strong candidate for future exacscale systems. But communication and data transfer in GPU accelerated systems remain a challenging problem. Since the GPU normally is not able to control a network device, today a hybrid-programming model is preferred, whereby the GPU is used for calculation and the CPU handles the communication. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. In this work, we modify user space libraries and device drivers of GPUs and the Infiniband network device in a way to enable the GPU to control an Infiniband network device to independently source and sink communication requests without any involvements of the CPU. Our performance analysis shows the differences to hybrid communication models in detail, in particular that the CPU's advantage in generating work requests outshines the overhead associated with context switching. In other terms, our results show that complex networking protocols like IBVERBS are better handled by CPUs in spite of time penalties due to context switching, since overhead of work request generation cannot be parallelized and is not suitable with the high parallel programming model of GPUs.

Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU

This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack---all the way from the application to the low-level network communication API---takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes.

https://dl.acm.org/doi/pdf/10.1145/3126908.3126963

Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1

GPUs gain high popularity in High Performance Computing, due to their massive parallelism and high performance per Watt. Despite their popularity, data transfer between multiple GPUs in a cluster remains a problem. Most communication models require the CPU to control the data flow; also intermediate staging copies to host memory are often inevitable. These two facts lead to higher CPU and memory utilization. As a result, overall performance decreases and power consumption increases.Collective operations like reduce and allreduce are very common in scientific simulations and also very sensitive to performance. Due to their massive parallelism, GPUs are very suitable for such operations, but they only excel in performance if they can process the problem in-core. Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPUs thread-collective execution model and does not require CPU assistance or staging copies in host memory. As we will see, GGAS helps to process collective operations among distributed GPUs in-core.In this paper, we introduce the implementation and optimization of collective reduce and allreduce operations using GGAS as a communication model. Compared to message passing, we get a speedup of 1.7x for small data sizes. A detailed analysis based on power measurements of CPU, host memory and GPU reveals that GGAS as communication model not only saves cycles, also the power and energy consumption is reduced dramatically. For instance, for an allreduce operation half of the energy can be saved by the reduced the power consumption in combination with the lower run time.

Energy-efficient collective reduce and allreduce operations on distributed GPUs

Python as programming language is increasingly gaining importance, especially in data science, scientific, and parallel programming. It is faster and easier to learn than classical programming languages such as C. However, usability often comes at the cost of performance and applications written in Python are considered to be much slower than applications written in C or FORTRAN. Further, it does not allow the usage of GPUs-besides of pre-compiled libraries.However, the Numba package promises performance similar to C code for compute intensive parts of a Python application and it supports CUDA, which allows the use of GPUs inside a Python application.In this paper we compare the performance of Numba-CUDA and C -CUDA for different kinds of applications. For compute intensive benchmarks, the performance of the Numba version only reaches between 50% and 85% performance of the CCUDA version, despite the reduction operation, where the Numba version outperforms CUDA. Analyzing the PTX code and CUDA performance counters revealed that index-calculation is one limiting factor in Numba. Another problem is the type interference for single precision computations, as some values are computed in double precision. By optimizing this within the Numba package, the performance of Numba improves. However, C-CUDA applications still outperform the Numba versions. Further analysis with the CloverLeav Mini App shows that Numba performance further decreases for applications with multiple different compute kernels. The non-GPU part slows down these applications, due to the slow Python interpreter. This leads to a worse GPU utilization.Today Python is widely used in industry and academia and has been the first choice of coding languages among software programmers in the last years. Currently, according to the TIOBE index [5], it is the 3rd most popular programming language and the number one in IEEE Spectrum’s fifth annual interactive ranking of the top programming languages [4]. One reason for this is that is easier to learn than classical programming languages like C. However, the other reason is the increasing popularity of Data Science, where Python is the most used language. A collection of libraries such as NumPy [22], and Matplotlib [1] or Scipy [8] provide a rich set of functions for scientific computing [16]. Packages like Dask [19], PyCompss [21] and MPI for Python [6] allow running Python applications on large, parallel machines, promising high performance. However, the performance of Python is considered slow compared to compiled languages such as C, C++, and FORTRAN, especially for heavy computations. In recent years, more and more tools have been developed to counter this prejudice. Numpy [22], for example, uses C-like arrays to store data and offers fast functions implemented in C to speed up calculations. The CuPy [14] package provides a similar set of functions, but these functions are implemented for GPUs using CUDA. The SciPy library is based on NumPy and provides a rich set on functionalities for scientific computing. Still, the high performance of these libraries is provided by the underling C-implementations. Internally, they use libraries like OpenBlas or IntelMKL to reach high performance and therefore, they are limited by the functions which are provided by theses libraries. Therefore, a performance problem always arises when the required functionality is not implemented within these libraries. In this case, the application falls back to the Python interpreter. Compared to "bare metal" code, interpreted code is slow. In addition, in Python it is not possible to use GPUs or other accelerators directly, as the Python interpreter cannot execute code on these machines. Therefore, the usage is only possible with precompiled libraries. To overcome this limitation, different approaches where developed to mix C, CUDA or OpenCL with Python. Cython [2] allows integrating C-code in Python applications to improve performance of critical sections. It also allows an easy development of wrappers for C-libraries. Similar, packages such as PyCuda and PyOpenCL [9] support wrappers for CUDA or OpenCL code within a Python script. Both approaches require the mixture of different programming languages.Numba [10] follows a different approach. Instead of merging C/CUDA code with Python, it allows the development of efficient applications for both, CPUs and GPUs in Python style. When a Python script using Numba is executed, marked functions are compiled just-in-time (JIT) using the LLVM framework. Using Python for GPU programming can mean a considerable simplification in the development of parallel applications.But often a simplification of comes at the expense of performance, and one expects a performance loss from Python compared to pure C code. In this paper, we want to understand the differences between native C-CUDA code and CUDA-code written in Python with Numba. We also want to share some basic tips how to improve the performance of applications written in Numba.We will first analyse a few micro benchmarks in detail. We are using these simple benchmarks, as it is easier to understand the differences with small code examples. We will use the collected information to derive some optimization for Numba. Finally, we evaluate and compare the performance of a more application like mini-app, written in C-CUDA and Numba accelerated Python. We will evaluate if our insights from the microbenchmarks to real applications.

Lena Oden

Papers

GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters

Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU

Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1

Energy-efficient collective reduce and allreduce operations on distributed GPUs

Lessons learned from comparing C-CUDA and Python-Numba for GPU-Computing