scispace - formally typeset
Search or ask a question

Showing papers by "Xiaowen Chu published in 2020"


Proceedings ArticleDOI
22 Feb 2020
TL;DR: This paper presents an incentive mechanism FMore with multi-dimensional procurement auction of K winners, which is lightweight and incentive compatible, but also encourages more high-quality edge nodes with low cost to participate in learning and eventually improve the performance of federated learning.
Abstract: Promising federated learning coupled with Mobile Edge Computing (MEC) is considered as one of the most promising solutions to the AI-driven service provision. Plenty of studies focus on federated learning from the performance and security aspects, but they neglect the incentive mechanism. In MEC, edge nodes would not like to voluntarily participate in learning, and they differ in the provision of multi-dimensional resources, both of which might deteriorate the performance of federated learning. Also, lightweight schemes appeal to edge nodes in MEC. These features require the incentive mechanism to be well designed for MEC. In this paper, we present an incentive mechanism FMore with multi-dimensional procurement auction of K winners. Our proposal FMore not only is lightweight and incentive compatible, but also encourages more high-quality edge nodes with low cost to participate in learning and eventually improve the performance of federated learning. We also present theoretical results of Nash equilibrium strategy to edge nodes and employ the expected utility theory to provide guidance to the aggregator. Both extensive simulations and real-world experiments demonstrate that the proposed scheme can effectively reduce the training rounds and drastically improve the model accuracy for challenging AI tasks.

108 citations


Posted Content
TL;DR: This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research orientations, and discusses some future research trends in this field.
Abstract: In recent years, natural language processing (NLP) has got great development with deep learning techniques. In the sub-field of machine translation, a new approach named Neural Machine Translation (NMT) has emerged and got massive attention from both academia and industry. However, with a significant number of researches proposed in the past several years, there is little work in investigating the development process of this new technology trend. This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research orientations, and discusses some future research trends in this field.

73 citations


Posted Content
TL;DR: A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.
Abstract: Distributed deep learning becomes very common to reduce the overall training time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of deep models and data sets increases. However, data communication between computing devices could be a potential bottleneck to limit the system scalability. How to address the communication problem in distributed deep learning is becoming a hot research topic recently. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations. In the system-level, we demystify the system design and implementation to reduce the communication cost. In algorithmic-level, we compare different algorithms with theoretical convergence bounds and communication complexity. Specifically, we first propose the taxonomy of data-parallel distributed training algorithms, which contains four main dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing. Then we discuss the studies in addressing the problems of the four dimensions to compare the communication cost. We further compare the convergence rates of different algorithms, which enable us to know how fast the algorithms can converge to the solution in terms of iterations. According to the system-level communication cost analysis and theoretical convergence speed comparison, we provide the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.

62 citations


Proceedings ArticleDOI
06 Jul 2020
TL;DR: The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived.
Abstract: Distributed synchronous stochastic gradient descent (SGD) algorithms are widely used in large-scale deep learning applications, while it is known that the communication bottleneck limits the scalability of the distributed system. Gradient sparsification is a promising technique to significantly reduce the communication traffic, while pipelining can further overlap the communications with computations. However, gradient sparsification introduces extra computation time, and pipelining requires many layer-wise communications which introduce significant communication startup overheads. Merging gradients from neighbor layers could reduce the startup overheads, but on the other hand it would increase the computation time of sparsification and the waiting time for the gradient computation. In this paper, we formulate the trade-off between communications and computations (including backward computation and gradient sparsification) as an optimization problem, and derive an optimal solution to the problem. We further develop the optimal merged gradient sparsification algorithm with SGD (OMGS-SGD) for distributed training of deep learning. We conduct extensive experiments to verify the convergence properties and scaling performance of OMGS-SGD. Experimental results show that OMGS-SGD achieves up to 31% end-to-end time efficiency improvement over the state-of-the-art sparsified SGD while preserving nearly consistent convergence performance with original SGD without sparsification on a 16-GPU cluster connected with 1Gbps Ethernet.

54 citations


Proceedings ArticleDOI
18 May 2020
TL;DR: This paper demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations.
Abstract: Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).

46 citations


Posted ContentDOI
09 Jun 2020-medRxiv
TL;DR: An automated deep learning methodology is designed to generate a lightweight deep learning model MNas3DNet41 that achieves an accuracy of 87.14%, F1-score of 86.25%, and AUC of 0.957, which are on par with the best models made by AI experts.
Abstract: COVID-19 pandemic has spread all over the world for months. As its transmissibility and high pathogenicity seriously threaten people’s lives, the accurate and fast detection of the COVID-19 infection is crucial. Although many recent studies have shown that deep learning based solutions can help detect COVID-19 based on chest CT scans, there lacks a consistent and systematic comparison and evaluation on these techniques. In this paper, we first build a clean and segmented CT dataset called Clean-CC-CCII by fixing the errors and removing some noises in a large CT scan dataset CC-CCII with three classes: novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal controls (Normal). After cleaning, our dataset consists of a total of 340,190 slices of 3,993 scans from 2,698 patients. Then we benchmark and compare the performance of a series of state-of-the-art (SOTA) 3D and 2D convolutional neural networks (CNNs). The results show that 3D CNNs outperform 2D CNNs in general. With extensive effort of hyperparameter tuning, we find that the 3D CNN model DenseNet3D121 achieves the highest accuracy of 88.63% (F1-score is 88.14% and AUC is 0.940), and another 3D CNN model ResNet3D34 achieves the best AUC of 0.959 (accuracy is 87.83% and F1-score is 86.04%). We further demonstrate that the mixup data augmentation technique can largely improve the model performance. At last, we design an automated deep learning methodology to generate a lightweight deep learning model MNas3DNet41 that achieves an accuracy of 87.14%, F1-score of 87.25%, and AUC of 0.957, which are on par with the best models made by AI experts. The automated deep learning design is a promising methodology that can help health-care professionals develop effective deep learning models using their private data sets. Our Clean-CC-CCII dataset and source code are available at:https://github.com/arthursdays/HKBU HPML COVID-19.

43 citations


Proceedings ArticleDOI
24 Jul 2020
TL;DR: This paper traced over 2.15 hundred thousand blocks from February 2016 to February 2020 and conducted a board range of measurements on the pool evolutions, labeled transactions as well as real-time network traffics, and discovered new interesting observations and features.
Abstract: Bitcoin network has received much attention from both industry and academy due to Bitcoin's recent success. Mining pools, the main components of the Bitcoin network, dominate the computing resources, and play essential roles in network security and performance aspects. Although many existing measurements of the Bitcoin network are available, little is known about the details of mining pool behaviors (e.g., mining revenue and transaction collection strategy) and their effects on the Bitcoin end users (e.g., transaction fees, transaction delay, and transaction acceptance rate). This paper aims to fill this gap with a systematic study of mining pools. We traced over 2.15 hundred thousand blocks from February 2016 to February 2020 and collected over 4.12 TB unconfirmed transactions. Then we conducted a board range of measurements on the pool evolutions, labeled transactions (blocks) as well as real-time network traffics, and discovered new interesting observations and features. Specifically, our measurements showed the following. 1) A few mining pools continuously controlled most of the computing resources of the Bitcoin network. 2) Mining pools were caught in a prisoner's dilemma where mining pools compete to increase their computing resources even though the unit profit of the computing resource decreases. 3) Mining pools were stuck in a Malthusian trap where there is a stage at which the Bitcoin incentives are inadequate for feeding the exponential growth of the computing resources. 4) The market price and transaction fees were not sensitive to the past events of halving block rewards. 5) Feerate played a dominating role in the transaction collection strategy for the top mining pools. Our measurements and analysis helped the Bitcoin community to understand and improve the Bitcoin network.

40 citations


Proceedings ArticleDOI
24 Mar 2020
TL;DR: FADNet as discussed by the authors exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation, and combines the residual structures to make the deeper model easier to learn, and contains multi-scale predictions so as to exploit a multiscale weight scheduling training technique to improve the accuracy.
Abstract: Deep neural networks (DNNs) have achieved great success in the area of computer vision. The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy in stereo matching than traditional hand-crafted feature based methods. On one hand, however, the designed DNNs require significant memory and computation resources to accurately predict the disparity, especially for those 3D convolution based networks, which makes it difficult for deployment in real-time applications. On the other hand, existing computation-efficient networks lack expression capability in large-scale datasets so that they cannot make an accurate prediction in many scenarios. To this end, we propose an efficient and accurate deep network for disparity estimation named FADNet with three main features: 1) It exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation; 2) It combines the residual structures to make the deeper model easier to learn; 3) It contains multi-scale predictions so as to exploit a multi-scale weight scheduling training technique to improve the accuracy. We conduct experiments to demonstrate the effectiveness of FADNet on two popular datasets, Scene Flow and KITTI 2015. Experimental results show that FADNet achieves state-of-the-art prediction accuracy, and runs at a significant order of magnitude faster speed than existing 3D models. The codes of FADNet are available at https://github.com/HKBU-HPML/FADNet.

38 citations


Proceedings ArticleDOI
19 Feb 2020
TL;DR: This paper presents an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs, and builds a SASS assembler TuringAs for VolTA and Turing that enables tuning the performance at the native assembly level.
Abstract: In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on Volta V100 and up to 2.65X speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak. Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

38 citations


Proceedings ArticleDOI
11 May 2020
TL;DR: A comprehensive empirical study on the performance and energy efficiency of several popular off-the-shelf processors in training DNNs by benchmarking a representative set of deep learning workloads, which provides an informative guide for end users to select proper AI accelerators.
Abstract: Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) are designed to improve the performance of AI training. However, processors from different vendors perform dissimilarly in terms of performance and energy consumption. To investigate the differences among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive empirical study on the performance and energy efficiency of these processors1 by benchmarking a representative set of deep learning workloads, including computation-intensive operations, classical convolutional neural networks (CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer. Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor's software library, and deep learning framework on the performance and energy consumption of AI training. Our evaluation methods and results not only provide an informative guide for end users to select proper AI accelerators, but also expose some opportunities for the hardware vendors to improve their software library.

28 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: A thorough performance evaluation on the first long term support release of Hyperledger Fabric, finding the validate phase was likely to be the system bottleneck due to the low validation speed of chaincode and the execution phase exhibited a good scalability under the OR endorsement policy but not with the AND endorsement policy.
Abstract: Hyperledger Fabric is a popular open-source project for deploying permissioned blockchains. Many performance characteristics of the latest Hyperledger Fabric (e.g., performance characteristics of each phase, the impacts of ordering services, bottleneck and scalability) are still not well understood due to the performance complexity of distributed systems. We conducted a thorough performance evaluation on the first long term support release of Hyperledger Fabric. We studied the performance characteristics of each phase, including execute, order, and the validate phase, according to Hyperledger Fabric’s new execute-order-validate architecture. We also studied the ordering services, including Solo, Kafka, and Raft. Our experimental results showed some findings as follows. 1) The execution phase exhibited a good scalability under the OR endorsement policy but not with the AND endorsement policy. 2) We were not able to find a significant performance difference between the three ordering services. 3) The validate phase was likely to be the system bottleneck due to the low validation speed of chaincode. Overall, our work helps to understand and improve Hyperledger Fabric.

Posted Content
TL;DR: This work aims to reduce communication time of two types of distributed deep learning architectures, centralized and decentralized, by improving the model generalization capability of deep neural network models.
Abstract: Distributed learning techniques such as federated learning have enabled multiple workers to train machine learning models together to reduce the overall training time. However, current distributed training algorithms (centralized or decentralized) suffer from the communication bottleneck on multiple low-bandwidth workers (also on the server under the centralized architecture). Although decentralized algorithms generally have lower communication complexity than the centralized counterpart, they still suffer from the communication bottleneck for workers with low network bandwidth. To deal with the communication problem while being able to preserve the convergence performance, we introduce a novel decentralized training algorithm with the following key features: 1) It does not require a parameter server to maintain the model during training, which avoids the communication pressure on any single peer. 2) Each worker only needs to communicate with a single peer at each communication round with a highly compressed model, which can significantly reduce the communication traffic on the worker. We theoretically prove that our sparsification algorithm still preserves convergence properties. 3) Each worker dynamically selects its peer at different communication rounds to better utilize the bandwidth resources. We conduct experiments with convolutional neural networks on 32 workers to verify the effectiveness of our proposed algorithm compared to seven existing methods. Experimental results show that our algorithm significantly reduces the communication traffic and generally select relatively high bandwidth peers.

Posted Content
TL;DR: Experimental results show that FADNet achieves state-of-the-art prediction accuracy, and runs at a significant order of magnitude faster speed than existing 3D models.
Abstract: Deep neural networks (DNNs) have achieved great success in the area of computer vision. The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy in stereo matching than traditional hand-crafted feature based methods. On one hand, however, the designed DNNs require significant memory and computation resources to accurately predict the disparity, especially for those 3D convolution based networks, which makes it difficult for deployment in real-time applications. On the other hand, existing computation-efficient networks lack expression capability in large-scale datasets so that they cannot make an accurate prediction in many scenarios. To this end, we propose an efficient and accurate deep network for disparity estimation named FADNet with three main features: 1) It exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation; 2) It combines the residual structures to make the deeper model easier to learn; 3) It contains multi-scale predictions so as to exploit a multi-scale weight scheduling training technique to improve the accuracy. We conduct experiments to demonstrate the effectiveness of FADNet on two popular datasets, Scene Flow and KITTI 2015. Experimental results show that FADNet achieves state-of-the-art prediction accuracy, and runs at a significant order of magnitude faster speed than existing 3D models. The codes of FADNet are available at this https URL.

Posted Content
TL;DR: A new computing and communication efficient top-k sparsification communication library for distributed training to improve the system scalability and optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator.
Abstract: Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

Journal ArticleDOI
TL;DR: A fine-grained analytical model is revealed to estimate the execution time of GPU kernels with both core and memory frequency scaling and can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy.
Abstract: Contemporary graphics processing units (GPUs) support dynamic voltage and frequency scaling to balance computational performance and energy consumption However, accurate and straightforward performance estimation for a given GPU kernel under different frequency settings is still lacking for real hardware, which is essential to determine the best frequency configuration for energy saving In this article, we reveal a fine-grained analytical model to estimate the execution time of GPU kernels with both core and memory frequency scaling Compared to the cycle-level simulators, which are too slow to apply on real hardware, our model only needs simple and one-off micro-benchmarks to extract a set of hardware parameters and kernel performance counters without any source code analysis Our experimental results show that the proposed performance model can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy (average errors of 385, 86, 882, and 883 percent on a set of 20 GPU kernels with four modern Nvidia GPUs)

Journal ArticleDOI
TL;DR: ESetStore is presented, a prototype erasure-coded storage system that aims to achieve fast recovery from failures and can be an enhancement for existing solutions, such as Partial-parallel-repair (PPR), to further improve recovery performance.
Abstract: Erasure codes have been used extensively in large-scale storage systems to reduce the storage overhead of triplication-based storage systems. One key performance issue introduced by erasure codes is the long time needed to recover from a single failure, which occurs constantly in large-scale storage systems. We present ESetStore, a prototype erasure-coded storage system that aims to achieve fast recovery from failures. ESetStore is novel in the following aspects. We proposed a data placement algorithm named ESet for our ESetStore that can aggregate adequate I/O resources from available storage servers to recover from each single failure. We designed and implemented efficient read and write operations on our erasure-coded storage system via effective use of available I/O and computation resources. We evaluated the performance of ESetStore with extensive experiments on a cluster with 50 storage servers. The evaluation results demonstrate that our recovery performance can obtain linear performance growth by harvesting available I/O resources. With our defined parameter recovery I/O parallelism under some mild conditions, we can achieve optimal recovery performance, in which ESet enables minimal recovery time. Rather than being an alternative to improve recovery performance, our work can be an enhancement for existing solutions, such as Partial-parallel-repair (PPR), to further improve recovery performance.

Journal ArticleDOI
TL;DR: A comprehensive study of the performance of erasure coding to see if it can match the network performance of 5G and Wi-Fi 6 at the network edge and accelerates erasure code with OpenMP on a multi-core CPU.
Abstract: Emerging computing paradigm edge computing expects to store and process data at the network edge with reduced latency and improved network bandwidth. To the best of our knowledge, key performance issues such as coding performance of erasure-coded storage systems haven’t been investigated for edge computing. In this paper, we present an erasure-coded storage system for edge computing. Unlike the data center and cloud storage systems, it employs edge devices to perform encoding and decoding operations, which can be a performance bottleneck of the whole storage system due to limited computing power. Hence, we present a comprehensive study of the performance of erasure coding to see if it can match the network performance of 5G and Wi-Fi 6 at the network edge. We use the popular edge device Jetson Nano and two state-of-the-art coding libraries: Jerasure and G-CRS. Our evaluation results reveal unsatisfied performance for Jerasure and high variance for G-CRS. To obtain better and stable performance, we accelerate erasure code with OpenMP on a multi-core CPU. Our work demonstrates our acceleration can bring stable performance and match the network bandwidth of 5G and Wi-Fi 6 for some commonly used cases. Besides, our work offers a better understanding of erasure-coded storage systems for edge computing and can be served as a reference to any further optimization for such kind of systems at the network edge.

Posted Content
TL;DR: It is shown that the DL models with low model intensity are difficult to scale out even with the best available lossless algorithm over 100Gb/s IB; and the system architecture and scheduling algorithms have a critical impact on the scaling property.
Abstract: Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL. We first identify the major communication challenges and classify the existing solutions into three levels, namely the learning algorithm, the system architecture, and the network infrastructure. We present the state-of-the-art communication optimization techniques and conduct a comparative study of seven common lossless distributed DL methods on a 32-GPU cluster with 100Gbps InfiniBand (IB). We show that (1) the DL models with low model intensity (such as BERT and BERT-Large) are difficult to scale out even with the best available lossless algorithm over 100Gbps IB; (2) the system architecture and scheduling algorithms have a critical impact on the scaling property. We conclude the article with discussions on the open issues for further investigations.

Proceedings ArticleDOI
06 Jul 2020
TL;DR: This paper designs an 802.11ax-based dense WiFi network to provide WiFi services to a large number of users within a given area with the following objectives: to minimize the number of access points (APs); to fulfil the users’ throughput requirement; and to be resistant to AP failures.
Abstract: IEEE 802.11ax is a promising standard for the next-generation WiFi network, which uses orthogonal frequency division multiple access (OFDMA) to segregate the wireless spectrum into time-frequency resource units (RUs). In this paper, we aim at designing an 802.11ax-based dense WiFi network to provide WiFi services to a large number of users within a given area with the following objectives: (1) to minimize the number of access points (APs); (2) to fulfil the users’ throughput requirement; and (3) to be resistant to AP failures. We formulate the above into a joint AP placement and power-channel-RU assignment optimization problem, which is NP-hard. To tackle this problem, we first derive an analytical model to estimate each user’s throughput under the mechanism of OFDMA and a widely used interference model. We then design a heuristic algorithm to find high-quality solutions with polynomial time complexity. Simulation results show that our algorithm can achieve the optimal performance for a small area of 50×50 m2. For a larger area of 100×80 m2 where we cannot find the optimal solution through an exhaustive search, our algorithm can reduce the number of APs by 32 ~ 55% as compared to the random and Greedy solutions.

Proceedings ArticleDOI
22 Feb 2020
TL;DR: In this paper, the authors aim to reduce communication time of two types of distributed deep learning architectures, centralized and decentralized, by exchanging gradients or models among workers, which is a potential bottleneck that limits the system scalability.
Abstract: The increasing size of machine learning models, especially deep neural network models, can improve the model generalization capability. However, large models require more training data and more computing resources (such as GPU clusters) to train. In distributed training, the communication overhead of exchanging gradients or models among workers becomes a potential system bottleneck that limits the system scalability. Recently, many research works aim to reduce communication time of two types of distributed deep learning architectures, centralized and decentralized.

Proceedings ArticleDOI
24 Aug 2020
TL;DR: In this article, the authors proposed a new distributed optimization method named LAGS-SGD, which combines synchronous stochastic gradient descent with a novel layer-wise adaptive gradient sparsification (LAGS) scheme.
Abstract: To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.

Posted Content
TL;DR: This paper establishes a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes and proposes an efficient algorithm, LWF-$\kappa, to balance the GPU utilization and consolidate the allocated GPUs for each job.
Abstract: Distributed Deep Learning (DDL) has rapidly grown its popularity since it helps boost the training performance on high-performance GPU clusters. Efficient job scheduling is indispensable to maximize the overall performance of the cluster when training multiple jobs simultaneously. However, existing schedulers do not consider the communication contention of multiple communication tasks from different distributed training jobs, which could deteriorate the system performance and prolong the job completion time. In this paper, we first establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient algorithm, LWF-$\kappa$, to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling those communication tasks, we observe that neither avoiding all the contention nor blindly accepting them is optimal to minimize the job completion time. We thus propose a provable algorithm, AdaDUAL, to efficiently schedule those communication tasks. Based on AdaDUAL, we finally propose Ada-SRSF for the DDL job scheduling problem. Simulations on a 64-GPU cluster connected with 10 Gbps Ethernet show that LWF-$\kappa$ achieves up to $1.59\times$ improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by $20.1\%$ and $36.7\%$, as compared to the SRSF(1) scheme (avoiding all the contention) and the SRSF(2) scheme (blindly accepting all of two-way communication contention) respectively.

Journal ArticleDOI
TL;DR: An unbiased semi-supervised cluster tree is proposed which is learnt using only very few labeled data using a K-means algorithm to build each level of this hierarchical tree in a decent top-down manner.
Abstract: Conventionally, it is a prerequisite to acquire a good number of annotated data to train an accurate classifier. However, the acquisition of such dataset is usually infeasible due to the high annotation cost. Therefore, semi-supervised learning has emerged and attracts increasing research efforts in recent years. Essentially, semi-supervised learning is sensitive to the manner how the unlabeled data is sampled. However, the model performance might be seriously deteriorated if biased unlabeled data is sampled at the early stage. In this paper, an unbiased semi-supervised cluster tree is proposed which is learnt using only very few labeled data. Specifically, a K-means algorithm is adopted to build each level of this hierarchical tree in a decent top-down manner. The number of clusters is determined by the number of classes contained in the labeled data. The confidence error of the cluster tree is theoretically analyzed which is then used to prune the tree. Empirical studies on several datasets have demonstrated that the proposed semi-supervised cluster tree is superior to the state-of-the-art semi-supervised learning algorithms with respect to classification accuracy.

Posted Content
TL;DR: Extensive experiments on the Scene Flow and KITTI datasets show that EDNet outperforms the previous 3D CNN based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption.
Abstract: Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network (CNN) for disparity regression, which is inefficient due to the high memory consumption and slow inference speed. In this paper, we propose a network named EDNet for efficient disparity estimation. Firstly, we construct a combined volume which incorporates contextual information from the squeezed concatenation volume and feature similarity measurement from the correlation volume. The combined volume can be next aggregated by 2D convolutions which are faster and require less memory than 3D convolutions. Secondly, we propose an attention-based spatial residual module to generate attention-aware residual features. The attention mechanism is applied to provide intuitive spatial evidence about inaccurate regions with the help of error maps at multiple scales and thus improve the residual learning efficiency. Extensive experiments on the Scene Flow and KITTI datasets show that EDNet outperforms the previous 3D CNN based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: ICIStrategy, a multi-node collaborative storage strategy based on intra-cluster integrity, aims to solve the storage pressure by reducing the amount data that each participate need to store and reduce communication overhead by collaboratively storing and verifying blocks through in-clusters nodes.
Abstract: Blockchain is essentially a distributed ledger shared by all nodes in the system. All nodes in blockchain are equal, and each node holds all transactions and blocks in the network. As the network continues to expand, the data rises linearly. Participates are about to face the problem of storage limitation. Blockchain is hard to scale.This paper introduces ICIStrategy, a multi-node collaborative storage strategy based on intra-cluster integrity. In ICIStrategy, we divide all participates into several clusters. Each cluster requires holding all data of the network, whereas a node within the cluster does not need to maintain data integrity. It aims to solve the storage pressure by reducing the amount data that each participate need to store and reduce communication overhead by collaboratively storing and verifying blocks through in-cluster nodes. Moreover, the ICIStrategy could greatly save the overhead of bootstrapping. We show the mode of operation in our strategy. We further analysis the performance of ICIStrategy and conduct simulation experiments. The results of several comparative experiments show that our strategy just needs 25% of storage space needed by Rapidchain, which indeed solve the problem of storage limitation and improve the blockchain performance.

Posted Content
TL;DR: This paper refers to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which it exploits coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic.
Abstract: Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing works under-look the performance optimization of SpDM on modern many-core architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal and Tesla P100) with CUDA-8.0 using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8$\times$ speedup over Nvidia's library cuSPARSE in many matrices. We also analyze instruction-level operations on a particular GPU to understand the performance gap between GCOOSpDM and cuSPARSE. The profiled instructions confirm that cuSPARSE spends a lot of time on slow memory access (including DRAM access and L2 cache access), while GCOOSpDM transfers such slow memory access to faster shared memory, which mainly contributes to the performance gain. Results also show that GCOOSpDM would outperform the dense algorithm (cuBLAS) with lower sparsity than cuSPARSE on GPUs.

Posted Content
27 May 2020
TL;DR: A systematic survey of communication-efficient distributed deep learning techniques with state-of-the-art techniques summarized, and a taxonomy with three levels: optimization algorithm, system architecture, and communication infrastructure is provided.
Abstract: In recent years, distributed deep learning techniques are widely deployed to accelerate the training of deep learning models by exploiting multiple computing nodes. However, the extensive communications among workers dramatically limit the system scalability. In this article, we provide a systematic survey of communication-efficient distributed deep learning. Specifically, we first identify the communication challenges in distributed deep learning. Then we summarize the state-of-the-art techniques in this direction, and provide a taxonomy with three levels: optimization algorithm, system architecture, and communication infrastructure. Afterwards, we present a comparative study on seven different distributed deep learning techniques on a 32-GPU cluster with both 10Gbps Ethernet and 100Gbps InfiniBand. We finally discuss some challenges and open issues for possible future investigations.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: In this article, the authors conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service, and propose the Aligned scheduling scheme that optimizes throughput and energy consumption with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss.
Abstract: Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.

Proceedings ArticleDOI
01 Dec 2020
TL;DR: GCOOSpDM as mentioned in this paper exploits coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic to optimize the performance of SpDM on modern GPUs.
Abstract: Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.

Proceedings ArticleDOI
23 Feb 2020
TL;DR: A novel GPGPU performance estimation model with both core and memory frequency scaling is proposed and a cross-benchmarking suite is designed, which simulates kernels with a wide range of instruction distributions to study the correlation between kernel performance counters and kernel performance.
Abstract: Dynamic Voltage and Frequency Scaling (D VFS) on General-Purpose Graphics Processing Units (GPGPUs) is now becoming one of the most significant techniques to balance computational performance and energy consumption. However, there are still few fast and accurate models for predicting GPU kernel execution time under different core and memory frequency settings, which is important to determine the best frequency configuration for energy saving. Accordingly, a novel GPGPU performance estimation model with both core and memory frequency scaling is herein proposed. We design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre-training or as supplementary training samples. Then we apply two different machine learning algorithms, Support Vector Regression (SVR) and Gradient Boosting Decision Tree (GBDT), to study the correlation between kernel performance counters and kernel performance. The models trained only with our cross-benchmarking suite achieve satisfying accuracy (16%~22% mean absolute error) on 24 unseen real application kernels. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the proposed model is able to achieve accurate results (5.1%, 2.8%, 6.5% mean absolute error) for the target GPUs (GTX 980, Titan X Pascal and Tesla P100).