Showing papers by "Xiaowen Chu published in 2020"

PDF

Open Access

Proceedings Article•DOI•

FMore: An Incentive Scheme of Multi-dimensional Auction for Federated Learning in MEC

[...]

Rongfei Zeng¹, Shixun Zhang¹, Jiaqi Wang¹, Xiaowen Chu²•Institutions (2)

Northeastern University¹, Hong Kong Baptist University²

22 Feb 2020

TL;DR: This paper presents an incentive mechanism FMore with multi-dimensional procurement auction of K winners, which is lightweight and incentive compatible, but also encourages more high-quality edge nodes with low cost to participate in learning and eventually improve the performance of federated learning.

...read moreread less

Abstract: Promising federated learning coupled with Mobile Edge Computing (MEC) is considered as one of the most promising solutions to the AI-driven service provision. Plenty of studies focus on federated learning from the performance and security aspects, but they neglect the incentive mechanism. In MEC, edge nodes would not like to voluntarily participate in learning, and they differ in the provision of multi-dimensional resources, both of which might deteriorate the performance of federated learning. Also, lightweight schemes appeal to edge nodes in MEC. These features require the incentive mechanism to be well designed for MEC. In this paper, we present an incentive mechanism FMore with multi-dimensional procurement auction of K winners. Our proposal FMore not only is lightweight and incentive compatible, but also encourages more high-quality edge nodes with low cost to participate in learning and eventually improve the performance of federated learning. We also present theoretical results of Nash equilibrium strategy to edge nodes and employ the expected utility theory to provide guidance to the aggregator. Both extensive simulations and real-world experiments demonstrate that the proposed scheme can effectively reduce the training rounds and drastically improve the model accuracy for challenging AI tasks.

...read moreread less

108 citations

Posted Content•

A Survey of Deep Learning Techniques for Neural Machine Translation.

[...]

Shuoheng Yang, Yuxin Wang, Xiaowen Chu

18 Feb 2020-arXiv: Computation and Language

TL;DR: This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research orientations, and discusses some future research trends in this field.

...read moreread less

Abstract: In recent years, natural language processing (NLP) has got great development with deep learning techniques. In the sub-field of machine translation, a new approach named Neural Machine Translation (NMT) has emerged and got massive attention from both academia and industry. However, with a significant number of researches proposed in the past several years, there is little work in investigating the development process of this new technology trend. This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research orientations, and discusses some future research trends in this field.

...read moreread less

73 citations

Posted Content•

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey.

[...]

Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li - Show less +1 more

10 Mar 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.

...read moreread less

Abstract: Distributed deep learning becomes very common to reduce the overall training time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of deep models and data sets increases. However, data communication between computing devices could be a potential bottleneck to limit the system scalability. How to address the communication problem in distributed deep learning is becoming a hot research topic recently. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations. In the system-level, we demystify the system design and implementation to reduce the communication cost. In algorithmic-level, we compare different algorithms with theoretical convergence bounds and communication complexity. Specifically, we first propose the taxonomy of data-parallel distributed training algorithms, which contains four main dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing. Then we discuss the studies in addressing the problems of the four dimensions to compare the communication cost. We further compare the convergence rates of different algorithms, which enable us to know how fast the algorithms can converge to the solution in terms of iterations. According to the system-level communication cost analysis and theoretical convergence speed comparison, we provide the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.

...read moreread less

62 citations

Proceedings Article•DOI•

Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs

[...]

Shaohuai Shi¹, Qiang Wang¹, Xiaowen Chu¹, Bo Li², Yang Qin³, Ruihao Liu, Xinxiao Zhao - Show less +3 more•Institutions (3)

Hong Kong Baptist University¹, Hong Kong University of Science and Technology², Harbin Institute of Technology³

06 Jul 2020

TL;DR: The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived.

...read moreread less

Abstract: Distributed synchronous stochastic gradient descent (SGD) algorithms are widely used in large-scale deep learning applications, while it is known that the communication bottleneck limits the scalability of the distributed system. Gradient sparsification is a promising technique to significantly reduce the communication traffic, while pipelining can further overlap the communications with computations. However, gradient sparsification introduces extra computation time, and pipelining requires many layer-wise communications which introduce significant communication startup overheads. Merging gradients from neighbor layers could reduce the startup overheads, but on the other hand it would increase the computation time of sparsification and the waiting time for the gradient computation. In this paper, we formulate the trade-off between communications and computations (including backward computation and gradient sparsification) as an optimization problem, and derive an optimal solution to the problem. We further develop the optimal merged gradient sparsification algorithm with SGD (OMGS-SGD) for distributed training of deep learning. We conduct extensive experiments to verify the convergence properties and scaling performance of OMGS-SGD. Experimental results show that OMGS-SGD achieves up to 31% end-to-end time efficiency improvement over the state-of-the-art sparsified SGD while preserving nearly consistent convergence performance with original SGD without sparsification on a 16-GPU cluster connected with 1Gbps Ethernet.

...read moreread less

54 citations

Proceedings Article•DOI•

Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply

[...]

Da Yan¹, Wei Wang¹, Xiaowen Chu²•Institutions (2)

Hong Kong University of Science and Technology¹, Hong Kong Baptist University²

18 May 2020

TL;DR: This paper demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations.

...read moreread less

Abstract: Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).

...read moreread less

46 citations

Posted Content•DOI•

Benchmarking Deep Learning Models and Automated Model Design for COVID-19 Detection with Chest CT Scans

[...]

Xin He¹, Shihao Wang¹, Shaohuai Shi¹, Xiaowen Chu¹, Jiangping Tang², Xin Liu², Chenggang Yan², Jiyong Zhang², Guiguang Ding³ - Show less +5 more•Institutions (3)

Hong Kong Baptist University¹, Hangzhou Dianzi University², Tsinghua University³

09 Jun 2020-medRxiv

TL;DR: An automated deep learning methodology is designed to generate a lightweight deep learning model MNas3DNet41 that achieves an accuracy of 87.14%, F1-score of 86.25%, and AUC of 0.957, which are on par with the best models made by AI experts.

...read moreread less

Abstract: COVID-19 pandemic has spread all over the world for months. As its transmissibility and high pathogenicity seriously threaten people’s lives, the accurate and fast detection of the COVID-19 infection is crucial. Although many recent studies have shown that deep learning based solutions can help detect COVID-19 based on chest CT scans, there lacks a consistent and systematic comparison and evaluation on these techniques. In this paper, we first build a clean and segmented CT dataset called Clean-CC-CCII by fixing the errors and removing some noises in a large CT scan dataset CC-CCII with three classes: novel coronavirus pneumonia (NCP), common pneumonia (CP), and normal controls (Normal). After cleaning, our dataset consists of a total of 340,190 slices of 3,993 scans from 2,698 patients. Then we benchmark and compare the performance of a series of state-of-the-art (SOTA) 3D and 2D convolutional neural networks (CNNs). The results show that 3D CNNs outperform 2D CNNs in general. With extensive effort of hyperparameter tuning, we find that the 3D CNN model DenseNet3D121 achieves the highest accuracy of 88.63% (F1-score is 88.14% and AUC is 0.940), and another 3D CNN model ResNet3D34 achieves the best AUC of 0.959 (accuracy is 87.83% and F1-score is 86.04%). We further demonstrate that the mixup data augmentation technique can largely improve the model performance. At last, we design an automated deep learning methodology to generate a lightweight deep learning model MNas3DNet41 that achieves an accuracy of 87.14%, F1-score of 87.25%, and AUC of 0.957, which are on par with the best models made by AI experts. The automated deep learning design is a promising methodology that can help health-care professionals develop effective deep learning models using their private data sets. Our Clean-CC-CCII dataset and source code are available at:https://github.com/arthursdays/HKBU HPML COVID-19.

...read moreread less

43 citations

Proceedings Article•DOI•

Measurement and Analysis of the Bitcoin Networks: A View from Mining Pools

[...]

Canhui Wang¹, Xiaowen Chu¹, Yang Qin²•Institutions (2)

Hong Kong Baptist University¹, Harbin Institute of Technology²

24 Jul 2020

TL;DR: This paper traced over 2.15 hundred thousand blocks from February 2016 to February 2020 and conducted a board range of measurements on the pool evolutions, labeled transactions as well as real-time network traffics, and discovered new interesting observations and features.

...read moreread less

Abstract: Bitcoin network has received much attention from both industry and academy due to Bitcoin's recent success. Mining pools, the main components of the Bitcoin network, dominate the computing resources, and play essential roles in network security and performance aspects. Although many existing measurements of the Bitcoin network are available, little is known about the details of mining pool behaviors (e.g., mining revenue and transaction collection strategy) and their effects on the Bitcoin end users (e.g., transaction fees, transaction delay, and transaction acceptance rate). This paper aims to fill this gap with a systematic study of mining pools. We traced over 2.15 hundred thousand blocks from February 2016 to February 2020 and collected over 4.12 TB unconfirmed transactions. Then we conducted a board range of measurements on the pool evolutions, labeled transactions (blocks) as well as real-time network traffics, and discovered new interesting observations and features. Specifically, our measurements showed the following. 1) A few mining pools continuously controlled most of the computing resources of the Bitcoin network. 2) Mining pools were caught in a prisoner's dilemma where mining pools compete to increase their computing resources even though the unit profit of the computing resource decreases. 3) Mining pools were stuck in a Malthusian trap where there is a stage at which the Bitcoin incentives are inadequate for feeding the exponential growth of the computing resources. 4) The market price and transaction fees were not sensitive to the past events of halving block rewards. 5) Feerate played a dominating role in the transaction collection strategy for the top mining pools. Our measurements and analysis helped the Bitcoin community to understand and improve the Bitcoin network.

...read moreread less

40 citations

Proceedings Article•DOI•

FADNet: A Fast and Accurate Network for Disparity Estimation

[...]

Qiang Wang¹, Shaohuai Shi¹, Zheng Shizhen¹, Kaiyong Zhao¹, Xiaowen Chu¹ - Show less +1 more•Institutions (1)

Hong Kong Baptist University¹

24 Mar 2020

TL;DR: FADNet as discussed by the authors exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation, and combines the residual structures to make the deeper model easier to learn, and contains multi-scale predictions so as to exploit a multiscale weight scheduling training technique to improve the accuracy.

...read moreread less

Abstract: Deep neural networks (DNNs) have achieved great success in the area of computer vision. The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy in stereo matching than traditional hand-crafted feature based methods. On one hand, however, the designed DNNs require significant memory and computation resources to accurately predict the disparity, especially for those 3D convolution based networks, which makes it difficult for deployment in real-time applications. On the other hand, existing computation-efficient networks lack expression capability in large-scale datasets so that they cannot make an accurate prediction in many scenarios. To this end, we propose an efficient and accurate deep network for disparity estimation named FADNet with three main features: 1) It exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation; 2) It combines the residual structures to make the deeper model easier to learn; 3) It contains multi-scale predictions so as to exploit a multi-scale weight scheduling training technique to improve the accuracy. We conduct experiments to demonstrate the effectiveness of FADNet on two popular datasets, Scene Flow and KITTI 2015. Experimental results show that FADNet achieves state-of-the-art prediction accuracy, and runs at a significant order of magnitude faster speed than existing 3D models. The codes of FADNet are available at https://github.com/HKBU-HPML/FADNet.

...read moreread less

38 citations

Proceedings Article•DOI•

Optimizing batched winograd convolution on GPUs

[...]

Da Yan¹, Wei Wang¹, Xiaowen Chu²•Institutions (2)

Hong Kong University of Science and Technology¹, Hong Kong Baptist University²

19 Feb 2020

TL;DR: This paper presents an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs, and builds a SASS assembler TuringAs for VolTA and Turing that enables tuning the performance at the native assembly level.

...read moreread less

Abstract: In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on Volta V100 and up to 2.65X speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak. Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

...read moreread less

38 citations

Proceedings Article•DOI•

Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

[...]

Yuxin Wang¹, Qiang Wang¹, Shaohuai Shi¹, Xin He¹, Zhenheng Tang¹, Kaiyong Zhao¹, Xiaowen Chu¹ - Show less +3 more•Institutions (1)

Hong Kong Baptist University¹

11 May 2020

TL;DR: A comprehensive empirical study on the performance and energy efficiency of several popular off-the-shelf processors in training DNNs by benchmarking a representative set of deep learning workloads, which provides an informative guide for end users to select proper AI accelerators.

...read moreread less

Abstract: Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) are designed to improve the performance of AI training. However, processors from different vendors perform dissimilarly in terms of performance and energy consumption. To investigate the differences among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive empirical study on the performance and energy efficiency of these processors1 by benchmarking a representative set of deep learning workloads, including computation-intensive operations, classical convolutional neural networks (CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer. Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor's software library, and deep learning framework on the performance and energy consumption of AI training. Our evaluation methods and results not only provide an informative guide for end users to select proper AI accelerators, but also expose some opportunities for the hardware vendors to improve their software library.

...read moreread less

28 citations

Proceedings Article•DOI•

Performance Characterization and Bottleneck Analysis of Hyperledger Fabric

[...]

Canhui Wang¹, Xiaowen Chu¹•Institutions (1)

Hong Kong Baptist University¹

01 Nov 2020

TL;DR: A thorough performance evaluation on the first long term support release of Hyperledger Fabric, finding the validate phase was likely to be the system bottleneck due to the low validation speed of chaincode and the execution phase exhibited a good scalability under the OR endorsement policy but not with the AND endorsement policy.

...read moreread less

Abstract: Hyperledger Fabric is a popular open-source project for deploying permissioned blockchains. Many performance characteristics of the latest Hyperledger Fabric (e.g., performance characteristics of each phase, the impacts of ordering services, bottleneck and scalability) are still not well understood due to the performance complexity of distributed systems. We conducted a thorough performance evaluation on the first long term support release of Hyperledger Fabric. We studied the performance characteristics of each phase, including execute, order, and the validate phase, according to Hyperledger Fabric’s new execute-order-validate architecture. We also studied the ordering services, including Solo, Kafka, and Raft. Our experimental results showed some findings as follows. 1) The execution phase exhibited a good scalability under the OR endorsement policy but not with the AND endorsement policy. 2) We were not able to find a significant performance difference between the three ordering services. 3) The validate phase was likely to be the system bottleneck due to the low validation speed of chaincode. Overall, our work helps to understand and improve Hyperledger Fabric.

...read moreread less

Posted Content•

Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

[...]

Zhenheng Tang¹, Shaohuai Shi¹, Xiaowen Chu¹•Institutions (1)

Hong Kong Baptist University¹

22 Feb 2020-arXiv: Learning

TL;DR: This work aims to reduce communication time of two types of distributed deep learning architectures, centralized and decentralized, by improving the model generalization capability of deep neural network models.

...read moreread less

Abstract: Distributed learning techniques such as federated learning have enabled multiple workers to train machine learning models together to reduce the overall training time. However, current distributed training algorithms (centralized or decentralized) suffer from the communication bottleneck on multiple low-bandwidth workers (also on the server under the centralized architecture). Although decentralized algorithms generally have lower communication complexity than the centralized counterpart, they still suffer from the communication bottleneck for workers with low network bandwidth. To deal with the communication problem while being able to preserve the convergence performance, we introduce a novel decentralized training algorithm with the following key features: 1) It does not require a parameter server to maintain the model during training, which avoids the communication pressure on any single peer. 2) Each worker only needs to communicate with a single peer at each communication round with a highly compressed model, which can significantly reduce the communication traffic on the worker. We theoretically prove that our sparsification algorithm still preserves convergence properties. 3) Each worker dynamically selects its peer at different communication rounds to better utilize the bandwidth resources. We conduct experiments with convolutional neural networks on 32 workers to verify the effectiveness of our proposed algorithm compared to seven existing methods. Experimental results show that our algorithm significantly reduces the communication traffic and generally select relatively high bandwidth peers.

...read moreread less

Posted Content•

FADNet: A Fast and Accurate Network for Disparity Estimation

[...]

Qiang Wang¹, Shaohuai Shi¹, Zheng Shizhen¹, Kaiyong Zhao¹, Xiaowen Chu¹ - Show less +1 more•Institutions (1)

Hong Kong Baptist University¹

24 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experimental results show that FADNet achieves state-of-the-art prediction accuracy, and runs at a significant order of magnitude faster speed than existing 3D models.

...read moreread less

Posted Content•

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

[...]

Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zhu Zilin, Huang Xue, Jiang Xinan, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Ouyang Xianbin, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu, Xiaowen Chu - Show less +20 more

20 Oct 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: A new computing and communication efficient top-k sparsification communication library for distributed training to improve the system scalability and optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator.

...read moreread less

Abstract: Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

...read moreread less

Journal Article•DOI•

GPGPU Performance Estimation With Core and Memory Frequency Scaling

[...]

Qiang Wang¹, Xiaowen Chu¹•Institutions (1)

Hong Kong Baptist University¹

01 Dec 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A fine-grained analytical model is revealed to estimate the execution time of GPU kernels with both core and memory frequency scaling and can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy.

...read moreread less

Abstract: Contemporary graphics processing units (GPUs) support dynamic voltage and frequency scaling to balance computational performance and energy consumption However, accurate and straightforward performance estimation for a given GPU kernel under different frequency settings is still lacking for real hardware, which is essential to determine the best frequency configuration for energy saving In this article, we reveal a fine-grained analytical model to estimate the execution time of GPU kernels with both core and memory frequency scaling Compared to the cycle-level simulators, which are too slow to apply on real hardware, our model only needs simple and one-off micro-benchmarks to extract a set of hardware parameters and kernel performance counters without any source code analysis Our experimental results show that the proposed performance model can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy (average errors of 385, 86, 882, and 883 percent on a set of 20 GPU kernels with four modern Nvidia GPUs)

...read moreread less

Journal Article•DOI•

ESetStore: An Erasure-Coded Storage System With Fast Data Recovery

[...]

Chengjian Liu, Qiang Wang¹, Xiaowen Chu¹, Yiu-Wing Leung¹, Hai Liu² - Show less +1 more•Institutions (2)

Hong Kong Baptist University¹, University of Hong Kong²

01 Sep 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: ESetStore is presented, a prototype erasure-coded storage system that aims to achieve fast recovery from failures and can be an enhancement for existing solutions, such as Partial-parallel-repair (PPR), to further improve recovery performance.

...read moreread less

Abstract: Erasure codes have been used extensively in large-scale storage systems to reduce the storage overhead of triplication-based storage systems. One key performance issue introduced by erasure codes is the long time needed to recover from a single failure, which occurs constantly in large-scale storage systems. We present ESetStore, a prototype erasure-coded storage system that aims to achieve fast recovery from failures. ESetStore is novel in the following aspects. We proposed a data placement algorithm named ESet for our ESetStore that can aggregate adequate I/O resources from available storage servers to recover from each single failure. We designed and implemented efficient read and write operations on our erasure-coded storage system via effective use of available I/O and computation resources. We evaluated the performance of ESetStore with extensive experiments on a cluster with 50 storage servers. The evaluation results demonstrate that our recovery performance can obtain linear performance growth by harvesting available I/O resources. With our defined parameter recovery I/O parallelism under some mild conditions, we can achieve optimal recovery performance, in which ESet enables minimal recovery time. Rather than being an alternative to improve recovery performance, our work can be an enhancement for existing solutions, such as Partial-parallel-repair (PPR), to further improve recovery performance.

...read moreread less

Journal Article•DOI•

An Erasure-Coded Storage System for Edge Computing

[...]

Lixin Liang, Huan He, Jian Zhao, Chengjian Liu, Qiuming Luo¹, Xiaowen Chu² - Show less +2 more•Institutions (2)

Shenzhen University¹, Hong Kong Baptist University²

20 May 2020-IEEE Access

TL;DR: A comprehensive study of the performance of erasure coding to see if it can match the network performance of 5G and Wi-Fi 6 at the network edge and accelerates erasure code with OpenMP on a multi-core CPU.

...read moreread less

Abstract: Emerging computing paradigm edge computing expects to store and process data at the network edge with reduced latency and improved network bandwidth. To the best of our knowledge, key performance issues such as coding performance of erasure-coded storage systems haven’t been investigated for edge computing. In this paper, we present an erasure-coded storage system for edge computing. Unlike the data center and cloud storage systems, it employs edge devices to perform encoding and decoding operations, which can be a performance bottleneck of the whole storage system due to limited computing power. Hence, we present a comprehensive study of the performance of erasure coding to see if it can match the network performance of 5G and Wi-Fi 6 at the network edge. We use the popular edge device Jetson Nano and two state-of-the-art coding libraries: Jerasure and G-CRS. Our evaluation results reveal unsatisfied performance for Jerasure and high variance for G-CRS. To obtain better and stable performance, we accelerate erasure code with OpenMP on a multi-core CPU. Our work demonstrates our acceleration can bring stable performance and match the network bandwidth of 5G and Wi-Fi 6 for some commonly used cases. Besides, our work offers a better understanding of erasure-coded storage systems for edge computing and can be served as a reference to any further optimization for such kind of systems at the network edge.

...read moreread less

Posted Content•

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

[...]

Shaohuai Shi¹, Zhenheng Tang², Xiaowen Chu², Chengjian Liu, Wei Wang¹, Bo Li¹ - Show less +2 more•Institutions (2)

Hong Kong University of Science and Technology¹, Hong Kong Baptist University²

27 May 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: It is shown that the DL models with low model intensity are difficult to scale out even with the best available lossless algorithm over 100Gb/s IB; and the system architecture and scheduling algorithms have a critical impact on the scaling property.

...read moreread less

Abstract: Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL. We first identify the major communication challenges and classify the existing solutions into three levels, namely the learning algorithm, the system architecture, and the network infrastructure. We present the state-of-the-art communication optimization techniques and conduct a comparative study of seven common lossless distributed DL methods on a 32-GPU cluster with 100Gbps InfiniBand (IB). We show that (1) the DL models with low model intensity (such as BERT and BERT-Large) are difficult to scale out even with the best available lossless algorithm over 100Gbps IB; (2) the system architecture and scheduling algorithms have a critical impact on the scaling property. We conclude the article with discussions on the open issues for further investigations.

...read moreread less

Proceedings Article•DOI•

Joint Access Point Placement and Power-Channel-Resource-Unit Assignment for 802.11ax-Based Dense WiFi with QoS Requirements

[...]

Shuwei Qiu¹, Xiaowen Chu¹, Yiu-Wing Leung¹, Joseph Kee-Yin Ng¹•Institutions (1)

Hong Kong Baptist University¹

06 Jul 2020

TL;DR: This paper designs an 802.11ax-based dense WiFi network to provide WiFi services to a large number of users within a given area with the following objectives: to minimize the number of access points (APs); to fulfil the users’ throughput requirement; and to be resistant to AP failures.

...read moreread less

Abstract: IEEE 802.11ax is a promising standard for the next-generation WiFi network, which uses orthogonal frequency division multiple access (OFDMA) to segregate the wireless spectrum into time-frequency resource units (RUs). In this paper, we aim at designing an 802.11ax-based dense WiFi network to provide WiFi services to a large number of users within a given area with the following objectives: (1) to minimize the number of access points (APs); (2) to fulfil the users’ throughput requirement; and (3) to be resistant to AP failures. We formulate the above into a joint AP placement and power-channel-RU assignment optimization problem, which is NP-hard. To tackle this problem, we first derive an analytical model to estimate each user’s throughput under the mechanism of OFDMA and a widely used interference model. We then design a heuristic algorithm to find high-quality solutions with polynomial time complexity. Simulation results show that our algorithm can achieve the optimal performance for a small area of 50×50 m2. For a larger area of 100×80 m2 where we cannot find the optimal solution through an exhaustive search, our algorithm can reduce the number of APs by 32 ~ 55% as compared to the random and Greedy solutions.

...read moreread less

Proceedings Article•DOI•

Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

[...]

Zhenheng Tang¹, Shaohuai Shi¹, Xiaowen Chu¹•Institutions (1)

Hong Kong Baptist University¹

22 Feb 2020

TL;DR: In this paper, the authors aim to reduce communication time of two types of distributed deep learning architectures, centralized and decentralized, by exchanging gradients or models among workers, which is a potential bottleneck that limits the system scalability.

...read moreread less

Abstract: The increasing size of machine learning models, especially deep neural network models, can improve the model generalization capability. However, large models require more training data and more computing resources (such as GPU clusters) to train. In distributed training, the communication overhead of exchanging gradients or models among workers becomes a potential system bottleneck that limits the system scalability. Recently, many research works aim to reduce communication time of two types of distributed deep learning architectures, centralized and decentralized.

...read moreread less

Proceedings Article•DOI•

Layer-Wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees.

[...]

Shaohuai Shi¹, Zhenheng Tang¹, Qiang Wang¹, Kaiyong Zhao¹, Xiaowen Chu¹ - Show less +1 more•Institutions (1)

Hong Kong Baptist University¹

24 Aug 2020

TL;DR: In this article, the authors proposed a new distributed optimization method named LAGS-SGD, which combines synchronous stochastic gradient descent with a novel layer-wise adaptive gradient sparsification (LAGS) scheme.

...read moreread less

Abstract: To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.

...read moreread less

Posted Content•

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

[...]

Qiang Wang, Shaohuai Shi, Canhui Wang, Xiaowen Chu

24 Feb 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper establishes a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes and proposes an efficient algorithm, LWF-$\kappa, to balance the GPU utilization and consolidate the allocated GPUs for each job.

...read moreread less

Abstract: Distributed Deep Learning (DDL) has rapidly grown its popularity since it helps boost the training performance on high-performance GPU clusters. Efficient job scheduling is indispensable to maximize the overall performance of the cluster when training multiple jobs simultaneously. However, existing schedulers do not consider the communication contention of multiple communication tasks from different distributed training jobs, which could deteriorate the system performance and prolong the job completion time. In this paper, we first establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient algorithm, LWF-$\kappa$, to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling those communication tasks, we observe that neither avoiding all the contention nor blindly accepting them is optimal to minimize the job completion time. We thus propose a provable algorithm, AdaDUAL, to efficiently schedule those communication tasks. Based on AdaDUAL, we finally propose Ada-SRSF for the DDL job scheduling problem. Simulations on a 64-GPU cluster connected with 10 Gbps Ethernet show that LWF-$\kappa$ achieves up to $1.59\times$ improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by $20.1\%$ and $36.7\%$, as compared to the SRSF(1) scheme (avoiding all the contention) and the SRSF(2) scheme (blindly accepting all of two-way communication contention) respectively.

...read moreread less

Journal Article•DOI•

A probabilistic approach towards an unbiased semi-supervised cluster tree

[...]

Zhaocai Sun¹, Xiaofeng Zhang¹, Yunming Ye¹, Xiaowen Chu², Zhi Liu³ - Show less +1 more•Institutions (3)

Harbin Institute of Technology¹, Hong Kong Baptist University², Shandong University³

15 Mar 2020-Knowledge Based Systems

TL;DR: An unbiased semi-supervised cluster tree is proposed which is learnt using only very few labeled data using a K-means algorithm to build each level of this hierarchical tree in a decent top-down manner.

...read moreread less

Abstract: Conventionally, it is a prerequisite to acquire a good number of annotated data to train an accurate classifier. However, the acquisition of such dataset is usually infeasible due to the high annotation cost. Therefore, semi-supervised learning has emerged and attracts increasing research efforts in recent years. Essentially, semi-supervised learning is sensitive to the manner how the unlabeled data is sampled. However, the model performance might be seriously deteriorated if biased unlabeled data is sampled at the early stage. In this paper, an unbiased semi-supervised cluster tree is proposed which is learnt using only very few labeled data. Specifically, a K-means algorithm is adopted to build each level of this hierarchical tree in a decent top-down manner. The number of clusters is determined by the number of classes contained in the labeled data. The confidence error of the cluster tree is theoretically analyzed which is then used to prune the tree. Empirical studies on several datasets have demonstrated that the proposed semi-supervised cluster tree is superior to the state-of-the-art semi-supervised learning algorithms with respect to classification accuracy.

...read moreread less

Posted Content•

EDNet: Efficient Disparity Estimation with Cost Volume Combination and Attention-based Spatial Residual.

[...]

Songyan Zhang¹, Zhicheng Wang¹, Qiang Wang², Jinshuo Zhang¹, Gang Wei¹, Xiaowen Chu² - Show less +2 more•Institutions (2)

Tongji University¹, Hong Kong Baptist University²

26 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Extensive experiments on the Scene Flow and KITTI datasets show that EDNet outperforms the previous 3D CNN based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption.

...read moreread less

Abstract: Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network (CNN) for disparity regression, which is inefficient due to the high memory consumption and slow inference speed. In this paper, we propose a network named EDNet for efficient disparity estimation. Firstly, we construct a combined volume which incorporates contextual information from the squeezed concatenation volume and feature similarity measurement from the correlation volume. The combined volume can be next aggregated by 2D convolutions which are faster and require less memory than 3D convolutions. Secondly, we propose an attention-based spatial residual module to generate attention-aware residual features. The attention mechanism is applied to provide intuitive spatial evidence about inaccurate regions with the help of error maps at multiple scales and thus improve the residual learning efficiency. Extensive experiments on the Scene Flow and KITTI datasets show that EDNet outperforms the previous 3D CNN based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption.

...read moreread less

Proceedings Article•DOI•

A Multi-node Collaborative Storage Strategy via Clustering in Blockchain Network

[...]

Mengya Li¹, Yang Qin¹, Bing Liu¹, Xiaowen Chu²•Institutions (2)

Harbin Institute of Technology¹, Hong Kong Baptist University²

01 Nov 2020

TL;DR: ICIStrategy, a multi-node collaborative storage strategy based on intra-cluster integrity, aims to solve the storage pressure by reducing the amount data that each participate need to store and reduce communication overhead by collaboratively storing and verifying blocks through in-clusters nodes.

...read moreread less

Abstract: Blockchain is essentially a distributed ledger shared by all nodes in the system. All nodes in blockchain are equal, and each node holds all transactions and blocks in the network. As the network continues to expand, the data rises linearly. Participates are about to face the problem of storage limitation. Blockchain is hard to scale.This paper introduces ICIStrategy, a multi-node collaborative storage strategy based on intra-cluster integrity. In ICIStrategy, we divide all participates into several clusters. Each cluster requires holding all data of the network, whereas a node within the cluster does not need to maintain data integrity. It aims to solve the storage pressure by reducing the amount data that each participate need to store and reduce communication overhead by collaboratively storing and verifying blocks through in-cluster nodes. Moreover, the ICIStrategy could greatly save the overhead of bootstrapping. We show the mode of operation in our strategy. We further analysis the performance of ICIStrategy and conduct simulation experiments. The results of several comparative experiments show that our strategy just needs 25% of storage space needed by Rapidchain, which indeed solve the problem of storage limitation and improve the blockchain performance.

...read moreread less

Posted Content•

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

[...]

Shaohuai Shi¹, Qiang Wang², Xiaowen Chu²•Institutions (2)

Hong Kong University of Science and Technology¹, Hong Kong Baptist University²

29 May 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper refers to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which it exploits coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic.

...read moreread less

Abstract: Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing works under-look the performance optimization of SpDM on modern many-core architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal and Tesla P100) with CUDA-8.0 using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8$\times$ speedup over Nvidia's library cuSPARSE in many matrices. We also analyze instruction-level operations on a particular GPU to understand the performance gap between GCOOSpDM and cuSPARSE. The profiled instructions confirm that cuSPARSE spends a lot of time on slow memory access (including DRAM access and L2 cache access), while GCOOSpDM transfers such slow memory access to faster shared memory, which mainly contributes to the performance gain. Results also show that GCOOSpDM would outperform the dense algorithm (cuBLAS) with lower sparsity than cuSPARSE on GPUs.

...read moreread less

Posted Content•

Communication-Efficient Distributed Deep Learning: Survey, Evaluation, and Challenges.

[...]

Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li - Show less +2 more

27 May 2020

TL;DR: A systematic survey of communication-efficient distributed deep learning techniques with state-of-the-art techniques summarized, and a taxonomy with three levels: optimization algorithm, system architecture, and communication infrastructure is provided.

...read moreread less

Abstract: In recent years, distributed deep learning techniques are widely deployed to accelerate the training of deep learning models by exploiting multiple computing nodes. However, the extensive communications among workers dramatically limit the system scalability. In this article, we provide a systematic survey of communication-efficient distributed deep learning. Specifically, we first identify the communication challenges in distributed deep learning. Then we summarize the state-of-the-art techniques in this direction, and provide a taxonomy with three levels: optimization algorithm, system architecture, and communication infrastructure. Afterwards, we present a comparative study on seven different distributed deep learning techniques on a 32-GPU cluster with both 10Gbps Ethernet and 100Gbps InfiniBand. We finally discuss some challenges and open issues for possible future investigations.

...read moreread less

Proceedings Article•DOI•

Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs

[...]

Yuxin Wang¹, Qiang Wang¹, Xiaowen Chu¹•Institutions (1)

Hong Kong Baptist University¹

01 Nov 2020

TL;DR: In this article, the authors conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service, and propose the Aligned scheduling scheme that optimizes throughput and energy consumption with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss.

...read moreread less

Abstract: Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.

...read moreread less

Proceedings Article•DOI•

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

[...]

Shaohuai Shi¹, Qiang Wang², Xiaowen Chu²•Institutions (2)

Hong Kong University of Science and Technology¹, Hong Kong Baptist University²

01 Dec 2020

TL;DR: GCOOSpDM as mentioned in this paper exploits coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic to optimize the performance of SpDM on modern GPUs.

...read moreread less

Abstract: Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.

...read moreread less

Proceedings Article•DOI•

GPGPU performance estimation for frequency scaling using cross-benchmarking

[...]

Qiang Wang¹, Chengjian Liu, Xiaowen Chu¹•Institutions (1)

Hong Kong Baptist University¹

23 Feb 2020

TL;DR: A novel GPGPU performance estimation model with both core and memory frequency scaling is proposed and a cross-benchmarking suite is designed, which simulates kernels with a wide range of instruction distributions to study the correlation between kernel performance counters and kernel performance.

...read moreread less

Abstract: Dynamic Voltage and Frequency Scaling (D VFS) on General-Purpose Graphics Processing Units (GPGPUs) is now becoming one of the most significant techniques to balance computational performance and energy consumption. However, there are still few fast and accurate models for predicting GPU kernel execution time under different core and memory frequency settings, which is important to determine the best frequency configuration for energy saving. Accordingly, a novel GPGPU performance estimation model with both core and memory frequency scaling is herein proposed. We design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre-training or as supplementary training samples. Then we apply two different machine learning algorithms, Support Vector Regression (SVR) and Gradient Boosting Decision Tree (GBDT), to study the correlation between kernel performance counters and kernel performance. The models trained only with our cross-benchmarking suite achieve satisfying accuracy (16%~22% mean absolute error) on 24 unseen real application kernels. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the proposed model is able to achieve accurate results (5.1%, 2.8%, 6.5% mean absolute error) for the target GPUs (GTX 980, Titan X Pascal and Tesla P100).

...read moreread less