scispace - formally typeset
Search or ask a question
Author

Xiuxian Guan

Bio: Xiuxian Guan is an academic researcher from University of Hong Kong. The author has contributed to research in topics: Computer science & Pipeline (computing). The author has an hindex of 1, co-authored 1 publications receiving 2 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: vPipe as mentioned in this paper provides dynamic layer partitioning and memory management for pipeline parallelism by searching a near-optimal partitioning/memory management plan and live layer migration protocol for rebalancing the layer distribution across a training pipeline.
Abstract: The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.

18 citations

Journal ArticleDOI
TL;DR: C OORP, the implementation of coordinated preemption, reduced the violation of latency requirements from 53.9% (EDCA) to 8.8% (comparable to SchedWiFi), and achieved a comparable (at times the same) learning reward with EDCA, which grew up to 76% faster than Sched WiFi.
Abstract: In coordinated robotic learning, multiple robots share the same wireless channel for communication, and bring together latency-sensitive (LS) network flows for control and bandwidth-hungry (BH) flows for distributed learning. Unfortunately, existing wireless network supporting systems cannot coordinate these two network flows to meet their own requirements: 1) prioritized contention systems (e.g., EDCA) prevent LS messages from timely acquiring the wireless channel because multiple wireless network interface cards (WNICs) with BH messages are contending for the channel 2) global planning systems (e.g., SchedWiFi) have to reserve a notable time window in the shared channel for each LS flow, suffering from severe bandwidth degradation (up to 42%). We present the coordinated preemption method to meet both requirements for LS flows and BH flows. Globally (among multiple robots), coordinated preemption eliminates unnecessary contention of BH flows by making them transmit in a round-robin manner, such that LS flows have the highest chance to win the contention against BH flows, without sacrificing overall bandwidth from the perspective of coordinated robotic learning applications. Locally (within the same robot), coordinated preemption in real time predicts the periodic transmission of LS flows from the upper application and conservatively limits packets of BH flows buffered in the WNIC only before LS packets arriving, reducing the bandwidth devoted to preemption. COORP, our implementation of coordinated preemption, reduced the violation of latency requirements from 53.9% (EDCA) to 8.8% (comparable to SchedWiFi). Regarding learning quality, COORP achieved a comparable (at times the same) learning reward with EDCA, which grew up to 76% faster than SchedWiFi.
Journal ArticleDOI
TL;DR: Fold3D as discussed by the authors slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can also be launched and overlapped with other segments' computational tasks.
Abstract: Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments’ computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%–42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D's high performance scaled to many GPUs.
Proceedings ArticleDOI
01 Oct 2022
TL;DR: ROG confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth, so that the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence.
Abstract: Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%~6.5% training accuracy gain compared with the baselines and saved 20.4%~50.7% of the energy to achieve the same training accuracy.

Cited by
More filters
Proceedings ArticleDOI
23 Mar 2022
TL;DR:
Abstract: We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.

51 citations

Proceedings ArticleDOI
28 Feb 2022
TL;DR:
Abstract: Supernet training, a prevalent and important paradigm in Neural Architecture Search, embeds the whole DNN architecture search space into one monolithic supernet, iteratively activates a subset of the supernet (i.e., a subnet) for fitting each batch of data, and searches a high-quality subnet which meets specific requirements. Although training subnets in parallel on multiple GPUs is desirable for acceleration, there inherently exists a race hazard that concurrent subnets may access the same DNN layers. Existing systems support neither efficiently parallelizing subnets’ training executions, nor resolving the race hazard deterministically, leading to unreproducible training procedures and potentiallly non-trivial accuracy loss. We present NASPipe, the first high-performance and reproducible distributed supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction: NASPipe partitions a supernet across GPUs and concurrently executes multiple generated sub-tasks (subnets) in a pipelined manner; meanwhile, it oversees the correlations between the subnets and deterministically resolves any causal dependency caused by subnets’ layer sharing. To obtain high performance, NASPipe’s CSP scheduler exploits the fact that the larger a supernet spans, the fewer dependencies manifest between chronologically close subnets; therefore, it aggressively schedules the subnets with larger chronological orders into execution, only if they are not causally dependent on unfinished precedent subnets. Moreover, to relieve the excessive GPU memory burden for holding the whole supernet’s parameters, NASPipe uses a context switch technique that stashes the whole supernet in CPU memory, precisely predicts the subnets’ schedule, and pre-fetches/evicts a subnet before/after its execution. The evaluation shows that NASPipe is the only system that retains supernet training reproducibility, while achieving a comparable and even higher performance (up to 7.8X) compared to three recent pipeline training systems (e.g., GPipe).

6 citations

Posted Content
TL;DR: CoCoNeT as discussed by the authors provides a DSL to express a program with both computation and communication, which allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation.
Abstract: Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.

4 citations

Proceedings ArticleDOI
22 Feb 2022
TL;DR: In this paper , the authors propose to break the separation between computation and communication kernels in machine learning frameworks, which can provide many optimizations to improve the performance of distributed workloads, however, manually applying these optimizations requires modifying the underlying computational and communication libraries for each scenario, which is both time consuming and error-prone.
Abstract: Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone.

4 citations