Home
/
Authors
/
Xusheng Chen

Author

Xusheng Chen

Bio: Xusheng Chen is an academic researcher from University of Hong Kong. The author has contributed to research in topics: Computer science & Scalability. The author has an hindex of 4, co-authored 16 publications receiving 91 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

APUS: fast and scalable paxos on RDMA

[...]

Cheng Wang¹, Jianyu Jiang¹, Xusheng Chen¹, Ning Yi¹, Heming Cui¹ - Show less +1 more•Institutions (1)

University of Hong Kong¹

24 Sep 2017

TL;DR: This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts, and evaluated APUS on nine widely-used server programs.

...read moreread less

Abstract: State machine replication (SMR) uses Paxos to enforce the same inputs for a program (e.g., Redis) replicated on a number of hosts, tolerating various types of failures. Unfortunately, traditional Paxos protocols incur prohibitive performance overhead on server programs due to their high consensus latency on TCP/IP. Worse, the consensus latency of extant Paxos protocols increases drastically when more concurrent client connections or hosts are added. This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts. APUS intercepts inbound socket calls of an unmodified server program, assigns a total order for all input requests, and uses fast RDMA primitives to replicate these requests concurrently. We evaluated APUS on nine widely-used server programs (e.g., Redis and MySQL). APUS incurred a mean overhead of 4.3% in response time and 4.2% in throughput. We integrated APUS with an SMR system Calvin. Our Calvin-APUS integration was 8.2X faster than the extant Calvin-ZooKeeper integration. The consensus latency of APUS outperformed an RDMA-based consensus protocol by 4.9X. APUS source code and raw results are released on github.com/hku-systems/apus.

...read moreread less

82 citations

Proceedings Article•DOI•

Bidl: A High-throughput, Low-latency Permissioned Blockchain Framework for Datacenter Networks

[...]

Ji Qi¹, Xusheng Chen¹, Yunpeng Jiang¹, Jianyu Jiang¹, Tianxiang Shen¹, Shixiong Zhao¹, Sen Wang², Gong Zhang², Li Chen², Man Ho Au¹, Heming Cui¹ - Show less +7 more•Institutions (2)

University of Hong Kong¹, Huawei²

26 Oct 2021

TL;DR: Bidl as mentioned in this paper is the first permissioned blockchain framework highly optimized for datacenter networks, which carries a sequencer to parallelize the consensus protocol and transaction execution speculatively.

...read moreread less

Abstract: A permissioned blockchain framework typically runs an efficient Byzantine consensus protocol and is attractive to deploy fast trading applications among a large number of mutually untrusted participants (e.g., companies). Unfortunately, all existing permissioned blockchain frameworks adopt sequential workflows for invoking the consensus protocol and executing applications' transactions, making the performance of these applications much lower than deploying them in traditional systems (e.g., in-datacenter stock exchange). We propose Bidl, the first permissioned blockchain framework highly optimized for datacenter networks. We leverage the network ordering in such networks to create a shepherded parallel workflow, which carries a sequencer to parallelize the consensus protocol and transaction execution speculatively. However, the presence of malicious participants (e.g., a malicious sequencer) can easily perturb the parallel workflow to greatly degrade Bidl's performance. To achieve stable high performance, Bidl efficiently shepherds all participants by detecting their misbehaviors, and performs denylist-based view changes to replace or deny malicious participants. Compared with three fast permissioned blockchain frameworks, Bidl's parallel workflow reduces applications' latency by up to 72.7% and improves their throughput by up to 4.3x in the presence of malicious participants. Bidl is suitable to be integrated with traditional stock exchange systems. Bidl's code is released on github.com/hku-systems/bidl.

...read moreread less

18 citations

Journal Article•DOI•

v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

[...]

Shixiong Zhao¹, Fanxin Li¹, Xusheng Chen¹, Xiuxian Guan¹, Jianyu Jiang¹, Dong Huang¹, Yuhao Qing¹, Sen Wang², Peng Wang², Gong Zhang², Cheng Li³, Ping Luo¹, Heming Cui¹ - Show less +9 more•Institutions (3)

University of Hong Kong¹, Huawei², University of Science and Technology of China³

01 Mar 2022-IEEE Transactions on Parallel and Distributed Systems

TL;DR: vPipe as mentioned in this paper provides dynamic layer partitioning and memory management for pipeline parallelism by searching a near-optimal partitioning/memory management plan and live layer migration protocol for rebalancing the layer distribution across a training pipeline.

...read moreread less

Abstract: The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.

...read moreread less

18 citations

Proceedings Article•DOI•

Uranus: Simple, Efficient SGX Programming and its Applications

[...]

Jianyu Jiang¹, Xusheng Chen¹, TszOn Li¹, Cheng Wang¹, Tianxiang Shen¹, Shixiong Zhao¹, Heming Cui¹, Cho-Li Wang¹, Fengwei Zhang² - Show less +5 more•Institutions (2)

University of Hong Kong¹, Southern University of Science and Technology²

05 Oct 2020

TL;DR: Uranus effectively tackles the two major vulnerabilities in the code-reuse approach by presenting two new protocols: a Java bytecode attestation protocol for dynamically loaded functions; and an OS-decoupled, efficient GC protocol optimized for data-handling applications running in enclaves.

...read moreread less

Abstract: Applications written in Java have strengths to tackle diverse threats in public clouds, but these applications are still prone to privileged attacks when processing plaintext data. Intel SGX is powerful to tackle these attacks, and traditional SGX systems rewrite a Java application's sensitive functions, which process plaintext data, using C/C++ SGX API. Although this code-rewrite approach achieves good efficiency and a small TCB, it requires SGX expert knowledge and can be tedious and error-prone. To tackle the limitations of rewriting Java to C/C++, recent SGX systems propose a code-reuse approach, which runs a default JVM in an SGX enclave to execute the sensitive Java functions. However, both recent study and this paper find that running a default JVM in enclaves incurs two major vulnerabilities, Iago attacks, and control flow leakage of sensitive functions, due to the usage of OS features in JVM. In this paper, Uranus creates easy-to-use Java programming abstractions for application developers to annotate sensitive functions, and Uranus automatically runs these functions in SGX at runtime. Uranus effectively tackles the two major vulnerabilities in the code-reuse approach by presenting two new protocols: 1) a Java bytecode attestation protocol for dynamically loaded functions; and 2) an OS-decoupled, efficient GC protocol optimized for data-handling applications running in enclaves. We implemented Uranus in Linux and applied it to two diverse data-handling applications: Spark and ZooKeeper. Evaluation shows that: 1) Uranus achieves the same security guarantees as two relevant SGX systems for these two applications with only a few annotations; 2) Uranus has reasonable performance overhead compared to the native, insecure applications; and 3) Uranus defends against privileged attacks. Uranus source code and evaluation results are released on https://github.com/hku-systems/uranus.

...read moreread less

15 citations

Proceedings Article•DOI•

Achieving low tail-latency and high scalability for serializable transactions in edge computing

[...]

Xusheng Chen¹, Haoze Song¹, Jianyu Jiang¹, Chaoyi Ruan, Cheng Li, Sen Wang², Gong Zhang², Reynold Cheng¹, Heming Cui¹ - Show less +5 more•Institutions (2)

University of Hong Kong¹, Huawei²

21 Apr 2021

TL;DR: In this article, the authors present Dast (Decentralized Anticipate and Stretch), the first edge database that can meet the stringent performance requirements with serializability.

...read moreread less

Abstract: A distributed database utilizing the wide-spread edge computing servers to provide low-latency data access with the serializability guarantee is highly desirable for emerging edge computing applications. In an edge database, nodes are divided into regions, and a transaction can be categorized as intra-region (IRT) or cross-region (CRT) based on whether it accesses data in different regions. In addition to serializability, we insist that a practical edge database should provide low tail latency for both IRTs and CRTs, and such low latency must be scalable to a large number of regions. Unfortunately, none of existing geo-replicated serializable databases or edge databases can meet such requirements. In this paper, we present Dast (Decentralized Anticipate and STretch), the first edge database that can meet the stringent performance requirements with serializability. Our key idea is to order transactions by anticipating when they are ready to execute: Dast binds an IRT to the latest timestamp and binds a CRT to a future timestamp to avoid the coordination of CRTs blocking IRTs. Dast also carries a new stretchable clock abstraction to tolerate inaccurate anticipations and to handle cross-region data reads. Our evaluation shows that, compared to three relevant serializable databases, Dast's 99-percentile latency was 87.9%~93.2% lower for IRTs and 27.7%~70.4% lower for CRTs; Dast's low latency is scalable to a large number of regions.

...read moreread less

12 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

LITE Kernel RDMA Support for Datacenter Applications

[...]

Shin-Yeh Tsai¹, Yiying Zhang¹•Institutions (1)

Purdue University¹

14 Oct 2017

TL;DR: LITE, a Local Indirection TiEr for RDMA in the Linux kernel that virtualizes native RDMA into a flexible, high-level, easy-to-use abstraction and allows applications to safely share resources is built.

...read moreread less

Abstract: Recently, there is an increasing interest in building data-center applications with RDMA because of its low-latency, high-throughput, and low-CPU-utilization benefits. However, RDMA is not readily suitable for datacenter applications. It lacks a flexible, high-level abstraction; its performance does not scale; and it does not provide resource sharing or flexible protection. Because of these issues, it is difficult to build RDMA-based applications and to exploit RDMA's performance benefits. To solve these issues, we built LITE, a Local Indirection TiEr for RDMA in the Linux kernel that virtualizes native RDMA into a flexible, high-level, easy-to-use abstraction and allows applications to safely share resources. Despite the widely-held belief that kernel bypassing is essential to RDMA's low-latency performance, we show that using a kernel-level indirection can achieve both flexibility and low-latency, scalable performance at the same time. To demonstrate the benefits of LITE, we developed several popular datacenter applications on LITE, including a graph engine, a MapReduce system, a Distributed Shared Memory system, and a distributed atomic logging system. These systems are easy to build and deliver good performance. For example, our implementation of PowerGraph uses only 20 lines of LITE code, while outperforming PowerGraph by 3.5x to 5.6x.

...read moreread less

137 citations

Proceedings Article•DOI•

Deconstructing RDMA-enabled distributed transactions: hybrid is better

[...]

Xingda Wei¹, Zhiyuan Dong¹, Rong Chen¹, Haibo Chen¹•Institutions (1)

Shanghai Jiao Tong University¹

08 Oct 2018

TL;DR: DrTM+H is built, a new hybrid distributed transaction system that always embraces the optimal RDMA primitives at each phase of transactional execution, and conducts an end-to-end comparison of prior designs on the same codebase and finds none of them is optimal.

...read moreread less

Abstract: There is currently an active debate on which RDMA primitive (i.e., one-sided or two-sided) is optimal for distributed transactions. Such a debate has led to a number of optimizations based on one RDMA primitive, which was shown with better performance than the other.In this paper, we perform a systematic comparison between different RDMA primitives with a combination of various optimizations using representative OLTP workloads. More specifically, we first implement and compare different RDMA primitives with existing and our new optimizations upon a single well-tuned execution framework. This gives us insights into the performance characteristics of different RDMA primitives. Then we investigate the implementation of optimistic concurrency control (OCC) by comparing different RDMA primitives using a phase-by-phase approach with various transactions from TPC-C, SmallBank, and TPC-E. Our results show that no single primitive (one-sided or two-sided) wins over the other on all phases. We further conduct an end-to-end comparison of prior designs on the same codebase and find none of them is optimal.Based on the above studies, we build DrTM+H, a new hybrid distributed transaction system that always embraces the optimal RDMA primitives at each phase of transactional execution. Evaluations using popular OLTP workloads including TPC-C and SmallBank show that DrTM+H achieves over 7.3 and 90.4 million transactions per second on a 16-node RDMA-capable cluster (ConnectX-4) respectively, without locality assumption. This number outperforms the pure one-sided and two-sided systems by up to 1.89X and 2.96X for TPC-C with over 49% and 65% latency reduction. Further, DrTM+H scales well with a large number of connections on modern RDMA network.

...read moreread less

90 citations

Journal Article•DOI•

PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database

[...]

Wei Cao¹, Zhenjun Liu¹, Peng Wang², Sen Chen¹, Caifeng Zhu, Song Zheng, Yuhui Wang¹, Guoqing Ma - Show less +4 more•Institutions (2)

Alibaba Group¹, Fudan University²

01 Aug 2018

TL;DR: ParallelRaft is developed, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases.

...read moreread less

Abstract: PolarFS is a distributed file system with ultra-low latency and high availability, designed for the POLARDB database service, which is now available on the Alibaba Cloud. PolarFS utilizes a lightweight network stack and I/O stack in user-space, taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. In this way, the end-to-end latency of PolarFS has been reduced drastically and our experiments show that the write latency of PolarFS is quite close to that of local file system on SSD. To keep replica consistency while maximizing I/O throughput for PolarFS, we develop ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understand-ability and easy implementation of Raft while providing much better I/O scalability for PolarFS. We also describe the shared storage architecture of PolarFS, which gives a strong support for POLARDB.

...read moreread less

76 citations

Proceedings Article•DOI•

Pathways: Asynchronous Distributed Dataflow for ML

[...]

Paul Barham, Aakanksha Chowdhery, Jeffrey Dean, Sanjay Ghemawat, Steven Hand, D. Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, Yonghui Wu - Show less +12 more

23 Mar 2022

TL;DR:

...read moreread less

Abstract: We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.

...read moreread less

51 citations

Proceedings Article•DOI•

REINFORCE: achieving efficient failure resiliency for network function virtualization based services

[...]

Sameer G. Kulkarni¹, Guyue Liu², Kadangode K. Ramakrishnan³, Mayutan Arumaithurai¹, Timothy Wood², Xiaoming Fu¹ - Show less +2 more•Institutions (3)

University of Göttingen¹, George Washington University², University of California, Riverside³

04 Dec 2018

TL;DR: REINFORCE is an integrated framework to support efficient resiliency for NFs and NF service chains that minimizes the number of state transfers by exploiting the concept of external synchrony, and leverages opportunistic batching and multi-buffering to optimize performance.

...read moreread less

Abstract: Ensuring high availability (HA) for software-based networks is a critical design feature that will help the adoption of software-based network functions (NFs) in production networks. It is important for NFs to avoid outages and maintain mission-critical operations. However, HA support for NFs on the critical data path can result in unacceptable performance degradation. We present REINFORCE, an integrated framework to support efficient resiliency for NFs and NF service chains. REINFORCE includes timely failure detection and consistent failover mechanisms. REINFORCE replicates state to standby NFs (local and remote) while enforcing correctness. It minimizes the number of state transfers by exploiting the concept of external synchrony, and leverages opportunistic batching and multi-buffering to optimize performance. Experimental results show that, even at line-rate packet processing (10 Gbps), REINFORCE achieves chain-level failover across servers in a LAN (or within the same node) within 10ms (100/μs), incurring less than 10% (1%) performance overhead, and adds average latency of only ~400/μs (5/μs), with a worst-case latency of less than 1ms (10/μs).

...read moreread less

46 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

Collapse