Home
/
Authors
/
Xiaolong Ma

Author

Xiaolong Ma

Other affiliations: Chinese Academy of Sciences, College of William & Mary, Syracuse University

Bio: Xiaolong Ma is an academic researcher from Northeastern University. The author has contributed to research in topics: Pruning (decision trees) & Computer science. The author has an hindex of 17, co-authored 103 publications receiving 1095 citations. Previous affiliations of Xiaolong Ma include Chinese Academy of Sciences & College of William & Mary.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices

[...]

Caiwen Ding¹, Siyu Liao², Yanzhi Wang¹, Zhe Li¹, Ning Liu¹, Youwei Zhuo³, Chao Wang³, Xuehai Qian³, Yu Bai⁴, Geng Yuan¹, Xiaolong Ma¹, Yipeng Zhang¹, Jian Tang¹, Qinru Qiu¹, Xue Lin⁵, Bo Yuan² - Show less +12 more•Institutions (5)

Syracuse University¹, City University of New York², University of Southern California³, California State University, Fullerton⁴, Northeastern University⁵

14 Oct 2017

TL;DR: The CirCNN architecture is proposed, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc) and FFT can be used as the key computing kernel which ensures universal and small-footprint implementations.

...read moreread less

Abstract: Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) the increased training complexity; and 3) the lack of rigirous guarantee of compression ratio and inference accuracy.To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from $\mathrm {O}(n^{2})$ to $\mathrm {O}(n$ log n) and the storage complexity from $\mathrm {O}(n^{2})$ to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: the DNNs based on CirCNN can converge to the same “effectiveness” as DNNs without compression. We propose the CirCNN architecture, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc In CirCNN architecture: 1) Due to the recursive property, FFT can be used as the key computing kernel which ensures universal and small-footprint implementations. 2) The compressed but regular network structure avoids the pitfalls of the network pruning and facilitates high performance and throughput with highly pipelined and parallel design. To demonstrate the performance and energy efficiency, we test CIR-CNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6 - 102X energy efficiency improvements compared with the best state-of-the-art results.CCS Concepts• Computer systems organization$\rightarrow $ Embedded hardware;

...read moreread less

262 citations

Proceedings Article•DOI•

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

[...]

Wei Niu¹, Xiaolong Ma², Sheng Lin², Shihao Wang², Xuehai Qian³, Xue Lin², Yanzhi Wang², Bin Ren¹ - Show less +4 more•Institutions (3)

College of William & Mary¹, Northeastern University², University of Southern California³

09 Mar 2020

TL;DR: The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique---pattern-based pruning based on an extended ADMM solution framework---and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning.

...read moreread less

Abstract: With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing Deep Neural Networks (DNNs) inference is still challenging considering the high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we advance the state-of-the-art by introducing a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in the design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique---pattern-based pruning based on an extended ADMM solution framework---and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning. Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5X, 11.4X, and 7.1X, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

...read moreread less

178 citations

Journal Article•DOI•

AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates.

[...]

Ning Liu¹, Xiaolong Ma², Zhiyuan Xu³, Yanzhi Wang², Jian Tang¹, Jieping Ye¹ - Show less +2 more•Institutions (3)

DiDi¹, Northeastern University², Syracuse University³

03 Apr 2020

TL;DR: This work proposes AutoCompress, an automatic structured pruning framework with the following key performance improvements: effectively incorporate the combination of structured pruned schemes in the automatic process, and adopt the state-of-art ADMM-based structured weight pruning as the core algorithm.

...read moreread less

Abstract: Structured weight pruning is a representative model compression technique of DNNs to reduce the storage and computation requirements and accelerate inference. An automatic hyperparameter determination process is necessary due to the large number of flexible hyperparameters. This work proposes AutoCompress, an automatic structured pruning framework with the following key performance improvements: (i) effectively incorporate the combination of structured pruning schemes in the automatic process; (ii) adopt the state-of-art ADMM-based structured weight pruning as the core algorithm, and propose an innovative additional purification step for further weight reduction without accuracy loss; and (iii) develop effective heuristic search method enhanced by experience-based guided search, replacing the prior deep reinforcement learning technique which has underlying incompatibility with the target pruning problem. Extensive experiments on CIFAR-10 and ImageNet datasets demonstrate that AutoCompress is the key to achieve ultra-high pruning rates on the number of weights and FLOPs that cannot be achieved before. As an example, AutoCompress outperforms the prior work on automatic model compression by up to 33× in pruning rate (120× reduction in the actual parameter count) under the same accuracy. Significant inference speedup has been observed from the AutoCompress framework on actual measurements on smartphone. We release models of this work at anonymous link: http://bit.ly/2VZ63dS.

...read moreread less

156 citations

Proceedings Article•DOI•

CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices

[...]

Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, Bo Yuan - Show less +12 more

29 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: CirCNN as discussed by the authors utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) and the storage complexity from O(n2) to O(nlogn) with negligible accuracy loss.

...read moreread less

Abstract: Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning; 2) the increased training complexity; and 3) the lack of rigorous guarantee of compression ratio and inference accuracy. To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n2) to O(nlogn) and the storage complexity from O(n2) to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: it can converge to the same effectiveness as DNNs without compression. The CirCNN architecture, a universal DNN inference engine that can be implemented on various hardware/software platforms with configurable network architecture. To demonstrate the performance and energy efficiency, we test CirCNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6-102X energy efficiency improvements compared with the best state-of-the-art results.

...read moreread less

98 citations

Journal Article•DOI•

PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices.

[...]

Xiaolong Ma¹, Fu-Ming Guo¹, Wei Niu², Xue Lin¹, Jian Tang³, Kaisheng Ma⁴, Bin Ren², Yanzhi Wang¹ - Show less +4 more•Institutions (4)

Northeastern University¹, College of William & Mary², DiDi³, Tsinghua University⁴

03 Apr 2020

TL;DR: PCONV as mentioned in this paper introduces a new sparsity dimension, comprising a new fine-grained pruning pattern inside the coarse grained structures, which can achieve real-time inference on large-scale DNNs.

...read moreread less

Abstract: Model compression techniques on Deep Neural Network (DNN) have been widely acknowledged as an effective way to achieve acceleration on a variety of platforms, and DNN weight pruning is a straightforward and effective method. There are currently two mainstreams of pruning methods representing two extremes of pruning regularity: non-structured, fine-grained pruning can achieve high sparsity and accuracy, but is not hardware friendly; structured, coarse-grained pruning exploits hardware-efficient structures in pruning, but suffers from accuracy drop when the pruning rate is high. In this paper, we introduce PCONV, comprising a new sparsity dimension, – fine-grained pruning patterns inside the coarse-grained structures. PCONV comprises two types of sparsities, Sparse Convolution Patterns (SCP) which is generated from intra-convolution kernel pruning and connectivity sparsity generated from inter-convolution kernel pruning. Essentially, SCP enhances accuracy due to its special vision properties, and connectivity sparsity increases pruning rate while maintaining balanced workload on filter computation. To deploy PCONV, we develop a novel compiler-assisted DNN inference framework and execute PCONV models in real-time without accuracy compromise, which cannot be achieved in prior work. Our experimental results show that, PCONV outperforms three state-of-art end-to-end DNN frameworks, TensorFlow-Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 39.2 ×, 11.4 ×, and 6.3 ×, respectively, with no accuracy loss. Mobile devices can achieve real-time inference on large-scale DNNs.

...read moreread less

95 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Collapse

Cited by

PDF

Open Access

More filters

Pattern Recognition and Machine Learning

[...]

Christopher M. Bishop¹•Institutions (1)

Microsoft¹

01 Jan 2006

TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.

...read moreread less

Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

...read moreread less

10,141 citations

Journal Article•DOI•

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

[...]

Lei Deng¹, Guoqi Li¹, Song Han², Luping Shi¹, Yuan Xie³ - Show less +1 more•Institutions (3)

Tsinghua University¹, Massachusetts Institute of Technology², University of California, Santa Barbara³

20 Mar 2020

TL;DR: This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.

...read moreread less

Abstract: Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore’s Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

...read moreread less

499 citations

Proceedings Article•DOI•

A configurable cloud-scale DNN processor for real-time AI

[...]

Jeremy Fowers¹, Kalin Ovtcharov¹, Michael K. Papamichael¹, Todd Massengill¹, Ming Liu¹, Lo Daniel¹, Shlomi Alkalay¹, Michael Haselman¹, Logan Adams¹, Mahdi Ghandi¹, Stephen F. Heil¹, Prerak Patel¹, Adam Sapek¹, Gabriel Weisz¹, Lisa Woods¹, Sitaram Lanka¹, Steven K. Reinhardt¹, Adrian M. Caulfield¹, Eric S. Chung¹, Doug Burger¹ - Show less +16 more•Institutions (1)

Microsoft¹

02 Jun 2018

TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.

...read moreread less

Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

...read moreread less

498 citations

Book•

Mathematical programming

[...]

Michael J. Todd

01 Jan 1997

437 citations

Proceedings Article•DOI•

Machine Learning at Facebook: Understanding Inference at the Edge

[...]

Carole-Jean Wu¹, David Brooks¹, Kevin Chen¹, Douglas Chen¹, Sy Choudhury¹, Marat Dukhan¹, Kim Hazelwood¹, Eldad Isaac¹, Yangqing Jia¹, Bill Jia¹, Tommer Leyvand¹, Hao Lu¹, Yang Lu¹, Lin Qiao¹, Brandon Reagen¹, Joe Spisak¹, Fei Sun¹, Andrew Tulloch¹, Peter Vajda¹, Xiaodong Wang¹, Yanghan Wang¹, Bram Wasti¹, Yiming Wu¹, Ran Xian¹, Sungjoo Yoo¹, Sungjoo Yoo², Peizhao Zhang¹ - Show less +23 more•Institutions (2)

Facebook¹, Seoul National University²

26 Mar 2019

TL;DR: This paper takes a datadriven approach to present the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms.

...read moreread less

Abstract: At Facebook, machine learning provides a wide range of capabilities that drive many aspects of user experience including ranking posts, content understanding, object detection and tracking for augmented and virtual reality, speech and text translations. While machine learning models are currently trained on customized datacenter infrastructure, Facebook is working to bring machine learning inference to the edge. By doing so, user experience is improved with reduced latency (inference time) and becomes less dependent on network connectivity. Furthermore, this also enables many more applications of deep learning with important features only made available at the edge. This paper takes a datadriven approach to present the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms.

...read moreread less

385 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse