Showing papers by "Kai Li published in 2018"

PDF

Open Access

Proceedings Article•DOI•

Optimizing N-dimensional, winograd-based convolution for manycore CPUs

[...]

Zhen Jia¹, Aleksandar Zlateski², Frédo Durand², Kai Li¹•Institutions (2)

Princeton University¹, Massachusetts Institute of Technology²

10 Feb 2018

TL;DR: This work proposes and implements an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs, and achieves high hardware utilization through a series of optimizations.

...read moreread less

Abstract: Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3 They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations We propose and implement an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs Our algorithm achieves high hardware utilization through a series of optimizations Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3 x, and sometimes 8 x faster than other state-of-the-art CPU implementations on an Intel Xeon Phi manycore processors Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations

...read moreread less

43 citations

Journal Article•DOI•

Hsa_circ_0001859 Regulates ATF2 Expression by Functioning as an MiR-204/211 Sponge in Human Rheumatoid Arthritis.

[...]

Bingrui Li¹, Nianyu Li¹, Le Zhang, Kai Li, Yingtian Xie², Meilan Xue¹, Zheng Zheng¹ - Show less +3 more•Institutions (2)

Qingdao University¹, Northeastern University²

17 Jan 2018-Clinical & Developmental Immunology

TL;DR: Findings indicate that hsa_circ_0001859 participates deeply in the process of chronic inflammatory disease in synovial tissue and could compete with ATF2 for miR-204/211.

...read moreread less

Abstract: Background. circRNAs are part of the competitive endogenous RNA network, which putatively function as miRNA sponges and play a crucial role in the development of numerous diseases. However, studies of circRNAs in rheumatoid arthritis (RA) disease are limited. This work aims to identify the expression pattern of circRNAs in synovial tissues and their inflammatory regulation mechanism. Methods. We first compared the mRNA expression in rheumatoid arthritis patients with that in healthy volunteers by GEO database mining to identify gene loci specifically expressed in synovial tissues. Functional enrichment algorithms were then used to draw the interactome diagram of circRNAs-miRNAs-mRNAs. Finally, loss-of-function and rescue assays of the candidate circRNAs were performed in vitro. Results. A total of 29 differentially expressed circRNAs related to rheumatoid arthritis were discovered. Silencing of hsa_circ_0001859 suppressed ATF2 expression and decreased inflammatory activity in SW982 cells. Hsa_circ_0001859 could compete with ATF2 for miR-204/211. Discussion. These findings indicate that hsa_circ_0001859 participates deeply in the process of chronic inflammatory disease in synovial tissue.

...read moreread less

41 citations

Posted Content•

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

[...]

Linpeng Tang, Yida Wang, Theodore L. Willke, Kai Li

16 Jul 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper proposes Graphi, a generic and high-performance execution engine to efficiently execute a computation graph in parallel on manycore CPUs that minimizes the interference on both software/hardware resources, discovers the best parallel setting with a profiler, and further optimizes graph execution with the critical-path first scheduling.

...read moreread less

Abstract: For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the parallelization of the graph as a whole. However, we observe that running multiple operations simultaneously without interference is critical to efficiently perform parallelizable small operations. The attempt of executing the computation graph in parallel in deep learning frameworks usually involves much resource contention among concurrent operations, leading to inferior performance on manycore CPUs. To address these issues, in this paper, we propose Graphi, a generic and high-performance execution engine to efficiently execute a computation graph in parallel on manycore CPUs. Specifically, Graphi minimizes the interference on both software/hardware resources, discovers the best parallel setting with a profiler, and further optimizes graph execution with the critical-path first scheduling. Our experiments show that the parallel execution consistently outperforms the sequential one. The training times on four different neural networks with Graphi are 2.1x to 9.5x faster than those with TensorFlow on a 68-core Intel Xeon Phi processor.

...read moreread less

18 citations

Posted Content•

FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why.

[...]

Aleksandar Zlateski, Zhen Jia, Kai Li, Frédo Durand

20 Sep 2018-arXiv: Performance

TL;DR: A Roofline performance model is used to analyze the three highly optimized implementations of convolutional neural networks on modern multi-- and many--core CPUs and explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs.

...read moreread less

Abstract: Winograd-based convolution has quickly gained traction as a preferred approach to implement convolutional neural networks (ConvNet) on various hardware platforms because it requires fewer floating point operations than FFT-based or direct convolutions This paper compares three highly optimized implementations (regular FFT--, Gauss--FFT--, and Winograd--based convolutions) on modern multi-- and many--core CPUs Although all three implementations employed the same optimizations for modern CPUs, our experimental results with two popular ConvNets (VGG and AlexNet) show that the FFT--based implementations generally outperform the Winograd--based approach, contrary to the popular belief To understand the results, we use a Roofline performance model to analyze the three implementations in detail, by looking at each of their computation phases and by considering not only the number of floating point operations, but also the memory bandwidth and the cache sizes The performance analysis explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs

...read moreread less

15 citations

PZnet: Efficient CPU ConvNet Inference Engine for 3D Medical Image Processing

[...]

Sergiy Popovych, Davit Buniatyan, Alexander Zlateski, H. Sebastian Seung, Kai Li - Show less +1 more

11 Apr 2018

TL;DR: PZnet is a CPU-only engine that can be used to perform inference for a variety of 3D convolutional net architectures, and outperforms MKL-based CPU implementations of PyTorch and Tensorflow for the popular 3D U-net architecture.

...read moreread less

Abstract: Convolutional nets have been shown to achieve state-of-the-art accuracy in many biomedical image analysis tasks. To deploy convolutional nets in practical working systems, it is also important to solve the efficient inference problem. Namely, one should be able to apply an already-trained convolutional network to many large images using limited computational resources. 3D images are especially relevant because biological tissues are 3D, and data volumes are typically high for 3D. While it is common to use GPUs for convolutional net inference, there may be environments where CPUs are more abundant or accessible. In this paper we present PZnet, a CPU-only engine that can be used to perform inference for a variety of 3D convolutional net architectures. PZNet outperforms MKL-based CPU implementations of PyTorch and Tensorflow by more than 3.5x for the popular 3D U-net architecture. Moreover, based on current pricing of preemptible or spot instances, cloud CPU inference with PZnet is competitive in cost with cloud GPU inference, for U-net style architectures.

...read moreread less