scispace - formally typeset
Search or ask a question

Showing papers by "Kai Li published in 2018"


Proceedings ArticleDOI
10 Feb 2018
TL;DR: This work proposes and implements an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs, and achieves high hardware utilization through a series of optimizations.
Abstract: Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3 They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations We propose and implement an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs Our algorithm achieves high hardware utilization through a series of optimizations Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3 x, and sometimes 8 x faster than other state-of-the-art CPU implementations on an Intel Xeon Phi manycore processors Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations

43 citations


Journal ArticleDOI
TL;DR: Findings indicate that hsa_circ_0001859 participates deeply in the process of chronic inflammatory disease in synovial tissue and could compete with ATF2 for miR-204/211.
Abstract: Background. circRNAs are part of the competitive endogenous RNA network, which putatively function as miRNA sponges and play a crucial role in the development of numerous diseases. However, studies of circRNAs in rheumatoid arthritis (RA) disease are limited. This work aims to identify the expression pattern of circRNAs in synovial tissues and their inflammatory regulation mechanism. Methods. We first compared the mRNA expression in rheumatoid arthritis patients with that in healthy volunteers by GEO database mining to identify gene loci specifically expressed in synovial tissues. Functional enrichment algorithms were then used to draw the interactome diagram of circRNAs-miRNAs-mRNAs. Finally, loss-of-function and rescue assays of the candidate circRNAs were performed in vitro. Results. A total of 29 differentially expressed circRNAs related to rheumatoid arthritis were discovered. Silencing of hsa_circ_0001859 suppressed ATF2 expression and decreased inflammatory activity in SW982 cells. Hsa_circ_0001859 could compete with ATF2 for miR-204/211. Discussion. These findings indicate that hsa_circ_0001859 participates deeply in the process of chronic inflammatory disease in synovial tissue.

41 citations


Posted Content
TL;DR: This paper proposes Graphi, a generic and high-performance execution engine to efficiently execute a computation graph in parallel on manycore CPUs that minimizes the interference on both software/hardware resources, discovers the best parallel setting with a profiler, and further optimizes graph execution with the critical-path first scheduling.
Abstract: For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the parallelization of the graph as a whole. However, we observe that running multiple operations simultaneously without interference is critical to efficiently perform parallelizable small operations. The attempt of executing the computation graph in parallel in deep learning frameworks usually involves much resource contention among concurrent operations, leading to inferior performance on manycore CPUs. To address these issues, in this paper, we propose Graphi, a generic and high-performance execution engine to efficiently execute a computation graph in parallel on manycore CPUs. Specifically, Graphi minimizes the interference on both software/hardware resources, discovers the best parallel setting with a profiler, and further optimizes graph execution with the critical-path first scheduling. Our experiments show that the parallel execution consistently outperforms the sequential one. The training times on four different neural networks with Graphi are 2.1x to 9.5x faster than those with TensorFlow on a 68-core Intel Xeon Phi processor.

18 citations


Posted Content
TL;DR: A Roofline performance model is used to analyze the three highly optimized implementations of convolutional neural networks on modern multi-- and many--core CPUs and explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs.
Abstract: Winograd-based convolution has quickly gained traction as a preferred approach to implement convolutional neural networks (ConvNet) on various hardware platforms because it requires fewer floating point operations than FFT-based or direct convolutions This paper compares three highly optimized implementations (regular FFT--, Gauss--FFT--, and Winograd--based convolutions) on modern multi-- and many--core CPUs Although all three implementations employed the same optimizations for modern CPUs, our experimental results with two popular ConvNets (VGG and AlexNet) show that the FFT--based implementations generally outperform the Winograd--based approach, contrary to the popular belief To understand the results, we use a Roofline performance model to analyze the three implementations in detail, by looking at each of their computation phases and by considering not only the number of floating point operations, but also the memory bandwidth and the cache sizes The performance analysis explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs

15 citations


11 Apr 2018
TL;DR: PZnet is a CPU-only engine that can be used to perform inference for a variety of 3D convolutional net architectures, and outperforms MKL-based CPU implementations of PyTorch and Tensorflow for the popular 3D U-net architecture.
Abstract: Convolutional nets have been shown to achieve state-of-the-art accuracy in many biomedical image analysis tasks. To deploy convolutional nets in practical working systems, it is also important to solve the efficient inference problem. Namely, one should be able to apply an already-trained convolutional network to many large images using limited computational resources. 3D images are especially relevant because biological tissues are 3D, and data volumes are typically high for 3D. While it is common to use GPUs for convolutional net inference, there may be environments where CPUs are more abundant or accessible. In this paper we present PZnet, a CPU-only engine that can be used to perform inference for a variety of 3D convolutional net architectures. PZNet outperforms MKL-based CPU implementations of PyTorch and Tensorflow by more than 3.5x for the popular 3D U-net architecture. Moreover, based on current pricing of preemptible or spot instances, cloud CPU inference with PZnet is competitive in cost with cloud GPU inference, for U-net style architectures.