scispace - formally typeset
Search or ask a question

Showing papers by "Xuehai Qian published in 2020"


Proceedings ArticleDOI
09 Mar 2020
TL;DR: The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique---pattern-based pruning based on an extended ADMM solution framework---and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning.
Abstract: With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing Deep Neural Networks (DNNs) inference is still challenging considering the high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we advance the state-of-the-art by introducing a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in the design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique---pattern-based pruning based on an extended ADMM solution framework---and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning. Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5X, 11.4X, and 7.1X, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

178 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: Capuchin is proposed, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation and makes memory management decisions based on dynamic tensor access pattern tracked at runtime.
Abstract: In recent years, deep learning has gained unprecedented success in various domains, the key of the success is the larger and deeper deep neural networks (DNNs) that achieved very high accuracy. On the other side, since GPU global memory is a scarce resource, large models also pose a significant challenge due to memory requirement in the training process. This restriction limits the DNN architecture exploration flexibility. In this paper, we propose Capuchin, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation. The key feature of Capuchin is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, one can exploit the total memory optimization space and offer the fine-grain and flexible control of when and how to perform memory optimization techniques. We deploy Capuchin in a widely-used deep learning framework, Tensorflow, and show that Capuchin can reduce the memory footprint by up to 85% among 6 state-of-the-art DNNs compared to the original Tensorflow. Especially, for the NLP task BERT, the maximum batch size that Capuchin can outperforms Tensorflow and gradient-checkpointing by 7x and 2.1x, respectively. We also show that Capuchin outperforms vDNN and gradient-checkpointing by up to 286% and 55% under the same memory oversubscription.

94 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: The proposed Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach, achieves the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency.
Abstract: Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers. For this reason, it is significantly slower in heterogeneous settings. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds --- designing a distributed training method that has both high performance like All-Reduce in homogeneous environment and good heterogeneity tolerance like AD-PSGD? In this paper, we propose Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that enables fast synchronization among a group of workers. To reduce serialization cost, we propose static group scheduling in homogeneous environment and simple techniques, i.e., Group Buffer and Group Division, to largely eliminate conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Prague is 1.2x faster than the state-of-the-art implementation of All-Reduce, 5.3x faster than Parameter Server and 3.7x faster than AD-PSGD. In a heterogeneous setting, Prague tolerates slowdowns well and achieves 4.4x speedup over All-Reduce.

41 citations


Proceedings ArticleDOI
TL;DR: PatDNN as discussed by the authors is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler and code generation-based optimizations.
Abstract: With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

37 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: This paper rethinks NVM deployment and makes a case for the \em asymmetric byte-addressable non-volatile memory architecture, which decouples servers from persistent data storage, and builds a framework based on this architecture that implements high performance persistent data structure update, concurrency control, and crash-consistency and replication.
Abstract: The byte-addressable non-volatile memory (NVM) is a promising technology since it simultaneously provides DRAM-like performance, disk-like capacity, and persistency. The current NVM deployment with byte-addressability is \em symmetric, where NVM devices are directly attached to servers. Due to the higher density, NVM provides much larger capacity and should be shared among servers. Unfortunately, in the symmetric setting, the availability of NVM devices is affected by the specific machine it is attached to. High availability can be achieved by replicating data to NVM on a remote machine. However, it requires full replication of data structure in local memory --- limiting the size of the working set. This paper rethinks NVM deployment and makes a case for the \em asymmetric byte-addressable non-volatile memory architecture, which decouples servers from persistent data storage. In the proposed \em \anvm architecture, NVM devices (i.e., back-end nodes) can be shared by multiple servers (i.e., front-end nodes) and provide recoverable persistent data structures. The asymmetric architecture, which follows the industry trend of \em resource disaggregation, is made possible due to the high-performance network (e.g., RDMA). At the same time, \anvm leads to a number of key problems such as, still relatively long network latency, persistency bottleneck, and simple interface of the back-end NVM nodes. We build \em \anvm framework based on \anvm architecture that implements: 1) high performance persistent data structure update; 2) NVM data management; 3) concurrency control; and 4) crash-consistency and replication. The key idea to remove persistency bottleneck is the use of \em operation log that reduces stall time due to RDMA writes and enables efficient batching and caching in front-end nodes. To evaluate performance, we construct eight widely used data structures and two transaction applications based on \anvm framework. In a 10-node cluster equipped with real NVM devices, results show that \anvm achieves similar or better performance compared to the best possible symmetric architecture while enjoying the benefits of disaggregation. We found the speedup brought by the proposed optimizations is drastic, --- 5$\sim$12× among all benchmarks.

32 citations


Posted Content
TL;DR: The first solution that applies different quantization schemes for different rows of the weight matrix, and an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes can maintain, or even increase accuracy due to better matching with weight distributions.
Abstract: Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning. Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions.

31 citations


Proceedings ArticleDOI
01 Feb 2020
TL;DR: This work presents AccPar, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays that considers the complete Tensor partition space and can reveal previously unknown new parallelism configurations.
Abstract: Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, "accelerator wall", we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present AccPar, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, AccPar considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. AccPar optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in AccPar allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate AccPar on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series, and Resnet series. The average performance improvements of the state-of-the-art "one weird trick" (OWT) and HYPAR, and AccPar, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively.

30 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: DNNGuard is proposed, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the simultaneous execution of original (target) DNN networks and the detect algorithm or network that detects adversary sample attacks, and is implemented based on RISC-V and NVDLA.
Abstract: Recent studies show that Deep Neural Networks (DNN) are vulnerable to adversarial samples that are generated by perturbing correctly classified inputs to cause the misclassification of DNN models. This can potentially lead to disastrous consequences, especially in security-sensitive applications such as unmanned vehicles, finance and healthcare. Existing adversarial defense methods require a variety of computing units to effectively detect the adversarial samples. However, deploying adversary sample defense methods in existing DNN accelerators leads to many key issues in terms of cost, computational efficiency and information security. Moreover, existing DNN accelerators cannot provide effective support for special computation required in the defense methods. To address these new challenges, this paper proposes DNNGuard, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the simultaneous execution of original (target) DNN networks and the detect algorithm or network that detects adversary sample attacks. The architecture tightly couples the DNN accelerator with the CPU core into one chip for efficient data transfer and information protection. An elastic DNN accelerator is designed to run the target network and detection network simultaneously. Besides the capability to execute two networks at the same time, DNNGuard also supports the non-DNN computing and allows the special layer of the neural network to be effectively supported by the CPU core. To reduce off-chip traffic and improve resources utilization, we propose a dynamical resource scheduling mechanism. To build a general implementation framework, we propose an extended AI instruction set for neural networks synchronization, task scheduling and efficient data interaction. We implement DNNGuard based on RISC-V and NVDLA, and evaluate its performance impacts with six target networks and three typical detection networks. Experiment results show that DNNGuard can effectively validate the legitimacy of the input samples in parallel with the target DNN model, achieving an average 1.42x speedup compared with the state-of-the-art accelerators.

22 citations


Journal ArticleDOI
TL;DR: The proposed framework significantly mitigates the model accuracy loss due to reduced data precision in a cohesive manner, constructing a comprehensive STT-MRAM accelerator system for fast NN computation with high energy efficiency and low cost.
Abstract: A large variety of applications rely on deep learning to process big data, learn sophisticated features, and perform complicated tasks. Utilizing emerging non-volatile memory (NVM)'s unique characteristics, including the crossbar array structure and gray-scale cell resistances, to perform neural network (NN) computation is a well-studied approach in accelerating deep learning applications. Compared to other NVM technologies, STT-MRAM has its unique advantages in performing NN computation. However, the state-of-the-art research have not utilized STT-MRAM for deep learning acceleration due to its device- and architecture-level challenges. Consequently, this paper enables STT-MRAM, for the first time, as an effective and practical deep learning accelerator. In particular, it proposes a full-stack framework iCELIA spanning multiple design levels, including device-level fabrication, circuit-level enhancements, architecture-level synaptic weight quantization, and system-level accelerator design. The primary contributions of iCELIA over our prior work CELIA include a new non-uniform weight quantization scheme and much enhanced accelerator system design. The proposed framework significantly mitigates the model accuracy loss due to reduced data precision in a cohesive manner, constructing a comprehensive STT-MRAM accelerator system for fast NN computation with high energy efficiency and low cost.

18 citations


Posted Content
TL;DR: This work proposes a randomized index mechanism of the branch predictor that disrupts the correspondence between the branch instruction address and the branch predictors entry, thus increases the noise for malicious perception attacks.
Abstract: Recently exposed vulnerabilities reveal the necessity to improve the security of branch predictors. Branch predictors record history about the execution of different programs, and such information from different processes are stored in the same structure and thus accessible to each other. This leaves the attackers with the opportunities for malicious training and malicious perception. Instead of flush-based or physical isolation of hardware resources, we want to achieve isolation of the content in these hardware tables with some lightweight processing using randomization as follows. (1) Content encoding. We propose to use hardware-based thread-private random numbers to encode the contents of the branch predictor tables (both direction and destination histories) which we call XOR-BP. Specifically, the data is encoded by XOR operation with the key before written in the table and decoded after read from the table. Such a mechanism obfuscates the information adding difficulties to cross-process or cross-privilege level analysis and perception. It achieves a similar effect of logical isolation but adds little in terms of space or time overheads. (2) Index encoding. We propose a randomized index mechanism of the branch predictor (Noisy-XOR-BP). Similar to the XOR-BP, another thread-private random number is used together with the branch instruction address as the input to compute the index of the branch predictor. This randomized indexing mechanism disrupts the correspondence between the branch instruction address and the branch predictor entry, thus increases the noise for malicious perception attacks. Our analyses using an FPGA-based RISC-V processor prototype and additional auxiliary simulations suggest that the proposed mechanisms incur a very small performance cost while providing strong protection.

16 citations


Posted Content
TL;DR: AccQOC is a comprehensive static/dynamic hybrid workflow to transform gate groups to pulses using QOC (Quantum Optimal Control) with a reasonable compilation time budget, and proposes to reduce the compilation time by generating an ordered sequence of groups in which the sum of similarity among consecutive groups in the sequence is minimized.
Abstract: In the last decades, we have witnessed the rapid growth of Quantum Computing. In the current Noisy Intermediate-Scale Quantum (NISQ) era, the capability of a quantum machine is limited by the decoherence time, gate fidelity and the number of Qubits. Current quantum computing applications are far from the real "quantum supremacy" due to the fragile physical Qubits, which can only be entangled for a few microseconds. Recent works use quantum optimal control to reduce the latency of quantum circuits, thereby effectively increasing quantum volume. However, the key challenge of this technique is the large overhead due to long compilation time. In this paper, we propose AccQOC, a comprehensive static/dynamic hybrid workflow to transform gate groups (equivalent to matrices) to pulses using QOC (Quantum Optimal Control) with a reasonable compilation time budget. AccQOC is composed of static pre-compilation and accelerated dynamic compilation. With the methodology of AccQOC, we reached a balanced point of compilation time and overall latency. The results show that accelerated compilation based on MST achieves 9.88x compilation speedup compared to the standard compilation of each group while maintaining an average 2.43x latency reduction compared with gate-based compilation.

Posted Content
TL;DR: PermDNN architecture, a novel approach to generate and execute hardware-friendly structured sparse DNN models using permuted diagonal matrices, is proposed, a multi-processing element (PE) fully-connected (FC) layer-targeted computing engine that is highly scalable and flexible.
Abstract: Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique. The growth of model size poses a key energy efficiency challenge for the underlying computing platform. Thus, model compression becomes a crucial problem. However, the current approaches are limited by various drawbacks. Specifically, network sparsification approach suffers from irregularity, heuristic nature and large indexing overhead. On the other hand, the recent structured matrix-based approach (i.e., CirCNN) is limited by the relatively complex arithmetic computation (i.e., FFT), less flexible compression ratio, and its inability to fully utilize input sparsity. To address these drawbacks, this paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models using permuted diagonal matrices. Compared with unstructured sparsification approach, PermDNN eliminates the drawbacks of indexing overhead, non-heuristic compression effects and time-consuming retraining. Compared with circulant structure-imposing approach, PermDNN enjoys the benefits of higher reduction in computational complexity, flexible compression ratio, simple arithmetic computation and full utilization of input sparsity. We propose PermDNN architecture, a multi-processing element (PE) fully-connected (FC) layer-targeted computing engine. The entire architecture is highly scalable and flexible, and hence it can support the needs of different applications with different model configurations. We implement a 32-PE design using CMOS 28nm technology. Compared with EIE, PermDNN achieves 3.3x~4.8x higher throughout, 5.9x~8.5x better area efficiency and 2.8x~4.0x better energy efficiency on different workloads. Compared with CirCNN, PermDNN achieves 11.51x higher throughput and 3.89x better energy efficiency.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: SympleGraph is proposed, a novel distributed graph processing framework that precisely enforces loop-carried dependency, i.e., when a condition is satisfied by a neighbor, all following neighbors can be skipped and achieves a good trade-off between precise semantics and parallelism.
Abstract: Graph analytics is an important way to understand relationships in real-world applications. At the age of big data, graphs have grown to billions of edges. This motivates distributed graph processing. Graph processing frameworks ask programmers to specify graph computations in user- defined functions (UDFs) of graph-oriented programming model. Due to the nature of distributed execution, current frameworks cannot precisely enforce the semantics of UDFs, leading to unnecessary computation and communication. In essence, there exists a gap between programming model and runtime execution. This paper proposes SympleGraph, a novel distributed graph processing framework that precisely enforces loop-carried dependency, i.e., when a condition is satisfied by a neighbor, all following neighbors can be skipped. SympleGraph instruments the UDFs to express the loop-carried dependency, then the distributed execution framework enforces the precise semantics by performing dependency propagation dynamically. Enforcing loop-carried dependency requires the sequential processing of the neighbors of each vertex distributed in different nodes. Therefore, the major challenge is to enable sufficient parallelism to achieve high performance. We propose to use circulant scheduling in the framework to allow different machines to process disjoint sets of edges/vertices in parallel while satisfying the sequential requirement. It achieves a good trade-off between precise semantics and parallelism. The significant speedups in most graphs and algorithms indicate that the benefits of eliminating unnecessary computation and communication overshadow the reduced parallelism. Communication efficiency is further optimized by 1) selectively propagating dependency for large-degree vertices to increase net benefits; 2) double buffering to hide communication latency. In a 16-node cluster, SympleGraph outperforms the state-of-the-art system Gemini and D-Galois on average by 1.42× and 3.30×, and up to 2.30× and 7.76×, respectively. The communication reduction compared to Gemini is 40.95% on average and up to 67.48%.

Proceedings ArticleDOI
30 May 2020
TL;DR: In this article, the authors proposed an approach to reduce the compilation time by generating an ordered sequence of groups in which the sum of similarity among consecutive groups in the sequence is minimized.
Abstract: In the last decades, we have witnessed the rapid growth of Quantum Computing. In the current Noisy Intermediate-Scale Quantum (NISQ) era, the capability of a quantum machine is limited by the decoherence time, gate fidelity and the number of Qubits. Current quantum computing applications are far from the real “quantum supremacy” due to the fragile physical Qubits, which can only be entangled for a few microseconds. Recent works use quantum optimal control to reduce the latency of quantum circuits, thereby effectively increasing quantum volume. However, the key challenge of this technique is the large overhead due to long compilation time. In this paper, we propose AccQOC, a comprehensive static/dynamic hybrid workflow to transform gate groups (equivalent to matrices) to pulses using QOC (Quantum Optimal Control) with a reasonable compilation time budget. AccQOC is composed of static pre-compilation and accelerated dynamic compilation. After the quantum program is mapped to the quantum circuit with our heuristic mapping algorithm considering crosstalk, we leverage static pre-compilation to generate pulses for the frequently used groups to eliminate the dynamic compilation time for them. The pulse is generated using QOC with binary search to determine the latency. For a new program, we use the same policy to generate groups, thus avoid incurring overhead for the “covered” groups. The dynamic compilation deals with “un-covered” groups with accelerated pulse generation. The key insight is that the pulse of a group can be generated faster based on the generated pulse of a similar group. We propose to reduce the compilation time by generating an ordered sequence of groups in which the sum of similarity among consecutive groups in the sequence is minimized. We can find the sequence by constructing a similarity graph – a complete graph in which each vertex is a gate group and the weight of an edge is the similarity between the two groups it connects, then construct a Minimum Spanning Tree (MST) for SG. With the methodology of AccQOC, we reached a balanced point of compilation time and overall latency. The results show that accelerated compilation based on MST achieves $9.88\times$ compilation speedup compared to the standard compilation of each group while maintaining an average $2.43\times$ latency reduction compared with gate-based compilation.

Posted Content
TL;DR: DwarvesGraph is presented, the first graph mining system based on pattern decomposition, which first decomposes the target pattern into several sub-patterns, and then computes the count of each, and is capable of mining large patterns that none of them can handle.
Abstract: This paper presents DwarvesGraph, the first compilation-based graph pattern mining (GPM) system based on pattern decomposition algorithms, which decompose a pattern into several subpatterns and find the count of each. Such algorithms can be orders of magnitudes faster than algorithms used in current GPM systems because the execution time of pattern enumeration drastically increases with pattern size. We define a novel partial-embedding-centric programming model that supports various applications. We propose an efficient on-the-fly aggregation of subpatterns embeddings to reduce memory consumption and random accesses. DwarvesGraph compiler, using abstract syntax tree (AST) as intermediate representation (IR), can apply conventional and a novel pattern-aware loop rewriting optimization to eliminate redundant computation that cannot be removed with standard methods. To estimate implementation cost based on AST, we propose a simple locality-aware and an advanced approximate-mining-based cost model to accurately capture the characteristics of real-world graphs. DwarvesGraph fully automates the algorithm generation, optimization, and selection in the search space. As a general GPM system, DwarvesGraph achieves performance much closer to the best native pattern decomposition algorithms.

Posted Content
28 Feb 2020
TL;DR: RCC is built, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six concurrency control protocols using either two-sided or one-sided primitives, and conducts the first and most comprehensive study of the six representative distributed concurrence control protocols on two clusters with different RDMA network capabilities.
Abstract: On-line transaction processing (OLTP) applications require efficient distributed transaction execution. When a transaction accesses multiple records in remote machines, network performance is a crucial factor affecting transaction latency and throughput. Due to its high bandwidth and very low latency, RDMA (Remote Direct Memory Access) has achieved much higher performance for distributed transactions than traditional TCP-based systems. RDMA provides primitives for both two-sided and one-sided communication. Although recent works have intensively studied the benefits of RDMA in distributed transaction systems, they either focus on primitive-level comparisons of two communication models (one-sided vs. two-sided) or only study one concurrency control protocol. A comprehensive understanding of the implication of RDMA for various concurrency control protocols is an open problem. In this paper, we build RCC, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six concurrency control protocols using either two-sided or one-sided primitives. We intensively optimize the performance of each protocol without bias, using known techniques such as co-routines, outstanding requests, and doorbell batching. Based on RCC, we conduct the first and most comprehensive (to the best of our knowledge) study of the six representative distributed concurrency control protocols on two clusters with different RDMA network capabilities.

Proceedings ArticleDOI
07 Sep 2020
TL;DR: TUPIM is a significant advance over the state-of-the-art because it transparently expends the scope of PIM to deploy all applications without any source code, programming models, or compiler modifications.
Abstract: Recently, processing-in-memory (PIM) is gaining much attention because it could minimize data movement by conducting computation in memory. Existing PIM solutions require a number of additional procedures during the setup-time, including code re-writing and re-compiling, code annotations, and detailed program profiling, etc. These requirements, however, potentially prevent existing executable binaries benefiting from PIM architectures. For old binary legacies without any source code, it is impossible to run them on existing PIM architectures. To solve these challenges, we propose a transparent and universal PIM (TUPIM), a novel PIM architecture that can execute unmodified binaries and at the same time take advantages of PIM. TUPIM is a significant advance over the state-of-the-art because it transparently expends the scope of PIM to deploy all applications without any source code, programming models, or compiler modifications. Experiments show that TUPIM can get 2.2x speedup on average (up to 3.67x) and 15.7% energy reduction, compared with conventional CPU-only executions.

Journal ArticleDOI
TL;DR: This article presents a hybrid framework for the efficient performance estimation and work-group size pruning of OpenCL workloads on GPUs that can predict the runtime performance with an average error and reduce the program design space by an average of 78.47 percent.
Abstract: Graphic Processing Units (GPUs) play a vital role in state-of-the-art high-performance scientific computing realm and research work towards its performance analysis is crucial but nontrivial. Extant GPU performance models are far from practical use, while fine-grained GPU simulation requires a considerably large time cost. Moreover, massive amounts of designs with various program inputs and parameter settings pose a challenge for efficient performance estimation and tuning of parallel GPU applications. To this end, this article presents a hybrid framework for the efficient performance estimation and work-group size pruning of OpenCL workloads on GPUs. The framework contains a static module used to extract the kernel execution trace from the high-level source code and a dynamical module used to mimic the kernel execution flow to estimate the runtime performance. For the design space pruning, an extra analysis is performed to filter out the redundant work-group sizes with duplicated execution traces and inferior pipelines. The proposed framework does not require any program runs to estimate the performance and find the optimal or near-optimal designs. Experiments on four Commercial Off-The-Shelf (COTS) Nvidia GPUs show that the framework can predict the runtime performance with an average error of 17.04 percent and reduce the program design space by an average of 78.47 percent.

Posted Content
TL;DR: The key problem solved is the first demonstration of a reversible cache coherence protocol that naturally rollbacks the effects of squashed speculative execution, and two concrete coherence protocols are designed, ReversiCC-Lazy and Revers iCC-Eager providing the same functionality with different trade-offs.
Abstract: The recent works such as InvisiSpec, SafeSpec, and Cleanup-Spec, among others, provided promising solutions to defend speculation induced (transient) attacks. However, they intro-duce delay either when a speculative load becomes safe in the redo approach or when it is squashed in the undo approach. We argue that it is due to the lack of fundamental mechanisms for reversing the effects of speculation in a cache coherence protocol. Based on mostly unmodified coherence protocol, the redo approach avoids leaving trace at the expense of double loads; the undo approach "stops the world" in recovery to avoid interference. This paper provides the first solution to the fundamental problem. Specifically, we propose ReversiSpec, a comprehensive solution to mitigate speculative induced attacks.ReversiSpec is a reversible approach that uses speculative buffers in all cache levels to record the effects of speculative execution. When a speculative load becomes safe, a merge operation adds the effects of speculative execution to the global state. When a speculative load is squashed, a purge operation clears the buffered speculative execution states from speculative buffer. The key problem solved by the paper is the first demonstration of a reversible cache coherence protocol that naturally rollbacks the effects of squashed speculative execution. We design two concrete coherence protocols, ReversiCC-Lazy and ReversiCC-Eager providing the same functionality with different trade-offs. Our solution closes a crucial gap in modern architecture: just like the mechanisms to roll back the speculation effects inside a processor, ReversiSpec provides the mechanisms to roll back the state of the whole coherence protocol.

Posted Content
TL;DR: IntersectX as discussed by the authors is a vertical approach to accelerate graph mining with stream instruction set extension and architectural support based on conventional processor, which can be considered as a natural extension to the traditional instructions for ordinary scalar values.
Abstract: Graph mining applications try to find all embeddings that match specific patterns. Compared to the traditional graph computation, graph mining applications are computation intensive. The state-of-the-art method, pattern enumeration, specifically constructs the embeddings that satisfy the pattern, leading to significant speedups over the exhaustive check method. However, the key operation intersection poses challenges to conventional architectures and takes substantial execution time. In this paper, we propose IntersectX, a vertical approach to accelerate graph mining with stream instruction set extension and architectural supports based on conventional processor. The stream based ISA can considered as a natural extension to the traditional instructions for ordinary scalar values. We develop the IntersectX architecture composed of specialized mechanisms that efficiently implement the stream ISA extensions, including: (1) Stream Mapping Table (SMT) that records the mapping between stream ID and stream register; (2) the Stream Cache(S-Cache) that enables efficient stream data movements; (3) tracking the dependency between streams with a property of intersection; (4) Stream Value Processing Unit (SVPU) that implements sparse value computations; and (5) the nested intersection translator that generates micro-op sequences for implementing nested intersections. We implement IntersectX ISA and architecture on zsim. We use 7 popular graph mining applications (triangle/three-chain/tailed-traingle counting, 3-motif mining, 4/5-clique counting, and FSM) on 10 real graphs. Our experiments show that IntersectX significantly outperforms our CPU baseline and GRAMER, a state-of-the-art graph mining accelerator. IntersectX's speedups over the CPU baseline and GRAMER are on average 10.7x and 40.1x(up to 83.9x and 181.8x), respectively.

Posted Content
Linghao Song, Fan Chen, Xuehai Qian, Hai Li, Yi Chen 
TL;DR: ReFloat enables the principled local fine-tuning of floating-point representation and reconciles the conflicting goals by storing the exponent offsets from a common base among matrix values in a block, which is the granularity of computation in ReRAM.
Abstract: We propose ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for scientific computing. ReFloat reduces bit number for floating-point representation and processing to achieve a smaller number ofReRAM crossbars and processing cycles for floating-point matrix-vector multiplication. In the ReFloat data format, for scalars of a matrix block, an exponent offset is derived from an optimized base, and then an exponent offset is re-served for each scalar and fraction bit numbers are reduced. The exponent base optimization is enabled by taking advantage of the existence of value locality in real-world matrices. After defining the ReFloat data format, we develop the conversion scheme from default double-precision floating-point format to ReFloat format, the computation procedure, and the low-cost high-performance floating-point processing architecture in ReRAM. With ReFloat, we find that for all 12 matrices only 3 bits for matrix exponent, matrix fraction and vector exponent, and 8 or 16 bits for vector fraction are sufficient to ensure convergence on solvers CG and BiCGSTAB. It translates to an average speedup of20.10x/24.59x onCG / BiCGSTAB compared with a GPU baseline and an average speedup of18.86x/54.00x on CG / BiCGSTAB compared with a state-of-the-art ReRAM-based accelerator for scientific computing.

Posted Content
TL;DR: IntersectX as discussed by the authors is a graph pattern mining accelerator with stream instruction set extension and architectural support based on a conventional processor, which can be considered as a natural extension to the traditional instructions that operate on scalar values.
Abstract: Graph pattern mining applications try to find all embeddings that match specific patterns. Compared to the traditional graph computation, graph mining applications are computation-intensive. The state-of-the-art method, pattern enumeration, constructs the embeddings that match the pattern. The key operation -- intersection -- of two edge lists, poses challenges to conventional architectures and requires substantial execution time. In this paper, we propose IntersectX, a vertically designed accelerator for pattern enumeration with stream instruction set extension and architectural supports based on conventional processor. The stream based ISA can be considered as a natural extension to the traditional instructions that operate on scalar values. We develop the IntersectX architecture composed of specialized mechanisms that efficiently implement the stream ISA extensions, including: (1) Stream Mapping Table (SMT) that records the mapping between stream ID and stream register; (2) the read-only Stream Cache (S-Cache) that enables efficient stream data movements; (3) tracking the dependency between streams with a property of intersection; (4) Stream Value Processing Unit (SVPU) that implements sparse value computations; and (5) the nested intersection translator that generates micro-op sequences for implementing nested intersections. We implement IntersectX ISA and architecture on zSim, and test it with seven popular graph mining applications (triangle/three-chain/tailed-traingle counting, 3-motif mining, 4/5-clique counting, and FSM) on ten real graphs. We develop our own implementation of AutoMine (InHouseAutomine). The results show that IntersectX significantly outperforms InHouseAutomine on CPU, on average 10.7 times and up to 83.9 times ; and GRAMER, a state-of-the-art graph pattern mining accelerator, based on exhaustive check, on average 40.1 times and up to 181.8 times.

Posted Content
TL;DR: The reversible coherence protocol (RCP) as mentioned in this paper enables invisible speculative load in a general memory hierarchy by allowing speculative loads and merge/purge operations in the interface between processor and cache coherence.
Abstract: We propose the first Reversible Coherence Protocol (RCP), a new protocol designed from ground up that enables invisible speculative load. RCP takes a bold approach by including the speculative loads and merge/purge operation in the interface between processor and cache coherence, and allowing them to participate in the coherence protocol. It means, speculative load, ordinary load/store, and merge/purge can all affect the state of a given cache line. RCP is the first coherence protocol that enables the commit and squash of the speculative load among distributed cache components in a general memory hierarchy. RCP incurs an average slowdown of (3.0%,8.3%,7.4%) on (SPEC2006,SPEC2017,PARSEC), which is lower compared to (26.5%,12%,18.3%) in InvisiSpec and (3.2%,9.4%,24.2%) in CleanupSpec. The coherence traffic overhead is on average 46%, compared to 40% and 27% of InvisiSpec and CleanupSpec, respectively. Even with higher traffic overhead (~46%), the performance overhead of RCP is lower than InvisiSpec and comparable to CleanupSpec. It reveals a key advantage of RCP: the coherence actions triggered by the merge and purge operations are not in the critical path of the execution and can be performed in the cache hierarchy concurrently with processor execution

Posted Content
TL;DR: RCC is a significant advance over the state-of-the-art; it can both provide performance insights and be used as the common infrastructure for fast prototyping new implementations and show that three hybrid designs are indeed better than both one-sided and two-sided implementations by up to 17.8%.
Abstract: In this paper, we develop RCC, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six serializable concurrency control protocols: not only the classical protocols NOWAIT, WAITDIE, and OCC, but also more advanced MVCC and SUNDIAL, and even CALVIN, the deterministic concurrency control protocol. Our goal is to unbiasedly compare the protocols in a common execution environment with the concurrency control protocol being the only changeable component. We focus on the correct and efficient implementation using key techniques, such as co-routines, outstanding requests, and doorbell batching, with two-sided and one-sided communication primitives. Based on RCC, we get the deep insights that cannot be obtained by any existing systems. Most importantly, we obtain the execution stage latency breakdowns with one-sided and two-sided primitive for each protocol, which are analyzed to develop more efficient hybrid implementations. Our results show that three hybrid designs are indeed better than both one-sided and two-sided implementations by up to 17.8%. We believe that RCC is a significant advance over the state-of-the-art; it can both provide performance insights and be used as the common infrastructure for fast prototyping new implementations.