scispace - formally typeset
Search or ask a question

Showing papers by "Xuehai Qian published in 2023"


Journal ArticleDOI
TL;DR: DyNNamic as mentioned in this paper leverages kernel-wise filter decomposition to partition the convolution operation into two compact stages: Shared Kernels Convolution (SKC) and Weighted Accumulation (WA).
Abstract: Convolutional layers dominate the computation and energy costs of Deep Neural Network (DNN) inference. Recent algorithmic works attempt to reduce these bottlenecks via compact DNN structures and model compression. Likewise, state-of-the-art accelerator designs leverage spatiotemporal characteristics of convolutional layers to reduce data movement overhead and improve throughput. Although both are independently effective at reducing latency and energy costs, combining these approaches does not guarantee cumulative improvements due to inefficient mapping. This inefficiency can be attributed to (1) inflexibility of underlying hardware and (2) inherent reduction of data-reuse opportunities of compact DNN structures. To address these issues, we propose a dynamically reshaping, high data-reuse PE array accelerator, namely DyNNamic. DyNNamic leverages kernel-wise filter decomposition to partition the convolution operation into two compact stages: Shared Kernels Convolution (SKC) and Weighted Accumulation (WA). Because both stages have vastly different dimensions, DyNNamic reshapes its PE array to effectively map the algorithm to the architecture. The architecture then exploits data-reuse opportunities created by the SKC stage, further reducing data movement with negligible overhead. We evaluate our approach on various representative networks and compare against state-of-the-art accelerators. On average, DyNNamic outperforms DianNao by $8.4\times$ 8 . 4 × and $12.3\times$ 12 . 3 × in terms of inference energy and latency, respectively.

Peer Review
17 Jul 2023
TL;DR: In this article , a QGAS model is proposed for quantum chemistry and quantum finance tasks, which can rapidly propose promising ansatz architectures and evaluate them with application benchmarks including quantum chemistry.
Abstract: Large Language Models (LLMs) contribute significantly to the development of conversational AI and has great potentials to assist the scientific research in various areas. This paper attempts to address the following questions: What opportunities do the current generation of generative pre-trained transformers (GPTs) offer for the developments of noisy intermediate-scale quantum (NISQ) technologies? Additionally, what potentials does the forthcoming generation of GPTs possess to push the frontier of research in fault-tolerant quantum computing (FTQC)? In this paper, we implement a QGAS model, which can rapidly propose promising ansatz architectures and evaluate them with application benchmarks including quantum chemistry and quantum finance tasks. Our results demonstrate that after a limited number of prompt guidelines and iterations, we can obtain a high-performance ansatz which is able to produce comparable results that are achieved by state-of-the-art quantum architecture search methods. This study provides a simple overview of GPT's capabilities in supporting quantum computing research while highlighting the limitations of the current GPT at the same time. Additionally, we discuss futuristic applications for LLM in quantum research.

Proceedings ArticleDOI
27 Jan 2023
TL;DR: SGraph as mentioned in this paper is a real-time OLAP system that can answer dynamic pairwise queries over evolving graphs with sub-second latency by estimating the upper bound of the query result.
Abstract: Many real-time OLAP systems have been proposed to query evolving data with sub-second latency. Although this feature is highly attractive, it is very hard to be achieved on analytic graph queries that can only be answered after accessing every connected vertex. Fortunately, researchers recently observed that answering pairwise queries is enough for many real-world scenarios. These pairwise queries avoid the exhaustive nature and hence may only need to access a small portion of the graph. Obviously, the crux of achieving low latency is to what extent the system can eliminate unnecessary computations. This pruning process, according to our investigation, is usually achieved by estimating certain upper bounds of the query result in existing systems. However, our evaluation results demonstrate that these existing upper-bound-only pruning techniques can only prune about half of the vertex activations, which is still far away from achieving the sub-second latency goal on large graphs. In contrast, we found that it is possible to substantially accelerate the processing if we are able to not only estimate the upper bounds, but also foresee a tighter lower bound for certain pairs of vertices in the graph. Our experiments show that only less than 1% of the vertices are activated via using this novel lower bound based pruning technique. Based on this observation, we build SGraph, a system that is able to answer dynamic pairwise queries over evolving graphs with sub-second latency. It can ingest millions of updates per second and simultaneously answer pairwise queries with a latency that is several orders of magnitude smaller than state-of-the-art systems.

Proceedings ArticleDOI
27 Jan 2023
TL;DR: This paper proposes Khuzdul, a distributed execution engine with a well-defined abstraction that can be integrated with existing single-machine graph pattern mining (GPM) systems to provide efficiency and scalability at the same time.
Abstract: This paper proposes Khuzdul, a distributed execution engine with a well-defined abstraction that can be integrated with existing single-machine graph pattern mining (GPM) systems to provide efficiency and scalability at the same time. The key novelty is the extendable embedding abstraction which can express pattern enumeration algorithms, allow fine-grained task scheduling, and enable low-cost GPM-specific data reuse to reduce communication cost. The effective BFS-DFS hybrid exploration generates sufficient concurrent tasks for communication-computation overlapping with bounded memory consumption. Two scalable distributed GPM systems are implemented by porting Automine and GraphPi on Khuzdul. Our evaluation shows that Khuzdul based systems significantly outperform state-of-the-art distributed GPM systems with partitioned graphs by up to 75.5× (on average 19.0×), achieve similar or even better performance compared with the fastest distributed GPM systems with replicated graph, and scale to massive graphs with more than one hundred billion edges with a commodity cluster.