scispace - formally typeset
Search or ask a question
Author

Kung

Bio: Kung is an academic researcher from Carnegie Mellon University. The author has contributed to research in topics: Signal processing & Digital image processing. The author has an hindex of 1, co-authored 1 publications receiving 2212 citations.

Papers
More filters
Journal ArticleDOI
Kung1
TL;DR: The basic principle of systolic architectures is reviewed and it is explained why they should result in cost-effective, highperformance special-purpose systems for a wide range of problems.
Abstract: f High-performance, special-purpose computer systems are typically used to meet specific application requirements or to off-load computations that are especially taxing to general-purpose computers. As hardware cost and size continue to drop and processing requirements become well-understood in areas such as signal and image processing, more special-purpose systems are being constructed. However, since most of these systems are built on an ad hoc basis for specific tasks, methodological work in this area is rare. Because the knowledge gaited from individual experiences is neither accumulated nor properly organized, the same errors are repeated. I/O and computation imbalance is a notable example-often, the fact that I/O interfaces cannot keep up with device speed is discovered only after constructing a high-speed, special-purpose device. We intend to help correct this ad hoc approach by providing a general guideline-specifically, the concept of systolic architecture, a general methodology for mapping high-level computations into hardware structures. In a systolic system, data flows from the computer memcory in a rhythmic fashion, passing through many processing elements before it returns to memory, much as blood circulates to and from the heart. The system works like an autombbile assembly line where different people work on the same car at different times and many cars are assembled simultaneously. An assembly line is always linear, however, and systolic systems are sometimes two-dimensional. They can be rectangular, triangular, or hexagonal to make use of higher degrees of parallelism. Moreover, to implement a variety of computations, data flow in a systolic system may be at multiple speeds in multiple directions-both inputs and (partial) results flow, whereas only results flow in classical pipelined systems. Generally speaking, a systolic system is easy to implement because of its regularity and easy to reconfigure (to meet various outside constraints) because of its modularity. The systolic architectural concept was developed at Carnegie-Mellon University,'17 and versions of systolic processors are being designed and built by several industrial and governmental organizations.840 This article reviews the basic principle of systolic architectures and explains why they should result in cost-effective, highperformance special-purpose systems for a wide range of problems.

2,319 citations


Cited by
More filters
Posted Content
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,067 citations

Posted Content
TL;DR: The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.
Abstract: Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.

1,234 citations

Journal ArticleDOI
01 Jul 1984
TL;DR: The combination of decreasing feature sizes and increasing chip sizes is leading to a communication crisis in the area of VLSI circuits and systems, and the possibility of applying optical and electrooptical technologies to such interconnection problems is investigated.
Abstract: The combination of decreasing feature sizes and increasing chip sizes is leading to a communication crisis in the area of VLSI circuits and systems. It is anticipated that the speeds of MOS circuits will soon be limited by interconnection delays, rather than gate delays. This paper investigates the possibility of applying optical and electrooptical technologies to such interconnection problems. The origins of the communication crisis are discussed. Those aspects of electrooptic technology that are applicable to the generation, routing, and detection of light at the level of chips and boards are reviewed. Algorithmic implications of interconnections are discussed, with emphasis on the definition of a hierarchy of interconnection problems from the signal-processing area having an increasing level of complexity. One potential application of optical interconnections is to the problem of clock distribution, for which a single signal must be routed to many parts of a chip or board. More complex is the problem of supplying data interconnections via optical technology. Areas in need of future research are identified.

1,187 citations

Proceedings Article
06 Jul 2015
TL;DR: In this article, the effect of limited precision data representation and computation on neural network training was studied, and it was shown that deep networks can be trained using only 16-bit wide fixed point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.
Abstract: Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of lowprecision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.

1,142 citations

Journal ArticleDOI
TL;DR: The historical events that lead to the interweaving of data and knowledge are tracked to help improve knowledge and understanding of the world around us.
Abstract: In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.

560 citations