Home
/
Authors
/
Tushar Krishna

Author

Tushar Krishna

Other affiliations: Indian Institutes of Technology, Princeton University, Georgia Institute of Technology ...read more

Bio: Tushar Krishna is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Computer science & Network on a chip. The author has an hindex of 19, co-authored 29 publications receiving 6728 citations. Previous affiliations of Tushar Krishna include Indian Institutes of Technology & Princeton University.

Topics: Computer science, Network on a chip, Router, Tree traversal, Cache coherence ...read more

Papers published on a yearly basis

2023
2022
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The gem5 simulator

[...]

Nathan Binkert¹, Bradford M. Beckmann², Gabriel Black³, Steven K. Reinhardt², Ali G. Saidi, Arkaprava Basu⁴, Joel Hestness⁵, Derek R. Hower⁴, Tushar Krishna⁶, Somayeh Sardashti⁴, Rathijit Sen⁴, Korey Sewell⁷, Muhammad Shoaib⁴, Nilay Vaish⁴, Mark D. Hill⁴, Darien Wood⁴ - Show less +12 more•Institutions (7)

Hewlett-Packard¹, Advanced Micro Devices², Google³, University of Wisconsin-Madison⁴, University of Texas at Austin⁵, Massachusetts Institute of Technology⁶, University of Michigan⁷

31 Aug 2011-ACM Sigarch Computer Architecture News

TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

4,039 citations

Journal Article•DOI•

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

[...]

Yu-Hsin Chen¹, Tushar Krishna¹, Joel Emer¹, Vivienne Sze¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2017-IEEE Journal of Solid-state Circuits

TL;DR: Eyeriss as mentioned in this paper is an accelerator for state-of-the-art deep convolutional neural networks (CNNs) that optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, by reconfiguring the architecture.

...read moreread less

Abstract: Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, for various CNN shapes by reconfiguring the architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput and energy efficiency to the underlying hardware. This is because its computation requires a large amount of data, creating significant data movement from on-chip and off-chip that is more energy-consuming than computation. Minimizing data movement energy cost for any CNN shape, therefore, is the key to high throughput and energy efficiency. Eyeriss achieves these goals by using a proposed processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements. RS dataflow reconfigures the computation mapping of a given shape, which optimizes energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses. Compression and data gating are also applied to further improve energy efficiency. Eyeriss processes the convolutional layers at 35 frames/s and 0.0029 DRAM access/multiply and accumulation (MAC) for AlexNet at 278 mW (batch size $N = 4$ ), and 0.7 frames/s and 0.0035 DRAM access/MAC for VGG-16 at 236 mW ( $N = 3$ ).

...read moreread less

2,165 citations

Proceedings Article•DOI•

GARNET: A detailed on-chip network model inside a full-system simulator

[...]

Niket Agarwal¹, Tushar Krishna¹, Li-Shiuan Peh¹, Niraj K. Jha¹•Institutions (1)

Princeton University¹

26 Apr 2009

TL;DR: In this article, a detailed cycle-accurate interconnection network model (GARNET) is proposed to simulate a CMP architecture with virtual channel (VC) flow control.

...read moreread less

Abstract: Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This was because of fast, single-cycle on-chip communication. The interconnect power was also insignificant compared to the transistor power. With uniprocessor designs providing diminishing returns and the advent of chip multiprocessors (CMPs) in mainstream systems, the on-chip network that connects different processing cores has become a critical part of the design. Transistor miniaturization has led to high global wire delay, and interconnect power comparable to transistor power. CMP design proposals can no longer ignore the interaction between the memory hierarchy and the interconnection network that connects various elements. This necessitates a detailed and accurate interconnection network model within a full-system evaluation framework. Ignoring the interconnect details might lead to inaccurate results when simulating a CMP architecture. It also becomes important to analyze the impact of interconnection network optimization techniques on full system behavior. In this light, we developed a detailed cycle-accurate interconnection network model (GARNET), inside the GEMS full-system simulation framework. GARNET models a classic five-stage pipelined router with virtual channel (VC) flow control. Microarchitectural details, such as flit-level input buffers, routing logic, allocators and the crossbar switch, are modeled. GARNET, along with GEMS, provides a detailed and accurate memory system timing model. To demonstrate the importance and potential impact of GARNET, we evaluate a shared and private L2 CMP with a realistic state-of-the-art interconnection network against the original GEMS simple network. The objective of the evaluation was to figure out which configuration is better for a particular workload. We show that not modeling the interconnect in detail might lead to an incorrect outcome. We also evaluate Express Virtual Channels (EVCs), an on-chip network flow control proposal, in a full-system fashion. We show that in improving on-chip network latency-throughput, EVCs do lead to better overall system runtime, however, the impact varies widely across applications.

...read moreread less

719 citations

Journal Article•DOI•

SCORPIO: a 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering

[...]

Bhavya K. Daya¹, Chia-Hsin Owen Chen¹, Suvinay Subramanian¹, Woo-Cheol Kwon¹, Sunghyun Park¹, Tushar Krishna¹, Jim Holt¹, Anantha P. Chandrakasan¹, Li-Shiuan Peh¹ - Show less +5 more•Institutions (1)

Massachusetts Institute of Technology¹

14 Jun 2014

TL;DR: SCORPIO is presented, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering, designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns.

...read moreread less

Abstract: In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designsWe present SCORPIO, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrivein any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application runtime reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectivelyThe SCORPIO architecture is incorporated in an 11 mm-by-13mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power ArchitectureTMcores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833MHz post-layout) with an estimated power of 28.8W (768mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.

...read moreread less

124 citations

Proceedings Article•DOI•

Breaking the on-chip latency barrier using SMART

[...]

Tushar Krishna¹, Chia-Hsin Owen Chen¹, Woo-Cheol Kwon¹, Li-Shiuan Peh¹•Institutions (1)

Massachusetts Institute of Technology¹

23 Feb 2013

TL;DR: This work proposes an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination.

...read moreread less

Abstract: As the number of on-chip cores increases, scalable on-chip topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency between a source and destination is equal to the number of routers + links (i.e. hops×2) between them. OS/compiler and cache coherence protocols designers often try to limit communication to within a few hops, since on-chip latency is critical for their scalability. In this work, we propose an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express links in the data-path; instead we drive the shared crossbars and links asynchronously up to multiple-hops within a single cycle. We design a router + link microarchitecture to achieve such a traversal, and a flow-control technique to arbitrate and setup multi-hop paths within a cycle. A place-and-routed design at 45nm achieves 11 hops within a 1GHz cycle for paths without turns (9 for paths with turns). We observe 5-8X reduction in low-load latencies across synthetic traffic patterns on an 8×8 CMP, compared to a baseline 1-cycle router. Full-system simulations with SPLASH-2 and PAR-SEC benchmarks demonstrate 27/52% and 20/59% reduction in runtime and EDP for Private/Shared L2 designs.

...read moreread less

117 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

The gem5 simulator

[...]

Hewlett-Packard¹, Advanced Micro Devices², Google³, University of Wisconsin-Madison⁴, University of Texas at Austin⁵, Massachusetts Institute of Technology⁶, University of Michigan⁷

31 Aug 2011-ACM Sigarch Computer Architecture News

TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

4,039 citations

Journal Article•DOI•

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

[...]

Vivienne Sze¹, Yu-Hsin Chen¹, Tien-Ju Yang¹, Joel Emer¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Nov 2017

TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.

...read moreread less

Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

...read moreread less

2,391 citations

Journal Article•DOI•

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

[...]

Yu-Hsin Chen¹, Tushar Krishna¹, Joel Emer¹, Vivienne Sze¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2017-IEEE Journal of Solid-state Circuits

...read moreread less

2,165 citations

Journal Article•DOI•

Deep learning with coherent nanophotonic circuits

[...]

Yichen Shen¹, Nicholas C. Harris¹, Scott Skirlo¹, Dirk Englund¹, Marin Soljacic¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Jul 2017

TL;DR: A new architecture for a fully optical neural network is demonstrated that enables a computational speed enhancement of at least two orders of magnitude and three order of magnitude in power efficiency over state-of-the-art electronics.

...read moreread less

Abstract: Artificial Neural Networks have dramatically improved performance for many machine learning tasks. We demonstrate a new architecture for a fully optical neural network that enables a computational speed enhancement of at least two orders of magnitude and three orders of magnitude in power efficiency over state-of-the-art electronics.

...read moreread less

1,955 citations

Journal Article•DOI•

An Introduction to Deep Learning for the Physical Layer

[...]

Timothy J. O'Shea¹, Jakob Hoydis²•Institutions (2)

Virginia Tech¹, Bell Labs²

02 Oct 2017-IEEE Transactions on Cognitive Communications and Networking

TL;DR: In this article, an end-to-end reconstruction task was proposed to jointly optimize transmitter and receiver components in a single process, which can be extended to networks of multiple transmitters and receivers.

...read moreread less

Abstract: We present and discuss several novel applications of deep learning for the physical layer. By interpreting a communications system as an autoencoder, we develop a fundamental new way to think about communications system design as an end-to-end reconstruction task that seeks to jointly optimize transmitter and receiver components in a single process. We show how this idea can be extended to networks of multiple transmitters and receivers and present the concept of radio transformer networks as a means to incorporate expert domain knowledge in the machine learning model. Lastly, we demonstrate the application of convolutional neural networks on raw IQ samples for modulation classification which achieves competitive accuracy with respect to traditional schemes relying on expert features. This paper is concluded with a discussion of open challenges and areas for future investigation.

...read moreread less

1,879 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse