Home
/
Authors
/
Nilay Vaish

Author

Nilay Vaish

Bio: Nilay Vaish is an academic researcher from University of Wisconsin-Madison. The author has contributed to research in topics: Engineering optimization & Network planning and design. The author has an hindex of 5, co-authored 7 publications receiving 3447 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The gem5 simulator

[...]

Nathan Binkert¹, Bradford M. Beckmann², Gabriel Black³, Steven K. Reinhardt², Ali G. Saidi, Arkaprava Basu⁴, Joel Hestness⁵, Derek R. Hower⁴, Tushar Krishna⁶, Somayeh Sardashti⁴, Rathijit Sen⁴, Korey Sewell⁷, Muhammad Shoaib⁴, Nilay Vaish⁴, Mark D. Hill⁴, Darien Wood⁴ - Show less +12 more•Institutions (7)

Hewlett-Packard¹, Advanced Micro Devices², Google³, University of Wisconsin-Madison⁴, University of Texas at Austin⁵, Massachusetts Institute of Technology⁶, University of Michigan⁷

31 Aug 2011-ACM Sigarch Computer Architecture News

TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

4,039 citations

Posted Content•

The gem5 Simulator: Version 20.0+

[...]

Jason Lowe-Power¹, Abdul Ahmad², Adria Armejach³, Adrian Herrera, Alec Roelke⁴, Amin Farmahini-Farahani, Andrea Mondelli, Andreas Hansson, Andreas Sandberg, Anthony Gutierrez, Austin Harris, Ayaz Akram⁵, Bagus Hanindhito⁶, Binh Pham⁷, Bobby R. Bruce, Boris Shingarov, Brad Beckmann, Carlos Escuin, Christian Menard⁸, Christian Weis², Daniel Rodrigues Carvalho⁹, Darien Wood, Dibakar Gope, Éder F. Zulian, Gabe Black, Gedare Bloom, Giacomo Travaglini, Hamidreza Khaleghzadeh, Hanhwi Jang, Hoa Nguyen, Hongil Yoon, Ilias Vougioukas, Javier Setoain, Jayneel Gandhi, Jeronimo Castrillon⁸, Krishnendra Nathella, Lena E. Olson, Lizhong Chen, Mahyar Samani, Marc S. Orr, Marjan Fariborz, Matteo Andreozzi, Matthew D. Sinclair, Matthew James Horsnell, Matthias Jung¹⁰, Michael Upton, Miquel Moreto, Mohammad Alian¹¹, Nicolas Derumigny, Nikos Nikoleris¹², Nilay Vaish, Nils Asmussen, Norbert Wehn², Omar Naji, Pablo Prieto, Pouya Fotouhi, Radhika Jagtap, Rahul Thakur, Raza Jafri, Reiley Jeyapaul, Rico Amslinger, Ryan Gambord, Srikant Bharadwaj, Stephan Diestelhorst, Subash Kannoth, Swapnil Haria, Syed Ali, Thomas Grass¹³, Tiago Muck, Timothy Hayes, Timothy M. Jones, Tommaso Marinelli, Trivikram Reddy, Tuan Ta, Tushar Krishna¹⁴, Wendy Arnott Elsasser, William S.-Y. Wang, Yuetsu Kodama, Zhengrong Wang - Show less +75 more•Institutions (14)

University of California, Davis¹, Kaiserslautern University of Technology², Barcelona Supercomputing Center³, University of Virginia⁴, Western Michigan University⁵, Bandung Institute of Technology⁶, Vietnam Academy of Science and Technology⁷, Dresden University of Technology⁸, French Institute for Research in Computer Science and Automation⁹, Fraunhofer Society¹⁰, University of Illinois at Urbana–Champaign¹¹, Uppsala University¹², RWTH Aachen University¹³, Georgia Institute of Technology¹⁴

06 Jan 2021-arXiv: Hardware Architecture

TL;DR: How the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research is discussed.

...read moreread less

Abstract: The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm®, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7000 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give an overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.

...read moreread less

84 citations

Book•

Optimization and Mathematical Modeling in Computer Architecture

[...]

Tony Nowatzki¹, Michael C. Ferris¹, Karthikeyan Sankaralingam¹, Cristian Estan², Nilay Vaish, Darien Wood - Show less +2 more•Institutions (2)

University of Wisconsin-Madison¹, Broadcom²

01 Oct 2013

TL;DR: This book motivate and describe the use of mathematical modeling, specifically optimization based on mixed integer linear programming (MILP) as a way to design and evaluate computer systems, and presents four detailed case studies showing how MILP can be used and quantifying by how much it outperforms traditional design exploration techniques.

...read moreread less

Abstract: In the last few decades computer systems and the underlying hardware have steadily become larger and more complex. The need to increase their efficiency through architectural innovation has not abated, but quantitatively evaluating the effect of various choices has become more difficult. Performance and resource consumption are determined by complex interactions between many modules, each with many possible alternative implementations. We need powerful computer programs to explore large design spaces, but the traditional approach of developing simulators, building prototypes, or writing heuristic-based algorithms in traditional programming languages is often tedious and slow. Fortunately mathematical optimization has made great advances in theory, and many fast commercial and academic solvers are now available. In this book we motivate and describe the use of mathematical modeling, specifically optimization based on mixed integer linear programming (MILP) as a way to design and evaluate computer systems. The major advantage is that the architect or system software writer only needs to describe what the problem is, not how to find a good solution. This greatly speeds up their work and, as our case studies show, it can often lead to better solutions than the traditional approach. In this book we give an overview of modeling techniques used to describe computer systems to mathematical optimization tools. We give a brief introduction to various classes of mathematical optimization frameworks with special focus on mixed integer linear programming which provides a good balance between solver time and expressiveness. We present four detailed case studies -- instruction set customization, data center resource management, spatial architecture scheduling, and resource allocation in tiled architectures -- showing how MILP can be used and quantifying by how much it outperforms traditional design exploration techniques. This book should help a skilled systems designer to learn techniques for using MILP in their problems, and the skilled optimization expert to understand the types of computer systems problems that MILP can be applied to. Fully operational source code for the examples used in this book is provided through the NEOS System at www.neos-guide.org/content/computer-architecture Table of Contents: Acknowledgments / Introduction / An Overview of Optimization / Case Study: Instruction Set Customization / Case Study: Data Center Resource Management / Case Study: Spatial Architecture Scheduling / Case Study: Resource Allocation in Tiled Architectures / Conclusions / Bibliography / Authors' Biographies

...read moreread less

18 citations

Proceedings Article•DOI•

Experiences in Co-designing a Packet Classification Algorithm and a Flexible Hardware Platform

[...]

Nilay Vaish¹, Thawan Kooburat¹, Lorenzo De Carli¹, Karthikeyan Sankaralingam¹, Cristian Estan² - Show less +1 more•Institutions (2)

University of Wisconsin-Madison¹, NetLogic Microsystems²

03 Oct 2011

TL;DR: This work identifies and evaluates changes to the original algorithm and to the platform that can improve throughput and memory utilization and confirms that this solution achieves high throughput (142 million packets per second) and low power (3.1 Watts).

...read moreread less

Abstract: Algorithmic solutions to the packet classification problem in network equipment have long been a subject of study in academia and industry and with increases in network speeds they are becoming even more important. Since general purpose processors cannot meet performance and cost requirements, researchers have been assuming that ASICs or FPGAs are necessary for hardware implementation. Industry and academia have been working on SRAM-based platforms specialized for tables used in network equipment, but existing publications only describe the mapping of simpler exact match or prefix match lookups to such platforms. In this paper we adopt a software-hardware co-design approach mapping the EffiCuts algorithm to the PLUG platform. Our work confirms that this solution achieves high throughput (142 million packets per second) and low power (3.1 Watts). It identifies and evaluates changes to the original algorithm and to the platform that can improve throughput and memory utilization.

...read moreread less

9 citations

Patent•

Simulating vector execution

[...]

Bradford M. Beckmann, Nilay Vaish, Steven K. Reinhardt

22 Jun 2012

TL;DR: In this paper, a simulator detects a given region in code generated by a compiler and then serially executes the region for at least two iterations using a functional-based simulation and using instructions with operands which correspond to P or less lanes of single-instruction multiple-data (SIMD) execution.

...read moreread less

Abstract: A system and method for simulating new instructions without compiler support for the new instructions. A simulator detects a given region in code generated by a compiler. The given region may be a candidate for vectorization or may be a region already vectorized. In response to the detection, the simulator suspends execution of a time-based simulation. The simulator then serially executes the region for at least two iterations using a functional-based simulation and using instructions with operands which correspond to P or less lanes of single-instruction-multiple-data (SIMD) execution. The value P is a maximum number of lanes of SIMD exection supported both by the compiler. The simulator stores checkpoint state during the serial execution. In response to determining no inter-iteration memory dependencies exist, the simulator returns to the time-based simulation and resumes execution using N-wide vector instructions.

...read moreread less

7 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

A detailed and flexible cycle-accurate Network-on-Chip simulator

[...]

Nan Jiang¹, Daniel U. Becker¹, George Michelogiannakis², James Balfour³, Brian Towles, David E. Shaw², John Kim¹, William J. Dally¹ - Show less +4 more•Institutions (3)

Stanford University¹, Lawrence Berkeley National Laboratory², Google³

21 Apr 2013

TL;DR: The simulator, BookSim, is designed for simulation flexibility and accurate modeling of network components and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes.

...read moreread less

Abstract: Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulations to evaluate the performance impact and analyze the cost of novel NoC architectures. In this work, we present BookSim, a cycle-accurate simulator for NoCs. The simulator is designed for simulation flexibility and accurate modeling of network components. It features a modular design and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes. BookSim furthermore emphasizes detailed implementations of network components that accurately model the behavior of actual hardware. We have validated the accuracy of the simulator against RTL implementations of NoC routers.

...read moreread less

645 citations

Journal Article•DOI•

Ramulator: A Fast and Extensible DRAM Simulator

[...]

Yoongu Kim¹, Weikun Yang¹, Onur Mutlu¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2016-IEEE Computer Architecture Letters

TL;DR: This paper presents Ramulator, a fast and cycle-accurate DRAM simulator that is built from the ground up for extensibility, and is able to provide out-of-the-box support for a wide array of DRAM standards.

...read moreread less

Abstract: Recently, both industry and academia have proposed many different roadmaps for the future of DRAM. Consequently, there is a growing need for an extensible DRAM simulator, which can be easily modified to judge the merits of today's DRAM standards as well as those of tomorrow. In this paper, we present Ramulator , a fast and cycle-accurate DRAM simulator that is built from the ground up for extensibility. Unlike existing simulators, Ramulator is based on a generalized template for modeling a DRAM system, which is only later infused with the specific details of a DRAM standard. Thanks to such a decoupled and modular design, Ramulator is able to provide out-of-the-box support for a wide array of DRAM standards: DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, as well as some academic proposals (SALP, AL-DRAM, TL-DRAM, RowClone, and SARP). Importantly, Ramulator does not sacrifice simulation speed to gain extensibility: according to our evaluations, Ramulator is 2.5 $\times$ faster than the next fastest simulator. Ramulator is released under the permissive BSD license.

...read moreread less

535 citations

Proceedings Article•DOI•

DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling

[...]

Chen Sun¹, Chia-Hsin Owen Chen¹, George Kurian¹, Lan Wei¹, Jason Miller¹, Anant Agarwal¹, Li-Shiuan Peh¹, Vladimir Stojanovic¹ - Show less +4 more•Institutions (1)

Massachusetts Institute of Technology¹

09 May 2012

TL;DR: DSENT, a NoC modeling tool for rapid design space exploration of electrical and opto-electrical networks, is presented and the results show the implications of different technology scenarios and the need to reduce laser and thermal tuning power in a photonic network due to their non-data-dependent nature.

...read moreread less

Abstract: With the rise of many-core chips that require substantial bandwidth from the network on chip (NoC), integrated photonic links have been investigated as a promising alternative to traditional electrical interconnects While numerous opto-electronic NoCs have been proposed, evaluations of photonic architectures have thus-far had to use a number of simplifications, reflecting the need for a modeling tool that accurately captures the tradeoffs for the emerging technology and its impacts on the overall network In this paper, we present DSENT, a NoC modeling tool for rapid design space exploration of electrical and opto-electrical networks We explain our modeling framework and perform an energy-driven case study, focusing on electrical technology scaling, photonic parameters, and thermal tuning Our results show the implications of different technology scenarios and, in particular, the need to reduce laser and thermal tuning power in a photonic network due to their non-data-dependent nature

...read moreread less

529 citations

Proceedings Article•DOI•

ZSim: fast and accurate microarchitectural simulation of thousand-core systems

[...]

Daniel Sanchez¹, Christos Kozyrakis²•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

23 Jun 2013

TL;DR: Zsim, a fast, scalable, and accurate simulator, is built using bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy, and lightweight user-level virtualization is implemented to support complex workloads.

...read moreread less

Abstract: Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial.We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores.We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.

...read moreread less

481 citations

Proceedings Article•DOI•

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

[...]

Vivek Seshadri¹, Donghyuk Lee¹, Thomas Mullins¹, Hasan Hassan², Amirali Boroumand¹, Jeremie S. Kim¹, Michael Kozuch³, Onur Mutlu¹, Phillip B. Gibbons¹, Todd C. Mowry¹ - Show less +6 more•Institutions (3)

Carnegie Mellon University¹, ETH Zurich², Intel³

14 Oct 2017

TL;DR: Ambit is proposed, an Accelerator-in-Memory for bulk bitwise operations that largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area).

...read moreread less

Abstract: Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.CCS CONCEPTS• Computer systems organization → Single instruction, multiple data; • Hardware → Hardware accelerator; • Hardware → Dynamic memory;

...read moreread less

444 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse