Home
/
Authors
/
Michael R. Marty

Author

Michael R. Marty

Other affiliations: Wisconsin Alumni Research Foundation, University of Wisconsin-Madison

Bio: Michael R. Marty is an academic researcher from Google. The author has contributed to research in topics: Shared memory & Arbitration. The author has an hindex of 17, co-authored 35 publications receiving 4754 citations. Previous affiliations of Michael R. Marty include Wisconsin Alumni Research Foundation & University of Wisconsin-Madison.

Topics: Shared memory, Arbitration, Arbiter, Amdahl's law, Cache ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

[...]

Milo M. K. Martin¹, Daniel J. Sorin², Bradford M. Beckmann³, Michael R. Marty³, Min Xu³, Alaa R. Alameldeen³, Kevin E. Moore³, Mark D. Hill³, Darien Wood³ - Show less +5 more•Institutions (3)

University of Pennsylvania¹, Duke University², University of Wisconsin-Madison³

01 Nov 2005-ACM Sigarch Computer Architecture News

TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers as mentioned in this paper, which includes a set of timing simulator modules for modeling the timing of the memory system and microprocessors.

...read moreread less

Abstract: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

...read moreread less

1,515 citations

Journal Article•DOI•

Amdahl's Law in the Multicore Era

[...]

Mark D. Hill¹, Michael R. Marty²•Institutions (2)

University of Wisconsin-Madison¹, Google²

01 Jul 2008-IEEE Computer

TL;DR: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores.

...read moreread less

Abstract: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

...read moreread less

1,245 citations

Proceedings Article•DOI•

Energy proportional datacenter networks

[...]

Dennis Abts¹, Michael R. Marty¹, Philip M. Wells¹, Peter Michael Klausler¹, Hong Liu¹ - Show less +1 more•Institutions (1)

Google¹

19 Jun 2010

TL;DR: It is demonstrated that energy proportional datacenter communication is indeed possible and that there is a significant power advantage to having independent control of each unidirectional channel comprising a network link.

...read moreread less

Abstract: Numerous studies have shown that datacenter computers rarely operate at full utilization, leading to a number of proposals for creating servers that are energy proportional with respect to the computation that they are performing. In this paper, we show that as servers themselves become more energy proportional, the datacenter network can become a significant fraction (up to 50%) of cluster power. In this paper we propose several ways to design a high-performance datacenter network whose power consumption is more proportional to the amount of traffic it is moving -- that is, we propose energy proportional datacenter networks. We first show that a flattened butterfly topology itself is inherently more power efficient than the other commonly proposed topology for high-performance datacenter networks. We then exploit the characteristics of modern plesiochronous links to adjust their power and performance envelopes dynamically. Using a network simulator, driven by both synthetic workloads and production datacenter traces, we characterize and understand design tradeoffs, and demonstrate an 85% reduction in power --- which approaches the ideal energy-proportionality of the network. Our results also demonstrate two challenges for the designers of future network switches: 1) We show that there is a significant power advantage to having independent control of each unidirectional channel comprising a network link, since many traffic patterns show very asymmetric use, and 2) system designers should work to optimize the high-speed channel designs to be more energy efficient by choosing optimal data rate and equalization technology. Given these assumptions, we demonstrate that energy proportional datacenter communication is indeed possible.

...read moreread less

473 citations

Proceedings Article•DOI•

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

[...]

Luke Yen¹, Jayaram Bobba¹, Michael R. Marty¹, Kevin E. Moore¹, Haris Volos¹, Mark D. Hill¹, Michael M. Swift¹, Darien Wood¹ - Show less +4 more•Institutions (1)

University of Wisconsin-Madison¹

10 Feb 2007

TL;DR: This paper proposes a hardware transactional memory system called LogTM Signature Edition (LogTM-SE), which uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection), and allows cache victimization, unbounded nesting, thread context switching and migration, and paging.

...read moreread less

Abstract: This paper proposes a hardware transactional memory (HTM) system called LogTM Signature Edition (LogTM-SE). LogTM-SE uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection). Transactions update memory "in place" after saving the old value in a per-thread memory log (eager version management). Finally, a transaction commits locally by clearing its signature, resetting the log pointer, etc., while aborts must undo the log. LogTM-SE achieves two key benefits. First, signatures and logs can be implemented without changes to highly-optimized cache arrays because LogTM-SE never moves cached data, changes a blocks cache state, or flash clears bits in the cache. Second, transactions are more easily virtualized because signatures and logs are software accessible, allowing the operating system and runtime to save and restore this state. In particular, LogTM-SE allows cache victimization, unbounded nesting (both open and closed), thread context switching and migration, and paging

...read moreread less

384 citations

Proceedings Article•DOI•

ASR: Adaptive Selective Replication for CMP Caches

[...]

Bradford M. Beckmann¹, Michael R. Marty², Darien Wood²•Institutions (2)

Microsoft¹, University of Wisconsin-Madison²

09 Dec 2006

TL;DR: Adaptive Selective Replication (ASR) as discussed by the authors replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses).

...read moreread less

Abstract: The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to balance latency and capacity, but their static replication rules result in performance degradation for some combinations of workloads and system configurations. This paper proposes Adaptive Selective Replication (ASR), a mechanism that dynamically monitors workload behavior to control replication. ASR replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulations of 8-processor CMPs show that ASR provides robust performance: improving performance by as much as 29% versus shared caches, 19% versus private caches, and 12% versus CMP-NuRapid [9] and Victim Replication [41]. Furthermore, while ASR does not improve the performance of all workloads, it provides performance stability by always performing at least comparably to the best alternative including Cooperative Caching [8].

...read moreread less

237 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

The gem5 simulator

[...]

Nathan Binkert¹, Bradford M. Beckmann², Gabriel Black³, Steven K. Reinhardt², Ali G. Saidi, Arkaprava Basu⁴, Joel Hestness⁵, Derek R. Hower⁴, Tushar Krishna⁶, Somayeh Sardashti⁴, Rathijit Sen⁴, Korey Sewell⁷, Muhammad Shoaib⁴, Nilay Vaish⁴, Mark D. Hill⁴, Darien Wood⁴ - Show less +12 more•Institutions (7)

Hewlett-Packard¹, Advanced Micro Devices², Google³, University of Wisconsin-Madison⁴, University of Texas at Austin⁵, Massachusetts Institute of Technology⁶, University of Michigan⁷

31 Aug 2011-ACM Sigarch Computer Architecture News

TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

...read moreread less

4,039 citations

Proceedings Article•DOI•

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

[...]

Sheng Li¹, Jung Ho Ahn², Richard Strong³, Jay B. Brockman¹, Dean M. Tullsen³, Norman P. Jouppi⁴ - Show less +2 more•Institutions (4)

University of Notre Dame¹, Seoul National University², University of California, San Diego³, Hewlett-Packard⁴

12 Dec 2009

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.

...read moreread less

Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

...read moreread less

2,487 citations

Journal Article•DOI•

Roofline: an insightful visual performance model for multicore architectures

[...]

Samuel Williams¹, Andrew Waterman², David A. Patterson²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

01 Apr 2009-Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

Abstract: The Roofline model offers insight on how to improve the performance of software and hardware.

...read moreread less

2,181 citations

Book•

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

[...]

Luiz Andre Barroso¹, Urs Hoelzle¹•Institutions (1)

Google¹

01 Jan 2008

TL;DR: The architecture of WSCs is described, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base are described.

...read moreread less

Abstract: As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board. Table of Contents: Introduction / Workloads and Software Infrastructure / Hardware Building Blocks / Datacenter Basics / Energy and Power Efficiency / Modeling Costs / Dealing with Failures and Repairs / Closing Remarks

...read moreread less

1,938 citations

Journal Article•DOI•

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

[...]

University of Pennsylvania¹, Duke University², University of Wisconsin-Madison³

01 Nov 2005-ACM Sigarch Computer Architecture News

...read moreread less

1,515 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse