Home
/
Authors
/
Martin Margala

Author

Martin Margala

Other affiliations: State University of New York System, University of Alberta, University of Massachusetts Amherst ...read more

Bio: Martin Margala is an academic researcher from University of Massachusetts Lowell. The author has contributed to research in topics: CMOS & Logic gate. The author has an hindex of 24, co-authored 229 publications receiving 1867 citations. Previous affiliations of Martin Margala include State University of New York System & University of Alberta.

Topics: CMOS, Logic gate, Throughput (business), Transistor, Adder ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1995
1994

Papers

PDF

Open Access

More filters

Patent•

Multiplier-based processor-in-memory architectures for image and graphics processing

[...]

Rong Lin¹, Martin Margala¹•Institutions (1)

State University of New York System¹

09 Apr 2003

TL;DR: A Procesor-In-Memory (PIM) is a digital accelerator for image and graphics processing as mentioned in this paper, which is based on an ALU having multipliers for processing combinations of bits smaller than those in the input data.

...read moreread less

Abstract: A Procesor-In-Memory (PIM) includes a digital accelerator for image and graphics processing. The digital accelerator is based on an ALU having multipliers for processing combinations of bits smaller than those in the input data (e.g., 4×4 adders if the input data are 8-bit numbers). The ALU implements various arithmetic algorithms for addition, multiplication, and other operations. A secondary processing logic includes adders in series and parallel to permit vector operations as well as operations on longer scalars. A self-repairing ALU is also disclosed.

...read moreread less

161 citations

Proceedings Article•DOI•

An FPGA memcached appliance

[...]

Sai Rahul Chalamalasetti¹, Kevin T. Lim², Mitch Wright², Alvin AuYoung², Parthasarathy Ranganathan², Martin Margala¹ - Show less +2 more•Institutions (2)

University of Massachusetts Lowell¹, Hewlett-Packard²

11 Feb 2013

TL;DR: This paper takes Memcached, a complex software system, and implements its core functionality on an FPGA, able to tightly integrate networking, compute, and memory, and overcome many of the bottlenecks found in standard servers.

...read moreread less

Abstract: Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGA's design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.

...read moreread less

92 citations

Proceedings Article•DOI•

Sobel edge detection processor for a real-time volume rendering system

[...]

N. Kazakova, Martin Margala, Nelson G. Durdle

23 May 2004

TL;DR: A novel fast and low-power Sobel edge detection processor targeted for image processing and volume rendering applications and implemented in 0.18 /spl mu/m CMOS technology.

...read moreread less

Abstract: This paper describes a novel fast and low-power Sobel edge detection processor targeted for image processing and volume rendering applications. The Sobel processor was built as a part of the real-time shear-warp factorization volume rendering system to compute a gradient. Sobel operator processor was designed and implemented in 0.18 /spl mu/m CMOS technology. Optimizations made at the mathematical model led to a simple regular architecture. High speed and low power consumption were achieved due to implementation of pipelining and parallelism at the components level. Employing the non-full swing CPL to design the Sobel processor sub-components reduced the power-delay product up to 40%. Simulation results showed that processor achieved the worst-case delay time of 4.61 ns and dissipates an average of 8.24 mW at 1.8 V and 200 MHz.

...read moreread less

71 citations

Journal Article•DOI•

Investigating the Impact of Logic and Circuit Implementation on Full Adder Performance

[...]

Sohan Purohit¹, Martin Margala¹•Institutions (1)

University of Massachusetts Lowell¹

01 Jul 2012-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents the design and characterization of 12 full-adder circuits in the IBM 90-nm process, including three new full-adders circuits using the recently proposed split-path data driven dynamic logic.

...read moreread less

Abstract: This paper presents the design and characterization of 12 full-adder circuits in the IBM 90-nm process. These include three new full-adder circuits using the recently proposed split-path data driven dynamic logic. Based on the logic function realized, the adders were characterized for performance and power consumption when operated under various supply voltages and fan-out loads. The adders were then further deployed in a 32 bit ripple carry adder and 8×4 multiplier to evaluate the impact of sum and carry propagation delays on the performance, power of these systems. Performance characterization of the adder circuits in the presence of process and voltage variations was also performed through Monte Carlo simulations. Besides analyzing and comparing circuit performance, the possible impact of the choice of logic function has also been underlined in this study.

...read moreread less

49 citations

Proceedings Article•DOI•

Low-power SRAM circuit design

[...]

Martin Margala¹•Institutions (1)

University of Alberta¹

09 Aug 1999

TL;DR: This paper presents an extensive summary of the latest developments in low-power circuit techniques and methods for Static Random Access Memories, including capacitance reduction by using divided word-line structure or single-bitline cross-point cell activation.

...read moreread less

Abstract: This paper presents an extensive summary of the latest developments in low-power circuit techniques and methods for Static Random Access Memories. The key techniques in power reduction in both active and standby modes are: capacitance reduction by using divided word-line structure or single-bitline cross-point cell activation, pulse operation by using ATD generator and reduced signal swings on high-capacitance predecode lines, write bus lines and datalines, AC current reduction by using multistage decoding, operating voltage reduction coupled with low-power sensing by using charge-transfer amplification, step-down boosted word-line scheme or full current-mode read/write operation and leakage current suppression by using dual-Vt, Auto-Backgate-Controlled multiple-Vt, or dynamic leakage cut-off techniques.

...read moreread less

46 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Collapse

Cited by

PDF

Open Access

More filters

Design Of Analog Cmos Integrated Circuits

[...]

Franziska Hoffmann

01 Jan 2016

TL;DR: The design of analog cmos integrated circuits is universally compatible with any devices to read and is available in the book collection an online access to it is set as public so you can download it instantly.

...read moreread less

Abstract: Thank you for downloading design of analog cmos integrated circuits. Maybe you have knowledge that, people have look hundreds times for their chosen books like this design of analog cmos integrated circuits, but end up in malicious downloads. Rather than enjoying a good book with a cup of coffee in the afternoon, instead they juggled with some harmful virus inside their computer. design of analog cmos integrated circuits is available in our book collection an online access to it is set as public so you can download it instantly. Our digital library spans in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the design of analog cmos integrated circuits is universally compatible with any devices to read.

...read moreread less

1,038 citations

Book•

Prognostics and health management of electronics

[...]

Nikhil M. Vichare¹, Michael Pecht¹•Institutions (1)

University of Maryland, College Park¹

02 Sep 2008

TL;DR: The state-of-the-art in the area of electronics prognostics and health management can be found in this article, where four current approaches include built-in-test (BIT), use of fuses and canary devices, monitoring and reasoning of failure precursors, and modeling accumulated damage based on measured life-cycle loads.

...read moreread less

Abstract: There has been a growing interest in monitoring the ongoing "health" of products and systems in order to predict failures and provide warning to avoid catastrophic failure. Here, health is defined as the extent of degradation or deviation from an expected normal condition. While the application of health monitoring, also referred to as prognostics, is well established for assessment of mechanical systems, this is not the case for electronic systems. However, electronic systems are integral to the functionality of most systems today, and their reliability is often critical for system reliability. This paper presents the state-of-practice and the current state-of-research in the area of electronics prognostics and health management. Four current approaches include built-in-test (BIT), use of fuses and canary devices, monitoring and reasoning of failure precursors, and modeling accumulated damage based on measured life-cycle loads. Examples are provided for these different approaches, and the implementation challenges are discussed.

...read moreread less

725 citations

Book•

Prognostics and Health Management of Electronics

[...]

Michael Pecht

01 Jan 2008

TL;DR: In this paper, a physics of failure (PoF) based approach is proposed for the prediction of the future state of reliability of a system under its actual application conditions, which integrates sensor data with models that enable in situ assessment of the deviation or degradation of a product from an expected normal operating condition.

...read moreread less

Abstract: Reliability is the ability of a product or system to perform as intended (i.e., without failure and within specified performance limits) for a specified time, in its life-cycle environment. Commonly used electronics reliability prediction methods (e.g., Mil-HDBK-217, 217-PLUS, PRISM, Telcordia, FIDES) based on handbook methods have been shown to be misleading and provide erroneous life predictions. The use of stress and damage models permits a far superior accounting of the reliability and the physics of failure (PoF); however, sufficient knowledge of the actual operating and environmental application conditions of the product is still required. This article presents a PoF-based prognostics and health management approach for effective reliability prediction. PoF is an approach that utilizes knowledge of a product's life-cycle loading and failure mechanisms to perform reliability modeling, design, and assessment. This method permits the assessment of the reliability of a system under its actual application conditions. It integrates sensor data with models that enable in situ assessment of the deviation or degradation of a product from an expected normal operating condition and the prediction of the future state of reliability. This article presents a formal implementation procedure, which includes failure modes, mechanisms, and effects analysis, data reduction and feature extraction from the life-cycle loads, damage accumulation, and assessment of uncertainty. Applications of PoF-based prognostics and health management are also discussed. Keywords: reliability; prognostics; physics of failure; design-for-reliability; reliability prediction

...read moreread less

677 citations

Proceedings Article•DOI•

A cloud-scale acceleration architecture

[...]

Adrian M. Caulfield¹, Eric S. Chung¹, Andrew Putnam¹, Hari Angepat¹, Jeremy Fowers¹, Michael Haselman¹, Stephen F. Heil¹, Matt Humphrey¹, Puneet Kaur¹, Joo-Young Kim¹, Lo Daniel¹, Todd Massengill¹, Kalin Ovtcharov¹, Michael K. Papamichael¹, Lisa Woods¹, Sitaram Lanka¹, Derek Chiou¹, Doug Burger¹ - Show less +14 more•Institutions (1)

Microsoft¹

15 Oct 2016

TL;DR: A new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications, and is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication.

...read moreread less

Abstract: Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with the economic benefits of homogeneity (manageability) In this paper we propose a new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications This Configurable Cloud architecture places a layer of reconfigurable logic (FPGAs) between the network switches and the servers, enabling network flows to be programmably transformed at line rate, enabling acceleration of local applications running on the server, and enabling the FPGAs to communicate directly, at datacenter scale, to harvest remote FPGAs unused by their local servers We deployed this design over a production server bed, and show how it can be used for both service acceleration (Web search ranking) and network acceleration (encryption of data in transit at high-speeds) This architecture is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication By coupling to the network plane, direct FPGA-to-FPGA messages can be achieved at comparable latency to previous work, without the secondary network Additionally, the scale of direct inter-FPGA messaging is much larger The average round-trip latencies observed in our measurements among 24, 1000, and 250,000 machines are under 3, 9, and 20 microseconds, respectively The Configurable Cloud architecture has been deployed at hyperscale in Microsoft's production datacenters worldwide

...read moreread less

512 citations

Proceedings Article•DOI•

A configurable cloud-scale DNN processor for real-time AI

[...]

Jeremy Fowers¹, Kalin Ovtcharov¹, Michael K. Papamichael¹, Todd Massengill¹, Ming Liu¹, Lo Daniel¹, Shlomi Alkalay¹, Michael Haselman¹, Logan Adams¹, Mahdi Ghandi¹, Stephen F. Heil¹, Prerak Patel¹, Adam Sapek¹, Gabriel Weisz¹, Lisa Woods¹, Sitaram Lanka¹, Steven K. Reinhardt¹, Adrian M. Caulfield¹, Eric S. Chung¹, Doug Burger¹ - Show less +16 more•Institutions (1)

Microsoft¹

02 Jun 2018

TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.

...read moreread less

Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

...read moreread less

498 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse