scispace - formally typeset
Search or ask a question
Author

Martin Margala

Bio: Martin Margala is an academic researcher from University of Massachusetts Lowell. The author has contributed to research in topics: CMOS & Logic gate. The author has an hindex of 24, co-authored 229 publications receiving 1867 citations. Previous affiliations of Martin Margala include State University of New York System & University of Alberta.


Papers
More filters
Patent
09 Apr 2003
TL;DR: A Procesor-In-Memory (PIM) is a digital accelerator for image and graphics processing as mentioned in this paper, which is based on an ALU having multipliers for processing combinations of bits smaller than those in the input data.
Abstract: A Procesor-In-Memory (PIM) includes a digital accelerator for image and graphics processing. The digital accelerator is based on an ALU having multipliers for processing combinations of bits smaller than those in the input data (e.g., 4×4 adders if the input data are 8-bit numbers). The ALU implements various arithmetic algorithms for addition, multiplication, and other operations. A secondary processing logic includes adders in series and parallel to permit vector operations as well as operations on longer scalars. A self-repairing ALU is also disclosed.

161 citations

Proceedings ArticleDOI
11 Feb 2013
TL;DR: This paper takes Memcached, a complex software system, and implements its core functionality on an FPGA, able to tightly integrate networking, compute, and memory, and overcome many of the bottlenecks found in standard servers.
Abstract: Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGA's design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.

92 citations

Proceedings ArticleDOI
23 May 2004
TL;DR: A novel fast and low-power Sobel edge detection processor targeted for image processing and volume rendering applications and implemented in 0.18 /spl mu/m CMOS technology.
Abstract: This paper describes a novel fast and low-power Sobel edge detection processor targeted for image processing and volume rendering applications. The Sobel processor was built as a part of the real-time shear-warp factorization volume rendering system to compute a gradient. Sobel operator processor was designed and implemented in 0.18 /spl mu/m CMOS technology. Optimizations made at the mathematical model led to a simple regular architecture. High speed and low power consumption were achieved due to implementation of pipelining and parallelism at the components level. Employing the non-full swing CPL to design the Sobel processor sub-components reduced the power-delay product up to 40%. Simulation results showed that processor achieved the worst-case delay time of 4.61 ns and dissipates an average of 8.24 mW at 1.8 V and 200 MHz.

71 citations

Journal ArticleDOI
TL;DR: This paper presents the design and characterization of 12 full-adder circuits in the IBM 90-nm process, including three new full-adders circuits using the recently proposed split-path data driven dynamic logic.
Abstract: This paper presents the design and characterization of 12 full-adder circuits in the IBM 90-nm process. These include three new full-adder circuits using the recently proposed split-path data driven dynamic logic. Based on the logic function realized, the adders were characterized for performance and power consumption when operated under various supply voltages and fan-out loads. The adders were then further deployed in a 32 bit ripple carry adder and 8×4 multiplier to evaluate the impact of sum and carry propagation delays on the performance, power of these systems. Performance characterization of the adder circuits in the presence of process and voltage variations was also performed through Monte Carlo simulations. Besides analyzing and comparing circuit performance, the possible impact of the choice of logic function has also been underlined in this study.

49 citations

Proceedings ArticleDOI
09 Aug 1999
TL;DR: This paper presents an extensive summary of the latest developments in low-power circuit techniques and methods for Static Random Access Memories, including capacitance reduction by using divided word-line structure or single-bitline cross-point cell activation.
Abstract: This paper presents an extensive summary of the latest developments in low-power circuit techniques and methods for Static Random Access Memories. The key techniques in power reduction in both active and standby modes are: capacitance reduction by using divided word-line structure or single-bitline cross-point cell activation, pulse operation by using ATD generator and reduced signal swings on high-capacitance predecode lines, write bus lines and datalines, AC current reduction by using multistage decoding, operating voltage reduction coupled with low-power sensing by using charge-transfer amplification, step-down boosted word-line scheme or full current-mode read/write operation and leakage current suppression by using dual-Vt, Auto-Backgate-Controlled multiple-Vt, or dynamic leakage cut-off techniques.

46 citations


Cited by
More filters
01 Jan 2016
TL;DR: The design of analog cmos integrated circuits is universally compatible with any devices to read and is available in the book collection an online access to it is set as public so you can download it instantly.
Abstract: Thank you for downloading design of analog cmos integrated circuits. Maybe you have knowledge that, people have look hundreds times for their chosen books like this design of analog cmos integrated circuits, but end up in malicious downloads. Rather than enjoying a good book with a cup of coffee in the afternoon, instead they juggled with some harmful virus inside their computer. design of analog cmos integrated circuits is available in our book collection an online access to it is set as public so you can download it instantly. Our digital library spans in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the design of analog cmos integrated circuits is universally compatible with any devices to read.

1,038 citations

Book
02 Sep 2008
TL;DR: The state-of-the-art in the area of electronics prognostics and health management can be found in this article, where four current approaches include built-in-test (BIT), use of fuses and canary devices, monitoring and reasoning of failure precursors, and modeling accumulated damage based on measured life-cycle loads.
Abstract: There has been a growing interest in monitoring the ongoing "health" of products and systems in order to predict failures and provide warning to avoid catastrophic failure. Here, health is defined as the extent of degradation or deviation from an expected normal condition. While the application of health monitoring, also referred to as prognostics, is well established for assessment of mechanical systems, this is not the case for electronic systems. However, electronic systems are integral to the functionality of most systems today, and their reliability is often critical for system reliability. This paper presents the state-of-practice and the current state-of-research in the area of electronics prognostics and health management. Four current approaches include built-in-test (BIT), use of fuses and canary devices, monitoring and reasoning of failure precursors, and modeling accumulated damage based on measured life-cycle loads. Examples are provided for these different approaches, and the implementation challenges are discussed.

725 citations

Book
01 Jan 2008
TL;DR: In this paper, a physics of failure (PoF) based approach is proposed for the prediction of the future state of reliability of a system under its actual application conditions, which integrates sensor data with models that enable in situ assessment of the deviation or degradation of a product from an expected normal operating condition.
Abstract: Reliability is the ability of a product or system to perform as intended (i.e., without failure and within specified performance limits) for a specified time, in its life-cycle environment. Commonly used electronics reliability prediction methods (e.g., Mil-HDBK-217, 217-PLUS, PRISM, Telcordia, FIDES) based on handbook methods have been shown to be misleading and provide erroneous life predictions. The use of stress and damage models permits a far superior accounting of the reliability and the physics of failure (PoF); however, sufficient knowledge of the actual operating and environmental application conditions of the product is still required. This article presents a PoF-based prognostics and health management approach for effective reliability prediction. PoF is an approach that utilizes knowledge of a product's life-cycle loading and failure mechanisms to perform reliability modeling, design, and assessment. This method permits the assessment of the reliability of a system under its actual application conditions. It integrates sensor data with models that enable in situ assessment of the deviation or degradation of a product from an expected normal operating condition and the prediction of the future state of reliability. This article presents a formal implementation procedure, which includes failure modes, mechanisms, and effects analysis, data reduction and feature extraction from the life-cycle loads, damage accumulation, and assessment of uncertainty. Applications of PoF-based prognostics and health management are also discussed. Keywords: reliability; prognostics; physics of failure; design-for-reliability; reliability prediction

677 citations

Proceedings ArticleDOI
15 Oct 2016
TL;DR: A new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications, and is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication.
Abstract: Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with the economic benefits of homogeneity (manageability) In this paper we propose a new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications This Configurable Cloud architecture places a layer of reconfigurable logic (FPGAs) between the network switches and the servers, enabling network flows to be programmably transformed at line rate, enabling acceleration of local applications running on the server, and enabling the FPGAs to communicate directly, at datacenter scale, to harvest remote FPGAs unused by their local servers We deployed this design over a production server bed, and show how it can be used for both service acceleration (Web search ranking) and network acceleration (encryption of data in transit at high-speeds) This architecture is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication By coupling to the network plane, direct FPGA-to-FPGA messages can be achieved at comparable latency to previous work, without the secondary network Additionally, the scale of direct inter-FPGA messaging is much larger The average round-trip latencies observed in our measurements among 24, 1000, and 250,000 machines are under 3, 9, and 20 microseconds, respectively The Configurable Cloud architecture has been deployed at hyperscale in Microsoft's production datacenters worldwide

512 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.
Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

498 citations