scispace - formally typeset
Search or ask a question
Author

Kambiz Samadi

Bio: Kambiz Samadi is an academic researcher from Qualcomm. The author has contributed to research in topics: Integrated circuit & Routing (electronic design automation). The author has an hindex of 24, co-authored 68 publications receiving 2526 citations. Previous affiliations of Kambiz Samadi include University of California, San Diego.


Papers
More filters
Proceedings ArticleDOI
20 Apr 2009
TL;DR: The development of ORION 2.0, an extensive enhancement of the original ORION models which includes completely new subcomponent power models, area models, as well as improved and updated technology models, confirms the need for accurate early-stage NoC power estimation.
Abstract: As industry moves towards many-core chips, networks-on-chip (NoCs) are emerging as the scalable fabric for interconnecting the cores. With power now the first-order design constraint, early-stage estimation of NoC power has become crucially important. ORION [29] was amongst the first NoC power models released, and has since been fairly widely used for early-stage power estimation of NoCs. However, when validated against recent NoC prototypes -- the Intel 80-core Teraflops chip and the Intel Scalable Communications Core (SCC) chip -- we saw significant deviation that can lead to erroneous NoC design choices. This prompted our development of ORION 2.0, an extensive enhancement of the original ORION models which includes completely new subcomponent power models, area models, as well as improved and updated technology models. Validation against the two Intel chips confirms a substantial improvement in accuracy over the original ORION. A case study with these power models plugged within the COSI-OCC NoC design space exploration tool [23] confirms the need for, and value of, accurate early-stage NoC power estimation. To ensure the longevity of ORION 2.0, we will be releasing it wrapped within a semi-automated flow that automatically updates its models as new technology files become available.

799 citations

Journal ArticleDOI
TL;DR: This work presents ORION 2.0, an enhanced NoC power and area simulator, which offers significant accuracy improvement relative to its predecessor, ORION 1.0.
Abstract: As industry moves towards multicore chips, networks-on-chip (NoCs) are emerging as the scalable fabric for interconnecting the cores. With power now the first-order design constraint, early-stage estimation of NoC power has become crucially important. In this work, we present ORION 2.0, an enhanced NoC power and area simulator, which offers significant accuracy improvement relative to its predecessor, ORION 1.0.

272 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper proposes a predictive early activation technique, dubbed SnaPEA, which offers up to 63% speedup and 49% energy reduction across the convolution layers with no loss in classification accuracy.
Abstract: Deep Convolutional Neural Networks (CNNs) perform billions of operations for classifying a single input. To reduce these computations, this paper offers a solution that leverages a combination of runtime information and the algorithmic structure of CNNs. Specifically, in numerous modern CNNs, the outputs of compute-heavy convolution operations are fed to activation units that output zero if their input is negative. By exploiting this unique algorithmic property, we propose a predictive early activation technique, dubbed SnaPEA. This technique cuts the computation of convolution operations short if it determines that the output will be negative. SnaPEA can operate in two distinct modes, exact and predictive. In the exact mode, with no loss in classification accuracy, SnaPEA statically re-orders the weights based on their signs and periodically performs a single-bit sign check on the partial sum. Once the partial sum drops below zero, the rest of computations can simply be ignored, since the output value will be zero in any case. In the predictive mode, which trades the classification accuracy for larger savings, SnaPEA speculatively cuts the computation short even earlier than the exact mode. To control the accuracy, we develop a multi-variable optimization algorithm that thresholds the degree of speculation. As such, the proposed algorithm exposes a knob to gracefully navigate the trade-offs between the classification accuracy and computation reduction. Compared to a state-of-the-art CNN accelerator, SnaPEA in the exact mode, yields, on average, 28% speedup and 16% energy reduction in various modern CNNs without affecting their classification accuracy. With 3% loss in classification accuracy, on average, 67.8% of the convolutional layers can operate in the predictive mode. The average speedup and energy saving of these layers are 2.02x and 1.89x, respectively. The benefits grow to a maximum of 3.59x speedup and 3.14x energy reduction. Compared to static pruning approaches, which are complimentary to the dynamic approach of SnaPEA, our proposed technique offers up to 63% speedup and 49% energy reduction across the convolution layers with no loss in classification accuracy.

162 citations

Proceedings ArticleDOI
28 Jun 2005
TL;DR: In this article, a topography map of the layout is generated by CMP simulation, which results in local defocus, which is explicitly modeled in OPC insertion and verification flows.
Abstract: Depth of focus is the major contributor to lithographic process margin. One of the major causes of focus variation is imperfect planarization of fabrication layers. Presently, OPC (Optical Proximity Correction) methods are oblivious to the predictable nature of focus variation arising from wafer topography. As a result, designers suffer from manufacturing yield loss, as well as loss of design quality through unnecessary guardbanding. In this work, we propose a novel flow and method to drive OPC with a topography map of the layout that is generated by CMP simulation. The wafer topography variations result in local defocus, which we explicitly model in our OPC insertion and verification flows. Our experimental validation uses 90nm foundry libraries and industry-strength OPC and scattering bar recipes. We find that the proposed topography-aware OPC can yield up to 90% reduction in edge placement errors at the cost of little increase in mask cost.

122 citations

Proceedings ArticleDOI
11 Aug 2014
TL;DR: This paper develops, for the first time, a complete RTL-to-GDSII design flow for gate-level M3D, and uses this flow along with a 28nm PDK to build layouts for the OpenSPARC T2 core.
Abstract: In a gate-level monolithic 3D IC (M3D), all the transistors in a single logic gate occupy the same tier, and gates in different tiers are connected using nano-scale monolithic inter-tier vias. This design style has the benefit of the superior power-performance quality offered by flat implementations (unlike block-level M3D), and zero total silicon area overhead compared to 2D (unlike transistor-level M3D). In this paper we develop, for the first time, a complete RTL-to-GDSII design flow for gate-level M3D. Our tool flow is based on commercial tools built for 2D ICs and enhanced with our 3D specific methodologies. We use this flow along with a 28nm PDK to build layouts for the OpenSPARC T2 core. Our simulations show that at the same performance, gate-level M3D offers 16% total power reduction with 0% area overhead compared to commercial quality 2D IC designs.

109 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

4,039 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations

Proceedings ArticleDOI
13 Dec 2014
TL;DR: This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
Abstract: Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

1,486 citations