scispace - formally typeset
Search or ask a question
Author

Gaurav Mittal

Bio: Gaurav Mittal is an academic researcher from IBM. The author has contributed to research in topics: Circuit design & POWER6. The author has an hindex of 3, co-authored 6 publications receiving 222 citations.

Papers
More filters
Proceedings ArticleDOI
18 Jun 2007
TL;DR: The POWER6trade microprocessor combines ultra-high frequency operation, aggressive power reduction, a highly scalable memory subsystem, and mainframe-like reliability, availability, and serviceability.
Abstract: The POWER6trade microprocessor combines ultra-high frequency operation, aggressive power reduction, a highly scalable memory subsystem, and mainframe-like reliability, availability, and serviceability. The 341mm2 700M transistor dual-core microprocessor is fabricated in a 65nm SOI process with 10 levels of low-k copper interconnect. It operates at clock frequencies over 5GHz in high-performance applications, and consumes under 100W in power-sensitive applications.

120 citations

Journal ArticleDOI
TL;DR: The organization of the design and the features of the processor core are described, before moving on to discuss the circuits used for analog elements, clock generation and distribution, and I/O designs, including special features for test, debug, and chip frequency tuning.
Abstract: This paper gives an overview of the latest member of the POWER™ processor family, POWER7™. Eight quad-threaded cores, operating at frequencies up to 4.14 GHz, are integrated together with two memory controllers and high speed system links on a 567 mm die, employing 1.2B transistors in a 45 nm CMOS SOI technology with 11 layers of low-k copper wiring. The technology features deep trench capacitors which are used to build a 32 MB embedded DRAM L3 based on a 0.067 m DRAM cell. The functionally equivalent chip transistor count would have been over 2.7B if the L3 had been implemented with a conventional 6 transistor SRAM cell. (A detailed paper about the eDRAM implementation will be given in a separate paper of this Journal). Deep trench capacitors are also used to reduce on-chip voltage island supply noise. This paper describes the organization of the design and the features of the processor core, before moving on to discuss the circuits used for analog elements, clock generation and distribution, and I/O designs. The final section describes the details of the clocked storage elements, including special features for test, debug, and chip frequency tuning.

53 citations

Journal ArticleDOI
TL;DR: Some of the circuit methodology and implementation innovations used in the development of POWER6, with particular emphasis on custom, synthesized, register file and SRAM design, as well as the electrical characterizations performed in the lab are described.
Abstract: The IBM POWER6 processor is a dual-core, 341 mm2, 790 million transistor chip fabricated using IBM's 65 nm partially-depleted SOI process. Capable of running at frequencies up to 5 GHz in high performance applications, it can also operate under 100 W for power-sensitive applications. Traditional power-intensive and deep-pipelining techniques used in high frequency design were abandoned in favor of more power efficient circuit design methodologies. The complexity and size of POWER6, together with its high operating frequency, presented a number of significant challenges for its multi-site team to complete the design on an aggressive schedule. This paper describes some of the circuit methodology and implementation innovations used in the development of POWER6, with particular emphasis on custom, synthesized, register file and SRAM design, as well as the electrical characterizations performed in the lab.

44 citations

Patent
Sanjay Dubey1, Gaurav Mittal1
19 Apr 2005
TL;DR: In this paper, a system and method that uses a text-based script file to capture a circuit design and allow a circuit designer to manipulate the script file is presented, where the circuit designer can add, delete, or move components using various tags and commands that are stored in the script files.
Abstract: A system and method that uses a text-based script file to capture a circuit design and allows a circuit designer to manipulate the script file. The circuit designer can add, delete, or move components using various tags and commands that are stored in the script file. When the design is complete, or ready to be tested, the script file is processed creating a layout representation file that is readable by a graphics-based circuit design tool.

3 citations

Patent
05 Jan 2017
TL;DR: In this article, a clock power component for each of a plurality of clock gating domains is identified, and a switching characteristic for each switching characteristic is combined into a domain combination list.
Abstract: Generating a contributor-based power abstract for a device, including: identifying a clock power component for each of a plurality of clock gating domains, identifying a switching characteristic for each of the clock gating domains, combining the switching characteristics for all of the clock gating domains into a domain combination list, performing a per-case simulation based at least on the domain combination list, calculating an effective capacitance for each of the clock gating domains based at least on the per-case simulation, and generating a power abstract for each of the clock gating domains based at least on the effective capacitance.

2 citations


Cited by
More filters
01 Jan 2010
TL;DR: The TILE64TM processor as mentioned in this paper is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications, with 64 tile processors arranged in an 8x8 array.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

634 citations

Proceedings ArticleDOI
01 Feb 2008
TL;DR: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

587 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: A range of cache refresh and placement schemes that are sensitive to retention time are proposed, and it is shown that most of the retention time variations can be masked by the microarchitecture when using these schemes.
Abstract: Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to reten- tion time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tol- erate large process variations with little or even no impact on per- formance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new mem- ory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.

359 citations

Journal ArticleDOI
09 Jun 2012
TL;DR: This work introduces a methodology for designing scalable and efficient scale-out server processors based on a metric of performance-density, and facilitates the design of optimal multi-core configurations, called pods.
Abstract: Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment Emerging applications (eg, data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency In this work, we introduce a methodology for designing scalable and efficient scale-out server processors Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (ie, inter-pod) interconnect and coherence These features synergistically maximize throughput, lower design complexity, and improve technology scalability In 20nm technology, scale-out chips improve throughput by 5x-65x over conventional and by 16x-19x over emerging tiled organizations

185 citations

Journal ArticleDOI
TL;DR: The processor core and caches of the POWER7 processor chip are significantly enhanced to boost the performance of both single-threaded response-time-oriented, as well as multithreaded, throughput-oriented applications.
Abstract: The IBM POWER® processor is the dominant reduced instruction set computing microprocessor in the world today, with a rich history of implementation and innovation over the last 20 years. In this paper, we describe the key features of the POWER7® processor chip. On the chip is an eight-core processor, with each core capable of four-way simultaneous multithreaded operation. Fabricated in IBM's 45-nm silicon-on-insulator (SOI) technology with 11 levels of metal, the chip contains more than one billion transistors. The processor core and caches are significantly enhanced to boost the performance of both single-threaded response-time-oriented, as well as multithreaded, throughput-oriented applications. The memory subsystem contains three levels of on-chip cache, with SOI embedded dynamic random access memory (DRAM) devices used as the last level of cache. A new memory interface using buffered double-data-rate-three DRAM and improvements in reliability, availability, and serviceability are discussed

167 citations