scispace - formally typeset
Search or ask a question
Author

Greg Ruhl

Bio: Greg Ruhl is an academic researcher from Intel. The author has contributed to research in topics: CMOS & Floating-point unit. The author has an hindex of 4, co-authored 6 publications receiving 1110 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz.
Abstract: This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.

645 citations

Proceedings ArticleDOI
13 Nov 2010
TL;DR: The programmer's view of this chip is described and RCCE is described: the native message passing model created for the SCC processor, an intermediate case, sharing traits of message passing and shared memory architectures.
Abstract: The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

267 citations

Proceedings ArticleDOI
03 Apr 2012
TL;DR: An IA-32 processor fabricated in 32nm CMOS technology is described, demonstrating a reliable ultra-low voltage operation and energy efficient performance across the wide voltage range from 280mV to 1.2V.
Abstract: Near-threshold computing brings the promise of an order of magnitude improvement in energy efficiency over the current generation of microprocessors [1]. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all single-threaded or performance-constrained applications. Enabling the processor to operate over a wide voltage range helps to achieve best possible energy efficiency while satisfying varying performance demands of the applications. This paper describes an IA-32 processor fabricated in 32nm CMOS technology [2], demonstrating a reliable ultra-low voltage operation and energy efficient performance across the wide voltage range from 280mV to 1.2V.

216 citations

Proceedings Article
01 Jan 2009
TL;DR: In this paper, the authors present a 2 Mb 2T PMOS gain cell macro on 65 nm logic process that has high bandwidth of 128 GBytes/sec, fast cycle time of 2 ns and 6-clock access time at 2 GHz.
Abstract: -We present 2 Mb 2T PMOS gain cell macro on 65 nm logic process that has high bandwidth of 128 GBytes/sec, fast cycle time of 2 ns and 6-clock cycles access time at 2 GHz. Macro features a full-rate pipelined architecture, ground precharge bitline, nondestructive read-out, partial write support and 128-row refresh to tolerate short refresh time. Cell is 2X denser than SRAM and is voltage compatible with logic.

4 citations


Cited by
More filters
Journal ArticleDOI
10 Jun 2009
TL;DR: The current performance and future demands of interconnects to and on silicon chips are examined and the requirements for optoelectronic and optical devices are project if optics is to solve the major problems of interConnects for future high-performance silicon chips.
Abstract: We examine the current performance and future demands of interconnects to and on silicon chips. We compare electrical and optical interconnects and project the requirements for optoelectronic and optical devices if optics is to solve the major problems of interconnects for future high-performance silicon chips. Optics has potential benefits in interconnect density, energy, and timing. The necessity of low interconnect energy imposes low limits especially on the energy of the optical output devices, with a ~ 10 fJ/bit device energy target emerging. Some optical modulators and radical laser approaches may meet this requirement. Low (e.g., a few femtofarads or less) photodetector capacitance is important. Very compact wavelength splitters are essential for connecting the information to fibers. Dense waveguides are necessary on-chip or on boards for guided wave optical approaches, especially if very high clock rates or dense wavelength-division multiplexing (WDM) is to be avoided. Free-space optics potentially can handle the necessary bandwidths even without fast clocks or WDM. With such technology, however, optics may enable the continued scaling of interconnect capacity required by future chips.

1,959 citations

Book
18 Apr 2008
TL;DR: This survey reviews the historical development of programmable logic devices, the fundamental programming technologies that the programmability is built on, and then describes the basic understandings gleaned from research on architectures.
Abstract: Field-Programmable Gate Arrays (FPGAs) have become one of the key digital circuit implementation media over the last decade. A crucial part of their creation lies in their architecture, which governs the nature of their programmable logic functionality and their programmable interconnect. FPGA architecture has a dramatic effect on the quality of the final device's speed performance, area efficiency, and power consumption. This survey reviews the historical development of programmable logic devices, the fundamental programming technologies that the programmability is built on, and then describes the basic understandings gleaned from research on architectures. We include a survey of the key elements of modern commercial FPGA architecture, and look toward future trends in the field.

491 citations

Proceedings ArticleDOI
Yan Pan1, Prabhat Kumar1, John Kim2, Gokhan Memik1, Yu Zhang1, Alok Choudhary1 
20 Jun 2009
TL;DR: Firefly is a hybrid, hierarchical network architecture that consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotinics is used only for global communication to realize an efficient on-chip network.
Abstract: Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.

411 citations

Proceedings ArticleDOI
03 Jun 2012
TL;DR: Four key approaches are discussed - the four horsemen - that have emerged as top contenders for thriving in the dark silicon age and each class carries with its virtues deep-seated restrictions that requires a careful understanding of the underlying tradeoffs and benefits.
Abstract: Due to the breakdown of Dennardian scaling, the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. This utilization wall forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark or dim silicon, i.e., either idle or significantly underclocked. As exponentially larger fractions of a chip's transistors become dark, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift is driving a new class of architectural techniques that “spend” area to “buy” energy efficiency. All of these techniques seek to introduce new forms of heterogeneity into the computational stack. We envision that ultimately we will see widespread use of specialized architectures that leverage these techniques in order to attain orders-of-magnitude improvements in energy efficiency. However, many of these approaches also suffer from massive increases in complexity. As a result, we will need to look towards developing pervasively specialized architectures that insulate the hardware designer and the programmer from the underlying complexity of such systems. In this paper, I discuss four key approaches — the four horsemen — that have emerged as top contenders for thriving in the dark silicon age. Each class carries with its virtues deep-seated restrictions that requires a careful understanding of the underlying tradeoffs and benefits.

334 citations

Journal ArticleDOI
16 Jun 2009
TL;DR: This paper emphasizes the recently proposed 5 times 5 matrix switch comprising two-dimensionally cascaded microring resonator-based electrooptic switches coupled to a waveguide cross-grid on a silicon chip, and studies the feasibility of large-scale integration of the matrix switch.
Abstract: This paper reviews developments in cascaded microresonator-based matrix switches for silicon photonic interconnection networks in many-core computing applications. Specifically, we emphasize our recently proposed 5 times 5 matrix switch comprising two-dimensionally cascaded microring resonator-based electrooptic switches coupled to a waveguide cross-grid on a silicon chip. The cross-grid adopts low-loss low-crosstalk multimode-interference-based waveguide crossings. Such a microresonator-based matrix switch offers nonblocking interconnections among multiple inputs and multiple outputs, with the key merits of i) a tens to hundreds of micrometers-scale footprint, ii) gigabit/second-scale data transmission, iii) nanosecond-speed circuit-switching, iv) 100-muW-scale DC power consumption per link, and v) large-scale integration for networks-on-chips applications. We analyze in detail the microring resonator-based cross-grid switch design for high-data-rate signal transmission in the context of our proposed 5 times 5 matrix switch. We also study the feasibility of large-scale integration of the matrix switch. We report proof-of-concept experiments of a single cross-grid switch element and a 2 times 2 matrix switch, propose design guidelines, and discuss future engineering challenges.

272 citations