scispace - formally typeset
Search or ask a question
Author

Clark Roberts

Bio: Clark Roberts is an academic researcher from Intel. The author has contributed to research in topics: CMOS & Integrated circuit. The author has an hindex of 7, co-authored 9 publications receiving 858 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz.
Abstract: This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.

645 citations

Journal ArticleDOI
18 Mar 2010
TL;DR: A 47 × 10 Gb/s chip-to-chip interface consuming 660 mW is demonstrated in 45 nm CMOS, co-designed to minimize power and area for a wide parallel interface and demonstrates fast power management for the I/O circuits.
Abstract: A 47 × 10 Gb/s chip-to-chip interface consuming 660 mW is demonstrated in 45 nm CMOS. The circuitry and interconnect are co-designed to minimize power and area for a wide parallel interface. Power is reduced by amortizing clocking, minimizing the span of clock signals and pairing a low-swing transmitter driver with a sensitive receiver sampler. The active silicon area is compressed by 64% relative to the C4 bumps using on-chip transmission line routing. A dense, top-side package connector and bridge enable both high off-chip interconnect density and low overall power by reducing equalization and deskew requirements. The interface also demonstrates fast power management for the I/O circuits. The receiver power can be reduced by 93% during standby and an integrated wake-up timer indicates that all lanes return reliably to active mode in <;5 ns. The interface operates at 470 Gb/s with an aggregate bit error ratio better than 2 ×10-18 while consuming 1.4 mW/Gb/s and occupies 3.2 mm2 active silicon area.

63 citations

Journal ArticleDOI
TL;DR: This paper details the design of an 8-lane bidirectional link for both within-the-box and external communications in 22 nm CMOS technology with low profile connector with a high density cable assembly to enable lane characterization without degrading jitter performance.
Abstract: This paper details the design of an 8-lane bidirectional link for both within-the-box and external communications in 22 nm CMOS technology. A low profile connector with a high density cable assembly ensure a data rate of up to 32 Gb/s per lane while maintaining channel loss below 25 dB. Channel equalization is performed by a combination of a 3-tap feed-forward equalizer (FFE), single-stage continuous-time linear equalizer (CTLE) and a 6-tap decision-feedback equalizer (DFE). Collaborative timing recovery is used to enable lane characterization without degrading jitter performance. Phase error decimation, with a conditional phase detection scheme, is used to reduce the DFE complexity by 50%. Power consumption over a wide range of data rates from 4 to 32 Gb/s is reduced by using regulated CMOS clocking with lane bundling, low swing transmitter with a source-series terminated (SST) driver and a highly reconfigurable receiver with an active inductor CTLE. At a lane data rate of 32 Gb/s, over a 0.5 m cable with 16 dB of loss, a transceiver lane consumes 205 mW from a 1.07 V supply. The power scales down to 26 mW from a 0.72 V supply at 8 Gb/s, when transmitting over a channel with 8 dB loss. The active silicon area per lane is 0.079 mm 2 .

50 citations

Journal ArticleDOI
TL;DR: System-level optimization of duty-cycle and quadrature error correctors across the clock hierarchy provides optimized clock phase placement and, thus, enhances link performance and power and a lane failover mechanism provides design robustness to mitigate channel or circuit defects.
Abstract: A scalable 64-lane chip-to-chip I/O, with per-lane data rate of 2-16 Gb/s is demonstrated in 32-nm low-power CMOS technology. At maximum aggregate bandwidth of 1.024 Tb/s across 50-cm channel length, the link consumes 2.7 W from a 1.08-V supply, corresponding to 2.6 pJ/bit. As bandwidth demand decreases, scaling the per-lane data rate to 4 Gb/s and power supply to 0.65 V provides 1/4 of the maximum bandwidth while consuming 0.2 W. Across a 1-m channel, the link operates at a maximum per-lane data rate of 16 Gb/s; thus, providing up to 1.024 Tb/s of aggregate bandwidth with 3.2 pJ/bit power efficiency from a 1.15-V supply. A length-matched dense interconnect topology allows clocking to be shared across multiple lanes to reduce area and power. Reconfigurable current/voltage mode transmitter driver and CMOS clocking enable a highly scalable power-efficient link. Optional low-dropout regulators provide >22-dB supply noise rejection at the package resonance frequency of 200 MHz. System-level optimization of duty-cycle and quadrature error correctors across the clock hierarchy provides optimized clock phase placement and, thus, enhances link performance and power. A lane failover mechanism provides design robustness to mitigate channel or circuit defects. The active circuitry occupies 1.3 mm2.

45 citations

Proceedings ArticleDOI
28 Mar 2013
TL;DR: This work developed a low-power dense 64-lane I/O system with per-port aggregate bandwidth up to 1Tb/s and 2.6pJ/bit power efficiency, and a high-density connector and cable attached to the top side of the package that enables this high interconnect density.
Abstract: High-performance computing (HPC) systems demand aggressive scaling of memory and I/O to achieve multiple terabits/sec of bandwidth. Minimizing I/O cost, area and power are crucial to achieving a practically realizable system with such large bandwidth. To meet these needs, we developed a low-power dense 64-lane I/O system with per-port aggregate bandwidth up to 1Tb/s and 2.6pJ/bit power efficiency. We developed a high-density connector and cable, attached to the top side of the package that enables this high interconnect density. A lane-failover mechanism provides design robustness for fault-tolerance. To further optimize power efficiency, the lane data rate scales from 2 to 16Gb/s with non-linear power efficiency of 0.8 to 2.6pJ/bit, providing scalable aggregate bandwidth of 0.128 to 1Tb/s. Highly power scalable circuits such as CMOS clocking and reconfigurable current-mode (CM) or voltage-mode (VM) TX driver enable the 8× bandwidth and 3× power efficiency scalability with aggressive supply voltage scaling (0.6 to 1.08V).

39 citations


Cited by
More filters
Journal ArticleDOI
10 Jun 2009
TL;DR: The current performance and future demands of interconnects to and on silicon chips are examined and the requirements for optoelectronic and optical devices are project if optics is to solve the major problems of interConnects for future high-performance silicon chips.
Abstract: We examine the current performance and future demands of interconnects to and on silicon chips. We compare electrical and optical interconnects and project the requirements for optoelectronic and optical devices if optics is to solve the major problems of interconnects for future high-performance silicon chips. Optics has potential benefits in interconnect density, energy, and timing. The necessity of low interconnect energy imposes low limits especially on the energy of the optical output devices, with a ~ 10 fJ/bit device energy target emerging. Some optical modulators and radical laser approaches may meet this requirement. Low (e.g., a few femtofarads or less) photodetector capacitance is important. Very compact wavelength splitters are essential for connecting the information to fibers. Dense waveguides are necessary on-chip or on boards for guided wave optical approaches, especially if very high clock rates or dense wavelength-division multiplexing (WDM) is to be avoided. Free-space optics potentially can handle the necessary bandwidths even without fast clocks or WDM. With such technology, however, optics may enable the continued scaling of interconnect capacity required by future chips.

1,959 citations

Book
18 Apr 2008
TL;DR: This survey reviews the historical development of programmable logic devices, the fundamental programming technologies that the programmability is built on, and then describes the basic understandings gleaned from research on architectures.
Abstract: Field-Programmable Gate Arrays (FPGAs) have become one of the key digital circuit implementation media over the last decade. A crucial part of their creation lies in their architecture, which governs the nature of their programmable logic functionality and their programmable interconnect. FPGA architecture has a dramatic effect on the quality of the final device's speed performance, area efficiency, and power consumption. This survey reviews the historical development of programmable logic devices, the fundamental programming technologies that the programmability is built on, and then describes the basic understandings gleaned from research on architectures. We include a survey of the key elements of modern commercial FPGA architecture, and look toward future trends in the field.

491 citations

Proceedings ArticleDOI
Yan Pan1, Prabhat Kumar1, John Kim2, Gokhan Memik1, Yu Zhang1, Alok Choudhary1 
20 Jun 2009
TL;DR: Firefly is a hybrid, hierarchical network architecture that consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotinics is used only for global communication to realize an efficient on-chip network.
Abstract: Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.

411 citations

Journal ArticleDOI
16 Jun 2009
TL;DR: This paper emphasizes the recently proposed 5 times 5 matrix switch comprising two-dimensionally cascaded microring resonator-based electrooptic switches coupled to a waveguide cross-grid on a silicon chip, and studies the feasibility of large-scale integration of the matrix switch.
Abstract: This paper reviews developments in cascaded microresonator-based matrix switches for silicon photonic interconnection networks in many-core computing applications. Specifically, we emphasize our recently proposed 5 times 5 matrix switch comprising two-dimensionally cascaded microring resonator-based electrooptic switches coupled to a waveguide cross-grid on a silicon chip. The cross-grid adopts low-loss low-crosstalk multimode-interference-based waveguide crossings. Such a microresonator-based matrix switch offers nonblocking interconnections among multiple inputs and multiple outputs, with the key merits of i) a tens to hundreds of micrometers-scale footprint, ii) gigabit/second-scale data transmission, iii) nanosecond-speed circuit-switching, iv) 100-muW-scale DC power consumption per link, and v) large-scale integration for networks-on-chips applications. We analyze in detail the microring resonator-based cross-grid switch design for high-data-rate signal transmission in the context of our proposed 5 times 5 matrix switch. We also study the feasibility of large-scale integration of the matrix switch. We report proof-of-concept experiments of a single cross-grid switch element and a 2 times 2 matrix switch, propose design guidelines, and discuss future engineering challenges.

272 citations

Proceedings ArticleDOI
13 Nov 2010
TL;DR: The programmer's view of this chip is described and RCCE is described: the native message passing model created for the SCC processor, an intermediate case, sharing traits of message passing and shared memory architectures.
Abstract: The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

267 citations