scispace - formally typeset
Search or ask a question
Author

Paul W. Coteus

Other affiliations: GlobalFoundries
Bio: Paul W. Coteus is an academic researcher from IBM. The author has contributed to research in topics: Interposer & Land grid array. The author has an hindex of 43, co-authored 233 publications receiving 8236 citations. Previous affiliations of Paul W. Coteus include GlobalFoundries.


Papers
More filters
Proceedings ArticleDOI
N. R. Adiga1, Gheorghe Almasi1, George Almási1, Y. Aridor1, Rajkishore Barik1, D. Beece1, Ralph Bellofatto1, Gyan Bhanot1, R. Bickford1, Matthias A. Blumrich1, A. A. Bright1, Jose R. Brunheroto1, Calin Cascaval2, José G. Castaños1, Waiman Chan1, Luis Ceze1, Paul W. Coteus1, Siddhartha Chatterjee1, Dong Chen1, G. Chiu1, Thomas Mario Cipolla1, Paul G. Crumley1, K.M. Desai1, A. Deutsch1, T. Domany1, M. B. Dombrowa1, Wilm E. Donath1, Maria Eleftheriou1, C. Christopher Erway1, J. Esch1, Blake G. Fitch1, J. Gagliano1, Alan Gara1, Rahul Garg1, Robert S. Germain1, Mark E. Giampapa1, B. Gopalsamy1, John A. Gunnels1, Manish Gupta1, Fred G. Gustavson1, Shawn A. Hall1, R. A. Haring1, D. Heidel1, P. Heidelberger1, L.M. Herger1, Dirk Hoenicke1, Rory D. Jackson1, T. Jamal-Eddine1, Gerard V. Kopcsay1, Elie Krevat1, Manish P. Kurhekar1, A.P. Lanzetta1, Derek Lieber1, L.K. Liu1, M. Lu1, M. Mendell1, A. Misra1, Yosef Moatti1, L. Mok1, José E. Moreira1, Ben J. Nathanson1, M. Newton1, Martin Ohmacht1, Adam J. Oliner1, Vinayaka Pandit1, R.B. Pudota1, Rick A. Rand1, R. Regan1, B. Rubin1, Albert E. Ruehli1, Silvius Rus1, Ramendra K. Sahoo1, A. Sanomiya1, Eugen Schenfeld1, M. Sharma1, E. Shmueli1, Suryabhan Singh1, Peilin Song1, Vijayalakshmi Srinivasan1, Burkhard Steinmacher-Burow1, Karin Strauss1, C. Surovic1, Richard A. Swetz1, Todd E. Takken1, R.B. Tremaine1, M. Tsao1, A. R. Umamaheshwaran1, P. Verma1, Pavlos M. Vranas1, T.J.C. Ward1, M. Wazlowski1, William A. Barrett1, C. Engel1, B. Drehmel1, B. Hilgart1, D. Hill1, F. Kasemkhani1, D. Krolak1, C.T. Li1, T. Liebsch1, James Anthony Marcella1, Adam J. Muff1, A. Okomo1, M. Rouse1, A. Schram1, Matthew R. Tubbs1, G. Ulsh1, Charles D. Wait1, J. Wittrup1, M. Bae3, Kenneth Alan Dockser3, Lynn Kissel2, M.K. Seager2, Jeffrey S. Vetter2, K. Yates2 
16 Nov 2002
TL;DR: An overview of the BlueGene/L Supercomputer, a massively parallel system of 65,536 nodes based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second).
Abstract: This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions, including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004--2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.

514 citations

Journal ArticleDOI
TL;DR: The key architectural features of BlueGene/L are introduced: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.
Abstract: The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.

422 citations

Journal ArticleDOI
01 May 2014
TL;DR: This report presents a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012, which summarizes and builds on discussions on resilience.
Abstract: We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

406 citations

Journal ArticleDOI
TL;DR: In this paper, the authors analyzed short, medium, and long on-chip interconnections having linewidths of 0.45-52 /spl mu/m in a five-metal-layer structure.
Abstract: Short, medium, and long on-chip interconnections having linewidths of 0.45-52 /spl mu/m are analyzed in a five-metal-layer structure. We study capacitive coupling for short lines, inductive coupling for medium-length lines, inductance and resistance of the current return path in the power buses, and line resistive losses for the global wiring. Design guidelines and technology changes are proposed to achieve minimum delay and contain crosstalk for local and global wiring. Conditional expressions are given to determine when transmission-line effects are important for accurate delay and crosstalk prediction.

397 citations

Patent
10 Jan 2011
TL;DR: A multi-petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, allows for a maximum packaging density of processing nodes from an interconnect point of view.
Abstract: A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC) Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency

391 citations


Cited by
More filters
01 May 1993
TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

29,323 citations

Journal ArticleDOI
17 Aug 2008
TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Abstract: Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

3,549 citations

Journal ArticleDOI
17 Nov 2011-Nature
TL;DR: Tunnels based on ultrathin semiconducting films or nanowires could achieve a 100-fold power reduction over complementary metal–oxide–semiconductor transistors, so integrating tunnel FETs with CMOS technology could improve low-power integrated circuits.
Abstract: Power dissipation is a fundamental problem for nanoelectronic circuits. Scaling the supply voltage reduces the energy needed for switching, but the field-effect transistors (FETs) in today's integrated circuits require at least 60 mV of gate voltage to increase the current by one order of magnitude at room temperature. Tunnel FETs avoid this limit by using quantum-mechanical band-to-band tunnelling, rather than thermal injection, to inject charge carriers into the device channel. Tunnel FETs based on ultrathin semiconducting films or nanowires could achieve a 100-fold power reduction over complementary metal-oxide-semiconductor (CMOS) transistors, so integrating tunnel FETs with CMOS technology could improve low-power integrated circuits.

2,390 citations

Journal ArticleDOI
01 Apr 2001
TL;DR: Wires that shorten in length as technologies scale have delays that either track gate delays or grow slowly relative to gate delays, which is good news since these "local" wires dominate chip wiring.
Abstract: Concern about the performance of wires wires in scaled technologies has led to research exploring other communication methods. This paper examines wire and gate delays as technologies migrate from 0.18-/spl mu/m to 0.035-/spl mu/m feature sizes to better understand the magnitude of the the wiring problem. Wires that shorten in length as technologies scale have delays that either track gate delays or grow slowly relative to gate delays. This result is good news since these "local" wires dominate chip wiring. Despite this scaling of local wire performance, computer-aided design (CAD) tools must still become move sophisticated in dealing with these wires. Under scaling, the total number of wires grows exponentially, so CAD tools will need to handle an ever-growing percentage of all the wires in order to keep designer workloads constant. Global wires present a more serious problem to designers. These are wires that do not scale in length since they communicate signals across the chip. The delay of these wives will remain constant if repeaters are used meaning that relative to gate delays, their delays scale upwards. These increased delays for global communication will drive architectures toward modular designs with explicit global latency mechanisms.

1,486 citations

Journal ArticleDOI
25 Oct 2010
TL;DR: This review introduces and summarizes progress in the development of the tunnel field- effect transistors (TFETs) including its origin, current experimental and theoretical performance relative to the metal-oxide-semiconductor field-effect transistor (MOSFET), basic current-transport theory, design tradeoffs, and fundamental challenges.
Abstract: Steep subthreshold swing transistors based on interband tunneling are examined toward extending the performance of electronics systems. In particular, this review introduces and summarizes progress in the development of the tunnel field-effect transistors (TFETs) including its origin, current experimental and theoretical performance relative to the metal-oxide-semiconductor field-effect transistor (MOSFET), basic current-transport theory, design tradeoffs, and fundamental challenges. The promise of the TFET is in its ability to provide higher drive current than the MOSFET as supply voltages approach 0.1 V.

1,389 citations