scispace - formally typeset
Search or ask a question

Showing papers by "Paul W. Coteus published in 2005"


Journal ArticleDOI
TL;DR: The key architectural features of BlueGene/L are introduced: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.
Abstract: The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.

422 citations


Journal ArticleDOI
TL;DR: Both the architecture and the microarchitecture of the torus and a network performance simulator are described and simulation results and hardware measurements are presented.
Abstract: The main interconnect of the massively parallel Blue Gene®/L is a three-dimensional torus network with dynamic virtual cut-through routing. This paper describes both the architecture and the microarchitecture of the torus and a network performance simulator. Both simulation results and hardware measurements are presented.

361 citations


Patent
18 Jul 2005
TL;DR: The Global Collective Network (GCN) as discussed by the authors is a system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes, which optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected nodes.
Abstract: A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.

49 citations


Journal ArticleDOI
TL;DR: How 1,024 dual-processor compute application-specific integrated circuits (ASICs) are packaged in a scalable rack, and how racks are combined and augmented with host computers and remote storage are described.
Abstract: As 1999 ended, IBM announced its intention to construct a one-petaflop supercomputer. The construction of this system was based on a cellular architecture--the use of relatively small but powerful building blocks used together in sufficient quantities to construct large systems. The first step on the road to a petaflop machine (one quadrillion floating-point operations in a second) is the Blue Gene®/L supercomputer. Blue Gene/L combines a low-power processor with a highly parallel architecture to achieve unparalleled computing performance per unit volume. Implementing the Blue Gene/L packaging involved trading off considerations of cost, power, cooling, signaling, electromagnetic radiation, mechanics, component selection, cabling, reliability, service strategy, risk, and schedule. This paper describes how 1,024 dual-processor compute application-specific integrated circuits (ASICs) are packaged in a scalable rack, and how racks are combined and augmented with host computers and remote storage. The Blue Gene/L interconnect, power, cooling, and control systems are described individually and as part of the synergistic whole.

47 citations


Proceedings ArticleDOI
04 May 2005
TL;DR: The BlueGene/L supercomputer as mentioned in this paper has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers, and emphasis was put on system solutions to engineer a power-efficient system.
Abstract: The BlueGene/L supercomputer has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers. To achieve this goal, emphasis was put on system solutions to engineer a power-efficient system. To exploit thread level parallelism, the BlueGene/L system can scale to 64 racks with a total of 65536 computer nodes consisting of a single compute ASIC integrating all system functions with two industry-standard PowerPC microprocessor cores in a chip multiprocessor configuration. Each PowerPC processor exploits data-level parallelism with a high-performance SIMD oating point unitTo support good application scaling on such a massive system, special emphasis was put on efficient communication primitives by including five highly optimized communification networks. After an initial introduction of the Blue-Gene/L system architecture, we analyze power/performance efficiency for the BlueGene system using performance and power characteristics for the overall system performance (as exemplified by peak performance numbers.To understand application scaling behavior, and its impact on performance and power/performance efficiency, we analyze the NAMD molecular dynamics package using the ApoA1 benchmark. We find that even for strong scaling problems, BlueGene/L systems can deliver superior performance scaling and deliver significant power/performance efficiency. Application benchmark power/performance scaling for the voltage-invariant energy delay 2 power/performance metric demonstrates that choosing a power-efficient 700MHz embedded PowerPC processor core and relying on application parallelism was the right decision to build a powerful, and power/performance efficient system

30 citations


Patent
07 Oct 2005
TL;DR: A 3D chip having at least one I/O layer connected to other 3D-chip layers by a vertical bus can accommodate protection and off-chip device drive circuits, customization circuits, translation circuits, conversions circuits and/or built-in self-test circuits as discussed by the authors.
Abstract: A 3D chip having at least one I/O layer connected to other 3D chip layers by a vertical bus such that the I/O layer(s) may accommodate protection and off-chip device drive circuits, customization circuits, translation circuits, conversions circuits and/or built-in self-test circuits capable of comprehensive chip or wafer level testing wherein the I/O layers function as a testhead. Substitution of I/O circuits or structures may be performed using E-fuses or the like responsive to such testing.

16 citations


Patent
30 Sep 2005
TL;DR: In this article, the authors proposed techniques or forming enhanced electrical connections, where an electrical connecting device comprises an electrically insulating carrier having one or more contact structures traversing a plane thereof.
Abstract: Techniques or forming enhanced electrical connections are provided. In one aspect, and electrical connecting device comprises an electrically insulating carrier having one or more contact structures traversing a plane thereof. Each contact structure comprises an elastomeric material having an electrically conductive layer running along at least one surface thereof continuously through the plane of the carrier.

15 citations


Patent
14 Apr 2005
TL;DR: In this article, a node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the program.
Abstract: Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for—example, checksums—to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.

13 citations


Patent
30 Sep 2005
TL;DR: In this paper, a probe structure for an electronic device is described, which includes an elastomeric material having an electrically conductive layer running along at least one surface thereof continuously through the plane of the carrier.
Abstract: A probe structure for an electronic device is provided. In one aspect, the probe structure includes an electrically insulating carrier having one or more contact structures traversing a plane thereof. Each contact structure includes an elastomeric material having an electrically conductive layer running along at least one surface thereof continuously through the plane of the carrier. The probe structure includes one or more other contact structures adapted for connection to a test apparatus.

6 citations


Patent
28 Nov 2005
TL;DR: In this paper, the authors propose a method for providing indeterminate read data latency in a memory system, which includes determining if a local data packet has been received and storing it into a buffer device, and if the buffer device contains a data packet and determining if an upstream driver for transmitting data packets to a memory controller via an upstream channel is idle.
Abstract: A method for providing indeterminate read data latency in a memory system. The method includes determining if a local data packet has been received and storing it into a buffer device. The method also includes determining if the buffer device contains a data packet and determining if an upstream driver for transmitting data packets to a memory controller via an upstream channel is idle, and in response thereto the data packet is transmitted to the upstream driver. The method further includes determining if an upstream data packet has been received and the upstream driver is not idle, then the upstream data packet is stored into the buffer device. The upstream data packet is selectively transmitted to the upstream driver. If the upstream driver is not idle, then any data packets in progress are continued being transmitted to the upstream driver.

1 citations


01 Jan 2005
TL;DR: The key architectural features of Blue Genet/L are introduced: the link chip component and five Blue Gene/L networks, the PowerPCt 440 core and floatingpoint enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.
Abstract: The Blue Genet/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPCt 440 core and floatingpoint enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.