Showing papers on "Massively parallel published in 2005"

PDF

Open Access

Journal Article•DOI•

Virtual machine monitors: current technology and future trends

[...]

Mendel Rosenblum¹, Tal Garfinkel¹•Institutions (1)

01 May 2005-IEEE Computer

TL;DR: From this project came the people and ideas that underpinned VMware Inc., the original supplier of VMMs for commodity computing hardware, and the implications of having a VMM for commodity platforms intrigued both researchers and entrepreneurs.

...read moreread less

Abstract: Developed more than 30 years ago to address mainframe computing problems, virtual machine monitors have resurfaced on commodity platforms, offering novel solutions to challenges in security, reliability, and administration Stanford University researchers began to look at the potential of virtual machines to overcome difficulties that hardware and operating system limitations imposed: This time the problems stemmed from massively parallel processing (MPP) machines that were difficult to program and could not run existing operating systems With virtual machines, researchers found they could make these unwieldy architectures look sufficiently similar to existing platforms to leverage the current operating systems From this project came the people and ideas that underpinned VMware Inc, the original supplier of VMMs for commodity computing hardware The implications of having a VMM for commodity platforms intrigued both researchers and entrepreneurs

...read moreread less

720 citations

Journal Article•DOI•

Overview of the Blue Gene/L system architecture

[...]

Alan Gara¹, Matthias A. Blumrich¹, Dong Chen¹, G. L.-T. Chiu¹, Paul W. Coteus¹, Mark E. Giampapa¹, R. A. Haring¹, Philip Heidelberger¹, Dirk Hoenicke¹, G.V. Kopcsay¹, T. A. Liebsch², Martin Ohmacht¹, Burkhard Steinmacher-Burow¹, Todd E. Takken¹, Pavlos M. Vranas¹ - Show less +11 more•Institutions (2)

IBM¹, University of Rochester²

01 Mar 2005-Ibm Journal of Research and Development

TL;DR: The key architectural features of BlueGene/L are introduced: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.

...read moreread less

Abstract: The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.

...read moreread less

422 citations

Journal Article•DOI•

Blue Gene/L torus interconnection network

[...]

N. R. Adiga¹, Matthias A. Blumrich¹, Dong Chen¹, Paul W. Coteus¹, Alan Gara¹, Mark E. Giampapa¹, Philip Heidelberger¹, Suryabhan Singh¹, Burkhard Steinmacher-Burow¹, Todd E. Takken¹, M. Tsao¹, Pavlos M. Vranas¹ - Show less +8 more•Institutions (1)

IBM¹

01 Mar 2005-Ibm Journal of Research and Development

TL;DR: Both the architecture and the microarchitecture of the torus and a network performance simulator are described and simulation results and hardware measurements are presented.

...read moreread less

Abstract: The main interconnect of the massively parallel Blue Gene®/L is a three-dimensional torus network with dynamic virtual cut-through routing. This paper describes both the architecture and the microarchitecture of the torus and a network performance simulator. Both simulation results and hardware measurements are presented.

...read moreread less

361 citations

Journal Article•DOI•

Simulating Radiating and Magnetized Flows in Multi-Dimensions with ZEUS-MP

[...]

John C. Hayes, Michael L. Norman, R. A. Fiedler, James Bordner, Pak Shing Li, S. E. Clark, Asif ud-Doula, Mordecai-Mark Mac Low - Show less +4 more

18 Nov 2005-arXiv: Astrophysics

TL;DR: ZEUS-MP as discussed by the authors is a massively parallel implementation of the ZEUS code for simulations on parallel computing platforms, which allows the advection of multiple chemical (or nuclear) species.

...read moreread less

Abstract: This paper describes ZEUS-MP, a multi-physics, massively parallel, message- passing implementation of the ZEUS code. ZEUS-MP differs significantly from the ZEUS-2D code, the ZEUS-3D code, and an early "version 1" of ZEUS-MP distributed publicly in 1999. ZEUS-MP offers an MHD algorithm better suited for multidimensional flows than the ZEUS-2D module by virtue of modifications to the Method of Characteristics scheme first suggested by Hawley and Stone (1995), and is shown to compare quite favorably to the TVD scheme described by Ryu et. al (1998). ZEUS-MP is the first publicly-available ZEUS code to allow the advection of multiple chemical (or nuclear) species. Radiation hydrodynamic simulations are enabled via an implicit flux-limited radiation diffusion (FLD) module. The hydrodynamic, MHD, and FLD modules may be used in one, two, or three space dimensions. Self gravity may be included either through the assumption of a GM/r potential or a solution of Poisson's equation using one of three linear solver packages (conjugate-gradient, multigrid, and FFT) provided for that purpose. Point-mass potentials are also supported. Because ZEUS-MP is designed for simulations on parallel computing platforms, considerable attention is paid to the parallel performance characteristics of each module. Strong-scaling tests involving pure hydrodynamics (with and without self-gravity), MHD, and RHD are performed in which large problems (256^3 zones) are distributed among as many as 1024 processors of an IBM SP3. Parallel efficiency is a strong function of the amount of communication required between processors in a given algorithm, but all modules are shown to scale well on up to 1024 processors for the chosen fixed problem size.

...read moreread less

333 citations

Book•

Reaction-diffusion computers

[...]

Andrew Adamatzky, Benjamin de Lacy Costello, Tetsuya Asai

01 Jan 2005

TL;DR: The interdisciplinary research monograph brings together results of a decade-long study into designing experimental and simulated prototypes of reaction-diffusion computing devices for image processing, path planning, robot navigation, computational geometry, logics and artificial intelligence.

...read moreread less

Abstract: The interdisciplinary research monograph, which has been peer-reviewed by several international experts assigned by Elsevier, introduces ground breaking original results in formal paradigms, architectures and laboratory implementations of computers based on travelling waves in reaction-diffusion media The monograph brings together results of a decade-long study into designing experimental and simulated prototypes of reaction-diffusion computing devices for image processing, path planning, robot navigation, computational geometry, logics and artificial intelligence The book has had impact in the field of massively parallel computing because of its comprehensive presentation of the theoretical and experimental foundations, and cutting-edge computation techniques, chemical laboratory experimental setups and hardware implementation technology employed in the development of novel nature-inspired computing devices The monograph resulted from EPSRC grants GR/S63854/01 and EP/C004272/1

...read moreread less

302 citations

Journal Article•DOI•

A fast, scalable method for the parallel evaluation of distance-limited pairwise particle interactions.

[...]

David E. Shaw¹•Institutions (1)

D. E. Shaw Research¹

01 Oct 2005-Journal of Computational Chemistry

TL;DR: A new method for the parallel evaluation of distance‐limited pairwise particle interactions that significantly reduces the amount of data transferred between processors by comparison with traditional methods is introduced.

...read moreread less

Abstract: Classical molecular dynamics simulations of biological macromolecules in explicitly modeled solvent typically require the evaluation of interactions between all pairs of atoms separated by no more than some distance R, with more distant interactions handled using some less expensive method. Performing such simulations for periods on the order of a millisecond is likely to require the use of massive parallelism. The extent to which such simulations can be efficiently parallelized, however, has historically been limited by the time required for interprocessor communication. This article introduces a new method for the parallel evaluation of distance-limited pairwise particle interactions that significantly reduces the amount of data transferred between processors by comparison with traditional methods. Specifically, the amount of data transferred into and out of a given processor scales as O(R(3/2)p(-1/2)), where p is the number of processors, and with constant factors that should yield a substantial performance advantage in practice.

...read moreread less

200 citations

Proceedings Article•DOI•

NonStop/spl reg/ advanced architecture

[...]

David L. Bernick¹, B. Bruckert¹, P.D. Vigna¹, David J. Garcia¹, Robert L. Jardine¹, James S. Klecka¹, James R. Smullen¹ - Show less +3 more•Institutions (1)

Hewlett-Packard¹

28 Jun 2005

TL;DR: The NonStop advanced architecture (NSAA) uses dual or triple modular redundant fault-tolerant servers built from standard HP 4-way SMP Itanium/spl reg/2 server processor modules, memory boards, and power infrastructure to improve system availability and reduce cost.

...read moreread less

Abstract: For nearly 30 years the Hewlett Packard NonStop Enterprise Division (formerly Tandem Computers Inc.) has produced highly available, fault-tolerant, massively parallel NonStop computer systems. These vertically integrated systems use a proprietary operating system and specialized hardware for detecting, isolating, and recovering from faults. The NonStop advanced architecture (NSAA) uses dual or triple modular redundant fault-tolerant servers built from standard HP 4-way SMP Itanium/spl reg/2 server processor modules, memory boards, and power infrastructure. A unique synchronization mechanism allows fully compared operations from loosely synchronized processor modules. In addition, the NSAA improves system availability by additional hardware fault masking, and significantly lowers cost by leveraging existing high-volume Itanium server components.

...read moreread less

185 citations

Software Architecture of the Light Weight Kernel, Catamount

[...]

Suzanne M. Kelly¹, Ron Brightwell¹•Institutions (1)

Sandia National Laboratories¹

01 Jan 2005

TL;DR: Catamount is designed to be a low overhead operating system for a parallel computing environment that is limited to the minimum set needed to run a scientific computation.

...read moreread less

Abstract: Catamount is designed to be a low overhead operating system for a parallel computing environment. Functionality is limited to the minimum set needed to run a scientific computation. The design choices and implementations will be presented.

...read moreread less

126 citations

Journal Article•DOI•

Car–Parrinello Molecular Dynamics on Massively Parallel Computers

[...]

Jürg Hutter¹, Alessandro Curioni²•Institutions (2)

University of Zurich¹, IBM²

12 Sep 2005-ChemPhysChem

101 citations

Journal Article•DOI•

Pursuing scalability for hypre's conceptual interfaces

[...]

Robert D. Falgout¹, Jim E. Jones², Ulrike Meier Yang¹•Institutions (2)

Lawrence Livermore National Laboratory¹, Florida Institute of Technology²

01 Sep 2005-ACM Transactions on Mathematical Software

TL;DR: The data structures, parallel implementation, and resulting performance of the IJ, Struct and semiStruct interfaces are described, which investigates their scalability, presents successes as well as pitfalls of some of the approaches and suggests ways of dealing with them.

...read moreread less

Abstract: The software library hypre provides high-performance preconditioners and solvers for the solution of large, sparse linear systems on massively parallel computers as well as conceptual interfaces that allow users to access the library in the way they naturally think about their problems. These interfaces include a stencil-based structured interface (Struct); a semistructured interface (semiStruct), which is appropriate for applications that are mostly structured, for example, block structured grids, composite grids in structured adaptive mesh refinement applications, and overset grids; and a finite element interface (FEI) for unstructured problems, as well as a conventional linear-algebraic interface (IJ). It is extremely important to provide an efficient, scalable implementation of these interfaces in order to support the scalable solvers of the library, especially when using tens of thousands of processors. This article describes the data structures, parallel implementation, and resulting performance of the IJ, Struct and semiStruct interfaces. It investigates their scalability, presents successes as well as pitfalls of some of the approaches and suggests ways of dealing with them.

...read moreread less

89 citations

Journal Article•DOI•

High-performance computing: clusters, constellations, MPPs, and future directions

[...]

Jack Dongarra¹, Thomas Sterling², Horst D. Simon³, Erich Strohmaier³•Institutions (3)

University of Tennessee¹, California Institute of Technology², Lawrence Berkeley National Laboratory³

07 Mar 2005-Computing in Science and Engineering

TL;DR: A perspective is presented that retains the descriptive richness while providing a unifying framework that will lead to effective and affordable Petaflops-scale computing including the future role of computer centers as facilities for supporting high performance computing environments.

...read moreread less

Abstract: In a recent paper, Gordon Bell and Jim Gray (2002) put forth a view of the past, present, and future of high-performance computing (HPC) that is both insightful and thought provoking. Identifying key trends with a grace and candor rarely encountered in a single work, the authors describe an evolutionary past drawn from their vast experience and project an enticing and compelling vision of HPC's future. Yet, the underlying assumptions implicit in their treatment, particularly those related to terminology and dominant trends, conflict with our own experience, common practices, and shared view of HPCs future directions. Taken from our vantage points of the Top500 list," the Lawrence Berkeley National Laboratory NERSC computer center, Beowulf-class computing, and research in petaflops-scale computing architectures, we offer an alternate perspective on several key issues in the form of a constructive counterpoint. One objective of this article is to restore the strength and value of the term "cluster" by degeneralizing its applicability to a restricted subset of parallel computers. We'll further consider this class in conjunction with its complementing terms constellation, Beowulf class, and massively parallel processing systems (MPPs), based on the classification used by the Top500 list, which has tracked the HPC field for more than a decade.

...read moreread less

Book Chapter•DOI•

Reversible cellular automata

[...]

Jarkko Kari¹•Institutions (1)

University of Turku¹

04 Jul 2005

TL;DR: This paper is a short survey of research on reversible cellular automata over the past fourty plus years and discusses the classic results by Hedlund, Moore and Myhill that relate injectivity, surjectivity and reversibility with each other.

...read moreread less

Abstract: Reversible cellular automata (RCA) are models of massively parallel computation that preserve information. This paper is a short survey of research on reversible cellular automata over the past fourty plus years. We discuss the classic results by Hedlund, Moore and Myhill that relate injectivity, surjectivity and reversibility with each other. Then we review algorithmic questions and some results on computational universality. Finally we talk about local reversibility vs. global reversibility.

...read moreread less

Journal Article•DOI•

Design and implementation of message-passing services for the Blue Gene/L supercomputer

[...]

G. Almasi¹, Charles J. Archer², José G. Castaños¹, John A. Gunnels¹, C. Christopher Erway³, P. Heidelberger¹, Xavier Martorell⁴, José E. Moreira², K. W. Pinnow², Joseph D. Ratterman², Burkhard Steinmacher-Burow¹, William Gropp⁵, Brian Toonen⁵ - Show less +9 more•Institutions (5)

IBM¹, University of Rochester², Brown University³, Polytechnic University of Catalonia⁴, Argonne National Laboratory⁵

01 Mar 2005-Ibm Journal of Research and Development

TL;DR: Performance measurements show that message-passing services deliver performance close to the hardware limits of the machine, and dedicating one of the processors of a node to communication functions greatly improves the message-Passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.

...read moreread less

Abstract: The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.

...read moreread less

Journal Article•DOI•

Embedded divide-and-conquer algorithm on hierarchical real-space grids: parallel molecular dynamics simulation based on linear-scaling density functional theory

[...]

Fuyuki Shimojo¹, Fuyuki Shimojo², Rajiv K. Kalia¹, Aiichiro Nakano¹, Priya Vashishta¹ - Show less +1 more•Institutions (2)

University of Southern California¹, Kumamoto University²

01 May 2005-Computer Physics Communications

TL;DR: A linear-scaling algorithm has been developed to perform large-scale molecular-dynamics (MD) simulations, in which interatomic forces are computed quantum mechanically in the framework of the density functional theory.

...read moreread less

Journal Article•DOI•

Bifurcation tracking algorithms and software for large scale applications

[...]

Andrew G. Salinger¹, Elizabeth A. Burroughs¹, Roger P. Pawlowski¹, Eric T. Phipps¹, Louis A. Romero¹ - Show less +1 more•Institutions (1)

Sandia National Laboratories¹

01 Mar 2005-International Journal of Bifurcation and Chaos

TL;DR: A bifurcation analysis of an 1.6 Million unknown model of 3D Rayleigh–Benard convection in a 5 × 5 × 1 box is successfully undertaken, showing that the algorithms can indeed scale to problems of this size while producing solutions of reasonable accuracy.

...read moreread less

Abstract: We present the set of bifurcation tracking algorithms which have been developed in the LOCA software library to work with large scale application codes that use fully coupled Newton's method with iterative linear solvers. Turning point (fold), pitchfork, and Hopf bifurcation tracking algorithms based on Newton's method have been implemented, with particular attention to the scalability to large problem sizes on parallel computers and to the ease of implementation with new application codes. The ease of implementation is accomplished by using block elimination algorithms to solve the Newton iterations of the augmented bifurcation tracking systems. The applicability of such algorithms for large applications is in doubt since the main computational kernel of these routines is the iterative linear solve of the same matrix that is being driven singular by the algorithm. To test the robustness and scalability of these algorithms, the LOCA library has been interfaced with the MPSalsa massively parallel finite element reacting flows code. A bifurcation analysis of an 1.6 Million unknown model of 3D Rayleigh–Benard convection in a 5 × 5 × 1 box is successfully undertaken, showing that the algorithms can indeed scale to problems of this size while producing solutions of reasonable accuracy.

...read moreread less

Proceedings Article•DOI•

Data redistribution and remote method invocation in parallel component architectures

[...]

Felipe Bertrand¹, Randall Bramley¹, David E. Bernholdt², James Arthur Kohl², Alan Sussman³, Jay Larson⁴, Kostadin Damevski⁵ - Show less +3 more•Institutions (5)

Indiana University¹, Oak Ridge National Laboratory², University of Maryland, College Park³, Argonne National Laboratory⁴, University of Utah⁵

04 Apr 2005

TL;DR: So-called "M/spl times/N" research, as part of the Common Component Architecture (CCA) effort, addresses these special and challenging needs, to provide generalized interfaces and tools that support flexible parallel data redistribution and parallel remote method invocation.

...read moreread less

Abstract: With the increasing availability of high-performance massively parallel computer systems, the prevalence of sophisticated scientific simulation has grown rapidly. The complexity of the scientific models being simulated has also evolved, leading to a variety of coupled multi-physics simulation codes. Such cooperating parallel programs require fundamentally new interaction capabilities, to efficiently exchange parallel data structures and collectively invoke methods across programs. So-called "M/spl times/N" research, as part of the Common Component Architecture (CCA) effort, addresses these special and challenging needs, to provide generalized interfaces and tools that support flexible parallel data redistribution and parallel remote method invocation. Using this technology, distinct simulation codes with disparate distributed data decompositions can work together to achieve greater scientific discoveries.

...read moreread less

Book•

Advanced Computer Architecture and Parallel Processing (Wiley Series on Parallel and Distributed Computing)

[...]

Hesham El-Rewini, Mostafa Abd-El-Barr

01 Jan 2005

TL;DR: Computer architecture deals with the physical configuration logical structure formats protocols and operational sequences for processing data controlling the configuration and controlling the operations over a computer Advanced Computer Architecture and Parallel Processing.

...read moreread less

Abstract: Advanced Computer Architecture And Parallel Processing Wiley Series On Parallel And Distributed Computing V 2 *FREE* advanced computer architecture and parallel processing wiley series on parallel and distributed computing v 2 Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them.ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING Eqbal ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING TEAM LinG Live Informative Non cost and Genuine WILEY SERIES ON PARALLEL AND DISTRIBUTED COMPUTING SERIES EDITOR Albert Y Zomaya Parallel amp Distributed Simulation Systems Richard Fujimoto Network Computing 157 7 1 Computer Networks Basics 158 7 2 Client Server Systems 161 Advanced Computer Architecture and Parallel Processing Advanced Computer Architecture and Parallel Processing Wiley Series on Parallel and Distributed Computing v 2 Hesham El Rewini Mostafa Abd El Barr on Amazon com FREE shipping on qualifying offers Computer architecture deals with the physical configuration logical structure formats protocols and operational sequences for processing data Advanced Computer Architecture and Parallel Processing Computer architecture deals with the physical configuration logical structure formats protocols and operational sequences for processing data controlling the configuration and controlling the operations over a computer Advanced Computer Architecture and Parallel Processing

...read moreread less

Proceedings Article•DOI•

Scatter-add in data parallel architectures

[...]

Jung Ho Ahn¹, Mattan Erez¹, William J. Dally¹•Institutions (1)

Stanford University¹

12 Feb 2005

TL;DR: This work introduces scatter-add, which is the data-parallel form of the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems, and details the microarchitecture of a scattered-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on a set of applications that require a scatter- add computation.

...read moreread less

Abstract: Many important applications exhibit large amounts of data parallelism, and modern computer systems are designed to take advantage of it. While much of the computation in the multimedia and scientific application domains is data parallel, certain operations require costly serialization that increase the run time. Examples include superposition type updates in scientific computing and histogram computations in media processing. We introduce scatter-add, which is the data-parallel form of the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems. The scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it. This novel architecture extension allows us to efficiently support data-parallel atomic update computations found in parallel programming languages such as HPF, and applies both to single-processor and multiprocessor SIMD data-parallel systems. We detail the microarchitecture of a scatter-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on a set of applications that require a scatter-add computation.

...read moreread less

Proceedings Article•DOI•

Integrated performance monitoring of a cosmology application on leading HEC platforms

[...]

Julian Borrill¹, Jonathan Carter¹, Leonid Oliker¹, David Skinner¹, Rupak Biswas² - Show less +1 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, Ames Research Center²

14 Jun 2005

TL;DR: MADbench is presented, a lightweight version of the MADCAP CMB power spectrum estimation code that retains the operational complexity and integrated system requirements, and the integrated performance monitoring (IPM) package is introduced: a portable, lightweight, and scalable tool for effectively extracting MPI message-passing overheads.

...read moreread less

Abstract: The cosmic microwave background (CMB) is an exquisitely sensitive probe of the fundamental parameters of cosmology. Extracting this information is computationally intensive, requiring massively parallel computing and sophisticated numerical algorithms. In this work we present MADbench, a lightweight version of the MADCAP CMB power spectrum estimation code that retains the operational complexity and integrated system requirements. In addition, to quantify communication behavior across a variety of architectural platforms, we introduce the integrated performance monitoring (IPM) package: a portable, lightweight, and scalable tool for effectively extracting MPI message-passing overheads. A performance characterization study is conducted on some of the world's most powerful supercomputers, including the superscalar Seaborg (IBM Power3+) and CC-NUMA Columbia (SGIAltix), as well as the vector-based Earth Simulator (NEC SX-6 enhanced) and Phoenix (Cray XI) systems. In-depth analysis shows that in order to bridge the gap between theoretical and sustained system performance, it is critical to gain a clear understanding of how the distinct parts of large-scale parallel applications interact with the individual subcomponents of HEC platforms.

...read moreread less

Journal Article•

Integrated Performance Monitoring of a Cosmology Application on Leading HEC Platforms

[...]

Julian Borrill¹, Jonathan Carter¹, Leonid Oliker¹, David Skinner¹, Rupak Biswas² - Show less +1 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, Ames Research Center²

01 Apr 2005-Lawrence Berkeley National Laboratory

TL;DR: In this article, the Integrated Performance Monitoring (IPMADCAP) package is introduced to quantify communication behavior across a variety of architectural platforms, and a portable, lightweight, and scalable tool for effectively extracting MPI message-passing over heads is presented.

...read moreread less

Abstract: The Cosmic Microwave Background (CMB) is an exquisitely sensitive probe of the fundamental parameters of cosmology. Extracting this information is computationally intensive, requiring massively parallel computing and sophisticated numerical algorithms. In this work we present MAD bench, a lightweight version of the MADCAP CMB power spectrum estimation code that retains the operational complexity and integrated system requirements. In addition, to quantify communication behavior across a variety of architectural platforms, we introduce the Integrated Performance Monitoring (IPM) package: a portable, lightweight, and scalable tool for effectively extracting MPI message-passing over heads. A performance characterization study is conducted on some of the world's most powerful supercomputers, including the superscalar Seaborg (IBMPower3+) and CC-NUMA Columbia (SGI Altix), as well as the vector-based Earth Simulator (NEC SX-6 enhanced) and Phoenix (Cray X1) systems. In-depth analysis shows that in order to bridge the gap between theoretical and sustained system performance, it is critical to gain a clear understanding of how the distinct parts of large-scale parallel applications interact with the individual subcomponents of HEC platforms.

...read moreread less

Journal Article•DOI•

Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

[...]

Siddhartha Chatterjee¹, L. R. Bachega², P. Bergner³, Kenneth Alan Dockser⁴, John A. Gunnels¹, Manish Gupta¹, Fred G. Gustavson¹, Christopher A. Lapkowski¹, G. K. Liu¹, M. Mendell¹, Ravi Nair¹, C. D. Wait³, T. J. C. Ward¹, Peng Wu¹ - Show less +10 more•Institutions (4)

IBM¹, Purdue University², University of Rochester³, Qualcomm⁴

01 Mar 2005-Ibm Journal of Research and Development

TL;DR: The design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it are described and measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating- point performance for compute-bound-kernels, such as matrix multiplication.

...read moreread less

Abstract: We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.

...read moreread less

Book Chapter•DOI•

A SIMD neural network processor for image processing

[...]

Dong-Sun Kim, Hyun-Sik Kim, Hongsik Kim, Gunhee Han, Duck-Jin Chung¹ - Show less +1 more•Institutions (1)

Inha University¹

30 May 2005

TL;DR: This paper proposes a high performance neural network processor whose function can be changed by programming that is based on the SIMD architecture that is optimized for neural network and image processing.

...read moreread less

Abstract: Artificial Neural Networks (ANNs) and image processing requires massively parallel computation of simple operator accompanied by heavy memory access. Thus, this type of operators naturally maps onto Single Instruction Multiple Data (SIMD) stream parallel processing with distributed memory. This paper proposes a high performance neural network processor whose function can be changed by programming. The proposed processor is based on the SIMD architecture that is optimized for neural network and image processing. The proposed processor supports 24 instructions, and consists of 16 Processing Units (PUs) per chip. Each PU includes 24-bit 2K-word Local Memory (LM) and a Processing Element (PE). The proposed architecture allows multichip expansion that minimizes chip-to-chip communication bottleneck. The proposed processor is verified with FPGA implementation and the functionality is verified with character recognition application.

...read moreread less

Book Chapter•DOI•

Direct solution of linear systems of size 10 9 arising in optimization with interior point methods

[...]

Jacek Gondzio¹, Andreas Grothey¹•Institutions (1)

University of Edinburgh¹

11 Sep 2005

TL;DR: A parallel implementation of an interior point method uses object-oriented programming techniques and allows for exploiting different block-structures of matrices and outperforms the industry-standard optimizer, shows very good parallel efficiency on massively parallel architecture and solves problems of unprecedented sizes reaching 109 variables.

...read moreread less

Abstract: Solution methods for very large scale optimization problems are addressed in this paper. Interior point methods are demonstrated to provide unequalled efficiency in this context. They need a small (and predictable) number of iterations to solve a problem. A single iteration of interior point method requires the solution of indefinite system of equations. This system is regularized to guarantee the existence of triangular decomposition. Hence the well-understood parallel computing techniques developed for positive definite matrices can be extended to this class of indefinite matrices. A parallel implementation of an interior point method is described in this paper. It uses object-oriented programming techniques and allows for exploiting different block-structures of matrices. Our implementation outperforms the industry-standard optimizer, shows very good parallel efficiency on massively parallel architecture and solves problems of unprecedented sizes reaching 109 variables.

...read moreread less

Journal Article•DOI•

Massively parallel implementation of a fast multipole method for distributed memory machines

[...]

Jakub Kurzak¹, B. Montgomery Pettitt¹•Institutions (1)

University of Houston¹

01 Jul 2005-Journal of Parallel and Distributed Computing

TL;DR: A new load balanced parallel implementation of a non-adaptive version of Greengard and Rokhlin's fast multipole method for distributed memory architectures with focus on applications in molecular dynamics is presented.

...read moreread less

Proceedings Article•DOI•

Two-state, reversible, universal cellular automata in three dimensions

[...]

Daniel B. Miller¹, Edward Fredkin¹•Institutions (1)

Carnegie Mellon University¹

04 May 2005

TL;DR: In this paper, a two-state reversible cellular automata (RCA) is described, which is shown to be capable of universal computation and evidence is offered that this RCA is able to be constructed by universal construction.

...read moreread less

Abstract: A novel two-state, Reversible Cellular Automata (RCA) is described. This three-dimensional RCA is shown to be capable of universal computation. Additionally, evidence is offered that this RCA is capable of universal construction.

...read moreread less

Proceedings Article•DOI•

Tera-Scalable Algorithms for Variable-Density Elliptic Hydrodynamics with Spectral Accuracy

[...]

Andrew W. Cook¹, William H. Cabot¹, Peter Williams¹, Brian J. Miller¹, Bronis R. de Supinski¹, Robert K. Yates¹, Michael Welcome² - Show less +3 more•Institutions (2)

Lawrence Livermore National Laboratory¹, Lawrence Berkeley National Laboratory²

12 Nov 2005

TL;DR: This effort represents the first time that a high-order variable-density incompressible flow solver with species diffusion has demonstrated sustained performance in the TeraFLOPS range.

...read moreread less

Abstract: We describe Miranda, a massively parallel spectral/compact solver for variabledensity incompressible flow, including viscosity and species diffusivity effects. Miranda utilizes FFTs and band-diagonal matrix solvers to compute spatial derivatives to at least 10th-order accuracy. We have successfully ported this communicationintensive application to BlueGene/L and have explored both direct block parallel and transpose-based parallelization strategies for its implicit solvers. We have discovered a mapping strategy which results in virtually perfect scaling of the transpose method up to 65,536 processors of the BlueGene/L machine. Sustained global communication rates in Miranda typically run at 85% of the theoretical peak speed of the BlueGene/L torus network, while sustained communication plus computation speeds reach 2.76 TeraFLOPS. This effort represents the first time that a high-order variable-density incompressible flow solver with species diffusion has demonstrated sustained performance in the TeraFLOPS range.

...read moreread less

Book Chapter•DOI•

A distributed model of spatial visual attention

[...]

Julien Vitay, Nicolas P. Rougier, Frédéric Alexandre

01 Jan 2005-Lecture Notes in Computer Science

TL;DR: A model of visual exploration of a scene by the means of localized computations in neural populations whose architecture allows the emergence of a coherent behaviour of sequential scanning of salient stimuli is proposed.

...read moreread less

Abstract: Although biomimetic autonomous robotics relies on the massively parallel architecture of the brain, the key issue is to temporally organize behaviour. The distributed representation of the sensory information has to be coherently processed to generate relevant actions. In the visual domain, we propose here a model of visual exploration of a scene by the means of localized computations in neural populations whose architecture allows the emergence of a coherent behaviour of sequential scanning of salient stimuli. It has been implemented on a real robotic platform exploring a moving and noisy scene including several identical targets.

...read moreread less

Journal Article•DOI•

Advances and challenges in computational plasma science

[...]

William Tang¹, Vincent Chan²•Institutions (2)

Princeton Plasma Physics Laboratory¹, General Atomics²

11 Jan 2005-Plasma Physics and Controlled Fusion

TL;DR: A review of recent advances in simulations of magnetically confined plasmas is presented in this article, with illustrative examples, chosen from associated research areas such as microturbulence, magnetohydrodynamics and other topics.

...read moreread less

Abstract: Scientific simulation, which provides a natural bridge between theory and experiment, is an essential tool for understanding complex plasma behaviour. Recent advances in simulations of magnetically confined plasmas are reviewed in this paper, with illustrative examples, chosen from associated research areas such as microturbulence, magnetohydrodynamics and other topics. Progress has been stimulated, in particular, by the exponential growth of computer speed along with significant improvements in computer technology. The advances in both particle and fluid simulations of fine-scale turbulence and large-scale dynamics have produced increasingly good agreement between experimental observations and computational modelling. This was enabled by two key factors: (a) innovative advances in analytic and computational methods for developing reduced descriptions of physics phenomena spanning widely disparate temporal and spatial scales and (b) access to powerful new computational resources. Excellent progress has been made in developing codes for which computer run-time and problem-size scale well with the number of processors on massively parallel processors (MPPs). Examples include the effective usage of the full power of multi-teraflop (multi-trillion floating point computations per second) MPPs to produce three-dimensional, general geometry, nonlinear particle simulations that have accelerated advances in understanding the nature of turbulence self-regulation by zonal flows. These calculations, which typically utilized billions of particles for thousands of time-steps, would not have been possible without access to powerful present generation MPP computers and the associated diagnostic and visualization capabilities. In looking towards the future, the current results from advanced simulations provide great encouragement for being able to include increasingly realistic dynamics to enable deeper physics insights into plasmas in both natural and laboratory environments. This should produce the scientific excitement which will help to (a) stimulate enhanced cross-cutting collaborations with other fields and (b) attract the bright young talent needed for the future health of the field of plasma science.

...read moreread less

Journal Article•DOI•

The DL_POLY molecular dynamics package

[...]

William A. P. Smith, Ilian T. Todorov, Maurice Leslie

01 May 2005-Zeitschrift Fur Kristallographie

TL;DR: The DL_POLY package provides a set of classical molecular dynamics programs that have application over a wide range of atomic and molecular systems, stretching from small systems consisting of a few hundred atoms running on a single processor to systems running on massively parallel computers with thousands of processors.

...read moreread less

Abstract: The DL_POLY package provides a set of classical molecular dynamics programs that have application over a wide range of atomic and molecular systems. Written for parallel computers they offer capabilities stretching from small systems consisting of a few hundred atoms running on a single processor, up to systems of several million atoms running on massively parallel computers with thousands of processors. In this article we describe the structure of the programs and some applications.

...read moreread less

Proceedings Article•

Performance Analysis and Visualization of the N-Body Tree Code PEPC on Massively Parallel Computers

[...]

Paul Gibbon, Wolfgang Frings, S. Dominiczak, Bernd Mohr

01 Jan 2005

TL;DR: This paper presents a meta-analyses of the immune system’s response to TSPs and its applications in medicine and medicine-like settings.

...read moreread less

Abstract: c © 2006 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.

...read moreread less

Collapse