scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Programming in 1995"


Journal ArticleDOI
TL;DR: A study comparing three EEG representations, the unprocessed signals, a reduced-dimensional representation using the Karhunen - Loeve transform, and a frequency-based representation finds the best classification accuracy on untrained samples is 73% using the frequency- based representation.
Abstract: EEG analysis has played a key role in the modeling of the brain's cortical dynamics, but relatively little effort has been devoted to developing EEG as a limited means of communication. If several mental states can be reliably distinguished by recognizing patterns in EEG, then a paralyzed person could communicate to a device such as a wheelchair by composing sequences of these mental states. EEG pattern recognition is a difficult problem and hinges on the success of finding representations of the EEG signals in which the patterns can be distinguished. In this article, we report on a study comparing three EEG representations, the unprocessed signals, a reduced-dimensional representation using the Karhunen - Loeve transform, and a frequency-based representation. Classification is performed with a two-layer neural network implemented on a CNAPS server (128 processor, SIMD architecture) by Adaptive Solutions, Inc. Execution time comparisons show over a hundred-fold speed up over a Sun Sparc 10. The best classification accuracy on untrained samples is 73% using the frequency-based representation.

91 citations


Journal ArticleDOI
TL;DR: This article analyzes the behavior of the cache when data are accessed at a constant stride, and a simple formula is presented that accurately gives the cache efficiency for various cache parameters and data strides.
Abstract: An important issue in obtaining high performance on a scientific application running on a cache-based computer system is the behavior of the cache when data are accessed at a constant stride. Others who have discussed this issue have noted an odd phenomenon in such situations: A few particular innocent-looking strides result in sharply reduced cache efficiency. In this article, this problem is analyzed, and a simple formula is presented that accurately gives the cache efficiency for various cache parameters and data strides.

34 citations


Journal ArticleDOI
TL;DR: This article provides a tutorial introduction to the main features of HPF, an informal standard for extensions to Fortran 90 to assist its implementation on parallel architectures, particularly for data-parallel computation.
Abstract: High Performance Fortran (HPF) is an informal standard for extensions to Fortran 90 to assist its implementation on parallel architectures, particularly for data-parallel computation. Among other things, it includes directives for specifying data distribution across multiple memories, and concurrent execution features. This article provides a tutorial introduction to the main features of HPF.

27 citations


Journal ArticleDOI
TL;DR: A modified formulation of Strassen's matrix multiplication algorithm is presented in which the working storage requirement is reduced to O(4$^n$) and the modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor.
Abstract: In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O(7$^n$) for multiplying 2$^n$ × 2$^n$ matrices. We present a modified formulation in which the working storage requirement is reduced to O(4$^n$). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.

21 citations


Journal ArticleDOI
TL;DR: ObjectMath can increase productivity and quality, thus enabling users to solve problems that are too complex to handle with traditional tools, especially in application areas such as machine elements analysis, where complex nonlinear problems are the norm.
Abstract: ObjectMath is a language for scientific computing that integrates object-oriented constructs with features for symbolic and numerical computation. Using ObjectMath, complex mathematical models may be implemented in a natural way. The ObjectMath programming environment provides tools for generating efficient numerical code from such models. Symbolic computation is used to rewrite and simplify equations before code is generated. One novelty of the ObjectMath approach is that it provides a comman language and an integrated environment for this kind of mixed symbolic/numerical computation. The motivation for this work is the current low-level state of the art in programming for scientific computing. Much numerical software is still being developed the traditional way in Fortran. This is especially true in application areas such as machine elements analysis, where complex nonlinear problems are the norm. We believe that tools like ObjectMath can increase productivity and quality, thus enabling users to solve problems that are too complex to handle with traditional tools.

19 citations


Journal ArticleDOI
TL;DR: This article presents a survey of language features for distributed memory multiprocessor systems (DMMs), in particular, systems that provide features for data partitioning and distribution.
Abstract: This article presents a survey of language features for distributed memory multiprocessor systems (DMMs), in particular, systems that provide features for data partitioning and distribution. In these systems the programmer is freed from consideration of the low-level details of the target architecture in that there is no need to program explicit processes or specify interprocess communication. Programs are written according to the shared memory programming paradigm but the programmer is required to specify, by means of directives, additional syntax or interactive methods, how the data of the program are decomposed and distributed.

15 citations


Journal ArticleDOI
TL;DR: The modifications needed to achieve a data-parallel version of this model without explicit message passing are outlined and the achieved performance of different numerical solution methods within this model is presented and compared.
Abstract: In this article we describe the implementation of a numerical weather forecast model on a massively parallel computer system. This model is a production code used for routine weather forecasting at the meteorological institutes of several European countries. The modifications needed to achieve a data-parallel version of this model without explicit message passing are outlined. The achieved performance of different numerical solution methods within this model is presented and compared.

11 citations


Journal ArticleDOI
TL;DR: A self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.
Abstract: Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.

8 citations


Journal ArticleDOI
TL;DR: This article describes many of the issues in developing an efficient interface for communication on distributed memory machines and how to change the interface to match the hardware more closely.
Abstract: This article describes many of the issues in developing an efficient interface for communication on distributed memory machines Although the hardware component of message latency is less than 1 ws on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 μs The reason for this imbalance is that the software interface does not match the hardware By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines This article describes several tests performed and many of the issues involved in supporting low latency messages on distributed memory machines

7 citations


Journal ArticleDOI
TL;DR: Detailed algorithms for all-to-all broadcast and reduction are given for arrays mapped by binary or binary-reflected Gray code encoding to the processing nodes of binary cube networks, thereby reducing the demand for the communications bandwidth.
Abstract: Detailed algorithms for all-to-all broadcast and reduction are given for arrays mapped by binary or binary-reflected Gray code encoding to the processing nodes of binary cube networks. Algorithms are also given for the local computation of the array indices for the communicated data, thereby reducing the demand for the communications bandwidth. For the Connection Machine system CM-200, Hamiltonian cycle-based all-to-all communication algorithms yield a performance that is a factor of 2 to 10 higher than the performance offered by algorithms based on trees, butterfly networks, or the Connection Machine router. The peak data rate achieved for all-to-all broadcast on a 2,048-node Connection Machine system CM-200 is 5.4 Gbyte/s. The index order of the data in local memory depends on implementation details of the algorithms, but it is well defined. If a linear ordering is desired, then including the time for local data reordering reduces the effective peak data rate to 2.5 Gbyte/s.

5 citations


Journal ArticleDOI
TL;DR: The University of Southampton has developed the Graphical Benchmark Information Service (GBIS) on the World Wide Web to display interactively graphs of user-selected benchmark results from the GENESIS and PARKBENCH benchmark suites.
Abstract: Unlike single-processor benchmarks, multiprocessor benchmarks can yield tens of numbers for each benchmark on each computer, as factors such as the number of processors and problem size are varied. A graphical display of performance surfaces therefore provides a satisfactory way of comparing results. The University of Southampton has developed the Graphical Benchmark Information Service (GBIS) on the World Wide Web (WWW) to display interactively graphs of user-selected benchmark results from the GENESIS and PARKBENCH benchmark suites.

Journal ArticleDOI
TL;DR: A set of well-characterized Fortran benchmarks spanning a range of computational characteristics was used for the study and the data from the 590 system are compared with those from a single-processor CRAY C90 system as well as with other microprocessor-based systems.
Abstract: The results of benchmark tests on the superscalar IBM RISC System/6000 Model 590 are presented. A set of well-characterized Fortran benchmarks spanning a range of computational characteristics was used for the study. The data from the 590 system are compared with those from a single-processor CRAY C90 system as well as with other microprocessor-based systems, such as the Digital Equipment Corporation AXP 3000/500X and the Hewlett-Packard HP/735.

Journal ArticleDOI
TL;DR: This work describes the implementation of the numerical scheme, and presents experimental results which demonstrate that a problem requiring 600,000 mesh points and 6,000 time steps can be solved in under 8 hours using 32 processors.
Abstract: Flows in estuarial and coastal regions may be described by the shallow-water equations. The processes of pollution transport, sediment transport, and plume dispersion are driven by the underlying hydrodynamics. Accurate resolution of these processes requires a three-dimensional formulation with turbulence modeling, which is very demanding computationally. A numerical scheme has been developed which is both stable and accurate - we show that this scheme is also well suited to parallel processing, making the solution of massive complex problems a practical computing possibility. We describe the implementation of the numerical scheme on a Kendall Square Research KSR-1 multiprocessor, and present experimental results which demonstrate that a problem requiring 600,000 mesh points and 6,000 time steps can be solved in under 8 hours using 32 processors.

Journal ArticleDOI
TL;DR: Experimental studies on benchmark programs concerning scientific computing show that most communication patterns in application programs are predictable at compile-time, and an execution model is proposed that utilizes this knowledge such that predictable communications are directly compiled and dynamic communications are emulated by scheduling an appropriate set of compiled communications.
Abstract: On most massively parallel architectures, the actual communication performance remains much less than the hardware capabilities. The main reason for this difference lies in the dynamic routing, because the software mechanisms for managing the routing represent a large overhead. This article presents experimental studies on benchmark programs concerning scientific computing; the results show that most communication patterns in application programs are predictable at compile-time. An execution model is proposed that utilizes this knowledge such that predictable communications are directly compiled and dynamic communications are emulated by scheduling an appropriate set of compiled communications. The performance of the model is evaluated, showing that performance is better in static cases and gracefully degrades with the growing complexity and dynamic aspect of the communication patterns.

Journal ArticleDOI
TL;DR: The porting and optimization of an explicit, time-dependent, computational fluid dynamics code on an 8,192-node MasPar MP-1 is described, and the performance of the code is slightly better than on a CRAY Y-MP for a functionally equivalent, optimized two-dimensional code.
Abstract: This article describes the porting and optimization of an explicit, time-dependent, computational fluid dynamics code on an 8,192-node MasPar MP-1. The MasPar is a very fine-grained, single instruction, multiple data parallel computer. The code uses the flux-corrected transport algorithm. We describe the techniques used to port and optimize the code, and the behavior of a test problem. The test problem used to benchmark the flux-corrected transport code on the MasPar was a two-dimensional exploding shock with periodic boundary conditions. We discuss the performance that our code achieved on the MasPar, and compare its performance on the MasPar with its performance on other architectures. The comparisons show that the performance of the code on the MasPar is slightly better than on a CRAY Y-MP for a functionally equivalent, optimized two-dimensional code.


Journal ArticleDOI
TL;DR: How parallel computing techniques on a KSR-1 produce performance improvements in transport aircraft if the process by which the wing boundary layer becomes turbulent can be controlled and extensive areas of laminar flow maintained is detailed.
Abstract: The performance of transport aircraft can be considerably improved if the process by which the wing boundary layer becomes turbulent can be controlled and extensive areas of laminar flow maintained. In order to design laminar flow control systems, it is necessary to be able to predict the movement of the transition location in response to changes in control variables, e.g., surface suction. At present, the technique which is available to industry requires excessively long computational time - so long that it is not suitable for use in the "design process." Therefore, there is a clear need to produce a system which delivers results in near realtime, i.e., in seconds rather than hours. This article details how parallel computing techniques on a KSR-1 produce these performance improvements.


Journal ArticleDOI
TL;DR: New parallel algorithms for solving the problem of many body interactions in molecular dynamics (MD) using two parallelization methods are presented and demonstrated that they exploit parallelism effectively and can be used to simulate large crystals.
Abstract: We present new parallel algorithms for solving the problem of many body interactions in molecular dynamics (MD). Such algorithms are essential in the simulation of irradiation effects in crystals, where the high energy of the impinging particles dictates computing with large numbers of atoms and for many time cycles. We realized the algorithms using two parallelization methods and compared their performance. Experimental results obtained on a Meiko machine demonstrate that the new algorithms exploit parallelism effectively and can be used to simulate large crystals.


Journal ArticleDOI
TL;DR: The primary motivHtion behind thi,; special issue stems from a dt>sire to set> how variou,;; ,;;cit>nti,,;ts approach tlw task of,;cientifi (' pro{!ramming'), and the following levels of abstraction were suggested.
Abstract: The primary motivHtion behind thi,; special issue stems from a dt>sire to set> how variou,;; ,;;cit>nti,;ts approach tlw task of ,;cientifi(' pro{!ramming. In generaL each scientist must formulate a problem and derivP a solution. The steps in thi,; process are not fixed or prescribed. howevPr. they arP nonetheless somewhat universal. In the Call for Papers for thi,; issue. I asked each author to describe the entire process from problem formulation to thP realization of a solution. Each step in this process can be characterized by a statement. describing the problem. in a notation or language ,;uitable to the current leucl cf abstraction. By way of {!Uidance, the following levels of abstraction were suggested in the Call for Paper,.;. ThesP wen· not enforced. but all the articles rough!~· follow this outline.


Journal ArticleDOI
TL;DR: A description of a combustion simulation's mathematical and computational methods are used to develop a version for parallel execution, which was a reasonable performance improvement on small numbers of processors.
Abstract: We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.