scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Programming in 2000"


Journal ArticleDOI
TL;DR: Extensions to OpenMP Fortran that implemen data placemen features needed for NUMA architectures are described and some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions are described.
Abstract: This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations OpenMP -- designed for shared-memory architectures -- does not by itself address these issues The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions It also describes some additional compiler optimizations, and concludes with some preliminary results

95 citations


Journal ArticleDOI
TL;DR: The difference between conventional software development of CFD codes with a method based on coordinate free mathematics is contrasted, illustrated on the coating problem: the simulation of coating a wire with a polymer.
Abstract: It has long been acknowledged that the development of scientific applications is in need of better software engineering practices. Here we contrast the difference between conventional software development of CFD codes with a method based on coordinate free mathematics. The former approach leads to programs where different aspects, such as the discretisation technique and the coordinate systems, can get entangled with the solver algorithm. The latter approach yields programs that segregate these concerns into fully independent software modules. Such considerations are important for the construction of numerical codes for practical problems. The two approaches are illustrated on the coating problem: the simulation of coating a wire with a polymer.

24 citations


Journal ArticleDOI
TL;DR: Sophus as mentioned in this paper is a programming style for partial differential equations (PDEs) with a focus on abstract datatypes and an algebraic expression style similar to the expression style used in the mathematical theory.
Abstract: The abstract mathematical theory of partial differential equations (PDEs) is formulated in terms of manifolds, scalar fields, tensors, and the like, but these algebraic structures are hardly recognizable in actual PDE solvers. The general aim of the Sophus programming style is to bridge the gap between theory and practice in the domain of PDE solvers. Its main ingredients are a library of abstract datatypes corresponding to the algebraic structures used in the mathematical theory and an algebraic expression style similar to the expression style used in the mathematical theory. Because of its emphasis on abstract datatypes, Sophus is most naturally combined with object-oriented languages or other languages supporting abstract datatypes. The resulting source code patterns are beyond the scope of current compiler optimizations, but are sufficiently specific for a dedicated source-to-source optimizer. The limited, domain-specific, character of Sophus is the key to success here. This kind of optimization has been tested on computationally intensive Sophus style code with promising results. The general approach may be useful for other styles and in other application domains as well.

21 citations


Proceedings ArticleDOI
TL;DR: In this paper, the authors present two methods of assessing the need for atmospheric correction, and address the importance of removing atmospheric effects in the satellite remote sensing of large reservoirs, such as water reservoirs.
Abstract: Solar radiation reflected by the Earth's surface to satellite sensors is modified by its interaction with the atmosphere. The objective of atmospheric correction is to determine true surface reflectance values by removing atmospheric effects from satellite images. Atmospheric correction is arguably the most important part of the pre-processing of satellite remotely sensed data and any omission produces erroneous results. The effects of the atmosphere are more severe for dark targets such as water reservoirs. The paper presents two methods of assessing the need for atmospheric correction, and addresses the importance of removing atmospheric effects in the satellite remote sensing of large reservoirs.

20 citations


Journal ArticleDOI
TL;DR: The Penn State University/NCAR Mesoscale Model (MM5) as discussed by the authors runs on distributed-memory (DM) parallel computers and is compatible with shared-memory/shared-memory hybrid parallelization on distributedmemory clusters of symmetric multiprocessors.
Abstract: Beginning with the March 1998 release of the Penn State University/NCAR Mesoscale Model (MM5), and continuing through eight subsequent releases up to the present, the official version has run on distributed -memory (DM) parallel computers. Source translation and runtime library support minimize the impact of parallelization on the original model source code, with the result that the majority of code is line-for-line identical with the original version. Parallel performance and scaling are equivalent to earlier, hand-parallelized versions; the modifications have no effect when the code is compiled and run without the DM option. Supported computers include the IBM SP, Cray T3E, Fujitsu VPP, Compaq Alpha clusters, and clusters of PCs (so-called Beowulf clusters). The approach also is compatible with shared-memory parallel directives, allowing distributed-memory/shared-memory hybrid parallelization on distributed-memory clusters of symmetric multiprocessors.

19 citations


Journal ArticleDOI
TL;DR: This work outlines a use of algebraic software methodologies and advanced program constructors to improve the abstraction level of software for scientific computing through the use of domain specific languages and appropriate software architectures.
Abstract: The use of domain specific languages and appropriate software architectures are currently seen as the way to enhance reusability and improve software productivity. Here we outline a use of algebraic software methodologies and advanced program constructors to improve the abstraction level of software for scientific computing. This leads us to the language of coordinate free numerics as an alternative to the traditional coordinate dependent array notation. This provides the backdrop for the three accompanying papers: {\it Coordinate Free Programming of Computational Fluid Dynamics Problems}, centered around an example of using coordinate free numerics, {\it Machine and Collection Abstractions for User-Implemented Data-Parallel Programming}, exploiting the higher abstraction level when parallelising code, and {\it An Algebraic Programming Style for Numerical Software and its Optimization}, looking at high-level transformations enabled by the domain specific programming style.

16 citations


Journal ArticleDOI
TL;DR: The main body of the paper describes how the OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention and provides a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.
Abstract: This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

16 citations


Journal ArticleDOI
TL;DR: P^3T+ is introduced which is a performance estimator for mostly regular HPF (High Performance Fortran) programs but partially covers also message passing programs (MPI).
Abstract: Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduce $P^3T+$ which is a performance estimator for mostly regular HPF (High Performance Fortran) programs but partially covers also message passing programs (MPI). $P^3T+$ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc. $P^3T+$ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC) to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness of $P^3T+$.

15 citations


Journal ArticleDOI
TL;DR: In the 1990's, computer manufacturers are increasingly turning to the development of parallel processor machines to meet the high performance needs of their customers and atmospheric scientists studying weather and climate phenomena require increasingly fine resolution models.
Abstract: In the 1990's, computer manufacturers are increasingly turning to the development of parallel processor machines to meet the high performance needs of their customers. Simultaneously, atmospheric scientists studying weather and climate phenomena ranging from hurricanes to El Ni\~{n}o to global warming require increasingly fine resolution models. Here, implementation of a parallel atmospheric general circulation model (GCM) which exploits the power of massively parallel machines is described. Using the horizontal data domain decomposition methodology, this FORTRAN 90 model is able to integrate a $0.6^{\circ}$ longitude by $0.5^{\circ}$ latitude problem at a rate of 19 Gigaflops on 512 processors of a Cray T3E 600; corresponding to 280 seconds of wall-clock time per simulated model day. At this resolution, the model has 64 times as many degrees of freedom and performs 400 times as many floating point operations per simulated day as the model it replaces.

10 citations


Journal ArticleDOI
TL;DR: The modular approach used for the design of the LM is explained and the effects on the development are discussed and some performance results are given.
Abstract: Nearly 30 years after introducing the first computer model for weather forecasting, the {\sl Deutscher Wetterdienst} (DWD) is developing the 4th generation of its numerical weather prediction (NWP) system. It consists of a global grid point model (GME) based on a triangular grid and a non-hydrostatic {\sl Lokal Modell} (LM). The operational demand for running this new system is immense and can only be met by parallel computers. From the experience gained in developing earlier NWP models, several new problems had to be taken into account during the design phase of the system. Most important were portability (including efficieny of the programs on several computer architectures) and ease of code maintainability. Also the organization and administration of the work done by developers from different teams and institutions is more complex than it used to be. This paper describes the models and gives some performance results. The modular approach used for the design of the LM is explained and the effects on the development are discussed.

10 citations


Journal ArticleDOI
TL;DR: An optimized parallelization scheme for molecular dynamics simulations of large biomolecular systems, implemented in the production-quality molecular dynamics program NAMD, achieves speeds and speedups that are much higher than any reported in literature so far.
Abstract: We present an optimized parallelization scheme for molecular dynamics simulations of large biomolecular systems, implemented in the production-quality molecular dynamics program NAMD. With an object-based hybrid force and spatial decomposition scheme, and an aggressive measurement-based predictive load balancing framework, we have attained speeds and speedups that are much higher than any reported in literature so far. The paper first summarizes the broad methodology we are pursuing, and the basic parallelization scheme we used. It then describes the optimizations that were instrumental in increasing performance, and presents performance results on benchmark simulations.


Proceedings ArticleDOI
TL;DR: The interpretation of the epithermal neutron spectra as indicating water ice at the poles appears premature, however, in the face of the proven presence of over 100 wppm solar-wind hydrogen in many soil and regolith breccia samples.
Abstract: The Lunar Prospector Team and others have interpreted epithermal neutron spectra from thelunar poles as potentially indicating the presence of large quantities of water ice. Water ice might bedeposited as a consequence of cometary impacts on the Moon as has been predicted theoretically.The interpretation of the neutron spectra as indicating water ice at the poles appears premature,however, in the face of the proven presence of over 100 wppm solar-wind hydrogen in many soil andregolith breccia samples. Solar-wind hydrogen also would be concentrated and preserved as adistributed regolith component in permanently shadowed areas. No significant amount of implantedsolar wind hydrogen in such areas would be lost to thermal cycling and both primary and pickup ionswould be continuously deposited. Further, a continuous blanket of cometary water ice, precipitatedon rare occasions in permanent shadow, would remain subject to solar-wind sputtering andmicrometeoroid erosion comparable to the complete reworking of the upper few centimeters ofregolith approximately every ten million years. Solar-wind hydrogen, therefore, probably accounts formost of the epithermal neutron signal. Other solar-wind volatiles of interest, such as

Journal ArticleDOI
TL;DR: Results for a wide range of processor numbers, model resolutions, and different vendor architectures are presented, and single node performance has been disappointing on RISC based systems, at least compared to vector processor performance.
Abstract: The Navy Operational Global Atmospheric Prediction System (NOGAPS) includes a state-of-the-art spectral forecast model similar to models run at several major operational numerical weather prediction (NWP) centers around the world. The model, developed by the Naval Research Laboratory (NRL) in Monterey, California, has run operational at the Fleet Numerical Meteorological and Oceanographic Center (FNMOC) since 1982, and most recently is being run on a Cray C90 in a multi-tasked configuration. Typically the multi-tasked code runs on 10 to 15 processors with overall parallel efficiency of about 90%. resolution is T159L30, but other operational and research applications run at significantly lower resolutions. A scalable NOGAPS forecast model has been developed by NRL in anticipation of a FNMOC C90 replacement in about 2001, as well as for current NOGAPS research requirements to run on DOD High-Performance Computing (HPC) scalable systems. The model is designed to run with message passing (MPI). Model design criteria include bit reproducibility for different processor numbers and reasonably efficient performance on fully shared memory, distributed memory, and distributed shared memory systems for a wide range of model resolutions. Results for a wide range of processor numbers, model resolutions, and different vendor architectures are presented. Single node performance has been disappointing on RISC based systems, at least compared to vector processor performance. This is a common complaint, and will require careful re-examination of traditional numerical weather prediction (NWP) model software design and data organization to fully exploit future scalable architectures.


Journal ArticleDOI
TL;DR: A framework with two conceptual classes, {\tt Machine} and {\tt Collection} is proposed, giving the programmer full control of the parallel distribution of data, as well as allowing normal sequential implementation of this class.
Abstract: Data parallelism has appeared as a fruitful approach to the parallelisation of compute-intensive programs. Data parallelism has the advantage of mimicking the sequential (and deterministic) structure of programs as opposed to task parallelism, where the explicit interaction of processes has to be programmed. In data parallelism data structures, typically collection classes in the form of large arrays, are distributed on the processors of the target parallel machine. Trying to extract distribution aspects from conventional code often runs into problems with a lack of uniformity in the use of the data structures and in the expression of data dependency patterns within the code. Here we propose a framework with two conceptual classes, {\tt Machine} and {\tt Collection}. The {\tt Machine} class abstracts hardware communication and distribution properties. This gives a programmer high-level access to the important parts of the low-level architecture. The {\tt Machine} class may readily be used in the implementation of a {\tt Collection} class, giving the programmer full control of the parallel distribution of data, as well as allowing normal sequential implementation of this class. Any program using such a collection class will be parallelisable, without requiring any modification, by choosing between sequential and parallel versions at link time. Experiments with a commercial application, built using the Sophus library which uses this approach to parallelisation, show good parallel speed-ups, without any adaptation of the application program being needed.

Journal ArticleDOI
TL;DR: A scalable, high-performance solution to multidimensional recurrences that arise in adaptive statistical designs that focuses on the problem of optimally assigning patients to treatments in clinical trials.
Abstract: We present a scalable, high-performance solution to multidimensional recurrences that arise in adaptive statistical designs. Adaptive designs are an important class of learning algorithms for a stochastic environment, and we focus on the problem of optimally assigning patients to treatments in clinical trials. While adaptive designs have significant ethical and cost advantages, they are rarely utilized because of the complexity of optimizing and analyzing them. Computational challenges include massive memory requirements, few calculations per memory access, and multiply-nested loops with dynamic indices. We analyze the effects of various parallelization options, and while standard approaches do not work well, with effort an efficient, highly scalable program can be developed. This allows us to solve problems thousands of times more complex than those solved previously, which helps make adaptive designs practical. Further, our work applies to many other problems involving neighbor recurrences, such as generalized string matching.

Journal Article
TL;DR: Magnetic field distribution in the radial direction was measured and compared with analysis and measurement of magnetic field distribution to improve the breaking capacity of vacuum circuit breakers.
Abstract: A magnetic field parallel to the arcing current can improve the breaking capacity of vacuum circuit breakers. In this paper, magnetic field distribution in the radial direction were measured and compared with analysis. A series of short circuit tests were carried out with various condition of contact material and electrode structure. These results were compared with analysis and measurement of magnetic field distribution.

Journal ArticleDOI
TL;DR: An overview of Version 1 of the HPD Standard is presented and an analysis of the process by which the standard was developed is presented.
Abstract: Throughout 1998, the High Performance Debugging Forum worked on defining a base level standard for high performance debuggers. The standard had to meet the sometimes conflicting constraints of being useful to users, realistically implementable by developers, and architecturally independent across multiple platforms. To meet criteria for timeliness, the standard had to be defined in one year and in such a way that it could be implemented within an additional year. The Forum was successful, and in November 1998 released Version 1 of the HPD Standard. Implementations of the standard are currently underway. This paper presents an overview of Version 1 of the standard and an analysis of the process by which the standard was developed. The status of implementation efforts and plans for follow-on efforts are discussed as well.



Proceedings ArticleDOI
TL;DR: Using Si substrates, a structure that can simultaneously act as a thermal management system, a radiation shield, an optical material, a package, and a semiconductor substrate can be realized.
Abstract: Silicon (Si) has a strength to density ratio of 3.0({sigma}{sub y}/{delta}=(6.8GPa/2.3g/cc)), an order-of-magnitude higher than titanium, aluminum, or stainless steel. Silicon also demonstrates favorable thermal, optical, and electrical properties making it ideal for use as a structural foundation for autonomous, mesoscopic systems such as nanosatellites. Using Si substrates, a structure that can simultaneously act as a thermal management system, a radiation shield, an optical material, a package, and a semiconductor substrate can be realized.

Journal ArticleDOI
TL;DR: Config is a software component of the Graphical R-Matrix Atomic Collision Environment that supports object orientation, a powerful architectural paradigm in designing the structure of software systems, and genericity, an orthogonal dimension to the inheritance hierarchies facilitated by object oriented languages.
Abstract: Config is a software component of the Graphical R-Matrix Atomic Collision Environment. Its development is documented as a case study combining several software engineering techniques: formal specification, generic programming, object-oriented programming, and design by contract. It is specified in VDM$++$; and implemented in C$++$, a language which is becoming more than a curiosity amongst the scientific programming community. C$++$supports object orientation, a powerful architectural paradigm in designing the structure of software systems, and genericity, an orthogonal dimension to the inheritance hierarchies facilitated by object oriented languages. Support in C$++$ for design by contract can be added in library form. The combination of techniques make a substantial contribution to the overall software quality.

Journal Article
TL;DR: “ 그러나 특징들이 결합해서 사용하는”, 크게 변하여 ���색 성능이 떨어지�O 한다.
Abstract: 내용기반 영상 검색에서는 컬러, 형태, 질감의 세 가지 대표적인 영상 특징들이 주로 사용된다. 한 가지 특징만을 사용하는 검색 방법은 영상의 내용이 복잡하거나 비교대상이 되는 영상의 수가 많아질수록 좋은 성능을 보이지 못한다. 그래서 여러 가지 영상 특징들을 결합한 방법들이 많이 연구되고 있다. 그러나 여러 특징들을 결합해서 사용하는 검색 시스템이라 할지라도 각 특징들에 대한 가중치가 적합하게 부여되지 않으면 검색되는 결과 영상들의 순위가 크게 변하여 검색 성능이 떨어지게 된다. 이러한 문제점을 해결하기 위해 본 논문에서는 여러 영상 특징들이 결합해서 사용될 때 각 특징에 대한 가중치를 자동적으로 부여해서 검색성능을 개선하고자 한다. 제안한 방법을 992개의 테스트 영상들로 구성된 데이터 베이스에서 실험을 하고 다양한 성능평가 방법을 통해 그 타당성을 확인하였으며 제안한 방법을 고정가중치 부여를 이용한 방법과 비교하여 검색 성능이 개선됨을 볼 수 있었다.

Journal ArticleDOI
TL;DR: The results indicate that performance degradation for both models on a single SX-4 node is primarily due to memory contention within the internal crossbar switch, and both models achieve close to ideal scaling on the VPP700.
Abstract: The NEC SX-4M cluster and Fujitsu VPP700 supercomputers are both based on custom vector processors using low-power CMOS technology. Their basic architectures and programming models are however somewhat different. A multi-node SX-4M cluster contains up to 32 processors per shared memory node, with a maximum of 16 nodes connected via the proprietary NEC IXS fibre channel crossbar network. A hybrid combination of inter-node MPI message-passing with intra-node tasking or threads is possible. The Fujitsu VPP700 is a fully distributed-memory vector machine with a crossbar interconnect which also supports MPI. The parallel performance of the MC2 model for high-resolution mesoscale forecasting over large domains and of the IFS RAPS 4.0 benchmark are presented for several different machine configurations. These include an SX-4/32, an SX-4/32M cluster and up to 100 PE's of the VPP700. Our results indicate that performance degradation for both models on a single SX-4 node is primarily due to memory contention within the internal crossbar switch. Multinode SX-4 performance is slightly better than single node. Longer vector lengths and SDRAM memory on the VPP700 result in lower per processor execution rates. Both models achieve close to ideal scaling on the VPP700.


Journal Article
TL;DR: Bhattacharyya distance는 패턴 분류 문제에 있어서 클래스간 필요한 최소 특징 벡터를 가능하며,
Abstract: Bhattacharyya distance는 패턴 분류 문제에 있어서 클래스간 분리도 측정의 수단으로 사용되어 왔으며 특징 추출 시 유용한 정보를 제공한다. 본 논문에서는 최근 발표된 Bhattacharyya distance를 이용한 에러 예측 기법을 이용하여 예측된 분류 에러가 최소가 되는 특징 벡터를 추출하는 방법에 대하여 제안한다. 제안한 특징 추출 기법은 최적화 알고리즘인 전체탐색 및 순차탐색 방법의 적용 시 분류 에러를 직접 구하지 않고 Bhattacharyya distance를 이용하여 분류 에러를 예측하므로 고차원 데이터의 경우 고속의 특징 추출이 가능하며, 에러 예측 성질을 이용하여 패턴 분류 시 필요한 최소 특징 벡터의 수를 예측할 수 있는 장점이 있다.