scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Programming in 1999"


Journal ArticleDOI
TL;DR: This work collected week-long, 1 Hz resolution traces of the Digital Unix 5 second exponential load average to find that relatively simple linear models are sufficient for short-range host load prediction.
Abstract: Understanding how host load changes over time is instrumental in predicting the execution time of tasks or jobs, such as in dynamic load balancing and distributed soft real-time systems. To improve this understanding, we collected week-long, 1 Hz resolution traces of the Digital Unix 5 second exponential load average on over 35 different machines including production and research cluster machines, compute servers, and desktop workstations. Separate sets of traces were collected at two different times of the year. The traces capture all of the dynamic load information available to user-level programs on these machines. We present a detailed statistical analysis of these traces here, including summary statistics, distributions, and time series analysis results. Two significant new results are that load is self-similar and that it displays epochal behavior. All of the traces exhibit a high degree of self-similarity with Hurst parameters ranging from 0.73 to 0.99, strongly biased toward the top of that range. The traces also display epochal behavior in that the local frequency content of the load signal remains quite stable for long periods of time (150-450 s mean) and changes abruptly at epoch boundaries. Despite these complex behaviors, we have found that relatively simple linear models are sufficient for short-range host load prediction.

162 citations


Journal ArticleDOI
TL;DR: The Vienna Fortran Compiler (VFC) is introduced, a new source-to-source parallelization system for HPF+, an optimized version of HPF, which addresses the requirements of irregular applications.
Abstract: High Performance Fortran (HPF) offers an attractive high-level language interface for programming scalable parallel architectures providing the user with directives for the specification of data distribution and delegating to the compiler the task of generating an explicitly parallel program. Available HPF compilers can handle regular codes quite efficiently, but dramatic performance losses may be encountered for applications which are based on highly irregular, dynamically changing data structures and access patterns. In this paper we introduce the Vienna Fortran Compiler (VFC), a new source-to-source parallelization system for HPF+, an optimized version of HPF, which addresses the requirements of irregular applications. In addition to extended data distribution and work distribution mechanisms, HPF+ provides the user with language features for specifying certain information that decisively influence a program’s performance. This comprises data locality assertions, non-local access specifications and the possibility of reusing runtime-generated communication schedules of irregular loops. Performance measurements of kernels from advanced applications demonstrate that with a high-level data parallel language such as HPF+ a performance close to hand-written message-passing programs can be achieved even for highly irregular codes.

70 citations


Journal Article
TL;DR: In this article, 40Ar/39Ar laser-probe dating of mylonites fabrics from the Pelion Massif in the Pelagonian Zone of mainland Greece has characterized its mid-late Alpine deformation history.
Abstract: Abstract 40Ar/39Ar laserprobe dating of mylonites fabrics from the Pelion Massif in the Pelagonian Zone of mainland Greece has characterized its Mid-Late Alpine deformation history. Following high pressure (HP) metamorphism, ductile deformation occurred under greenschist-facies conditions from c. 54 Ma, and continued to affect the Pelion Massif until c. 15 Ma. The prolonged episode of ductile deformation in the Pelion Massif has resulted in the formation of an Oligocene-Early Miocene ductile domal structure. The new geochronological data obtained for the Pelion contribute to a detailed record of the Alpine kinematic history in the Pelagonian Zone and allow a discussion of P-T-t data from Aegean HP rocks to characterize the regional thermotectonic history. Comparison with the P-T-t data from the Cycladic region reinforces the point that Mid-Eocene phengite ages, commonly taken as the age of peak HP metamorphism in the Cyclades, do not always reflect the metamorphic culmination, but rather the retrograde paths of the HP rocks. It is shown that, on a regional scale, termination of HP metamorphism is a diachronous process in the Aegean region, being c. 54 Ma in the north (Pelagonian Zone) and shifting to younger ages, chiefly c. 40 Ma in the Cyclades, and c. 20 Ma on Crete, as the present-day subduction zone is approached. In contrast to the diachronous exhumation of Aegean HP assemblages, the well documented Miocene phase of ductile regional extension appears to be synchronous across the whole Aegean region and affected basement rocks until c. 15 Ma.

35 citations


Journal ArticleDOI
TL;DR: The research issues involved in the JLAPACK project are described, and the LAPACK API will be considerably simplified to take advantage of Java’s object-oriented design.
Abstract: The JLAPACK project provides the LAPACK numerical subroutines translated from their subset Fortran 77 source into class files, executable by the Java Virtual Machine (JVM) and suitable for use by Java programmers. This makes it possible for Java applications or applets, distributed on the World Wide Web (WWW) to use established legacy numerical code that was originally written in Fortran. The translation is accomplished using a special purpose Fortran-to-Java (source-to-source) compiler. The LAPACK API will be considerably simplified to take advantage of Java’s object-oriented design. This report describes the research issues involved in the JLAPACK project, and its current implementation and status.

30 citations


Journal ArticleDOI
TL;DR: JLAPACK, a subset of the LAPACK library in Java, is implemented, a high-performance Fortran 77 library used to solve common linear algebra problems, and performs comparably with the Fortran version using the native BLAS library.
Abstract: This paper describes the design and implementation of high performance numerical software in Java. Our primary goals are to characterize the performance of object-oriented numerical software written in Java and to investigate whether Java is a suitable language for such endeavors. We have implemented JLAPACK, a subset of the LAPACK library in Java. LAPACK is a high-performance Fortran 77 library used to solve common linear algebra problems. JLAPACK is an object-oriented library, using encapsulation, inheritance, and exception handling. It performs within a factor of four of the optimized Fortran version for certain platforms and test cases. When used with the native BLAS library, JLAPACK performs comparably with the Fortran version using the native BLAS library. We conclude that high-performance numerical software could be written in Java if a handful of concerns about language features and compilation strategies are adequately addressed.

25 citations


Journal ArticleDOI
TL;DR: OwlPack develops two object-oriented versions of LINPACK in Java, a true polymorphic version and a “Lite” version designed for higher performance, to drive research on compiler technology that will reward, rather than penalize good object- oriented programming practice.
Abstract: Since the introduction of the Java programming language, there has been widespread interest in the use Java for the high performance scientific computing. One major impediment to such use is the performance penalty paid relative to Fortran. To support our research on overcoming this penalty through compiler technology, we have developed a benchmark suite, called OwlPack, which is based on the popular LINPACK library. Although there are existing implementations of LINPACK in Java, most of these are produced by direct translation from Fortran. As such they do not reflect the style of programming that a good object-oriented programmer would use in Java. Our goal is to investigate how to make object-oriented scientific programming practical. Therefore we developed two object-oriented versions of LINPACK in Java, a true polymorphic version and a “Lite” version designed for higher performance. We used these libraries to perform a detailed performance analysis using several leading Java compilers and virtual machines, comparing the performance of the object-oriented versions of the benchmark with a version produced by direct translation from Fortran. Although Java implementations have been made great strides, they still fall short on programs that use the full power of Java’s object-oriented features. Our ultimate goal is to drive research on compiler technology that will reward, rather than penalize good object-oriented programming practice.

17 citations


Journal ArticleDOI
TL;DR: The Java-to-C Interface (JCI) tool is described, which provides application programmers wishing to use Java with immediate accessibility to existing scientific packages and facilitates rapid development and reuse of existing code.
Abstract: Recent developments in processor capabilities, software tools, programming languages and programming paradigms have brought about new approaches to high performance computing. A steadfast component of this dynamic evolution has been the scientific community’s reliance on established scientific packages. As a consequence, programmers of high-performance applications are reluctant to embrace evolving languages such as Java. This paper describes the Java-to-C Interface (JCI) tool which provides application programmers wishing to use Java with immediate accessibility to existing scientific packages. The JCI tool also facilitates rapid development and reuse of existing code. These benefits are provided at minimal cost to the programmer. While beneficial to the programmer, the additional advantages of mixed-language programming in terms of application performance and portability are addressed in detail within the context of this paper. In addition, we discuss how the JCI tool is complementing other ongoing projects such as IBM’s High-Performance Compiler for Java (HPCJ) and IceT’s metacomputing environment.

15 citations


Journal ArticleDOI
TL;DR: The design of the Impulse architecture is described, and it is shown how an Impulse memory system can improve the performance of memory-bound scientific applications, and increase the running time of the NAS conjugate gradient benchmark by 67%.
Abstract: Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can improve the performance of memory-bound scientific applications. For instance, Impulse decreases the running time of the NAS conjugate gradient benchmark by 67%. We expect that Impulse will also benefit regularly strided, memory-bound applications of commercial importance, such as database and multimedia programs.

15 citations




Journal ArticleDOI
TL;DR: CRAUL (Compiler and Run-Time Integration for Adaptation Under Load), a system that dynamically balances computational load in a parallel application that combines compile-time support to identify data access patterns with a run-time system that uses the access information to intelligently distribute the parallel workload in loop-based programs.
Abstract: Clusters of workstations provide a cost-effective, high performance parallel computing environment. These environments, however, are often shared by multiple users, or may consist of heterogeneous machines. As a result, parallel applications executing in these environments must operate despite unequal computational resources. For maximum performance, applications should automatically adapt execution to maximize use of the available resources. Ideally, this adaptation should be transparent to the application programmer. In this paper, we present CRAUL (Compiler and Run-Time Integration for Adaptation Under Load), a system that dynamically balances computational load in a parallel application. Our target run-time is software-based distributed shared memory (SDSM). SDSM is a good target for parallelizing compilers since it reduces compile-time complexity by providing data caching and other support for dynamic load balancing. CRAUL combines compile-time support to identify data access patterns with a run-time system that uses the access information to intelligently distribute the parallel workload in loop-based programs. The distribution is chosen according to the relative power of the processors and so as to minimize SDSM overhead and maximize locality. We have evaluated the resulting load distribution in the presence of different types of load - computational, computational and memory intensive, and network load. CRAUL performs within 5-23% of ideal in the presence of load, and is able to improve on naive compiler-based work distribution that does not take locality into account even in the absence of load.

Journal ArticleDOI
TL;DR: U-Net/SLE (Safe Language Extensions), a user-level network interface architecture which enables per-application customization of communication semantics through downloading of user extension applets, implemented as Java classfiles, to the network interface, is described.
Abstract: We describe U-Net/SLE (Safe Language Extensions), a user-level network interface architecture which enables per-application customization of communication semantics through downloading of user extension applets, implemented as Java classfiles, to the network interface. This architecture permits applications to safely specify code to be executed within the NI on message transmission and reception. By leveraging the existing U-Net model, applications may implement protocol code at the user level, within the NI, or using some combination of the two. Our current implementation, using the Myricom Myrinet interface and a small Java Virtual Machine subset, allows host communication overhead to be reduced and improves the overlap of communication and computation during protocol processing.

Journal ArticleDOI
TL;DR: The compilation process and the target system description for Menhir, the Matlab language compiler, are presented and preliminary performances are given and compared with MCC, the MathWorks Matlab compiler.
Abstract: In this paper we present Menhir a compiler for generating sequential or parallel code from the Matlab language The compiler has been designed in the context of using Matlab as a specification language One of the major features of Menhir is its retargetability to generate parallel and sequential C or Fortran code We present the compilation process and the target system description for Menhir Preliminary performances are given and compared with MCC, the MathWorks Matlab compiler

Journal ArticleDOI
TL;DR: This work explores nested data-parallel implementations of the sparse matrix-vector product and the Barnes-Hut $n$-body algorithm by hand-coding thread-based and flattening-based versions of these algorithms and evaluating their performance on an SGI Origin 2000 and an NEC SX-4, two shared-memory machines.
Abstract: Modern dialects of Fortran enjoy wide use and good support on high-performance computers as performance-oriented programming languages. By providing the ability to express nested data parallelism, modern Fortran dialects enable irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular portions of the application. Since performance of nested data-parallel computation is unpredictable and often poor using current compilers, we investigate threading and flattening, two source-to-source transformation techniques that can improve performance and performance stability. For experimental validation of these techniques, we explore nested data-parallel implementations of the sparse matrix-vector product and the Barnes-Hut $n$-body algorithm by hand-coding thread-based (using OpenMP directives) and flattening-based versions of these algorithms and evaluating their performance on an SGI Origin 2000 and an NEC SX-4, two shared-memory machines.

Journal ArticleDOI
TL;DR: This work presents a novel framework for integrating task and data parallelism for applications that exhibit constrained dynamism, and has been implemented using Stampede, a cluster programming system developed at the Cambridge Research Laboratory.
Abstract: There is an emerging class of real-time interactive applications that require the dynamic integration of task and data parallelism. An example is the Smart Kiosk, a free-standing computer device that provides information and entertainment to people in public spaces. The kiosk interface is computationally demandingc It employs vision and speech sensing and an animated graphical talking face for output. The computational demands of an interactive kiosk can vary widely with the number of customers and the state of the interaction. Unfortunately this makes it difficult to apply current techniques for integrated task and data parallel computing, which can produce optimal decompositions for static problems. Using experimental results from a color-based people tracking module, we demonstrate the existence of a small number of distinct operating regimes in the kiosk application. We refer to this type of program behavior as constrained dynamism. An application exhibiting constrained dynamism can execute efficiently by dynamically switching among a small number of statically determined fixed data parallel strategies. We present a novel framework for integrating task and data parallelism for applications that exhibit constrained dynamism. Our solution has been implemented using Stampede, a cluster programming system developed at the Cambridge Research Laboratory.



Journal Article
TL;DR: A new theoretical equation of the surge impedance is derived; Z=60・{log(h/r0)-1} +ZeP(h,r0,β), and found the theoretical values comparatively well coincide with the measured ones.
Abstract: The tower surge impedance derived from the electromagnetic field theory doesn't always coincide with the measured values satisfactorily. The theory derived by Lundholm is the most famous one, and believed to have been established, but it doesn't coincide with the measured values. We investigated his theory precisely, and found his theory was incorrect. He derived the loop voltage method and skillfully used vector potential, electric and magnetic field. Especially he combined the vector potential with the electric field, however we clarified that, in this point the errors came in. The vector potential is the quantity from which the magnetic field is derived, therefore the electric field must be derived from the magnetic field coiled around. In most cases, undoubtedly the electric field can be calculated from the vector potential. In this case, however, the magnetic field is propagating, therefore the vector potential is also propagating, so that the electric field derived from the vector potential is the circulating local field. The electric field, therefore, must be calculated, considering the propagation phenomena and the simultaneity. We derived a new theoretical equation of the surge impedance ; Z=60・{log(h/r0)-1} +ZeP(h,r0,β), and found the theoretical values comparatively well coincide with the measured ones.

Journal ArticleDOI
TL;DR: An algorithm is created, called a metaheuristic, which automatically chooses a scheduling heuristic for each input program and produces better schedules in general than the heuristics upon which it is based.
Abstract: Task mapping and scheduling are two very difficult problems that must be addressed when a sequential program is transformed into a parallel program. Since these problems are NP-hard, compiler writers have opted to concentrate their efforts on optimizations that produce immediate gains in performance. As a result, current parallelizing compilers either use very simple methods to deal with task scheduling or they simply ignore it altogether. Unfortunately, the programmer does not have this luxury. The burden of repartitioning or rescheduling, should the compiler produce inefficient parallel code, lies entirely with the programmer. We were able to create an algorithm (called a metaheuristic), which automatically chooses a scheduling heuristic for each input program. The metaheuristic produces better schedules in general than the heuristics upon which it is based. This technique was tested on a suite of real scientific programs written in SISAL and simulated on four different network configurations. Averaged over all of the test cases, the metaheuristic out-performed all eight underlying scheduling algorithmss beating the best one by 2%, 12%, 13%, and 3% on the four separate network configurations. It is able to do this, not always by picking the best heuristic, but rather by avoiding the heuristics when they would produce very poor schedules. For example, while the metaheuristic only picked the best algorithm about 50% of the time for the 100 Gbps Ethernet, its worst decision was only 49% away from optimal. In contrast, the best of the eight scheduling algorithms was optimal 30% of the time, but its worst decision was 844% away from optimal.




Journal ArticleDOI
TL;DR: This paper has implemented Flick, a flexible and optimizing idl compiler, and is using it to produce specialized high-performance code for complex distributed applications, and believes that the special idl compilation techniques developed for Khazana will be useful in other applications with similar communication requirements.
Abstract: Distributed applications are complex by nature, so it is essential that there be effective software development tools to aid in the construction of these programs. Commonplace “middleware” tools, however, often impose a tradeoff between programmer productivity and application performance. For instance, many corba idl compilers generate code that is too slow for high-performance systems. More importantly, these compilers provide inadequate support for sophisticated patterns of communication. We believe that these problems can be overcome, thus making idl compilers and similar middleware tools useful for a broader range of systems. To this end we have implemented Flick, a flexible and optimizing idl compiler, and are using it to produce specialized high-performance code for complex distributed applications. Flick can produce specially “decomposed” stubs that encapsulate different aspects of communication in separate functions, thus providing application programmers with fine-grain control over all messages. The design of our decomposed stubs was inspired by the requirements of a particular distributed application called Khazana, and in this paper we describe our experience to date in refitting Khazana with Flick-generated stubs. We believe that the special idl compilation techniques developed for Khazana will be useful in other applications with similar communication requirements. [1]This research was supported in part by the Defense Advanced Research Projects Agency, monitored by the Department of the Army under contract number DABT63-94-C-0058, and the Air Force Research Laboratory, Rome Research Site, USAF, under agreement number F30602-96-2-0269. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation hereon.

Journal ArticleDOI
TL;DR: A new compile-time analysis technique is presented that can be used to parallelize most of the loops left unparallelized by the Stanford SUIF compiler's automatic parallelization system, and is designed to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run- time when safety cannot be proven statically.
Abstract: This paper demonstrates that significant improvements to automatic parallelization technology require that existing systems be extended in two waysc (1) they must combine high-quality compile-time analysis with low-cost run-time testings and (2) they must take control flow into account during analysis. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. We present results of measurements on programs from two benchmark suites - \textsc{Specfp95} and \textsc{Nas} sample benchmarks - which identify inherently parallel loops in these programs that are missed by the compiler. We characterize remaining parallelization opportunities, and find that most of the loops require run-time testing, analysis of control flow, or some combination of the two. We present a new compile-time analysis technique that can be used to parallelize most of these remaining loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating predicates with array data-flow values. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicatess these predicates can be used to derive a run-time test guaranteeing the safety of parallelization. [1]This work has been supported by DARPA Contract DABT63-95-C-0118 and NSF Contract ACI-9721368.

Journal ArticleDOI
TL;DR: The results indicate that the EM program tends to become computation-intensive in the KSR-1 shared-memory system, and memory-demanding in the CM-5 data-parallel system when the systems and the problems are scaled.
Abstract: Shared-memory and data-parallel programming models are two important paradigms for scientific applications. Both models provide high-level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM) and a linear system solver using the shared-memory model on the KSR-1 and the data-parallel model on the CM-5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two modelss to study memory access patternss to address scalability issuess and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation-intensive in the KSR-1 shared-memory system, and memory-demanding in the CM-5 data-parallel system when the systems and the problems are scaled. The EM program, a highly data-parallel program performed extremely well, and the linear system solver, a highly control-structured program suffered significantly in the data-parallel model on the CM-5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance. [1]This work is supported in part by the National Science Foundation under grants CCR-9102854 and CCR-9400719, by the U.S. Air Force under research agreement FD-204092-64157, by Air Force Office of Scientific Research under grant AFOSR-95-01-0215, and by a grant from Cray Research. Part of the experiments were conducted on the CM-5 machines in Los Alamos National Laboratory and in the National Center for Supercomputing Applications at the University of Illinois, and on the KSR-1 machines at Cornell University and at the University of Washington.