Showing papers on "Speedup published in 1998"

PDF

Open Access

Journal Article•DOI•

A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

[...]

George Karypis¹, Vipin Kumar¹•Institutions (1)

01 Jan 1998-Journal of Parallel and Distributed Computing

TL;DR: The quality of the produced partitions and orderings are comparable to those produced by the serial multilevel algorithm that has been shown to outperform both spectral partitioning and multiple minimum degree.

...read moreread less

496 citations

Journal Article•DOI•

Performance of parallel TURBOMOLE for density functional calculations

[...]

Malte von Arnim¹, Reinhart Ahlrichs¹•Institutions (1)

Karlsruhe Institute of Technology¹

30 Nov 1998-Journal of Computational Chemistry

TL;DR: The RI‐J technique to approximate Coulomb interactions (by means of an auxiliary basis set approximation for the electron density) even shows superlinear speedup on distributed memory architectures.

...read moreread less

Abstract: The parallelization of density functional treatments of molecular electronic energy and first-order gradients is described, and the performance is documented. The quadrature required for exchange correlation terms and the treatment of exact Coulomb interaction scales virtually linearly up to 100 nodes. The RI-J technique to approximate Coulomb interactions (by means of an auxiliary basis set approximation for the electron density) even shows superlinear speedup on distributed memory architectures. The bottleneck is then linear algebra. Demonstrative application examples include molecules with up to 300 atoms and 3000 basis functions that can now be treated in a few hours per geometry optimization cycle in C1 symmetry. © 1998 John Wiley & Sons, Inc. J Comput Chem 19: 1746–1757, 1998

...read moreread less

480 citations

Proceedings Article•

Quantum Counting

[...]

Gilles Brassard, Peter F. Hoyer, Alain Tapp

13 Jul 1998

TL;DR: This work generalizes the Grover iteration in the light of a concept called amplitude amplification, and shows that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist.

...read moreread less

Abstract: We study some extensions of Grover's quantum searching algorithm. First, we generalize the Grover iteration in the light of a concept called amplitude amplification. Then, we show that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist. Finally, as our main result, we combine ideas from Grover's and Shor's quantum algorithms to perform approximate counting, which can be seen as an amplitude estimation process.

...read moreread less

421 citations

Proceedings Article•DOI•

A dynamic multithreading processor

[...]

Haitham Akkary¹, Michael A. Driscoll²•Institutions (2)

Intel¹, Portland State University²

01 Nov 1998

TL;DR: An architecture that features dynamic multithreading execution of a single program that minimizes the impact of ICache misses and branch mispredictions by fetching and dispatching instructions out-of-order and uses a novel value prediction and recovery mechanism to reduce artificial data dependencies created by the use of a stack to manage run-time storage is presented.

...read moreread less

Abstract: We present an architecture that features dynamic multithreading execution of a single program. Threads are created automatically by hardware at procedure and loop boundaries and executed speculatively on a simultaneous multithreading pipeline. Data prediction is used to alleviate dependency constraints and enable lookahead execution of the threads. A two-level hierarchy significantly enlarges the instruction window. Efficient selective recovery from the second level instruction window takes place after a mispredicted input to a thread is corrected. The second level is slower to access but has the advantage of large storage capacity. We show several advantages of this architecture: (1) it minimizes the impact of ICache misses and branch mispredictions by fetching and dispatching instructions out-of-order, (2) it uses a novel value prediction and recovery mechanism to reduce artificial data dependencies created by the use of a stack to manage run-time storage, and (3) it improves the execution throughput of a superscalar by 15% without increasing the execution resources or cache bandwidth, and by 30% with one additional ICache fetch port. The speedup was measured on the integer SPEC95 benchmarks, without any compiler support, using a detailed performance simulator.

...read moreread less

339 citations

Journal Article•DOI•

Optimal task assignment in heterogeneous distributed computing systems

[...]

M. Kafil¹, Ishfaq Ahmad•Institutions (1)

Hong Kong University of Science and Technology¹

01 Jul 1998-IEEE Concurrency

TL;DR: Two new algorithms based on the A* technique are described which are considerably faster, are more memory-efficient, and give optimal solutions to solve distributed task-to-processor assignment problems.

...read moreread less

Abstract: A distributed system comprising networked heterogeneous processors requires efficient task-to-processor assignment to achieve fast turnaround time. Although reasonable heuristics exist to address optimal processor assignment for small problems, larger problems require better algorithms. The authors describe two new algorithms based on the A* technique which are considerably faster, are more memory-efficient, and give optimal solutions. The first is a sequential algorithm that reduces the search space. The second proposes to lower time complexity, by running the assignment algorithm in parallel, and achieves significant speedup. The authors test their results on a library of task graphs and processor topologies.

...read moreread less

262 citations

Journal Article•DOI•

Fast and effective retrieval of medical tumor shapes

[...]

P. Korn¹, Nicholas D. Sidiropoulos, Christos Faloutsos², Eliot L. Siegel³, Zenon Protopapas³ - Show less +1 more•Institutions (3)

Bell Labs¹, Carnegie Mellon University², Veterans Health Administration³

01 Nov 1998-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A natural similarity function for shape matching is used, based on concepts from mathematical morphology, and it is shown how it can be lower-bounded by a set of shape features for safely pruning candidates, thus giving fast and correct output.

...read moreread less

Abstract: Investigates the problem of retrieving similar shapes from a large database; in particular, we focus on medical tumor shapes (finding tumors that are similar to a given pattern). We use a natural similarity function for shape matching, based on concepts from mathematical morphology, and we show how it can be lower-bounded by a set of shape features for safely pruning candidates, thus giving fast and correct output. These features can be organized in a spatial access method, leading to fast indexing for range queries and nearest-neighbor queries. In addition to the lower-bounding, our second contribution is the design of a fast algorithm for nearest-neighbor searching, achieving significant speedup while provably guaranteeing correctness. Our experiments demonstrate that roughly 90% of the candidates can be pruned using these techniques, resulting in up to 27 times better performance compared to sequential scanning.

...read moreread less

204 citations

Book Chapter•DOI•

Metrics and Benchmarking for Parallel Job Scheduling

[...]

Dror G. Feitelson¹, Larry Rudolph²•Institutions (2)

Hebrew University of Jerusalem¹, Massachusetts Institute of Technology²

30 Mar 1998

TL;DR: It is argued that the focus should be on on-line open systems, and proposed that a standard workload should be used as a benchmark for schedulers, which will specify distributions of parallelism and runtime, as found by analyzing accounting traces.

...read moreread less

Abstract: The evaluation of parallel job schedulers hinges on two things: the use of appropriate metrics, and the use of appropriate workloads on which the scheduler can operate. We argue that the focus should be on on-line open systems, and propose that a standard workload should be used as a benchmark for schedulers. This benchmark will specify distributions of parallelism and runtime, as found by analyzing accounting traces, and also internal structures that create different speedup and synchronization characteristics. As for metrics, we present some problems with slowdown and bounded slowdown that have been proposed recently.

...read moreread less

191 citations

Journal Article•DOI•

Quantum Algorithms: Entanglement-enhanced Information Processing

[...]

Artur Ekert¹, Richard Jozsa²•Institutions (2)

University of Oxford¹, University of Plymouth²

15 Aug 1998-Philosophical transactions - Royal Society. Mathematical, physical and engineering sciences

TL;DR: In this paper, the fundamental role of entanglement as the essential nonclassical feature providing the computational speedup in known quantum algorithms is discussed, and the construction of the Fourier...

...read moreread less

Abstract: We discuss the fundamental role of entanglement as the essential nonclassical feature providing the computational speedup in the known quantum algorithms. We review the construction of the Fourier ...

...read moreread less

161 citations

Journal Article•DOI•

Re-marshaling export containers in port container terminals

[...]

Kap Hwan Kim¹, Jong Wook Bae¹•Institutions (1)

Pusan National University¹

01 Dec 1998

TL;DR: In this article, a methodology is proposed to convert the current bay layout into the desirable layout by moving the fewest possible number of containers and in the shortest possible travel distance, where the problem is decomposed into three sub-problems such as the bay matching, the move planning, and the task sequencing.

...read moreread less

Abstract: In order to speed up the loading operation of export containers onto a ship, the re-marshaling operation is an usual practice in port container terminals. It is assumed that the current yard map for containers is available and a desirable bay layout is provided. A methodology is proposed to convert the current bay layout into the desirable layout by moving the fewest possible number of containers and in the shortest possible travel distance. The problem is decomposed into three sub-problems such as the bay matching, the move planning, and the task sequencing. The bay matching is to match a specific current bay with a bay configuration in the target layout. In the move planning stage, the number of containers to be moved from a specific bay to another is determined. The completion time of the re-marshaling operation is minimized by sequencing the moving tasks in the final stage. A mathematical model is suggested for each sub-problem. A numerical example is provided to illustrate the solution procedure.

...read moreread less

161 citations

Journal Article•DOI•

Space-Efficient Scheduling of Multithreaded Computations

[...]

Robert D. Blumofe, Charles E. Leiserson

01 Feb 1998-SIAM Journal on Computing

TL;DR: This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution and concludes that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space.

...read moreread less

Abstract: This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T1 is the time required for executing the computation on a 1 processor, $T_\infty$ is the time required by an infinite number of processors, and S1 is the space required to execute the computation on a 1 processor. A computation executed on P processors is time-efficient if the time is $O(T_1/P + T_\infty)$, that is, it achieves linear speedup when $P=O(T_1/T_\infty)$, and it is space-efficient if it uses O(S1P) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space. But by restricting attention to "strict" computations---those in which all arguments to a procedure must be available before the procedure can be invoked---much more positive results are obtainable. Specifically, for any strict multithreaded computation, a simple online algorithm can compute a schedule that is both time-efficient and space-efficient. Unfortunately, because the algorithm uses a global queue, the overhead of computing the schedule can be substantial. This problem is overcome by a decentralized algorithm that can compute and execute a P-processor schedule online in expected time $O(T_1/P + T_\infty\lg P)$ and worst-case space $O(S_1P\lg P)$, including overhead costs.

...read moreread less

154 citations

Journal Article•DOI•

Fast fuzzy clustering

[...]

Tai Wai Cheng¹, Dmitry B. Goldgof¹, Lawrence O. Hall¹•Institutions (1)

University of South Florida¹

01 Jan 1998-Fuzzy Sets and Systems

TL;DR: This paper presents a multistage random sampling fuzzy c-means-based clustering algorithm, which significantly reduces the computation time required to partition a data set into c classes.

...read moreread less

Journal Article•DOI•

Parallel Algorithms with Optimal Speedup for Bounded Treewidth

[...]

Hans L. Bodlaender¹, Torben Hagerup²•Institutions (2)

Utrecht University¹, Max Planck Society²

01 Dec 1998-SIAM Journal on Computing

TL;DR: This work describes the first parallel algorithm with optimal speedup for constructing minimum-width tree decompositions of graphs of bounded treewidth, and gives faster parallel algorithms for all decision problems expressible in monadic second-order logic.

...read moreread less

Abstract: We describe the first parallel algorithm with optimal speedup for constructing minimum-width tree decompositions of graphs of bounded treewidth. On n-vertex input graphs, the algorithm works in O((log n)2) time using O(n) operations on the EREW PRAM. We also give faster parallel algorithms with optimal speedup for the problem of deciding whether the treewidth of an input graph is bounded by a given constant and for a variety of problems on graphs of bounded treewidth, including all decision problems expressible in monadic second-order logic. On n-vertex input graphs, the algorithms use O(n) operations together with O(log n log* n) time on the EREW PRAM, or O(log n) time on the CRCW PRAM.

...read moreread less

Proceedings Article•DOI•

Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

[...]

Stephen W. Keckler¹, William J. Dally¹, Daniel Maskit¹, Nicholas P. Carter¹, Andrew Chang¹, Whay S. Lee² - Show less +2 more•Institutions (2)

Stanford University¹, Massachusetts Institute of Technology²

16 Apr 1998

TL;DR: This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier that provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache.

...read moreread less

Abstract: Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.

...read moreread less

Proceedings Article•DOI•

Algorithms for providing bandwidth and delay guarantees in input-buffered crossbars with speedup

[...]

A. Charny¹, P. Krishna, N.S. Patel, R. Simcoe•Institutions (1)

Massachusetts Institute of Technology¹

18 May 1998

TL;DR: It is demonstrated that with relatively simple arbitration algorithms and a speedup that is independent of the switch size, it is possible to ensure delay guarantees which are comparable to those available for output-buffered switches.

...read moreread less

Abstract: Investigates some issues related to providing QoS guarantees in input-buffered crossbars with speedup. We show that a speedup of 4 is sufficient to ensure 100% asymptotic throughput with any maximal matching algorithm employed by the arbiter. We present several algorithms which ensure different delay guarantees with a range of speedup values between 2 and 6. We demonstrate that with relatively simple arbitration algorithms and a speedup that is independent of the switch size, it is possible to ensure delay guarantees which are comparable to those available for output-buffered switches.

...read moreread less

Proceedings Article•DOI•

Evaluating MMX technology using DSP and multimedia applications

[...]

Ravi Bhargava¹, Lizy K. John¹, Brian L. Evans¹, Ramesh Radhakrishnan¹•Institutions (1)

University of Texas at Austin¹

01 Nov 1998

TL;DR: This paper evaluates the X86 architecture's multimedia extension (MMX) instruction set on a set of benchmarks to understand which aspects of native signal processing instruction sets are most useful, the current limitations, and how they can be utilized most efficiently.

...read moreread less

Abstract: Many current general purpose processors are using extensions to the instruction set architecture to enhance the performance of digital signal processing (DSP) and multimedia applications. In this paper, we evaluate the X86 architecture's multimedia extension (MMX) instruction set on a set of benchmarks. Our benchmark suite includes kernels (filtering, fast Fourier transforms, and vector arithmetic) and applications (JPEG compression, Doppler radar processing, imaging, and G.722 speech encoding). Each benchmark has at least one non-MMX version in C and an MMX version that makes calls to an MMX assembly library. The versions differ in the implementation of filtering, vector arithmetic, and other relevant kernels. The observed speed up for the MMX versions of the suite ranges from less than 1.0 to 6.1. In addition to quantifying the speedup, we perform detailed instruction level profiling using Intel's VTune profiling tool. Using VTune, we profile static and dynamic instructions, microarchitecture operations, and data references to isolate the specific reasons for speedup or lack thereof. This analysis allows one to understand which aspects of native signal processing instruction sets are most useful, the current limitations, and how they can be utilized most efficiently.

...read moreread less

Journal Article•DOI•

Accelerating multi-media processing by implementing memoing in multiplication and division units

[...]

Daniel Citron¹, Dror G. Feitelson¹, Larry Rudolph²•Institutions (2)

Hebrew University of Jerusalem¹, Massachusetts Institute of Technology²

01 Oct 1998

TL;DR: Results of simulations have shown that on the average, for a modestly sized memo-table, about 40% of the floating point multiplications and 50% ofThe floating point divisions, in Multi-Media applications, can be avoided by using the values within the memo- table, leading to an average computational speedup of more than 20%.

...read moreread less

Abstract: This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations and using the output if the input is encountered again. This technique is especially suitable for Multi-Media (MM) processing. In MM applications the local entropy of the data tends to be low which results in repeated operations on the same datum.The inputs and outputs of assembly level operations are stored in cache-like lookup tables and accessed in parallel to the conventional computation. A successful lookup gives the result of a multi-cycle computation in a single cycle, and a failed lookup doesn't necessitate a penalty in computation time.Results of simulations have shown that on the average, for a modestly sized memo-table, about 40% of the floating point multiplications and 50% of the floating point divisions, in Multi-Media applications, can be avoided by using the values within the memo-table, leading to an average computational speedup of more than 20%.

...read moreread less

Proceedings Article•DOI•

Fast set operations using treaps

[...]

Guy E. Blelloch¹, Margaret Reid-Miller¹•Institutions (1)

Carnegie Mellon University¹

01 Jun 1998

TL;DR: The main advantage of the algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context.

...read moreread less

Abstract: We present parallel algorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m ≤ n) the algorithms run in expected O(mlg(n=m)) work and O(lg n) depth (parallel time) on an EREW PRAM with scan operations (implying O(lg2 n) depth on a plain EREW PRAM). As with the sequential algorithms on treaps for insertion and deletion, the main advantage of our algorithms are their simplicity. In fact, our algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context. To analyze the effectiveness of the algorithms we implemented both sequential and parallel versions of the algorithms and ran several experiments on them. Our parallel implementation uses the Cilk [5] shared memory runtime system on a 16 processor SGI Power Challenge and a 6 processor Sun Ultra Enterprise 3000. It shows reasonable speedup: 6.3 to 6.8 speedup on 8 processors of the SGI, and 4.1 to 4.4 speedup on 5 processors of the Sun.

...read moreread less

Proceedings Article•DOI•

Efficacy and performance impact of value prediction

[...]

Bohuslav Rychlik¹, John Faistl, Bryon Krug, John Paul Shen•Institutions (1)

Carnegie Mellon University¹

12 Oct 1998

TL;DR: It is shown that prediction rate is not a good indicator of speedup because over 40% of predictions made may not be useful in enhancing performance, and a simple hardware mechanism that eliminates many of these useless predictions is introduced.

...read moreread less

Abstract: Value prediction is a technique that bypasses inter-instruction data dependencies by speculating on the outcomes of producer instructions, thereby allowing dependent consumer instructions to execute in parallel. This work makes several contributions in value prediction research. A hybrid value predictor that achieves an overall prediction rate of up to 83% is presented. The design of a value-predicting eight-wide superscalar machine with its speculative execution core is described. This design is able to achieve 8.6% to 23% IPC improvements on the SPEC benchmarks. Furthermore, it is shown that prediction rate is not a good indicator of speedup because over 40% of predictions made may not be useful in enhancing performance, and a simple hardware mechanism that eliminates many of these useless predictions is introduced.

...read moreread less

Book Chapter•DOI•

Effect of Data Skewness in Parallel Mining of Association Rules

[...]

David W. Cheung¹, Yongqiao Xiao¹•Institutions (1)

University of Hong Kong¹

15 Apr 1998

TL;DR: An efficient parallel algorithm FPM(Fast Parallel Mining) for mining association rules on a shared-nothing parallel system has been proposed and it is found that the two pruning techniques are very sensitive to data skewness, which describes the degree of non-uniformity of the itemset distribution among the database partitions.

...read moreread less

Abstract: An efficient parallel algorithm FPM(Fast Parallel Mining) for mining association rules on a shared-nothing parallel system has been proposed. It adopts the count distribution approach and has incorporated two powerful candidate pruning techniques, i.e., distributed pruning and global pruning. It has a simple communication scheme which performs only one round of message exchange in each iteration. We found that the two pruning techniques are very sensitive to data skewness, which describes the degree of non-uniformity of the itemset distribution among the database partitions. Distributed pruning is very effective when data skewness is high. Global pruning is more effective than distributed pruning even for the mild data skewness case. We have implemented the algorithm on an IBM SP2 parallel machine. The performance studies confirm our observation on the relationship between the effectiveness of the two pruning techniques and data skewness. It has also shown that FPM outperforms CD (Count Distribution) consistently, which is a parallel version of the popular Apriori algorithm [2, 3]. Furthermore, FPM has nice parallelism of speedup, scaleup and sizeup.

...read moreread less

Journal Article•DOI•

Parallel 3-D pseudospectral simulation of seismic wave propagation

[...]

Takashi Furumura¹, Brian Kennett², Hiroshi Takenaka³•Institutions (3)

Hokkaido University of Education¹, Australian National University², Kyushu University³

01 Feb 1998-Geophysics

TL;DR: A parallel pseudospectral code for calculating the 3-D wavefield by concurrent use of a number of processors based on a partition of the computational domain, where the field quantities are distributed over anumber of processors and the calculation is concurrently done in each subdomain with interprocessor communications.

...read moreread less

Abstract: Three-dimensional pseudospectral modeling for a realistic scale problem is still computationally very intensive, even when using current powerful computers. To overcome this, we have developed a parallel pseudospectral code for calculating the 3-D wavefield by concurrent use of a number of processors. The parallel algorithm is based on a partition of the computational domain, where the field quantities are distributed over a number of processors and the calculation is concurrently done in each subdomain with interprocessor communications. Experimental performance tests using three different styles of parallel computers achieved a fairly good speed up compared with conventional computation on a single processor: maximum speed-up rate of 26 using 32 processors of a Thinking Machine CM-5 parallel computer, 1.6 using a Digital Equipment DEC-Alpha two-CPU workstation, and 4.6 using a cluster of eight Sun Microsystems SPARC-Station 10 (SPARC-10) workstations connected by an Ethernet. The result of this test agrees well with the performance theoretically predicted for each system. To demonstrate the feasibility of our parallel algorithm, we show three examples: 3-D acoustic and elastic modeling of fault-zone trapped waves and the calculation of elastic wave propagation in a 3-D syncline model.

...read moreread less

Journal Article•DOI•

Data Partitioning for Parallel Spatial Join Processing

[...]

Xiaofang Zhou¹, David J. Abel¹, David Truffet¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

01 Jun 1998-Geoinformatica

TL;DR: A framework combining the data-partitioning techniques used by most parallel join algorithms in relational databases and the filter-and-refine strategy for spatial operation processing is proposed for parallel spatial join processing.

...read moreread less

Abstract: The cost of spatial join processing can be very high because of the large sizes of spatial objects and the computation-intensive spatial operations. While parallel processing seems a natural solution to this problem, it is not clear how spatial data can be partitioned for this purpose. Various spatial data partitioning methods are examined in this paper. A framework combining the data-partitioning techniques used by most parallel join algorithms in relational databases and the filter-and-refine strategy for spatial operation processing is proposed for parallel spatial join processing. Object duplication caused by multi-assignment in spatial data partitioning can result in extra CPU cost as well as extra communication cost. We find that the key to overcome this problem is to preserve spatial locality in task decomposition. In this paper we show that a near-optimal speedup can be achieved for parallel spatial join processing using our new algorithms.

...read moreread less

Proceedings Article•DOI•

Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

[...]

Chi-Keung Luk¹, Todd C. Mowry²•Institutions (2)

University of Toronto¹, Carnegie Mellon University²

01 Nov 1998

TL;DR: A novel compiler algorithm which automatically inserts instruction prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance is proposed, which results in speedups ranging from 9.4% to 18.5% over the original execution time on an out-of-order superscalar processor.

...read moreread less

Abstract: Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor; which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.

...read moreread less

Journal Article•DOI•

Parallel implementation of the PHOENIX generalized stellar atmosphere program. II. Wavelength parallelization

[...]

E. Baron¹, Peter H. Hauschildt²•Institutions (2)

University of Oklahoma¹, University of Georgia²

01 Mar 1998-The Astrophysical Journal

TL;DR: In this paper, a pipelined parallelization of PHOENIX is described, where the necessary data from a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known.

...read moreread less

Abstract: We describe an important addition to the parallel implementation of our generalized nonlocal thermodynamic equilibrium (NLTE) stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is, distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition, task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000-300,000) and hence parallelization over wavelength can lead both to considerable speedup in calculation time and the ability to make use of the aggregate memory available on massively parallel supercomputers. Here, we describe an implementation of a pipelined design for the wavelength parallelization of PHOENIX, where the necessary data from the processor working on a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known. Our implementation uses a MIMD design based on a relatively small number of standard message passing interface (MPI) library calls and is fully portable between serial and parallel computers.

...read moreread less

Proceedings Article•DOI•

Parallel mining algorithms for generalized association rules with classification hierarchy

[...]

Takahiko Shintani¹, Masaru Kitsuregawa¹•Institutions (1)

University of Tokyo¹

01 Jun 1998

TL;DR: In this article, a parallel algorithm for mining association rules with classification hierarchy on a shared-nothing parallel machine is proposed, where the candidate itemsets are partitioned over the processors, which exploits the aggregate memory of the system effectively.

...read moreread less

Abstract: Association rule mining recently attracted strong attention. Usually, the classification hierarchy over the data items is available. Users are interested in generalized association rules that span different levels of the hierarchy, since sometimes more interesting rules can be derived by taking the hierarchy into account.In this paper, we propose the new parallel algorithms for mining association rules with classification hierarchy on a shared-nothing parallel machine to improve its performance. Our algorithms partition the candidate itemsets over the processors, which exploits the aggregate memory of the system effectively. If the candidate itemsets are partitioned without considering classification hierarchy, both the items and its all the ancestor items have to be transmitted, that causes prohibitively large amount of communications. Our method minimizes interprocessor communication by considering the hierarchy. Moreover, in our algorithm, the available memory space is fully utilized by identifying the frequently occurring candidate itemsets and copying them over all the processors, through which frequent itemsets can be processed locally without any communication. Thus it can effectively reduce the load skew among the processors. Several experiments are done by changing the granule of copying itemsets, from the whole tree, to the small group of the frequent itemsets along the hierarchy. The coarser the grain, the easier the control but it is rather difficult to achieve the sufficient load balance. The finer the grain, the more complicated the control is required but it can balance the load quite well.We implemented proposed algorithms on IBM SP-2. Performance evaluations show that our algorithms are effective for handling skew and attain sufficient speedup ratio.

...read moreread less

Proceedings Article•DOI•

Multi-processor Performance on the Tera MTA

[...]

Allan Snavely¹, Larry Carter¹, Jay Boisseau, Amit Majumdar, Kang Su Gatlin¹, Nick Mitchell¹, John Feo, Brian Koblenz - Show less +4 more•Institutions (1)

University of California, San Diego¹

07 Nov 1998

TL;DR: A preliminary investigation of the first multi-processor Tera MTA, finding that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator.

...read moreread less

Abstract: The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA. In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes. We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the "serial bottlenecks" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads. Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near- perfect, but the others are reduced accordingly. Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running "frays" (a collection of threads running on a single processor) and "crews" (a collection of frays, one per processor.)

...read moreread less

Journal Article•DOI•

Fast spin dynamics algorithms for classical spin systems

[...]

M. Krech¹, Alex Bunker¹, David P. Landau¹•Institutions (1)

University of Georgia¹

01 Jun 1998-Computer Physics Communications

TL;DR: In this paper, the Suzuki-Trotter decomposition of exponential operators is used for the numerical integration of spin systems, which can be used with much larger time steps than the predictor-corrector method.

...read moreread less

Book Chapter•DOI•

Quantum Counting

[...]

Gilles Brassard, Peter F. Hoyer, Alain Tapp

27 May 1998

TL;DR: In this paper, Grover's and Shor's quantum search algorithm was extended to approximate counting, which can be seen as an amplitude estimation process, and it was shown that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist.

...read moreread less

Abstract: We study some extensions of Grover's quantum searching algorithm First, we generalize the Grover iteration in the light of a concept called amplitude amplification Then, we show that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist Finally, as our main result, we combine ideas from Grover's and Shor's quantum algorithms to perform approximate counting, which can be seen as an amplitude estimation process

...read moreread less

Proceedings Article•DOI•

On the speedup required for combined input and output queued switching

[...]

Balaji Prabhakar¹, Nick McKeown²•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

16 Aug 1998

TL;DR: It is proved that if the switch uses virtual output queueing, and has an internal speedup of just four, it is possible for it to behave identically to an output queued switch, regardless of the nature of the arriving traffic.

...read moreread less

Abstract: Architectures based on a non-blocking fabric, such as a crosspoint switch, are attractive for use in high-speed LAN switches, ATM switches and IP routers. These fabrics, coupled with memory bandwidth limitations, dictate that queues be placed at the input of the switch. But it is well known that input-queueing can lead to low throughput, and does not allow the control of latency through the switch. This is in contrast to output-queueing, which maximizes throughput, and permits the accurate control of packet latency through scheduling. We ask the question: can a switch with combined input and output queueing be designed to behave identically to an output-queued switch? In this paper, we prove that if the switch uses virtual output queueing, and has an internal speedup of just four, it is possible for it to behave identically to an output queued switch, regardless of the nature of the arriving traffic. Our proof is based on a novel scheduling algorithm, known as most urgent cell first. This result makes possible switches that perform as if they were output-queued, yet use memories that run more slowly.

...read moreread less

Proceedings Article•DOI•

On parallel processing of aggregate and scalar functions in object-relational DBMS

[...]

Michael Jaedicke¹, Bernhard Mitschang¹•Institutions (1)

Technische Universität München¹

01 Jun 1998

TL;DR: A framework that allows to process user-defined functions with data parallelism, and describes the class of partitionable functions that can be processed parallelly, and proposes an extension which allows to speed up the processing of another large class of functions by means of parallel sorting.

...read moreread less

Abstract: Nowadays parallel object-relational DBMS are envisioned as the next great wave, but there is still a lack of efficient implementation concepts for some parts of the proposed functionality. Thus one of the current goals for parallel object-relational DBMS is to move towards higher performance. In this paper we develop a framework that allows to process user-defined functions with data parallelism. We will describe the class of partitionable functions that can be processed parallelly. We will also propose an extension which allows to speed up the processing of another large class of functions by means of parallel sorting. Functions that can be processed by means of our techniques are often used in decision support queries on large data volumes, for example. Hence a parallel execution is indispensable.

...read moreread less

Matching Output Queueing with Combined Input and Output Queueing

[...]

Nick McKeown, Balaji Prabhakar, Mingyan Zhu

01 Jan 1998

TL;DR: It is proved that if virtual output queueing is used, a combined input-output queued switch is always work-conserving if its speedup is greater than .

...read moreread less

Abstract: At very high aggregate bandwidths, output queueing is impractical because of insufficient memory bandwidth. This problem is getting worse: memory bandwidth is improving slowly, whereas the demand for network bandwidth continues to grow exponentially. The difficulty is that outputqueued switches require memories that run at a speedup of N, where N is equal to the number of switch ports. This paper addresses the following question: Is it possible for a switch to exactly match outputqueueing with a reduced speedup? We prove that if virtual output queueing is used, a combined input-output queued switch is always work-conserving if its speedup is greater than . This result is proved using a novel scheduling algorithm the Home Territory Algorithm (HTA).

...read moreread less

Collapse