scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1998"


Journal ArticleDOI
TL;DR: The quality of the produced partitions and orderings are comparable to those produced by the serial multilevel algorithm that has been shown to outperform both spectral partitioning and multiple minimum degree.

496 citations


Journal ArticleDOI
TL;DR: The RI‐J technique to approximate Coulomb interactions (by means of an auxiliary basis set approximation for the electron density) even shows superlinear speedup on distributed memory architectures.
Abstract: The parallelization of density functional treatments of molecular electronic energy and first-order gradients is described, and the performance is documented. The quadrature required for exchange correlation terms and the treatment of exact Coulomb interaction scales virtually linearly up to 100 nodes. The RI-J technique to approximate Coulomb interactions (by means of an auxiliary basis set approximation for the electron density) even shows superlinear speedup on distributed memory architectures. The bottleneck is then linear algebra. Demonstrative application examples include molecules with up to 300 atoms and 3000 basis functions that can now be treated in a few hours per geometry optimization cycle in C1 symmetry. © 1998 John Wiley & Sons, Inc. J Comput Chem 19: 1746–1757, 1998

480 citations


Proceedings Article
13 Jul 1998
TL;DR: This work generalizes the Grover iteration in the light of a concept called amplitude amplification, and shows that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist.
Abstract: We study some extensions of Grover's quantum searching algorithm. First, we generalize the Grover iteration in the light of a concept called amplitude amplification. Then, we show that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist. Finally, as our main result, we combine ideas from Grover's and Shor's quantum algorithms to perform approximate counting, which can be seen as an amplitude estimation process.

421 citations


Proceedings ArticleDOI
01 Nov 1998
TL;DR: An architecture that features dynamic multithreading execution of a single program that minimizes the impact of ICache misses and branch mispredictions by fetching and dispatching instructions out-of-order and uses a novel value prediction and recovery mechanism to reduce artificial data dependencies created by the use of a stack to manage run-time storage is presented.
Abstract: We present an architecture that features dynamic multithreading execution of a single program. Threads are created automatically by hardware at procedure and loop boundaries and executed speculatively on a simultaneous multithreading pipeline. Data prediction is used to alleviate dependency constraints and enable lookahead execution of the threads. A two-level hierarchy significantly enlarges the instruction window. Efficient selective recovery from the second level instruction window takes place after a mispredicted input to a thread is corrected. The second level is slower to access but has the advantage of large storage capacity. We show several advantages of this architecture: (1) it minimizes the impact of ICache misses and branch mispredictions by fetching and dispatching instructions out-of-order, (2) it uses a novel value prediction and recovery mechanism to reduce artificial data dependencies created by the use of a stack to manage run-time storage, and (3) it improves the execution throughput of a superscalar by 15% without increasing the execution resources or cache bandwidth, and by 30% with one additional ICache fetch port. The speedup was measured on the integer SPEC95 benchmarks, without any compiler support, using a detailed performance simulator.

339 citations


Journal ArticleDOI
TL;DR: Two new algorithms based on the A* technique are described which are considerably faster, are more memory-efficient, and give optimal solutions to solve distributed task-to-processor assignment problems.
Abstract: A distributed system comprising networked heterogeneous processors requires efficient task-to-processor assignment to achieve fast turnaround time. Although reasonable heuristics exist to address optimal processor assignment for small problems, larger problems require better algorithms. The authors describe two new algorithms based on the A* technique which are considerably faster, are more memory-efficient, and give optimal solutions. The first is a sequential algorithm that reduces the search space. The second proposes to lower time complexity, by running the assignment algorithm in parallel, and achieves significant speedup. The authors test their results on a library of task graphs and processor topologies.

262 citations


Journal ArticleDOI
TL;DR: A natural similarity function for shape matching is used, based on concepts from mathematical morphology, and it is shown how it can be lower-bounded by a set of shape features for safely pruning candidates, thus giving fast and correct output.
Abstract: Investigates the problem of retrieving similar shapes from a large database; in particular, we focus on medical tumor shapes (finding tumors that are similar to a given pattern). We use a natural similarity function for shape matching, based on concepts from mathematical morphology, and we show how it can be lower-bounded by a set of shape features for safely pruning candidates, thus giving fast and correct output. These features can be organized in a spatial access method, leading to fast indexing for range queries and nearest-neighbor queries. In addition to the lower-bounding, our second contribution is the design of a fast algorithm for nearest-neighbor searching, achieving significant speedup while provably guaranteeing correctness. Our experiments demonstrate that roughly 90% of the candidates can be pruned using these techniques, resulting in up to 27 times better performance compared to sequential scanning.

204 citations


Book ChapterDOI
30 Mar 1998
TL;DR: It is argued that the focus should be on on-line open systems, and proposed that a standard workload should be used as a benchmark for schedulers, which will specify distributions of parallelism and runtime, as found by analyzing accounting traces.
Abstract: The evaluation of parallel job schedulers hinges on two things: the use of appropriate metrics, and the use of appropriate workloads on which the scheduler can operate. We argue that the focus should be on on-line open systems, and propose that a standard workload should be used as a benchmark for schedulers. This benchmark will specify distributions of parallelism and runtime, as found by analyzing accounting traces, and also internal structures that create different speedup and synchronization characteristics. As for metrics, we present some problems with slowdown and bounded slowdown that have been proposed recently.

191 citations


Journal ArticleDOI
TL;DR: In this paper, the fundamental role of entanglement as the essential nonclassical feature providing the computational speedup in known quantum algorithms is discussed, and the construction of the Fourier...
Abstract: We discuss the fundamental role of entanglement as the essential nonclassical feature providing the computational speedup in the known quantum algorithms. We review the construction of the Fourier ...

161 citations


Journal ArticleDOI
01 Dec 1998
TL;DR: In this article, a methodology is proposed to convert the current bay layout into the desirable layout by moving the fewest possible number of containers and in the shortest possible travel distance, where the problem is decomposed into three sub-problems such as the bay matching, the move planning, and the task sequencing.
Abstract: In order to speed up the loading operation of export containers onto a ship, the re-marshaling operation is an usual practice in port container terminals. It is assumed that the current yard map for containers is available and a desirable bay layout is provided. A methodology is proposed to convert the current bay layout into the desirable layout by moving the fewest possible number of containers and in the shortest possible travel distance. The problem is decomposed into three sub-problems such as the bay matching, the move planning, and the task sequencing. The bay matching is to match a specific current bay with a bay configuration in the target layout. In the move planning stage, the number of containers to be moved from a specific bay to another is determined. The completion time of the re-marshaling operation is minimized by sequencing the moving tasks in the final stage. A mathematical model is suggested for each sub-problem. A numerical example is provided to illustrate the solution procedure.

161 citations


Journal ArticleDOI
TL;DR: This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution and concludes that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space.
Abstract: This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T1 is the time required for executing the computation on a 1 processor, $T_\infty$ is the time required by an infinite number of processors, and S1 is the space required to execute the computation on a 1 processor. A computation executed on P processors is time-efficient if the time is $O(T_1/P + T_\infty)$, that is, it achieves linear speedup when $P=O(T_1/T_\infty)$, and it is space-efficient if it uses O(S1P) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space. But by restricting attention to "strict" computations---those in which all arguments to a procedure must be available before the procedure can be invoked---much more positive results are obtainable. Specifically, for any strict multithreaded computation, a simple online algorithm can compute a schedule that is both time-efficient and space-efficient. Unfortunately, because the algorithm uses a global queue, the overhead of computing the schedule can be substantial. This problem is overcome by a decentralized algorithm that can compute and execute a P-processor schedule online in expected time $O(T_1/P + T_\infty\lg P)$ and worst-case space $O(S_1P\lg P)$, including overhead costs.

154 citations


Journal ArticleDOI
TL;DR: This paper presents a multistage random sampling fuzzy c-means-based clustering algorithm, which significantly reduces the computation time required to partition a data set into c classes.

Journal ArticleDOI
TL;DR: This work describes the first parallel algorithm with optimal speedup for constructing minimum-width tree decompositions of graphs of bounded treewidth, and gives faster parallel algorithms for all decision problems expressible in monadic second-order logic.
Abstract: We describe the first parallel algorithm with optimal speedup for constructing minimum-width tree decompositions of graphs of bounded treewidth. On n-vertex input graphs, the algorithm works in O((log n)2) time using O(n) operations on the EREW PRAM. We also give faster parallel algorithms with optimal speedup for the problem of deciding whether the treewidth of an input graph is bounded by a given constant and for a variety of problems on graphs of bounded treewidth, including all decision problems expressible in monadic second-order logic. On n-vertex input graphs, the algorithms use O(n) operations together with O(log n log* n) time on the EREW PRAM, or O(log n) time on the CRCW PRAM.

Proceedings ArticleDOI
16 Apr 1998
TL;DR: This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier that provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache.
Abstract: Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.

Proceedings ArticleDOI
18 May 1998
TL;DR: It is demonstrated that with relatively simple arbitration algorithms and a speedup that is independent of the switch size, it is possible to ensure delay guarantees which are comparable to those available for output-buffered switches.
Abstract: Investigates some issues related to providing QoS guarantees in input-buffered crossbars with speedup. We show that a speedup of 4 is sufficient to ensure 100% asymptotic throughput with any maximal matching algorithm employed by the arbiter. We present several algorithms which ensure different delay guarantees with a range of speedup values between 2 and 6. We demonstrate that with relatively simple arbitration algorithms and a speedup that is independent of the switch size, it is possible to ensure delay guarantees which are comparable to those available for output-buffered switches.

Proceedings ArticleDOI
01 Nov 1998
TL;DR: This paper evaluates the X86 architecture's multimedia extension (MMX) instruction set on a set of benchmarks to understand which aspects of native signal processing instruction sets are most useful, the current limitations, and how they can be utilized most efficiently.
Abstract: Many current general purpose processors are using extensions to the instruction set architecture to enhance the performance of digital signal processing (DSP) and multimedia applications. In this paper, we evaluate the X86 architecture's multimedia extension (MMX) instruction set on a set of benchmarks. Our benchmark suite includes kernels (filtering, fast Fourier transforms, and vector arithmetic) and applications (JPEG compression, Doppler radar processing, imaging, and G.722 speech encoding). Each benchmark has at least one non-MMX version in C and an MMX version that makes calls to an MMX assembly library. The versions differ in the implementation of filtering, vector arithmetic, and other relevant kernels. The observed speed up for the MMX versions of the suite ranges from less than 1.0 to 6.1. In addition to quantifying the speedup, we perform detailed instruction level profiling using Intel's VTune profiling tool. Using VTune, we profile static and dynamic instructions, microarchitecture operations, and data references to isolate the specific reasons for speedup or lack thereof. This analysis allows one to understand which aspects of native signal processing instruction sets are most useful, the current limitations, and how they can be utilized most efficiently.

Journal ArticleDOI
01 Oct 1998
TL;DR: Results of simulations have shown that on the average, for a modestly sized memo-table, about 40% of the floating point multiplications and 50% ofThe floating point divisions, in Multi-Media applications, can be avoided by using the values within the memo- table, leading to an average computational speedup of more than 20%.
Abstract: This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations and using the output if the input is encountered again. This technique is especially suitable for Multi-Media (MM) processing. In MM applications the local entropy of the data tends to be low which results in repeated operations on the same datum.The inputs and outputs of assembly level operations are stored in cache-like lookup tables and accessed in parallel to the conventional computation. A successful lookup gives the result of a multi-cycle computation in a single cycle, and a failed lookup doesn't necessitate a penalty in computation time.Results of simulations have shown that on the average, for a modestly sized memo-table, about 40% of the floating point multiplications and 50% of the floating point divisions, in Multi-Media applications, can be avoided by using the values within the memo-table, leading to an average computational speedup of more than 20%.

Proceedings ArticleDOI
01 Jun 1998
TL;DR: The main advantage of the algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context.
Abstract: We present parallel algorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m ≤ n) the algorithms run in expected O(mlg(n=m)) work and O(lg n) depth (parallel time) on an EREW PRAM with scan operations (implying O(lg2 n) depth on a plain EREW PRAM). As with the sequential algorithms on treaps for insertion and deletion, the main advantage of our algorithms are their simplicity. In fact, our algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context. To analyze the effectiveness of the algorithms we implemented both sequential and parallel versions of the algorithms and ran several experiments on them. Our parallel implementation uses the Cilk [5] shared memory runtime system on a 16 processor SGI Power Challenge and a 6 processor Sun Ultra Enterprise 3000. It shows reasonable speedup: 6.3 to 6.8 speedup on 8 processors of the SGI, and 4.1 to 4.4 speedup on 5 processors of the Sun.

Proceedings ArticleDOI
12 Oct 1998
TL;DR: It is shown that prediction rate is not a good indicator of speedup because over 40% of predictions made may not be useful in enhancing performance, and a simple hardware mechanism that eliminates many of these useless predictions is introduced.
Abstract: Value prediction is a technique that bypasses inter-instruction data dependencies by speculating on the outcomes of producer instructions, thereby allowing dependent consumer instructions to execute in parallel. This work makes several contributions in value prediction research. A hybrid value predictor that achieves an overall prediction rate of up to 83% is presented. The design of a value-predicting eight-wide superscalar machine with its speculative execution core is described. This design is able to achieve 8.6% to 23% IPC improvements on the SPEC benchmarks. Furthermore, it is shown that prediction rate is not a good indicator of speedup because over 40% of predictions made may not be useful in enhancing performance, and a simple hardware mechanism that eliminates many of these useless predictions is introduced.

Book ChapterDOI
15 Apr 1998
TL;DR: An efficient parallel algorithm FPM(Fast Parallel Mining) for mining association rules on a shared-nothing parallel system has been proposed and it is found that the two pruning techniques are very sensitive to data skewness, which describes the degree of non-uniformity of the itemset distribution among the database partitions.
Abstract: An efficient parallel algorithm FPM(Fast Parallel Mining) for mining association rules on a shared-nothing parallel system has been proposed. It adopts the count distribution approach and has incorporated two powerful candidate pruning techniques, i.e., distributed pruning and global pruning. It has a simple communication scheme which performs only one round of message exchange in each iteration. We found that the two pruning techniques are very sensitive to data skewness, which describes the degree of non-uniformity of the itemset distribution among the database partitions. Distributed pruning is very effective when data skewness is high. Global pruning is more effective than distributed pruning even for the mild data skewness case. We have implemented the algorithm on an IBM SP2 parallel machine. The performance studies confirm our observation on the relationship between the effectiveness of the two pruning techniques and data skewness. It has also shown that FPM outperforms CD (Count Distribution) consistently, which is a parallel version of the popular Apriori algorithm [2, 3]. Furthermore, FPM has nice parallelism of speedup, scaleup and sizeup.

Journal ArticleDOI
TL;DR: A parallel pseudospectral code for calculating the 3-D wavefield by concurrent use of a number of processors based on a partition of the computational domain, where the field quantities are distributed over anumber of processors and the calculation is concurrently done in each subdomain with interprocessor communications.
Abstract: Three-dimensional pseudospectral modeling for a realistic scale problem is still computationally very intensive, even when using current powerful computers. To overcome this, we have developed a parallel pseudospectral code for calculating the 3-D wavefield by concurrent use of a number of processors. The parallel algorithm is based on a partition of the computational domain, where the field quantities are distributed over a number of processors and the calculation is concurrently done in each subdomain with interprocessor communications. Experimental performance tests using three different styles of parallel computers achieved a fairly good speed up compared with conventional computation on a single processor: maximum speed-up rate of 26 using 32 processors of a Thinking Machine CM-5 parallel computer, 1.6 using a Digital Equipment DEC-Alpha two-CPU workstation, and 4.6 using a cluster of eight Sun Microsystems SPARC-Station 10 (SPARC-10) workstations connected by an Ethernet. The result of this test agrees well with the performance theoretically predicted for each system. To demonstrate the feasibility of our parallel algorithm, we show three examples: 3-D acoustic and elastic modeling of fault-zone trapped waves and the calculation of elastic wave propagation in a 3-D syncline model.

Journal ArticleDOI
TL;DR: A framework combining the data-partitioning techniques used by most parallel join algorithms in relational databases and the filter-and-refine strategy for spatial operation processing is proposed for parallel spatial join processing.
Abstract: The cost of spatial join processing can be very high because of the large sizes of spatial objects and the computation-intensive spatial operations. While parallel processing seems a natural solution to this problem, it is not clear how spatial data can be partitioned for this purpose. Various spatial data partitioning methods are examined in this paper. A framework combining the data-partitioning techniques used by most parallel join algorithms in relational databases and the filter-and-refine strategy for spatial operation processing is proposed for parallel spatial join processing. Object duplication caused by multi-assignment in spatial data partitioning can result in extra CPU cost as well as extra communication cost. We find that the key to overcome this problem is to preserve spatial locality in task decomposition. In this paper we show that a near-optimal speedup can be achieved for parallel spatial join processing using our new algorithms.

Proceedings ArticleDOI
01 Nov 1998
TL;DR: A novel compiler algorithm which automatically inserts instruction prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance is proposed, which results in speedups ranging from 9.4% to 18.5% over the original execution time on an out-of-order superscalar processor.
Abstract: Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor; which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.

Journal ArticleDOI
TL;DR: In this paper, a pipelined parallelization of PHOENIX is described, where the necessary data from a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known.
Abstract: We describe an important addition to the parallel implementation of our generalized nonlocal thermodynamic equilibrium (NLTE) stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is, distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition, task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000-300,000) and hence parallelization over wavelength can lead both to considerable speedup in calculation time and the ability to make use of the aggregate memory available on massively parallel supercomputers. Here, we describe an implementation of a pipelined design for the wavelength parallelization of PHOENIX, where the necessary data from the processor working on a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known. Our implementation uses a MIMD design based on a relatively small number of standard message passing interface (MPI) library calls and is fully portable between serial and parallel computers.

Proceedings ArticleDOI
01 Jun 1998
TL;DR: In this article, a parallel algorithm for mining association rules with classification hierarchy on a shared-nothing parallel machine is proposed, where the candidate itemsets are partitioned over the processors, which exploits the aggregate memory of the system effectively.
Abstract: Association rule mining recently attracted strong attention. Usually, the classification hierarchy over the data items is available. Users are interested in generalized association rules that span different levels of the hierarchy, since sometimes more interesting rules can be derived by taking the hierarchy into account.In this paper, we propose the new parallel algorithms for mining association rules with classification hierarchy on a shared-nothing parallel machine to improve its performance. Our algorithms partition the candidate itemsets over the processors, which exploits the aggregate memory of the system effectively. If the candidate itemsets are partitioned without considering classification hierarchy, both the items and its all the ancestor items have to be transmitted, that causes prohibitively large amount of communications. Our method minimizes interprocessor communication by considering the hierarchy. Moreover, in our algorithm, the available memory space is fully utilized by identifying the frequently occurring candidate itemsets and copying them over all the processors, through which frequent itemsets can be processed locally without any communication. Thus it can effectively reduce the load skew among the processors. Several experiments are done by changing the granule of copying itemsets, from the whole tree, to the small group of the frequent itemsets along the hierarchy. The coarser the grain, the easier the control but it is rather difficult to achieve the sufficient load balance. The finer the grain, the more complicated the control is required but it can balance the load quite well.We implemented proposed algorithms on IBM SP-2. Performance evaluations show that our algorithms are effective for handling skew and attain sufficient speedup ratio.

Proceedings ArticleDOI
07 Nov 1998
TL;DR: A preliminary investigation of the first multi-processor Tera MTA, finding that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator.
Abstract: The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA. In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes. We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the "serial bottlenecks" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads. Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near- perfect, but the others are reduced accordingly. Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running "frays" (a collection of threads running on a single processor) and "crews" (a collection of frays, one per processor.)

Journal ArticleDOI
TL;DR: In this paper, the Suzuki-Trotter decomposition of exponential operators is used for the numerical integration of spin systems, which can be used with much larger time steps than the predictor-corrector method.

Book ChapterDOI
27 May 1998
TL;DR: In this paper, Grover's and Shor's quantum search algorithm was extended to approximate counting, which can be seen as an amplitude estimation process, and it was shown that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist.
Abstract: We study some extensions of Grover's quantum searching algorithm First, we generalize the Grover iteration in the light of a concept called amplitude amplification Then, we show that the quadratic speedup obtained by the quantum searching algorithm over classical brute force can still be obtained for a large family of search problems for which good classical heuristics exist Finally, as our main result, we combine ideas from Grover's and Shor's quantum algorithms to perform approximate counting, which can be seen as an amplitude estimation process

Proceedings ArticleDOI
16 Aug 1998
TL;DR: It is proved that if the switch uses virtual output queueing, and has an internal speedup of just four, it is possible for it to behave identically to an output queued switch, regardless of the nature of the arriving traffic.
Abstract: Architectures based on a non-blocking fabric, such as a crosspoint switch, are attractive for use in high-speed LAN switches, ATM switches and IP routers. These fabrics, coupled with memory bandwidth limitations, dictate that queues be placed at the input of the switch. But it is well known that input-queueing can lead to low throughput, and does not allow the control of latency through the switch. This is in contrast to output-queueing, which maximizes throughput, and permits the accurate control of packet latency through scheduling. We ask the question: can a switch with combined input and output queueing be designed to behave identically to an output-queued switch? In this paper, we prove that if the switch uses virtual output queueing, and has an internal speedup of just four, it is possible for it to behave identically to an output queued switch, regardless of the nature of the arriving traffic. Our proof is based on a novel scheduling algorithm, known as most urgent cell first. This result makes possible switches that perform as if they were output-queued, yet use memories that run more slowly.

Proceedings ArticleDOI
01 Jun 1998
TL;DR: A framework that allows to process user-defined functions with data parallelism, and describes the class of partitionable functions that can be processed parallelly, and proposes an extension which allows to speed up the processing of another large class of functions by means of parallel sorting.
Abstract: Nowadays parallel object-relational DBMS are envisioned as the next great wave, but there is still a lack of efficient implementation concepts for some parts of the proposed functionality. Thus one of the current goals for parallel object-relational DBMS is to move towards higher performance. In this paper we develop a framework that allows to process user-defined functions with data parallelism. We will describe the class of partitionable functions that can be processed parallelly. We will also propose an extension which allows to speed up the processing of another large class of functions by means of parallel sorting. Functions that can be processed by means of our techniques are often used in decision support queries on large data volumes, for example. Hence a parallel execution is indispensable.

01 Jan 1998
TL;DR: It is proved that if virtual output queueing is used, a combined input-output queued switch is always work-conserving if its speedup is greater than .
Abstract: At very high aggregate bandwidths, output queueing is impractical because of insufficient memory bandwidth. This problem is getting worse: memory bandwidth is improving slowly, whereas the demand for network bandwidth continues to grow exponentially. The difficulty is that outputqueued switches require memories that run at a speedup of N, where N is equal to the number of switch ports. This paper addresses the following question: Is it possible for a switch to exactly match outputqueueing with a reduced speedup? We prove that if virtual output queueing is used, a combined input-output queued switch is always work-conserving if its speedup is greater than . This result is proved using a novel scheduling algorithm the Home Territory Algorithm (HTA).