scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 2007"


01 Jan 2007
TL;DR: In this article, the authors develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms, such as locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain ‘summation form,’ which allows them to be easily parallelized on multicore computers. We adapt Google's map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.

381 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: It is shown that iterative removal of pairwise reconstructions with the largest residual and reregistration removes most non-existent epipolar geometries.
Abstract: It is known that the problem of multiview reconstruction can be solved in two steps: first estimate camera rotations and then translations using them. This paper presents new robust techniques for both of these steps, (i) Given pair-wise relative rotations, global camera rotations are estimated linearly in least squares, (ii) Camera translations are estimated using a standard technique based on Second Order Cone Programming. Robustness is achieved by using only a subset of points according to a new criterion that diminishes the risk of choosing a mismatch. It is shown that only four points chosen in a special way are sufficient to represent a pairwise reconstruction almost equally as all points. This leads to a significant speedup. In image sets with repetitive or similar structures, non-existent epipolar geometries may be found. Due to them, some rotations and consequently translations may be estimated incorrectly. It is shown that iterative removal of pairwise reconstructions with the largest residual and reregistration removes most non-existent epipolar geometries. The performance of the proposed method is demonstrated on difficult wide base-line image sets.

369 citations


Journal ArticleDOI
TL;DR: It is found that evolution toward goals that change over time can dramatically speed up evolution compared with evolution toward a fixed goal, and a way to accelerate optimization algorithms and improve evolutionary approaches in engineering is suggested.
Abstract: Simulations of biological evolution, in which computers are used to evolve systems toward a goal, often require many generations to achieve even simple goals. It is therefore of interest to look for generic ways, compatible with natural conditions, in which evolution in simulations can be speeded. Here, we study the impact of temporally varying goals on the speed of evolution, defined as the number of generations needed for an initially random population to achieve a given goal. Using computer simulations, we find that evolution toward goals that change over time can, in certain cases, dramatically speed up evolution compared with evolution toward a fixed goal. The highest speedup is found under modularly varying goals, in which goals change over time such that each new goal shares some of the subproblems with the previous goal. The speedup increases with the complexity of the goal: the harder the problem, the larger the speedup. Modularly varying goals seem to push populations away from local fitness maxima, and guide them toward evolvable and modular solutions. This study suggests that varying environments might significantly contribute to the speed of natural evolution. In addition, it suggests a way to accelerate optimization algorithms and improve evolutionary approaches in engineering.

328 citations


Proceedings ArticleDOI
10 Nov 2007
TL;DR: This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures, and shows that AMPS improves fairness and repeatability of application performance measurements.
Abstract: Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges to operating systems, which traditionally assume homogeneous hardware. This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures. AMPS contains three components: asymmetry-aware load balancing, faster-core-first scheduling, and NUMA-aware migration. We have implemented AMPS in Linux kernel 2.6.16 and used CPU clock modulation to emulate performance asymmetry on an SMP and NUMA system. For various workloads, we show that AMPS achieves a median speedup of 1.16 with a maximum of 1.44 over stock Linux on the SMP, and a median of 1.07 with a maximum of 2.61 on the NUMA system. Our results also show that AMPS improves fairness and repeatability of application performance measurements.

274 citations


Proceedings Article
03 Dec 2007
TL;DR: Using five real-world text corpora, it is shown that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning.
Abstract: We investigate the problem of learning a widely-used latent-variable model - the Latent Dirichlet Allocation (LDA) or "topic" model - using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

264 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: A new efficient MRF optimization algorithm, called Fast-PD, is proposed, which generalizes a-expansion and can also guarantee an almost optimal solution for a much wider class of NP-hard MRF problems.
Abstract: A new efficient MRF optimization algorithm, called Fast-PD, is proposed, which generalizes a-expansion. One of its main advantages is that it offers a substantial speedup over that method, e.g. it can be at least 3-9 times faster than a-expansion. Its efficiency is a result of the fact that Fast-PD exploits information coming not only from the original MRF problem, but also from a dual problem. Furthermore, besides static MRFs, it can also be used for boosting the performance of dynamic MRFs, i.e. MRFs varying over time. On top of that, Fast-PD makes no compromise about the optimality of its solutions: it can compute exactly the same answer as a-expansion, but, unlike that method, it can also guarantee an almost optimal solution for a much wider class of NP-hard MRF problems. Results on static and dynamic MRFs demonstrate the algorithm's efficiency and power. E.g., Fast-PD has been able to compute disparity for stereoscopic sequences in real time, with the resulting disparity coinciding with that of a-expansion.

225 citations


Journal ArticleDOI
TL;DR: Two fast sparse approximation schemes for least squares support vector machine (LS-SVM) are presented to overcome the limitation of LS-S VM that it is not applicable to large data sets and to improve test speed.
Abstract: In this paper, we present two fast sparse approximation schemes for least squares support vector machine (LS-SVM), named FSALS-SVM and PFSALS-SVM, to overcome the limitation of LS-SVM that it is not applicable to large data sets and to improve test speed. FSALS-SVM iteratively builds the decision function by adding one basis function from a kernel-based dictionary at one time. The process is terminated by using a flexible and stable epsilon insensitive stopping criterion. A probabilistic speedup scheme is employed to further improve the speed of FSALS-SVM and the resulting classifier is named PFSALS-SVM. Our algorithms are of two compelling features: low complexity and sparse solution. Experiments on benchmark data sets show that our algorithms obtain sparse classifiers at a rather low cost without sacrificing the generalization performance

184 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: This paper offers a new and pragmatic approach to leveraging coarse-grained pipeline parallelism in C programs in the domain of streaming applications, such as audio, video, and digital signal processing, which exhibit regular flows of data.
Abstract: The emergence of multicore processors has heightened the need for effective parallel programming practices. In addition to writing new parallel programs, the next gener- ation of programmers will be faced with the overwhelming task of migrating decades' worth of legacy C code into a parallel representation. Addressing this problem requires a toolset of parallel programming primitives that can broadly apply to both new and existing programs. While tools such as threads and OpenMP allow programmers to express task and data parallelism, support for pipeline parallelism is distinctly lacking. In this paper, we offer a new and pragmatic approach to leveraging coarse-grained pipeline parallelism in C pro- grams. We target the domain of streaming applications, such as audio, video, and digital signal processing, which exhibit regular flows of data. To exploit pipeline paral- lelism, we equip the programmer with a simple set of an- notations (indicating pipeline boundaries) and a dynamic analysis that tracks all communication across those bound- aries. Our analysis outputs a stream graph of the applica- tion as well as a set of macros for parallelizing the program and communicating the data needed. We apply our method- ology to six case studies, including MPEG-2 decoding, MP3 decoding, GMTI radar processing, and three SPEC bench- marks. Our analysis extracts a useful block diagram for each application, and the parallelized versions offer a 2.78x mean speedup on a 4-core machine.

183 citations


Proceedings ArticleDOI
10 Jun 2007
TL;DR: Exoskeleton Sequencer (EXO), an architecture to represent heterogeneous accelerators as ISA-based MIMD architecture resources, and C for Heterogeneous Integration (CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages are presented.
Abstract: Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power.We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel® Core™ 2 Duo processor and an 8-core 32-thread Intel® Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel® C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.

169 citations


Book ChapterDOI
06 Jun 2007
TL;DR: A dynamic technique for fast route planning in large road networks is introduced that can handle the practically relevant scenarios that arise in present-day navigation systems and has outstandingly low memory requirements of only a few bytes per node.
Abstract: We introduce a dynamic technique for fast route planning in large road networks. For the first time, it is possible to handle the practically relevant scenarios that arise in present-day navigation systems: When an edge weight changes (e.g., due to a traffic jam), we can update the preprocessed information in 2-40ms allowing subsequent fast queries in about one millisecond on average. When we want to perform only a single query, we can skip the comparatively expensive update step and directly perform a prudent query that automatically takes the changed situation into account. If the overall cost function changes (e.g., due to a different vehicle type), recomputing the preprocessed information takes typically less than two minutes. The foundation of our dynamic method is a new static approach that generalises and combines several previous speedup techniques. It has outstandingly low memory requirements of only a few bytes per node.

167 citations


Journal ArticleDOI
TL;DR: An extension of the speedup technique to multiple levels of partitions that can be seen as a compression of the precomputed data that preserves the correctness of the computed shortest paths is presented.
Abstract: We study an acceleration method for point-to-point shortest-path computations in large and sparse directed graphs with given nonnegative arc weights. The acceleration method is called the arc-flag approach and is based on Dijkstra's algorithm. In the arc-flag approach, we allow a preprocessing of the network data to generate additional information, which is then used to speedup shortest-path queries. In the preprocessing phase, the graph is divided into regions and information is gathered on whether an arc is on a shortest path into a given region. The arc-flag method combined with an appropriate partitioning and a bidirected search achieves an average speedup factor of more than 500 compared to the standard algorithm of Dijkstra on large networks (1 million nodes, 2.5 million arcs). This combination narrows down the search space of Dijkstra's algorithm to almost the size of the corresponding shortest path for long-distance shortest-path queries. We conduct an experimental study that evaluates which partitionings are best suited for the arc-flag method. In particular, we examine partitioning algorithms from computational geometry and a multiway arc separator partitioning. The evaluation was done on German road networks. The impact of different partitions on the speedup of the shortest path algorithm are compared. Furthermore, we present an extension of the speedup technique to multiple levels of partitions. With this multilevel variant, the same speedup factors can be achieved with smaller space requirements. It can, therefore, be seen as a compression of the precomputed data that preserves the correctness of the computed shortest paths.

Journal ArticleDOI
TL;DR: This survey paper compares native double precision solvers with emulated- and mixed-precision solvers of linear systems of equations as they typically arise in finite element discretisations and concludes that the mixed precision approach works very well with the parallel co-processors gaining speedup factors and area savings, while maintaining the same accuracy as a reference solver executing everything in double precision.
Abstract: In this survey paper, we compare native double precision solvers with emulated-and mixed-precision solvers of linear systems of equations as they typically arise in finite element discretisations. The emulation utilises two single float numbers to achieve higher precision, while the mixed precision iterative refinement computes residuals and updates the solution vector in double precision but solves the residual systems in single precision. Both techniques have been known since the 1960s, but little attention has been devoted to their performance aspects. Motivated by changing paradigms in processor technology and the emergence of highly-parallel devices with outstanding single float performance, we adapt the emulation and mixed precision techniques to coupled hardware configurations, where the parallel devices serve as scientific co-processors. The performance advantages are examined with respect to speedups over a native double precision implementation (time aspect) and reduced area requirements for a chip (space aspect). The paper begins with an overview of the theoretical background, algorithmic approaches and suitable hardware architectures. We then employ several conjugate gradient (CG) and multigrid solvers and study their behaviour for different parameter settings of the iterative refinement technique. Concrete speedup factors are evaluated on the coupled hardware configuration of a general-purpose CPU and a graphics processor. The dual performance aspect of potential area savings is assessed on a field programmable gate array (FPGA). In the last part, we test the applicability of the proposed mixed precision schemes with ill-conditioned matrices. We conclude that the mixed precision approach works very well with the parallel co-processors gaining speedup factors of four to five, and area savings of three to four, while maintaining the same accuracy as a reference solver executing everything in double precision.

Journal ArticleDOI
TL;DR: In this paper, a general low-thrust trade analysis tool is developed based on a global search for local indirect method solutions, and an efficient propagator is implemented with an implicit "bang-bang" thrusting structure with an a priori unknown number of switching times.
Abstract: The low-thrust spacecraft trajectory problem can be reduced to only a few parameters using calculus of variations and the well-known primer vector theory. This low dimensionality combined with the extraordinary speed of modern computers allows for rapid exploration of the parameter space and invites opportunities for global optimization. Accordingly, a general low-thrust trade analysis tool is developed based on a global search for local indirect method solutions. An efficient propagator is implemented with an implicit "bang-bang" thrusting structure that accommodates an a priori unknown number of switching times. An extension to the standard adjoint control transformation is introduced that provides additional physical insight and control over the anticipated evolution of the thrust profile. The uniformly random search enjoys a perfect linear speedup for parallel implementation. The method is applied specifically on multi revolution transfers in the Jupiter-Europa and Earth-moon restricted three body problems. In both cases, thousands of solutions are found in a single parallel run. The result is a global front of Pareto optimal solutions across the competing objectives of flight time and final mass.

Journal ArticleDOI
TL;DR: The WaveScalar instruction set is presented and a simulated implementation based on current technology is evaluated, finding that for single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area.
Abstract: Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity/high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics that imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application, or even a single function.To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new ones in their place. The instructions communicate directly with one another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication.This article presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications the WaveCache achieves nearly linear speedup with up to 64 threads and can sustain 7--14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to equake from Spec2000 and speed it up by 9x compared to the serial version.

Proceedings ArticleDOI
Sebastian Winkel1
01 Dec 2007
TL;DR: A highly efficient ILP model that was implemented experimentally in the Intel Itaniumreg product compiler features virtually the full scale of known EPIC scheduling optimizations, more than its heuristic counterpart in the compiler, GCS, and in contrast to the latter it computes optimal solutions in the form of schedules with minimal length.
Abstract: We present a global instruction scheduler based on inte- ger linear programming (ILP) that was implemented exper- imentally in the Intel Itanium® product compiler. It features virtually the full scale of known EPIC scheduling optimiza- tions, more than its heuristic counterpart in the compiler, GCS, and in contrast to the latter it computes optimal so- lutions in the form of schedules with minimal length. Due to our highly efficient ILP model it can solve problem in- stances with 500-750 instructions, and in combination with region scheduling we are able to schedule routines of arbi- trary size. In experiments on five SPEC® CPU2006 integer bench- marks, ILP-scheduled code exhibits a 32% schedule length advantage and a 10% runtime speedup over GCS-scheduled code, at the highest compiler optimization levels typically used for SPEC submissions. We further study the impact of different code motion classes, region sizes, and target microarchitectures, gaining insights into the nature of the global instruction scheduling problem.

Proceedings ArticleDOI
14 Mar 2007
TL;DR: This work presents performance measurements on several architectures and concludes that simple recompilation will provide partial parallelization of applications that make consistent use of the C++ Standard Template Library.
Abstract: Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The Multi-Core Standard Template Library (MCSTL) simplifies parallelization by providing efficient parallel implementations of the algorithms in the C++ Standard Template Library. Thus, simple recompilation will provide partial parallelization of applications that make consistent use of the STL. We present performance measurements on several architectures. For example, our sorter achieves a speedup of 21 on an 8-core 32-thread SUN T1.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: This work proposes a profile-guided method for partitioning memory accesses across distributed data caches, which reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.
Abstract: The recent design shift towards multicore processors has spawned a significant amount of research in the area of program paralleliza- tion. The future abundance of cores on a single chip requires pro- grammer and compiler intervention to increase the amount of par- allel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving per- formance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain paral- lelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data ac- cesses can lead to increased memory stalls and low resource utiliza- tion. We propose a profile-guided method for partitioning mem- ory accesses across distributed data caches. First, a profile deter- mines affinity relationships between memory accesses and work- ing set characteristics of individual memory operations in the pro- gram. Next, a program-level partitioning of the memory opera- tions is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is per- formed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51% versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.

Journal ArticleDOI
TL;DR: The analysis and experiments show that the PETS algorithm substantially outperforms the existing scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT), Critical-Path-On a Processor (CPOP) and Levelized Min Time (LMT), in terms of schedule length ratio, speedup, efficiency, running time and frequency of best results.
Abstract: A heterogeneous computing environment is a suite of heterogeneous processors interconnected by high-speed networks, thereby promising high speed processing of computationally intensive applications with diverse computing needs. Scheduling of an application modeled by Directed Acyclic Graph (DAG) is a key issue when aiming at high performance in this kind of environment. The problem is generally addressed in terms of task scheduling, where tasks are the schedulable units of a program. The task scheduling problems have been shown to be NP-complete in general as well as several restricted cases. In this study we present a simple scheduling algorithm based on list scheduling, namely, low complexity Performance Effective Task Scheduling (PETS) algorithm for heterogeneous computing systems with complexity O (e) (p+ log v), which provides effective results for applications represented by DAGs. The analysis and experiments based on both randomly generated graphs and graphs of some real applications show that the PETS algorithm substantially outperforms the existing scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT), Critical-Path-On a Processor (CPOP) and Levelized Min Time (LMT), in terms of schedule length ratio, speedup, efficiency, running time and frequency of best results.

Journal ArticleDOI
TL;DR: Small tables with only 25-200 entries were used to obtain this performance, while full enumeration is intractable for this example, and Versions of PE are shown to be closed-loop stable.

Proceedings Article
23 Sep 2007
TL;DR: This paper proposes two techniques: a cache-conscious FP-array (frequent pattern array) and a lock-free dataset tiling parallelization mechanism to address the problem of frequent pattern mining on a modern multi-core machine.
Abstract: Multi-core processors are proliferated across different domains in recent years. In this paper, we study the performance of frequent pattern mining on a modern multi-core machine. A detailed study shows that, even with the best implementation, current FP-tree based algorithms still under-utilize a multi-core system due to poor data locality and insufficient parallelism expression. We propose two techniques: a cache-conscious FP-array (frequent pattern array) and a lock-free dataset tiling parallelization mechanism to address this problem. The FP-array efficiently improves the data locality performance, and makes use of the benefits from hardware and software prefetching. The result yields an overall 4.0 speedup compared with the state-of-the-art implementation. Furthermore, to unlock the power of multi-core processor, a lock-free parallelization approach is proposed to restructure the FP-tree building algorithm. It not only eliminates the locks in building a single FP-tree with fine-grained threads, but also improves the temporal data locality performance. To summarize, with the proposed cache-conscious FP-array and lock-free parallelization enhancements, the overall FP-tree algorithm achieves a 24 fold speedup on an 8-core machine. Finally, we believe the presented techniques can be applied to other data mining tasks as well with the prevalence of multi-core processor.

Journal ArticleDOI
TL;DR: An end-to-end QMC application with core elements of the algorithm running on a GPU is reported on, demonstrating the speedup improvements possible for QMC in running on advanced hardware and exploring a path toward providing QMC level accuracy as a more standard tool.

Journal ArticleDOI
TL;DR: Having optimized an all‐to‐all routine, which sends the data in an ordered fashion, it is shown that it is possible to completely prevent packet loss for any number of multi‐CPU nodes, and the GROMACS scaling dramatically improves, even for switches that lack flow control.
Abstract: We investigate the parallel scaling of the GROMACS molecular dynamics code on Ethernet Beowulf clusters and what prerequisites are necessary for decent scaling even on such clusters with only limited bandwidth and high latency. GROMACS 3.3 scales well on supercomputers like the IBM p690 (Regatta) and on Linux clusters with a special interconnect like Myrinet or Infiniband. Because of the high single-node performance of GROMACS, however, on the widely used Ethernet switched clusters, the scaling typically breaks down when more than two computer nodes are involved, limiting the absolute speedup that can be gained to about 3 relative to a single-CPU run. With the LAM MPI implementation, the main scaling bottleneck is here identified to be the all-to-all communication which is required every time step. During such an all-to-all communication step, a huge amount of messages floods the network, and as a result many TCP packets are lost. We show that Ethernet flow control prevents network congestion and leads to substantial scaling improvements. For 16 CPUs, e.g., a speedup of 11 has been achieved. However, for more nodes this mechanism also fails. Having optimized an all-to-all routine, which sends the data in an ordered fashion, we show that it is possible to completely prevent packet loss for any number of multi-CPU nodes. Thus, the GROMACS scaling dramatically improves, even for switches that lack flow control. In addition, for the common HP ProCurve 2848 switch we find that for optimum all-to-all performance it is essential how the nodes are connected to the switch's ports. This is also demonstrated for the example of the Car-Parinello MD code.

Proceedings ArticleDOI
Quanzhong Li1, Minglong Sha1, Volker Markl1, K. Beyer1, L. Colby1, G. Lohman1 
10 Sep 2007
TL;DR: A novel method for processing pipelined join plans that dynamically arranges the join order of both inner and outer-most tables at run-time and achieves adaptability by changing the pipeline itself which avoids the bookkeeping and routing decision associated with each row.
Abstract: Traditional query processing techniques based on static query optimization are ineffective in applications where statistics about the data are unavailable at the start of query execution or where the data characteristics are skewed and change dynamically. Several adaptive query processing techniques have been proposed in recent years to overcome the limitations of static query optimizers through either explicit re-optimization of plans during execution or by using a row-routing based approach. In this paper, we present a novel method for processing pipelined join plans that dynamically arranges the join order of both inner and outer-most tables at run-time. We extend the Eddies concept of "moments of symmetry" to reorder indexed nested-loop joins, the join method used by all commercial DBMSs for building pipelined query plans for applications for which low latencies are crucial. Unlike row-routing techniques, our approach achieves adaptability by changing the pipeline itself which avoids the bookkeeping and routing decision associated with each row. Operator selectivities monitored during query execution are used to change the execution plan at strategic points, and the change of execution plans utilizes a novel and efficient technique for avoiding duplicates in the query results. Our prototype implementation in a commercial DBMS shows a query execution speedup of up to 8 times.

Proceedings Article
06 Jan 2007
TL;DR: A fast algorithm for computing all shortest paths between source nodes s ∈ S and target nodes t ∈ T, based on highway hierarchies, which is also used for the currently fastest speedup techniques for shortest path queries in road networks.
Abstract: We present a fast algorithm for computing all shortest paths between source nodes s ∈ S and target nodes t ∈ T. This problem is important as an initial step for many operations research problems (e.g., the vehicle routing problem), which require the distances between S and T as input. Our approach is based on highway hierarchies, which are also used for the currently fastest speedup techniques for shortest path queries in road networks. We show how to use highway hierarchies so that for example, a 10 000 × 10 000 distance table in the European road network can be computed in about one minute. These results are based on a simple basic idea, several refinements, and careful engineering of the approach. We also explain how the approach can be parallelized and how the computation can be restricted to computing only the k closest connections.

Proceedings ArticleDOI
26 Mar 2007
TL;DR: In the exploration to achieve the optimum level of performance for Sweep3D, the BE processor has enjoyed many pleasant surprises, such as a very high floating point performance, reaching 64% of the theoretical peak in double precision, and an over all performance speedup ranging from 4.5 times when compared with "heavy iron" processors, up to over 20 times with conventional processors.
Abstract: The Cell Broadband Engine (BE) processor provides the potential to achieve an impressive level of performance for scientific applications. This level of performance can be reached by exploiting several dimensions of parallelism, such as thread-level parallelism using several synergistic processing elements, data streaming parallelism, vector parallelism in the form of 128-bit SIMD operations, and pipeline parallelism by issuing multiple instructions in the same clock cycle. In our exploration to achieve the optimum level of performance for Sweep3D, we have enjoyed many pleasant surprises, such as a very high floating point performance, reaching 64% of the theoretical peak in double precision, and an over all performance speedup ranging from 4.5 times when compared with "heavy iron" processors, up to over 20 times with conventional processors.

Journal ArticleDOI
TL;DR: Fundamental bounds on the number of wires required to provide joint crosstalk avoidance and error correction using memoryless codes are presented and a code construction that results in practical codec circuits with theNumber of wires being within 35% of the fundamental bounds is proposed.
Abstract: A reliable high-speed bus employing low-swing signaling can be designed by encoding the bus to prevent crosstalk and provide error correction. Coding for on-chip buses requires additional bus wires and codec circuits. In this paper, fundamental bounds on the number of wires required to provide joint crosstalk avoidance and error correction using memoryless codes are presented. The authors propose a code construction that results in practical codec circuits with the number of wires being within 35% of the fundamental bounds. When applied to a 10-mm 32-bit bus in a 0.13-mum CMOS technology with low-swing signaling, one of the proposed codes provides 2.14times speedup and 27.5% energy savings at the cost of 2.1times area overhead, but without any loss in reliability

Book ChapterDOI
22 Feb 2007
TL;DR: A condensed overview of new developments and extensions of classic results for Dijkstra's algorithm is provided and how combinations of speed-up techniques can be realized to take advantage from different strategies are discussed.
Abstract: During the last years, several speed-up techniques for Dijkstra's algorithm have been published that maintain the correctness of the algorithm but reduce its running time for typical instances. They are usually based on a preprocessing that annotates the graph with additional information which can be used to prune or guide the search. Timetable information in public transport is a traditional application domain for such techniques. In this paper, we provide a condensed overview of new developments and extensions of classic results. Furthermore, we discuss how combinations of speed-up techniques can be realized to take advantage from different strategies.

Proceedings ArticleDOI
07 May 2007
TL;DR: In this article, the authors proposed a low-cost priority-based NoC for cache access and cache coherency in future high-performance chip multi-processors (CMPs).
Abstract: The paper introduces network-on-chip (NoC) design methodology and low cost mechanisms for supporting efficient cache access and cache coherency in future high-performance chip multi processors (CMPs). We address previously proposed CMP architectures based on non uniform cache architecture (NUCA) over NoC, analyze basic memory transactions and translate them into a set of network transactions. We first show how a simple, generic NoC which is equipped with needed module interface functionalities can provide infrastructure for the coherent access of both static and dynamic NUCA. Then we show how several low cost mechanisms incorporated into such a vanilla NoC can facilitate CMP and boost performance of a cache coherent NUCA CMP. The basic mechanism is based on priority support embedded in the NoC, which differentiates between short control signals and long data messages to achieve a major reduction in cache access delay. The low cost priority-based NoC is extremely useful for increasing performance of almost any other CMP transaction. Priority-based NoC along with the discussed NoC interfaces are evaluated in detail using CMP-NoC simulations across several SPLASH-2 benchmarks and static Web content serving benchmarks showing substantial L2 cache access delay reduction and overall program speedup

Proceedings ArticleDOI
02 Jul 2007
TL;DR: This work proposes a two step approach: a custom preparsing technique to resolve control dependencies in the input stream and expose MB level data parallelism, and an MB level scheduling technique to allocate and load balance MB rendering tasks.
Abstract: The H264 decoder has a sequential, control intensive front end that makes it difficult to leverage the potential performance of emerging manycore processors Preparsing is a functional parallelization technique to resolve this front end bottleneck However, the resulting parallel macro block (MB) rendering tasks have highly input-dependent execution times and precedence constraints, which make them difficult to schedule efficiently on manycore processors To address these issues, we propose a two step approach: (i) a custom preparsing technique to resolve control dependencies in the input stream and expose MB level data parallelism, (ii) an MB level scheduling technique to allocate and load balance MB rendering tasks The run time MB level scheduling increases the efficiency of parallel execution in the rest of the H264 decoder, providing 60% speedup over greedy dynamic scheduling and 9-15% speedup over static compile time scheduling for more than four processors The preparsing technique coupled with run time MB level scheduling enables a potential 7times speedup for H264 decoding

Proceedings ArticleDOI
23 Jan 2007
TL;DR: Results show that in single threaded mode, MiraXT compares well to other state of the art solvers on industrial problems, and provides cutting edge performance, as speedup is obtained on both SAT and UNSAT instances.
Abstract: This paper describes the multithreaded MiraXT SAT solver which was designed to take advantage of current and future shared memory multiprocessor systems The paper highlights design and implementation details that allow the multiple threads to run and cooperate efficiently Results show that in single threaded mode, MiraXT compares well to other state of the art solvers on industrial problems In threaded mode, it provides cutting edge performance, as speedup is obtained on both SAT and UNSAT instances