Showing papers on "Speedup published in 2002"

PDF

Open Access

Journal Article•DOI•

Performance-effective and low-complexity task scheduling for heterogeneous computing

[...]

Haluk Rahmi Topcuoglu¹, Salim Hariri², Min-You Wu³•Institutions (3)

Marmara University¹, University of Arizona², University of New Mexico³

01 Mar 2002-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time are presented, called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm.

...read moreread less

Abstract: Efficient application scheduling is critical for achieving high performance in heterogeneous computing environments. The application scheduling problem has been shown to be NP-complete in general cases as well as in several restricted cases. Because of its key importance, this problem has been extensively studied and various algorithms have been proposed in the literature which are mainly for systems with homogeneous processors. Although there are a few algorithms in the literature for heterogeneous processors, they usually require significantly high scheduling costs and they may not deliver good quality schedules with lower costs. In this paper, we present two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time, which are called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm. The HEFT algorithm selects the task with the highest upward rank value at each step and assigns the selected task to the processor, which minimizes its earliest finish time with an insertion-based approach. On the other hand, the CPOP algorithm uses the summation of upward and downward rank values for prioritizing tasks. Another difference is in the processor selection phase, which schedules the critical tasks onto the processor that minimizes the total execution time of the critical tasks. In order to provide a robust and unbiased comparison with the related work, a parametric graph generator was designed to generate weighted directed acyclic graphs with various characteristics. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithms significantly surpass previous approaches in terms of both quality and cost of schedules, which are mainly presented with schedule length ratio, speedup, frequency of best results, and average scheduling time metrics.

...read moreread less

2,961 citations

Journal Article•DOI•

A novel cross-diamond search algorithm for fast block motion estimation

[...]

Chun-Ho Cheung¹, Lai-Man Po¹•Institutions (1)

City University of Hong Kong¹

01 Dec 2002-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: The proposed cross-diamond search (CDS) algorithm employs the halfway-stop technique and finds small motion vectors with fewer search points than the DS algorithm while maintaining similar or even better search quality.

...read moreread less

Abstract: In block motion estimation, search patterns with different shapes or sizes and the center-biased characteristics of motion-vector distribution have a large impact on the searching speed and quality of performance. We propose a novel algorithm using a cross-search pattern as the initial step and large/small diamond search (DS) patterns as the subsequent steps for fast block motion estimation. The initial cross-search pattern is designed to fit the cross-center-biased motion vector distribution characteristics of the real-world sequences by evaluating the nine relatively higher probable candidates located horizontally and vertically at the center of the search grid. The proposed cross-diamond search (CDS) algorithm employs the halfway-stop technique and finds small motion vectors with fewer search points than the DS algorithm while maintaining similar or even better search quality. The improvement of CDS over DS can be up to a 40% gain on speedup. Experimental results show that the CDS is much more robust, and provides faster searching speed and smaller distortions than other popular fast block-matching algorithms.

...read moreread less

392 citations

Proceedings Article•DOI•

The architecture of the DIVA processing-in-memory chip

[...]

Jeffrey Draper¹, Jacqueline Chame¹, Mary Hall¹, Craig S. Steele¹, Tim Barrett¹, Jeff LaCoss¹, John J. Granacki¹, Jaewook Shin¹, Chun Chen¹, Chang Woo Kang¹, Ihn Kim¹, Gokhan Daglikoca¹ - Show less +8 more•Institutions (1)

Information Sciences Institute¹

22 Jun 2002

TL;DR: The DIVA (Data IntensiVe Architecture) system incorporates a collection of Processing-In-Memory chips as smart-memory co-processors to a conventional microprocessor, and a PIM-based architecture with many such chips yields significantly higher performance than a multiprocessor of a similar scale and at a much reduced hardware cost.

...read moreread less

Abstract: The DIVA (Data IntensiVe Architecture) system incorporates a collection of Processing-In-Memory (PIM) chips as smart-memory co-processors to a conventional microprocessor. We have recently fabricated prototype DIVA PIMs. These chips represent the first smart-memory devices designed to support virtual addressing and capable of executing multiple threads of control. In this paper, we describe the prototype PIM architecture. We emphasize three unique features of DIVA PIMs, namely, the memory interface to the host processor, the 256-bit wide datapaths for exploiting on-chip bandwidth, and the address translation unit. We present detailed simulation results on eight benchmark applications. When just a single PIM chip is used, we achieve an average speedup of 3.3X over host-only execution, due to lower memory stall times and increased fine-grain parallelism. These 1-PIM results suggest that a PIM-based architecture with many such chips yields significantly higher performance than a multiprocessor of a similar scale and at a much reduced hardware cost.

...read moreread less

363 citations

Book Chapter•DOI•

Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO

[...]

Olaf Schenk¹, Klaus Gärtner•Institutions (1)

University of Basel¹

21 Apr 2002

TL;DR: Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsympetric matrices from real world applications.

...read moreread less

Abstract: Supernode pivoting for unsymmetric matrices coupled with supernode partitioning and asynchronous computation can achieve high gigaflop rates for parallel sparse LU factorization on shared memory parallel computers. The progress in weighted graph matching algorithms helps to extend these concepts further and prepermutation of rows is used to place large matrix entries on the diagonal. Supernode pivoting allows dynamical interchanges of columns and rows during the factorization process. The BLAS-3 level efficiency is retained. An enhanced left-right looking scheduling scheme is uneffected and results in good speedup on SMP machines without increasing the operation count. These algorithms have been integrated into the recent unsymmetric version of the PARDISO solver. Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsymmetric matrices from real world applications.

...read moreread less

323 citations

Journal Article•DOI•

Prediction of analog performance parameters using fast transient testing

[...]

P.N. Variyam¹, S. Cherubal, Abhijit Chatterjee•Institutions (1)

Texas Instruments¹

07 Aug 2002-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A fast transient testing methodology for predicting the performance parameters of analog circuits showed a ten times speedup in production testing; accurate prediction of the performance parameter; and a simpler test configuration.

...read moreread less

Abstract: In this paper, a fast transient testing methodology for predicting the performance parameters of analog circuits is presented. A transient test signal is applied to the circuit under (cut) test and the transient response of the circuit is sampled and analyzed to predict the circuit's performance parameters. An algorithm for generating the optimum transient test signal is presented. The methodology is demonstrated in a production environment using a low-power opamp. Result from production test data showed: 1) a ten times speedup in production testing; 2) accurate prediction of the performance parameters; and 3) a simpler test configuration.

...read moreread less

286 citations

Journal Article•DOI•

Recent improvements in aerodynamic design optimization on unstructured meshes

[...]

Eric J. Nielsen¹, W. Kyle Anderson¹•Institutions (1)

Langley Research Center¹

01 Jan 2002-AIAA Journal

TL;DR: In this paper, a set of design codes based on a discrete adjoint method is extended to a multiprocessor environment using a shared memory approach, and a nearly linear speedup is demonstrated, and the consistency of the linearizations is shown to remain valid.

...read moreread less

Abstract: Recent improvements in an unstructured-grid method for large-scale aerodynamic design are presented. Previous work had shown such computations to be prohibitively long in a sequential processing environment. Also, robust adjoint solutions and mesh movement procedures were difficult to realize, particularly for viscous flows. To overcome these limiting factors, a set of design codes based on a discrete adjoint method is extended to a multiprocessor environment using a shared memory approach. A nearly linear speedup is demonstrated, and the consistency of the linearizations is shown to remain valid. The full linearization of the residual is used to precondition the adjoint system, and a significantly improved convergence rate is obtained. A new mesh movement algorithm is implemented, and several advantages over an existing technique are presented

...read moreread less

253 citations

Proceedings Article•DOI•

Exponential algorithmic speedup by quantum walk

[...]

Andrew M. Childs, Richard Cleve, Enrico Deotto, Edward Farhi, Sam Gutmann, Daniel A. Spielman - Show less +2 more

24 Sep 2002-arXiv: Quantum Physics

TL;DR: In this paper, a quantum algorithm based on a continuous time quantum walk was proposed to solve a black-box problem with high probability in subexponential time on a quantum computer.

...read moreread less

Abstract: We construct an oracular (i.e., black box) problem that can be solved exponentially faster on a quantum computer than on a classical computer. The quantum algorithm is based on a continuous time quantum walk, and thus employs a different technique from previous quantum algorithms based on quantum Fourier transforms. We show how to implement the quantum walk efficiently in our oracular setting. We then show how this quantum walk can be used to solve our problem by rapidly traversing a graph. Finally, we prove that no classical algorithm can solve this problem with high probability in subexponential time.

...read moreread less

247 citations

Journal Article•DOI•

A parallel implementation of ant colony optimization

[...]

Marcus Randall¹, Andrew Lewis²•Institutions (2)

Bond University¹, Griffith University²

01 Sep 2002-Journal of Parallel and Distributed Computing

TL;DR: Several parallel decomposition strategies are examined in Ant Colony Optimization applied to a specific problem, namely the travelling salesman problem, with encouraging speedup and efficiency results.

...read moreread less

199 citations

Patent•

Design data format and hierarchy management for processing

[...]

Michel Luc Cote¹, Christophe Pierrat¹•Institutions (1)

Synopsys¹

07 Jun 2002

TL;DR: In this article, a phase shifting layout from an original layout is divided into useful groups, i.e., clusters that can be independently processed, so that the phase shifting process can be performed more rapidly.

...read moreread less

Abstract: Definition of a phase shifting layout from an original layout can be time consuming. If the original layout is divided into useful groups, i.e. clusters that can be independently processed, then the phase shifting process can be performed more rapidly. If the shapes on the layout are enlarged, then the overlapping shapes can be grouped together to identify shapes that should be processed together. For large layouts, growing and grouping the shapes can be time consuming. Therefore, an approach that uses bins can speed up the clustering process, thereby allowing the phase shifting to be performed in parallel on multiple computers. Additional efficiencies result if identical clusters are identified and processing time saved so that repeated clusters of shapes only undergo the computationally expensive phase shifter placement and assignment process a single time.

...read moreread less

192 citations

Journal Article•DOI•

Parallel evolutionary algorithms can achieve super-linear performance

[...]

Enrique Alba

15 Apr 2002-Information Processing Letters

TL;DR: The conclusion is that super-linear performance is possible for PEAs, theoretically and in practice, both in homogeneous and in heterogeneous parallel machines.

...read moreread less

182 citations

Proceedings Article•DOI•

A stateless, content-directed data prefetching mechanism

[...]

Robert N. Cooksey¹, Stephan Jourdan¹, Dirk Grunwald²•Institutions (2)

Intel¹, University of Colorado Boulder²

01 Oct 2002

TL;DR: Content-Directed Data Prefetching is proposed, a data prefetching architecture that exploits the memory allocation used by operating systems and runtime systems to improve the performance of pointer-intensive applications constructed using modern language systems.

...read moreread less

Abstract: Although central processor speeds continues to improve, improvements in overall system performance are increasingly hampered by memory latency, especially for pointer-intensive applications. To counter this loss of performance, numerous data and instruction prefetch mechanisms have been proposed. Recently, several proposals have posited a memory-side prefetcher; typically, these prefetchers involve a distinct processor that executes a program slice that would effectively prefetch data needed by the primary program. Alternative designs embody large state tables that learn the miss reference behavior of the processor and attempt to prefetch likely misses.This paper proposes Content-Directed Data Prefetching, a data prefetching architecture that exploits the memory allocation used by operating systems and runtime systems to improve the performance of pointer-intensive applications constructed using modern language systems. This technique is modeled after conservative garbage collection, and prefetches "likely" virtual addresses observed in memory references. This prefetching mechanism uses the underlying data of the application, and provides an 11.3% speedup using no additional processor state. By adding less than ½% space overhead to the second level cache, performance can be further increased to 12.6% across a range of "real world" applications.

...read moreread less

Journal Article•DOI•

Using a user-level memory thread for correlation prefetching

[...]

Yan Solihin¹, Jaejin Lee², Josep Torrellas¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Michigan State University²

01 May 2002

TL;DR: This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching, and shows that the scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46.

...read moreread less

Abstract: This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: the correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide usability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.

...read moreread less

Journal Article•DOI•

Implementation approaches for the Advanced Encryption Standard algorithm

[...]

Xinmiao Zhang¹, Keshab K. Parhi¹•Institutions (1)

University of Minnesota¹

01 Jan 2002-IEEE Circuits and Systems Magazine

TL;DR: This paper addresses various approaches for efficient hardware implementation of the Advanced Encryption Standard algorithm with various methods to reduce the critical path and area of each round unit.

...read moreread less

Abstract: This paper addresses various approaches for efficient hardware implementation of the Advanced Encryption Standard algorithm. The optimization methods can be divided into two classes: architectural optimization and algorithmic optimization. Architectural optimization exploits the strength of pipelining, loop unrolling and sub-pipelining. Speed is increased by processing multiple rounds simultaneously at the cost of increased area. Architectural optimization is not an effective solution infeed-back mode. Loop unrolling is the only architecture that can achieve a slight speedup with significantly increased area. In non-feedback mode, subpipelining can achieve maximum speedup and the best speed/area ratio. Algorithmic optimization exploits algorithmic strength inside each round unit. Various methods to reduce the critical path and area of each round unit are presented. Resource sharing issues between encryptor and decryptor are also discussed. They become important issues when both encryptor and decryptor need to be implemented in a small area.

...read moreread less

Proceedings Article•DOI•

Post-pass binary adaptation for software-based speculative precomputation

[...]

Steve Shih-wei Liao¹, Perry Wang¹, Hong Wang¹, Gerolf Hoflehner¹, Daniel M. Lavery¹, John Paul Shen¹ - Show less +2 more•Institutions (1)

Intel¹

17 May 2002

TL;DR: This paper presents a post-pass compilation tool for generating SSP-enhanced binaries that is able to analyze a single-threaded application to generate prefetch threads, and identify and embed trigger points in the original binary to produce a new binary that has thePrefetch threads attached.

...read moreread less

Abstract: Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch threads. Software-based speculative precomputation (SSP) is one such technique, proposed for multithreaded Itanium models. SSP does not require expensive hardware support-instead it relies on the compiler to adapt binaries to perform prefetching on otherwise idle hardware thread contexts at run time. This paper presents a post-pass compilation tool for generating SSP-enhanced binaries. The tool is able to: (1) analyze a single-threaded application to generate prefetch threads; (2) identify and embed trigger points in the original binary; and (3) produce a new binary that has the prefetch threads attached. The execution of the new binary spawns the speculative prefetch threads, which are executed concurrently with the main thread. Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an out-of-order processor.

...read moreread less

Proceedings Article•DOI•

SSVM: a simple SVM algorithm

[...]

S.V.M. Vishwanathan¹, M. Narasimha Murty¹•Institutions (1)

Indian Institute of Science¹

07 Aug 2002

TL;DR: A fast iterative algorithm for identifying the support vectors of a given set of points using a greedy approach to pick points for inclusion in the candidate set, which is extremely competitive as compared to other conventional iterative algorithms like SMO and the NPA.

...read moreread less

Abstract: We present a fast iterative algorithm for identifying the support vectors of a given set of points. Our algorithm works by maintaining a candidate support vector set. It uses a greedy approach to pick points for inclusion in the candidate set. When the addition of a point to the candidate set is blocked because of other points already present in the set, we use a backtracking approach to prune away such points. To speed up convergence we initialize our algorithm with the nearest pair of points from opposite classes. We then use an optimization based approach to increase or prune the candidate support vector set. The algorithm makes repeated passes over the data to satisfy the KKT constraints. The memory requirements of our algorithm scale as O(|SI|/sup 2/) in the average case, where |S| is the size of the support vector set. We show that the algorithm is extremely competitive as compared to other conventional iterative algorithms like SMO and the NPA. We present results on a variety of real life datasets to validate our claims.

...read moreread less

Proceedings Article•DOI•

Design and evaluation of compiler algorithms for pre-execution

[...]

Dongkeun Kim¹, Donald Yeung¹•Institutions (1)

University of Maryland, College Park¹

01 Oct 2002

TL;DR: This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task.

...read moreread less

Abstract: Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.

...read moreread less

Book Chapter•DOI•

Strategies for the Parallel Implementation of Metaheuristics

[...]

Van-Dat Cung, Simone Martins¹, Celso C. Ribeiro¹, Catherine Roucairol•Institutions (1)

The Catholic University of America¹

01 Jan 2002

TL;DR: Some trends in parallel computing are reviewed and recent results about linear speedups that can be obtained with parallel implementations using multiple independent processors are reported.

...read moreread less

Abstract: Parallel implementations of metaheuristics appear quite naturally as an effective alternative to speed up the search for approximate solutions of combinatorial optimization problems. They not only allow solving larger problems or finding improved solutions with respect to their sequential counterparts, but also lead to more robust algorithms. We review some trends in parallel computing and report recent results about linear speedups that can be obtained with parallel implementations using multiple independent processors. Parallel implementations of tabu search, GRASP, genetic algorithms, simulated annealing, and ant colonies are reviewed and discussed to illustrate the main strategies used in the parallelization of different metaheuristics and their hybrids.

...read moreread less

Report•DOI•

Applying Fast String Matching to Intrusion Detection

[...]

Mike Fisk, George Varghese

01 Sep 2002

TL;DR: This paper develops a hybrid system that utilizes three different search algorithms, including one new algorithm presented in this paper, which is a system that matches many common packets 5 times faster with an average speedup of 50%.

...read moreread less

Abstract: The performance of signature-based network intrusion detection tools is dominated by the string matching of packets against many signatures. In this paper we study how the popular intrusion detecton system Snort can be best optimized to utilize different string matching algorithms. We analyze the performance of Snort's current string matching algorithm, Boyer-Moore, and several alternate algorithms. We show that no single algorithm is fastest in the context of a real Snort rule set. Instead, we develop a hybrid system that utilizes three different search algorithms, including one new algorithm presented in this paper. The result is a system that matches many common packets 5 times faster with an average speedup of 50%. While the context of our analysis is intrusion detection, other problem domains such as virus scanning, firewalls, and layer seven switches benefit from our work.

...read moreread less

Journal Article•DOI•

Steps toward accurate reconstructions of phylogenies from gene-order data

[...]

Bernard M. E. Moret¹, Jijun Tang¹, Li-San Wang², Tandy Warnow²•Institutions (2)

University of New Mexico¹, University of Texas at Austin²

01 Nov 2002-Journal of Computer and System Sciences

TL;DR: New phylogenetic analyses of a subset of the Campanulaceae family are conducted, confirming various conjectures about the relationships among members of the subset and confirming that inversion can be viewed as the principal mechanism of evolution for their chloroplast genome.

...read moreread less

Proceedings Article•DOI•

Combining strengths of circuit-based and CNF-based algorithms for a high-performance SAT solver

[...]

Malay K. Ganai¹, Lintao Zhang¹, P. Ashar², Aarti Gupta¹, Sharad Malik¹ - Show less +1 more•Institutions (2)

Princeton University¹, NEC²

10 Jun 2002

TL;DR: This work demonstrates that by employing the same innovations as in advanced CNF-based SAT solvers, but in a hybrid approach where these two portions of the formula are represented differently and processed separately, it is possible to obtain the consistently highest performing SAT solver for circuit oriented problem domains.

...read moreread less

Abstract: We propose satisfiability checking (SAT) techniques that lead to a consistent performance improvement of up to 3/spl times/ over state-of-the-art SAT solvers like Chaff on important problem domains in VLSI CAD. We observe that in circuit oriented applications like ATPG and verification, different software engineering techniques are required for the portions of the formula corresponding to learnt clauses compared to the original formula. We demonstrate that by employing the same innovations as in advanced CNF-based SAT solvers, but in a hybrid approach where these two portions of the formula are represented differently and processed separately, it is possible to obtain the consistently highest performing SAT solver for circuit oriented problem domains. We also present controlled experiments to highlight where these gains come from. Once it is established that the hybrid approach is faster, it becomes possible to apply low overhead circuit-based heuristics that would be unavailable in the CNF domain for greater speedup.

...read moreread less

Proceedings Article•DOI•

Sequential Conditional Generalized Iterative Scaling

[...]

Joshua T. Goodman¹•Institutions (1)

Microsoft¹

06 Jul 2002

TL;DR: A speedup for training conditional maximum entropy models is described, a simple variation on Generalized Iterative Scaling, but converges roughly an order of magnitude faster, depending on the number of constraints, and the way speed is measured.

...read moreread less

Abstract: We describe a speedup for training conditional maximum entropy models. The algorithm is a simple variation on Generalized Iterative Scaling, but converges roughly an order of magnitude faster, depending on the number of constraints, and the way speed is measured. Rather than attempting to train all model parameters simultaneously, the algorithm trains them sequentially. The algorithm is easy to implement, typically uses only slightly more memory, and will lead to improvements for most maximum entropy problems.

...read moreread less

Journal Article•DOI•

Tarantula: a vector extension to the alpha architecture

[...]

Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, Isaac Hernandez, Toni Juan, Geoff Lowney, Matthew Mattina, André Seznec - Show less +7 more

01 May 2002

TL;DR: Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads that fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, and achieves excellent "real-computation" per transistor and per watt ratios.

...read moreread less

Abstract: Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw band- width. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

...read moreread less

Proceedings Article•DOI•

Synthesis of custom processors based on extensible platforms

[...]

Fei Sun¹, Srivaths Ravi², Anand Raghunathan², Niraj K. Jha¹•Institutions (2)

Princeton University¹, NEC²

10 Nov 2002

TL;DR: It is demonstrated that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow.

...read moreread less

Abstract: Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions.In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its efficiency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Further, we employ a two-stage process, wherein a limited number of promising instruction candidates are first selected, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint.We have evaluated the proposed techniques using a state-of-the-art extensible processor platform, in the context of a commercial design flow. Experiments with several benchmark programs indicate that custom processors synthesized using automatic custom instruction selection can result in large improvements in performance (upto 5.4X, average of 3.4X), energy (upto 4.5X, average of 3.2X), and energy-delay product (upto 24.2X, average of 12.6X), while speeding up the design process significantly.

...read moreread less

Proceedings Article•DOI•

A general compiler framework for speculative multithreading

[...]

Anasua Bhowmik¹, Manoj Franklin¹•Institutions (1)

University of Maryland, College Park¹

10 Aug 2002

TL;DR: A compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system, which supports a wide variety of threads, such as speculative threads, non-speculative threads, loop-centric threads, and out-of-order thread spawning.

...read moreread less

Abstract: Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing non-numeric programs, which tend to use irregular data structures with pointers and have complex flows of control. Proper thread formation is crucial to obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general, and supports a wide variety of threads, such as speculative threads, non-speculative threads, loop-centric threads, and out-of-order thread spawning. The compiler uses profiling, intra-procedural pointer analysis, data dependence information and control dependence information. The compiler is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of the generated threads shows that an average speedup of 3 can be obtained with 6 processing elements for non-numeric programs. This speedup reduces to 2 if we use only loop-based threads.

...read moreread less

Journal Article•DOI•

Heterogeneous computing and parallel genetic algorithms

[...]

Enrique Alba¹, Antonio J. Nebro¹, José M. Troya¹•Institutions (1)

University of Málaga¹

01 Sep 2002-Journal of Parallel and Distributed Computing

TL;DR: This paper uses Java to implement a distributed PGA model, and finds out that heterogeneous computing can be as efficient or even more efficient than homogeneous computing for parallel heuristics.

...read moreread less

Journal Article•DOI•

Improving flexibility and efficiency by adding parallelism to genetic algorithms

[...]

Enrique Alba, José M. Troya

01 Apr 2002-Statistics and Computing

TL;DR: The aim is to shed some light on the advantages and drawbacks of various sequential and parallel GAs to help researchers using them in the very diverse application fields of the evolutionary computation.

...read moreread less

Abstract: In this paper we develop a study on several types of parallel genetic algorithms (PGAs). Our motivation is to bring some uniformity to the proposal, comparison, and knowledge exchange among the traditionally opposite kinds of serial and parallel GAs. We comparatively analyze the properties of steady-state, generational, and cellular genetic algorithms. Afterwards, this study is extended to consider a distributed model consisting in a ring of GA islands. The analyzed features are the time complexity, selection pressure, schema processing rates, efficacy in finding an optimum, efficiency, speedup, and resistance to scalability. Besides that, we briefly discuss how the migration policy affects the search. Also, some of the search properties of cellular GAs are investigated. The selected benchmark is a representative subset of problems containing real world difficulties. We often conclude that parallel GAs are numerically better and faster than equivalent sequential GAs. Our aim is to shed some light on the advantages and drawbacks of various sequential and parallel GAs to help researchers using them in the very diverse application fields of the evolutionary computation.

...read moreread less

Proceedings Article•DOI•

Parallel-beam backprojection: an FPGA implementation optimized for medical imaging

[...]

Srdjan Coric¹, Miriam Leeser¹, Eric L. Miller¹, Marc Trepanier²•Institutions (2)

Northeastern University¹, Mercury Systems²

24 Feb 2002

TL;DR: This paper presents an FPGA implementation of the parallel-beam backprojection algorithm used in CT for which all of the requirements are met and shows significant speedup over software versions of the same algorithm, and is more flexible than an ASIC implementation.

...read moreread less

Abstract: Medical image processing in general and computerized tomography (CT) in particular can benefit greatly from hardware acceleration. This application domain is marked by computationally intensive algorithms requiring the rapid processing of large amounts of data. To date, reconfigurable hardware has not been applied to this important area. For efficient implementation and maximum speedup, fixed-point implementations are required. The associated quantization errors must be carefully balanced against the requirements of the medical community. Specifically, care must be taken so that very little error is introduced compared to floating-point implementations and the visual quality of the images is not compromised. In this paper, we present an FPGA implementation of the parallel-beam backprojection algorithm used in CT for which all of these requirements are met. We explore a number of quantization issues arising in backprojection and concentrate on minimizing error while maximizing efficiency. Our implementation shows significant speedup over software versions of the same algorithm, and is more flexible than an ASIC implementation. Our FPGA implementation can easily be adapted to both medical sensors with different dynamic ranges as well as tomographic scanners employed in a wider range of application areas including nondestructive evaluation and baggage inspection in airport terminals.

...read moreread less

Proceedings Article•DOI•

A 10 Gbps full-AES crypto design with a twisted-BDD S-Box architecture

[...]

S. Morioka¹, Akashi Satoh¹•Institutions (1)

IBM¹

16 Sep 2002

TL;DR: A high-speed AES IP-core is presented, which runs at 780 MHz on a 0.

...read moreread less

Abstract: In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13 /spl mu/m CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.

...read moreread less

Proceedings Article•DOI•

Structural alignment of large—size proteins via lagrangian relaxation

[...]

Alberto Caprara¹, Giuseppe Lancia²•Institutions (2)

University of Bologna¹, University of Padua²

18 Apr 2002

TL;DR: A new approach to the Contact Map Overlap problem for the comparison of protein structures is illustrated, able to solve optimally for the first time instances for PDB proteins with about 1000 residues and 2000 contacts.

...read moreread less

Abstract: We illustrate a new approach to the Contact Map Overlap problem for the comparison of protein structures. The approach is based on formulating the problem as an integer linear program and then relaxing in a Lagrangian way a suitable set of constraints. This relaxation is solved by computing a sequence of simple alignment problems, each in quadratic time, and near--optimal Lagrangian multipliers are found by subgradient optimization. By our approach we achieved a substantial speedup over the best existing methods. We were able to solve optimally for the first time instances for PDB proteins with about 1000 residues and 2000 contacts. Moreover, within a few hours we compared 780 pairs in a testbed of 40 large proteins, finding the optimal solution in 150 cases. Finally, we compared 10,000 pairs of proteins from a test set of 269 proteins in the literature, which took a couple of days on a PC.

...read moreread less

Journal Article•DOI•

Modelling the three-way catalytic converter with mechanistic kinetics using the Newton–Krylov method on a parallel computer

[...]

L.S. Mukadi¹, Robert E. Hayes¹•Institutions (1)

University of Alberta¹

15 Mar 2002-Computers & Chemical Engineering

TL;DR: A mathematical model for an automotive three-way catalytic converter based on experimental mechanistic kinetics and the use of parallel computing at a fine grain level on vector–vector and vector–matrix operation is shown to provide a large degree of speedup, which increases as the number of grid points increases.

...read moreread less

Collapse