scispace - formally typeset
Search or ask a question
Topic

Speedup

About: Speedup is a research topic. Over the lifetime, 23618 publications have been published within this topic receiving 390005 citations.


Papers
More filters
Proceedings ArticleDOI
25 Aug 1996
TL;DR: The dynamic clocking scheme provided a speedup ranging from 1.5 to 3 over the uni-frequency clocking for various low level pattern recognition and image processing algorithms that were mapped onto the chip.
Abstract: In this paper, we propose a dynamic frequency linear array processor, DFLAP, for real-time image processing applications. The architecture uses a novel concept of dynamic frequency clocking which allows the chip to operate between, a maximum frequency of 400 MHz and a minimum frequency of 50 MHz based on the operation being performed. The dynamic clocking scheme is especially useful in the contest of image processing applications where certain tasks require only logic functions while others require only additions and certain others multiplication or division. The proposed architecture provides speedup by supporting two levels of parallelism and using variable frequency single clock cycle operations. DFLAP provides parallelism at the array level using multiple processing elements (PEs) and at a functional level allowing concurrent use of various units in the PE. The array architecture contains N PEs, where the image size is N/spl times/N and each PE in turn contains an a-bit arithmetic/logic unit, an 8/spl times/8 single-cycle multiplier, a shifter, a neighbor communication unit, a 32/spl times/8 dual port SRAM and a dynamic clocking unit (DCU). The DCU an each PE enables dynamic switching of clock frequencies. The dynamic clocking scheme provided a speedup ranging from 1.5 to 3 over the uni-frequency clocking for various low level pattern recognition and image processing algorithms that were mapped onto the chip.

7 citations

Proceedings ArticleDOI
09 Sep 2015
TL;DR: A hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme is presented, which achieves more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms.
Abstract: Motivation. Despite several reported acceleration successes of programmable GPUs (Graphics Processing Units) for molecular modeling and simulation tools, the general focus has been on fast computation with small molecules. This was primarily due to the limited memory size on the GPU. Moreover simultaneous use of CPU and GPU cores for a single kernel execution -- a necessity for achieving high parallelism -- has also not been fully considered. Results. We present fast computation methods for molecular mechanical (Lennard-Jones and Coulombic) and generalized Born solvation energetics which run on commodity multicore CPUs and manycore GPUs. The key idea is to trade off accuracy of pairwise, long-range atomistic energetics for higher speed of execution. A simple yet efficient CUDA kernel for GPU acceleration is presented which ensures high arithmetic intensity and memory efficiency. Our CUDA kernel uses a cache-friendly, recursive and linear-space octree data structure to handle very large molecular structures with up to several million atoms. Based on this CUDA kernel, we present a hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme. Our CUDA kernels achieve more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms. The hybrid method is shown to be able to achieve the best performance for all values of the approximation parameter. Availability. The source code and binaries are freely available as PMEOPA (Parallel Molecular Energetic using Octree Pairwise Approximation) and downloadable from http://cvcweb.ices.utexas.edu/software.

7 citations

Proceedings ArticleDOI
30 Aug 1992
TL;DR: The special purpose architecture is used to perform the band matrix multiplication in order to compute the local distance metric based on Itakura's log likelihood distance.
Abstract: Describes an area and time efficient systolic array architecture for computations in Dynamic Time Warping (DTW). The special purpose architecture is used to perform the band matrix multiplication in order to compute the local distance metric based on Itakura's log likelihood distance. The time complexity of the algorithm is O(nk) where n and k are the number of elements in the row of the first and second input matrices. The number of processors is equal to the bandwidth w of the output band matrix. The speedup of the parallel algorithm compared to the sequential algorithm is wz where z is the multiplier stages within a PE. The parallel algorithm can be implemented as a single VLSI chip. >

7 citations

01 Jan 2009
TL;DR: A pipelined architecture for hardware PSO implementation is presented and an execution speedup of several orders of magnitude is observed.
Abstract: Particle Swarm Optimization (PSO) is a popular population-based optimization algorithm. While PSO has been shown to perform well in a large variety of problems, PSO is typically implemented in software. Population-based optimization algorithms such as PSO are well suited for execution in parallel stages. This allows PSO to be implemented directly in hardware and achieve much faster execution times than possible in software. In this paper, a pipelined architecture for hardware PSO implementation is presented. Benchmark functions solved by software and hardware PSO implementations are compared. The hardware PSO design is implemented on a Xilinx Virtex-II Pro Development Kit for evaluation. By implementing PSO directly on hardware an execution speedup of several orders of magnitude is observed.

7 citations

Book ChapterDOI
24 Nov 1998
TL;DR: The Xilinx XC6216 Field Programmable Gate Array is described and how it is used to efficiently search a hybrid 2-state, 5- neighbour cellular automata rule space that exhibits computation universality.
Abstract: Cellular Automata architectures are attractive due to their fine grain parallelism, simple computational structures and local routing resources. Some researchers have used genetic algorithms to find CA that perform useful computations. The inherently parallel cellular automata model as well as the genetic algorithm are poorly suited to implementation on general purpose microprocessor based systems. Field Programmable Gate Arrays are an alternative that can provide significant speedup. This paper describes the Xilinx XC6216 Field Programmable Gate Array and how it is used to efficiently search a hybrid 2-state, 5- neighbour cellular automata rule space that exhibits computation universality. Its application to an image processing application, binary texture analysis, is discussed.

7 citations


Network Information
Related Topics (5)
Artificial neural network
207K papers, 4.5M citations
86% related
Cluster analysis
146.5K papers, 2.9M citations
86% related
Deep learning
79.8K papers, 2.1M citations
86% related
Software
130.5K papers, 2M citations
85% related
Optimization problem
96.4K papers, 2.1M citations
85% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023945
20222,078
20211,318
20201,365
20191,370
20181,406