scispace - formally typeset
Search or ask a question
Topic

Pipeline (computing)

About: Pipeline (computing) is a research topic. Over the lifetime, 26760 publications have been published within this topic receiving 204305 citations. The topic is also known as: data pipeline & computational pipeline.


Papers
More filters
Proceedings ArticleDOI
07 Oct 1996
TL;DR: In this article, the focus of the non-restoring is on the "partial remainder", not on "each bit of the square root", with each iteration, and it only requires one traditional adder/subtracter in each iteration.
Abstract: We present a new non-restoring square root algorithm that is very efficient to implement. The new algorithm presented here has the following features unlike other square root algorithms. First, the focus of the "non-restoring" is on the "partial remainder", not on "each bit of the square root", with each iteration. Second, it only requires one traditional adder/subtracter in each iteration, i.e., it does not require other hardware components, such as seed generators, multipliers, or even multiplexors. Third, it generates the correct resulting value even in the last bit position. Next, based on the resulting value of the last bit, a precise remainder can be obtained immediately without any correction or addition operation. And finally, it can be implemented at very fast clock rate because of the very simple operations at each iteration. We illustrate two VLSI implementations of the new algorithm. One is a fully pipelined high-performance implementation that can accept a new square-root instruction each clock cycle with each pipeline stage requiring a minimum number of gate counts. The other is a low-cost implementation that uses only a single adder/subtractor for iterative operation.

74 citations

Patent
06 Jul 1998
TL;DR: In this paper, a branch instruction is predicted to be taken, and the processor allows sequential fetching to continue and selectively cancels the sequential instructions which are not part of the predicted instruction sequence (i.e. the instructions between the predicted taken branch instruction and the target instruction identified by the forward branch target address).
Abstract: A processor is configured to detect a branch instruction have a forward branch target address within a predetermined range of the branch fetch address of the branch instruction. If the branch instruction is predicted taken, instead of canceling subsequent instructions and fetching the branch target address, the processor allows sequential fetching to continue and selectively cancels the sequential instructions which are not part of the predicted instruction sequence (i.e. the instructions between the predicted taken branch instruction and the target instruction identified by the forward branch target address). Instructions within the predicted instruction sequence which may already have been fetched prior to predicting the branch instruction taken may be retained within the pipeline of the processor, and yet subsequent instructions may be fetched.

74 citations

Journal ArticleDOI
01 Nov 2007
TL;DR: The approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency, and uses this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms.
Abstract: We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2-10x improvement in performance.

74 citations

Proceedings ArticleDOI
27 Feb 2006
TL;DR: An integrated, hardware/software co-designed CISC processor is proposed and analyzed, and a proposed x86 implementation with complexity similar to a two-wide superscalar processor is shown to provide performance that is equivalent to a conventional four-wide supercomputer.
Abstract: An integrated, hardware/software co-designed CISC processor is proposed and analyzed. The objectives are high performance and reduced complexity. Although the x86 ISA is targeted, the overall approach is applicable to other CISC ISAs. To provide high performance on frequently executed code sequences, fully transparent dynamic translation software decomposes CISC superblocks into RISC-style micro-ops. Then, pairs of dependent micro-ops are reordered and fused into macro-ops held in a large, concealed code cache. The macro-ops are fetched from the code cache and processed throughout the pipeline as single units. Consequently, instruction level communication and management are reduced, and processor resources such as the issue buffer and register file ports are better utilized. Moreover, fused instructions lead naturally to pipelined instruction scheduling (issue) logic, and collapsed 3-1 ALUs can be used, resulting in much simplified result forwarding logic. Steady state performance is evaluated for the SPEC2000 benchmarks, and a proposed x86 implementation with complexity similar to a two-wide superscalar processor is shown to provide performance (instructions per cycle) that is equivalent to a conventional four-wide superscalar processor.

74 citations

Patent
31 Oct 2003
TL;DR: In this article, a pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit.
Abstract: A pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit. By including a plurality of pipeline units in the pipeline accelerator, one can increase the accelerator's data-processing performance as compared to a single-pipeline-unit accelerator. Furthermore, by designing the pipeline units so that they communicate via a common bus, one can alter the number of pipeline units, and thus alter the configuration and functionality of the accelerator, by merely coupling or uncoupling pipeline units to or from the bus. This eliminates the need to design or redesign the pipeline-unit interfaces each time one alters one of the pipeline units or alters the number of pipeline units within the accelerator.

74 citations


Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
86% related
Scalability
50.9K papers, 931.6K citations
85% related
Server
79.5K papers, 1.4M citations
82% related
Electronic circuit
114.2K papers, 971.5K citations
82% related
CMOS
81.3K papers, 1.1M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202218
20211,066
20201,556
20191,793
20181,754
20171,548