Topic

Pipeline (computing)

About: Pipeline (computing) is a research topic. Over the lifetime, 26760 publications have been published within this topic receiving 204305 citations. The topic is also known as: data pipeline & computational pipeline.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A new non-restoring square root algorithm and its VLSI implementations

[...]

Yamin Li¹, Wanming Chu¹•Institutions (1)

University of Aizu¹

07 Oct 1996

TL;DR: In this article, the focus of the non-restoring is on the "partial remainder", not on "each bit of the square root", with each iteration, and it only requires one traditional adder/subtracter in each iteration.

...read moreread less

Abstract: We present a new non-restoring square root algorithm that is very efficient to implement. The new algorithm presented here has the following features unlike other square root algorithms. First, the focus of the "non-restoring" is on the "partial remainder", not on "each bit of the square root", with each iteration. Second, it only requires one traditional adder/subtracter in each iteration, i.e., it does not require other hardware components, such as seed generators, multipliers, or even multiplexors. Third, it generates the correct resulting value even in the last bit position. Next, based on the resulting value of the last bit, a precise remainder can be obtained immediately without any correction or addition operation. And finally, it can be implemented at very fast clock rate because of the very simple operations at each iteration. We illustrate two VLSI implementations of the new algorithm. One is a fully pipelined high-performance implementation that can accept a new square-root instruction each clock cycle with each pipeline stage requiring a minimum number of gate counts. The other is a low-cost implementation that uses only a single adder/subtractor for iterative operation.

...read moreread less

74 citations

Patent•

Processor configured to selectively cancel instructions from its pipeline responsive to a predicted-taken short forward branch instruction

[...]

David B. Witt¹, William M. Johnson¹•Institutions (1)

Advanced Micro Devices¹

06 Jul 1998

TL;DR: In this paper, a branch instruction is predicted to be taken, and the processor allows sequential fetching to continue and selectively cancels the sequential instructions which are not part of the predicted instruction sequence (i.e. the instructions between the predicted taken branch instruction and the target instruction identified by the forward branch target address).

...read moreread less

Abstract: A processor is configured to detect a branch instruction have a forward branch target address within a predetermined range of the branch fetch address of the branch instruction. If the branch instruction is predicted taken, instead of canceling subsequent instructions and fetching the branch target address, the processor allows sequential fetching to continue and selectively cancels the sequential instructions which are not part of the predicted instruction sequence (i.e. the instructions between the predicted taken branch instruction and the target instruction identified by the forward branch target address). Instructions within the predicted instruction sequence which may already have been fetched prior to predicting the branch instruction taken may be retained within the pipeline of the processor, and yet subsequent instructions may be fetched.

...read moreread less

74 citations

Journal Article•DOI•

Cache-efficient numerical algorithms using graphics hardware

[...]

Naga K. Govindaraju¹, Dinesh Manocha²•Institutions (2)

Microsoft¹, University of North Carolina at Chapel Hill²

01 Nov 2007

TL;DR: The approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency, and uses this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms.

...read moreread less

Abstract: We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2-10x improvement in performance.

...read moreread less

74 citations

Proceedings Article•DOI•

An approach for implementing efficient superscalar CISC processors

[...]

Shiliang Hu¹, I. Kim², Mikko H. Lipasti¹, James E. Smith¹•Institutions (2)

University of Wisconsin-Madison¹, Intel²

27 Feb 2006

TL;DR: An integrated, hardware/software co-designed CISC processor is proposed and analyzed, and a proposed x86 implementation with complexity similar to a two-wide superscalar processor is shown to provide performance that is equivalent to a conventional four-wide supercomputer.

...read moreread less

Abstract: An integrated, hardware/software co-designed CISC processor is proposed and analyzed. The objectives are high performance and reduced complexity. Although the x86 ISA is targeted, the overall approach is applicable to other CISC ISAs. To provide high performance on frequently executed code sequences, fully transparent dynamic translation software decomposes CISC superblocks into RISC-style micro-ops. Then, pairs of dependent micro-ops are reordered and fused into macro-ops held in a large, concealed code cache. The macro-ops are fetched from the code cache and processed throughout the pipeline as single units. Consequently, instruction level communication and management are reduced, and processor resources such as the issue buffer and register file ports are better utilized. Moreover, fused instructions lead naturally to pipelined instruction scheduling (issue) logic, and collapsed 3-1 ALUs can be used, resulting in much simplified result forwarding logic. Steady state performance is evaluated for the SPEC2000 benchmarks, and a proposed x86 implementation with complexity similar to a two-wide superscalar processor is shown to provide performance (instructions per cycle) that is equivalent to a conventional four-wide superscalar processor.

...read moreread less

74 citations

Patent•

Pipeline accelerator having multiple pipeline units and related computing machine and method

[...]

Kenneth R. Schulz¹, John W. Rapp¹, Larry Jackson¹, Mark Jones¹, Troy Cherasaro¹ - Show less +1 more•Institutions (1)

Lockheed Martin Corporation¹

31 Oct 2003

TL;DR: In this article, a pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit.

...read moreread less

Abstract: A pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit. By including a plurality of pipeline units in the pipeline accelerator, one can increase the accelerator's data-processing performance as compared to a single-pipeline-unit accelerator. Furthermore, by designing the pipeline units so that they communicate via a common bus, one can alter the number of pipeline units, and thus alter the configuration and functionality of the accelerator, by merely coupling or uncoupling pipeline units to or from the bus. This eliminates the need to design or redesign the pipeline-unit interfaces each time one alters one of the pipeline units or alters the number of pipeline units within the accelerator.

...read moreread less

74 citations

Collapse

Network Information

Performance

Metrics

26,760

Papers

229,716

Citations

No. of papers in the topic in previous years
Year	Papers
2022	18
2021	1,066
2020	1,556
2019	1,793
2018	1,754
2017	1,548

Pipeline (computing)

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics