Showing papers on "Pipeline (computing) published in 2005"

PDF

Open Access

Vista Data Flow System: Pipeline Processing for WFCAM and VISTA

[...]

Jack Lewis, Mike Irwin, S. T. Hodgkin, P. S. Bunclark, D. W. Evans, Richard G. McMahon - Show less +2 more

01 Dec 2005

TL;DR: The pipeline architecture being developed to deal with the IR imaging data from WFCAM and VISTA is described, and the primary issues involved are discussed, capable of robustly removing instrument and night sky signatures; monitoring data quality and system integrity; providing astrometric and photometric calibration; and generating photon noise-limited images and astronomical catalogues.

...read moreread less

Abstract: The UKIRT Wide Field Camera (WFCAM) on Mauna Kea and the VISTA IR mosaic camera at ESO, Paranal, with respectively 4 Rockwell 2kx2k and 16 Raytheon 2kx2k IR arrays on 4m-class telescopes, represent an enormous leap in deep IR survey capability. With combined nightly data-rates of typically 1TB, automated pipeline processing and data management requirements are paramount. Pipeline processing of IR data is far more technically challenging than for optical data. IR detectors are inherently more unstable, while the sky emission is over 100 times brighter than most objects of interest, and varies in a complex spatial and temporal manner. In this presentation we describe the pipeline architecture being developed to deal with the IR imaging data from WFCAM and VISTA, and discuss the primary issues involved in an end-to-end system capable of: robustly removing instrument and night sky signatures; monitoring data quality and system integrity; providing astrometric and photometric calibration; and generating photon noise-limited images and astronomical catalogues. Accompanying papers by Emerson etal and Hambly etal provide an overview of the project and a detailed description of the science archive aspects.

...read moreread less

166 citations

Journal Article•DOI•

A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec

[...]

Bing-Fei Wu¹, Chung-Fu Lin¹•Institutions (1)

National Chiao Tung University¹

01 Dec 2005-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A high-performance and memory-efficient pipeline architecture which performs the one-level two-dimensional (2-D) discrete wavelet transform (DWT) in the 5/3 and 9/7 filters by cascading the three key components.

...read moreread less

Abstract: In this paper, we propose a high-performance and memory-efficient pipeline architecture which performs the one-level two-dimensional (2-D) discrete wavelet transform (DWT) in the 5/3 and 9/7 filters. In general, the internal memory size of 2-D architecture highly depends on the pipeline registers of one-dimensional (1-D) DWT. Based on the lifting-based DWT algorithm, the primitive data path is modified and an efficient pipeline architecture is derived to shorten the data path. Accordingly, under the same arithmetic resources, the 1-D DWT pipeline architecture can operate at a higher processing speed (up to 200 MHz in 0.25-/spl mu/m technology) than other pipelined architectures with direct implementation. The proposed 2-D DWT architecture is composed of two 1-D processors (column and row processors). Based on the modified algorithm, the row processor can partially execute each row-wise transform with only two column-processed data. Thus, the pipeline registers of 1-D architecture do not fully turn into the internal memory of 2-D DWT. For an N/spl times/M image, only 3.5N internal memory is required for the 5/3 filter, and 5.5N is required for the 9/7 filter to perform the one-level 2-D DWT decomposition with the critical path of one multiplier delay (i.e., N and M indicate the height and width of an image). The pipeline data path is regular and practicable. Finally, the proposed architecture implements the 5/3 and 9/7 filters by cascading the three key components.

...read moreread less

147 citations

Proceedings Article•DOI•

The microarchitecture of FPGA-based soft processors

[...]

Peter Yiannacouras¹, Jonathan Rose¹, J. Gregory Steffan¹•Institutions (1)

University of Toronto¹

24 Sep 2005

TL;DR: An infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power, are presented.

...read moreread less

Abstract: As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial soft processors have been widely deployed, and hence we are motivated to understand their microarchitecture. We must re-evaluate microarchiteture in the soft processor context because an FPGA platform is significantly different than an ASIC platform---for example, the relative speed of memory and logic is quite different in the two platforms, as is the area cost. In this paper we present an infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power. Using our automatically-generated soft processors we explore the microarchitecture trade-off space including: (i) hardware vs software multiplication support; (ii) shifter implementations; and (iii) pipeline depth, organization, and forwarding. For example, we find that a 3-stage pipeline has better wall-clock-time performance than deeper pipelines, despite lower clock frequency. We also compare our designs to Altera's NiosII commercial soft processor variations and find that our automatically generated designs span the design space while remaining very competitive.

...read moreread less

103 citations

Journal Article•DOI•

Virtex FPGA implementation of a pipelined adaptive LMS predictor for electronic support measures receivers

[...]

Lok-Kee Ting¹, Roger Woods¹, Colin F. N. Cowan¹•Institutions (1)

Queen's University Belfast¹

01 Jan 2005-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented.

...read moreread less

Abstract: High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented. They employ "fine-grained" pipelining, i.e., pipelining within the processor and result in an increased output latency when used in the LMS recursive system. Therefore, the major challenge is to maintain a low latency output whilst increasing the pipeline stage in the filter for higher speeds. Using the delayed LMS (DLMS) algorithm, fine-grained pipelined FPGA implementations using both the direct form (DF) and the transposed form (TF) are considered and compared. It is shown that the direct form LMS filter utilizes the FPGA resources more efficiently thereby allowing a 120 MHz sampling rate.

...read moreread less

96 citations

Journal Article•DOI•

Numerical modeling of flow and scour below a pipeline in currents

[...]

Dongfang Liang, Liang Cheng

01 Jan 2005-Coastal Engineering

94 citations

Journal Article•DOI•

A Tree Based Router Search Engine Architecture with Single Port Memories

[...]

Florin Baboescu¹, Dean M. Tullsen¹, Grigore Rosu², Sumeet Singh¹•Institutions (2)

University of California, San Diego¹, University of Illinois at Urbana–Champaign²

01 May 2005

TL;DR: The microarchitecture of a novel network search processor which provides both high execution throughput and balanced memory distribution by dividing the tree into subtrees and allocating each subtree separately, allowing searches to begin at any pipeline stage.

...read moreread less

Abstract: Pipelined forwarding engines are used in core routers to meet speed demands. Tree-based searches are pipelined across a number of stages to achieve high throughput, but this results in unevenly distributed memory. To address this imbalance, conventional approaches use either complex dynamic memory allocation schemes or over-provision each of the pipeline stages. This paper describes the microarchitecture of a novel network search processor which provides both high execution throughput and balanced memory distribution by dividing the tree into subtrees and allocating each subtree separately, allowing searches to begin at any pipeline stage. The architecture is validated by implementing and simulating state of the art solutions for IPv4 lookup, VPN forwarding and packet classification. The new pipeline scheme and memory allocator can provide searches with a memory allocation efficiency that is within 1% of non-pipelined schemes.

...read moreread less

89 citations

Journal Article•DOI•

A 150-MS/s 8-b 71-mW CMOS time-interleaved ADC

[...]

S. Limotyrakis¹, S.D. Kulchycki¹, D. Su², Bruce A. Wooley¹•Institutions (2)

Stanford University¹, Qualcomm Atheros²

02 May 2005-IEEE Journal of Solid-state Circuits

TL;DR: A pipelined analog-to-digital converter (ADC) architecture suitable for high-speed (150 MHz), Nyquist-rate A/D conversion is presented and an experimental prototype of the proposed ADC has been integrated in a 0.18-/spl mu/m CMOS technology.

...read moreread less

Abstract: A pipelined analog-to-digital converter (ADC) architecture suitable for high-speed (150 MHz), Nyquist-rate A/D conversion is presented. At the input of the converter, two parallel track-and-hold circuits are used to separately drive the sub-ADC of a 2.8-b first pipeline stage and the input to two time-interleaved residue generation paths. Beyond the first pipeline stage, each residue path includes a cascade of two 1.5-b pipeline stages followed by a 4-b "backend" folding ADC. The full-scale residue range at the output of the pipeline stages is half that of the converter input range in order to conserve power in the operational amplifiers used in each residue path. An experimental prototype of the proposed ADC has been integrated in a 0.18-/spl mu/m CMOS technology and operates from a 1.8-V supply. At a sampling rate of 150 MSample/s, it achieves a peak SNDR of 45.4 dB for an input frequency of 80 MHz. The power dissipation is 71 mW.

...read moreread less

87 citations

Proceedings Article•DOI•

Resource Sharing and Pipelining in Coarse-Grained Reconfigurable Architecture for Domain-Specific Optimization

[...]

Yoonjin Kim¹, Mary Kiemb¹, Chulsoo Park¹, Jinyong Jung¹, Kiyoung Choi¹ - Show less +1 more•Institutions (1)

Seoul National University¹

07 Mar 2005

TL;DR: A reconfigurable array architecture template and a design space exploration flow for domain-specific optimization are suggested and Experimental results show that this approach is much more efficient, in both performance and area, compared to existing reconfigured array architectures.

...read moreread less

Abstract: Coarse-grained reconfigurable architectures aim to achieve goals of both high performance and flexibility. However, existing reconfigurable array architectures require many resources without considering the specific application domain. Functional resources that take long latency and/or large area can be pipelined and/or shared among the processing elements. Therefore, the hardware cost and the delay can be effectively reduced without any performance degradation for some application domains. We suggest such a reconfigurable array architecture template and a design space exploration flow for domain-specific optimization. Experimental results show that our approach is much more efficient, in both performance and area, compared to existing reconfigurable architectures.

...read moreread less

86 citations

Journal Article•DOI•

An AES crypto chip using a high-speed parallel pipelined architecture

[...]

Seong-Moo Yoo¹, Deen Kotturi², W. David Pan¹, John Blizzard²•Institutions (2)

University of Alabama in Huntsville¹, Cadence Design Systems²

01 Sep 2005-Microprocessors and Microsystems

TL;DR: This work presents a hardware-efficient design increasing throughput for the AES algorithm using a high-speed parallel pipelined architecture and achieves a high throughput of 29.77 Gbps in encryption whereas the highest throughput reported in literature is 21.54 Gbps.

...read moreread less

79 citations

Patent•

Multiple branch predictions

[...]

Paul Caprioli¹, Shailender Chaudhry¹•Institutions (1)

Sun Microsystems¹

27 May 2005

TL;DR: In this paper, branch prediction for multiple branch-type instructions is performed concurrently for a high-bandwidth pipeline, and the branch prediction is then supplied for further processing of the corresponding branch type instructions.

...read moreread less

Abstract: Concurrently branch predicting for multiple branch-type instructions satisfies demands of high performance environments. Concurrently branch predicting for multiple branch-type instructions provides the instruction flow for a high bandwidth pipeline utilized in advanced performance environments. Branch predictions are concurrently generated for multiple branch-type instructions. The concurrently generated branch predictions are then supplied for further processing of the corresponding branch-type instructions.

...read moreread less

78 citations

Book•

Logically determined design : clockless system design with NULL convention logic

[...]

Karl Fant

01 Jan 2005

TL;DR: This chapter discusses Trusting Logic, a methodology of Logical Confidence that helps to define a Sufficiently Expressive Logic system and its applications in the oil and gas industry.

...read moreread less

Abstract: Preface Acknowledgments 1 Trusting Logic 11 Mathematicianless Enlivenment of Logic Expression 12 Emulating the Mathematician 13 Supplementing the Expressivity of Boolean Logic 14 Defining a Sufficiently Expressive Logic 15 The Logically Determined System 16 Trusting the Logic: A Methodology of Logical Confidence 17 Summary 18 Exercises 2 A Sufficiently Expressive Logic 21 Searching for a New Logic 22 Deriving a 3 Value Logic 23 Deriving a 2 Value Logic 24 Compromising Logical Completeness 25 Summary 3 The Structure of Logically Determined Systems 31 The Cycle 32 Basic Pipeline Structures 33 Control Variables and Wavefront Steering 34 The Logically Determined System 35 Initialization 36 Testing 37 Summary 38 Exercises 4 2NCL Combinational Expression 41 Function Classification 42 The Library of 2NCL Operators 43 2NCL Combinational Expression 44 Example 1: Binary Plus Trinary to Quaternary Adder 45 Example 2: Logic Unit 46 Example 3: Minterm Construction 47 Example 4: A Binary Clipper 48 Example 5: A Code Detector 49 Completeness Sufficiency 410 Greater Combinational Composition 411 Directly Mapping Boolean Combinational Expressions 412 Summary 413 Exercises 5 Cycle Granularity 51 Partitioning Combinational Expressions 52 Partitioning the Data Path 53 Two--dimensional Pipelining: Orthogonal Pipelining Across a Data Path 54 2D Wavefront Behavior 55 2D Pipelined Operations 56 Summary 57 Exercises 6 Memory Elements 61 The Ring Register 62 Complex Function Registers 63 The Consume/Produce Register Structure 64 The Register File 65 Delay Pipeline Memory 66 Delay Tower 67 FIFO Tower 68 Stack Tower 69 Wrapper for Standard Memory Modules 610 Exercises 7 State Machines 71 Basic State Machine Structure 72 Exercises 8 Busses and Networks 81 The Bus 82 A Fan--out Steering Tree 83 Fan--in Steering Trees Do Not Work 84 Arbitrated Steering Structures 85 Concurrent Crossbar Network 86 Exercises 9 Multi--value Numeric Design 91 Numeric Representation 92 A Quaternary ALU 93 A Binary ALU 94 Comparison 95 Summary 96 Exercises 10 The Shadow Model of Pipeline Behavior 101 Pipeline Structure 102 The Pipeline Simulation Model 103 Delays Affecting Throughput 104 The Shadow Model 105 The Value of the Shadow Model 106 Exercises 11 Pipeline Buffering 111 Enhancing Throughput 112 Buffering for Constant Rate Throughput 113 Summary of Buffering 114 Exercises 12 Ring Behavior 121 The Pipeline Ring 122 Wavefront--limited Ring Behavior 123 The Cycle--to--Wavefront Ratio 124 Ring Signal Behavior 13 Interacting Pipeline Structures 131 Preliminaries 132 Example 1: The Basics of a Two--pipeline Structure 133 Example 2: A Wavefront Delay Structure 134 Example 3: Reducing the Period of the Slowest Cycle 135 Exercises 14 Complex Pipeline Structures 141 Linear Feedback Shift Register Example 142 Grafting Pipelines 143 The LFSR with a Slow Cycle 144 Summary 145 Exercises Appendix A: Logically Determined Wavefront Flow A1 Synchronization A2 Wavefronts and Bubbles A3 Wavefront Propagation A4 Extended Simulation of Wavefront Flow A5 Wavefront and Bubble Behavior in a System Appendix B: Playing with 2NCL B1 The SR Flip--flop Implementations B2 Initialization B3 Auto--produce and Auto--consume Appendix C: Pipeline Simulation References Index

...read moreread less

Proceedings Article•DOI•

Hardware Implementation Analysis of the MD5 Hash Algorithm

[...]

Kimmo Järvinen¹, Matti Tommiska¹, Jorma Skyttä¹•Institutions (1)

Helsinki University of Technology¹

03 Jan 2005

TL;DR: The MD5 designs presented in this paper are the fastest published FPGA-based architectures at the time of writing.

...read moreread less

Abstract: Hardware implementation aspects of the MD5 hash algorithm are discussed in this paper. A general architecture for MD5 is proposed and several implementations are presented. An extensive study of effects of pipelining on delay, area requirements and throughput is performed, and finally certain architectures are recommended and compared to other published MD5 designs. The designs were implemented on a Xilinx Virtex-II XC2V4000-6 FPGA and a throughput of 586 Mbps was achieved with logic requirements of only 647 slices and 2 BlockRAMs. Methods to increase the throughput to gigabit-level were also studied and an implementation of parallel MD5 blocks achieving a throughput of over 5.8 Gbps was introduced. At least to the authors' knowledge, MD5 designs presented in this paper are the fastest published FPGA-based architectures at the time of writing.

...read moreread less

Proceedings Article•DOI•

Architecture design of H.264/AVC decoder with hybrid task pipelining for high definition videos

[...]

To-Wei Chen¹, Yu-Wen Huang¹, Tung-Chien Chen¹, Yu-Han Chen¹, Chuan-Yung Tsai¹, Liang-Gee Chen¹ - Show less +2 more•Institutions (1)

National Taiwan University¹

23 May 2005

TL;DR: A hybrid task pipelining scheme is first presented to greatly reduce the internal memory size and bandwidth and Appropriate degrees of parallelism for each pipeline task are also proposed.

...read moreread less

Abstract: The most critical issue of an H.264/AVC decoder is the system architecture design with balanced pipelining schedules and proper degrees of parallelism. In this paper, a hybrid task pipelining scheme is first presented to greatly reduce the internal memory size and bandwidth. Block-level, macroblock-level, and macroblock/frame-level pipelining schedules are arranged for CAVLD/IQ/IT/INTRA/spl I.bar/PRED, INTER/spl I.bar/PRED, and DEBLOCK, respectively. Appropriate degrees of parallelism for each pipeline task are also proposed. Moreover, efficient modules are contributed. The CAVLD unit smoothly decodes the bitstream into symbols without bubble cycles. The INTER/spl I.bar/PRED unit highly exploits the data reuse between interpolation windows of neighboring blocks to save 60% of external memory bandwidth. The DEBLOCK unit doubles the processing capability of our previous work with only 35.3% of logic gate count overhead. The proposed baseline profile decoder architecture can support up to 2048/spl times/1024 30 fps videos with 217 K logic gates, 10 KB SRAMs, and 528.9 MB/s bus bandwidth when operating at 120 MHz.

...read moreread less

Patent•

System and methods for providing histogram computation in a high precision rasterization data pipeline

[...]

Nicholas P. Wilt¹•Institutions (1)

Microsoft¹

09 Feb 2005

TL;DR: In this article, a system and methods for implementing histogram computation into the rasterization pipeline of a 3D graphics system is described, where statistical histogram data may be generated for input data of any kind or retrieved from any source that may be specified in a 2D array or specified in an immediate fashion to specialized data processing hardware.

...read moreread less

Abstract: A system and methods for implementing histogram computation, for example, into the rasterization pipeline of a 3-D graphics system, are provided. With the histogram computation mechanism, statistical histogram data may be generated for input data of any kind or retrieved from any source that may be specified in a 2-D array or specified in an immediate fashion to specialized data processing hardware. Depending on the nature of the input data, the data may be filtered before passing the data to data processing hardware for further processing. The data processing hardware may then apply an additional function to the input data set before calculation of the histogram data. Then, at some point, the data processing hardware may apply a function to the data to map the derived data to a real-valued function that can then be quantized to a histogram element in the range specified from zero to the number of histogram elements minus one. The corresponding element in this histogram is then incremented according to the data received as it passes through the graphics processor. Advantageously, relatively expensive host computing resources are conserved, and developers are insulated from the tedious details required of implementing histogram computation from the ground up each time it becomes desirable to compute histogram data in connection with an application.

...read moreread less

Proceedings Article•DOI•

Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation

[...]

Lieven Eeckhout¹, Jack Sampson, Brad Calder²•Institutions (2)

Ghent University¹, University of California, San Diego²

07 Nov 2005

TL;DR: The goal of this paper is to further reduce simulation time for architecture design space exploration by finding similarity between benchmarks and program inputs at the level of samples, and shows that this provides approximately the same accuracy as the SimPoint sampling approach while reducing the number of simulated instructions by a factor of 1.5.

...read moreread less

Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to complete. Simulating the full execution of the whole benchmark suite for one architecture configuration can take months. To address this issue researchers have examined using targetted sampling based on phase behavior to significantly reduce the simulation time of each program in the benchmark suite. However, even with this sampling approach, simulating the full benchmark suite across a large range of architecture designs can take days to weeks to complete. The goal of this paper is to further reduce simulation time for architecture design space exploration. We reduce simulation time by finding similarity between benchmarks and program inputs at the level of samples (100M instructions of execution). This allows us to use a representative sample of execution from one benchmark to accurately represent a sample of execution of other benchmarks and inputs. The end result of our analysis is a small number of sample points of execution. These are selected across the whole benchmark suite in order to accurately represent the complete simulation of the whole benchmark suite for design space exploration. We show that this provides approximately the same accuracy as the SimPoint sampling approach while reducing the number of simulated instructions by a factor of 1.5.

...read moreread less

Journal Article•DOI•

Low-power and high-speed VLSI architecture for lifting-based forward and inverse wavelet transform

[...]

Xuguang Lan¹, Nanning Zheng¹, Yuehu Liu¹•Institutions (1)

Xi'an Jiaotong University¹

01 May 2005-IEEE Transactions on Consumer Electronics

TL;DR: A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme.

...read moreread less

Abstract: A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme It consists of one row processor and one column processor each of which contains four sub-filters And the row processor which is time-multiplexed performs in parallel with the column processor Optimized shift-add operations are substituted for multiplications, and edge extension is implemented by an embedded circuit The whole architecture which is optimized in the pipeline design way to speed up and achieve higher hardware utilization has been demonstrated in FPGA Two pixels per clock cycle can be encoded at 100 MHz The architecture can he used as a compact and independent IP core for JPEG2000 VLSI implementation and various real-time image/video applications

...read moreread less

Journal Article•

Fast leak detection and location of gas pipelines based on an adaptive particle filter

[...]

Ming Liu¹, Shu Zang, Donghua Zhou¹•Institutions (1)

Tsinghua University¹

01 Jan 2005-International Journal of Applied Mathematics and Computer Science

TL;DR: In this paper, an adaptive particle filter algorithm is proposed for leak detection and location of gas pipelines, in which the variance of the artificial noise can be adjusted adaptively, which can improve the speed and accuracy.

...read moreread less

Abstract: Leak detection and location play an important role in the management of a pipeline system. Some model-based methods, such as those based on the extended Kalman filter (EKF) or based on the strong tracking filter (STF), have been presented to solve this problem. But these methods need the nonlinear pipeline model to be linearized. Unfortunately, linearized transformations are only reliable if error propagation can be well approximated by a linear function, and this condition does not hold for a gas pipeline model. This will deteriorate the speed and accuracy of the detection and location. Particle filters are sequential Monte Carlo methods based on point mass (or “particle”) representations of probability densities, which can be applied to estimate states in nonlinear and non-Gaussian systems without linearization. Parameter estimation methods are widely used in fault detection and diagnosis (FDD), and have been applied to pipeline leak detection and location. However, the standard particle filter algorithm is not applicable to time-varying parameter estimation. To solve this problem, artificial noise has to be added to the parameters, but its variance is difficult to determine. In this paper, we propose an adaptive particle filter algorithm, in which the variance of the artificial noise can be adjusted adaptively. This method is applied to leak detection and location of gas pipelines. Simulation results show that fast and accurate leak detection and location can be achieved using this improved particle filter.

...read moreread less

Journal Article•DOI•

Automatically partitioning packet processing applications for pipelined architectures

[...]

Jinquan Dai¹, Bo Huang¹, Long Li¹, Luddy Harrison²•Institutions (2)

Intel¹, University of Illinois at Urbana–Champaign²

12 Jun 2005

TL;DR: A novel program transformation technique to exploit parallel and pipelined computing power of modern network processors is presented and results show that the method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks.

...read moreread less

Abstract: Modern network processors employs parallel processing engines (PEs) to keep up with explosive internet packet processing demands. Most network processors further allow processing engines to be organized in a pipelined fashion to enable higher processing throughput and flexibility. In this paper, we present a novel program transformation technique to exploit parallel and pipelined computing power of modern network processors. Our proposed method automatically partitions a sequential packet processing application into coordinated pipelined parallel subtasks which can be naturally mapped to contemporary high-performance network processors. Our transformation technique ensures that packet processing tasks are balanced among pipeline stages and that data transmission between pipeline stages is minimized. We have implemented the proposed transformation method in an auto-partitioning C compiler product for Intel Network Processors. Experimental results show that our method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks. For a 9-stage pipeline, our auto-partitioning C compiler obtained more than 4X speedup for the IPv4 forwarding PPS and the IP forwarding PPS (for both the IPv4 traffic and IPv6 traffic).

...read moreread less

Proceedings Article•DOI•

Dynamic pipelining: making IP-lookup truly scalable

[...]

Jahangir Hasan¹, T. N. Vijaykumar¹•Institutions (1)

Purdue University¹

22 Aug 2005

TL;DR: Compared to previous schemes, this paper shows that SDP is the only scheme that scales well in all the five scalability requirements and achieves scalability in throughput by simultaneously pipelining at the data-structure level and the hardware level.

...read moreread less

Abstract: A truly scalable IP-lookup scheme must address five challenges of scalability, namely: routing-table size, lookup throughput, implementation cost, power dissipation, and routing-table update cost. Though several IP-lookup schemes have been proposed in the past, none of them do well in all the five scalability requirements. Previous schemes pipeline tries by mapping trie levels to pipeline stages. We make the fundamental observation that because this mapping is static and oblivious of the prefix distribution, the schemes do not scale well when worst-case prefix distributions are considered. This paper is the first to meet all the five requirements in the worst case. We propose scalable dynamic pipelining (SDP) which includes three key innovations: (1) We map trie nodes to pipeline stages based on the node height. Because the node height is directly determined by the prefix distribution, the node height succinctly provides sufficient information about the distribution. Our mapping enables us to prove a worst-case per-stage memory bound which is significantly tighter than those of previous schemes. (2) We exploit our mapping to propose a novel scheme for incremental route-updates. In our scheme a route-update requires exactly and only one write dispatched into the pipeline. This route-update cost is obviously the optimum and our scheme achieves the optimum in the worst case. (3) We achieve scalability in throughput by simultaneously pipelining at the data-structure level and the hardware level. SDP naturally scales in power and implementation cost. We not only present a theoretical analysis but also evaluate SDP and a number of previous schemes using detailed hardware simulation. Compared to previous schemes, we show that SDP is the only scheme that scales well in all the five requirements.

...read moreread less

Proceedings Article•DOI•

Stretching the limits of clock-gating efficiency in server-class processors

[...]

Hans M. Jacobson¹, Pradip Bose¹, Zhigang Hu¹, Alper Buyuktosunoglu¹, Victor Zyuban¹, Richard J. Eickemeyer¹, Lee Evan Eisen¹, John Barry Griswell¹, D. Logan¹, Balaram Sinharoy¹, Joel M. Tendler¹ - Show less +7 more•Institutions (1)

IBM¹

12 Feb 2005

TL;DR: This paper examines the realistic benefits and limits of clock-gating in current generation high-performance processors and examines additional opportunities to avoid unnecessary clocking in real workload executions, and examines the power reduction benefits of a couple of newly invented schemes called transparent pipeline clock- gating and elastic pipeline Clock-Gating.

...read moreread less

Abstract: Clock-gating has been introduced as the primary means of dynamic power management in recent high-end commercial microprocessors. The temperature drop resulting from active power reduction can result in additional leakage power savings in future processors. In this paper we first examine the realistic benefits and limits of clock-gating in current generation high-performance processors (e.g. of the POWER4/spl trade/ or POWER5/spl trade/ class). We then look beyond classical clock-gating: we examine additional opportunities to avoid unnecessary clocking in real workload executions. In particular, we examine the power reduction benefits of a couple of newly invented schemes called transparent pipeline clock-gating and elastic pipeline clock-gating. Based on our experiences with current designs, we try to bound the practical limits of clock gating efficiency in future microprocessors.

...read moreread less

Journal Article•DOI•

Dynamic characteristics of a novel self-drive pipeline pig

[...]

Zheng Hu¹, E. Appleton²•Institutions (2)

National University of Defense Technology¹, Durham University²

01 Oct 2005-IEEE Transactions on Robotics

TL;DR: A dynamic model for a novel self-drive pipeline robot or "pig," which obtains its power from the kinetic energy of fluid flow in a pipe via a turbine and a reverse-traverse screw mechanism is presented.

...read moreread less

Abstract: This paper presents a dynamic model for a novel self-drive pipeline robot or "pig," which obtains its power from the kinetic energy of fluid flow in a pipe via a turbine and a reverse-traverse screw mechanism. The new robot is designed to move both against and with the flowing fluid, which makes it different from conventional "pigs", which can only move with the flowing fluid. This bidirectional capability makes it very valuable to many industries, especially the oil and gas industries. Based on the model, the dynamic behavior of the new robot under different conditions has been analyzed in detail. In order to verify the validity of the dynamic model, a prototype machine and pipe-loop test rig was built, and experimental data obtained compared well with the theoretical analyses. Both the theoretical and experimental results validated the practicability of this novel robot structure. Furthermore, detailed analysis has been carried out, and the conclusions that have been drawn provide basic design principles for this new pipeline robot, and will assist in the aim of optimizing details of its design.

...read moreread less

The Baku-Tbilisi-Ceyhan Pipeline : Oil Window to the West

[...]

Frederick Starr, Svante E. Cornell

01 Jan 2005

Proceedings Article•DOI•

Beyond the Pipeline: Discrete Optimization in NLP

[...]

Tomasz Marciniak, Michael Strube

29 Jun 2005

TL;DR: A discrete optimization model based on a linear programming formulation is presented as an alternative to the cascade of classifiers implemented in many language processing systems and it is shown that it performs better than a pipeline-based system.

...read moreread less

Abstract: We present a discrete optimization model based on a linear programming formulation as an alternative to the cascade of classifiers implemented in many language processing systems. Since NLP tasks are correlated with one another, sequential processing does not guarantee optimal solutions. We apply our model in an NLG application and show that it performs better than a pipeline-based system.

...read moreread less

Journal Article•DOI•

Continuous Optimization

[...]

Brian Fahs¹, Todd Rafacz¹, Sanjay J. Patel¹, Steven S. Lumetta¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 May 2005

TL;DR: A hardware-based dynamic optimizer that continuously optimizes an application's instruction stream and is evaluated in the context of a contemporary microarchitecture running current workloads reveals that a hardware optimizer can directly execute 33% of instructions, resolve 29% of mispredicted branches, and generate addresses for 76% of memory operations.

...read moreread less

Abstract: This paper presents a hardware-based dynamic optimizer that continuously optimizes an applicationýs instruction stream. In continuous optimization, dataflow optimizations are performed using simple, table-based hardware placed in the rename stage of the processor pipeline. The continuous optimizer reduces dataflow height by performing constant propagation, reassociation, redundant load elimination, store forwarding, and silent store removal. To enhance the impact of the optimizations, the optimizer integrates values generated by the execution units back into the optimization process. Continuous optimization allows instructions with input values known at optimization time to be executed in the optimizer, leaving less work for the out-of-order portion of the pipeline. Continuous optimization can detect branch mispredictions earlier and thus reduce the misprediction penalty. In this paper, we present a detailed description of a hardware optimizer and evaluate it in the context of a contemporary microarchitecture running current workloads. Our analysis of SPECint, SPECfp, and mediabench workloads reveals that a hardware optimizer can directly execute 33% of instructions, resolve 29% of mispredicted branches, and generate addresses for 76% of memory operations. These positive effects combine to provide speed ups in the range 0.99 to 1.27.

...read moreread less

Journal Article•DOI•

Encouraging Results for Second-Generation Antiangiogenesis Drugs

[...]

Jean L. Marx

27 May 2005-Science

TL;DR: The strategy of denying growing tumors a blood supply continues to show clinical promise as new and improved drugs move through the pipeline.

...read moreread less

Abstract: The strategy of denying growing tumors a blood supply continues to show clinical promise as new and improved drugs move through the pipeline.

...read moreread less

Journal Article•DOI•

Design of multigigabit multiplexer-loop-based decision feedback equalizers

[...]

Keshab K. Parhi¹•Institutions (1)

University of Minnesota¹

01 Apr 2005-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: Novel approaches for pipelining of parallel nested multiplexer loops and decision feedback equalizers (DFEs) based on look-ahead techniques are presented, which can guarantee improvement in performance either in the form of pipeline or parallelism.

...read moreread less

Abstract: This paper presents novel approaches for pipelining of parallel nested multiplexer loops and decision feedback equalizers (DFEs) based on look-ahead techniques. Look-ahead techniques can be applied to pipeline a nested multiplexer loop in many possible ways. It is shown that not all the look-ahead approaches necessarily result in improved performance. A novel look-ahead approach is identified, which can guarantee improvement in performance either in the form of pipelining or parallelism. The proposed technique is demonstrated and applied to design multiplexer-loop-based DFEs with throughput in the range of 3.125-10 Gb/s.

...read moreread less

Journal Article•DOI•

Area-efficient high-throughput MAP decoder architectures

[...]

Seok-jun Lee¹, Naresh R. Shanbhag², Andrew C. Singer²•Institutions (2)

Texas Instruments¹, University of Illinois at Urbana–Champaign²

01 Aug 2005-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this paper, a symbol-based block-interleaved pipelining (BIP) architecture is proposed for the maximum a posteriori probability (MAP) decoder in turbo-decoders.

...read moreread less

Abstract: Iterative decoders such as turbo decoders have become integral components of modern broadband communication systems because of their ability to provide substantial coding gains. A key computational kernel in iterative decoders is the maximum a posteriori probability (MAP) decoder. The MAP decoder is recursive and complex, which makes high-speed implementations extremely difficult to realize. In this paper, we present block-interleaved pipelining (BIP) as a new high-throughput technique for MAP decoders. An area-efficient symbol-based BIP MAP decoder architecture is proposed by combining BIP with the well-known look-ahead computation. These architectures are compared with conventional parallel architectures in terms of speed-up, memory and logic complexity, and area. Compared to the parallel architecture, the BIP architecture provides the same speed-up with a reduction in logic complexity by a factor of M, where M is the level of parallelism. The symbol-based architecture provides a speed-up in the range from 1 to 2 with a logic complexity that grows exponentially with M and a state metric storage requirement that is reduced by a factor of M as compared to a parallel architecture. The symbol-based BIP architecture provides speed-up in the range M to 2M with an exponentially higher logic complexity and a reduced memory complexity compared to a parallel architecture. These high-throughput architectures are synthesized in a 2.5-V 0.25-/spl mu/m CMOS standard cell library and post-layout simulations are conducted. For turbo decoder applications, we find that the BIP architecture provides a throughput gain of 1.96 at the cost of 63% area overhead. For turbo equalizer applications, the symbol-based BIP architecture enables us to achieve a throughput gain of 1.79 with an area savings of 25%.

...read moreread less

Patent•

Processes, circuits, devices, and systems for branch prediction and other processor improvements

[...]

Jeffrey L. Nye¹, Thang Tran¹•Institutions (1)

Texas Instruments¹

24 Aug 2005

TL;DR: A processor (1700) for processing instructions has a pipeline (1710, 1736, 1740) including a fetch stage and an execute stage (1870), including a first storing circuit (aGHR 2130) associated with said fetch stage (17 10) and operable to store a history of actual branches, and a second storing circuit(wGHR 2140), coupled back to said first circuit as mentioned in this paper.

...read moreread less

Abstract: A processor (1700) for processing instructions has a pipeline (1710, 1736, 1740) including a fetch stage (1710) and an execute stage (1870), a first storing circuit (aGHR 2130) associated with said fetch stage (1710) and operable to store a history of actual branches, and a second storing circuit (wGHR 2140) associated with said fetch stage (1710) and operable to store a pattern of predicted branches, said second storing circuit (wGHR 2140) coupled to said first storing circuit (aGHR 2130), said execute stage (1870) coupled back to said first storing circuit (aGHR 2130). Other processors, wireless communications devices, systems, circuits, devices, branch prediction processes and methods of operation, processes of manufacture, and articles of manufacture, as disclosed and claimed.

...read moreread less

Journal Article•DOI•

High throughput and low memory access sub-pixel interpolation architecture for H.264/AVC HDTV decoder

[...]

Ronggang Wang¹, Mo Li, Jintao Li, Yongdong Zhang•Institutions (1)

Chinese Academy of Sciences¹

01 Aug 2005-IEEE Transactions on Consumer Electronics

TL;DR: This paper proposed parallel and pipeline architecture for the sub-pixel interpolation filter in H.264/AVC conformed HDTV decoder with 60% reduced memory data transfer and a dedicated buffer organization to convert tree-structured block size reading to fixable and sequential processing.

...read moreread less

Abstract: In this paper, we proposed parallel and pipeline architecture for the sub-pixel interpolation filter in H.264/AVC conformed HDTV decoder. To efficiently use the bus bandwidth, we bring forward three memory access optimization strategies to avoid redundant data transfer and improve data bus utilization. To improve the processing throughput, we use parallel and multi-stage pipeline architecture for conducting data transmission and interpolation filtering in parallel. Moreover, to balance the tradeoff between memory accessing scheme and sub-pixel interpolation processing granularity we devise a dedicated buffer organization to convert tree-structured block size reading to fixable and sequential processing. As compared to the traditional designs, our scheme offers 60% reduced memory data transfer. While clocking at 66 MHz, our design can support 1280 /spl times/ 720 @30 Hz processing throughput. The proposed design is suitable for low cost and real-time applications. Moreover, it can easily be applied in system-on-chip design.

...read moreread less

Journal Article•DOI•

Efficient pipeline FFT processors for WLAN MIMO-OFDM systems

[...]

T. Sansaloni¹, A. Perez-Pascual¹, V. Torres¹, Javier Valls¹•Institutions (1)

University of Valencia¹

15 Sep 2005-Electronics Letters

TL;DR: The most area-efficient pipeline FFT processors for WLAN MIMO-OFDM systems are presented in this paper, where it is shown that although the R2/sup 3/SDF architecture is the most area efficient approach for implementing pipeline-FFT processors, RrMDC architectures are more efficient when more than three channels are used.

...read moreread less

Abstract: The most area-efficient pipeline FFT processors for WLAN MIMO-OFDM systems are presented. It is shown that although the R2/sup 3/SDF architecture is the most area-efficient approach for implementing pipeline FFT processors, RrMDC architectures are more efficient in MIMO-OFDM systems when more than three channels are used.

...read moreread less

Collapse