scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2005"


01 Dec 2005
TL;DR: The pipeline architecture being developed to deal with the IR imaging data from WFCAM and VISTA is described, and the primary issues involved are discussed, capable of robustly removing instrument and night sky signatures; monitoring data quality and system integrity; providing astrometric and photometric calibration; and generating photon noise-limited images and astronomical catalogues.
Abstract: The UKIRT Wide Field Camera (WFCAM) on Mauna Kea and the VISTA IR mosaic camera at ESO, Paranal, with respectively 4 Rockwell 2kx2k and 16 Raytheon 2kx2k IR arrays on 4m-class telescopes, represent an enormous leap in deep IR survey capability. With combined nightly data-rates of typically 1TB, automated pipeline processing and data management requirements are paramount. Pipeline processing of IR data is far more technically challenging than for optical data. IR detectors are inherently more unstable, while the sky emission is over 100 times brighter than most objects of interest, and varies in a complex spatial and temporal manner. In this presentation we describe the pipeline architecture being developed to deal with the IR imaging data from WFCAM and VISTA, and discuss the primary issues involved in an end-to-end system capable of: robustly removing instrument and night sky signatures; monitoring data quality and system integrity; providing astrometric and photometric calibration; and generating photon noise-limited images and astronomical catalogues. Accompanying papers by Emerson etal and Hambly etal provide an overview of the project and a detailed description of the science archive aspects.

166 citations


Journal ArticleDOI
TL;DR: A high-performance and memory-efficient pipeline architecture which performs the one-level two-dimensional (2-D) discrete wavelet transform (DWT) in the 5/3 and 9/7 filters by cascading the three key components.
Abstract: In this paper, we propose a high-performance and memory-efficient pipeline architecture which performs the one-level two-dimensional (2-D) discrete wavelet transform (DWT) in the 5/3 and 9/7 filters. In general, the internal memory size of 2-D architecture highly depends on the pipeline registers of one-dimensional (1-D) DWT. Based on the lifting-based DWT algorithm, the primitive data path is modified and an efficient pipeline architecture is derived to shorten the data path. Accordingly, under the same arithmetic resources, the 1-D DWT pipeline architecture can operate at a higher processing speed (up to 200 MHz in 0.25-/spl mu/m technology) than other pipelined architectures with direct implementation. The proposed 2-D DWT architecture is composed of two 1-D processors (column and row processors). Based on the modified algorithm, the row processor can partially execute each row-wise transform with only two column-processed data. Thus, the pipeline registers of 1-D architecture do not fully turn into the internal memory of 2-D DWT. For an N/spl times/M image, only 3.5N internal memory is required for the 5/3 filter, and 5.5N is required for the 9/7 filter to perform the one-level 2-D DWT decomposition with the critical path of one multiplier delay (i.e., N and M indicate the height and width of an image). The pipeline data path is regular and practicable. Finally, the proposed architecture implements the 5/3 and 9/7 filters by cascading the three key components.

147 citations


Proceedings ArticleDOI
24 Sep 2005
TL;DR: An infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power, are presented.
Abstract: As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial soft processors have been widely deployed, and hence we are motivated to understand their microarchitecture. We must re-evaluate microarchiteture in the soft processor context because an FPGA platform is significantly different than an ASIC platform---for example, the relative speed of memory and logic is quite different in the two platforms, as is the area cost. In this paper we present an infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power. Using our automatically-generated soft processors we explore the microarchitecture trade-off space including: (i) hardware vs software multiplication support; (ii) shifter implementations; and (iii) pipeline depth, organization, and forwarding. For example, we find that a 3-stage pipeline has better wall-clock-time performance than deeper pipelines, despite lower clock frequency. We also compare our designs to Altera's NiosII commercial soft processor variations and find that our automatically generated designs span the design space while remaining very competitive.

103 citations


Journal ArticleDOI
TL;DR: High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented.
Abstract: High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented. They employ "fine-grained" pipelining, i.e., pipelining within the processor and result in an increased output latency when used in the LMS recursive system. Therefore, the major challenge is to maintain a low latency output whilst increasing the pipeline stage in the filter for higher speeds. Using the delayed LMS (DLMS) algorithm, fine-grained pipelined FPGA implementations using both the direct form (DF) and the transposed form (TF) are considered and compared. It is shown that the direct form LMS filter utilizes the FPGA resources more efficiently thereby allowing a 120 MHz sampling rate.

96 citations



Journal ArticleDOI
01 May 2005
TL;DR: The microarchitecture of a novel network search processor which provides both high execution throughput and balanced memory distribution by dividing the tree into subtrees and allocating each subtree separately, allowing searches to begin at any pipeline stage.
Abstract: Pipelined forwarding engines are used in core routers to meet speed demands. Tree-based searches are pipelined across a number of stages to achieve high throughput, but this results in unevenly distributed memory. To address this imbalance, conventional approaches use either complex dynamic memory allocation schemes or over-provision each of the pipeline stages. This paper describes the microarchitecture of a novel network search processor which provides both high execution throughput and balanced memory distribution by dividing the tree into subtrees and allocating each subtree separately, allowing searches to begin at any pipeline stage. The architecture is validated by implementing and simulating state of the art solutions for IPv4 lookup, VPN forwarding and packet classification. The new pipeline scheme and memory allocator can provide searches with a memory allocation efficiency that is within 1% of non-pipelined schemes.

89 citations


Journal ArticleDOI
TL;DR: A pipelined analog-to-digital converter (ADC) architecture suitable for high-speed (150 MHz), Nyquist-rate A/D conversion is presented and an experimental prototype of the proposed ADC has been integrated in a 0.18-/spl mu/m CMOS technology.
Abstract: A pipelined analog-to-digital converter (ADC) architecture suitable for high-speed (150 MHz), Nyquist-rate A/D conversion is presented. At the input of the converter, two parallel track-and-hold circuits are used to separately drive the sub-ADC of a 2.8-b first pipeline stage and the input to two time-interleaved residue generation paths. Beyond the first pipeline stage, each residue path includes a cascade of two 1.5-b pipeline stages followed by a 4-b "backend" folding ADC. The full-scale residue range at the output of the pipeline stages is half that of the converter input range in order to conserve power in the operational amplifiers used in each residue path. An experimental prototype of the proposed ADC has been integrated in a 0.18-/spl mu/m CMOS technology and operates from a 1.8-V supply. At a sampling rate of 150 MSample/s, it achieves a peak SNDR of 45.4 dB for an input frequency of 80 MHz. The power dissipation is 71 mW.

87 citations


Proceedings ArticleDOI
Yoonjin Kim1, Mary Kiemb1, Chulsoo Park1, Jinyong Jung1, Kiyoung Choi1 
07 Mar 2005
TL;DR: A reconfigurable array architecture template and a design space exploration flow for domain-specific optimization are suggested and Experimental results show that this approach is much more efficient, in both performance and area, compared to existing reconfigured array architectures.
Abstract: Coarse-grained reconfigurable architectures aim to achieve goals of both high performance and flexibility. However, existing reconfigurable array architectures require many resources without considering the specific application domain. Functional resources that take long latency and/or large area can be pipelined and/or shared among the processing elements. Therefore, the hardware cost and the delay can be effectively reduced without any performance degradation for some application domains. We suggest such a reconfigurable array architecture template and a design space exploration flow for domain-specific optimization. Experimental results show that our approach is much more efficient, in both performance and area, compared to existing reconfigurable architectures.

86 citations


Journal ArticleDOI
TL;DR: This work presents a hardware-efficient design increasing throughput for the AES algorithm using a high-speed parallel pipelined architecture and achieves a high throughput of 29.77 Gbps in encryption whereas the highest throughput reported in literature is 21.54 Gbps.

79 citations


Patent
27 May 2005
TL;DR: In this paper, branch prediction for multiple branch-type instructions is performed concurrently for a high-bandwidth pipeline, and the branch prediction is then supplied for further processing of the corresponding branch type instructions.
Abstract: Concurrently branch predicting for multiple branch-type instructions satisfies demands of high performance environments. Concurrently branch predicting for multiple branch-type instructions provides the instruction flow for a high bandwidth pipeline utilized in advanced performance environments. Branch predictions are concurrently generated for multiple branch-type instructions. The concurrently generated branch predictions are then supplied for further processing of the corresponding branch-type instructions.

78 citations


Book
01 Jan 2005
TL;DR: This chapter discusses Trusting Logic, a methodology of Logical Confidence that helps to define a Sufficiently Expressive Logic system and its applications in the oil and gas industry.
Abstract: Preface Acknowledgments 1 Trusting Logic 11 Mathematicianless Enlivenment of Logic Expression 12 Emulating the Mathematician 13 Supplementing the Expressivity of Boolean Logic 14 Defining a Sufficiently Expressive Logic 15 The Logically Determined System 16 Trusting the Logic: A Methodology of Logical Confidence 17 Summary 18 Exercises 2 A Sufficiently Expressive Logic 21 Searching for a New Logic 22 Deriving a 3 Value Logic 23 Deriving a 2 Value Logic 24 Compromising Logical Completeness 25 Summary 3 The Structure of Logically Determined Systems 31 The Cycle 32 Basic Pipeline Structures 33 Control Variables and Wavefront Steering 34 The Logically Determined System 35 Initialization 36 Testing 37 Summary 38 Exercises 4 2NCL Combinational Expression 41 Function Classification 42 The Library of 2NCL Operators 43 2NCL Combinational Expression 44 Example 1: Binary Plus Trinary to Quaternary Adder 45 Example 2: Logic Unit 46 Example 3: Minterm Construction 47 Example 4: A Binary Clipper 48 Example 5: A Code Detector 49 Completeness Sufficiency 410 Greater Combinational Composition 411 Directly Mapping Boolean Combinational Expressions 412 Summary 413 Exercises 5 Cycle Granularity 51 Partitioning Combinational Expressions 52 Partitioning the Data Path 53 Two--dimensional Pipelining: Orthogonal Pipelining Across a Data Path 54 2D Wavefront Behavior 55 2D Pipelined Operations 56 Summary 57 Exercises 6 Memory Elements 61 The Ring Register 62 Complex Function Registers 63 The Consume/Produce Register Structure 64 The Register File 65 Delay Pipeline Memory 66 Delay Tower 67 FIFO Tower 68 Stack Tower 69 Wrapper for Standard Memory Modules 610 Exercises 7 State Machines 71 Basic State Machine Structure 72 Exercises 8 Busses and Networks 81 The Bus 82 A Fan--out Steering Tree 83 Fan--in Steering Trees Do Not Work 84 Arbitrated Steering Structures 85 Concurrent Crossbar Network 86 Exercises 9 Multi--value Numeric Design 91 Numeric Representation 92 A Quaternary ALU 93 A Binary ALU 94 Comparison 95 Summary 96 Exercises 10 The Shadow Model of Pipeline Behavior 101 Pipeline Structure 102 The Pipeline Simulation Model 103 Delays Affecting Throughput 104 The Shadow Model 105 The Value of the Shadow Model 106 Exercises 11 Pipeline Buffering 111 Enhancing Throughput 112 Buffering for Constant Rate Throughput 113 Summary of Buffering 114 Exercises 12 Ring Behavior 121 The Pipeline Ring 122 Wavefront--limited Ring Behavior 123 The Cycle--to--Wavefront Ratio 124 Ring Signal Behavior 13 Interacting Pipeline Structures 131 Preliminaries 132 Example 1: The Basics of a Two--pipeline Structure 133 Example 2: A Wavefront Delay Structure 134 Example 3: Reducing the Period of the Slowest Cycle 135 Exercises 14 Complex Pipeline Structures 141 Linear Feedback Shift Register Example 142 Grafting Pipelines 143 The LFSR with a Slow Cycle 144 Summary 145 Exercises Appendix A: Logically Determined Wavefront Flow A1 Synchronization A2 Wavefronts and Bubbles A3 Wavefront Propagation A4 Extended Simulation of Wavefront Flow A5 Wavefront and Bubble Behavior in a System Appendix B: Playing with 2NCL B1 The SR Flip--flop Implementations B2 Initialization B3 Auto--produce and Auto--consume Appendix C: Pipeline Simulation References Index

Proceedings ArticleDOI
03 Jan 2005
TL;DR: The MD5 designs presented in this paper are the fastest published FPGA-based architectures at the time of writing.
Abstract: Hardware implementation aspects of the MD5 hash algorithm are discussed in this paper. A general architecture for MD5 is proposed and several implementations are presented. An extensive study of effects of pipelining on delay, area requirements and throughput is performed, and finally certain architectures are recommended and compared to other published MD5 designs. The designs were implemented on a Xilinx Virtex-II XC2V4000-6 FPGA and a throughput of 586 Mbps was achieved with logic requirements of only 647 slices and 2 BlockRAMs. Methods to increase the throughput to gigabit-level were also studied and an implementation of parallel MD5 blocks achieving a throughput of over 5.8 Gbps was introduced. At least to the authors' knowledge, MD5 designs presented in this paper are the fastest published FPGA-based architectures at the time of writing.

Proceedings ArticleDOI
23 May 2005
TL;DR: A hybrid task pipelining scheme is first presented to greatly reduce the internal memory size and bandwidth and Appropriate degrees of parallelism for each pipeline task are also proposed.
Abstract: The most critical issue of an H.264/AVC decoder is the system architecture design with balanced pipelining schedules and proper degrees of parallelism. In this paper, a hybrid task pipelining scheme is first presented to greatly reduce the internal memory size and bandwidth. Block-level, macroblock-level, and macroblock/frame-level pipelining schedules are arranged for CAVLD/IQ/IT/INTRA/spl I.bar/PRED, INTER/spl I.bar/PRED, and DEBLOCK, respectively. Appropriate degrees of parallelism for each pipeline task are also proposed. Moreover, efficient modules are contributed. The CAVLD unit smoothly decodes the bitstream into symbols without bubble cycles. The INTER/spl I.bar/PRED unit highly exploits the data reuse between interpolation windows of neighboring blocks to save 60% of external memory bandwidth. The DEBLOCK unit doubles the processing capability of our previous work with only 35.3% of logic gate count overhead. The proposed baseline profile decoder architecture can support up to 2048/spl times/1024 30 fps videos with 217 K logic gates, 10 KB SRAMs, and 528.9 MB/s bus bandwidth when operating at 120 MHz.

Patent
Nicholas P. Wilt1
09 Feb 2005
TL;DR: In this article, a system and methods for implementing histogram computation into the rasterization pipeline of a 3D graphics system is described, where statistical histogram data may be generated for input data of any kind or retrieved from any source that may be specified in a 2D array or specified in an immediate fashion to specialized data processing hardware.
Abstract: A system and methods for implementing histogram computation, for example, into the rasterization pipeline of a 3-D graphics system, are provided. With the histogram computation mechanism, statistical histogram data may be generated for input data of any kind or retrieved from any source that may be specified in a 2-D array or specified in an immediate fashion to specialized data processing hardware. Depending on the nature of the input data, the data may be filtered before passing the data to data processing hardware for further processing. The data processing hardware may then apply an additional function to the input data set before calculation of the histogram data. Then, at some point, the data processing hardware may apply a function to the data to map the derived data to a real-valued function that can then be quantized to a histogram element in the range specified from zero to the number of histogram elements minus one. The corresponding element in this histogram is then incremented according to the data received as it passes through the graphics processor. Advantageously, relatively expensive host computing resources are conserved, and developers are insulated from the tedious details required of implementing histogram computation from the ground up each time it becomes desirable to compute histogram data in connection with an application.

Proceedings ArticleDOI
07 Nov 2005
TL;DR: The goal of this paper is to further reduce simulation time for architecture design space exploration by finding similarity between benchmarks and program inputs at the level of samples, and shows that this provides approximately the same accuracy as the SimPoint sampling approach while reducing the number of simulated instructions by a factor of 1.5.
Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to complete. Simulating the full execution of the whole benchmark suite for one architecture configuration can take months. To address this issue researchers have examined using targetted sampling based on phase behavior to significantly reduce the simulation time of each program in the benchmark suite. However, even with this sampling approach, simulating the full benchmark suite across a large range of architecture designs can take days to weeks to complete. The goal of this paper is to further reduce simulation time for architecture design space exploration. We reduce simulation time by finding similarity between benchmarks and program inputs at the level of samples (100M instructions of execution). This allows us to use a representative sample of execution from one benchmark to accurately represent a sample of execution of other benchmarks and inputs. The end result of our analysis is a small number of sample points of execution. These are selected across the whole benchmark suite in order to accurately represent the complete simulation of the whole benchmark suite for design space exploration. We show that this provides approximately the same accuracy as the SimPoint sampling approach while reducing the number of simulated instructions by a factor of 1.5.

Journal ArticleDOI
TL;DR: A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme.
Abstract: A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme It consists of one row processor and one column processor each of which contains four sub-filters And the row processor which is time-multiplexed performs in parallel with the column processor Optimized shift-add operations are substituted for multiplications, and edge extension is implemented by an embedded circuit The whole architecture which is optimized in the pipeline design way to speed up and achieve higher hardware utilization has been demonstrated in FPGA Two pixels per clock cycle can be encoded at 100 MHz The architecture can he used as a compact and independent IP core for JPEG2000 VLSI implementation and various real-time image/video applications

Journal Article
TL;DR: In this paper, an adaptive particle filter algorithm is proposed for leak detection and location of gas pipelines, in which the variance of the artificial noise can be adjusted adaptively, which can improve the speed and accuracy.
Abstract: Leak detection and location play an important role in the management of a pipeline system. Some model-based methods, such as those based on the extended Kalman filter (EKF) or based on the strong tracking filter (STF), have been presented to solve this problem. But these methods need the nonlinear pipeline model to be linearized. Unfortunately, linearized transformations are only reliable if error propagation can be well approximated by a linear function, and this condition does not hold for a gas pipeline model. This will deteriorate the speed and accuracy of the detection and location. Particle filters are sequential Monte Carlo methods based on point mass (or “particle”) representations of probability densities, which can be applied to estimate states in nonlinear and non-Gaussian systems without linearization. Parameter estimation methods are widely used in fault detection and diagnosis (FDD), and have been applied to pipeline leak detection and location. However, the standard particle filter algorithm is not applicable to time-varying parameter estimation. To solve this problem, artificial noise has to be added to the parameters, but its variance is difficult to determine. In this paper, we propose an adaptive particle filter algorithm, in which the variance of the artificial noise can be adjusted adaptively. This method is applied to leak detection and location of gas pipelines. Simulation results show that fast and accurate leak detection and location can be achieved using this improved particle filter.

Journal ArticleDOI
12 Jun 2005
TL;DR: A novel program transformation technique to exploit parallel and pipelined computing power of modern network processors is presented and results show that the method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks.
Abstract: Modern network processors employs parallel processing engines (PEs) to keep up with explosive internet packet processing demands. Most network processors further allow processing engines to be organized in a pipelined fashion to enable higher processing throughput and flexibility. In this paper, we present a novel program transformation technique to exploit parallel and pipelined computing power of modern network processors. Our proposed method automatically partitions a sequential packet processing application into coordinated pipelined parallel subtasks which can be naturally mapped to contemporary high-performance network processors. Our transformation technique ensures that packet processing tasks are balanced among pipeline stages and that data transmission between pipeline stages is minimized. We have implemented the proposed transformation method in an auto-partitioning C compiler product for Intel Network Processors. Experimental results show that our method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks. For a 9-stage pipeline, our auto-partitioning C compiler obtained more than 4X speedup for the IPv4 forwarding PPS and the IP forwarding PPS (for both the IPv4 traffic and IPv6 traffic).

Proceedings ArticleDOI
22 Aug 2005
TL;DR: Compared to previous schemes, this paper shows that SDP is the only scheme that scales well in all the five scalability requirements and achieves scalability in throughput by simultaneously pipelining at the data-structure level and the hardware level.
Abstract: A truly scalable IP-lookup scheme must address five challenges of scalability, namely: routing-table size, lookup throughput, implementation cost, power dissipation, and routing-table update cost. Though several IP-lookup schemes have been proposed in the past, none of them do well in all the five scalability requirements. Previous schemes pipeline tries by mapping trie levels to pipeline stages. We make the fundamental observation that because this mapping is static and oblivious of the prefix distribution, the schemes do not scale well when worst-case prefix distributions are considered. This paper is the first to meet all the five requirements in the worst case. We propose scalable dynamic pipelining (SDP) which includes three key innovations: (1) We map trie nodes to pipeline stages based on the node height. Because the node height is directly determined by the prefix distribution, the node height succinctly provides sufficient information about the distribution. Our mapping enables us to prove a worst-case per-stage memory bound which is significantly tighter than those of previous schemes. (2) We exploit our mapping to propose a novel scheme for incremental route-updates. In our scheme a route-update requires exactly and only one write dispatched into the pipeline. This route-update cost is obviously the optimum and our scheme achieves the optimum in the worst case. (3) We achieve scalability in throughput by simultaneously pipelining at the data-structure level and the hardware level. SDP naturally scales in power and implementation cost. We not only present a theoretical analysis but also evaluate SDP and a number of previous schemes using detailed hardware simulation. Compared to previous schemes, we show that SDP is the only scheme that scales well in all the five requirements.

Proceedings ArticleDOI
12 Feb 2005
TL;DR: This paper examines the realistic benefits and limits of clock-gating in current generation high-performance processors and examines additional opportunities to avoid unnecessary clocking in real workload executions, and examines the power reduction benefits of a couple of newly invented schemes called transparent pipeline clock- gating and elastic pipeline Clock-Gating.
Abstract: Clock-gating has been introduced as the primary means of dynamic power management in recent high-end commercial microprocessors. The temperature drop resulting from active power reduction can result in additional leakage power savings in future processors. In this paper we first examine the realistic benefits and limits of clock-gating in current generation high-performance processors (e.g. of the POWER4/spl trade/ or POWER5/spl trade/ class). We then look beyond classical clock-gating: we examine additional opportunities to avoid unnecessary clocking in real workload executions. In particular, we examine the power reduction benefits of a couple of newly invented schemes called transparent pipeline clock-gating and elastic pipeline clock-gating. Based on our experiences with current designs, we try to bound the practical limits of clock gating efficiency in future microprocessors.

Journal ArticleDOI
TL;DR: A dynamic model for a novel self-drive pipeline robot or "pig," which obtains its power from the kinetic energy of fluid flow in a pipe via a turbine and a reverse-traverse screw mechanism is presented.
Abstract: This paper presents a dynamic model for a novel self-drive pipeline robot or "pig," which obtains its power from the kinetic energy of fluid flow in a pipe via a turbine and a reverse-traverse screw mechanism. The new robot is designed to move both against and with the flowing fluid, which makes it different from conventional "pigs", which can only move with the flowing fluid. This bidirectional capability makes it very valuable to many industries, especially the oil and gas industries. Based on the model, the dynamic behavior of the new robot under different conditions has been analyzed in detail. In order to verify the validity of the dynamic model, a prototype machine and pipe-loop test rig was built, and experimental data obtained compared well with the theoretical analyses. Both the theoretical and experimental results validated the practicability of this novel robot structure. Furthermore, detailed analysis has been carried out, and the conclusions that have been drawn provide basic design principles for this new pipeline robot, and will assist in the aim of optimizing details of its design.


Proceedings ArticleDOI
29 Jun 2005
TL;DR: A discrete optimization model based on a linear programming formulation is presented as an alternative to the cascade of classifiers implemented in many language processing systems and it is shown that it performs better than a pipeline-based system.
Abstract: We present a discrete optimization model based on a linear programming formulation as an alternative to the cascade of classifiers implemented in many language processing systems. Since NLP tasks are correlated with one another, sequential processing does not guarantee optimal solutions. We apply our model in an NLG application and show that it performs better than a pipeline-based system.

Journal ArticleDOI
01 May 2005
TL;DR: A hardware-based dynamic optimizer that continuously optimizes an application's instruction stream and is evaluated in the context of a contemporary microarchitecture running current workloads reveals that a hardware optimizer can directly execute 33% of instructions, resolve 29% of mispredicted branches, and generate addresses for 76% of memory operations.
Abstract: This paper presents a hardware-based dynamic optimizer that continuously optimizes an applicationýs instruction stream. In continuous optimization, dataflow optimizations are performed using simple, table-based hardware placed in the rename stage of the processor pipeline. The continuous optimizer reduces dataflow height by performing constant propagation, reassociation, redundant load elimination, store forwarding, and silent store removal. To enhance the impact of the optimizations, the optimizer integrates values generated by the execution units back into the optimization process. Continuous optimization allows instructions with input values known at optimization time to be executed in the optimizer, leaving less work for the out-of-order portion of the pipeline. Continuous optimization can detect branch mispredictions earlier and thus reduce the misprediction penalty. In this paper, we present a detailed description of a hardware optimizer and evaluate it in the context of a contemporary microarchitecture running current workloads. Our analysis of SPECint, SPECfp, and mediabench workloads reveals that a hardware optimizer can directly execute 33% of instructions, resolve 29% of mispredicted branches, and generate addresses for 76% of memory operations. These positive effects combine to provide speed ups in the range 0.99 to 1.27.

Journal ArticleDOI
27 May 2005-Science
TL;DR: The strategy of denying growing tumors a blood supply continues to show clinical promise as new and improved drugs move through the pipeline.
Abstract: The strategy of denying growing tumors a blood supply continues to show clinical promise as new and improved drugs move through the pipeline.

Journal ArticleDOI
TL;DR: Novel approaches for pipelining of parallel nested multiplexer loops and decision feedback equalizers (DFEs) based on look-ahead techniques are presented, which can guarantee improvement in performance either in the form of pipeline or parallelism.
Abstract: This paper presents novel approaches for pipelining of parallel nested multiplexer loops and decision feedback equalizers (DFEs) based on look-ahead techniques. Look-ahead techniques can be applied to pipeline a nested multiplexer loop in many possible ways. It is shown that not all the look-ahead approaches necessarily result in improved performance. A novel look-ahead approach is identified, which can guarantee improvement in performance either in the form of pipelining or parallelism. The proposed technique is demonstrated and applied to design multiplexer-loop-based DFEs with throughput in the range of 3.125-10 Gb/s.

Journal ArticleDOI
TL;DR: In this paper, a symbol-based block-interleaved pipelining (BIP) architecture is proposed for the maximum a posteriori probability (MAP) decoder in turbo-decoders.
Abstract: Iterative decoders such as turbo decoders have become integral components of modern broadband communication systems because of their ability to provide substantial coding gains. A key computational kernel in iterative decoders is the maximum a posteriori probability (MAP) decoder. The MAP decoder is recursive and complex, which makes high-speed implementations extremely difficult to realize. In this paper, we present block-interleaved pipelining (BIP) as a new high-throughput technique for MAP decoders. An area-efficient symbol-based BIP MAP decoder architecture is proposed by combining BIP with the well-known look-ahead computation. These architectures are compared with conventional parallel architectures in terms of speed-up, memory and logic complexity, and area. Compared to the parallel architecture, the BIP architecture provides the same speed-up with a reduction in logic complexity by a factor of M, where M is the level of parallelism. The symbol-based architecture provides a speed-up in the range from 1 to 2 with a logic complexity that grows exponentially with M and a state metric storage requirement that is reduced by a factor of M as compared to a parallel architecture. The symbol-based BIP architecture provides speed-up in the range M to 2M with an exponentially higher logic complexity and a reduced memory complexity compared to a parallel architecture. These high-throughput architectures are synthesized in a 2.5-V 0.25-/spl mu/m CMOS standard cell library and post-layout simulations are conducted. For turbo decoder applications, we find that the BIP architecture provides a throughput gain of 1.96 at the cost of 63% area overhead. For turbo equalizer applications, the symbol-based BIP architecture enables us to achieve a throughput gain of 1.79 with an area savings of 25%.

Patent
24 Aug 2005
TL;DR: A processor (1700) for processing instructions has a pipeline (1710, 1736, 1740) including a fetch stage and an execute stage (1870), including a first storing circuit (aGHR 2130) associated with said fetch stage (17 10) and operable to store a history of actual branches, and a second storing circuit(wGHR 2140), coupled back to said first circuit as mentioned in this paper.
Abstract: A processor (1700) for processing instructions has a pipeline (1710, 1736, 1740) including a fetch stage (1710) and an execute stage (1870), a first storing circuit (aGHR 2130) associated with said fetch stage (1710) and operable to store a history of actual branches, and a second storing circuit (wGHR 2140) associated with said fetch stage (1710) and operable to store a pattern of predicted branches, said second storing circuit (wGHR 2140) coupled to said first storing circuit (aGHR 2130), said execute stage (1870) coupled back to said first storing circuit (aGHR 2130). Other processors, wireless communications devices, systems, circuits, devices, branch prediction processes and methods of operation, processes of manufacture, and articles of manufacture, as disclosed and claimed.

Journal ArticleDOI
TL;DR: This paper proposed parallel and pipeline architecture for the sub-pixel interpolation filter in H.264/AVC conformed HDTV decoder with 60% reduced memory data transfer and a dedicated buffer organization to convert tree-structured block size reading to fixable and sequential processing.
Abstract: In this paper, we proposed parallel and pipeline architecture for the sub-pixel interpolation filter in H.264/AVC conformed HDTV decoder. To efficiently use the bus bandwidth, we bring forward three memory access optimization strategies to avoid redundant data transfer and improve data bus utilization. To improve the processing throughput, we use parallel and multi-stage pipeline architecture for conducting data transmission and interpolation filtering in parallel. Moreover, to balance the tradeoff between memory accessing scheme and sub-pixel interpolation processing granularity we devise a dedicated buffer organization to convert tree-structured block size reading to fixable and sequential processing. As compared to the traditional designs, our scheme offers 60% reduced memory data transfer. While clocking at 66 MHz, our design can support 1280 /spl times/ 720 @30 Hz processing throughput. The proposed design is suitable for low cost and real-time applications. Moreover, it can easily be applied in system-on-chip design.

Journal ArticleDOI
TL;DR: The most area-efficient pipeline FFT processors for WLAN MIMO-OFDM systems are presented in this paper, where it is shown that although the R2/sup 3/SDF architecture is the most area efficient approach for implementing pipeline-FFT processors, RrMDC architectures are more efficient when more than three channels are used.
Abstract: The most area-efficient pipeline FFT processors for WLAN MIMO-OFDM systems are presented. It is shown that although the R2/sup 3/SDF architecture is the most area-efficient approach for implementing pipeline FFT processors, RrMDC architectures are more efficient in MIMO-OFDM systems when more than three channels are used.