scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2003"


Patent
09 Jun 2003
TL;DR: In this article, a deferred shading graphics pipeline processor and method are provided encompassing numerous substructures, including one or more of deferred shading, a tiled frame buffer, and multiple?stage hidden surface removal processing.
Abstract: A deferred shading graphics pipeline processor and method are provided encompassing numerous substructures. Embodiments of the processor and method may include one or more of deferred shading, a tiled frame buffer, and multiple?stage hidden surface removal processing. In the deferred shading graphics pipeline, hidden surface removal is completed before pixel coloring is done. The pipeline processor comprises a command fetch and decode unit, a geometry unit, a mode extraction unit, a sort unit, a setup unit, a cull unit, a mode injection unit, a fragment unit, a texture unit, a Phong lighting unit, a pixel unit, and a backend unit.

468 citations


Journal ArticleDOI
TL;DR: The eXtreme Processing Platform (XPPTM) is a new runtime-reconfigurable data processing architecture based on a hierarchical array of coarsegrain, adaptive computing elements, and a packet-oriented communication network that is well suited for applications in multimedia, telecommunications, simulation, signal processing, graphics, and similar stream-based application domains.
Abstract: The eXtreme Processing Platform (XPPTM) is a new runtime-reconfigurable data processing architecture. It is based on a hierarchical array of coarsegrain, adaptive computing elements, and a packet-oriented communication network. The strength of the XPPTM technology originates from the combination of array processing with unique, powerful run-time reconfiguration mechanisms. Parts of the array can be configured rapidly in parallel while neighboring computing elements are processing data. Reconfiguration is triggered externally or even by special event signals originating within the array, enabling self-reconfiguring designs. The XPPTM architecture is designed to support different types of parallelism: pipelining, instruction level, data flow, and task level parallelism. Therefore this technology is well suited for applications in multimedia, telecommunications, simulation, signal processing (DSP), graphics, and similar stream-based application domains. The anticipated peak performance of the first commercial device running at 150 MHz is estimated to be 57.6 GigaOps/sec, with a peak I/O bandwidth of several GByte/sec. Simulated applications achieve up to 43.5 GigaOps/sec (32-bit fixed point).

387 citations


Journal ArticleDOI
Byung-Moo Min1, P. Kim1, F.W. Bowman, D.M. Boisvert, A.J. Aude 
TL;DR: The proposed feedback signal polarity inverting (FSPI) technique addresses the drawback of the conventional amplifier sharing technique and helps to reduce power consumption in a 10-bit pipeline.
Abstract: A 10-bit 80-MS/s analog-to-digital converter (ADC) with an area- and power-efficient architecture is described. By sharing an amplifier between two successive pipeline stages, a 10-bit pipeline is realized using just four amplifiers with a separate sample-and-hold block. The proposed feedback signal polarity inverting (FSPI) technique addresses the drawback of the conventional amplifier sharing technique. A wide-swing wide-bandwidth telescopic amplifier and an early comparison technique with a constant delay circuit have been developed to further reduce power consumption. The ADC is implemented in a 0.18-/spl mu/m dual-gate-oxidation CMOS process technology, achieves 72.8-dBc spurious free dynamic range, 57.92-dBc signal-to-noise ratio, 9.29 effective number of bits (ENOB) for a 99-MHz input at full sampling rate, and consumes 69 mW from a 3-V supply. The ADC occupies 1.85 mm/sup 2/.

234 citations


Proceedings ArticleDOI
27 Sep 2003
TL;DR: A statistically driven algorithm for forming clusters from which simulation points are chosen, and algorithms for picking simulation points earlier in a program's execution are examined - in order to significantly reduce fast-forwarding time during simulation.
Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution frequency profiles) to increase the efficiency and accuracy of simulation. Simulation points are a small set of execution samples that when combined represent the complete execution of the program.In this paper we present a statistically driven algorithm for forming clusters from which simulation points are chosen, and examine algorithms for picking simulation points earlier in a program's execution - in order to significantly reduce fast-forwarding time during simulation. In addition, we show that simulation points can be used independent of the underlying architecture. The points are generated once for a program/input pair by only examining the code executed. We show the points accurately track hardware metrics (e.g., performance and cache miss rates) between different architecture configurations. They can therefore be used across different architecture configurations to allow a designer to make accurate trade-off decisions between different configurations.

222 citations


Journal ArticleDOI
TL;DR: This work presents detailed analysis and dedicated hardware architecture of the block-coding engine to execute the EBCOT algorithm efficiently and shows that about 60% of the processing time is reduced compared with sample-based straightforward implementation.
Abstract: Embedded block coding with optimized truncation (EBCOT) is the most important technology in the latest image-coding standard, JPEG 2000. The hardware design of the block-coding engine in EBCOT is critical because the operations are bit-level processing and occupy more than half of the computation time of the whole compression process. A general purpose processor (GPP) is, therefore, very inefficient to process these operations. We present detailed analysis and dedicated hardware architecture of the block-coding engine to execute the EBCOT algorithm efficiently. The context formation process in EBCOT is analyzed to get an insight into the characteristics of the operation. A column-based architecture and two speed-up methods, sample skipping (SS) and group-of-column skipping (GOCS), for the context generation are then proposed. As for arithmetic encoder design, the pipeline and look-ahead techniques are used to speed up the processing. It is shown that about 60% of the processing time is reduced compared with sample-based straightforward implementation. A test chip is designed and the simulation results show that it can process 4.6 million pixels image within 1 s, corresponding to 2400 /spl times/ 1800 image size, or CIF (352 /spl times/ 288) 4 : 2 : 0 video sequence with 30 frames per second at 50-MHz working frequency.

220 citations


Proceedings ArticleDOI
08 Feb 2003
TL;DR: The unique properties of scalar operand networks are discussed, alternative ways of implementing them are examined, and the implementation of one such network is described in detail in the Raw microprocessor.
Abstract: The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties

205 citations


Journal ArticleDOI
TL;DR: High-speed parallel Reed-Solomon (RS) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems and it is suggested that a parallel RS decoder, which can keep up with optical transmission rates, could be implemented.
Abstract: This paper presents high-speed parallel Reed-Solomon (RS) (255,239) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems. Pipelining and parallelizing allow inputs to be received at very high fiber-optic rates and outputs to be delivered at correspondingly high rates with minimum delay. A parallel processing architecture results in speed-ups of as much as or more than 10 Gb, since the maximum achievable clock frequency is generally bounded by the critical path of the modified Euclidean algorithm block. The parallel RS decoders have been designed and implemented with the 0.13-/spl mu/m CMOS standard cell technology in a supply voltage of 1.1 V. It is suggested that a parallel RS decoder, which can keep up with optical transmission rates, i.e., 10 Gb/s and beyond, could be implemented. The proposed channel = 4 parallel RS decoder operates at a clock frequency of 770 MHz and has a data processing rate of 26.6 Gb/s.

164 citations


Journal ArticleDOI
TL;DR: The GRAPE-6 system as mentioned in this paper is a massively parallel special-purpose computer for astrophysical $N$-body simulations with a theoretical peak speed of 1.08 TFLOPS.
Abstract: In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical $N$-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.

153 citations


Journal ArticleDOI
TL;DR: The PowerTimer toolset is useful in assessing the typical and worst-case power swings that occur between successive cycle windows in a given workload execution and helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.
Abstract: The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors. The key component of the toolset is a parameterized set of energy functions that can be used in conjunction with any given cycle-accurate microarchitectural simulator. The energy functions model the power consumption of primitive and hierarchically composed building blocks which are used in microarchitecture-level performance models. Examples of structures modeled are pipeline stage latches, queues, buffers and component read/write multiplexers, local clock buffers, register files, and cache array macros. The energy functions can be derived using purely analytical equations that are driven by organizational, circuit, and technology parameters or behavioral equations that are derived from empirical, circuit-level simulation experiments. After describing the modeling methodology, we present analysis results in the context of a current-generation superscalar processor simulator to illustrate the use and effectiveness of such early-stage models. In addition to average power and performance tradeoff analysis, PowerTimer is useful in assessing the typical and worst-case power (or current) swings that occur between successive cycle windows in a given workload execution. Such a characterization of workloads at the early stage of microarchitecture definition helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.

123 citations


Proceedings ArticleDOI
Hai Li1, Swarup Bhunia1, Yi Chen1, T. N. Vijaykumar1, Kaushik Roy1 
08 Feb 2003
TL;DR: Deterministic clock gating (DCG) is introduced based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time.
Abstract: With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. Pipeline balancing (PLB), a previous technique, is essentially a methodology to clock-gate unused components whenever a program's instruction-level parallelism is predicted to be low. However, no nonpredictive methodologies are available in the literature for efficient clock gating. This paper introduces deterministic clock gating (DCG) based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time. Our experiments show an average of 19.9% reduction in processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D-Cache wordline decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss.

115 citations


Proceedings ArticleDOI
22 Jun 2003
TL;DR: This work's approach to WCET prediction was implemented for the Motorola ColdFire 5307 and includes a static prediction of ∗ This work was partly supported by the RTD project IST-1999-20527 “DAEDALUS” of the European FP5 program.
Abstract: Hard real-time avionics systems like flight control software are expected to always react in time. Consequently, it is essential for the timing validation of the software that the worst-case execution time (WCET) of all tasks on a given hardware configuration be known. Modern processor components like caches, pipelines, and branch prediction complicate the determination of the WCET considerably since the execution time of a single instruction may depend on the execution history. The safe, yet overly pessimistic assumption of no cache hits, no overlapping executions in the processor pipeline, and constantly mispredicted branches results in a serious overestimation of the WCET. Our approach to WCET prediction was implemented for the Motorola ColdFire 5307. It includes a static prediction of ∗ This work was partly supported by the RTD project IST-1999-20527 “DAEDALUS” of the European FP5 program. cache and pipeline behavior, producing much tighter upper bounds for the execution times. The WCET analysis tool works on real applications. It is safe in the sense that the computed WCET is always an upper bound of the real WCET. It requires much less effort, while producing more precise results than conventional measurement-based methods.

Proceedings ArticleDOI
Allan M. Hartstein1, Thomas R. Puzak1
03 Dec 2003
TL;DR: The theory shows that the more important power is to themetric, the shorter the optimum pipeline length that results, and that as dynamic power grows, the optimal design point shifts to shorter pipelines, while clock gating pushes the optimum to deeper pipelines.
Abstract: The impact of pipeline length on both the power and performance of a microprocessor is explored both theoretically and by simulation. A theory is presented for a wide range of power/performance metrics, BIPS/sup m//W. The theory shows that the more important power is to the metric, the shorter the optimum pipeline length that results. For typical parameters neither BIPS/W nor BIPS/sup 2//W yield an optimum, i.e., a non-pipelined design is optimal. For BIPS/sup 3//W the optimum, averaged over all 55 workloads studied, occurs at a 22.5 FO4 design point, a 7 stage pipeline, but this value is highly dependent on the assumed growth in latch count with pipeline depth. As dynamic power grows, the optimal design point shifts to shorter pipelines. Clock gating pushes the optimum to deeper pipelines. Surprisingly, as leakage power grows, the optimum is also found to shift to deeper pipelines. The optimum pipeline depth varies for different classes of workloads: SPEC95 and SPEC2000 integer applications, traditional (legacy) database and on-line transaction processing applications, modern (e.g. Web) applications, and floating point applications.

Journal ArticleDOI
TL;DR: Since the proposed algorithm can achieve higher processing rate and better efficiency than the conventional algorithm, it is very suitable for the OFDM/DMT applications such as the WLAN, DAB/DVB, and ADSL/VDSL systems.
Abstract: In this paper, we propose a new efficient FFT algorithm for OFDM/DMT applications and present its pipeline implementation results. Since the proposed algorithm is based on the radix-4 butterfly unit, the processing rate can be twice as fast as that based on the radix-23 algorithm. Also, its implementation is more area-efficient than the implementation from conventional radix-4 algorithm due to reduced number of nontrivial multipliers like using the radix-2/sup 3/ algorithm. In order to compare the proposed algorithm with the conventional radix-4 algorithm, the 64-point MDC pipelined FFT processor based on the proposed algorithm was implemented. After the logic synthesis using 0.35 /spl mu/m CMOS technology, the logic gate count for the processor with the proposed algorithm is only about 70% of that for the processor with the conventional radix-4 algorithm. Since the proposed algorithm can achieve higher processing rate and better efficiency than the conventional algorithm, it is very suitable for the OFDM/DMT applications such as the WLAN, DAB/DVB, and ADSL/VDSL systems.

Patent
09 Dec 2003
TL;DR: In this paper, the authors present a system and method of executing instructions within a counterflow pipeline processor, where an instruction and one or more operands issue into the instruction pipeline and a determination is made at one of the execution units whether the instruction is ready for execution.
Abstract: A system and method of executing instructions within a counterflow pipeline processor. The counterflow pipeline processor includes an instruction pipeline, a data pipeline, a reorder buffer and a plurality of execution units. An instruction and one or more operands issue into the instruction pipeline and a determination is made at one of the execution units whether the instruction is ready for execution. If so, the operands are loaded into the execution unit and the instruction executes. The execution unit is monitored for a result and, when the result arrives, it is stored into the result pipeline. If the instruction reaches the end of the pipeline without executing it wraps around and is sent down the instruction pipeline again.

Proceedings ArticleDOI
27 Oct 2003
TL;DR: A new active range scanning technique suitable for moving or deformable surfaces, in that 3D data are acquired from a single image, using a 'one-shot' system that poses a weak temporal continuity constraint.
Abstract: We present a new active range scanning technique suitable for moving or deformable surfaces. It is a 'one-shot' system, in that 3D data are acquired from a single image. The projection pattern consists of equidistant black and white stripes combined with a limited number of colored, transversal stripes which aid in their identification. Instead of using a generic, static code that is supposed to work under all circumstances, a 'self-adaptive code' is used. Two modes of adaptation are provided. The first mode is off-line, and generates a robust identification code for the current projector-camera configuration. This configuration is supposed to remain fixed during the remainder of the 3D capturing session. The second mode occurs on-line. By introducing a feedback-loop from the reconstruction output to the pattern generation, the pattern can be adapted as to keep its decoding well-conditioned. Within the considered family of patterns, its parameters are optimized on the fly based on the current content of the scene. As a matter of fact, scene content of the current frame influences the pattern projected during the next. This poses a weak temporal continuity constraint. Only a very fast change of scene invalidates this assumption. The higher the throughput of the reconstruction pipeline, the less serious this constraint becomes. Our current pipeline is running at approximately 20 Hz. Our prototype only uses 'off the shelf' hardware.

Journal ArticleDOI
TL;DR: This paper presents an efficient VLSI architecture of the pipeline fast Fourier transform (FFT) processor based on radix-4 decimation-in-time algorithm with the use of digit-serial arithmetic units by combining both the feedforward and feedback commutator schemes.
Abstract: This paper presents an efficient VLSI architecture of the pipeline fast Fourier transform (FFT) processor based on radix-4 decimation-in-time algorithm with the use of digit-serial arithmetic units. By combining both the feedforward and feedback commutator schemes, the proposed architecture can not only achieve nearly 100% hardware utilization, but also require much less memory compared with the previous digit-serial FFT processors. Furthermore, in FFT processors, several modules of ROM are required for the storage of twiddle factors. By exploiting the redundancy of the factors, the overall ROM size can be effectively reduced by a factor of 2.

Patent
31 Oct 2003
TL;DR: In this article, a pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit.
Abstract: A pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit. By including a plurality of pipeline units in the pipeline accelerator, one can increase the accelerator's data-processing performance as compared to a single-pipeline-unit accelerator. Furthermore, by designing the pipeline units so that they communicate via a common bus, one can alter the number of pipeline units, and thus alter the configuration and functionality of the accelerator, by merely coupling or uncoupling pipeline units to or from the bus. This eliminates the need to design or redesign the pipeline-unit interfaces each time one alters one of the pipeline units or alters the number of pipeline units within the accelerator.

Proceedings ArticleDOI
01 May 2003
TL;DR: This work reconciles the complexity/safety trade-off by decoupling worst-case timing analysis from the processor implementation, through a virtual simple architecture (VISA), and shows a VISA-compliant complex pipeline consumes 43--61% less power than an explicitly-safe pipeline.
Abstract: Meeting deadlines is a key requirement in safe realtime systems. Worst-case execution times (WCET) of tasks are needed for safe planning. Contemporary worst-case timing analysis tools can safely and tightly bound execution time on in-order single-issue pipelines with caches and static branch prediction. However, this simple pipeline appears to be a complexity limit, due to the need for analyzability. This excludes a whole class of high-performance processors from many embedded systems.We reconcile the complexity/safety trade-off by decoupling worst-case timing analysis from the processor implementation, through a virtual simple architecture (VISA). A VISA is the timing specification of a hypothetical simple pipeline and is the basis for worst-case timing analysis. However, the underlying microarchitecture can be arbitrarily complex. A task is divided into multiple sub-tasks which provide a means to gauge progress on the complex pipeline. Each sub-task is assigned an interim deadline, or checkpoint, based on the latest allowable completion time of the sub-task on the hypothetical simple pipeline. If no checkpoints are missed, then the complex pipeline is as timely as the safe pipeline. If a checkpoint is missed, the pipeline switches to a simple mode of operation that directly implements the VISA so that execution time of unfinished sub-tasks is safely bounded. The significance of our approach is that we circumvent worst-case timing analysis of the complex pipeline, by dynamically confirming its behavior is bounded by worst-case timing analysis of a simpler proxy pipeline.The benefit of using a high-performance processor is that tasks finish much sooner than they would have on an explicitly-safe processor. The new slack in the schedule can be exploited for higher throughput or lower power. With the VISA approach, an arbitrarily complex SMT processor can safely run non-real-time tasks at the same time as a real-time task. Alternatively, frequency/voltage can be safely lowered to take up slack. We explore the latter application and show a VISA-compliant complex pipeline consumes 43--61% less power than an explicitly-safe pipeline.

Proceedings ArticleDOI
08 Feb 2003
TL;DR: This work focuses on reducing the power dissipated by mis-speculated instructions by proposing selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic).
Abstract: With the constant advances in technology that lead to the increasing of the transistor count and processor frequency, power dissipation is becoming one of the major issues in high-performance processors. These processors increase their clock frequency by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. Branch mispredictions are responsible for around 28% of the power dissipated by a typical processor due to the useless activities performed by instructions that are squashed. This work focuses on reducing the power dissipated by mis-speculated instructions. We propose selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic). The particular set of techniques applied to each branch is dynamically chosen depending on the branch prediction confidence level. For branches with a low confidence on the prediction, the most aggressive throttling mechanism is used whereas high confidence branch predictions trigger the least aggressive techniques. Results show that combining fetch bandwidth reduction along with select logic disabling provides the best performance both in terms of energy reduction and energy-delay improvement (14% and 9% respectively for 14 stages, and 17% and 12% respectively for 28 stages).

Patent
18 Dec 2003
TL;DR: In this article, a BIST controller supporting a "resume" mode (225) in addition to a "pass/fail' mode (220) may be used to compensate for timing latencies introduced by pipeline staging in an embedded memory array.
Abstract: A BIST (Built-In Self Test) controller supporting a 'resume' mode (225) in addition to a 'pass/fail' mode (220) may be used to compensate for timing latencies introduced by pipeline staging in an embedded memory array (205) .

Patent
31 Oct 2003
TL;DR: In this paper, the authors propose a peer-vector machine with a hardwired pipeline accelerator coupled to the processor, where the buffer and data-transfer objects facilitate the transfer of data between the application and the accelerator.
Abstract: A computing machine includes a first buffer and a processor coupled to the buffer. The processor executes an application, a first data-transfer object, and a second data-transfer object, publishes data under the control of the application, loads the published data into the buffer under the control of the first data-transfer object, and retrieves the published data from the buffer under the control of the second data-transfer object. Alternatively, the processor retrieves data and loads the retrieved data into the buffer under the control of the first data-transfer object, unloads the data from the buffer under the control of the second data-transfer object, and processes the unloaded data under the control of the application. Where the computing machine is a peer-vector machine that includes a hardwired pipeline accelerator coupled to the processor, the buffer and data-transfer objects facilitate the transfer of data between the application and the accelerator.

Proceedings ArticleDOI
03 Dec 2003
TL;DR: A microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in- order back-end pipelines coupled by a queue, which is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.
Abstract: Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

Patent
18 Mar 2003
TL;DR: In this article, a data transformation pipeline enables a user to develop complex end-to-end data transformation functionality by graphically describing and representing, via a GUI, a desired data flow from one or multiple sources to one or more destinations through various interconnected nodes (graph).
Abstract: The data transformation system in one embodiment, comprises a capability to receive data, a data destination and a capability to store transformed data, and a data transformation pipeline that constructs complex end-to-end data transformation functionality by pipelining data flowing from one or more sources to one or more destinations through various interconnected nodes for transforming the data as it flows. Each component in the pipeline possesses predefined data transformation functionality, and the logical connections between components define the data flow pathway in an operational sense. The data transformation pipeline enables a user to develop complex end-to-end data transformation functionality by graphically describing and representing, via a GUI, a desired data flow from one or more sources to one or more destinations through various interconnected nodes (graph). Each node in the graph selected by the user represents predefined data transformation functionality, and connections between nodes define the data flow pathway.

Proceedings ArticleDOI
08 Feb 2003
TL;DR: This paper describes an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline and uses this design in a new value-based branch prediction design using available register value information (ARVI).
Abstract: To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch execution architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its performance. We then use this design in a new value-based branch prediction design using available register value information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch predictor With ARVI used as the second-level branch predictor the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite.


Journal ArticleDOI
TL;DR: This work describes an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance and presents a programming model for generic network applications that uses software pipelines.

Patent
23 Jul 2003
TL;DR: In this paper, a reconfigurable processor for processing digital logic functions includes a microcontroller, one or more decoders connected to the microcontroller and a plurality of interconnection busses.
Abstract: A reconfigurable processor for processing digital logic functions includes a microcontroller, one or more decoders connected to the microcontroller, a plurality of interconnection busses; and a plurality of processing elements is described. Each processing element connects to one or more other processing elements by local interconnection paths and the a decoder. The plurality of processing elements are arranged in one or more pipeline stages each including one or more processing elements. A method of dynamically reconfiguring a pipelined processor including configuring, using a microcontroller, a plurality of pipeline stages each including one or more processing elements, processing data through one or more pipeline stages, reconfiguring, by the microcontroller, one or more pipeline stages to define one or more subsequent pipeline stages, and routing the processed data through the one or more reconfigured pipeline stages is also described. The reconfiguration may take place while data is processed by other pipeline stages.

Proceedings ArticleDOI
25 Aug 2003
TL;DR: Effectiveness of PSU to DVS in current and future process generations is compared and evaluation results show PSU will reduce energy consumption by 27-34% more than DVS after about 10 years.
Abstract: Recent mobile processors are required to exhibit both low-energy consumption and high performance. To satisfy these requirements, dynamic voltage scaling (DVS) is currently employed. However, its effectiveness will be limited in the future because of shrinking the variable supply voltage range. As an alternative, we previously proposed pipeline stage unification (PSU), which unifies multiple pipeline stages without reducing the supply voltage at a power-saving mode. This paper compares effectiveness of PSU to DVS in current and future process generations. Our evaluation results show PSU will reduce energy consumption by 27-34% more than DVS after about 10 years.

Proceedings ArticleDOI
01 Jan 2003
TL;DR: The logic cell is a complete asynchronous pipeline stage, and the interconnects are entirely delay insensitive, eliminating all timing issues from the place-and-route procedure.
Abstract: We present an architecture for a quasi delay-insensitive asynchronous field-programmable gate array. The logic cell is a complete asynchronous pipeline stage, and the interconnects are entirely delay insensitive, eliminating all timing issues from the place-and-route procedure.

Proceedings ArticleDOI
02 Jun 2003
TL;DR: A method for decomposing a high-level program description of a circuit into a system of concurrent modules that can each be implemented as asynchronous pre-charge half-buffer pipeline stages for the asynchronous R3000 MIPS microprocessor is presented.
Abstract: We present a method for decomposing a high-level program description of a circuit into a system of concurrent modules that can each be implemented as asynchronous pre-charge half-buffer pipeline stages (the circuits used in the asynchronous R3000 MIPS microprocessor). We apply it to designing the instruction fetch of an asynchronous 8051 microcontroller, with promising results. We discuss new clustering algorithms that will improve the performance figures further.