Showing papers on "Pipeline (computing) published in 2003"

PDF

Open Access

Patent•

Deferred shading graphics pipeline processor having advanced features

[...]

Jerome F. Duluk¹, Richard E. Hessel¹, Vaughn T. Arnold¹, Jack Benkual¹, Joseph P. Bratt¹, George Cuan¹, Stephen L. Dodgen¹, Emerson S. Fang¹, Zhaoyu Gong¹, Thomas Y. Ho¹, Hengwei Hsu¹, Sidong Li¹, Sam Ng¹, Matthew N. Papakipos¹, Jason R. Redgrave¹, Sushma S. Trivedi¹, Nathan D. Tuck¹, Shun Wai Go¹, Lindy Fung¹, Tuan D. Nguyen¹, Joseph P. Grass¹, Bo Hong¹, Abraham Mammen¹, Abbas Rashid¹, Albert Suan-Wei Tsay¹ - Show less +21 more•Institutions (1)

Apple Inc.¹

09 Jun 2003

TL;DR: In this article, a deferred shading graphics pipeline processor and method are provided encompassing numerous substructures, including one or more of deferred shading, a tiled frame buffer, and multiple?stage hidden surface removal processing.

...read moreread less

Abstract: A deferred shading graphics pipeline processor and method are provided encompassing numerous substructures. Embodiments of the processor and method may include one or more of deferred shading, a tiled frame buffer, and multiple?stage hidden surface removal processing. In the deferred shading graphics pipeline, hidden surface removal is completed before pixel coloring is done. The pipeline processor comprises a command fetch and decode unit, a geometry unit, a mode extraction unit, a sort unit, a setup unit, a cull unit, a mode injection unit, a fragment unit, a texture unit, a Phong lighting unit, a pixel unit, and a backend unit.

...read moreread less

468 citations

Journal Article•DOI•

PACT XPP—A Self-Reconfigurable Data Processing Architecture

[...]

Volker Baumgarte, G. Ehlers, F. May, Armin Nückel, Martin Vorbach, Markus Weinhardt - Show less +2 more

01 Sep 2003-The Journal of Supercomputing

TL;DR: The eXtreme Processing Platform (XPPTM) is a new runtime-reconfigurable data processing architecture based on a hierarchical array of coarsegrain, adaptive computing elements, and a packet-oriented communication network that is well suited for applications in multimedia, telecommunications, simulation, signal processing, graphics, and similar stream-based application domains.

...read moreread less

Abstract: The eXtreme Processing Platform (XPPTM) is a new runtime-reconfigurable data processing architecture. It is based on a hierarchical array of coarsegrain, adaptive computing elements, and a packet-oriented communication network. The strength of the XPPTM technology originates from the combination of array processing with unique, powerful run-time reconfiguration mechanisms. Parts of the array can be configured rapidly in parallel while neighboring computing elements are processing data. Reconfiguration is triggered externally or even by special event signals originating within the array, enabling self-reconfiguring designs. The XPPTM architecture is designed to support different types of parallelism: pipelining, instruction level, data flow, and task level parallelism. Therefore this technology is well suited for applications in multimedia, telecommunications, simulation, signal processing (DSP), graphics, and similar stream-based application domains. The anticipated peak performance of the first commercial device running at 150 MHz is estimated to be 57.6 GigaOps/sec, with a peak I/O bandwidth of several GByte/sec. Simulated applications achieve up to 43.5 GigaOps/sec (32-bit fixed point).

...read moreread less

387 citations

Journal Article•DOI•

A 69-mW 10-bit 80-MSample/s Pipelined CMOS ADC

[...]

Byung-Moo Min¹, P. Kim¹, F.W. Bowman, D.M. Boisvert, A.J. Aude - Show less +1 more•Institutions (1)

National Semiconductor¹

01 Dec 2003-IEEE Journal of Solid-state Circuits

TL;DR: The proposed feedback signal polarity inverting (FSPI) technique addresses the drawback of the conventional amplifier sharing technique and helps to reduce power consumption in a 10-bit pipeline.

...read moreread less

Abstract: A 10-bit 80-MS/s analog-to-digital converter (ADC) with an area- and power-efficient architecture is described. By sharing an amplifier between two successive pipeline stages, a 10-bit pipeline is realized using just four amplifiers with a separate sample-and-hold block. The proposed feedback signal polarity inverting (FSPI) technique addresses the drawback of the conventional amplifier sharing technique. A wide-swing wide-bandwidth telescopic amplifier and an early comparison technique with a constant delay circuit have been developed to further reduce power consumption. The ADC is implemented in a 0.18-/spl mu/m dual-gate-oxidation CMOS process technology, achieves 72.8-dBc spurious free dynamic range, 57.92-dBc signal-to-noise ratio, 9.29 effective number of bits (ENOB) for a 99-MHz input at full sampling rate, and consumes 69 mW from a 3-V supply. The ADC occupies 1.85 mm/sup 2/.

...read moreread less

234 citations

Proceedings Article•DOI•

Picking statistically valid and early simulation points

[...]

Erez Perelman¹, Greg Hamerly¹, Brad Calder¹•Institutions (1)

University of California, San Diego¹

27 Sep 2003

TL;DR: A statistically driven algorithm for forming clusters from which simulation points are chosen, and algorithms for picking simulation points earlier in a program's execution are examined - in order to significantly reduce fast-forwarding time during simulation.

...read moreread less

Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution frequency profiles) to increase the efficiency and accuracy of simulation. Simulation points are a small set of execution samples that when combined represent the complete execution of the program.In this paper we present a statistically driven algorithm for forming clusters from which simulation points are chosen, and examine algorithms for picking simulation points earlier in a program's execution - in order to significantly reduce fast-forwarding time during simulation. In addition, we show that simulation points can be used independent of the underlying architecture. The points are generated once for a program/input pair by only examining the code executed. We show the points accurately track hardware metrics (e.g., performance and cache miss rates) between different architecture configurations. They can therefore be used across different architecture configurations to allow a designer to make accurate trade-off decisions between different configurations.

...read moreread less

222 citations

Journal Article•DOI•

Analysis and architecture design of block-coding engine for EBCOT in JPEG 2000

[...]

Chung-Jr Lian¹, Kuanfu Chen¹, Hong-Hui Chen¹, Liang-Gee Chen¹•Institutions (1)

National Taiwan University¹

01 Mar 2003-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: This work presents detailed analysis and dedicated hardware architecture of the block-coding engine to execute the EBCOT algorithm efficiently and shows that about 60% of the processing time is reduced compared with sample-based straightforward implementation.

...read moreread less

Abstract: Embedded block coding with optimized truncation (EBCOT) is the most important technology in the latest image-coding standard, JPEG 2000. The hardware design of the block-coding engine in EBCOT is critical because the operations are bit-level processing and occupy more than half of the computation time of the whole compression process. A general purpose processor (GPP) is, therefore, very inefficient to process these operations. We present detailed analysis and dedicated hardware architecture of the block-coding engine to execute the EBCOT algorithm efficiently. The context formation process in EBCOT is analyzed to get an insight into the characteristics of the operation. A column-based architecture and two speed-up methods, sample skipping (SS) and group-of-column skipping (GOCS), for the context generation are then proposed. As for arithmetic encoder design, the pipeline and look-ahead techniques are used to speed up the processing. It is shown that about 60% of the processing time is reduced compared with sample-based straightforward implementation. A test chip is designed and the simulation results show that it can process 4.6 million pixels image within 1 s, corresponding to 2400 /spl times/ 1800 image size, or CIF (352 /spl times/ 288) 4 : 2 : 0 video sequence with 30 frames per second at 50-MHz working frequency.

...read moreread less

220 citations

Proceedings Article•DOI•

Scalar operand networks: on-chip interconnect for ILP in partitioned architectures

[...]

M. Bedford Taylor, Woo Sik Lee, Saman Amarasinghe, Anant Agarwal

08 Feb 2003

TL;DR: The unique properties of scalar operand networks are discussed, alternative ways of implementing them are examined, and the implementation of one such network is described in detail in the Raw microprocessor.

...read moreread less

Abstract: The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties

...read moreread less

205 citations

Journal Article•DOI•

High-speed VLSI architecture for parallel Reed-Solomon decoder

[...]

Hanho Lee¹•Institutions (1)

University of Connecticut¹

01 Apr 2003-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: High-speed parallel Reed-Solomon (RS) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems and it is suggested that a parallel RS decoder, which can keep up with optical transmission rates, could be implemented.

...read moreread less

Abstract: This paper presents high-speed parallel Reed-Solomon (RS) (255,239) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems. Pipelining and parallelizing allow inputs to be received at very high fiber-optic rates and outputs to be delivered at correspondingly high rates with minimum delay. A parallel processing architecture results in speed-ups of as much as or more than 10 Gb, since the maximum achievable clock frequency is generally bounded by the critical path of the modified Euclidean algorithm block. The parallel RS decoders have been designed and implemented with the 0.13-/spl mu/m CMOS standard cell technology in a supply voltage of 1.1 V. It is suggested that a parallel RS decoder, which can keep up with optical transmission rates, i.e., 10 Gb/s and beyond, could be implemented. The proposed channel = 4 parallel RS decoder operates at a clock frequency of 770 MHz and has a data processing rate of 26.6 Gb/s.

...read moreread less

164 citations

Journal Article•DOI•

GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation

[...]

Junichiro Makino, Toshiyuki Fukushige, Masaki Koga, Ken Namura

24 Oct 2003-arXiv: Astrophysics

TL;DR: The GRAPE-6 system as mentioned in this paper is a massively parallel special-purpose computer for astrophysical $N$-body simulations with a theoretical peak speed of 1.08 TFLOPS.

...read moreread less

Abstract: In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical $N$-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.

...read moreread less

153 citations

Journal Article•DOI•

New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors

[...]

David Brooks¹, P. Bose¹, V. Srinivasan¹, Michael K. Gschwind¹, Philip G. Emma¹, M. G. Rosenfield¹ - Show less +2 more•Institutions (1)

Harvard University¹

01 Sep 2003-Ibm Journal of Research and Development

TL;DR: The PowerTimer toolset is useful in assessing the typical and worst-case power swings that occur between successive cycle windows in a given workload execution and helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.

...read moreread less

Abstract: The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors. The key component of the toolset is a parameterized set of energy functions that can be used in conjunction with any given cycle-accurate microarchitectural simulator. The energy functions model the power consumption of primitive and hierarchically composed building blocks which are used in microarchitecture-level performance models. Examples of structures modeled are pipeline stage latches, queues, buffers and component read/write multiplexers, local clock buffers, register files, and cache array macros. The energy functions can be derived using purely analytical equations that are driven by organizational, circuit, and technology parameters or behavioral equations that are derived from empirical, circuit-level simulation experiments. After describing the modeling methodology, we present analysis results in the context of a current-generation superscalar processor simulator to illustrate the use and effectiveness of such early-stage models. In addition to average power and performance tradeoff analysis, PowerTimer is useful in assessing the typical and worst-case power (or current) swings that occur between successive cycle windows in a given workload execution. Such a characterization of workloads at the early stage of microarchitecture definition helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.

...read moreread less

123 citations

Proceedings Article•DOI•

Deterministic clock gating for microprocessor power reduction

[...]

Hai Li¹, Swarup Bhunia¹, Yi Chen¹, T. N. Vijaykumar¹, Kaushik Roy¹ - Show less +1 more•Institutions (1)

Purdue University¹

08 Feb 2003

TL;DR: Deterministic clock gating (DCG) is introduced based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time.

...read moreread less

Abstract: With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. Pipeline balancing (PLB), a previous technique, is essentially a methodology to clock-gate unused components whenever a program's instruction-level parallelism is predicted to be low. However, no nonpredictive methodologies are available in the literature for efficient clock gating. This paper introduces deterministic clock gating (DCG) based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time. Our experiments show an average of 19.9% reduction in processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D-Cache wordline decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss.

...read moreread less

115 citations

Proceedings Article•DOI•

An abstract interpretation-based timing validation of hard real-time avionics software

[...]

Stephan Thesing¹, Jean Souyris², Reinhold Heckmann, F. Randimbivololona², Marc Langenbach¹, Reinhard Wilhelm¹, Christian Ferdinand - Show less +3 more•Institutions (2)

Saarland University¹, Airbus²

22 Jun 2003

TL;DR: This work's approach to WCET prediction was implemented for the Motorola ColdFire 5307 and includes a static prediction of ∗ This work was partly supported by the RTD project IST-1999-20527 “DAEDALUS” of the European FP5 program.

...read moreread less

Abstract: Hard real-time avionics systems like flight control software are expected to always react in time. Consequently, it is essential for the timing validation of the software that the worst-case execution time (WCET) of all tasks on a given hardware configuration be known. Modern processor components like caches, pipelines, and branch prediction complicate the determination of the WCET considerably since the execution time of a single instruction may depend on the execution history. The safe, yet overly pessimistic assumption of no cache hits, no overlapping executions in the processor pipeline, and constantly mispredicted branches results in a serious overestimation of the WCET. Our approach to WCET prediction was implemented for the Motorola ColdFire 5307. It includes a static prediction of ∗ This work was partly supported by the RTD project IST-1999-20527 “DAEDALUS” of the European FP5 program. cache and pipeline behavior, producing much tighter upper bounds for the execution times. The WCET analysis tool works on real applications. It is safe in the sense that the computed WCET is always an upper bound of the real WCET. It requires much less effort, while producing more precise results than conventional measurement-based methods.

...read moreread less

Proceedings Article•DOI•

Optimum power/performance pipeline depth

[...]

Allan M. Hartstein¹, Thomas R. Puzak¹•Institutions (1)

IBM¹

03 Dec 2003

TL;DR: The theory shows that the more important power is to themetric, the shorter the optimum pipeline length that results, and that as dynamic power grows, the optimal design point shifts to shorter pipelines, while clock gating pushes the optimum to deeper pipelines.

...read moreread less

Abstract: The impact of pipeline length on both the power and performance of a microprocessor is explored both theoretically and by simulation. A theory is presented for a wide range of power/performance metrics, BIPS/sup m//W. The theory shows that the more important power is to the metric, the shorter the optimum pipeline length that results. For typical parameters neither BIPS/W nor BIPS/sup 2//W yield an optimum, i.e., a non-pipelined design is optimal. For BIPS/sup 3//W the optimum, averaged over all 55 workloads studied, occurs at a 22.5 FO4 design point, a 7 stage pipeline, but this value is highly dependent on the assumed growth in latch count with pipeline depth. As dynamic power grows, the optimal design point shifts to shorter pipelines. Clock gating pushes the optimum to deeper pipelines. Surprisingly, as leakage power grows, the optimum is also found to shift to deeper pipelines. The optimum pipeline depth varies for different classes of workloads: SPEC95 and SPEC2000 integer applications, traditional (legacy) database and on-line transaction processing applications, modern (e.g. Web) applications, and floating point applications.

...read moreread less

Journal Article•DOI•

New efficient FFT algorithm and pipeline implementation results for OFDM/DMT applications

[...]

Yunho Jung¹, Hongil Yoon¹, Jaeseok Kim¹•Institutions (1)

Yonsei University¹

01 Feb 2003-IEEE Transactions on Consumer Electronics

TL;DR: Since the proposed algorithm can achieve higher processing rate and better efficiency than the conventional algorithm, it is very suitable for the OFDM/DMT applications such as the WLAN, DAB/DVB, and ADSL/VDSL systems.

...read moreread less

Abstract: In this paper, we propose a new efficient FFT algorithm for OFDM/DMT applications and present its pipeline implementation results. Since the proposed algorithm is based on the radix-4 butterfly unit, the processing rate can be twice as fast as that based on the radix-23 algorithm. Also, its implementation is more area-efficient than the implementation from conventional radix-4 algorithm due to reduced number of nontrivial multipliers like using the radix-2/sup 3/ algorithm. In order to compare the proposed algorithm with the conventional radix-4 algorithm, the 64-point MDC pipelined FFT processor based on the proposed algorithm was implemented. After the logic synthesis using 0.35 /spl mu/m CMOS technology, the logic gate count for the processor with the proposed algorithm is only about 70% of that for the processor with the conventional radix-4 algorithm. Since the proposed algorithm can achieve higher processing rate and better efficiency than the conventional algorithm, it is very suitable for the OFDM/DMT applications such as the WLAN, DAB/DVB, and ADSL/VDSL systems.

...read moreread less

Patent•

Non-stalling circular counterflow pipeline processor with reorder buffer

[...]

Kenneth J. Janik¹, Shih-Lien Lu¹, Michael F. Miller¹•Institutions (1)

Intel¹

09 Dec 2003

TL;DR: In this paper, the authors present a system and method of executing instructions within a counterflow pipeline processor, where an instruction and one or more operands issue into the instruction pipeline and a determination is made at one of the execution units whether the instruction is ready for execution.

...read moreread less

Abstract: A system and method of executing instructions within a counterflow pipeline processor. The counterflow pipeline processor includes an instruction pipeline, a data pipeline, a reorder buffer and a plurality of execution units. An instruction and one or more operands issue into the instruction pipeline and a determination is made at one of the execution units whether the instruction is ready for execution. If so, the operands are loaded into the execution unit and the instruction executes. The execution unit is monitored for a result and, when the result arrives, it is stored into the result pipeline. If the instruction reaches the end of the pipeline without executing it wraps around and is sent down the instruction pipeline again.

...read moreread less

Proceedings Article•DOI•

Real-time range scanning of deformable surfaces by adaptively coded structured light

[...]

Thomas Koninckx¹, A. Griesser², L. Van Gool²•Institutions (2)

Katholieke Universiteit Leuven¹, École Polytechnique Fédérale de Lausanne²

27 Oct 2003

TL;DR: A new active range scanning technique suitable for moving or deformable surfaces, in that 3D data are acquired from a single image, using a 'one-shot' system that poses a weak temporal continuity constraint.

...read moreread less

Abstract: We present a new active range scanning technique suitable for moving or deformable surfaces. It is a 'one-shot' system, in that 3D data are acquired from a single image. The projection pattern consists of equidistant black and white stripes combined with a limited number of colored, transversal stripes which aid in their identification. Instead of using a generic, static code that is supposed to work under all circumstances, a 'self-adaptive code' is used. Two modes of adaptation are provided. The first mode is off-line, and generates a robust identification code for the current projector-camera configuration. This configuration is supposed to remain fixed during the remainder of the 3D capturing session. The second mode occurs on-line. By introducing a feedback-loop from the reconstruction output to the pattern generation, the pattern can be adapted as to keep its decoding well-conditioned. Within the considered family of patterns, its parameters are optimized on the fly based on the current content of the scene. As a matter of fact, scene content of the current frame influences the pattern projected during the next. This poses a weak temporal continuity constraint. Only a very fast change of scene invalidates this assumption. The higher the throughput of the reconstruction pipeline, the less serious this constraint becomes. Our current pipeline is running at approximately 20 Hz. Our prototype only uses 'off the shelf' hardware.

...read moreread less

Journal Article•DOI•

An efficient pipelined FFT architecture

[...]

Yun-Nan Chang¹, Keshab K. Parhi²•Institutions (2)

National Sun Yat-sen University¹, University of Minnesota²

09 Jul 2003-IEEE Transactions on Circuits and Systems Ii: Analog and Digital Signal Processing

TL;DR: This paper presents an efficient VLSI architecture of the pipeline fast Fourier transform (FFT) processor based on radix-4 decimation-in-time algorithm with the use of digit-serial arithmetic units by combining both the feedforward and feedback commutator schemes.

...read moreread less

Abstract: This paper presents an efficient VLSI architecture of the pipeline fast Fourier transform (FFT) processor based on radix-4 decimation-in-time algorithm with the use of digit-serial arithmetic units. By combining both the feedforward and feedback commutator schemes, the proposed architecture can not only achieve nearly 100% hardware utilization, but also require much less memory compared with the previous digit-serial FFT processors. Furthermore, in FFT processors, several modules of ROM are required for the storage of twiddle factors. By exploiting the redundancy of the factors, the overall ROM size can be effectively reduced by a factor of 2.

...read moreread less

Patent•

Pipeline accelerator having multiple pipeline units and related computing machine and method

[...]

Kenneth R. Schulz¹, John W. Rapp¹, Larry Jackson¹, Mark Jones¹, Troy Cherasaro¹ - Show less +1 more•Institutions (1)

Lockheed Martin Corporation¹

31 Oct 2003

TL;DR: In this article, a pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit.

...read moreread less

Abstract: A pipeline accelerator includes a bus and a plurality of pipeline units, each unit coupled to the bus and including at least one respective hardwired-pipeline circuit. By including a plurality of pipeline units in the pipeline accelerator, one can increase the accelerator's data-processing performance as compared to a single-pipeline-unit accelerator. Furthermore, by designing the pipeline units so that they communicate via a common bus, one can alter the number of pipeline units, and thus alter the configuration and functionality of the accelerator, by merely coupling or uncoupling pipeline units to or from the bus. This eliminates the need to design or redesign the pipeline-unit interfaces each time one alters one of the pipeline units or alters the number of pipeline units within the accelerator.

...read moreread less

Proceedings Article•DOI•

Virtual simple architecture (VISA): exceeding the complexity limit in safe real-time systems

[...]

Aravindh Anantaraman¹, Kiran Seth¹, Kaustubh Patil¹, Eric Rotenberg¹, Frank Mueller¹ - Show less +1 more•Institutions (1)

North Carolina State University¹

01 May 2003

TL;DR: This work reconciles the complexity/safety trade-off by decoupling worst-case timing analysis from the processor implementation, through a virtual simple architecture (VISA), and shows a VISA-compliant complex pipeline consumes 43--61% less power than an explicitly-safe pipeline.

...read moreread less

Abstract: Meeting deadlines is a key requirement in safe realtime systems. Worst-case execution times (WCET) of tasks are needed for safe planning. Contemporary worst-case timing analysis tools can safely and tightly bound execution time on in-order single-issue pipelines with caches and static branch prediction. However, this simple pipeline appears to be a complexity limit, due to the need for analyzability. This excludes a whole class of high-performance processors from many embedded systems.We reconcile the complexity/safety trade-off by decoupling worst-case timing analysis from the processor implementation, through a virtual simple architecture (VISA). A VISA is the timing specification of a hypothetical simple pipeline and is the basis for worst-case timing analysis. However, the underlying microarchitecture can be arbitrarily complex. A task is divided into multiple sub-tasks which provide a means to gauge progress on the complex pipeline. Each sub-task is assigned an interim deadline, or checkpoint, based on the latest allowable completion time of the sub-task on the hypothetical simple pipeline. If no checkpoints are missed, then the complex pipeline is as timely as the safe pipeline. If a checkpoint is missed, the pipeline switches to a simple mode of operation that directly implements the VISA so that execution time of unfinished sub-tasks is safely bounded. The significance of our approach is that we circumvent worst-case timing analysis of the complex pipeline, by dynamically confirming its behavior is bounded by worst-case timing analysis of a simpler proxy pipeline.The benefit of using a high-performance processor is that tasks finish much sooner than they would have on an explicitly-safe processor. The new slack in the schedule can be exploited for higher throughput or lower power. With the VISA approach, an arbitrarily complex SMT processor can safely run non-real-time tasks at the same time as a real-time task. Alternatively, frequency/voltage can be safely lowered to take up slack. We explore the latter application and show a VISA-compliant complex pipeline consumes 43--61% less power than an explicitly-safe pipeline.

...read moreread less

Proceedings Article•DOI•

Power-aware control speculation through selective throttling

[...]

Juan L. Aragón, José González¹, Antonio González²•Institutions (2)

Intel¹, Polytechnic University of Catalonia²

08 Feb 2003

TL;DR: This work focuses on reducing the power dissipated by mis-speculated instructions by proposing selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic).

...read moreread less

Abstract: With the constant advances in technology that lead to the increasing of the transistor count and processor frequency, power dissipation is becoming one of the major issues in high-performance processors. These processors increase their clock frequency by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. Branch mispredictions are responsible for around 28% of the power dissipated by a typical processor due to the useless activities performed by instructions that are squashed. This work focuses on reducing the power dissipated by mis-speculated instructions. We propose selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic). The particular set of techniques applied to each branch is dynamically chosen depending on the branch prediction confidence level. For branches with a low confidence on the prediction, the most aggressive throttling mechanism is used whereas high confidence branch predictions trigger the least aggressive techniques. Results show that combining fetch bandwidth reduction along with select logic disabling provides the best performance both in terms of energy reduction and energy-delay improvement (14% and 9% respectively for 14 stages, and 17% and 12% respectively for 28 stages).

...read moreread less

Patent•

Method and Apparatus for Testing Embedded Cores

[...]

Sankaran M. Menon¹, Luis Antonio Basto², Tien Dinh¹, Thomas Tomazin², Juan G. Revilla² - Show less +1 more•Institutions (2)

Intel¹, Analog Devices²

18 Dec 2003

TL;DR: In this article, a BIST controller supporting a "resume" mode (225) in addition to a "pass/fail' mode (220) may be used to compensate for timing latencies introduced by pipeline staging in an embedded memory array.

...read moreread less

Abstract: A BIST (Built-In Self Test) controller supporting a 'resume' mode (225) in addition to a 'pass/fail' mode (220) may be used to compensate for timing latencies introduced by pipeline staging in an embedded memory array (205) .

...read moreread less

Patent•

Computing machine having improved computing architecture and related system and method

[...]

Chandan Mathur¹, Scott Hellenbach¹, John W. Rapp¹•Institutions (1)

Lockheed Martin Corporation¹

31 Oct 2003

TL;DR: In this paper, the authors propose a peer-vector machine with a hardwired pipeline accelerator coupled to the processor, where the buffer and data-transfer objects facilitate the transfer of data between the application and the accelerator.

...read moreread less

Abstract: A computing machine includes a first buffer and a processor coupled to the buffer. The processor executes an application, a first data-transfer object, and a second data-transfer object, publishes data under the control of the application, loads the published data into the buffer under the control of the first data-transfer object, and retrieves the published data from the buffer under the control of the second data-transfer object. Alternatively, the processor retrieves data and loads the retrieved data into the buffer under the control of the first data-transfer object, unloads the data from the buffer under the control of the second data-transfer object, and processes the unloaded data under the control of the application. Where the computing machine is a peer-vector machine that includes a hardwired pipeline accelerator coupled to the processor, the buffer and data-transfer objects facilitate the transfer of data between the application and the accelerator.

...read moreread less

Proceedings Article•DOI•

Beating in-order stalls with "flea-flicker" two-pass pipelining

[...]

Ronald D. Barnes¹, Erik M. Nystrom¹, John W. Sias¹, Sanjay J. Patel¹, Nacho Navarro¹, Wen-mei W. Hwu¹ - Show less +2 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

03 Dec 2003

TL;DR: A microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in- order back-end pipelines coupled by a queue, which is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

...read moreread less

Abstract: Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

...read moreread less

Patent•

Systems and methods for scheduling data flow execution based on an arbitrary graph describing the desired data flow

[...]

Michael A. Blaszczak¹, James K. Howey¹•Institutions (1)

Microsoft¹

18 Mar 2003

TL;DR: In this article, a data transformation pipeline enables a user to develop complex end-to-end data transformation functionality by graphically describing and representing, via a GUI, a desired data flow from one or multiple sources to one or more destinations through various interconnected nodes (graph).

...read moreread less

Abstract: The data transformation system in one embodiment, comprises a capability to receive data, a data destination and a capability to store transformed data, and a data transformation pipeline that constructs complex end-to-end data transformation functionality by pipelining data flowing from one or more sources to one or more destinations through various interconnected nodes for transforming the data as it flows. Each component in the pipeline possesses predefined data transformation functionality, and the logical connections between components define the data flow pathway in an operational sense. The data transformation pipeline enables a user to develop complex end-to-end data transformation functionality by graphically describing and representing, via a GUI, a desired data flow from one or more sources to one or more destinations through various interconnected nodes (graph). Each node in the graph selected by the user represents predefined data transformation functionality, and connections between nodes define the data flow pathway.

...read moreread less

Proceedings Article•DOI•

Dynamic data dependence tracking and its application to branch prediction

[...]

Lei Chen¹, S. Dropsho, David H. Albonesi•Institutions (1)

University of Rochester¹

08 Feb 2003

TL;DR: This paper describes an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline and uses this design in a new value-based branch prediction design using available register value information (ARVI).

...read moreread less

Abstract: To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch execution architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its performance. We then use this design in a new value-based branch prediction design using available register value information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch predictor With ARVI used as the second-level branch predictor the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite.

...read moreread less

Journal Article•DOI•

Antibiotic development pipeline runs dry

[...]

Roxanne Nelson

22 Nov 2003-The Lancet

Journal Article•DOI•

A highly flexible, distributed multiprocessor architecture for network processing

[...]

Muthu Venkatachalam¹, Prashant R. Chandra¹, Raj Yavatkar¹•Institutions (1)

Intel¹

05 Apr 2003-Computer Networks

TL;DR: This work describes an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance and presents a programming model for generic network applications that uses software pipelines.

...read moreread less

Patent•

Pipelined reconfigurable dynamic instruciton set processor

[...]

Jr. Robert C. Klein

23 Jul 2003

TL;DR: In this paper, a reconfigurable processor for processing digital logic functions includes a microcontroller, one or more decoders connected to the microcontroller and a plurality of interconnection busses.

...read moreread less

Abstract: A reconfigurable processor for processing digital logic functions includes a microcontroller, one or more decoders connected to the microcontroller, a plurality of interconnection busses; and a plurality of processing elements is described. Each processing element connects to one or more other processing elements by local interconnection paths and the a decoder. The plurality of processing elements are arranged in one or more pipeline stages each including one or more processing elements. A method of dynamically reconfiguring a pipelined processor including configuring, using a microcontroller, a plurality of pipeline stages each including one or more processing elements, processing data through one or more pipeline stages, reconfiguring, by the microcontroller, one or more pipeline stages to define one or more subsequent pipeline stages, and routing the processed data through the one or more reconfigured pipeline stages is also described. The reconfiguration may take place while data is processed by other pipeline stages.

...read moreread less

Proceedings Article•DOI•

Pipeline stage unification: a low-energy consumption technique for future mobile processors

[...]

Hajime Shimada¹, Hideki Ando¹, Toshio Shimada¹•Institutions (1)

Nagoya University¹

25 Aug 2003

TL;DR: Effectiveness of PSU to DVS in current and future process generations is compared and evaluation results show PSU will reduce energy consumption by 27-34% more than DVS after about 10 years.

...read moreread less

Abstract: Recent mobile processors are required to exhibit both low-energy consumption and high performance. To satisfy these requirements, dynamic voltage scaling (DVS) is currently employed. However, its effectiveness will be limited in the future because of shrinking the variable supply voltage range. As an alternative, we previously proposed pipeline stage unification (PSU), which unifies multiple pipeline stages without reducing the supply voltage at a power-saving mode. This paper compares effectiveness of PSU to DVS in current and future process generations. Our evaluation results show PSU will reduce energy consumption by 27-34% more than DVS after about 10 years.

...read moreread less

Proceedings Article•DOI•

An architecture for asynchronous FPGAs

[...]

C.G. Wong¹, Alain J. Martin¹, P. Thomas¹•Institutions (1)

California Institute of Technology¹

01 Jan 2003

TL;DR: The logic cell is a complete asynchronous pipeline stage, and the interconnects are entirely delay insensitive, eliminating all timing issues from the place-and-route procedure.

...read moreread less

Abstract: We present an architecture for a quasi delay-insensitive asynchronous field-programmable gate array. The logic cell is a complete asynchronous pipeline stage, and the interconnects are entirely delay insensitive, eliminating all timing issues from the place-and-route procedure.

...read moreread less

Proceedings Article•DOI•

High-level synthesis of asynchronous systems by data-driven decomposition

[...]

Catherine G. Wong¹, Alain J. Martin¹•Institutions (1)

California Institute of Technology¹

02 Jun 2003

TL;DR: A method for decomposing a high-level program description of a circuit into a system of concurrent modules that can each be implemented as asynchronous pre-charge half-buffer pipeline stages for the asynchronous R3000 MIPS microprocessor is presented.

...read moreread less

Abstract: We present a method for decomposing a high-level program description of a circuit into a system of concurrent modules that can each be implemented as asynchronous pre-charge half-buffer pipeline stages (the circuits used in the asynchronous R3000 MIPS microprocessor). We apply it to designing the instruction fetch of an asynchronous 8051 microcontroller, with promising results. We discuss new clustering algorithms that will improve the performance figures further.

...read moreread less

Collapse