scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 1999"


Journal ArticleDOI
TL;DR: This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching and indicates that the timing analyzer efficiently produces tight predictions of best and best-case performance for pipelined and instruction cache.
Abstract: Predicting the execution time of code segments in real-time systems is challenging. Most recently designed machines contain pipelines and caches. Pipeline hazards may result in multicycle delays. Instruction or data memory references may not be found in cache and these misses typically require several cycles to resolve. Whether an instruction will stall due to a pipeline hazard or a cache miss depends on the dynamic sequence of previous instructions executed and memory references performed. Furthermore, these penalties are not independent since delays due to pipeline stalls and cache miss penalties may overlap. This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching. First, a method is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. Next, these categorizations are used in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the pipeline path analysis to estimate the worst and best-case execution performance of each loop and function in the program. Finally, a graphical user interface is invoked that allows a user to request timing predictions on portions of the program. The results indicate that the timing analyzer efficiently produces tight predictions of worst and best-case performance for pipelining and instruction caching.

223 citations


Journal ArticleDOI
TL;DR: A new family of edge-triggered flip-flops has been developed that has the capability of easily incorporating logic functions with a small delay penalty, and greatly reduces the pipeline overhead.
Abstract: In an attempt to reduce the pipeline overhead, a new family of edge-triggered flip-flops has been developed. The flip-flops belong to a class of semidynamic and dynamic circuits that can interface to both static and dynamic circuits. The main features of the basic design are short latency, small clock load, small area, and a single-phase clock scheme. Furthermore, the flip-flop family has the capability of easily incorporating logic functions with a small delay penalty. This feature greatly reduces the pipeline overhead, since each flip-flop can be viewed as a special logic gate that serves as a synchronization element as well.

167 citations


Patent
Hock C. So1, Sau C. Wong1
22 Jun 1999
TL;DR: In this article, a memory architecture for a nonvolatile analog or multiple-bits-per-cell memory includes multiple separate memory arrays and multiple read/write pipelines, each pipeline includes a sample-and-hold circuit that samples the programming voltage when the pipeline begins a write operation.
Abstract: A memory architecture for a non-volatile analog or multiple-bits-per-cell memory includes multiple separate memory arrays and multiple read/write pipelines. The multiple read/write pipelines share a read circuit and/or a write circuit to reduce the circuit area of each pipeline and the circuit area of the memory as a whole. In one embodiment, a shared write circuit generates a programming voltage that changes with an input signal representing values to be written to the memory. Each pipeline includes a sample-and-hold circuit that samples the programming voltage when the pipeline begins a write operation. The write circuit can additionally generate a verify voltage that a second sample-and-hold circuit in each pipeline samples when starting a write operation. In another embodiment, a shared read circuit generates a read signal that ramps across the range of permitted threshold voltages for the memory cells, and a sense amplifier in each pipeline clocks a sample-and-hold circuit or another temporary storage circuit when the sense amplifier senses a transition in conductivity of a selected memory cell. When clocked, the sample-and-hold circuit or other temporary storage circuit registers a signal that corresponds to the read signal and indicates a data value associated with the voltage of the read signal. In alternative embodiments, the signal registered is the read signal, a converted form of the read signal, or a multi-bit digital signal.

149 citations


01 Jan 1999
TL;DR: The behavior of the SPEC95 benchmark suite over their course of execution correlating the behavior between IPC, branch prediction, value prediction, address prediction, cache performance, and reorder buffer occupancy is classified.
Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Furthermore, programs often times exhibit interesting and important time varying behavior on an extremely large scale. Very little analysis has been conducted to classify the time varying behavior of popular benchmarks using detailed simulation for important architecture features. In this paper we classify the behavior of the SPEC95 benchmark suite over their course of execution correlating the behavior between IPC, branch prediction, value prediction, address prediction, cache performance, and reorder buffer occupancy. Branch prediction, cache performance, value prediction, and address prediction are currently some of the most influential architecture features driving microprocessor research, and we show important interactions and relationships between these features. In addition, we show that many programs have wildly different behavior during different parts of their execution, which makes the section of the program simulated of great importance to the relevance and correctness of a study. We show that the large scale behavior of the programs is cyclic in nature, point out the length of cyclic behavior for these programs, and suggest where to simulate to achieve results representative of the program as a whole.

125 citations


Patent
Mario B. Ignagni1
29 Sep 1999
TL;DR: In this paper, an inertial measurement unit located in the pig determines a pipeline profile as it travels through the pipeline and correlates the measured profile with the GPS survey, and a pipeline surveying system including a pipeline pig can accurately provide pipeline profile after correlation with a previously determine Global Positioning System (GPS) survey.
Abstract: A pipeline surveying system including a pipeline pig can accurately provide a pipeline profile after correlation with a previously determine Global Positioning System (GPS) survey. An inertial measurement unit located in the pig determines a pipeline profile as it travels through the pipeline and correlates the measured profile with the GPS survey.

123 citations


Journal ArticleDOI
TL;DR: In this article, the effect of corrosion defect size on the remaining pipeline strength is modeled by a Markov process Analytical solution of the probability transition matrix is obtained by solving the Kolmogorov forward differential equation.

113 citations


Patent
08 Jun 1999
TL;DR: In this paper, an oil pipeline is fitted with a stainless steel tube sheathed telecommunications grade optical fiber, forming part of a distributed temperature measurement means, such as to be in thermal contact with the oil flowing along the pipeline.
Abstract: An oil pipeline (10), which connects a Christmas tree (11) at its lower end to an oil rig (12) at its upper end, is fitted with a stainless steel tube sheathed telecommunications grade optical fibre (13), forming part of a distributed temperature measurement means, such as to be in thermal contact with the oil flowing along the pipeline (10) Also fitted to respective ones of three discrete positions on the pipeline are ultrasonic deposit thickness measurement means (14, 15 and 16), the one (14) nearest of which to the Christmas tree (11) is a few hundred metres downstream from a position where preliminary studies of the pipeline system have indicated that deposits are likely to form during oil production A computer (17), located on the oil rig (12), is connected to be able to provide control over and to receive measurement signals from the temperature measurement means and the deposit thickness measurement means A model of at least deposit deposition (figure 4, not shown) stored in the computer (17) is revised in response to these signals The model includes an approximation of the temperature, the pressure and the viscosity and flow patterns of the oil along the length of the pipeline (10), and, from these parameters, an estimation of the nature and quantity or rate of deposit deposition along the length of the pipeline (10)

111 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: A multi-level fetch block-oriented predictor that decouple the FTB from the instruction fetch and decode pipelines to afford it the fastest clock possible and scales better to future process technologies than traditional single-level designs.
Abstract: In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the instruction delivery mechanism scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. To further complicate matters, a VLSI interconnect scaling trend is materializing that further limits the performance of front-end designs in future generation process technologies.To counter these challenges, we present a fetch architecture that permits a faster cycle time than previous designs and scales better with future process technologies. Our design, called the Fetch Target Buffer, is a multi-level fetch block-oriented predictor. We decouple the FTB from the instruction fetch and decode pipelines to afford it the fastest clock possible. Through cycle-based simulation and circuit-level delay analysis, we find that our multi-level FTB design is capable of delivering instructions 25% faster than the best single-level BTB-based pipeline configuration. Moreover, we show that our design scales better to future process technologies than traditional single-level designs.

103 citations


Journal ArticleDOI
TL;DR: In this article, a mathematical model for the assessment of the pressure head maxima that air pockets within a pipeline can originate on start-up is presented, based on a general model addressing the...
Abstract: A mathematical model for the assessment of the pressure head maxima that air pockets within a pipeline can originate on start-up is presented. This model is based on a general model addressing the ...

98 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: A method to predict the behavior of pipelined superscalar processors is described and initial results of a prototypical implementation for the SuperSPARC I processor are reported.
Abstract: For real time systems not only the logical function is important but also the timing behavior, e. g. hard real time systems must react inside their deadlines. To guarantee this it is necessary to know upper bounds for the worst case execution times (WCETs). The accuracy of the prediction of WCETs depends strongly on the ability to model the features of the target processor.Cache memories, pipelines and parallel functional units are architectural components which are responsible for the speed gain of modern processors. It is not trivial to determine their influence when predicting the worst case execution time of programs.This paper describes a method to predict the behavior of pipelined superscalar processors and reports initial results of a prototypical implementation for the SuperSPARC I processor.

96 citations


Patent
David B. Kirk1, Gopal Solanki1, Curtis Priem1, Walter E. Donovan1, Joe L. Yeun1 
22 Mar 1999
TL;DR: In this paper, a graphics accelerator pipeline including a combiner stage capable of producing output values during each clock interval of the pipeline which map a plurality of textures to a single pixel or an individual texture to two pixels.
Abstract: A graphics accelerator pipeline including a combiner stage capable of producing output values during each clock interval of the pipeline which map a plurality of textures to a single pixel or an individual texture to two pixels.

Proceedings ArticleDOI
21 Apr 1999
TL;DR: A protection architecture is proposed for the Morph/AMRM reconfigurable processor which enable nearly the full range of power of reconfigurability in the processor core while requiring only a small number of fixed logic features which to ensure safe, protected multiprocess execution.
Abstract: Technology scaling of CMOS processes brings relatively faster transistors (gates) and slower interconnects (wires), making viable the addition of reconfigurability to increase performance. In the Morph/AMRM system we are exploring the addition of reconfigurable logic, deeply integrated with the processor core, employing the reconfigurability to manage the cache, datapath, and pipeline resources more effectively. However, integration of reconfigurable logic introduces significant protection and safety challenges for microprocess execution. We analyze the protection structures in a state of the art microprocessor core (R10000), identifying the few critical logic blocks and demonstrating that the majority of the logic in the processor core can be safely reconfigured. Subsequently, we propose a protection architecture for the Morph/AMRM reconfigurable processor which enable nearly the full range of power of reconfigurability in the processor core while requiring only a small number of fixed logic features which to ensure safe, protected multiprocess execution.

Proceedings ArticleDOI
16 Nov 1999
TL;DR: Experiments found that when the missprediction penalty of the added Huffman decoder stage was taken into account, a Tailored ISA approach produced higher performance, while providing higher ROM size savings.
Abstract: During the last 15 years, embedded systems have grown in complexity and performance to rival desktop systems. The architectures of these systems present unique challenges to processor microarchitecture, including instruction encoding and instruction fetch processes. This paper presents new techniques for reducing embedded system code size without reducing functionality. This approach is to extract the pipeline decoder logic for an embedded VLIW processor in software at system development time. The code size reduction is achieved by Huffman compressing or tailor encoding the ISA of the original program. Some interesting results were found. In particular, the degree of compression for the ROM doesn't translate into an improvement in instructions delivered per cycle. Experiments found that when the missprediction penalty of the added Huffman decoder stage was taken into account, a Tailored ISA approach produced higher performance. Methods that compress the entire operation using Huffman encodings, and decompress at ICache hit time still achieved a median performance advantage, while providing higher ROM size savings. All results were generated by an optimizing compiler and tool suite, and presented for an encoding similar to the Intel/HP IA-64 architecture.

Patent
01 Apr 1999
TL;DR: In this article, an asynchronously pipelined SDRAM has separate pipeline stages that are controlled by asynchronous signals and data is synchronized to the clock at the end of the read data path before being read out of the chip.
Abstract: An asynchronously pipelined SDRAM has separate pipeline stages that are controlled by asynchronous signals. Rather than using a clock signal to synchronize data at each stage, an asynchronous signal is used to latch data at every stage. The asynchronous control signals are generated within the chip and are optimized to the different latency stages. Longer latency stages require larger delays elements, while shorter latency states require shorter delay elements. The data is synchronized to the clock at the end of the read data path before being read out of the chip. Because the data has been latched at each pipeline stage, it suffers from less skew than would be seen in a conventional wave pipeline architecture. Furthermore, since the stages are independent of the system clock, the read data path can be run at any CAS latency as long as the re-synchronizing output is built to support it.

Patent
30 Jun 1999
TL;DR: In this article, a method and system for monitoring the performance of an instruction pipeline is presented, where the processor may contain a performance monitor for monitoring for the occurrence of an event within a data processing system.
Abstract: A method and system for monitoring the performance of a instruction pipeline is provided. The processor may contain a performance monitor for monitoring for the occurrence of an event within a data processing system. An event to be monitored may be specified through software control, and the occurrence of the specified event is monitored during the execution of an instruction in the execution pipeline of the processor. A particular instruction may be specified to execute within a threshold time for each stage of the instruction pipeline. The specified event may be the completion of a single tagged instruction beyond the specified threshold interval for a stage of the instruction pipeline. The performance monitor may contain a number of counters for counting multiple occurrences of specified events during the execution of multiple instructions, in which case the specified events may be the completion of tagged instructions beyond a threshold interval for any stage of the multiple stages of the execution pipeline. As the instruction moves through the processor, the performance monitor collects the events and provides the events for optimization analysis.

Proceedings ArticleDOI
13 Dec 1999
TL;DR: A technique for worst-case execution time (WCET) analysis for pipelined processors for embedded real-time systems using a standard simulator instead of special-purpose pipeline modeling is presented.
Abstract: We present a technique for worst-case execution time (WCET) analysis for pipelined processors. Our technique uses a standard simulator instead of special-purpose pipeline modeling. Our technique handles CPUs that execute multiple shorter instructions in parallel with long-running instructions. The results of other machine analyses, like cache analysis, can be used in our pipeline analysis. Also, results from high-level program flow analysis can be used to tighten the execution time predictions. Our primary target is embedded real-time systems, and since processor simulators are standard equipment for embedded development work, our tool will be easy to port to relevant target processors.

27 Aug 1999
TL;DR: It is demonstrated that scaling the architecture leads to near linear application speedup, and the effect of scaling the capacity and parallelism of the on-chip memory system to die area and sustained performance is evaluated.
Abstract: Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector processing with embedded DRAM technology. Vector processing achieves high multimedia performance with simple hardware, while embedded DRAM provides high memory bandwidth at low energy consumption. VIRAM provides flexible support for media data types, short vectors, and DSP features. The vector pipeline is enhanced to hide DRAM latency without using caches. The peak performance is 3.2 GFLOPS (single precision) and maximum memory bandwidth is 25.6 GBytes/s. With a target power consumption of 2 Watts for the vector pipeline and the memory system, VIRAM supports 1.6 GFLOPS/Watt. For a set of representative media kernels, VIRAM sustains on average 88% of its peak performance, outperforming conventional SIMD media extensions and DSP processors by factors of 4.5 to 17. Using a clustered implementation approach, the modular design can be scaled without complicating control logic. We demonstrate that scaling the architecture leads to near linear application speedup. We also evaluate the effect of scaling the capacity and parallelism of the on-chip memory system to die area and sustained performance.

Patent
31 Mar 1999
TL;DR: In this article, a method and system for power savings within a pipelined design by performing intelligent stage gating is presented. But not every operand applied to the input of a pipeline requires a recomputation in the different pipeline stages.
Abstract: A method and system for power savings within a pipelined design by performing intelligent stage gating. The present invention recognizes that not every operand applied to the input of a pipeline requires a recomputation in the different pipeline stages. Circuitry is used to generate a signal, C, indicating that this condition holds. C is then used to gate the register bank at the input of the first pipeline stage thereby potentially saving power in the register bank. Moreover, C can also be stored in a register, the output of which: a) gates the register bank of the second stage; and b) connects to another register to store signal C to be used in the third stage. Power savings is provided by not clocking the register circuit of the stage, and in some instances, power is saved within the stage's associated combinational logic. In one embodiment, a register (to store C) is added in each stage of a pipeline to use C as a gating signal in the subsequent stage. This yields a structure in which signal C propagates through the pipeline in synchronization with the clock, successively gating the associated register banks. The value of C is generated whenever the output of the stage is inconsequential. For example, the output can be inconsequential in cases when duplicate operands are received in back-to-back clock cycles. Also, in maximum and minimum cases a operand that is not larger or smaller, respectively, than the largest or smallest previously received operand can yield an inconsequential result.

Journal ArticleDOI
TL;DR: The ability to incorporate partitioning with pipelining at several levels of granularity enables the author to attain high throughput designs, and also distinguishes this paper from previously proposed hardware/software partitioning algorithms.
Abstract: In order to satisfy cost and performance requirements, digital signal processing and telecommunication systems are generally implemented with a combination of different components, from custom-designed chips to off-the-shelf processors. These components vary in their area, performance, programmability and so on, and the system functionality is partitioned amongst the components to best utilize this tradeoff. However, for performance critical designs, it is not sufficient to only implement the critical sections as custom-designed high-performance hardware, but it is also necessary to pipeline the system at several levels of granularity. We present a design flow and an algorithm to first allocate software and hardware components, and then partition and pipeline a throughput-constrained specification amongst the selected components. This is performed to best satisfy the throughput constraint at minimal application-specific integrated-circuit cost. Our ability to incorporate partitioning with pipelining at several levels of granularity enables us to attain high throughput designs, and also distinguishes this paper from previously proposed hardware/software partitioning algorithms.

Journal ArticleDOI
TL;DR: Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively.
Abstract: This paper describes a system for compressed code generation. The code of applications is partioned into time-critical and non-time-critical code. Critical code is compiled to native code, and non-critical code is compiled to a very dense virtual instruction set which is executed on a highly optimized interpreter. The system employs dictionary-based compression by means of superinstructions which correspond to patterns of frequently used base instructions. The code compression system is designed for the Philips TriMedia VLIW processor. The interpreter is pipelined to achieve a high interpretation speed. The pipeline consists of three stages: fetch, decode, and execute. While one instruction is being executed, the next instruction is decoded, and the next one after that is fetched from memory. On a TriMedia VLIW with a load latency of three cycles and a jump latency of four cycles, the interpreter achieves a peak performance of four cycles per instruction and a sustained performance of 6.27 cycles per instruction. Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively.

Journal ArticleDOI
01 Feb 1999
TL;DR: This work describes two high-speed first-in-first-out (FIFO) circuits that were used to compare the performance of asynchronous FIFOs with that of conventionally clocked shift registers.
Abstract: Asynchronous circuits are often perceived to operate slower than equivalent clocked circuits. We demonstrate with fabricated chips that asynchronous circuits can be every bit as fast as clocked circuits. We describe two high-speed first-in-first-out (FIFO) circuits that we used to compare the performance of asynchronous FIFOs with that of conventionally clocked shift registers. The first FIFO circuit uses a pulse-like protocol, which we call the Asynchronous Symmetric Persistent Pulse Protocol (asP*), to advance data along a pipeline of conventional latches. Use of this protocol requires careful management of circuit delays. The second FIFO circuit uses a transition signaling protocol and special transition latches to store data. These transition latches are fast, but they are about 50% larger than conventional latches. Measurements obtained from chips fabricated in 0.6 /spl mu/m CMOS and from SPICE simulations show that the throughput of the first FIFO design matches that of a conventionally clocked shift register design, with a maximum throughput of 1.1 Giga data items per second. The throughput of the second design exceeds the performance of the asP* design and achieves a maximum throughput of 1.7 Giga data items per second. We have extensively tested the chips and have found them to operate reliably over a very wide range of conditions.

Patent
09 Apr 1999
TL;DR: In this article, an apparatus for processing data has a Single-Instruction-Multiple-Data (SIMD) architecture, and a number of features that improve performance and programmability.
Abstract: An apparatus for processing data has a Single-Instruction-Multiple-Data (SIMD) architecture, and a number of features that improve performance and programmability. The apparatus includes a rectangular array of processing elements and a controller. In one aspect, each of the processing elements includes one or more addressable storage means and other elements arranged in a pipelined architecture. The controller includes means for receiving a high level instruction, and converting each instruction into a sequence of one or more processing element microinstructions for simultaneously controlling each stage of the processing element pipeline. In doing so, the controller detects and resolves a number of resource conflicts, and automatically generates instructions for registering image operands that are skewed with respect to one another in the processing element array. In another aspect, a programmer references images via pointers to image descriptors that include the actual addresses of various bits of multi-bit data. Other features facilitate and speed up the movement of data into and out of the apparatus. 'Hit' detection and histogram logic are also included.


Proceedings ArticleDOI
20 Oct 1999
TL;DR: This work discusses the design and implementation of a high-speed, low power 1024-point pipeline FFT processor, which is efficient in terms of power consumption and chip area.
Abstract: We discuss the design and implementation of a high-speed, low power 1024-point pipeline FFT processor. Key features are flexible internal data length and a novel processing element. The FFT processor, which is implemented in a standard 0.35 /spl mu/m CMOS process, is efficient in terms of power consumption and chip area.

Patent
08 Oct 1999
TL;DR: In this paper, a parallel array VLIW digital signal processor is employed along with specialized complex multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation.
Abstract: Efficient computation of complex multiplication results and very efficient fast Fourier transforms (FFTs) are provided. A parallel array VLIW digital signal processor is employed along with specialized complex multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs are used allowing the complex multiplication pipeline hardware to be efficiently used. In addition, efficient techniques for supporting combined multiply accumulate operations are described.

Journal ArticleDOI
TL;DR: A very-large-scale integration architecture for Reed-Solomon (RS) decoding is presented that is scalable with respect to the throughput rate and new regular, multiplexed architectures have been derived for solving the key equation and performing finite field divisions.
Abstract: A very-large-scale integration architecture for Reed-Solomon (RS) decoding is presented that is scalable with respect to the throughput rate. This architecture enables given system specifications to be matched efficiently independent of a particular technology. The scalability is achieved by applying a systematic time-sharing technique. Based on this technique, new regular, multiplexed architectures have been derived for solving the key equation and performing finite field divisions. In addition to the flexibility, this approach leads to a small silicon area in comparison with several decoder implementations published in the past. The efficiency of the proposed architecture results from a fine granular pipeline scheme throughout each of the RS decoder components and a small overhead for the control circuitry. Implementation examples show that due to the pipeline strategy, data rates up to 1.28 Gbit/s are reached in a 0.5 /spl mu/m CMOS technology.

Proceedings ArticleDOI
19 Jul 1999
TL;DR: A new approach to evolvable hardware is introduced termed 'Complete Hardware Evolution (CHE), which differs from Extrinsic and Intrinsics evolution in that the evolution process itself is implemented in hardware.
Abstract: In this paper a new approach to evolvable hardware is introduced termed 'Complete Hardware Evolution (CHE). This method differs from Extrinsic and Intrinsic evolution in that the evolution process itself is implemented in hardware. In addition, the evolution process implementation, referred to herein as the GA Pipeline, is implemented on the same chip as the evolving design. A prototype implementation of the GA Pipeline is presented which uses FPGA technology as the implementation medium.

Journal ArticleDOI
01 Feb 1999
TL;DR: This paper describes the world's first commercial data-driven multimedia processors, developed jointly by Osaka University Kochi University of Technology, and Sharp Corporation, and a super-pipelined implementation based on a self-timed clocking scheme.
Abstract: This paper describes the world's first commercial data-driven multimedia processors (DDMPs), developed jointly by Osaka University Kochi University of Technology, and Sharp Corporation. The data-driven principle underlying the structure of these processors was realized in a super-pipelined implementation, which was in turn based on a self-timed clocking scheme. This design made it possible to realize single chip DDMPs capable of executing tens of billions of signal processing operations per second with power consumption as low as 2 W. In terms of operations per watt, the processors exhibit threefold to tenfold improvement over conventional sequential digital signal processors (DSPs). The structure of this paper is as follows: 1) a brief introduction to the data-driven processing principle; 2) a detailed description of elementary modules for the realization of self-timed pipeline microprocessors; and 3) a description of the DDMP's developed thus far in the research project, which has continued for more than a decade. Also outlined here are the numerous advantages, in terms of both function and power consumption, of the self-timed pipeline over its synchronous counterparts. Commercially available DSP-oriented asynchronous data-driven processors and their practical applications to consumer appliances such as digital TV receivers are discussed; some programming examples are provided.

Book ChapterDOI
30 Aug 1999
TL;DR: This paper describes the study of a new field programmable gate array architecture based on on-line arithmetic, dedicated to single chip implementation of numerical algorithms in low-power signal processing and digital control applications.
Abstract: This paper describes the study of a new field programmable gate array architecture based on on-line arithmetic. This architecture, called Field Programmable On-line oPerators (FPOP), is dedicated to single chip implementation of numerical algorithms in low-power signal processing and digital control applications. FPOP is based on a reprogrammable array of on-line arithmetic operators. On-line arithmetic is a digit-serial arithmetic with most significant digits first using a redundant number system. The digit-level pipeline, the small number of communication wires between the operators and the small size of the arithmetic operators lead to high-performance parallel computations. In FPOP, the basic elements are arithmetic operators such as adders, subtracters, multipliers, dividers, square-rooters, sine or cosine operators.... An equation model is then sufficient to describe the mapping of the algorithm on the circuit. The digit-serial communication mode also significantly reduces the necessary programmable routing resources compared to standard FPGAs.

Patent
George Lyons1, William Low1
09 Dec 1999
TL;DR: In this article, multiple independent images are presented on multiple display devices by driving the display devices with a common set of control signals and a multiplexed set of data signals that convey information representing interleaved components of each independent image.
Abstract: Multiple independent images are presented on multiple display devices by driving the display devices with a common set of control signals and a multiplexed set of data signals that convey information representing interleaved components of each independent image. In a preferred embodiment, a unique clock signal is provided to each respective display device that is aligned with the interleaved components of the image to be presented by that respective display. The control, data and clock signals may be obtained by multiplexing control and data signals received from display pipeline circuits, or by generating the, signals using a composite circuit that implements the features of two or more multiplexed display pipeline circuits.