scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2007"


Journal ArticleDOI
TL;DR: This work presents a novel streaming CT framework that conceptualizes the reconstruction process as a steady flow of data across a computing pipeline, updating the reconstruction result immediately after the projections have been acquired.
Abstract: The recent emergence of various types of flat-panel x-ray detectors and C-arm gantries now enables the construction of novel imaging platforms for a wide variety of clinical applications. Many of these applications require interactive 3D image generation, which cannot be satisfied with inexpensive PC-based solutions using the CPU. We present a solution based on commodity graphics hardware (GPUs) to provide these capabilities. While GPUs have been employed for CT reconstruction before, our approach provides significant speedups by exploiting the various built-in hardwired graphics pipeline components for the most expensive CT reconstruction task, backprojection. We show that the timings so achieved are superior to those obtained when using the GPU merely as a multi-processor, without a drop in reconstruction quality. In addition, we also show how the data flow across the graphics pipeline can be optimized, by balancing the load among the pipeline components. The result is a novel streaming CT framework that conceptualizes the reconstruction process as a steady flow of data across a computing pipeline, updating the reconstruction result immediately after the projections have been acquired. Using a single PC equipped with a single high-end commodity graphics board (the Nvidia 8800 GTX), our system is able to process clinically-sized projection data at speeds meeting and exceeding the typical flat-panel detector data production rates, enabling throughput rates of 40-50 projections s(-1) for the reconstruction of 512(3) volumes.

250 citations


Proceedings ArticleDOI
A. Kumary1, P. Kunduz2, A.P. Singhx2, L.-S. Pehy1, N.K. Jhay 
01 Oct 2007
TL;DR: This paper presents a detailed design of a novel high throughput and low latency switch allocation mechanism, a non-speculative single-cycle router pipeline which uses advanced bundles to remove control setup overhead, a low-complexity virtual channel allocator and a dynamically-managed shared buffer design which uses prefetching to minimize critical path delay.
Abstract: As chip multiprocessors (CMPs) become the only viable way to scale up and utilize the abundant transistors made available in current microprocessors, the design of on-chip networks is becoming critically important. These networks face unique design constraints and are required to provide extremely fast and high bandwidth communication, yet meet tight power and area budgets. In this paper, we present a detailed design of our on-chip network router targeted at a 36-core shared-memory CMP system in 65 nm technology. Our design targets an aggressive clock frequency of 3.6 GHz, thus posing tough design challenges that led to several unique circuit and microarchitectural innovations and design choices, including a novel high throughput and low latency switch allocation mechanism, a non-speculative single-cycle router pipeline which uses advanced bundles to remove control setup overhead, a low-complexity virtual channel allocator and a dynamically-managed shared buffer design which uses prefetching to minimize critical path delay. Our router takes up 1.19 mm2 area and expends 551 mW power at 10% activity, delivering a single-cycle no-load latency at 3.6 GHz clock frequency while achieving apeak switching data rate in excess of 4.6 Tbits/sper router node.

217 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: ReCycle, an architectural framework that comprehensively applies cycle time stealing to the pipeline - transferring the time slack of the faster stages to the slow ones by skewing clock arrival times to latching elements after fabrication, completely reclaiming the performance losses due to variation.
Abstract: Process variation affects processor pipelines by making some stages slower and others faster, therefore exacerbating pipeline unbalance. This reduces the frequency attainable by the pipeline. To improve performance, this paper proposes ReCycle, an architectural framework that comprehensively applies cycle time stealing to the pipeline - transferring the time slack of the faster stages to the slow ones by skewing clock arrival times to latching elements after fabrication. As a result, the pipeline can be clocked with a period equal to the average stage delay rather than the longest one. In addition, ReCycle's frequency gains are enhanced with Donor stages, which are empty stages added to "donate" slack to the slow stages. Finally, ReCycle can also convert slack into power reductions.For a 17FO4 pipeline, ReCycle increases the frequency by 12% and the application performance by 9% on average. Combining ReCycle and donor stages delivers improvements of 36% in frequency and 15% in performance onaverage, completely reclaiming the performance losses due to variation.

174 citations


Journal ArticleDOI
TL;DR: An asynchronous pipeline style is introduced for high-speed applications, called MOUSETRAP, which uses standard transparent latches and static logic in its datapath, and small latch controllers consisting of only a single gate per pipeline stage to handle more complex system architectures.
Abstract: An asynchronous pipeline style is introduced for high-speed applications, called MOUSETRAP. The pipeline uses standard transparent latches and static logic in its datapath, and small latch controllers consisting of only a single gate per pipeline stage. This simple structure is combined with an efficient and highly-concurrent event-driven protocol between adjacent stages. Post-layout SPICE simulations of a ten-stage pipeline with a 4-bit wide datapath indicate throughputs of 2.1-2.4 GHz in a 0.18-mum TSMC CMOS process. Similar results were obtained when the datapath width was extended to 16 bits. This performance is competitive even with that of wave pipelines, without the accompanying problems of complex timing and much design effort. Additionally, the new pipeline gracefully and robustly adapts to variable speed environments. The pipeline stages are extended to fork and join structures, to handle more complex system architectures.

159 citations


Proceedings ArticleDOI
11 Nov 2007
TL;DR: The implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the XD1000 platform are presented and a multistage PE (processing element) design is brought forward which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited.
Abstract: An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies -- a key to minimize the overall PE pipeline cycle time; (3) we also put forward a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor speedups 185X and 250X respectively.

144 citations


Proceedings ArticleDOI
Quanzhong Li1, Minglong Sha1, Volker Markl1, K. Beyer1, L. Colby1, G. Lohman1 
10 Sep 2007
TL;DR: A novel method for processing pipelined join plans that dynamically arranges the join order of both inner and outer-most tables at run-time and achieves adaptability by changing the pipeline itself which avoids the bookkeeping and routing decision associated with each row.
Abstract: Traditional query processing techniques based on static query optimization are ineffective in applications where statistics about the data are unavailable at the start of query execution or where the data characteristics are skewed and change dynamically. Several adaptive query processing techniques have been proposed in recent years to overcome the limitations of static query optimizers through either explicit re-optimization of plans during execution or by using a row-routing based approach. In this paper, we present a novel method for processing pipelined join plans that dynamically arranges the join order of both inner and outer-most tables at run-time. We extend the Eddies concept of "moments of symmetry" to reorder indexed nested-loop joins, the join method used by all commercial DBMSs for building pipelined query plans for applications for which low latencies are crucial. Unlike row-routing techniques, our approach achieves adaptability by changing the pipeline itself which avoids the bookkeeping and routing decision associated with each row. Operator selectivities monitored during query execution are used to change the execution plan at strategic points, and the change of execution plans utilizes a novel and efficient technique for avoiding duplicates in the query results. Our prototype implementation in a commercial DBMS shows a query execution speedup of up to 8 times.

110 citations


Journal ArticleDOI
TL;DR: Several optimization techniques over thenegative tuples approach are presented that aim to reduce the overhead of processing negative tuples while avoiding the output delay of the query answer.
Abstract: Two research efforts have been conducted to realize sliding-window queries in data stream management systems, namely, query revaluation and incremental evaluation. In the query reevaluation method, two consecutive windows are processed independently of each other. On the other hand, in the incremental evaluation method, the query answer for a window is obtained incrementally from the answer of the preceding window. In this paper, we focus on the incremental evaluation method. Two approaches have been adopted for the incremental evaluation of sliding-window queries, namely, the input-triggered approach and the negative tuples approach. In the input-triggered approach, only the newly inserted tuples flow in the query pipeline and tuple expiration is based on the timestamps of the newly inserted tuples. On the other hand, in the negative tuples approach, tuple expiration is separated from tuple insertion where a tuple flows in the pipeline for every inserted or expired tuple. The negative tuples approach avoids the unpredictable output delays that result from the input-triggered approach. However, negative tuples double the number of tuples through the query pipeline, thus reducing the pipeline bandwidth. Based on a detailed study of the incremental evaluation pipeline, we classify the incremental query operators into two classes according to whether an operator can avoid the processing of negative tuples or not. Based on this classification, we present several optimization techniques over the negative tuples approach that aim to reduce the overhead of processing negative tuples while avoiding the output delay of the query answer. A detailed experimental study, based on a prototype system implementation, shows the performance gains over the input-triggered approach of the negative tuples approach when accompanied with the proposed optimizations

84 citations


Proceedings ArticleDOI
07 Nov 2007
TL;DR: TERRASTREAM is a "pipelined" solution that consists of four main stages: construction of a digital elevation model (DEM), hydrological conditioning, extraction of river networks, andConstruction of a watershed hierarchy, which handles massive multi-gigabyte terrain data sets.
Abstract: We consider the problem of extracting a river network and a watershed hierarchy from a terrain given as a set of irregularly spaced points. We describe TERRASTREAM, a "pipelined" solution that consists of four main stages: construction of a digital elevation model (DEM), hydrological conditioning, extraction of river networks, and construction of a watershed hierarchy. Our approach has several advantages over existing methods. First, we design and implement the pipeline so that each stage is scalable to massive data sets; a single non-scalable stage would create a bottleneck and limit overall scalability. Second, we develop the algorithms in a general framework so that they work for both TIN and grid DEMs. Furthermore, TERRASTREAM is flexible and allows users to choose from various models and parameters, yet our pipeline is designed to reduce (or eliminate) the need for manual intervention between stages.We have implemented TERRASTREAM and we present experimental results on real elevation point sets, which show that our approach handles massive multi-gigabyte terrain data sets. For example, we can process a data set containing over 300 million points---over 20GB of raw data---in under 26 hours, where most of the time (76%) is spent in the initial CPU-intensive DEM construction stage.

83 citations


Journal ArticleDOI
TL;DR: In this paper, a new application of neural networks (NN) for pipeline failure prediction is described, which shows higher correlations with recorded data when compared with the two existing statistical models, i.e., shifted time power model and shifted time exponential model.
Abstract: This paper describes investigations into a development of a new application of neural networks (NN) for prediction of pipeline failure Results show higher correlations with recorded data when compared with the two existing statistical models The shifted time power model gives results in total number of failures and the shifted time exponential model gives results in number of failures per year The database was large but neither complete and nor fully accurate Factors influencing pipeline deterioration were missing from the database Using the NN technique on this database produced models of pipeline failure, in terms of failures/km/year, that more closely matched the number of failures of a particular asset recorded for the period

75 citations


Journal ArticleDOI
TL;DR: This paper identifies two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method and proposes high-performance and area-efficient designs using each method.
Abstract: Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.

75 citations


Journal ArticleDOI
01 Nov 2007
TL;DR: The approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency, and uses this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms.
Abstract: We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2-10x improvement in performance.

Proceedings ArticleDOI
04 Jun 2007
TL;DR: This work introduces a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration, and proposes a heuristic to efficiently search the design space for a pipeline-based multi ASIP system.
Abstract: Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system. We have implemented the proposed heterogeneous multiprocessor methodology using a commercial extensible processor (Xtensa LX from Tensilica Inc.). We have evaluated our system by creating two benchmarks (MP3 and JPEG encoders) which are mapped to our proposed design platform. Our multiprocessor design provided a performance improvement of at least 4.1 IX (JPEG) and 3.36X (MP3) compared to the single processor design. The minimum cost obtained through our heuristic was within 5.47% and 5.74% of the best possible values for JPEG and MP3 benchmarks respectively.

Proceedings ArticleDOI
22 Aug 2007
TL;DR: A dynamic path management scheme that exploits network traffic information during switch arbitration is proposed that improves the performance up to 30% while incurring only minimal area/power overhead.
Abstract: In modern multi-core system-on-chip (SoC) architectures, the design of innovative interconnection fabrics is indispensable. The concept of the network-on-chip (NoC) architecture has been proposed recently to better suit this requirement. Especially, the router architecture has a significant effect on the overall performance and energy consumption of the chip. We propose a dynamic path management scheme that exploits network traffic information during switch arbitration. Consequently, flits transferred across frequently used paths are expedited by traversing a reduced router pipeline. This technique, based on pipeline bypassing, is simulated and evaluated in terms of network latency and average power consumption. Simulation results with real-world application traces show that the architecture improves the performance up to 30% while incurring only minimal area/power overhead.

Journal ArticleDOI
TL;DR: This paper introduces a high-throughput asynchronous pipeline style, called high-capacity (HC) pipelines, targeted to datapaths that use dynamic logic, with a novel highly-concurrent handshake protocol, with fewer synchronization points between neighboring pipeline stages than almost all existing asynchronous dynamic pipelining approaches.
Abstract: This paper introduces a high-throughput asynchronous pipeline style, called high-capacity (HC) pipelines, targeted to datapaths that use dynamic logic. This approach includes a novel highly-concurrent handshake protocol, with fewer synchronization points between neighboring pipeline stages than almost all existing asynchronous dynamic pipelining approaches. Furthermore, the dynamic pipelines provide 100% buffering capacity, without explicit latches, by means of separate pullup and pulldown control for each pipeline stage: neighboring stages can store distinct data items, unlike almost all existing latchless dynamic asynchronous pipelines. As a result, very high throughput is obtained. Fabricated first-input-first-output (FIFO) designs, in 0.18-m technology, were fully functional over a wide range of supply voltages (1.2 to over 2.5 V), exhibiting a corresponding range of throughputs from 1.0-2.4 giga items/s. In addition, an experimental finite-impulse response (FIR) filter chip was designed and fabricated with IBM Research, whose speed-critical core used an HC pipeline. The HC pipeline exhibited throughputs up to 1.8 giga items/s, and the overall filter achieved 1.32 giga items/s, thus obtaining 15% higher throughput and 50% lower latency than the fastest previously-reported synchronous FIR filter, also designed at IBM Research.


Patent
09 Jan 2007
TL;DR: In this article, the authors present a system and method for flow assurance and pipe condition monitoring in a pipeline for flowing hydrocarbons using at least one thermal sensor probe, in conjunction with one or more other sensors to accurately determine flow properties and/or pipeline condition.
Abstract: Embodiments of the present invention provide for a system and method for flow assurance and pipe condition monitoring in a pipeline for flowing hydrocarbons using at least one thermal sensor probe, which at least one thermal sensor probe may be used in conjunction with one or more other sensors to manage the sensing process and for data fusion to accurately determine flow properties and/or pipeline condition. By way of example, but not by way of limitation, in an embodiment of the present invention, a network of noninvasive sensors may provide output data that may be data-fused to determine properties of the pipeline and/or flow through the pipeline.

Proceedings ArticleDOI
25 Jun 2007
TL;DR: Division and square root algorithms are also described which take advantage of high-precision linear approximation hardware for obtaining a reciprocal or reciprocal square root approximation.
Abstract: The floating point unit of the next generation PowerPC is detailed. It has been tested at over 5 GHz. The design supports an extremely aggressive cycle time of 13 FO4 using a technology independent measure. For most dependent instructions, its fused multiply-add dataflow has only 6 effective pipeline stages. This is nearly equivalent to its predecessor, the Power 5, even though its technology independent frequency has increased over 70%. Overall the frequency has improved over 100%. It achieves this high performance through aggressive feedback paths, circuit design and layout. The pipeline has 7 stages but data may be fed back to dependent operations prior to rounding and complete normalization. Division and square root algorithms are also described which take advantage of high-precision linear approximation hardware for obtaining a reciprocal or reciprocal square root approximation.

Journal ArticleDOI
TL;DR: The hardware implementation of a simple, fast technique for depth estimation based on phase measurement, which avoids the problem of phase warping and is much less susceptible to camera noise and distortion than standard block-matching stereo systems is presented.
Abstract: We present the hardware implementation of a simple, fast technique for depth estimation based on phase measurement. This technique avoids the problem of phase warping and is much less susceptible to camera noise and distortion than standard block-matching stereo systems. The architecture exploits the parallel computing resources of FPGA devices to achieve a computation speed of 65 megapixels per second. For this purpose, we have designed a fine-grain pipeline structure that can be arranged with a customized frame-grabber module to process 52 frames per second at a resolution of 1280times960 pixels. We have measured the system's degradation due to bit quantization errors and compared its performance with other previous approaches. We have also used different Gabor-scale circuits, which can be selected by the user according to the application addressed and typical image structure in the target scenario

Proceedings ArticleDOI
22 Aug 2007
TL;DR: This paper proposes a simple and effective linear pipeline architecture for trie-based IP lookup that achieves evenly distributed memory while realizing high throughput of one lookup per clock cycle and offers more freedom in mapping trie nodes to pipeline stages by supporting nops.
Abstract: Rapid growth in network link rates poses a strong demand on high speed IP lookup engines. Trie-based architectures are natural candidates for pipelined implementation to provide high throughput. However, simply mapping a trie level onto a pipeline stage results in unbalanced memory distribution over different stages. To address this problem, several novel pipelined architectures have been proposed. But their non-linear pipeline structure results in some new performance issues such as throughput degradation and delay variation. In this paper, we propose a simple and effective linear pipeline architecture for trie-based IP lookup. Our architecture achieves evenly distributed memory while realizing high throughput of one lookup per clock cycle. It offers more freedom in mapping trie nodes to pipeline stages by supporting nops. We implement our design as well as the state-of-the-art solutions on a commodity FPGA and evaluate their performance. Post place and route results show that our design can achieve a throughput of 80 Gbps, up to twice the throughput of reference solutions. It has constant delay, maintains input order, and supports incremental route updates without disrupting the ongoing IP lookup operations.

Journal ArticleDOI
TL;DR: This paper describes a 10-bit 205-MS/s pipeline analog-to-digital converter for flat panel display applications with the techniques to alleviate the design limitations in the deep-submicron CMOS process.
Abstract: This paper describes a 10-bit 205-MS/s pipeline analog-to-digital converter (ADC) for flat panel display applications with the techniques to alleviate the design limitations in the deep-submicron CMOS process. The switched source follower combined with a resistor-switch ladder eliminates the sampling switches and achieves high linearity for a large single-ended input signal. Multistage amplifiers adopting the complementary common-source topology increase the output swing range with lower transconductance variation and reduce the power consumption. The supply voltage for the analog blocks is provided by the low drop-out regulator for a high power-supply rejection ratio (PSRR) under the noisy operation environment. The pipeline stages of the ADC are optimized in the aspect of power consumption through the iterated calculation of the sampling capacitance and transconductance. The ADC occupies an active area of 1.0 mm2 in a 90-nm CMOS process and achieves a 53-dB PSRR for a 100-MHz noise tone with the regulator and a 55.2-dB signal-to-noise-and-distortion ratio for a 30-MHz 1.0-VPP single-ended input at 205 MS/s. The ADC core dissipates 40 mW from a 1.0-V nonregulated supply voltage.

Journal ArticleDOI
TL;DR: This brief presents an application-specific instruction-set processor (ASIP) for real-time Retinex image and video filtering,Synthesized in CMOS technology, the ASIP stands for its better energy-flexibility tradeoff versus reference ASIC and digital signal processing retinex implementations.
Abstract: This brief presents an application-specific instruction-set processor (ASIP) for real-time Retinex image and video filtering. Design optimizations are addressed at algorithmic and architectural levels, the latter including a dedicated memory structure, an adapted pipeline, bypasses, a custom address generator and special looping structures. Synthesized in CMOS technology, the ASIP stands for its better energy-flexibility tradeoff versus reference ASIC and digital signal processing Retinex implementations.

Proceedings ArticleDOI
09 Mar 2007
TL;DR: This paper presents a design of a fast 2D-DCT hardware accelerator for a FPGA-based SoC and shows that this architecture provides optimal performance/area ratio with respect to several alternative designs.
Abstract: Multimedia applications, and in particular the encoding and decoding of standard image and video formats, are usually a typical target for systems-on-chip (SoC). The bi-dimensional discrete cosine transformation (2D-DCT) is a commonly used frequency transformation in graphic compression algorithms. Many hardware implementations, adopting disparate algorithms, have been proposed for field programmable gate arrays (FPGA). These designs focus either on performance or area, and often do not succeed in balancing the two aspects. In this paper, we present a design of a fast 2D-DCT hardware accelerator for a FPGA-based SoC. This accelerator makes use of a single seven stages 1D-DCT pipeline able to alternate computation for the even and odd coefficients in every cycle. In addition, it uses special memories to perform the transpose operations. Our hardware takes 80 clock cycles at 107MHz to generate a complete 8times8 2D DCT, from the writing of the first input sample to the reading of the last result (including the overhead of the interface logic). We show that this architecture provides optimal performance/area ratio with respect to several alternative designs.

Proceedings ArticleDOI
25 Jun 2007
TL;DR: The key objective is to see the effect of overclocking on superscalar processors for various benchmark applications, and analyze the associated overhead, in terms of extra hardware and error recovery penalty, when the clock frequency is adjusted dynamically.
Abstract: Synchronous circuits are typically clocked considering worst case timing paths so that timing errors are avoided under all circumstances. In the case of a pipelined processor, this has special implications since the operating frequency of the entire pipeline is limited by the slowest stage. Our goal, in this paper, is to achieve higher performance in superscalar processors by dynamically varying the operating frequency during run time past worst case limits. The key objective is to see the effect of overclocking on superscalar processors for various benchmark applications, and analyze the associated overhead, in terms of extra hardware and error recovery penalty, when the clock frequency is adjusted dynamically. We tolerate timing errors occurring at speeds higher than what the circuit is designed to operate at by implementing an efficient error detection and recovery mechanism. We also study the limitations imposed by minimum path constraints on our technique. Experimental results show that an average performance gain up to 57% across all benchmark applications is achievable.

Journal ArticleDOI
TL;DR: A general mixed-integer nonlinear programming (MINLP) model is developed in this study to synthesize water networks in batch processes and is believed to be superior to the available ones.
Abstract: A general mixed-integer nonlinear programming (MINLP) model is developed in this study to synthesize water networks in batch processes The proposed model formulation is believed to be superior to the available ones In the past, the tasks of optimizing batch schedules, water-reuse subsystems, and wastewater treatment subsystems were performed individually In this study, all three optimization problems are incorporated in the same mathematical programming model By properly addressing the issue of interaction between subsystems, better overall designs can be generated The resulting design specifications include the following: the production schedule, the number and sizes of buffer tanks, the physical configuration of the pipeline network, and the operating policies of water flows The network structure can also be strategically manipulated by imposing suitable logic constraints A series of illustrative examples are presented to demonstrate the effectiveness of the proposed approach

Journal ArticleDOI
16 Jul 2007
TL;DR: This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain and that both the fine grained control and the flexible interconnect contribute to the speedup.
Abstract: rdquoWe introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework. This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine grained control and the flexible interconnect contribute to the speedup. Furthermore, according to our VLSI implementation study, the FlexCore architecture offers both time and energy savings. The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC

Proceedings ArticleDOI
Mathys C. Walma1
24 Sep 2007
TL;DR: A method for pipelining the calculation of CRC's, such as ISO-3309 CRC32, that allows independent scaling of circuit frequency and data throughput by varying the data width and the number of pipeline stages and allows calculation over data that isn't the full width of the input.
Abstract: Traditional methods to calculate CRC suffer from diminishing returns. Doubling the data width doesn't double the maximum data throughput, the worst case timing path becomes slower. Feedback in the traditional implementation makes pipelining problematic. However, the on chip data width used for high throughput protocols is constantly increasing. The battle of reducing static power consumption is one factor driving this trend towards wider data paths. This paper discusses a method for pipelining the calculation of CRC's, such as ISO-3309 CRC32. This method allows independent scaling of circuit frequency and data throughput by varying the data width and the number of pipeline stages. Pipeline latency can be traded for area while slightly affecting timing. Additionally it allows calculation over data that isn't the full width of the input. This often happens at the end of the packet, although it could happen in the middle of the packet if data arrival is bursty. Finally, a fortunate side effect is that it offers the ability to efficiently update a known good CRC value where a small subset of data in the packet has changed. This is a function often desired in routers, for example updating the TTL field in IPv4 packets.

Patent
31 Jan 2007
TL;DR: In this paper, the authors proposed a method of storing and transporting wind generated power in the form of compressed air energy via a pipeline, from a location where wind conditions are ideal, to a facility or community where energy is needed.
Abstract: The invention relates to a method of storing and transporting wind generated power in the form of compressed air energy, via a pipeline, from a location where wind conditions are ideal, to a facility or community where energy is needed. The method preferably comprises using at least one wind turbine to drive a compressor to compress air into storage, wherein the size and length of the pipeline can be adapted to reduce the pressure losses that are experienced along the length of the pipeline. The pipeline can be located on railroad ties, or on the desert floor, or can be extended along paths where existing right of ways are provided. The facility or community using the energy can use energy in the form of electricity, or to drive pneumatic tools or equipment, or to generate chilled air as a by-product, which can be used for refrigeration, air conditioning or desalination. A utility or grid can be provided to generate compressed air energy when the wind is not blowing, wherein compressed air energy can be produced and stored during low demand periods, and used during high demand periods.


Proceedings ArticleDOI
04 Jul 2007
TL;DR: This paper bound the end-to-end delay of a job in a multistage pipeline as a function of higher-priority job execution times on different stages, so that the pipeline delay composition rule may be a step towards a general schedulability analysis foundation for large distributed systems.
Abstract: Uniprocessor schedulability theory made great strides, in part, due to the simplicity of composing the delay of a job from the execution times of higher-priority jobs that preempt it. In this paper, we bound the end-to-end delay of a job in a multistage pipeline as a function of higher-priority job execution times on different stages. We show that the end-to-end delay is bounded by that of a single virtual "bottleneck" stage plus a small additive component. This contribution effectively transforms the pipeline into a single stage system. The wealth of schedulability analysis techniques derived for uniprocessors can then be applied to decide the schedulability of the pipeline. The transformation does not require imposing artitifical per-stage deadlines, but rather models the pipeline as a whole and uses the end-to-end deadlines directly in the single-stage analysis. It also does not make assumptions on job arrival patterns or periodicity and thus can be applied to periodic and aperiodic tasks alike. We show through simulations that this approach outperforms previous pipeline schedulability tests except for very short pipelines or when deadlines are sufficiently large. The reason lies in the way we account for execution overlap among stages. We discuss how previous approaches account for overlap and point out interesting differences that lead to different performance advantages in different cases. We hope that the pipeline delay composition rule, derived in this paper, may be a step towards a general schedulability analysis foundation for large distributed systems.

Patent
22 Feb 2007
TL;DR: In this article, a vibratory pipeline diagnostic system is presented, which consists of at least one vibration generator (104) adapted to be affixed to a pipeline, at least 1 vibration sensor (107), and a processing device (111) in communication with the generator and sensor.
Abstract: A vibratory pipeline diagnostic system (100) is provided. The system (100) comprises at least one vibration generator (104) adapted to be affixed to a pipeline, at least one vibration sensor (107) adapted to be affixed to the pipeline, and a processing device (111) in communication with the at least one vibration generator (104) and the at least one vibration sensor (107). The processing device (111) is configured to vibrate a portion of the pipeline using the at least one vibration generator (104), receive a vibrational response to the vibration from the at least one vibration sensor (107), compare the vibrational response to one or more previous vibrational responses of the pipeline, and indicate a fault condition if the vibrational response differs from the one or more previous vibrational responses by more than a predetermined tolerance.