scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2009"


Proceedings ArticleDOI
01 Sep 2009
TL;DR: A system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city on Internet photo sharing sites and is designed to scale gracefully with both the size of the problem and the amount of available computation.
Abstract: We present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo sharing sites. Our system uses a collection of novel parallel distributed matching and reconstruction algorithms, designed to maximize parallelism at each stage in the pipeline and minimize serialization bottlenecks. It is designed to scale gracefully with both the size of the problem and the amount of available computation. We have experimented with a variety of alternative algorithms at each stage of the pipeline and report on which ones work best in a parallel computing environment. Our experimental results demonstrate that it is now possible to reconstruct cities consisting of 150K images in less than a day on a cluster with 500 compute cores.

1,454 citations


Proceedings ArticleDOI
Mark T. Bohr1
29 May 2009
TL;DR: The new era of microprocessor scaling is a system-on-a-chip approach that combines a diverse set of components using adaptive circuits, integrated sensors, sophisticated power-management techniques, and increased parallelism to build products that are many-core, multi- core, and multi-function.
Abstract: The time has passed when traditional MOSFET scaling techniques were adequate to meet the needs of microprocessor products, but that has not meant the end of Moore's Law nor the end of improvements in microprocessor performance and power. In the new era of device scaling, innovations in materials and device structure are just as important as dimensional scaling. The past trend of using smaller transistors to build larger microprocessor cores operating at higher frequency and consuming more power is also at an end. The new era of microprocessor scaling is a system-on-a-chip approach that combines a diverse set of components using adaptive circuits, integrated sensors, sophisticated power-management techniques, and increased parallelism to build products that are many-core, multi-core, and multi-function. Although many promising technologies and device options are in the research pipeline, we need to recognize that we are doing system integration, and the future challenge we face is learning how to integrate an ever wider range of heterogeneous elements.

172 citations


Journal ArticleDOI
TL;DR: The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%.
Abstract: A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7p.The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.

168 citations


Proceedings ArticleDOI
12 Dec 2009
TL;DR: This work focuses on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array, or PPA, which is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application.
Abstract: Mobile computing in the form of smart phones, netbooks, and personal digital assistants has become an integral part of our everyday lives. Moving ahead to the next generation of mobile devices, we believe that multimedia will become a more critical and product-differentiating feature. High definition audio and video as well as 3D graphics provide richer interfaces and compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia algorithms are more complex featuring more control flow and variable computational requirements where execution time is not dominated by innermost vector loops. Further, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. Thus, the design of current mobile platforms requires re-examination to account for these new application domains. In this work, we focus on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array, or PPA. The PPA is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application. PPAs exploit pipeline parallelism found in streaming applications to create a coarse-grain hardware pipeline to execute streaming media applications. PPA resources are allocated to each stage depending on its size and ability to exploit fine-grain parallelism. Experimental results show that real-time media applications can take advantage of the static and dynamic configurability for increased power efficiency.

160 citations


Journal ArticleDOI
TL;DR: This paper presents a high-throughput decoder design for the Quasi-Cyclic (QC) Low-Density Parity-Check (LDPC) codes, and two new techniques are proposed, including parallel layered decoding architecture (PLDA) and critical path splitting.
Abstract: This paper presents a high-throughput decoder design for the Quasi-Cyclic (QC) Low-Density Parity-Check (LDPC) codes. Two new techniques are proposed, including parallel layered decoding architecture (PLDA) and critical path splitting. PLDA enables parallel processing for all layers by establishing dedicated message passing paths among them. The decoder avoids crossbar-based large interconnect network. Critical path splitting technique is based on articulate adjustment of the starting point of each layer to maximize the time intervals between adjacent layers, such that the critical path delay can be split into pipeline stages. Furthermore, min-sum and loosely coupled algorithms are employed for area efficiency. As a case study, a rate-1/2 2304-bit irregular LDPC decoder is implemented using ASIC design in 90 nm CMOS process. The decoder can achieve the maximum decoding throughput of 2.2 Gbps at 10 iterations. The operating frequency is 950 MHz after synthesis and the chip area is 2.9 mm2.

130 citations


Journal ArticleDOI
TL;DR: The IBM System z10™ microprocessor is currently the fastest running 64-bit CISC (complex instruction set computer) microprocessor and implements new architectural features that allow better software optimization across compiled applications.
Abstract: The IBM System z10™ microprocessor is currently the fastest running 64-bit CISC (complex instruction set computer) microprocessor. This microprocessor operates at 4.4 GHz and provides up to two times performance improvement compared with its predecessor, the System z9® microprocessor. In addition to its ultrahigh-frequency pipeline, the z10™ microprocessor offers such performance enhancements as a sophisticated branch-prediction structure, a large second-level private cache, a data-prefetch engine, and a hardwired decimal floating-point arithmetic unit. The z10 microprocessor also implements new architectural features that allow better software optimization across compiled applications. These features include new instructions that help shorten the code path lengths and new facilities for software-directed cache management and the use of 1-MB virtual pages. The innovative microarchitecture of the z10 microprocessor and notable differences from its predecessors and the IBM POWER6™ microprocessor are discussed.

114 citations


Journal ArticleDOI
TL;DR: This paper describes a digitally calibrated pipeline analog-to-digital converter (ADC) implemented in 90 nm CMOS technology with a 1.2 V supply voltage that achieves 73 dB SNR and 90 dB SFDR at 100 MS/s sampling rate and 250 mW power consumption.
Abstract: This paper describes a digitally calibrated pipeline analog-to-digital converter (ADC) implemented in 90 nm CMOS technology with a 1.2 V supply voltage. A digital background calibration algorithm reduces the linearity requirements in the first stage of the pipeline chain. Range scaling in the first pipeline stage enables a maximal 1.6 Vpp input signal swing, and a charge-reset switch eliminates ISI-induced distortion. The 14b ADC achieves 73 dB SNR and 90 dB SFDR at 100 MS/s sampling rate and 250 mW power consumption. The 73 dB SNDR performance is maintained within 3 dB up to a Nyquist input frequency and the FOM is 0.68 pJ per conversion-step.

91 citations


Proceedings ArticleDOI
04 Mar 2009
TL;DR: The "Georgia Computes!" alliance, funded by the National Science Foundation's Broadening Participation in Computing program, seeks to improve the computing education pipeline in Georgia.
Abstract: Computing education suffers from low enrollment and a lack of diversity. Both of these problems require changes across the entire computing education pipeline. The "Georgia Computes!" alliance, funded by the National Science Foundation's Broadening Participation in Computing program, seeks to improve the computing education pipeline in Georgia. "Georgia Computes!" is having a measurable effect at each stage of the pipeline, but has not yet shown an impact across the whole pipeline.

84 citations


Proceedings ArticleDOI
06 Mar 2009
TL;DR: A new technique, called Common Activity-based Model for Power (CAMP), is proposed, to estimate activity factors and power for microarchitectural structures, using a relatively few input parameters based on general microprocessor utilization statistics.
Abstract: Microprocessor power has become a first-order constraint at run-time. Designers must employ aggressive power-management techniques at run-time to keep a processor's ballooning power requirements under control. Effective power management benefits from knowledge of run-time microprocessor power consumption in both the core and individual microarchitectural structures, such as caches, queues, and execution units. Increasingly feasible per-structure power-control techniques, such as fine-grain clock gating, power gating, and dynamic voltage/frequency scaling (DVFS), become more effective from run-time estimates of per-structure power. However, run-time computation of per-structure power estimates based on utilization requires daunting numbers of input statistics, which makes per-structure monitoring of run-time power a challenging problem. To address the challenges of estimating per-structure power in hardware, we propose a new technique, called Common Activity-based Model for Power (CAMP), to estimate activity factors and power for microarchitectural structures. Despite using a relatively few input parameters-specifically nine-based on general microprocessor utilization statistics (e.g., IPC and load rate), our linear-regression-based model estimates activity and dynamic power for over 100 structures in an out-of-order x86 pipeline and core power with an average error of 8%. Because the computations utilize few inputs, CAMP is simple enough to implement in hardware, providing run-time structure and core power estimates for dynamic power management. Because the input statistics are generic in nature and the model remains accurate across incremental microarchitectural refinements, CAMP provides simple intuitive equations relating global microarchitectural statistics to structure activity and power. These equations provide a simple technique that can equate changes in one structure's activity to power variations in other structures across the pipeline.

84 citations


Journal ArticleDOI
TL;DR: The discrete Fourier transform (DFT) matrix factorization based on the Kronecker product is proposed to express the family of radix rk single-path delay commutator/single- path delay feedback (SDC/SDF) pipeline fast Fouriers transform (FFT) architectures.
Abstract: This paper proposes to use the discrete Fourier transform (DFT) matrix factorization based on the Kronecker product to express the family of radix rk single-path delay commutator/single-path delay feedback (SDC/SDF) pipeline fast Fourier transform (FFT) architectures. The matricial expressions of the radix r, r 2, r 3, and r 4 decimation-in-frequency (DIF) SDC/SDF pipeline architectures are derived. These expressions can be written using a small set of operators, resulting in a compact representation of the algorithms. The derived expressions are general in terms of r and the number of points of the FFT N. Expressions are given where it is not necessary that N is a power of rk. The proposed set of operators can be mapped to equivalent hardware circuits. Thus, the designer can easily go from the matricial representations to their implementations and vice versa. As an example, the mapping of the operators is shown for radix 2, 22, 23, and 24, and the details of the corresponding SDC/SDF pipeline FFT architectures are presented. Furthermore, a general expression is given for the SDC/SDF radix rk pipeline architectures when k > 4. This general expression helps the designer to efficiently handle a wider design exploration space and select the optimum single-path architecture for a given value of N.

83 citations


Journal ArticleDOI
TL;DR: A H.264/AVC baseline-profile real-time encoder for HDTV-1080p at 30 fps is proposed in this paper and the design considerations for chief components, including high throughput integer motion estimation, data reusing fractionalmotion estimation, and hardware friendly mode reduction for intra prediction are described.
Abstract: A H.264/AVC baseline-profile real-time encoder for HDTV-1080p at 30 fps is proposed in this paper. On the basis of the specifications and algorithm optimizations, the dedicated hardware engines and one 32-bit media embedded processor (MeP) equipped with hardware extensions are mapped into the three-stage macroblock pipelining system architecture. This paper describes the design considerations for chief components, including high throughput integer motion estimation, data reusing fractional motion estimation, and hardware friendly mode reduction for intra prediction. The 11.5 Gbps 64 Mb system-in-silicon DRAM is embedded to alleviate the external memory bandwidth. Using TSMC one-poly six-metal 0.18 mum CMOS technology, the prototype chip is implemented with 1140 k logic gates and 108.3 KB internal SRAM. The SoC core occupies 27.1 mm2 die area and consumes 1.41 W at 200 MHz execution speed in typical work conditions.

Journal ArticleDOI
TL;DR: In this paper, the pig position, optimum flow rate in upstream flow and the time that the pig reaches the end of the pipeline are obtained from the simulation results with the field data of liquid flow through the pipeline from KG to AG located in Iran.

Journal ArticleDOI
29 Oct 2009-Wear
TL;DR: In this article, a systematic study of pipeline steel degradation due to erosion-corrosion containing sand in a CO 2 saturated environment has been carried out, focusing on the total material loss, corrosion, erosion and their interactions (synergy) as a function of environmental parameters (temperature, flow velocity and sand content).

Proceedings ArticleDOI
Ying Yi1, Wei Han1, Xin Zhao1, Ahmet T. Erdogan1, Tughrul Arslan1 
20 Apr 2009
TL;DR: The results demonstrate that the proposed technique is able to generate high-quality mappings of realistic applications on the target multi-core architecture, achieving up to 1.3× parallel efficiency by employing only two dynamically reconfigurable processor cores.
Abstract: Multi-core architectures are increasingly being adopted in the design of emerging complex embedded systems. Key issues of designing such systems are on-chip interconnects, memory architecture, and task mapping and scheduling. This paper presents an integer linear programming formulation for the task mapping and scheduling problem. The technique incorporates profiling-driven loop level task partitioning, task transformations, functional pipelining, and memory architecture aware data mapping to reduce system execution time. Experiments are conducted to evaluate the technique by implementing a series of DSP applications on several multi-core architectures based on dynamically reconfigurable processor cores. The results demonstrate that the proposed technique is able to generate high-quality mappings of realistic applications on the target multi-core architecture, achieving up to 1.3× parallel efficiency by employing only two dynamically reconfigurable processor cores.

Journal ArticleDOI
TL;DR: The R22SDF was more efficient than the R4SDC in terms of throughput per area due to a simpler controller and an easier balanced rounding scheme, and it is shown that balanced stage rounding is an appropriate rounding scheme for pipeline FFT processors.
Abstract: This paper presents optimized implementations of two different pipeline FFT processors on Xilinx Spartan-3 and Virtex-4 FPGAs Different optimization techniques and rounding schemes were explored The implementation results achieved better performance with lower resource usage than prior art The 16-bit 1024-point FFT with the R22SDF architecture had a maximum clock frequency of 952 MHz and used 2802 slices on the Spartan-3, a throughput per area ratio of 0034 Msamples/s/slice The R4SDC architecture ran at 1238 MHz and used 4409 slices on the Spartan-3, a throughput per area ratio of 0028 Msamples/s/slice On Virtex-4, the 16-bit 1024-point R22SDF architecture ran at 2356 MHz and used 2256 slice, giving a 0104 Msamples/s/slice ratio; the 16-bit 1024-point R4SDC architecture ran at 2192 MHz and used 3064 slices, giving a 0072 Msamples/s/slice ratio The R22SDF was more efficient than the R4SDC in terms of throughput per area due to a simpler controller and an easier balanced rounding scheme This paper also shows that balanced stage rounding is an appropriate rounding scheme for pipeline FFT processors

Journal ArticleDOI
TL;DR: In this article, the authors present a flexible multiprocessor platform for high throughput turbo decoding using configurable application-specific instruction set processors (ASIP) combined with an efficient memory and communication interconnect scheme.
Abstract: Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture.

Patent
18 Aug 2009
TL;DR: In this article, a method for quality objective-based ETL pipeline optimization is provided, where an improvement objective is obtained from user input into a computing system, which represents a priority optimization desired by a user for improved ETL flows for an application designed to run in memory of the computing system.
Abstract: A method for quality objective-based ETL pipeline optimization is provided. An improvement objective is obtained from user input into a computing system. The improvement objective represents a priority optimization desired by a user for improved ETL flows for an application designed to run in memory of the computing system. An ETL flow is created in the memory of the computing system. The ETL flow is restructured for flow optimization with a processor of the computing system. The flow restructuring is based on the improvement objective. Flow restructuring can include application of flow rewriting optimization or application of an algebraic rewriting optimization. The optimized ETL flow is stored as executable code on a computer readable storage medium.

Book ChapterDOI
22 Apr 2009
TL;DR: A model to specify the local execution context of a basic block as a set of parameters can then be computed as a function of these parameters, which can be used for computing the Worst-Case Execution Time of the program.
Abstract: The static analysis of the execution time of a program (i.e. the evaluation of this time for any input data set) can be useful for the purpose of optimizing the code or verifying that strict real-time deadlines can be met. This analysis generally goes through determining the execution times of partial execution paths, typically basic blocks. Now, as soon as the target processor architecture features a superscalar pipeline, possibly with dynamic instruction scheduling, the execution time of a basic block highly depends on the pipeline state, that is on the instructions executed before it. In this paper, we propose a model to specify the local execution context of a basic block as a set of parameters. The execution time of the block can then be computed as a function of these parameters. We show how this model can be used to determine an upper bound of the execution time of a basic block, that can be used for computing the Worst-Case Execution Time of the program. Experimental results give an insight into the tightness of the estimations.

Patent
13 Jul 2009
TL;DR: In this article, the authors proposed a semi-column-parallel pipeline architecture for analog-to-digital converters, which allows multiple output lines to share an analog-To-Digital converter.
Abstract: An imaging device with a semi-column-parallel pipeline analog-to-digital converter architecture. The semi-column-parallel pipeline architecture allows multiple column output lines to share an analog-to-digital converter. Analog-to-digital conversions are performed in a pipelined manner to reduce the conversion time, which results in shorter row times and increased frames rate and data throughput. The architecture also enhances the pitch of the analog-to-digital converters, which allows high performance, high resolution analog-to-digital converters to be used. As such, semi-column-parallel pipeline architecture overcomes the shortcomings of the typical serial and column-parallel architectures.

Proceedings ArticleDOI
04 Oct 2009
TL;DR: Microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques are described.
Abstract: This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.

Patent
09 Jan 2009
TL;DR: In this paper, a basket calculation engine is deployed to receive a stream of data and accelerate the computation of basket values based on that data, which is used to process financial market data to compute the net asset values (NAVs) of financial instrument baskets.
Abstract: A basket calculation engine is deployed to receive a stream of data and accelerate the computation of basket values based on that data. In a preferred embodiment, the basket calculation engine is used to process financial market data to compute the net asset values (NAVs) of financial instrument baskets. The basket calculation engine can be deployed on a coprocessor and can also be realized via a pipeline, the pipeline preferably comprising a basket association lookup module and a basket value updating module. The coprocessor is preferably a reconfigurable logic device such as a field programmable gate array (FPGA).

Journal ArticleDOI
TL;DR: A control theoretic approach to dynamic voltage/frequency scaling for data-flow models of computations mapped to multiprocessor systems-on-chip architectures is presented and nonlinear control approaches to deal with general streaming applications containing both pipeline and parallel stages are discussed.
Abstract: Runtime frequency and voltage adaptation has become very attractive for current and next generation embedded multicore platforms because it allows handling the workload variabilities arising in complex and dynamic utilization scenarios. The main challenge of dynamic frequency adaptation is to adjust the processing speed of each element to match the quality-of-service requirements in the presence of workload variations. In this paper, we present a control theoretic approach to dynamic voltage/frequency scaling for data-flow models of computations mapped to multiprocessor systems-on-chip architectures. We discuss, in particular, nonlinear control approaches to deal with general streaming applications containing both pipeline and parallel stages. Theoretical analysis and experiments, carried out by means of a cycle-accurate energy-aware multiprocessor simulation platform, are provided. We have applied the proposed control approach to realistic streaming applications such as Data Encryption Standard and software-based FM radio.

Patent
22 Dec 2009
TL;DR: In this article, an SRAM-based pipeline IP lookup architecture is presented, where a multitude of intersecting and different length pipelines are constructed on a two dimensional array of processing elements in a circular fashion.
Abstract: This invention first presents SRAM based pipeline IP lookup architectures including an SRAM based systolic array architecture that utilizes multi-pipeline parallelism idea and elaborates on it as the base architecture highlighting its advantages. In this base architecture a multitude of intersecting and different length pipelines are constructed on a two dimensional array of processing elements in a circular fashion. The architecture supports the use of any type of prefix tree instead of conventional binary prefix tree. The invention secondly proposes a novel use of an alternative and more advantageous prefix tree based on binomial spanning tree to achieve a substantial performance increase. The new approach, enhanced with other extensions including four-side input and three-pointer implementations, considerably increases the parallelism and search capability of the base architecture and provides a much higher throughput than all existing IP lookup approaches making, for example, a 7 Tbps router IP lookup front end speed possible. Although theoretical worst-case lookup delay in this systolic array structure is high, the average delay is quite low, large delays being observed only rarely. The structure in its new form is scalable in terms of processing elements and is also well suited for the IPv6 addressing scheme.

Proceedings ArticleDOI
17 May 2009
TL;DR: A new methodology is proposed, based on formal verification and relative timing, to create and prove correct necessary constraints to support asynchronous design with traditional clocked CAD.
Abstract: Asynchronous circuit design can result in substantial benefits ofreduced power, improved performance, and high modularity. However,asynchronous design styles are largely incompatible with clocked CAD,which has prevented wide-scale adoption. The key incompatibility istiming. Thus most commercial work relies on custom CAD or untimeddelay-insensitive design methodologies. This paper proposes a newmethodology, based on formal verification and relative timing, tocreate and prove correct necessary constraints to support asynchronousdesign with traditional clocked CAD. These constraints support timingdriving synthesis, place and route, and behavior and timing validationof fully asynchronous designs using traditional clocked CAD flows.This flow is demonstrated through a simple example pipeline in IBM's65nm process showing the ability to retarget the design for improvedpower and performance.

Proceedings ArticleDOI
05 Apr 2009
TL;DR: This work proposes a novel scalable high-throughput, low-power SRAM-based linear pipeline architecture for IP lookup that maintains packet input order and supports in-place non-blocking route updates.
Abstract: Most high-speed Internet Protocol (IP) lookup implementations use tree traversal and pipelining. Due to the available on-chip memory and the number of I/O pins ofField Programmable Gate Arrays (FPGAs), state-of-the-artdesigns cannot support the current largest routing table(consisting of 257K prefixes in backbone routers). We propose a novel scalable high-throughput, low-power SRAM-based linear pipeline architecture for IP lookup. Using asingle FPGA, the proposed architecture can support thecurrent largest routing table, or even larger tables of upto 400K prefixes. Our architecture can also be easily partitioned, so as to use external SRAM to handle even larger routing tables (up to 1.7M prefixes). Our implementation shows a high throughput (340 mega lookups per second or 109 Gbps), even when external SRAM is used. The use of SRAM (instead of TCAM) leads to an order of magnitude reduction in power dissipation. Additionally, the architecture supports power saving by allowing only a portion of the memory to be active on each memory access. Our design also maintains packet input order and supports in-place non-blocking route updates.

Proceedings ArticleDOI
12 Sep 2009
TL;DR: This paper proposed FastBCI, an architecture support that achieves the granularity efficiency of a bulk copying/ initialization instruction, but without its pipeline and cache bottlenecks, which on average achieves anywhere between 23% to 32% speedup ratios.
Abstract: Bulk memory copying and initialization is one of the most ubiquitous operations performed in current computer systems by both user applications and Operating Systems. While many current systems rely on a loop of loads and stores, there are proposals to introduce a single instruction to perform bulk memory copying. While such an instruction can improve performance due to generating fewer TLB and cache accesses, and requiring fewer pipeline resources, in this paper we show that the key to significantly improving the performance is removing pipeline and cache bottlenecks of the code that follows the instructions. We show that the bottlenecks arise due to (1) the pipeline clogged by the copying instruction, (2) lengthened critical path due to dependent instructions stalling while waiting for the copying to complete, and (3) the inability to specify (separately) the cacheability of the source and destination regions. We propose FastBCI, an architecture support that achieves the granularity efficiency of a bulk copying/ initialization instruction, but without its pipeline and cache bottlenecks. When applied to OS kernel buffer management, we show that on average FastBCI achieves anywhere between 23% to 32% speedup ratios, which is roughly 3x-4x of an alternative scheme, and 1.5x-2x of a highly optimistic DMA with zero setup and interrupt overheads.

Journal ArticleDOI
TL;DR: In this article, an economic analysis computer tool is developed for the evaluation of carbon capture and storage (CCS) systems comprising of a set of multiple sources of CO2 and storage locations.

Journal ArticleDOI
TL;DR: Preliminary tests show that the sensors can detect the presence of wall thinning in a steel pipe by classifying the attenuation and frequency changes of the propagating lamb waves, and the SVM algorithm was able to classify the signals as abnormal in the absence of wallthinning.
Abstract: Oil and gas pipeline condition monitoring is a potentially challenging process due to varying temperature conditions, harshness of the flowing commodity and unpredictable terrains. Pipeline breakdown can potentially cost millions of dollars worth of loss, not to mention the serious environmental damage caused by the leaking commodity. The proposed techniques, although implemented on a lab scale experimental rig, ultimately aim at providing a continuous monitoring system using an array of different sensors strategically positioned on the surface of the pipeline. The sensors used are piezoelectric ultrasonic sensors. The raw sensor signal will be first processed using the discrete wavelet transform (DWT) as a feature extractor and then classified using the powerful learning machine called the support vector machine (SVM). Preliminary tests show that the sensors can detect the presence of wall thinning in a steel pipe by classifying the attenuation and frequency changes of the propagating lamb waves. The SVM...

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a methodology for optimizing the operating performance of a pipeline network to minimize the total fuel consumption while maintaining the desired throughput in the line while maintaining a desired throughput.
Abstract: As the gas industry has developed, gas pipeline networks have evolved over decades into very complex systems A typical network today might consist of thousands of pipes, dozens of stations, and many other devices, such as valves and regulators Inside each station, there can be several groups of compressor units of various vintages that were installed as the capacity of the system expanded The compressor stations typically consume about 3–5% of the transported gas It is estimated that the global optimization of operations can save considerably the fuel consumed by the stations Hence, the problem of minimizing fuel cost is of great importance Consequently, the objective is to operate a given compressor station or a set of compressor stations so that the total fuel consumption is reduced while maintaining the desired throughput in the line Two case studies illustrate the proposed methodology Case 1 was chosen for its simple and small-size design, developed for the sake of illustration The implementation of the methodology is thoroughly presented and typical results are analyzed Case 2 was submitted by the French Company Gaz de France It is a more complex network containing several loops, supply nodes, and delivery points, referred as a multisupply multidelivery transmission network The key points of implementation of an optimization framework are presented The treatment of both case studies provides some guidelines for optimization of the operating performances of pipeline networks, according to the complexity of the involved problems © 2009 American Institute of Chemical Engineers AIChE J, 2010

Patent
08 Apr 2009
TL;DR: In this article, a system and method for facilitating increased graphics processing without deadlock is presented, which provides storage for execution unit pipeline results (e.g., texture pipeline results).
Abstract: A system and method for facilitating increased graphics processing without deadlock. Embodiments of the present invention provide storage for execution unit pipeline results (e.g., texture pipeline results). The storage allows increased processing of multiple threads as a texture unit may be used to store information while corresponding locations of the register file are available for reallocation to other threads. Embodiments further provide for preventing deadlock by limiting the number of requests and ensuring that a set of requests is not issued unless there are resources available to complete each request of the set of requests. Embodiments of the present invention thus provide for deadlock free increased performance.