Showing papers on "Pipeline (computing) published in 2009"

PDF

Open Access

Proceedings Article•DOI•

[...]

Sameer Agarwal¹, Noah Snavely², Ian Simon¹, Steven M. Seitz¹, Richard Szeliski³ - Show less +1 more•Institutions (3)

University of Washington¹, Cornell University², Microsoft³

01 Sep 2009

TL;DR: A system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city on Internet photo sharing sites and is designed to scale gracefully with both the size of the problem and the amount of available computation.

...read moreread less

Abstract: We present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo sharing sites. Our system uses a collection of novel parallel distributed matching and reconstruction algorithms, designed to maximize parallelism at each stage in the pipeline and minimize serialization bottlenecks. It is designed to scale gracefully with both the size of the problem and the amount of available computation. We have experimented with a variety of alternative algorithms at each stage of the pipeline and report on which ones work best in a parallel computing environment. Our experimental results demonstrate that it is now possible to reconstruct cities consisting of 150K images in less than a day on a cluster with 500 compute cores.

...read moreread less

1,454 citations

Proceedings Article•DOI•

The new era of scaling in an SoC world

[...]

Mark T. Bohr¹•Institutions (1)

Intel¹

29 May 2009

TL;DR: The new era of microprocessor scaling is a system-on-a-chip approach that combines a diverse set of components using adaptive circuits, integrated sensors, sophisticated power-management techniques, and increased parallelism to build products that are many-core, multi- core, and multi-function.

...read moreread less

Abstract: The time has passed when traditional MOSFET scaling techniques were adequate to meet the needs of microprocessor products, but that has not meant the end of Moore's Law nor the end of improvements in microprocessor performance and power. In the new era of device scaling, innovations in materials and device structure are just as important as dimensional scaling. The past trend of using smaller transistors to build larger microprocessor cores operating at higher frequency and consuming more power is also at an end. The new era of microprocessor scaling is a system-on-a-chip approach that combines a diverse set of components using adaptive circuits, integrated sensors, sophisticated power-management techniques, and increased parallelism to build products that are many-core, multi-core, and multi-function. Although many promising technologies and device options are in the research pipeline, we need to recognize that we are doing system integration, and the future challenge we face is learning how to integrate an ever wider range of heterogeneous elements.

...read moreread less

172 citations

Journal Article•DOI•

A mechanistic performance model for superscalar out-of-order processors

[...]

Stijn Eyerman¹, Lieven Eeckhout¹, Tejas Karkhanis², James E. Smith³•Institutions (3)

Ghent University¹, Advanced Micro Devices², University of Wisconsin-Madison³

29 May 2009-ACM Transactions on Computer Systems

TL;DR: The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%.

...read moreread less

Abstract: A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7p.The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.

...read moreread less

168 citations

Proceedings Article•DOI•

Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications

[...]

Hyunchul Park¹, Yongjun Park¹, Scott Mahlke¹•Institutions (1)

University of Michigan¹

12 Dec 2009

TL;DR: This work focuses on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array, or PPA, which is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application.

...read moreread less

Abstract: Mobile computing in the form of smart phones, netbooks, and personal digital assistants has become an integral part of our everyday lives. Moving ahead to the next generation of mobile devices, we believe that multimedia will become a more critical and product-differentiating feature. High definition audio and video as well as 3D graphics provide richer interfaces and compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia algorithms are more complex featuring more control flow and variable computational requirements where execution time is not dominated by innermost vector loops. Further, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. Thus, the design of current mobile platforms requires re-examination to account for these new application domains. In this work, we focus on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array, or PPA. The PPA is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application. PPAs exploit pipeline parallelism found in streaming applications to create a coarse-grain hardware pipeline to execute streaming media applications. PPA resources are allocated to each stage depending on its size and ability to exploit fine-grain parallelism. Experimental results show that real-time media applications can take advantage of the static and dynamic configurability for increased power efficiency.

...read moreread less

160 citations

Journal Article•DOI•

High-throughput layered decoder implementation for quasi-cyclic LDPC codes

[...]

Kai Zhang, Xinming Huang, Zhongfeng Wang¹•Institutions (1)

Broadcom¹

01 Aug 2009-IEEE Journal on Selected Areas in Communications

TL;DR: This paper presents a high-throughput decoder design for the Quasi-Cyclic (QC) Low-Density Parity-Check (LDPC) codes, and two new techniques are proposed, including parallel layered decoding architecture (PLDA) and critical path splitting.

...read moreread less

Abstract: This paper presents a high-throughput decoder design for the Quasi-Cyclic (QC) Low-Density Parity-Check (LDPC) codes. Two new techniques are proposed, including parallel layered decoding architecture (PLDA) and critical path splitting. PLDA enables parallel processing for all layers by establishing dedicated message passing paths among them. The decoder avoids crossbar-based large interconnect network. Critical path splitting technique is based on articulate adjustment of the starting point of each layer to maximize the time intervals between adjacent layers, such that the critical path delay can be split into pipeline stages. Furthermore, min-sum and loosely coupled algorithms are employed for area efficiency. As a case study, a rate-1/2 2304-bit irregular LDPC decoder is implemented using ASIC design in 90 nm CMOS process. The decoder can achieve the maximum decoding throughput of 2.2 Gbps at 10 iterations. The operating frequency is 950 MHz after synthesis and the chip area is 2.9 mm2.

...read moreread less

130 citations

Journal Article•DOI•

Design and microarchitecture of the IBM system z10 microprocessor

[...]

Chung-Lung Shum¹, Fadi Y. Busaba¹, S. Dao-Trong¹, Günter Gerwig¹, C. Jacobi¹, Thomas Koehler¹, Erwin Pfeffer¹, Brian R. Prasky¹, John G. Rell¹, Aaron Tsai¹ - Show less +6 more•Institutions (1)

IBM¹

01 Jan 2009-Ibm Journal of Research and Development

TL;DR: The IBM System z10™ microprocessor is currently the fastest running 64-bit CISC (complex instruction set computer) microprocessor and implements new architectural features that allow better software optimization across compiled applications.

...read moreread less

Abstract: The IBM System z10™ microprocessor is currently the fastest running 64-bit CISC (complex instruction set computer) microprocessor. This microprocessor operates at 4.4 GHz and provides up to two times performance improvement compared with its predecessor, the System z9® microprocessor. In addition to its ultrahigh-frequency pipeline, the z10™ microprocessor offers such performance enhancements as a sophisticated branch-prediction structure, a large second-level private cache, a data-prefetch engine, and a hardwired decimal floating-point arithmetic unit. The z10 microprocessor also implements new architectural features that allow better software optimization across compiled applications. These features include new instructions that help shorten the code path lengths and new facilities for software-directed cache management and the use of 1-MB virtual pages. The innovative microarchitecture of the z10 microprocessor and notable differences from its predecessors and the IBM POWER6™ microprocessor are discussed.

...read moreread less

114 citations

Journal Article•DOI•

A 1.2-V 250-mW 14-b 100-MS/s Digitally Calibrated Pipeline ADC in 90-nm CMOS

[...]

H. Van de Vel¹, Berry Buter¹, H. van der Ploeg, Maarten Vertregt¹, Govert Geelen¹, E.J.F. Paulus¹ - Show less +2 more•Institutions (1)

NXP Semiconductors¹

24 Mar 2009-IEEE Journal of Solid-state Circuits

TL;DR: This paper describes a digitally calibrated pipeline analog-to-digital converter (ADC) implemented in 90 nm CMOS technology with a 1.2 V supply voltage that achieves 73 dB SNR and 90 dB SFDR at 100 MS/s sampling rate and 250 mW power consumption.

...read moreread less

Abstract: This paper describes a digitally calibrated pipeline analog-to-digital converter (ADC) implemented in 90 nm CMOS technology with a 1.2 V supply voltage. A digital background calibration algorithm reduces the linearity requirements in the first stage of the pipeline chain. Range scaling in the first pipeline stage enables a maximal 1.6 Vpp input signal swing, and a charge-reset switch eliminates ISI-induced distortion. The 14b ADC achieves 73 dB SNR and 90 dB SFDR at 100 MS/s sampling rate and 250 mW power consumption. The 73 dB SNDR performance is maintained within 3 dB up to a Nyquist input frequency and the FOM is 0.68 pJ per conversion-step.

...read moreread less

91 citations

Proceedings Article•DOI•

"Georgia computes!": improving the computing education pipeline

[...]

Amy Bruckman¹, Maureen Biggers², Barbara Ericson¹, Tom McKlin, Jill Dimond¹, Betsy DiSalvo¹, Mike Hewner¹, Lijun Ni¹, Sarita Yardi¹ - Show less +5 more•Institutions (2)

Georgia Institute of Technology¹, Indiana University²

04 Mar 2009

TL;DR: The "Georgia Computes!" alliance, funded by the National Science Foundation's Broadening Participation in Computing program, seeks to improve the computing education pipeline in Georgia.

...read moreread less

Abstract: Computing education suffers from low enrollment and a lack of diversity. Both of these problems require changes across the entire computing education pipeline. The "Georgia Computes!" alliance, funded by the National Science Foundation's Broadening Participation in Computing program, seeks to improve the computing education pipeline in Georgia. "Georgia Computes!" is having a measurable effect at each stage of the pipeline, but has not yet shown an impact across the whole pipeline.

...read moreread less

84 citations

Proceedings Article•DOI•

CAMP: A technique to estimate per-structure power at run-time using a few simple parameters

[...]

Michael D. Powell¹, Arijit Biswas¹, Joel Emer¹, Shubhendu S. Mukherjee¹, Basit Riaz Sheikh², Shrirang Yardi³ - Show less +2 more•Institutions (3)

Intel¹, Cornell University², Virginia Tech³

06 Mar 2009

TL;DR: A new technique, called Common Activity-based Model for Power (CAMP), is proposed, to estimate activity factors and power for microarchitectural structures, using a relatively few input parameters based on general microprocessor utilization statistics.

...read moreread less

Abstract: Microprocessor power has become a first-order constraint at run-time. Designers must employ aggressive power-management techniques at run-time to keep a processor's ballooning power requirements under control. Effective power management benefits from knowledge of run-time microprocessor power consumption in both the core and individual microarchitectural structures, such as caches, queues, and execution units. Increasingly feasible per-structure power-control techniques, such as fine-grain clock gating, power gating, and dynamic voltage/frequency scaling (DVFS), become more effective from run-time estimates of per-structure power. However, run-time computation of per-structure power estimates based on utilization requires daunting numbers of input statistics, which makes per-structure monitoring of run-time power a challenging problem. To address the challenges of estimating per-structure power in hardware, we propose a new technique, called Common Activity-based Model for Power (CAMP), to estimate activity factors and power for microarchitectural structures. Despite using a relatively few input parameters-specifically nine-based on general microprocessor utilization statistics (e.g., IPC and load rate), our linear-regression-based model estimates activity and dynamic power for over 100 structures in an out-of-order x86 pipeline and core power with an average error of 8%. Because the computations utilize few inputs, CAMP is simple enough to implement in hardware, providing run-time structure and core power estimates for dynamic power management. Because the input statistics are generic in nature and the model remains accurate across incremental microarchitectural refinements, CAMP provides simple intuitive equations relating global microarchitectural statistics to structure activity and power. These equations provide a simple technique that can equate changes in one structure's activity to power variations in other structures across the pipeline.

...read moreread less

84 citations

Journal Article•DOI•

Radix $r^{k} $ FFTs: Matricial Representation and SDC/SDF Pipeline Implementation

[...]

A. Cortes¹, Igone Velez¹, Juan F. Sevillano¹•Institutions (1)

University of Navarra¹

01 Jul 2009-IEEE Transactions on Signal Processing

TL;DR: The discrete Fourier transform (DFT) matrix factorization based on the Kronecker product is proposed to express the family of radix rk single-path delay commutator/single- path delay feedback (SDC/SDF) pipeline fast Fouriers transform (FFT) architectures.

...read moreread less

Abstract: This paper proposes to use the discrete Fourier transform (DFT) matrix factorization based on the Kronecker product to express the family of radix rk single-path delay commutator/single-path delay feedback (SDC/SDF) pipeline fast Fourier transform (FFT) architectures. The matricial expressions of the radix r, r 2, r 3, and r 4 decimation-in-frequency (DIF) SDC/SDF pipeline architectures are derived. These expressions can be written using a small set of operators, resulting in a compact representation of the algorithms. The derived expressions are general in terms of r and the number of points of the FFT N. Expressions are given where it is not necessary that N is a power of rk. The proposed set of operators can be mapped to equivalent hardware circuits. Thus, the designer can easily go from the matricial representations to their implementations and vice versa. As an example, the mapping of the operators is shown for radix 2, 22, 23, and 24, and the details of the corresponding SDC/SDF pipeline FFT architectures are presented. Furthermore, a general expression is given for the SDC/SDF radix rk pipeline architectures when k > 4. This general expression helps the designer to efficiently handle a wider design exploration space and select the optimum single-path architecture for a given value of N.

...read moreread less

83 citations

Journal Article•DOI•

HDTV1080p H.264/AVC Encoder Chip Design and Performance Analysis

[...]

Zhenyu Liu¹, Yang Song², Ming Shao, Shen Li³, Lingfeng Li, Shunichi Ishiwata³, Michio Nakagawa³, Satoshi Goto¹, Takeshi Ikenaga¹ - Show less +5 more•Institutions (3)

Waseda University¹, Fujitsu², Toshiba³

27 Jan 2009-IEEE Journal of Solid-state Circuits

TL;DR: A H.264/AVC baseline-profile real-time encoder for HDTV-1080p at 30 fps is proposed in this paper and the design considerations for chief components, including high throughput integer motion estimation, data reusing fractionalmotion estimation, and hardware friendly mode reduction for intra prediction are described.

...read moreread less

Abstract: A H.264/AVC baseline-profile real-time encoder for HDTV-1080p at 30 fps is proposed in this paper. On the basis of the specifications and algorithm optimizations, the dedicated hardware engines and one 32-bit media embedded processor (MeP) equipped with hardware extensions are mapped into the three-stage macroblock pipelining system architecture. This paper describes the design considerations for chief components, including high throughput integer motion estimation, data reusing fractional motion estimation, and hardware friendly mode reduction for intra prediction. The 11.5 Gbps 64 Mb system-in-silicon DRAM is embedded to alleviate the external memory bandwidth. Using TSMC one-poly six-metal 0.18 mum CMOS technology, the prototype chip is implemented with 1140 k logic gates and 108.3 KB internal SRAM. The SoC core occupies 27.1 mm2 die area and consumes 1.41 W at 200 MHz execution speed in typical work conditions.

...read moreread less

Journal Article•DOI•

Mathematical modeling and simulation of pigging operation in gas and liquid pipelines

[...]

Feridun Esmaeilzadeh¹, Dariush Mowla¹, Maryam Asemani¹•Institutions (1)

Shiraz University¹

01 Nov 2009-Journal of Petroleum Science and Engineering

TL;DR: In this paper, the pig position, optimum flow rate in upstream flow and the time that the pig reaches the end of the pipeline are obtained from the simulation results with the field data of liquid flow through the pipeline from KG to AG located in Iran.

...read moreread less

Journal Article•DOI•

CO2 erosion–corrosion of pipeline steel (API X65) in oil and gas conditions—A systematic approach

[...]

Xinming Hu¹, Anne Neville¹•Institutions (1)

University of Leeds¹

29 Oct 2009-Wear

TL;DR: In this article, a systematic study of pipeline steel degradation due to erosion-corrosion containing sand in a CO 2 saturated environment has been carried out, focusing on the total material loss, corrosion, erosion and their interactions (synergy) as a function of environmental parameters (temperature, flow velocity and sand content).

...read moreread less

Proceedings Article•DOI•

An ILP formulation for task mapping and scheduling on multi-core architectures

[...]

Ying Yi¹, Wei Han¹, Xin Zhao¹, Ahmet T. Erdogan¹, Tughrul Arslan¹ - Show less +1 more•Institutions (1)

University of Edinburgh¹

20 Apr 2009

TL;DR: The results demonstrate that the proposed technique is able to generate high-quality mappings of realistic applications on the target multi-core architecture, achieving up to 1.3× parallel efficiency by employing only two dynamically reconfigurable processor cores.

...read moreread less

Abstract: Multi-core architectures are increasingly being adopted in the design of emerging complex embedded systems. Key issues of designing such systems are on-chip interconnects, memory architecture, and task mapping and scheduling. This paper presents an integer linear programming formulation for the task mapping and scheduling problem. The technique incorporates profiling-driven loop level task partitioning, task transformations, functional pipelining, and memory architecture aware data mapping to reduce system execution time. Experiments are conducted to evaluate the technique by implementing a series of DSP applications on several multi-core architectures based on dynamically reconfigurable processor cores. The results demonstrate that the proposed technique is able to generate high-quality mappings of realistic applications on the target multi-core architecture, achieving up to 1.3× parallel efficiency by employing only two dynamically reconfigurable processor cores.

...read moreread less

Journal Article•DOI•

Pipeline FFT architectures optimized for FPGAs

[...]

Bin Zhou¹, Yingning Peng², David Hwang¹•Institutions (2)

George Mason University¹, Tsinghua University²

01 Jan 2009-International Journal of Reconfigurable Computing

TL;DR: The R22SDF was more efficient than the R4SDC in terms of throughput per area due to a simpler controller and an easier balanced rounding scheme, and it is shown that balanced stage rounding is an appropriate rounding scheme for pipeline FFT processors.

...read moreread less

Abstract: This paper presents optimized implementations of two different pipeline FFT processors on Xilinx Spartan-3 and Virtex-4 FPGAs Different optimization techniques and rounding schemes were explored The implementation results achieved better performance with lower resource usage than prior art The 16-bit 1024-point FFT with the R22SDF architecture had a maximum clock frequency of 952 MHz and used 2802 slices on the Spartan-3, a throughput per area ratio of 0034 Msamples/s/slice The R4SDC architecture ran at 1238 MHz and used 4409 slices on the Spartan-3, a throughput per area ratio of 0028 Msamples/s/slice On Virtex-4, the 16-bit 1024-point R22SDF architecture ran at 2356 MHz and used 2256 slice, giving a 0104 Msamples/s/slice ratio; the 16-bit 1024-point R4SDC architecture ran at 2192 MHz and used 3064 slices, giving a 0072 Msamples/s/slice ratio The R22SDF was more efficient than the R4SDC in terms of throughput per area due to a simpler controller and an easier balanced rounding scheme This paper also shows that balanced stage rounding is an appropriate rounding scheme for pipeline FFT processors

...read moreread less

Journal Article•DOI•

From Parallelism Levels to a Multi-ASIP Architecture for Turbo Decoding

[...]

Olivier Muller, Amer Baghdadi, Michel Jezequel

01 Jan 2009-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this article, the authors present a flexible multiprocessor platform for high throughput turbo decoding using configurable application-specific instruction set processors (ASIP) combined with an efficient memory and communication interconnect scheme.

...read moreread less

Abstract: Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture.

...read moreread less

Patent•

Quality-driven ETL design optimization

[...]

Maria G. Castellanos¹, Umeshwar Dayal¹, Alkiviadis Simitsis¹, William K. Wilkinson¹•Institutions (1)

Hewlett-Packard¹

18 Aug 2009

TL;DR: In this article, a method for quality objective-based ETL pipeline optimization is provided, where an improvement objective is obtained from user input into a computing system, which represents a priority optimization desired by a user for improved ETL flows for an application designed to run in memory of the computing system.

...read moreread less

Abstract: A method for quality objective-based ETL pipeline optimization is provided. An improvement objective is obtained from user input into a computing system. The improvement objective represents a priority optimization desired by a user for improved ETL flows for an application designed to run in memory of the computing system. An ETL flow is created in the memory of the computing system. The ETL flow is restructured for flow optimization with a processor of the computing system. The flow restructuring is based on the improvement objective. Flow restructuring can include application of flow rewriting optimization or application of an algebraic rewriting optimization. The optimized ETL flow is stored as executable code on a computer readable storage medium.

...read moreread less

Book Chapter•DOI•

A Context-Parameterized Model for Static Analysis of Execution Times

[...]

Christine Rochange¹, Pascal Sainrat¹•Institutions (1)

University of Toulouse¹

22 Apr 2009

TL;DR: A model to specify the local execution context of a basic block as a set of parameters can then be computed as a function of these parameters, which can be used for computing the Worst-Case Execution Time of the program.

...read moreread less

Abstract: The static analysis of the execution time of a program (i.e. the evaluation of this time for any input data set) can be useful for the purpose of optimizing the code or verifying that strict real-time deadlines can be met. This analysis generally goes through determining the execution times of partial execution paths, typically basic blocks. Now, as soon as the target processor architecture features a superscalar pipeline, possibly with dynamic instruction scheduling, the execution time of a basic block highly depends on the pipeline state, that is on the instructions executed before it. In this paper, we propose a model to specify the local execution context of a basic block as a set of parameters. The execution time of the block can then be computed as a function of these parameters. We show how this model can be used to determine an upper bound of the execution time of a basic block, that can be used for computing the Worst-Case Execution Time of the program. Experimental results give an insight into the tightness of the estimations.

...read moreread less

Patent•

Image sensor with on-chip semi-column-parallel pipeline ADCs

[...]

Junichi Nakamura¹•Institutions (1)

Micron Technology¹

13 Jul 2009

TL;DR: In this article, the authors proposed a semi-column-parallel pipeline architecture for analog-to-digital converters, which allows multiple output lines to share an analog-To-Digital converter.

...read moreread less

Abstract: An imaging device with a semi-column-parallel pipeline analog-to-digital converter architecture. The semi-column-parallel pipeline architecture allows multiple column output lines to share an analog-to-digital converter. Analog-to-digital conversions are performed in a pipelined manner to reduce the conversion time, which results in shorter row times and increased frames rate and data throughput. The architecture also enhances the pitch of the analog-to-digital converters, which allows high performance, high resolution analog-to-digital converters to be used. As such, semi-column-parallel pipeline architecture overcomes the shortcomings of the typical serial and column-parallel architectures.

...read moreread less

Proceedings Article•DOI•

A disruptive computer design idea: Architectures with repeatable timing

[...]

Stephen A. Edwards¹, Sungjun Kim¹, Edward A. Lee², Isaac Liu², Hiren D. Patel², Martin Schoeberl³ - Show less +2 more•Institutions (3)

Columbia University¹, University of California, Berkeley², Vienna University of Technology³

04 Oct 2009

TL;DR: Microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques are described.

...read moreread less

Abstract: This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.

...read moreread less

Patent•

Method and System for Low Latency Basket Calculation

[...]

David E. Taylor, Naveen Singla, Benjamin C. Brodie, Nathaniel McVicar, Justin Ryan Thiel¹, Ronald S. Indeck¹ - Show less +2 more•Institutions (1)

ABB Ltd¹

09 Jan 2009

TL;DR: In this paper, a basket calculation engine is deployed to receive a stream of data and accelerate the computation of basket values based on that data, which is used to process financial market data to compute the net asset values (NAVs) of financial instrument baskets.

...read moreread less

Abstract: A basket calculation engine is deployed to receive a stream of data and accelerate the computation of basket values based on that data. In a preferred embodiment, the basket calculation engine is used to process financial market data to compute the net asset values (NAVs) of financial instrument baskets. The basket calculation engine can be deployed on a coprocessor and can also be realized via a pipeline, the pipeline preferably comprising a basket association lookup module and a basket value updating module. The coprocessor is preferably a reconfigurable logic device such as a field programmable gate array (FPGA).

...read moreread less

Journal Article•DOI•

A Feedback-Based Approach to DVFS in Data-Flow Applications

[...]

A. Alimonda¹, Salvatore Carta¹, Andrea Acquaviva, Alessandro Pisano¹, Luca Benini² - Show less +1 more•Institutions (2)

University of Cagliari¹, University of Bologna²

01 Nov 2009-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A control theoretic approach to dynamic voltage/frequency scaling for data-flow models of computations mapped to multiprocessor systems-on-chip architectures is presented and nonlinear control approaches to deal with general streaming applications containing both pipeline and parallel stages are discussed.

...read moreread less

Abstract: Runtime frequency and voltage adaptation has become very attractive for current and next generation embedded multicore platforms because it allows handling the workload variabilities arising in complex and dynamic utilization scenarios. The main challenge of dynamic frequency adaptation is to adjust the processing speed of each element to match the quality-of-service requirements in the presence of workload variations. In this paper, we present a control theoretic approach to dynamic voltage/frequency scaling for data-flow models of computations mapped to multiprocessor systems-on-chip architectures. We discuss, in particular, nonlinear control approaches to deal with general streaming applications containing both pipeline and parallel stages. Theoretical analysis and experiments, carried out by means of a cycle-accurate energy-aware multiprocessor simulation platform, are provided. We have applied the proposed control approach to realistic streaming applications such as Data Encryption Standard and software-based FM radio.

...read moreread less

Patent•

Systolic array architecture for fast ip lookup

[...]

Cuneyt F. Bazlamacci, Oguzhan Erdem

22 Dec 2009

TL;DR: In this article, an SRAM-based pipeline IP lookup architecture is presented, where a multitude of intersecting and different length pipelines are constructed on a two dimensional array of processing elements in a circular fashion.

...read moreread less

Abstract: This invention first presents SRAM based pipeline IP lookup architectures including an SRAM based systolic array architecture that utilizes multi-pipeline parallelism idea and elaborates on it as the base architecture highlighting its advantages. In this base architecture a multitude of intersecting and different length pipelines are constructed on a two dimensional array of processing elements in a circular fashion. The architecture supports the use of any type of prefix tree instead of conventional binary prefix tree. The invention secondly proposes a novel use of an alternative and more advantageous prefix tree based on binomial spanning tree to achieve a substantial performance increase. The new approach, enhanced with other extensions including four-side input and three-pointer implementations, considerably increases the parallelism and search capability of the base architecture and provides a much higher throughput than all existing IP lookup approaches making, for example, a 7 Tbps router IP lookup front end speed possible. Although theoretical worst-case lookup delay in this systolic array structure is high, the average delay is quite low, large delays being observed only rarely. The structure in its new form is scalable in terms of processing elements and is also well suited for the IPv6 addressing scheme.

...read moreread less

Proceedings Article•DOI•

Characterization of Asynchronous Templates for Integration into Clocked CAD Flows

[...]

Kenneth S. Stevens¹, Yang Xu¹, Vikas S. Vij¹•Institutions (1)

University of Utah¹

17 May 2009

TL;DR: A new methodology is proposed, based on formal verification and relative timing, to create and prove correct necessary constraints to support asynchronous design with traditional clocked CAD.

...read moreread less

Abstract: Asynchronous circuit design can result in substantial benefits ofreduced power, improved performance, and high modularity. However,asynchronous design styles are largely incompatible with clocked CAD,which has prevented wide-scale adoption. The key incompatibility istiming. Thus most commercial work relies on custom CAD or untimeddelay-insensitive design methodologies. This paper proposes a newmethodology, based on formal verification and relative timing, tocreate and prove correct necessary constraints to support asynchronousdesign with traditional clocked CAD. These constraints support timingdriving synthesis, place and route, and behavior and timing validationof fully asynchronous designs using traditional clocked CAD flows.This flow is demonstrated through a simple example pipeline in IBM's65nm process showing the ability to retarget the design for improvedpower and performance.

...read moreread less

Proceedings Article•DOI•

Scalable High Throughput and Power Efficient IP-Lookup on FPGA

[...]

Hoang Le¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

05 Apr 2009

TL;DR: This work proposes a novel scalable high-throughput, low-power SRAM-based linear pipeline architecture for IP lookup that maintains packet input order and supports in-place non-blocking route updates.

...read moreread less

Abstract: Most high-speed Internet Protocol (IP) lookup implementations use tree traversal and pipelining. Due to the available on-chip memory and the number of I/O pins ofField Programmable Gate Arrays (FPGAs), state-of-the-artdesigns cannot support the current largest routing table(consisting of 257K prefixes in backbone routers). We propose a novel scalable high-throughput, low-power SRAM-based linear pipeline architecture for IP lookup. Using asingle FPGA, the proposed architecture can support thecurrent largest routing table, or even larger tables of upto 400K prefixes. Our architecture can also be easily partitioned, so as to use external SRAM to handle even larger routing tables (up to 1.7M prefixes). Our implementation shows a high throughput (340 mega lookups per second or 109 Gbps), even when external SRAM is used. The use of SRAM (instead of TCAM) leads to an order of magnitude reduction in power dissipation. Additionally, the architecture supports power saving by allowing only a portion of the memory to be active on each memory access. Our design also maintains packet input order and supports in-place non-blocking route updates.

...read moreread less

Proceedings Article•DOI•

Architecture Support for Improving Bulk Memory Copying and Initialization Performance

[...]

Xiaowei Jiang¹, Yan Solihin¹, Li Zhao², Ravishankar Iyer²•Institutions (2)

North Carolina State University¹, Intel²

12 Sep 2009

TL;DR: This paper proposed FastBCI, an architecture support that achieves the granularity efficiency of a bulk copying/ initialization instruction, but without its pipeline and cache bottlenecks, which on average achieves anywhere between 23% to 32% speedup ratios.

...read moreread less

Abstract: Bulk memory copying and initialization is one of the most ubiquitous operations performed in current computer systems by both user applications and Operating Systems. While many current systems rely on a loop of loads and stores, there are proposals to introduce a single instruction to perform bulk memory copying. While such an instruction can improve performance due to generating fewer TLB and cache accesses, and requiring fewer pipeline resources, in this paper we show that the key to significantly improving the performance is removing pipeline and cache bottlenecks of the code that follows the instructions. We show that the bottlenecks arise due to (1) the pipeline clogged by the copying instruction, (2) lengthened critical path due to dependent instructions stalling while waiting for the copying to complete, and (3) the inability to specify (separately) the cacheability of the source and destination regions. We propose FastBCI, an architecture support that achieves the granularity efficiency of a bulk copying/ initialization instruction, but without its pipeline and cache bottlenecks. When applied to OS kernel buffer management, we show that on average FastBCI achieves anywhere between 23% to 32% speedup ratios, which is roughly 3x-4x of an alternative scheme, and 1.5x-2x of a highly optimistic DMA with zero setup and interrupt overheads.

...read moreread less

Journal Article•DOI•

Algorithm to create a CCS low-cost pipeline network

[...]

Tomasz Kazmierczak, Ruut Brandsma, Filip Neele, Chris Hendriks

01 Feb 2009-Energy Procedia

TL;DR: In this article, an economic analysis computer tool is developed for the evaluation of carbon capture and storage (CCS) systems comprising of a set of multiple sources of CO2 and storage locations.

...read moreread less

Journal Article•DOI•

Pipeline defect prediction using support vector machines

[...]

Dino Isa¹, Rajprasad Kumar Rajkumar¹•Institutions (1)

University of Nottingham¹

05 Oct 2009-Applied Artificial Intelligence

TL;DR: Preliminary tests show that the sensors can detect the presence of wall thinning in a steel pipe by classifying the attenuation and frequency changes of the propagating lamb waves, and the SVM algorithm was able to classify the signals as abnormal in the absence of wallthinning.

...read moreread less

Abstract: Oil and gas pipeline condition monitoring is a potentially challenging process due to varying temperature conditions, harshness of the flowing commodity and unpredictable terrains. Pipeline breakdown can potentially cost millions of dollars worth of loss, not to mention the serious environmental damage caused by the leaking commodity. The proposed techniques, although implemented on a lab scale experimental rig, ultimately aim at providing a continuous monitoring system using an array of different sensors strategically positioned on the surface of the pipeline. The sensors used are piezoelectric ultrasonic sensors. The raw sensor signal will be first processed using the discrete wavelet transform (DWT) as a feature extractor and then classified using the powerful learning machine called the support vector machine (SVM). Preliminary tests show that the sensors can detect the presence of wall thinning in a steel pipe by classifying the attenuation and frequency changes of the propagating lamb waves. The SVM...

...read moreread less

Journal Article•DOI•

Improving the performance of natural gas pipeline networks fuel consumption minimization problems

[...]

Firooz Tabkhi¹, Luc Pibouleau¹, Guillermo Hernandez-Rodriguez¹, Catherine Azzaro-Pantel¹, Serge Domenech¹ - Show less +1 more•Institutions (1)

Centre national de la recherche scientifique¹

29 Oct 2009-Aiche Journal

TL;DR: In this paper, the authors proposed a methodology for optimizing the operating performance of a pipeline network to minimize the total fuel consumption while maintaining the desired throughput in the line while maintaining a desired throughput.

...read moreread less

Abstract: As the gas industry has developed, gas pipeline networks have evolved over decades into very complex systems A typical network today might consist of thousands of pipes, dozens of stations, and many other devices, such as valves and regulators Inside each station, there can be several groups of compressor units of various vintages that were installed as the capacity of the system expanded The compressor stations typically consume about 3–5% of the transported gas It is estimated that the global optimization of operations can save considerably the fuel consumed by the stations Hence, the problem of minimizing fuel cost is of great importance Consequently, the objective is to operate a given compressor station or a set of compressor stations so that the total fuel consumption is reduced while maintaining the desired throughput in the line Two case studies illustrate the proposed methodology Case 1 was chosen for its simple and small-size design, developed for the sake of illustration The implementation of the methodology is thoroughly presented and typical results are analyzed Case 2 was submitted by the French Company Gaz de France It is a more complex network containing several loops, supply nodes, and delivery points, referred as a multisupply multidelivery transmission network The key points of implementation of an optimization framework are presented The treatment of both case studies provides some guidelines for optimization of the operating performances of pipeline networks, according to the complexity of the involved problems © 2009 American Institute of Chemical Engineers AIChE J, 2010

...read moreread less

Patent•

System and method for deadlock-free pipelining

[...]

Michael J. M. Toksvig¹, Erik Lindholm¹•Institutions (1)

Nvidia¹

08 Apr 2009

TL;DR: In this article, a system and method for facilitating increased graphics processing without deadlock is presented, which provides storage for execution unit pipeline results (e.g., texture pipeline results).

...read moreread less

Abstract: A system and method for facilitating increased graphics processing without deadlock. Embodiments of the present invention provide storage for execution unit pipeline results (e.g., texture pipeline results). The storage allows increased processing of multiple threads as a texture unit may be used to store information while corresponding locations of the register file are available for reallocation to other threads. Embodiments further provide for preventing deadlock by limiting the number of requests and ensuring that a set of requests is not issued unless there are resources available to complete each request of the set of requests. Embodiments of the present invention thus provide for deadlock free increased performance.

...read moreread less

Collapse