Showing papers on "Pipeline (computing) published in 2012"

PDF

Open Access

Book Chapter•DOI•

ML confidential: machine learning on encrypted data

[...]

Thore Graepel¹, Kristin E. Lauter¹, Michael Naehrig²•Institutions (2)

Microsoft¹, Eindhoven University of Technology²

28 Nov 2012

TL;DR: A new class of machine learning algorithms in which the algorithm's predictions can be expressed as polynomials of bounded degree, and confidential algorithms for binary classification based on polynomial approximations to least-squares solutions obtained by a small number of gradient descent steps are proposed.

...read moreread less

Abstract: We demonstrate that, by using a recently proposed leveled homomorphic encryption scheme, it is possible to delegate the execution of a machine learning algorithm to a computing service while retaining confidentiality of the training and test data. Since the computational complexity of the homomorphic encryption scheme depends primarily on the number of levels of multiplications to be carried out on the encrypted data, we define a new class of machine learning algorithms in which the algorithm's predictions, viewed as functions of the input data, can be expressed as polynomials of bounded degree. We propose confidential algorithms for binary classification based on polynomial approximations to least-squares solutions obtained by a small number of gradient descent steps. We present experimental validation of the confidential machine learning pipeline and discuss the trade-offs regarding computational complexity, prediction accuracy and cryptographic security.

...read moreread less

440 citations

Journal Article•DOI•

An integrated pipeline for de novo assembly of microbial genomes.

[...]

Andrew Tritt¹, Jonathan A. Eisen¹, Marc T. Facciotti¹, Aaron E. Darling¹•Institutions (1)

University of California, Davis¹

13 Sep 2012-PLOS ONE

TL;DR: An assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) is presented that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection.

...read moreread less

Abstract: Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.

...read moreread less

418 citations

Book Chapter•DOI•

Improving image-based localization by active correspondence search

[...]

Torsten Sattler¹, Bastian Leibe¹, Leif Kobbelt¹•Institutions (1)

RWTH Aachen University¹

07 Oct 2012

TL;DR: This work proposes a powerful pipeline for determining the pose of a query image relative to a point cloud reconstruction of a large scene consisting of more than one million 3D points, with run-times comparable or superior to the fastest state-of-the-art methods.

...read moreread less

Abstract: We propose a powerful pipeline for determining the pose of a query image relative to a point cloud reconstruction of a large scene consisting of more than one million 3D points The key component of our approach is an efficient and effective search method to establish matches between image features and scene points needed for pose estimation Our main contribution is a framework for actively searching for additional matches, based on both 2D-to-3D and 3D-to-2D search A unified formulation of search in both directions allows us to exploit the distinct advantages of both strategies, while avoiding their weaknesses Due to active search, the resulting pipeline is able to close the gap in registration performance observed between efficient search methods and approaches that are allowed to run for multiple seconds, without sacrificing run-time efficiency Our method achieves the best registration performance published so far on three standard benchmark datasets, with run-times comparable or superior to the fastest state-of-the-art methods

...read moreread less

274 citations

Journal Article•DOI•

A mathematical programming approach for optimal design of distributed energy systems at the neighbourhood level

[...]

Eugenia D. Mehleri¹, Eugenia D. Mehleri², Haralambos Sarimveis², N.C. Markatos², Lazaros G. Papageorgiou¹ - Show less +1 more•Institutions (2)

University College London¹, National Technical University of Athens²

01 Aug 2012-Energy

TL;DR: In this paper, a mixed-integer linear programming (MILP) super-structure model for the optimal design of distributed energy generation systems that satisfy the heating and power demand at the level of a small neighborhood is presented.

...read moreread less

267 citations

Proceedings Article•DOI•

CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs

[...]

Michael K. Papamichael¹, James C. Hoe¹•Institutions (1)

Carnegie Mellon University¹

22 Feb 2012

TL;DR: This paper developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control.

...read moreread less

Abstract: An FPGA is a peculiar hardware realization substrate in terms of the relative speed and cost of logic vs. wires vs. memory. In this paper, we present a Network-on-Chip (NoC) design study from the mindset of NoC as a synthesizable infrastructural element to support emerging System-on-Chip (SoC) applications on FPGAs. To support our study, we developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology. The CONNECT NoC architecture embodies a set of FPGA-motivated design principles that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control. We evaluate CONNECT against a high-quality publicly available synthesizable RTL-level NoC design intended for ASICs. Our evaluation shows a significant gain in specializing NoC design decisions to FPGAs' unique mapping and operating characteristics. For example, in the case of a 4x4 mesh configuration evaluated using a set of synthetic traffic patterns, we obtain comparable or better performance than the state-of-the-art NoC while reducing logic resource cost by 58%, or alternatively, achieve 3-4x better performance for approximately the same logic resource usage. Finally, to demonstrate CONNECT's flexibility and extensive design space coverage, we also report synthesis and network performance results for several router configurations and for entire CONNECT networks.

...read moreread less

201 citations

Proceedings Article•DOI•

Combining in-situ and in-transit processing to enable extreme-scale scientific analysis

[...]

Janine C. Bennett¹, Hasan Abbasi, Peer-Timo Bremer², Ray Grout³, Attila Gyulassy⁴, Tong Jin⁵, Scott Klasky, Hemanth Kolla¹, Manish Parashar⁵, Valerio Pascucci⁴, Philippe Pierre Pebay⁶, David C. Thompson⁶, Hongfeng Yu¹, Fan Zhang⁵, Jacqueline H. Chen¹ - Show less +11 more•Institutions (6)

Sandia National Laboratories¹, Lawrence Livermore National Laboratory², National Renewable Energy Laboratory³, University of Utah⁴, Rutgers University⁵, Kitware⁶

10 Nov 2012

TL;DR: The lightweight, flexible framework allows scientists dealing with the data deluge at extreme scale to perform analyses at increased temporal resolutions, mitigate I/O costs, and significantly improve the time to insight.

...read moreread less

Abstract: With the onset of extreme-scale computing, I/O constraints make it increasingly difficult for scientists to save a sufficient amount of raw simulation data to persistent storage. One potential solution is to change the data analysis pipeline from a post-process centric to a concurrent approach based on either in-situ or in-transit processing. In this context computations are considered in-situ if they utilize the primary compute resources, while in-transit processing refers to offloading computations to a set of secondary resources using asynchronous data transfers. In this paper we explore the design and implementation of three common analysis techniques typically performed on large-scale scientific simulations: topological analysis, descriptive statistics, and visualization. We summarize algorithmic developments, describe a resource scheduling system to coordinate the execution of various analysis workflows, and discuss our implementation using the DataSpaces and ADIOS frameworks that support efficient data movement between in-situ and in-transit computations. We demonstrate the efficiency of our lightweight, flexible framework by deploying it on the Jaguar XK6 to analyze data generated by S3D, a massively parallel turbulent combustion code. Our framework allows scientists dealing with the data deluge at extreme scale to perform analyses at increased temporal resolutions, mitigate I/O costs, and significantly improve the time to insight.

...read moreread less

185 citations

Journal Article•DOI•

Mechanical behavior of buried steel pipes crossing active strike-slip faults

[...]

Polynikis Vazouras¹, Spyros A. Karamanos¹, Panos Dakoulas¹•Institutions (1)

University of Thessaly¹

01 Oct 2012-Soil Dynamics and Earthquake Engineering

TL;DR: In this article, the mechanical behavior of buried steel pipes crossing active strike-slip tectonic faults is investigated. And the results from the present study can be used for the development of performance-based design methodologies for buried steel pipelines.

...read moreread less

176 citations

Journal Article•

Design of High Performance MIPS Cryptography Processor Based on T-DES Algorithm

[...]

Kirat Pal Singh, Shivani Parmar

30 May 2012-International journal of engineering research and technology

TL;DR: The design of high performance MIPS Cryptography processor based on triple data encryption standard is described in such a way that pipeline can be clocked at high frequency and the small adjustments and minor improvement in the MIPS pipelined architecture design are described.

...read moreread less

Abstract: The paper describes the design of high performance MIPS Cryptography processor based on triple data encryption standard. The organization of pipeline stages in such a way that pipeline can be clocked at high frequency. Encryption and Decryption blocks of triple data encryption standard (T-DES) crypto system and dependency among themselves are explained in detail with the help of block diagram. In order to increase the processor functionality and performance, especially for security applications we include three new 32-bit instructions LKLW, LKUW and CRYPT. The design has been synthesized at 40nm process technology targeting using Xilinx Virtex-6 device. The overall MIPS Crypto processor works at 209MHz. Keywords ALU, register file, pipeline, memory, T-DES, throughput 1. INTRODUCTION oday’s digital world, Cryptography is the art and science that deals with the principles and methods for keeping message secure. Encryption is emerging as a disintegrable part of all communication networks and information processing systems, involving transmission of data. Encryption is the transformation of plain data (known as plaintext) into inintengible data (known as cipher text) through an algorithm referred to as cipher. MIPS architecture employs a wide range of applications. The architecture remains the same for all MIPS based processors while the implementations may differ [1]. The proposed design has the feature of 32-bit asymmetric and symmetric cryptography system as a security application. There is a 16- bit RSA cryptography MIPS cryptosystem have been previously designed [2]. There are the small adjustments and minor improvement in the MIPS pipelined architecture design to protect data transmission over insecure medium using authenticating devices such as data encryption standard [DES], Triple-DES and advanced encryption standard [AES] [3]. These cryptographic devices use an identical key for the receiver side and sender side. Our design mainly includes the symmetric cryptosystem into MIPS pipeline stages. That is suitable to encrypt large amount data with high speed. The MIPS is simply known as Millions of instructions per second and is one of the best RISC (Reduced Instruction Set Computer) processor ever designed. High speed MIPS processor possessed Pipeline architecture for speed up processing, increase the frequency and performance of the processor. A MIPS based RISC processor was described in [4]. It consist of basic five stages of pipelining that are pipelined processor is shown in Fig.1 which containInstruction Fetch, Instruction Decode, Instruction Execution, Memory access, write back. These five pipeline stages generate 5 clock cycles processing delay and several Hazard during the operation [2]. These pipelining Hazard are eliminates by inserting NOP (No Operation Performed) instruction which generate some delays for the proper execution of instruction [4]. The pipelining Hazards are of three type’s data, structural and control hazard. These hazards are handled in the MIPS processor by the implementation of forwarding unit, Pre-fetching or Hazard detection unit, branch and jump prediction unit [2]. Forwarding unit is used for preventing data hazards which detects the dependencies and forward the required data from the running instruction to the dependent instructions [5]. Stall are occurred in the pipelined architecture when the consecutive instruction uses the same operand of the instruction and that require more clock cycles for execution and reduces performance. To overcome this situation, instruction pre-fetching unit is used which reduces the stalls and improve performance. The control hazard are occurs when a branch prediction is mistaken or in general, when the system has no mechanism for handling the control hazards [5]. The control hazard is handled by two mechanisms: Flush mechanism and Delayed jump mechanism. The branch and jump prediction unit uses these two mechanisms for preventing control hazards. The flush mechanism runs instruction after a branch and flushes the pipe after the misprediction [5]. Frequent flushing may increase the clock cycles and reduce performance. In the delayed jump mechanism, to handle the control hazard is to fill the pipe after the jump instruction with specific numbers of NOP’s [5]. The branch and jump prediction unit placement in the pipelining architecture may affect the critical or longest path. To detecting the longest path and improving the hardware that resulting minimum clock period and is the standard method of increasing the performance of the processor. To further speed up processor and minimize clock period, the design incorporates a high speed hybrid adder which employs both carry skip and carry select techniques with in the ALU unit to handle the additions. This paper is organized as follows. The system architecture hardware design and implementation are explained in Section II. Instruction set of MIPS including new instructions in detail with corresponding diagrams shown in sub-sections. Hardware implementation design methodology is explained in section III. The experimental results of pipeline stages are shown in section IV. Simulation results of encrypted MIPS pipeline processor and their Verification & synthesis report are describes in sub sections. The conclusions of paper are described in section V.

...read moreread less

167 citations

Journal Article•DOI•

The Pipeline Embolization Device for the Intracranial Treatment of Aneurysms Trial

[...]

C. Gandhi

01 Jan 2012-Yearbook of Neurology and Neurosurgery

160 citations

Proceedings Article•DOI•

A PRET microarchitecture implementation with repeatable timing and competitive performance

[...]

Isaac Liu¹, Jan Reineke², David Broman³, Michael Zimmer¹, Edward A. Lee¹ - Show less +1 more•Institutions (3)

University of California, Berkeley¹, Saarland University², Linköping University³

30 Sep 2012

TL;DR: The Precision-Timed ARM (PTARM) is introduced, a precision-timed microarchitecture implementation that exhibits repeatable execution times without sacrificing performance, and shows an improved throughput compared to a single-threaded in-order five-stage pipeline, given sufficient parallelism in the software.

...read moreread less

Abstract: We contend that repeatability of execution times is crucial to the validity of testing of real-time systems. However, computer architecture designs fail to deliver repeatable timing, a consequence of aggressive techniques that improve average-case performance. This paper introduces the Precision-Timed ARM (PTARM), a precision-timed (PRET) microarchitecture implementation that exhibits repeatable execution times without sacrificing performance. The PTARM employs a repeatable thread-interleaved pipeline with an exposed memory hierarchy, including a repeatable DRAM controller. Our benchmarks show an improved throughput compared to a single-threaded in-order five-stage pipeline, given sufficient parallelism in the software.

...read moreread less

101 citations

Proceedings Article•DOI•

A Unified WCET Analysis Framework for Multi-core Platforms

[...]

Sudipta Chattopadhyay¹, Chong Lee Kee¹, Abhik Roychoudhury¹, Timon Kelter, Peter Marwedel, Heiko Falk² - Show less +2 more•Institutions (2)

National University of Singapore¹, University of Ulm²

16 Apr 2012

TL;DR: This work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor) by assuming a timing anomaly free multi-core architecture for computing the WCET.

...read moreread less

Abstract: With the advent of multi-core architectures, worst case execution time (WCET) analysis has become an increasingly difficult problem. In this paper, we propose a unified WCET analysis framework for multi-core processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multi-core architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.

...read moreread less

Proceedings Article•DOI•

Parallel lossless data compression on the GPU

[...]

Ritesh A. Patel¹, Yao Zhang¹, Jason Mak¹, Andrew Davidson¹, John D. Owens¹ - Show less +1 more•Institutions (1)

University of California, Davis¹

13 May 2012

TL;DR: This work utilizing a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree to parallelize the bzip2 compression pipeline.

...read moreread less

Abstract: We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation is dominated by BWT performance and is 2.78× slower than bzip2, with BWT and MTF-Huffman respectively 2.89× and 1.34× slower on average.

...read moreread less

Journal Article•DOI•

Embedding a high speed interval type-2 fuzzy controller for a real plant into an FPGA

[...]

Roberto Sepúlveda¹, Oscar Montiel¹, Oscar Castillo², Patricia Melin²•Institutions (2)

Instituto Politécnico Nacional¹, AmeriCorps VISTA²

01 Mar 2012

TL;DR: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing and shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software.

...read moreread less

Abstract: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing. This is an important issue since the use of IT2 FIS still being controversial for several reasons, one of the most important is related to the resulting shocking increase in computational complexity that type reducers, like the Karnik-Mendel (KM) iterative method, can cause even for small systems. Hence, comparing our results against a typical implementation of a IT2 FIS using a high level language implemented into a computer, we show that using a hardware implementation the the whole IT2 FIS (fuzzification, inference engine, type reducer and defuzzification) last only four clock cycles; a speed up of nearly 225,000 and 450,000 can be obtained for the Spartan 3 and Virtex 5 Field Programmable Gate Arrays (FPGAs), respectively. This proposal is suitable to be implemented in pipeline, so the complete IT2 process can be obtained in just one clock cycle with the consequently gain in speed of 900,000 and 2,400,000 for the aforementioned FPGAs. This paper also shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software. Comparative experiments of control surfaces, and time response in the control of a real plant using the IT2 FIS implemented into a computer against the IT2 FIS into an FPGA are shown.

...read moreread less

Proceedings Article•DOI•

On-the-fly Composition of FPGA-Based SQL Query Accelerators Using a Partially Reconfigurable Module Library

[...]

Christopher Dennl¹, Daniel Ziener¹, Jürgen Teich¹•Institutions (1)

University of Erlangen-Nuremberg¹

29 Apr 2012

TL;DR: This paper introduces a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration and shows that it is able to achieve a substantially higher throughput compared to a software-only solution.

...read moreread less

Abstract: In this paper, we introduce a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration. Query acceleration is of utmost importance in large database systems to achieve a very high throughput. Although common FPGA-based accelerators are suitable to achieve such a high throughput, their design is hard to extend for new operations. Using partial dynamic reconfiguration, we are able to build more flexible architectures which can be extended to new operations or SQL constructs with a very low area overhead on the FPGA. Furthermore, the reconfiguration of a few FPGA frames can be used to switch very fast from one query to the next. In our approach, an SQL query is transformed into a hardware pipeline consisting of partially reconfigurable modules. The assembly of the (FPGA) data path is done at run-time using a static system providing the stream-based communication interfaces to the partial modules and the database management system. More specifically, each incoming SQL query is analyzed and divided into single operations which are subsequently mapped onto library modules and the composed data path loaded on the FPGA. We show that our approach is able to achieve a substantially higher throughput compared to a software-only solution.

...read moreread less

Proceedings Article•DOI•

LazyBase: trading freshness for performance in a scalable database

[...]

James Cipar¹, Greg Ganger¹, Kimberly Keeton², Charles B. Morrey², Craig A. N. Soules², Alistair Veitch² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Hewlett-Packard²

10 Apr 2012

TL;DR: This work demonstrates LazyBase's tradeoff between query latency and result freshness as well as the benefits of its consistency model, and demonstrates specific cases where Cassandra's consistency model is weaker than Lazy base's.

...read moreread less

Abstract: The LazyBase scalable database system is specialized for the growing class of data analysis applications that extract knowledge from large, rapidly changing data sets. It provides the scalability of popular NoSQL systems without the query-time complexity associated with their eventual consistency models, offering a clear consistency model and explicit per-query control over the trade-off between latency and result freshness. With an architecture designed around batching and pipelining of updates, LazyBase simultaneously ingests atomic batches of updates at a very high throughput and offers quick read queries to a stale-but-consistent version of the data. Although slightly stale results are sufficient for many analysis queries, fully up-to-date results can be obtained when necessary by also scanning updates still in the pipeline. Compared to the Cassandra NoSQL system, LazyBase provides 4X--5X faster update throughput and 4X faster read query throughput for range queries while remaining competitive for point queries. We demonstrate LazyBase's tradeoff between query latency and result freshness as well as the benefits of its consistency model. We also demonstrate specific cases where Cassandra's consistency model is weaker than LazyBase's.

...read moreread less

Journal Article•DOI•

Generating candidate networks for optimization: The CO2 capture and storage optimization problem

[...]

Richard S. Middleton¹, Michael Kuby², Jeffrey M. Bielicki³•Institutions (3)

Los Alamos National Laboratory¹, Arizona State University², University of Minnesota³

01 Jan 2012-Computers, Environment and Urban Systems

TL;DR: A unique method for generating a candidate network from scratch, from which the optimization model selects the optimal set of arcs to form the pipeline network, can be applied to any network optimization problem including transmission line, roads, and telecommunication applications.

...read moreread less

Journal Article•DOI•

An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform

[...]

Wei Zhang¹, Jiang Zhe¹, Gao Zhiyu¹, Yanyan Liu²•Institutions (2)

Tianjin University¹, College of Information Technology²

28 Feb 2012-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: Modifications are made to the lifting scheme, and the intermediate results are recombined and stored to reduce the number of pipelining stages to achieve a critical path with only one multiplier.

...read moreread less

Abstract: A high-speed and reduced-area 2-D discrete wavelet transform (2-D DWT) architecture is proposed. Previous DWT architectures are mostly based on the modified lifting scheme or the flipping structure. In order to achieve a critical path with only one multiplier, at least four pipelining stages are required for one lifting step, or a large temporal buffer is needed. In this brief, modifications are made to the lifting scheme, and the intermediate results are recombined and stored to reduce the number of pipelining stages. As a result, the number of registers can be reduced to 18 without extending the critical path. In addition, the two-input/two-output parallel scanning architecture is adopted in our design. For a 2-D DWT with the size of , the proposed architecture only requires three registers between the row and column filters as the transposing buffer, and a higher efficiency can be achieved.

...read moreread less

Journal Article•DOI•

FLAGCAL: a flagging and calibration package for radio interferometric data

[...]

Jayanti Prasad¹, Jayaram N. Chengalur²•Institutions (2)

Inter-University Centre for Astronomy and Astrophysics¹, Tata Institute of Fundamental Research²

01 Mar 2012-Experimental Astronomy

TL;DR: A flagging and calibration pipeline intended for making quick look images from GMRT data that identifies and flags corrupted visibilities, computes calibration solutions and interpolates these onto the target source.

...read moreread less

Abstract: We describe a flagging and calibration pipeline intended for making quick look images from GMRT data. The package identifies and flags corrupted visibilities, computes calibration solutions and interpolates these onto the tar- get source. These flagged calibrated visibilities can be directly imaged using any standard imaging package. The pipeline is written in "C" with the most compute intensive algorithms being parallelized using OpenMP.

...read moreread less

Patent•

Conditional load instructions in an out-of-order execution microprocessor

[...]

G. Glenn Henry¹, Gerard M. Col¹, Colin Eddy¹, Rodney E. Hooker¹, Terry Parks¹ - Show less +1 more•Institutions (1)

VIA Technologies¹

06 Apr 2012

TL;DR: In this article, a microprocessor instruction translator translates a conditional load instruction into at least two microinstructions, and an out-of-order execution pipeline executes the instructions.

...read moreread less

Abstract: A microprocessor instruction translator translates a conditional load instruction into at least two microinstructions. An out-of-order execution pipeline executes the microinstructions. To execute a first microinstruction, an execution unit receives source operands from the source registers of a register file and responsively generates a first result using the source operands. To execute a second the microinstruction, an execution unit receives a previous value of the destination register and the first result and responsively reads data from a memory location specified by the first result and provides a second result that is the data if a condition is satisfied and that is the previous destination register value if not. The previous value of the destination register comprises a result produced by execution of a microinstruction that is the most recent in-order previous writer of the destination register with respect to the second microinstruction.

...read moreread less

Proceedings Article•DOI•

A pipeline for the simulation of transcranial direct current stimulation for realistic human head models using SCIRun/BioMesh3D

[...]

Moritz Dannhauer¹, Dana H. Brooks², Don M. Tucker³, Rob S. MacLeod¹•Institutions (3)

Scientific Computing and Imaging Institute¹, Dana Corporation², University of Oregon³

01 Jan 2012

TL;DR: The current work presents a computational pipeline to simulate transcranial direct current stimulation from image based models of the head with SCIRun, supported by a complete suite of open source software tools.

...read moreread less

Abstract: The current work presents a computational pipeline to simulate transcranial direct current stimulation from image based models of the head with SCIRun [15]. The pipeline contains all the steps necessary to carry out the simulations and is supported by a complete suite of open source software tools: image visualization, segmentation, mesh generation, tDCS electrode generation and efficient tDCS forward simulation.

...read moreread less

Journal Article•DOI•

A fully automated data reduction pipeline for the FRODOSpec integral field spectrograph

[...]

R. M. Barnsley¹, Rory Smith¹, Iain A. Steele¹•Institutions (1)

Liverpool John Moores University¹

01 Feb 2012-Astronomische Nachrichten

TL;DR: In this article, a fully autonomous data reduction pipeline has been developed for FRODOSpec, an optical fibre-fed integral field spectrograph currently in use at the Liverpool Telescope.

...read moreread less

Abstract: A fully autonomous data reduction pipeline has been developed for FRODOSpec, an optical fibre-fed integral field spectrograph currently in use at the Liverpool Telescope. This paper details the process required for the reduction of data taken using an integral field spectrograph and presents an overview of the computational methods implemented to create the pipeline. Analysis of errors and possible future enhancements are also discussed (© 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)

...read moreread less

Journal Article•DOI•

FPGA Implementation for Real-Time Empirical Mode Decomposition

[...]

Ying-Yi Hong¹, Yu-Qing Bao²•Institutions (2)

Chung Yuan Christian University¹, Southeast University²

19 Sep 2012-IEEE Transactions on Instrumentation and Measurement

TL;DR: A novel field-programmable gate array (FPGA) based method for empirical mode decomposition (EMD) in real time using a circular queue to temporarily store values of maxima and minima and revealing its effectiveness in real-time applications.

...read moreread less

Abstract: This paper presents a novel field-programmable gate array (FPGA) based method for empirical mode decomposition (EMD) in real time. Traditionally, EMD can be easily implemented and developed using a high-level computer language in a PC or DSP chip. However, it is difficult to implement EMD in a hardware environment. This paper develops EMD for real-time applications using a hardware-based FPGA. The proposed FPGA-based method calculates the upper and lower envelopes in EMD point by point by using a circular queue to temporarily store values of maxima and minima, from which the upper and lower envelopes in the EMD can be determined continuously. Additionally, an attempt is made to increase the efficiency of the computational process by cascading several identical modules as a serial pipeline structure in order to conduct an iterative loop for calculating the intrinsic mode functions in EMD. The fast process from the serial pipeline structure results in real-time computation with a sampling rate of up to 12.5 MHz and mitigation of the end effect. The proposed method is validated by the simulation results obtained by Quartus II and verified by FPGA (Altera Stratix III EP3SL150F1152C2) realization, revealing its effectiveness in real-time applications.

...read moreread less

Journal Article•DOI•

Influence of the position angle of the small pipeline on vortex shedding flow around a sub-sea piggyback pipeline

[...]

Ming Zhao¹•Institutions (1)

University of Western Sydney¹

27 Sep 2012-Coastal Engineering Journal

TL;DR: In this paper, the vortex shedding flow past a piggyback pipeline close to a plane boundary is investigated numerically and the piggy-back pipeline is comprised of a large pipeline and a small pipeline.

...read moreread less

Abstract: The vortex shedding flow past a piggyback pipeline close to a plane boundary is investigated numerically. The piggyback pipeline is comprised of a large pipeline and a small pipeline. The aim of th...

...read moreread less

Journal Article•DOI•

A Pipeline VLSI Architecture for Fast Computation of the 2-D Discrete Wavelet Transform

[...]

Chengjun Zhang¹, Chunyan Wang¹, M.O. Ahmad¹•Institutions (1)

Concordia University¹

17 Jan 2012-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: The performance in terms of the processing speed of the architecture designed based on the proposed scheme is superior to those of the architectures designed using other existing schemes, and it has similar or lower hardware consumption.

...read moreread less

Abstract: In this paper, a scheme for the design of a high-speed pipeline VLSI architecture for the computation of the 2-D discrete wavelet transform (DWT) is proposed. The main focus in the development of the architecture is on providing a high operating frequency and a small number of clock cycles along with an efficient hardware utilization by maximizing the inter-stage and intra-stage computational parallelism for the pipeline. The inter-stage parallelism is enhanced by optimally mapping the computational task of multi decomposition levels to the stages of the pipeline and synchronizing their operations. The intra-stage parallelism is enhanced by dividing the 2-D filtering operation into four subtasks that can be performed independently in parallel and minimizing the delay of the critical path of bit-wise adder networks for performing the filtering operation. To validate the proposed scheme, a circuit is designed, simulated, and implemented in FPGA for the 2-D DWT computation. The results of the implementation show that the circuit is capable of operating with a maximum clock frequency of 134 MHz and processing 1022 frames of size 512 × 512 per second with this operating frequency. It is shown that the performance in terms of the processing speed of the architecture designed based on the proposed scheme is superior to those of the architectures designed using other existing schemes, and it has similar or lower hardware consumption.

...read moreread less

Journal Article•DOI•

Improved Architectures for a Fused Floating-Point Add-Subtract Unit

[...]

Jongwook Sohn¹, Earl E. Swartzlander¹•Institutions (1)

University of Texas at Austin¹

12 Apr 2012-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: Improved architectures for a fused floating-point add-subtract unit for digital signal processing applications such as fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations are presented.

...read moreread less

Abstract: This paper presents improved architectures for a fused floating-point add-subtract unit. The fused floating-point add-subtract unit is useful for digital signal processing (DSP) applications such as fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations. To improve the performance of the fused floating-point add-subtract unit, a dual-path algorithm and pipelining are employed. The proposed designs are implemented for both single and double precision and synthesized with a 45-nm standard-cell library. The fused floating-point add-subtract unit saves 40% of the area and power consumption compared to a discrete floating-point add-subtract unit. The proposed dual-path design reduces the latency by 30% compared to the discrete design with area and power consumption between that of the discrete and fused designs. Based on a data flow analysis, the proposed fused dual-path floating-point add-subtract unit can be split into two pipeline stages. Since the latencies of two pipeline stages are fairly well balanced, the throughput is increased by 80% compared to the nonpipelined dual-path design.

...read moreread less

Proceedings Article•DOI•

Scalable framework for mapping streaming applications onto multi-GPU systems

[...]

Huynh Phung Huynh¹, Andrei Hagiescu², Weng-Fai Wong², Rick Siow Mong Goh¹•Institutions (2)

Institute of High Performance Computing Singapore¹, National University of Singapore²

25 Feb 2012

TL;DR: An efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system and is implemented as a back-end of the StreamIt programming language compiler.

...read moreread less

Abstract: Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

...read moreread less

Patent•

Model-based data pipeline system optimization

[...]

Magdi Morsi¹, Wai-Ho Au¹, Amgad Neematallah¹, Will Xu¹•Institutions (1)

Microsoft¹

27 Feb 2012

TL;DR: In this article, a computer-implemented method for optimizing a data pipeline system includes processing a configuration manifest to generate a framework of the pipeline system and a data flow logic package of the system.

...read moreread less

Abstract: A computer-implemented method for optimizing a data pipeline system includes processing a data pipeline configuration manifest to generate a framework of the data pipeline system and a data flow logic package of the data pipeline system. The data pipeline configuration manifest includes an object-oriented metadata model of the data pipeline system. The computer-implemented method further includes monitoring performance of the data pipeline system during execution of the data flow logic package to obtain a performance metric for the data pipeline system, and modifying, with a processor, the framework of the data pipeline system based on the data pipeline configuration manifest and the performance metric.

...read moreread less

Patent•

Network device with a programmable core

[...]

Amir Roitshtein

05 Jan 2012

TL;DR: In this paper, a configurable processor is coupled to the at least one hardware stage of the packet processing pipeline, which is configured to modify the field in the data structure to generate a modified data structure.

...read moreread less

Abstract: In network device, a plurality of ports is configured to receive and to transmit packets on a network A packet processing pipeline includes a plurality of hardware stages, wherein at least one hardware stage is configured to output a data structure comprising a field extracted from a received packet based on a first packet processing operation performed on the packet or the data structure, wherein the data structure is associated with the packet A configurable processor is coupled to the at least one hardware stage of the packet processing pipeline The configurable processor is configured to modify the field in the data structure to generate a modified data structure and to pass the modified data structure to a subsequent hardware stage that is configured to perform a second packet processing operation on the data structure using the field modified by the configurable processor

...read moreread less

spec2d: DEEP2 DEIMOS Spectral Pipeline

[...]

Michael C. Cooper, Jeffrey A. Newman, Marc Davis, Douglas P. Finkbeiner, Brian F. Gerke - Show less +1 more

01 Mar 2012

Proceedings Article•DOI•

Power balanced pipelines

[...]

John Sartori¹, Ben Ahrens¹, Rakesh Kumar¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

25 Feb 2012

TL;DR: This paper proposes the concept of power balanced pipelines - i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance.

...read moreread less

Abstract: Since the onset of pipelined processors, balancing the delay of the microarchitectural pipeline stages such that each microarchitectural pipeline stage has an equal delay has been a primary design objective, as it maximizes instruction throughput. Unfortunately, this causes significant energy inefficiency in processors, as each microarchitectural pipeline stage gets the same amount of time to complete, irrespective of its size or complexity. For power-optimized processors, the inefficiency manifests itself as a significant imbalance in power consumption of different microarchitectural pipestages. In this paper, rather than balancing processor pipelines for delay, we propose the concept of power balanced pipelines — i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance. A specific implementation of the concept uses cycle time stealing [19] to deliberately redistribute cycle time from low-power pipeline stages to power-hungry stages, relaxing their timing constraints and allowing them to operate at reduced voltages or use smaller, less leaky cells. We present several static and dynamic techniques for power balancing and demonstrate that balancing pipeline power rather than delay can result in 46% processor power reduction with no loss in processor throughput for a full FabScalar processor over a power-optimized baseline. Benefits are comparable over a Fabscalar baseline where static cycle time stealing is used to optimize achieved frequency. Power savings increase at lower operating frequencies. To the best of our knowledge, this is the first such work on microarchitecture-level power reduction that guarantees the same performance.

...read moreread less

Collapse