scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2012"


Book ChapterDOI
28 Nov 2012
TL;DR: A new class of machine learning algorithms in which the algorithm's predictions can be expressed as polynomials of bounded degree, and confidential algorithms for binary classification based on polynomial approximations to least-squares solutions obtained by a small number of gradient descent steps are proposed.
Abstract: We demonstrate that, by using a recently proposed leveled homomorphic encryption scheme, it is possible to delegate the execution of a machine learning algorithm to a computing service while retaining confidentiality of the training and test data. Since the computational complexity of the homomorphic encryption scheme depends primarily on the number of levels of multiplications to be carried out on the encrypted data, we define a new class of machine learning algorithms in which the algorithm's predictions, viewed as functions of the input data, can be expressed as polynomials of bounded degree. We propose confidential algorithms for binary classification based on polynomial approximations to least-squares solutions obtained by a small number of gradient descent steps. We present experimental validation of the confidential machine learning pipeline and discuss the trade-offs regarding computational complexity, prediction accuracy and cryptographic security.

440 citations


Journal ArticleDOI
13 Sep 2012-PLOS ONE
TL;DR: An assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) is presented that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection.
Abstract: Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.

418 citations


Book ChapterDOI
07 Oct 2012
TL;DR: This work proposes a powerful pipeline for determining the pose of a query image relative to a point cloud reconstruction of a large scene consisting of more than one million 3D points, with run-times comparable or superior to the fastest state-of-the-art methods.
Abstract: We propose a powerful pipeline for determining the pose of a query image relative to a point cloud reconstruction of a large scene consisting of more than one million 3D points The key component of our approach is an efficient and effective search method to establish matches between image features and scene points needed for pose estimation Our main contribution is a framework for actively searching for additional matches, based on both 2D-to-3D and 3D-to-2D search A unified formulation of search in both directions allows us to exploit the distinct advantages of both strategies, while avoiding their weaknesses Due to active search, the resulting pipeline is able to close the gap in registration performance observed between efficient search methods and approaches that are allowed to run for multiple seconds, without sacrificing run-time efficiency Our method achieves the best registration performance published so far on three standard benchmark datasets, with run-times comparable or superior to the fastest state-of-the-art methods

274 citations


Journal ArticleDOI
01 Aug 2012-Energy
TL;DR: In this paper, a mixed-integer linear programming (MILP) super-structure model for the optimal design of distributed energy generation systems that satisfy the heating and power demand at the level of a small neighborhood is presented.

267 citations


Proceedings ArticleDOI
22 Feb 2012
TL;DR: This paper developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control.
Abstract: An FPGA is a peculiar hardware realization substrate in terms of the relative speed and cost of logic vs. wires vs. memory. In this paper, we present a Network-on-Chip (NoC) design study from the mindset of NoC as a synthesizable infrastructural element to support emerging System-on-Chip (SoC) applications on FPGAs. To support our study, we developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology. The CONNECT NoC architecture embodies a set of FPGA-motivated design principles that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control. We evaluate CONNECT against a high-quality publicly available synthesizable RTL-level NoC design intended for ASICs. Our evaluation shows a significant gain in specializing NoC design decisions to FPGAs' unique mapping and operating characteristics. For example, in the case of a 4x4 mesh configuration evaluated using a set of synthetic traffic patterns, we obtain comparable or better performance than the state-of-the-art NoC while reducing logic resource cost by 58%, or alternatively, achieve 3-4x better performance for approximately the same logic resource usage. Finally, to demonstrate CONNECT's flexibility and extensive design space coverage, we also report synthesis and network performance results for several router configurations and for entire CONNECT networks.

201 citations


Proceedings ArticleDOI
10 Nov 2012
TL;DR: The lightweight, flexible framework allows scientists dealing with the data deluge at extreme scale to perform analyses at increased temporal resolutions, mitigate I/O costs, and significantly improve the time to insight.
Abstract: With the onset of extreme-scale computing, I/O constraints make it increasingly difficult for scientists to save a sufficient amount of raw simulation data to persistent storage. One potential solution is to change the data analysis pipeline from a post-process centric to a concurrent approach based on either in-situ or in-transit processing. In this context computations are considered in-situ if they utilize the primary compute resources, while in-transit processing refers to offloading computations to a set of secondary resources using asynchronous data transfers. In this paper we explore the design and implementation of three common analysis techniques typically performed on large-scale scientific simulations: topological analysis, descriptive statistics, and visualization. We summarize algorithmic developments, describe a resource scheduling system to coordinate the execution of various analysis workflows, and discuss our implementation using the DataSpaces and ADIOS frameworks that support efficient data movement between in-situ and in-transit computations. We demonstrate the efficiency of our lightweight, flexible framework by deploying it on the Jaguar XK6 to analyze data generated by S3D, a massively parallel turbulent combustion code. Our framework allows scientists dealing with the data deluge at extreme scale to perform analyses at increased temporal resolutions, mitigate I/O costs, and significantly improve the time to insight.

185 citations


Journal ArticleDOI
TL;DR: In this article, the mechanical behavior of buried steel pipes crossing active strike-slip tectonic faults is investigated. And the results from the present study can be used for the development of performance-based design methodologies for buried steel pipelines.

176 citations


Journal Article
TL;DR: The design of high performance MIPS Cryptography processor based on triple data encryption standard is described in such a way that pipeline can be clocked at high frequency and the small adjustments and minor improvement in the MIPS pipelined architecture design are described.
Abstract: The paper describes the design of high performance MIPS Cryptography processor based on triple data encryption standard. The organization of pipeline stages in such a way that pipeline can be clocked at high frequency. Encryption and Decryption blocks of triple data encryption standard (T-DES) crypto system and dependency among themselves are explained in detail with the help of block diagram. In order to increase the processor functionality and performance, especially for security applications we include three new 32-bit instructions LKLW, LKUW and CRYPT. The design has been synthesized at 40nm process technology targeting using Xilinx Virtex-6 device. The overall MIPS Crypto processor works at 209MHz. Keywords ALU, register file, pipeline, memory, T-DES, throughput 1. INTRODUCTION oday’s digital world, Cryptography is the art and science that deals with the principles and methods for keeping message secure. Encryption is emerging as a disintegrable part of all communication networks and information processing systems, involving transmission of data. Encryption is the transformation of plain data (known as plaintext) into inintengible data (known as cipher text) through an algorithm referred to as cipher. MIPS architecture employs a wide range of applications. The architecture remains the same for all MIPS based processors while the implementations may differ [1]. The proposed design has the feature of 32-bit asymmetric and symmetric cryptography system as a security application. There is a 16- bit RSA cryptography MIPS cryptosystem have been previously designed [2]. There are the small adjustments and minor improvement in the MIPS pipelined architecture design to protect data transmission over insecure medium using authenticating devices such as data encryption standard [DES], Triple-DES and advanced encryption standard [AES] [3]. These cryptographic devices use an identical key for the receiver side and sender side. Our design mainly includes the symmetric cryptosystem into MIPS pipeline stages. That is suitable to encrypt large amount data with high speed. The MIPS is simply known as Millions of instructions per second and is one of the best RISC (Reduced Instruction Set Computer) processor ever designed. High speed MIPS processor possessed Pipeline architecture for speed up processing, increase the frequency and performance of the processor. A MIPS based RISC processor was described in [4]. It consist of basic five stages of pipelining that are pipelined processor is shown in Fig.1 which containInstruction Fetch, Instruction Decode, Instruction Execution, Memory access, write back. These five pipeline stages generate 5 clock cycles processing delay and several Hazard during the operation [2]. These pipelining Hazard are eliminates by inserting NOP (No Operation Performed) instruction which generate some delays for the proper execution of instruction [4]. The pipelining Hazards are of three type’s data, structural and control hazard. These hazards are handled in the MIPS processor by the implementation of forwarding unit, Pre-fetching or Hazard detection unit, branch and jump prediction unit [2]. Forwarding unit is used for preventing data hazards which detects the dependencies and forward the required data from the running instruction to the dependent instructions [5]. Stall are occurred in the pipelined architecture when the consecutive instruction uses the same operand of the instruction and that require more clock cycles for execution and reduces performance. To overcome this situation, instruction pre-fetching unit is used which reduces the stalls and improve performance. The control hazard are occurs when a branch prediction is mistaken or in general, when the system has no mechanism for handling the control hazards [5]. The control hazard is handled by two mechanisms: Flush mechanism and Delayed jump mechanism. The branch and jump prediction unit uses these two mechanisms for preventing control hazards. The flush mechanism runs instruction after a branch and flushes the pipe after the misprediction [5]. Frequent flushing may increase the clock cycles and reduce performance. In the delayed jump mechanism, to handle the control hazard is to fill the pipe after the jump instruction with specific numbers of NOP’s [5]. The branch and jump prediction unit placement in the pipelining architecture may affect the critical or longest path. To detecting the longest path and improving the hardware that resulting minimum clock period and is the standard method of increasing the performance of the processor. To further speed up processor and minimize clock period, the design incorporates a high speed hybrid adder which employs both carry skip and carry select techniques with in the ALU unit to handle the additions. This paper is organized as follows. The system architecture hardware design and implementation are explained in Section II. Instruction set of MIPS including new instructions in detail with corresponding diagrams shown in sub-sections. Hardware implementation design methodology is explained in section III. The experimental results of pipeline stages are shown in section IV. Simulation results of encrypted MIPS pipeline processor and their Verification & synthesis report are describes in sub sections. The conclusions of paper are described in section V.

167 citations



Proceedings ArticleDOI
30 Sep 2012
TL;DR: The Precision-Timed ARM (PTARM) is introduced, a precision-timed microarchitecture implementation that exhibits repeatable execution times without sacrificing performance, and shows an improved throughput compared to a single-threaded in-order five-stage pipeline, given sufficient parallelism in the software.
Abstract: We contend that repeatability of execution times is crucial to the validity of testing of real-time systems. However, computer architecture designs fail to deliver repeatable timing, a consequence of aggressive techniques that improve average-case performance. This paper introduces the Precision-Timed ARM (PTARM), a precision-timed (PRET) microarchitecture implementation that exhibits repeatable execution times without sacrificing performance. The PTARM employs a repeatable thread-interleaved pipeline with an exposed memory hierarchy, including a repeatable DRAM controller. Our benchmarks show an improved throughput compared to a single-threaded in-order five-stage pipeline, given sufficient parallelism in the software.

101 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: This work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor) by assuming a timing anomaly free multi-core architecture for computing the WCET.
Abstract: With the advent of multi-core architectures, worst case execution time (WCET) analysis has become an increasingly difficult problem. In this paper, we propose a unified WCET analysis framework for multi-core processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multi-core architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.

Proceedings ArticleDOI
13 May 2012
TL;DR: This work utilizing a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree to parallelize the bzip2 compression pipeline.
Abstract: We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation is dominated by BWT performance and is 2.78× slower than bzip2, with BWT and MTF-Huffman respectively 2.89× and 1.34× slower on average.

Journal ArticleDOI
01 Mar 2012
TL;DR: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing and shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software.
Abstract: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing. This is an important issue since the use of IT2 FIS still being controversial for several reasons, one of the most important is related to the resulting shocking increase in computational complexity that type reducers, like the Karnik-Mendel (KM) iterative method, can cause even for small systems. Hence, comparing our results against a typical implementation of a IT2 FIS using a high level language implemented into a computer, we show that using a hardware implementation the the whole IT2 FIS (fuzzification, inference engine, type reducer and defuzzification) last only four clock cycles; a speed up of nearly 225,000 and 450,000 can be obtained for the Spartan 3 and Virtex 5 Field Programmable Gate Arrays (FPGAs), respectively. This proposal is suitable to be implemented in pipeline, so the complete IT2 process can be obtained in just one clock cycle with the consequently gain in speed of 900,000 and 2,400,000 for the aforementioned FPGAs. This paper also shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software. Comparative experiments of control surfaces, and time response in the control of a real plant using the IT2 FIS implemented into a computer against the IT2 FIS into an FPGA are shown.

Proceedings ArticleDOI
29 Apr 2012
TL;DR: This paper introduces a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration and shows that it is able to achieve a substantially higher throughput compared to a software-only solution.
Abstract: In this paper, we introduce a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration. Query acceleration is of utmost importance in large database systems to achieve a very high throughput. Although common FPGA-based accelerators are suitable to achieve such a high throughput, their design is hard to extend for new operations. Using partial dynamic reconfiguration, we are able to build more flexible architectures which can be extended to new operations or SQL constructs with a very low area overhead on the FPGA. Furthermore, the reconfiguration of a few FPGA frames can be used to switch very fast from one query to the next. In our approach, an SQL query is transformed into a hardware pipeline consisting of partially reconfigurable modules. The assembly of the (FPGA) data path is done at run-time using a static system providing the stream-based communication interfaces to the partial modules and the database management system. More specifically, each incoming SQL query is analyzed and divided into single operations which are subsequently mapped onto library modules and the composed data path loaded on the FPGA. We show that our approach is able to achieve a substantially higher throughput compared to a software-only solution.

Proceedings ArticleDOI
10 Apr 2012
TL;DR: This work demonstrates LazyBase's tradeoff between query latency and result freshness as well as the benefits of its consistency model, and demonstrates specific cases where Cassandra's consistency model is weaker than Lazy base's.
Abstract: The LazyBase scalable database system is specialized for the growing class of data analysis applications that extract knowledge from large, rapidly changing data sets. It provides the scalability of popular NoSQL systems without the query-time complexity associated with their eventual consistency models, offering a clear consistency model and explicit per-query control over the trade-off between latency and result freshness. With an architecture designed around batching and pipelining of updates, LazyBase simultaneously ingests atomic batches of updates at a very high throughput and offers quick read queries to a stale-but-consistent version of the data. Although slightly stale results are sufficient for many analysis queries, fully up-to-date results can be obtained when necessary by also scanning updates still in the pipeline. Compared to the Cassandra NoSQL system, LazyBase provides 4X--5X faster update throughput and 4X faster read query throughput for range queries while remaining competitive for point queries. We demonstrate LazyBase's tradeoff between query latency and result freshness as well as the benefits of its consistency model. We also demonstrate specific cases where Cassandra's consistency model is weaker than LazyBase's.

Journal ArticleDOI
TL;DR: A unique method for generating a candidate network from scratch, from which the optimization model selects the optimal set of arcs to form the pipeline network, can be applied to any network optimization problem including transmission line, roads, and telecommunication applications.

Journal ArticleDOI
TL;DR: Modifications are made to the lifting scheme, and the intermediate results are recombined and stored to reduce the number of pipelining stages to achieve a critical path with only one multiplier.
Abstract: A high-speed and reduced-area 2-D discrete wavelet transform (2-D DWT) architecture is proposed. Previous DWT architectures are mostly based on the modified lifting scheme or the flipping structure. In order to achieve a critical path with only one multiplier, at least four pipelining stages are required for one lifting step, or a large temporal buffer is needed. In this brief, modifications are made to the lifting scheme, and the intermediate results are recombined and stored to reduce the number of pipelining stages. As a result, the number of registers can be reduced to 18 without extending the critical path. In addition, the two-input/two-output parallel scanning architecture is adopted in our design. For a 2-D DWT with the size of , the proposed architecture only requires three registers between the row and column filters as the transposing buffer, and a higher efficiency can be achieved.

Journal ArticleDOI
TL;DR: A flagging and calibration pipeline intended for making quick look images from GMRT data that identifies and flags corrupted visibilities, computes calibration solutions and interpolates these onto the target source.
Abstract: We describe a flagging and calibration pipeline intended for making quick look images from GMRT data. The package identifies and flags corrupted visibilities, computes calibration solutions and interpolates these onto the tar- get source. These flagged calibrated visibilities can be directly imaged using any standard imaging package. The pipeline is written in "C" with the most compute intensive algorithms being parallelized using OpenMP.

Patent
G. Glenn Henry1, Gerard M. Col1, Colin Eddy1, Rodney E. Hooker1, Terry Parks1 
06 Apr 2012
TL;DR: In this article, a microprocessor instruction translator translates a conditional load instruction into at least two microinstructions, and an out-of-order execution pipeline executes the instructions.
Abstract: A microprocessor instruction translator translates a conditional load instruction into at least two microinstructions. An out-of-order execution pipeline executes the microinstructions. To execute a first microinstruction, an execution unit receives source operands from the source registers of a register file and responsively generates a first result using the source operands. To execute a second the microinstruction, an execution unit receives a previous value of the destination register and the first result and responsively reads data from a memory location specified by the first result and provides a second result that is the data if a condition is satisfied and that is the previous destination register value if not. The previous value of the destination register comprises a result produced by execution of a microinstruction that is the most recent in-order previous writer of the destination register with respect to the second microinstruction.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: The current work presents a computational pipeline to simulate transcranial direct current stimulation from image based models of the head with SCIRun, supported by a complete suite of open source software tools.
Abstract: The current work presents a computational pipeline to simulate transcranial direct current stimulation from image based models of the head with SCIRun [15]. The pipeline contains all the steps necessary to carry out the simulations and is supported by a complete suite of open source software tools: image visualization, segmentation, mesh generation, tDCS electrode generation and efficient tDCS forward simulation.

Journal ArticleDOI
TL;DR: In this article, a fully autonomous data reduction pipeline has been developed for FRODOSpec, an optical fibre-fed integral field spectrograph currently in use at the Liverpool Telescope.
Abstract: A fully autonomous data reduction pipeline has been developed for FRODOSpec, an optical fibre-fed integral field spectrograph currently in use at the Liverpool Telescope. This paper details the process required for the reduction of data taken using an integral field spectrograph and presents an overview of the computational methods implemented to create the pipeline. Analysis of errors and possible future enhancements are also discussed (© 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)

Journal ArticleDOI
TL;DR: A novel field-programmable gate array (FPGA) based method for empirical mode decomposition (EMD) in real time using a circular queue to temporarily store values of maxima and minima and revealing its effectiveness in real-time applications.
Abstract: This paper presents a novel field-programmable gate array (FPGA) based method for empirical mode decomposition (EMD) in real time. Traditionally, EMD can be easily implemented and developed using a high-level computer language in a PC or DSP chip. However, it is difficult to implement EMD in a hardware environment. This paper develops EMD for real-time applications using a hardware-based FPGA. The proposed FPGA-based method calculates the upper and lower envelopes in EMD point by point by using a circular queue to temporarily store values of maxima and minima, from which the upper and lower envelopes in the EMD can be determined continuously. Additionally, an attempt is made to increase the efficiency of the computational process by cascading several identical modules as a serial pipeline structure in order to conduct an iterative loop for calculating the intrinsic mode functions in EMD. The fast process from the serial pipeline structure results in real-time computation with a sampling rate of up to 12.5 MHz and mitigation of the end effect. The proposed method is validated by the simulation results obtained by Quartus II and verified by FPGA (Altera Stratix III EP3SL150F1152C2) realization, revealing its effectiveness in real-time applications.

Journal ArticleDOI
TL;DR: In this paper, the vortex shedding flow past a piggyback pipeline close to a plane boundary is investigated numerically and the piggy-back pipeline is comprised of a large pipeline and a small pipeline.
Abstract: The vortex shedding flow past a piggyback pipeline close to a plane boundary is investigated numerically. The piggyback pipeline is comprised of a large pipeline and a small pipeline. The aim of th...

Journal ArticleDOI
TL;DR: The performance in terms of the processing speed of the architecture designed based on the proposed scheme is superior to those of the architectures designed using other existing schemes, and it has similar or lower hardware consumption.
Abstract: In this paper, a scheme for the design of a high-speed pipeline VLSI architecture for the computation of the 2-D discrete wavelet transform (DWT) is proposed. The main focus in the development of the architecture is on providing a high operating frequency and a small number of clock cycles along with an efficient hardware utilization by maximizing the inter-stage and intra-stage computational parallelism for the pipeline. The inter-stage parallelism is enhanced by optimally mapping the computational task of multi decomposition levels to the stages of the pipeline and synchronizing their operations. The intra-stage parallelism is enhanced by dividing the 2-D filtering operation into four subtasks that can be performed independently in parallel and minimizing the delay of the critical path of bit-wise adder networks for performing the filtering operation. To validate the proposed scheme, a circuit is designed, simulated, and implemented in FPGA for the 2-D DWT computation. The results of the implementation show that the circuit is capable of operating with a maximum clock frequency of 134 MHz and processing 1022 frames of size 512 × 512 per second with this operating frequency. It is shown that the performance in terms of the processing speed of the architecture designed based on the proposed scheme is superior to those of the architectures designed using other existing schemes, and it has similar or lower hardware consumption.

Journal ArticleDOI
TL;DR: Improved architectures for a fused floating-point add-subtract unit for digital signal processing applications such as fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations are presented.
Abstract: This paper presents improved architectures for a fused floating-point add-subtract unit. The fused floating-point add-subtract unit is useful for digital signal processing (DSP) applications such as fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations. To improve the performance of the fused floating-point add-subtract unit, a dual-path algorithm and pipelining are employed. The proposed designs are implemented for both single and double precision and synthesized with a 45-nm standard-cell library. The fused floating-point add-subtract unit saves 40% of the area and power consumption compared to a discrete floating-point add-subtract unit. The proposed dual-path design reduces the latency by 30% compared to the discrete design with area and power consumption between that of the discrete and fused designs. Based on a data flow analysis, the proposed fused dual-path floating-point add-subtract unit can be split into two pipeline stages. Since the latencies of two pipeline stages are fairly well balanced, the throughput is increased by 80% compared to the nonpipelined dual-path design.

Proceedings ArticleDOI
25 Feb 2012
TL;DR: An efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system and is implemented as a back-end of the StreamIt programming language compiler.
Abstract: Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

Patent
27 Feb 2012
TL;DR: In this article, a computer-implemented method for optimizing a data pipeline system includes processing a configuration manifest to generate a framework of the pipeline system and a data flow logic package of the system.
Abstract: A computer-implemented method for optimizing a data pipeline system includes processing a data pipeline configuration manifest to generate a framework of the data pipeline system and a data flow logic package of the data pipeline system. The data pipeline configuration manifest includes an object-oriented metadata model of the data pipeline system. The computer-implemented method further includes monitoring performance of the data pipeline system during execution of the data flow logic package to obtain a performance metric for the data pipeline system, and modifying, with a processor, the framework of the data pipeline system based on the data pipeline configuration manifest and the performance metric.

Patent
05 Jan 2012
TL;DR: In this paper, a configurable processor is coupled to the at least one hardware stage of the packet processing pipeline, which is configured to modify the field in the data structure to generate a modified data structure.
Abstract: In network device, a plurality of ports is configured to receive and to transmit packets on a network A packet processing pipeline includes a plurality of hardware stages, wherein at least one hardware stage is configured to output a data structure comprising a field extracted from a received packet based on a first packet processing operation performed on the packet or the data structure, wherein the data structure is associated with the packet A configurable processor is coupled to the at least one hardware stage of the packet processing pipeline The configurable processor is configured to modify the field in the data structure to generate a modified data structure and to pass the modified data structure to a subsequent hardware stage that is configured to perform a second packet processing operation on the data structure using the field modified by the configurable processor


Proceedings ArticleDOI
25 Feb 2012
TL;DR: This paper proposes the concept of power balanced pipelines - i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance.
Abstract: Since the onset of pipelined processors, balancing the delay of the microarchitectural pipeline stages such that each microarchitectural pipeline stage has an equal delay has been a primary design objective, as it maximizes instruction throughput. Unfortunately, this causes significant energy inefficiency in processors, as each microarchitectural pipeline stage gets the same amount of time to complete, irrespective of its size or complexity. For power-optimized processors, the inefficiency manifests itself as a significant imbalance in power consumption of different microarchitectural pipestages. In this paper, rather than balancing processor pipelines for delay, we propose the concept of power balanced pipelines — i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance. A specific implementation of the concept uses cycle time stealing [19] to deliberately redistribute cycle time from low-power pipeline stages to power-hungry stages, relaxing their timing constraints and allowing them to operate at reduced voltages or use smaller, less leaky cells. We present several static and dynamic techniques for power balancing and demonstrate that balancing pipeline power rather than delay can result in 46% processor power reduction with no loss in processor throughput for a full FabScalar processor over a power-optimized baseline. Benefits are comparable over a Fabscalar baseline where static cycle time stealing is used to optimize achieved frequency. Power savings increase at lower operating frequencies. To the best of our knowledge, this is the first such work on microarchitecture-level power reduction that guarantees the same performance.