scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2015"


Proceedings ArticleDOI
13 Jul 2015
TL;DR: In this article, a preintegration theory is proposed to summarize hundreds of inertial measurements into a single relative motion constraint, and the measurements are integrated in a local frame, which eliminates the need to repeat the integration when the linearization point changes while leaving the opportunity for belated bias corrections.
Abstract: Recent results in monocular visual-inertial navigation (VIN) have shown that optimization-based approaches outperform filtering methods in terms of accuracy due to their capability to relinearize past states. However, the improvement comes at the cost of increased computational complexity. In this paper, we address this issue by preintegrating inertial measurements between selected keyframes. The preintegration allows us to accurately summarize hundreds of inertial measurements into a single relative motion constraint. Our first contribution is a preintegration theory that properly addresses the manifold structure of the rotation group and carefully deals with uncertainty propagation. The measurements are integrated in a local frame, which eliminates the need to repeat the integration when the linearization point changes while leaving the opportunity for belated bias corrections. The second contribution is to show that the preintegrated IMU model can be seamlessly integrated in a visual-inertial pipeline under the unifying framework of factor graphs. This enables the use of a structureless model for visual measurements, further accelerating the computation. The third contribution is an extensive evaluation of our monocular VIN pipeline: experimental results confirm that our system is very fast and demonstrates superior accuracy with respect to competitive state-of-the-art filtering and optimization algorithms, including off-the-shelf systems such as Google Tango

395 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a data reduction pipeline for flat-fielding spectropolarimetric data acquired with telecentric Fabry-Perot instruments and a new approach for accurate camera co-alignment for image restoration.
Abstract: The production of science-ready data from major solar telescopes requires expertise beyond that of the typical observer. This is a consequence of the increasing complexity of instruments and observing sequences, which require calibrations and corrections for instrumental and seeing effects that are not only difficult to measure, but are also coupled in ways that require careful analysis in the design of the correction procedures. Modern space-based telescopes have data-processing pipelines capable of routinely producing well-characterized data products. High resolution imaging spectropolarimeters at ground-based telescopes need similar data pipelines. We present new methods for flat-fielding spectropolarimetric data acquired with telecentric Fabry-Perot instruments and a new approach for accurate camera co-alignment for image restoration. We document a procedure that forms the basis of current state-of- the-art processing of data from the CRISP imaging spectropolarimeter at the Swedish 1 m Solar Telescope (SST). By collecting, implementing, and testing a suite of computer programs, we have defined a data reduction pipeline for this instrument. This pipeline, CRISPRED, streamlines the process of making science-ready data. It is implemented and operated in IDL, with time-consuming steps delegated to C. CRISPRED will also be the basis for the data pipeline of the forthcoming CHROMIS instrument.

260 citations


Journal ArticleDOI
TL;DR: GotCloud is presented, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data that automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information.
Abstract: The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.

257 citations


Patent
27 Aug 2015
TL;DR: In this article, a convolution engine configures the parallel processing pipeline to independently generate and process individual image tiles, and the pipeline then performs matrix multiplication operations between the image tile and the filter tile to generate data included in the corresponding output tile.
Abstract: In one embodiment of the present invention a convolution engine configures a parallel processing pipeline to perform multi-convolution operations. More specifically, the convolution engine configures the parallel processing pipeline to independently generate and process individual image tiles. In operation, for each image tile, the pipeline calculates source locations included in an input image batch. Notably, the source locations reflect the contribution of the image tile to an output tile of an output matrix—the result of the multi-convolution operation. Subsequently, the pipeline copies data from the source locations to the image tile. Similarly, the pipeline copies data from a filter stack to a filter tile. The pipeline then performs matrix multiplication operations between the image tile and the filter tile to generate data included in the corresponding output tile. To optimize both on-chip memory usage and execution time, the pipeline creates each image tile in on-chip memory as-needed.

150 citations


Journal ArticleDOI
TL;DR: In this article, the performance of pipelines subjected to permanent strike-slip fault movement is investigated by combining detailed numerical simulations and closed-form solutions, and the results show that axial tensile tensile strains are in excellent agreement with results obtained from detailed finite element models that employ beam elements and distributed springs along the pipeline length.

144 citations


Posted Content
TL;DR: In this article, the design of high performance MIPS Cryptography processor based on triple data encryption standard is described and the organization of pipeline stages in such a way that pipeline can be clocked at high frequency.
Abstract: The paper describes the design of high performance MIPS Cryptography processor based on triple data encryption standard. The organization of pipeline stages in such a way that pipeline can be clocked at high frequency. Encryption and Decryption blocks of triple data encryption standard (T-DES) crypto system and dependency among themselves are explained in detail with the help of block diagram. In order to increase the processor functionality and performance, especially for security applications we include three new 32-bit instructions LKLW, LKUW and CRYPT. The design has been synthesized at 40nm process technology targeting using Xilinx Virtex-6 device. The overall MIPS Crypto processor works at 209MHz.

142 citations



Journal ArticleDOI
TL;DR: In this article, the authors compare the detection efficiency of the Kepler pipeline with the expectation from the set of simulated planets, and construct a sensitivity curve of signal recovery as a function of the signal-to-noise of the simulated transit signal train.
Abstract: The Kepler planet sample can only be used to reconstruct the underlying planet occurrence rate if the detection efficiency of the Kepler pipeline is known; here we present the results of a second experiment aimed at characterizing this detection efficiency. We inject simulated transiting planet signals into the pixel data of ~10,000 targets, spanning one year of observations, and process the pixels as normal. We compare the set of detections made by the pipeline with the expectation from the set of simulated planets, and construct a sensitivity curve of signal recovery as a function of the signal-to-noise of the simulated transit signal train. The sensitivity curve does not meet the hypothetical maximum detection efficiency; however, it is not as pessimistic as some of the published estimates of the detection efficiency. For the FGK stars in our sample, the sensitivity curve is well fit by a gamma function with the coefficients a = 4.35 and b = 1.05. We also find that the pipeline algorithms recover the depths and periods of the injected signals with very high fidelity, especially for periods longer than 10 days. We perform a simplified occurrence rate calculation using the measured detection efficiency compared to previous assumptions of the detection efficiency found in the literature to demonstrate the systematic error introduced into the resulting occurrence rates. The discrepancies in the calculated occurrence rates may go some way toward reconciling some of the inconsistencies found in the literature.

124 citations


Journal ArticleDOI
TL;DR: This work presents the first pipeline for real-time volumetric surface reconstruction and dense 6DoF camera tracking running purely on standard, off-the-shelf mobile phones, and qualitatively compares to a state of the art point-based mobile phone method.
Abstract: We present the first pipeline for real-time volumetric surface reconstruction and dense 6DoF camera tracking running purely on standard, off-the-shelf mobile phones Using only the embedded RGB camera, our system allows users to scan objects of varying shape, size, and appearance in seconds, with real-time feedback during the capture process Unlike existing state of the art methods, which produce only point-based 3D models on the phone, or require cloud-based processing, our hybrid GPU/CPU pipeline is unique in that it creates a connected 3D surface model directly on the device at 25Hz In each frame, we perform dense 6DoF tracking, which continuously registers the RGB input to the incrementally built 3D model, minimizing a noise aware photoconsistency error metric This is followed by efficient key-frame selection, and dense per-frame stereo matching These depth maps are fused volumetrically using a method akin to KinectFusion, producing compelling surface models For each frame, the implicit surface is extracted for live user feedback and pose estimation We demonstrate scans of a variety of objects, and compare to a Kinect-based baseline, showing on average ∼ 15 cm error We qualitatively compare to a state of the art point-based mobile phone method, demonstrating an order of magnitude faster scanning times, and fully connected surface models

110 citations


Journal ArticleDOI
TL;DR: A novel automatic fault detection system using infrared imaging, focussing on bearings of rotating machinery, able to distinguish between all eight different conditions with an accuracy of 88.25%.

105 citations


Journal ArticleDOI
TL;DR: A new pipeline, dubbed Samantha, is presented, that departs from the prevailing sequential paradigm and embraces instead a hierarchical approach, which has several advantages, like a provably lower computational complexity, which is necessary to achieve true scalability, and better error containment, leading to more stability and less drift.

Journal ArticleDOI
TL;DR: An improved auto-zero scheme that eliminates the gain error caused by the parasitic capacitance across the auto- zero switch is introduced and a comparator-less pipeline ADC structure takes advantage of the characteristics of the ring-amplifier to replace the sub-ADC in each pipeline stage.
Abstract: The ring amplifier is an energy efficient and high output swing alternative to an OTA for switched-capacitor circuits However, the conventional ring amplifier requires external biases, which makes the ring amplifier less practical when we consider process, supply voltage, and temperature (PVT) variation This paper presents a self-biased ring amplifier scheme that makes the ring amplifier more practical and power efficient while maintaining the benefits of efficient slew-based charging and an almost rail-to-rail output swing We introduce an improved auto-zero scheme that eliminates the gain error caused by the parasitic capacitance across the auto-zero switch Furthermore, a comparator-less pipeline ADC structure takes advantage of the characteristics of the ring-amplifier to replace the sub-ADC in each pipeline stage The prototype ADC has measured SNDR, SNR and SFDR of 566 dB (911 b), 575 dB and 647 dB, respectively, for a Nyquist frequency input sampled at 100 MS/s, and consumes 246 mW

Journal ArticleDOI
TL;DR: The multi-band template analysis (MBTA) pipeline as mentioned in this paper is a low-latency coincident analysis pipeline for the detection of gravitational waves (GWs) from compact binary coalescences.
Abstract: The multi-band template analysis (MBTA) pipeline is a low-latency coincident analysis pipeline for the detection of gravitational waves (GWs) from compact binary coalescences. MBTA runs with a low computational cost, and can identify candidate GW events online with a sub-minute latency. The low computational running cost of MBTA also makes it useful for data quality studies. Events detected by MBTA online can be used to alert astronomical partners for electromagnetic follow-up. We outline the current status of MBTA and give details of recent pipeline upgrades and validation tests that were performed in preparation for the first advanced detector observing period. The MBTA pipeline is ready for the outset of the advanced detector era and the exciting prospects it will bring.

Posted Content
Heng Li1
TL;DR: FermiKit as mentioned in this paper is a variant calling pipeline for Illumina data that de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions (INDELs) and structural variations (SVs).
Abstract: Summary: FermiKit is a variant calling pipeline for Illumina data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions (INDELs) and structural variations (SVs). FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85GB RAM at the peak, and calls variants in half an hour to an accuracy comparable to the current practice. FermiKit assembly is a reduced representation of raw data while retaining most of the original information. Availability and implementation: this https URL Contact: hengli@broadinstitute.org

Journal ArticleDOI
10 Jul 2015-PLOS ONE
TL;DR: An adaptive resampling framework for evaluating and optimizing preprocessing choices by optimizing data-driven metrics of task prediction and spatial reproducibility is outlined and validates, demonstrating that with pipeline optimization, it is possible to obtain reliable results and brain-behaviour correlations in relatively small datasets.
Abstract: BOLD fMRI is sensitive to blood-oxygenation changes correlated with brain function; however, it is limited by relatively weak signal and significant noise confounds. Many preprocessing algorithms have been developed to control noise and improve signal detection in fMRI. Although the chosen set of preprocessing and analysis steps (the “pipeline”) significantly affects signal detection, pipelines are rarely quantitatively validated in the neuroimaging literature, due to complex preprocessing interactions. This paper outlines and validates an adaptive resampling framework for evaluating and optimizing preprocessing choices by optimizing data-driven metrics of task prediction and spatial reproducibility. Compared to standard “fixed” preprocessing pipelines, this optimization approach significantly improves independent validation measures of within-subject test-retest, and between-subject activation overlap, and behavioural prediction accuracy. We demonstrate that preprocessing choices function as implicit model regularizers, and that improvements due to pipeline optimization generalize across a range of simple to complex experimental tasks and analysis models. Results are shown for brief scanning sessions (<3 minutes each), demonstrating that with pipeline optimization, it is possible to obtain reliable results and brain-behaviour correlations in relatively small datasets.

Proceedings ArticleDOI
04 May 2015
TL;DR: This work uses FPGA to design a deep learning accelerator, the accelerator focuses on the implementation of the prediction process, data access optimization and pipeline structure, and can achieve promising result.
Abstract: Recently, machine learning is widely used in applications and cloud services And as the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems To give users better experience, high performance implementations of deep learning applications seem very important As a common means to accelerate algorithms, FPGA has high performance, low power consumption, small size and other characteristics So we use FPGA to design a deep learning accelerator, the accelerator focuses on the implementation of the prediction process, data access optimization and pipeline structure Compared with Core 2 CPU 23GHz, our accelerator can achieve promising result

Journal ArticleDOI
08 Sep 2015
TL;DR: A provably efficient scheduling algorithm, the Piper algorithm, is described, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested and automatically throttles the parallelism, precluding “runaway” pipelines.
Abstract: Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP ≤ T1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

Journal ArticleDOI
TL;DR: FHAST (FPGA hardware accelerated sequence-matching tool), a drop-in replacement for BOWTIE that uses a hardware design based on field programmable gate arrays (FP GA) that masks memory latency by executing multiple concurrent hardware threads accessing memory simultaneously.
Abstract: While the sequencing capability of modern instruments continues to increase exponentially, the computational problem of mapping short sequenced reads to a reference genome still constitutes a bottleneck in the analysis pipeline. A variety of mapping tools (e.g., Bowtie , BWA ) is available for general-purpose computer architectures. These tools can take many hours or even days to deliver mapping results, depending on the number of input reads, the size of the reference genome and the number of allowed mismatches or insertion/deletions, making the mapping problem an ideal candidate for hardware acceleration. In this paper, we present FHAST (FPGA hardware accelerated sequence-matching tool), a drop-in replacement for Bowtie that uses a hardware design based on field programmable gate arrays (FPGA). Our architecture masks memory latency by executing multiple concurrent hardware threads accessing memory simultaneously. FHAST is composed by multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 and later ported on the Convey HC-2ex, taking advantage of the large memory bandwidth available to these systems and the shared memory image between hardware and software. A preliminary version of FHAST running on the Convey HC-1 achieved up to 70x speedup compared to Bowtie (single-threaded). An improved version of FHAST running on the Convey HC-2ex FPGAs achieved up to 12x fold speed gain compared to Bowtie running eight threads on an eight-core conventional architecture, while maintaining almost identical mapping accuracy. FHAST is a drop-in replacement for Bowtie , so it can be incorporated in any analysis pipeline that uses Bowtie (e.g., TopHat ).

Journal ArticleDOI
TL;DR: A hierarchical pipeline for skull-stripping and segmentation of anatomical structures of interest from T1-weighted images of the human brain is proposed, constructed based on a two-level Bayesian parameter estimation algorithm called multi-atlas likelihood fusion.
Abstract: We propose a hierarchical pipeline for skull-stripping and segmentation of anatomical structures of interest from T1-weighted images of the human brain. The pipeline is constructed based on a two-level Bayesian parameter estimation algorithm called multi-atlas likelihood fusion (MALF). In MALF, estimation of the parameter of interest is performed via maximum a posteriori estimation using the expectation-maximization (EM) algorithm. The likelihoods of multiple atlases are fused in the E-step while the optimal estimator, a single maximizer of the fused likelihoods, is then obtained in the M-step. There are two stages in the proposed pipeline; first the input T1-weighted image is automatically skull-stripped via a fast MALF, then internal brain structures of interest are automatically extracted using a regular MALF. We assess the performance of each of the two modules in the pipeline based on two sets of images with markedly different anatomical and photometric contrasts; 3T MPRAGE scans of pediatric subjects with developmental disorders versus 1.5T SPGR scans of elderly subjects with dementia. Evaluation is performed quantitatively using the Dice overlap as well as qualitatively via visual inspections. As a result, we demonstrate subject-level differences in the performance of the proposed pipeline, which may be accounted for by age, diagnosis, or the imaging parameters (particularly the field strength). For the subcortical and ventricular structures of the two datasets, the hierarchical pipeline is capable of producing automated segmentations with Dice overlaps ranging from 0.8 to 0.964 when compared with the gold standard. Comparisons with other representative segmentation algorithms are presented, relative to which the proposed hierarchical pipeline demonstrates comparative or superior accuracy.

Journal ArticleDOI
TL;DR: A novel 128/256/512/1024/1536/2048-point single-path delay feedback (SDF) pipeline FFT processor for long-term evolution and mobile worldwide interoperability for microwave access systems and formulated a hardware-sharing mechanism to reduce the memory space requirements of the proposed 1536-point FFT computation scheme.
Abstract: Fast Fourier transform (FFT) is widely used in digital signal processing and telecommunications, particularly in orthogonal frequency division multiplexing systems, to overcome the problems associated with orthogonal subcarriers. This paper presents a novel 128/256/512/1024/1536/2048-point single-path delay feedback (SDF) pipeline FFT processor for long-term evolution and mobile worldwide interoperability for microwave access systems. The proposed design employs a low-cost computation scheme to enable 1536-point FFT, which significantly reduces hardware costs as well as power consumption. In conjunction with the aforementioned 1536-point FFT computation scheme, the proposed design included an efficient three-stage SDF pipeline architecture on which to implement a radix-3 FFT. The new radix-3 SDF pipeline FFT processor simplifies its data flow and is easy to control, and the complexity of the resulting hardware is lower than that of existing structures. This paper also formulated a hardware-sharing mechanism to reduce the memory space requirements of the proposed 1536-point FFT computation scheme. The proposed design was implemented using 90 nm CMOS technology. Postlayout simulation results revealed a die area of approximately $1.44 \times 1.44~\mathrm{mm}^{2}$ with power consumption of only 9.3 mW at 40 MHz.

Journal ArticleDOI
TL;DR: The redesigned system has significantly improved the deployment of the Pipeline stent, by enabling the operator to resheath the device and has the potential to continue revolutionizing the endovascular approach for intracranial aneurysms.
Abstract: Background Flow diverter stents (FDS) have been described as a breakthrough in the treatment of intracranial aneurysms. Of the various flow diverter models, the Pipeline device has been the main approved and used device, with established and good long-term results. Objective To present the first series of patients treated with its new version, the Pipeline Flex device. This has kept the same device design and configuration but redesigned and completely modified the delivery system. Methods In this technical report, we include 10 consecutive patients harboring 12 saccular aneurysms of the anterior circulation. We report the main changes on the system, immediate results, and technical nuances with illustrative cases. Results We implanted 12 devices, including 11 Pipeline Flex and one Pipeline device. We used the old version in one case that required a second layer with a short length not available in the Pipeline Flex size range. All attempts at treatment were successful and no device was discharged or removed. Recovery was required or used in half of the cases with good or excellent performance, except in one case that presented with multiple proximal loops and tight curves. We had two transitory events without ischemic lesions on MRI that recovered 1 and 4 h after all patients were discharged home asymptomatic. Conclusions Pipeline Flex represents a major advance in FDS technology. The redesigned system has significantly improved the deployment of the Pipeline stent, by enabling the operator to resheath the device. It has the potential to continue revolutionizing the endovascular approach for intracranial aneurysms.

Journal ArticleDOI
TL;DR: This paper model the problem of mitigating water hammer during valve closure by an optimal boundary control problem involving a nonlinear hyperbolic PDE system that describes the fluid flow along the pipeline.
Abstract: When fluid flow in a pipeline is suddenly halted, a pressure surge or wave is created within the pipeline. This phenomenon, called water hammer, can cause major damage to pipelines, including pipeline ruptures. In this paper, we model the problem of mitigating water hammer during valve closure by an optimal boundary control problem involving a nonlinear hyperbolic PDE system that describes the fluid flow along the pipeline. The control variable in this system represents the valve boundary actuation implemented at the pipeline terminus. To solve the boundary control problem, we first use the method of lines to obtain a finite-dimensional ODE model based on the original PDE system. Then, for the boundary control design, we apply the control parameterization method to obtain an approximate optimal parameter selection problem that can be solved using nonlinear optimization techniques such as Sequential Quadratic Programming (SQP). We conclude the paper with simulation results demonstrating the capability of optimal boundary control to significantly reduce flow fluctuation.

Proceedings ArticleDOI
09 Mar 2015
TL;DR: P predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors to exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures.
Abstract: Static timing analysis provides the basis for setting the clock period of a microprocessor core, based on its worst-case critical path. However, depending on the design, this critical path is not always excited and therefore dynamic timing margins exist that can theoretically be exploited for the benefit of better speed or lower power consumption (through voltage scaling). This paper introduces predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors. To this end, we exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures to detect and correct timing violations. We provide a design flow to extract the dynamic timing information for the design using post-layout dynamic timing analysis and we integrate the results into a custom cycle-accurate simulator. This simulator allows annotation of individual instructions with their impact on timing (in each pipeline stage) and rapidly derives the overall code execution time for complex benchmarks. The design methodology is illustrated at the microarchitecture level, demonstrating the performance and power gains possible on a 6-stage OpenRISC in-order general purpose processor core in a 28nm CMOS technology. We show that employing instruction-dependent dynamic clock adjustment leads on average to an increase in operating speed by 38% or to a reduction in power consumption by 24%, compared to traditional synchronous clocking, which at all times has to respect the worst-case timing identified through static timing analysis.

Journal ArticleDOI
TL;DR: This work proposes dynamic multi-frame processing schedule which efficiently utilizes the layered-LDPC decoding with minimum pipeline stages and efficient comparison techniques for both column and row layered schedule and rejection-based high-speed circuits to compute the two minimum values from multiple inputs required for row layered processing of hardware-friendly min-sum decoding algorithm.
Abstract: This paper presents architecture of block-level-parallel layered decoder for irregular LDPC code. It can be reconfigured to support various block lengths and code rates of IEEE 802.11n (WiFi) wireless-communication standard. We have proposed efficient comparison techniques for both column and row layered schedule and rejection-based high-speed circuits to compute the two minimum values from multiple inputs required for row layered processing of hardware-friendly min-sum decoding algorithm. The results show good speed with lower area as compared to state-of-the-art circuits. Additionally, this work proposes dynamic multi-frame processing schedule which efficiently utilizes the layered-LDPC decoding with minimum pipeline stages. The suggested LDPC-decoder architecture has been synthesized and post-layout simulated in 90 nm-CMOS process. This decoder occupies 5.19 ${\rm mm}^{2}$ area and supports multiple code rates like 1/2, 2/3, 3/4 & 5/6 as well as block-lengths of 648, 1296 & 1944. At a clock frequency of 336 MHz, the proposed LDPC-decoder has achieved better throughput of 5.13 Gbps and energy efficiency of 0.01 nJ/bits/iterations, as compared to the similar state-of-the-art works.

Proceedings ArticleDOI
09 Nov 2015
TL;DR: This paper presents a framework that implements an in-memory distributed version of the GATK pipeline using Apache Spark that reduced execution time by keeping data active in the memory between the map and reduce steps and has a dynamic load balancing algorithm that better utilizes system performance using runtime statistics of the active workload.
Abstract: Fast progress in next generation sequencing has dramatically increased the throughout of DNA sequencing, resulting in the availability of large DNA data sets ready for analysis. However, post-sequencing DNA analysis has become the bottleneck in using these data sets, as it requires powerful and scalable tools to perform the needed analysis. A typical analysis pipeline consists of a number of steps, not all of which can readily scale on a distributed computing infrastructure. Recently, tools like Halvade, a Hadoop MapReduce solution, and Churchill, an HPC cluster-based solution, addressed this issue of scalability in the GATK DNA analysis pipeline. In this paper, we present a framework that implements an in-memory distributed version of the GATK pipeline using Apache Spark. Our framework reduced execution time by keeping data active in the memory between the map and reduce steps. In addition, it has a dynamic load balancing algorithm that better utilizes system performance using runtime statistics of the active workload. Experiments on a 4 node cluster with 64 virtual cores show that this approach is 63% faster than a Hadoop MapReduce based solution.

Journal ArticleDOI
TL;DR: In this paper, a forward magnetic anomaly forward modeling is proposed, and a series of calculations are performed between model parameters of the pipeline, the geomagnetic field, the measuring trace, and total magnetic anomaly (TMA) to analyze the influence of all factors on the detected magnetic anomalies.

Proceedings ArticleDOI
28 Oct 2015
TL;DR: In this article, the authors derived a reduced control system model for the dynamics of compressible gas flow through a pipeline subject to distributed time-varying injections, withdrawals, and control actions of compressors.
Abstract: We derive a reduced control system model for the dynamics of compressible gas flow through a pipeline subject to distributed time-varying injections, withdrawals, and control actions of compressors. The gas dynamics PDE equations are simplified using lumped elements to a nonlinear ODE system with matrix coefficients. We verify that low-order integration of this ODE system with adaptive time-stepping is computationally consistent with solution of the PDE system using a split-step characteristic scheme on a regular space-time grid for a realistic pipeline model. Furthermore, the reduced model is tractable for use as the dynamic constraints of the optimal control problem of minimizing compression costs given transient withdrawals and gas pressure constraints. We discretize this problem as a finite nonlinear program using a pseudospectral collocation scheme, which we solve to obtain a polynomial approximation of the optimal transient compression controls. The method is applied to an example involving the Williams-Transco pipeline.Copyright © 2015 by ASME

Journal ArticleDOI
TL;DR: A hardware scheduler architecture integrated into the CPU structure that uses resource remapping techniques for the pipeline registers and for the CPU working registers and a method used for assigning interrupts to tasks that insures an efficient operation in the context of real-time control is presented.
Abstract: Task switching, synchronization, and communication between processes are major problems for each real-time operating system. Software implementation of the specific mechanisms may lead to significant delays that can affect deadline requirements for some applications. This paper presents a hardware scheduler architecture integrated into the CPU structure that uses resource remapping techniques for the pipeline registers and for the CPU working registers. We present an original implementation of the hardware structure used for static and dynamic scheduling of the task, unitary management of events, access to architecture shared resources, event generation, and a method used for assigning interrupts to tasks that insures an efficient operation in the context of real-time control. One assembler instruction is used for simultaneous task synchronization with multiple event sources. This architecture allows a task switching time of one clock cycle (with a worst case scenario of three clock cycles for special instructions used for external memory accesses) and a response time of only 1.5 clock cycles for the events. Some mechanisms for improving program execution speed are also taken in consideration.

Journal ArticleDOI
TL;DR: PeriSCOPE is described, which automatically optimizes a data-parallel program's procedural code in the context of data flow that is reconstructed from the program's pipeline topology, and leverages symbolic execution to enlarge the scope of such optimizations by eliminating dead code.
Abstract: To minimize the amount of data-shuffling I/O that occurs between the pipeline stages of a distributed data-parallel program, its procedural code must be optimized with full awareness of the pipeline that it executes in. Unfortunately, neither pipeline optimizers nor traditional compilers examine both the pipeline and procedural code of a data-parallel program so programmers must either hand-optimize their program across pipeline stages or live with poor performance. To resolve this tension between performance and programmability, this paper describes PeriSCOPE, which automatically optimizes adata-parallel program’s procedural code in the context of data flow that is reconstructed from the program’s pipeline topology. Such optimizations eliminate unnecessary code and data, perform early data filtering, and calculate small derived values (e.g., predicates) earlier in the pipeline, so that less data—sometimes much less data—is transferred between pipeline stages. PeriSCOPE further leverages symbolic execution to enlarge the scope of such optimizations by eliminating dead code. We describe how PeriSCOPE is implemented and evaluate its effectiveness on real production jobs.

Journal ArticleDOI
TL;DR: This article shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor.
Abstract: There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading.Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62p and 64p, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.