Showing papers on "Pipeline (computing) published in 2016"

PDF

Open Access

Journal Article•DOI•

ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars

[...]

Ali Shafiee¹, Anirban Nag¹, Naveen Muralimanohar², Rajeev Balasubramonian¹, John Paul Strachan², Miao Hu², R. Stanley Williams², Vivek Srikumar¹ - Show less +4 more•Institutions (2)

University of Utah¹, Hewlett-Packard²

18 Jun 2016

TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.

...read moreread less

Abstract: A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks.This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.

...read moreread less

1,558 citations

Book Chapter•DOI•

LIFT: Learned Invariant Feature Transform

[...]

Kwang Moo Yi¹, Eduard Trulls¹, Vincent Lepetit², Pascal Fua¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Graz University of Technology²

08 Oct 2016

TL;DR: This work introduces a novel Deep Network architecture that implements the full feature point handling pipeline, that is, detection, orientation estimation, and feature description, and shows how to learn to do all three in a unified manner while preserving end-to-end differentiability.

...read moreread less

Abstract: We introduce a novel Deep Network architecture that implements the full feature point handling pipeline, that is, detection, orientation estimation, and feature description. While previous works have successfully tackled each one of these problems individually, we show how to learn to do all three in a unified manner while preserving end-to-end differentiability. We then demonstrate that our Deep pipeline outperforms state-of-the-art methods on a number of benchmark datasets, without the need of retraining.

...read moreread less

878 citations

Posted Content•

LIFT: Learned Invariant Feature Transform

[...]

Kwang Moo Yi¹, Eduard Trulls¹, Vincent Lepetit², Pascal Fua¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Graz University of Technology²

30 Mar 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a novel deep network architecture is introduced that implements the full feature point handling pipeline, that is, detection, orientation estimation, and feature description, in a unified manner while preserving end-to-end differentiability.

...read moreread less

325 citations

Posted Content•

Distributed and parallel time series feature extraction for industrial big data applications.

[...]

Maximilian Christ, Andreas W. Kempa-Liehr, Michael Feindt

25 Oct 2016-arXiv: Learning

TL;DR: An efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features.

...read moreread less

Abstract: The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features The proposed algorithm combines established feature extraction methods with a feature importance filter It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics

...read moreread less

227 citations

Journal Article•DOI•

Quantitative risk analysis on leakage failure of submarine oil and gas pipelines using Bayesian network

[...]

Xinhong Li¹, Guoming Chen¹, Hongwei Zhu¹•Institutions (1)

China University of Petroleum¹

01 Sep 2016-Process Safety and Environmental Protection

TL;DR: Wang et al. as discussed by the authors presented a risk-based accident model to conduct quantitative risk analysis (QRA) for leakage failure of submarine pipeline, which can provide a more case-specific and realistic analysis consequence compared to bow-tie method.

...read moreread less

205 citations

Proceedings Article•DOI•

Predicting Disk Replacement towards Reliable Data Centers

[...]

Mirela Botezatu¹, Ioana Giurgiu¹, Jasmina Bogojeska¹, Dorothea Wiesmann¹•Institutions (1)

IBM¹

13 Aug 2016

TL;DR: A highly accurate SMART-based analysis pipeline that can correctly predict the necessity of a disk replacement even 10-15 days in advance and uses statistical techniques to automatically detect which SMART parameters correlate with disk replacement.

...read moreread less

Abstract: Disks are among the most frequently failing components in today's IT environments. Despite a set of defense mechanisms such as RAID, the availability and reliability of the system are still often impacted severely. In this paper, we present a highly accurate SMART-based analysis pipeline that can correctly predict the necessity of a disk replacement even 10-15 days in advance. Our method has been built and evaluated on more than 30000 disks from two major manufacturers, monitored over 17 months. Our approach employs statistical techniques to automatically detect which SMART parameters correlate with disk replacement and uses them to predict the replacement of a disk with even 98% accuracy.

...read moreread less

134 citations

Patent•

Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform

[...]

Pieter Van Rooyen, Robert J. McMillen, Michael Ruehle

05 Jan 2016

TL;DR: A system, method and apparatus for executing a sequence analysis pipeline on genetic sequence data includes a structured ASIC formed of a set of hardwired digital logic circuits that are interconnected by physical electrical interconnects.

...read moreread less

Abstract: A system, method and apparatus for executing a sequence analysis pipeline on genetic sequence data includes a structured ASIC formed of a set of hardwired digital logic circuits that are interconnected by physical electrical interconnects One of the physical electrical interconnects forms an input to the structured ASIC connected with an electronic data source for receiving reads of genomic data The hardwired digital logic circuits are arranged as a set of processing engines, each processing engine being formed of a subset of the hardwired digital logic circuits to perform one or more steps in the sequence analysis pipeline on the reads of genomic data Each subset of the hardwired digital logic circuits is formed in a wired configuration to perform the one or more steps in the sequence analysis pipeline

...read moreread less

124 citations

Posted Content•

Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning

[...]

Amit Shaked¹, Lior Wolf¹•Institutions (1)

Tel Aviv University¹

31 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a highway network architecture is proposed for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained with a hybrid loss that supports multi-level comparison of image patches.

...read moreread less

Abstract: We present an improved three-step pipeline for the stereo matching problem and introduce multiple novelties at each stage. We propose a new highway network architecture for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained with a hybrid loss that supports multilevel comparison of image patches. A novel post-processing step is then introduced, which employs a second deep convolutional neural network for pooling global information from multiple disparities. This network outputs both the image disparity map, which replaces the conventional "winner takes all" strategy, and a confidence in the prediction. The confidence score is achieved by training the network with a new technique that we call the reflective loss. Lastly, the learned confidence is employed in order to better detect outliers in the refinement step. The proposed pipeline achieves state of the art accuracy on the largest and most competitive stereo benchmarks, and the learned confidence is shown to outperform all existing alternatives.

...read moreread less

120 citations

Journal Article•DOI•

Low-latency analysis pipeline for compact binary coalescences in the advanced gravitational wave detector era

[...]

T. Adams¹, D. Buskulic¹, V. Germain¹, Gianluca Guidi², Frédérique Marion¹, M. Montani², B. Mours¹, F. Piergiovanni², G. Wang² - Show less +5 more•Institutions (2)

Laboratoire d'Annecy-le-Vieux de physique des particules¹, University of Urbino²

09 Aug 2016-Classical and Quantum Gravity

TL;DR: The multi-band template analysis (MBTA) pipeline as discussed by the authors is a low-latency coincident analysis pipeline for the detection of gravitational waves (GWs) from compact binary coalescences.

...read moreread less

Abstract: The multi-band template analysis (MBTA) pipeline is a low-latency coincident analysis pipeline for the detection of gravitational waves (GWs) from compact binary coalescences. MBTA runs with a low computational cost, and can identify candidate GW events online with a sub-minute latency. The low computational running cost of MBTA also makes it useful for data quality studies. Events detected by MBTA online can be used to alert astronomical partners for electromagnetic follow-up. We outline the current status of MBTA and give details of recent pipeline upgrades and validation tests that were performed in preparation for the first advanced detector observing period. The MBTA pipeline is ready for the outset of the advanced detector era and the exciting prospects it will bring.

...read moreread less

100 citations

Journal Article•DOI•

Improved surface quality in 3D printing by optimizing the printing direction

[...]

Weiming Wang¹, C. Zanni², Leif Kobbelt²•Institutions (2)

Dalian University of Technology¹, RWTH Aachen University²

01 May 2016

TL;DR: A pipeline of algorithms is presented that decomposes a given polygon model into parts such that each part can be 3D printed with high (outer) surface quality due to the fact that most 3D printing technologies have an anisotropic resolution.

...read moreread less

Abstract: We present a pipeline of algorithms that decomposes a given polygon model into parts such that each part can be 3D printed with high (outer) surface quality. For this we exploit the fact that most 3D printing technologies have an anisotropic resolution and hence the surface smoothness varies significantly with the orientation of the surface. Our pipeline starts by segmenting the input surface into patches such that their normals can be aligned perpendicularly to the printing direction. A 3D Voronoi diagram is computed such that the intersections of the Voronoi cells with the surface approximate these surface patches. The intersections of the Voronoi cells with the input model's volume then provide an initial decomposition. We further present an algorithm to compute an assembly order for the parts and generate connectors between them. A post processing step further optimizes the seams between segments to improve the visual quality. We run our pipeline on a wide range of 3D models and experimentally evaluate the obtained improvements in terms of numerical, visual, and haptic quality.

...read moreread less

90 citations

Proceedings Article•DOI•

Complexity-based consistent-quality encoding in the cloud

[...]

Jan De Cock¹, Zhi Li¹, Megha Manohara¹, Anne Aaron¹•Institutions (1)

Netflix¹

19 Aug 2016

TL;DR: Two algorithm optimizations for a distributed cloud-based encoding pipeline are described, including per-title complexity analysis for bitrate-resolution selection and per-chunk bitrate control for consistent-quality encoding, which result in more efficient bandwidth usage and more consistent video quality.

...read moreread less

Abstract: A cloud-based encoding pipeline which generates streams for video-on-demand distribution typically processes a wide diversity of content that exhibit varying signal characteristics. To produce the best quality video streams, the system needs to adapt the encoding to each piece of content, in an automated and scalable way. In this paper, we describe two algorithm optimizations for a distributed cloud-based encoding pipeline: (i) per-title complexity analysis for bitrate-resolution selection; and (ii) per-chunk bitrate control for consistent-quality encoding. These improvements result in a number of advantages over a simple “one-size-fits-all” encoding system, including more efficient bandwidth usage and more consistent video quality.

...read moreread less

Proceedings Article•DOI•

The Stratix™ 10 Highly Pipelined FPGA Architecture

[...]

David Lewis, Gordon Raymond Chiu, Jeffrey Christopher Chromczak, David Galloway, Ben Gamsa, Valavan Manohararajah, Ian Milton, Tim Vanderhoek, John Curtis Van Dyken - Show less +5 more

21 Feb 2016

TL;DR: This paper describes architectural enhancements in the Altera Stratix?

...read moreread less

Abstract: This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs.

...read moreread less

Journal Article•DOI•

An integrated numerical model for wave–soil–pipeline interactions

[...]

Zaibin Lin¹, Yakun Guo², Dong-Sheng Jeng³, Chencong Liao⁴, Nick Rey⁵ - Show less +1 more•Institutions (5)

University of Aberdeen¹, University of Bradford², Griffith University³, Shanghai Jiao Tong University⁴, Wood Group⁵

01 Feb 2016-Coastal Engineering

TL;DR: In this article, an integrated Finite Element Method (FEM) model is proposed to investigate the dynamic seabed response for several specific pipeline layouts and to simulate the pipeline stability under waves loading.

...read moreread less

Journal Article•DOI•

Measuring Transit Signal Recovery in the Kepler Pipeline. III. Completeness of the Q1–Q17 DR24 Planet Candidate Catalogue with Important Caveats for Occurrence Rate Calculations

[...]

Jessie L. Christiansen¹, Bruce D. Clarke², Christopher J. Burke², Jon M. Jenkins², Stephen T. Bryson², Jeffrey L. Coughlin², Fergal Mullally², Susan E. Thompson², Joseph D. Twicken², Natalie M. Batalha², Michael R. Haas², Joseph Catanzarite², Jennifer R. Campbell², Akm Kamal Uddin², Khadeejah A. Zamudio², Jeffrey C. Smith², Christopher E. Henze² - Show less +13 more•Institutions (2)

California Institute of Technology¹, Ames Research Center²

08 Sep 2016-The Astrophysical Journal

TL;DR: Measurements of the sensitivity of the pipeline used to generate the Q1-Q17 DR24 planet candidate catalog find a strong period dependence in the measured detection efficiency, with longer (>40 day) periods having a significantly lower detectability than shorter periods.

...read moreread less

Abstract: With each new version of the Kepler pipeline and resulting planet candidate catalog, an updated measurement of the underlying planet population can only be recovered with a corresponding measurement of the Kepler pipeline detection efficiency. Here we present measurements of the sensitivity of the pipeline (version 9.2) used to generate the Q1–Q17 DR24 planet candidate catalog. We measure this by injecting simulated transiting planets into the pixel-level data of 159,013 targets across the entire Kepler focal plane, and examining the recovery rate. Unlike previous versions of the Kepler pipeline, we find a strong period dependence in the measured detection efficiency, with longer (>40 day) periods having a significantly lower detectability than shorter periods, introduced in part by an incorrectly implemented veto. Consequently, the sensitivity of the 9.2 pipeline cannot be cast as a simple one-dimensional function of the signal strength of the candidate planet signal, as was possible for previous versions of the pipeline. We report on the implications for occurrence rate calculations based on the Q1–Q17 DR24 planet candidate catalog, and offer important caveats and recommendations for performing such calculations. As before, we make available the entire table of injected planet parameters and whether they were recovered by the pipeline, enabling readers to derive the pipeline detection sensitivity in the planet and/or stellar parameter space of their choice.

...read moreread less

Journal Article•DOI•

Intermediate Palomar Transient Factory: Realtime Image Subtraction Pipeline

[...]

Yi Cao¹, Peter Nugent², Peter Nugent³, Mansi M. Kasliwal¹•Institutions (3)

California Institute of Technology¹, University of California, Berkeley², Lawrence Berkeley National Laboratory³

01 Nov 2016-Publications of the Astronomical Society of the Pacific

TL;DR: This paper presents the realtime image subtraction pipeline in the intermediate Palomar Transient Factory, using high-performance computing, efficient database, and machine learning algorithms to reliably deliver transient candidates within ten minutes of images being taken.

...read moreread less

Abstract: A fast-turnaround pipeline for realtime data reduction plays an essential role in discovering and permitting follow-up observations to young supernovae and fast-evolving transients in modern time-domain surveys. In this paper, we present the realtime image subtraction pipeline in the intermediate Palomar Transient Factory. By using high-performance computing, efficient databases, and machine-learning algorithms, this pipeline manages to reliably deliver transient candidates within 10 minutes of images being taken. Our experience in using high-performance computing resources to process big data in astronomy serves as a trailblazer to dealing with data from large-scale time-domain facilities in the near future.

...read moreread less

Journal Article•DOI•

Harmonica: A Framework of Heterogeneous Computing Systems With Memristor-Based Neuromorphic Computing Accelerators

[...]

Xiaoxiao Liu¹, Mengjie Mao¹, Beiye Liu¹, Boxun Li², Yu Wang², Hao Jiang³, Mark Barnell⁴, Qing Wu⁴, Jianhua Yang⁵, Hai Li¹, Yi Chen¹ - Show less +7 more•Institutions (5)

University of Pittsburgh¹, Tsinghua University², San Francisco State University³, Air Force Research Laboratory⁴, University of Massachusetts Amherst⁵

23 Jun 2016-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: Harmonica-a framework of heterogeneous computing system enhanced by memristor-based neuromorphic computing accelerators (NCAs) is presented, which are superior to the designs with either digital neural processing units (D-NPUs) or MBC arrays cooperating with a digital interconnection network.

...read moreread less

Abstract: Following technology scaling, on-chip heterogeneous architecture emerges as a promising solution to combat the power wall of microprocessors. This work presents Harmonica —aframework of heterogeneous computing system enhanced by memristor-based neuromorphic computing accelerators (NCAs). In Harmonica, a conventional pipeline is augmented with a NCA which is designed to speedup artificial neural network (ANN) relevant executions by leveraging the extremely efficient mixed-signal computation capability of nanoscale memristor-based crossbar (MBC) arrays. With the help of a mixed-signal interconnection network (M-Net), the hierarchically arranged MBC arrays can accelerate the computation of a variety of ANNs. Moreover, an inline calibration scheme is proposed to ensure the computation accuracy degradation incurred by the memristor resistance shifting within an acceptable range during NCA executions. Compared to general-purpose processor, Harmonica can achieve on average 27.06 $\times$ performance speedup and 25.23 $\times$ energy savings when the NCA is configured with auto-associative memory (AAM) implementation. If the NCA is configured with multilayer perception (MLP) implementation, the performance speedup and energy savings can be boosted to 178.41 $\times$ and 184.24 $\times$ , respectively, with slightly degraded computation accuracy. Moreover, the performance and power efficiency of Harmonica are superior to the designs with either digital neural processing units (D-NPUs) or MBC arrays cooperating with a digital interconnection network. Compared to the baseline of general-purpose processor, the classification rate degradation of Harmonica in MLP or AAM is less than 8% or 4%, respectively.

...read moreread less

Posted Content•

High-Throughput Data Detection for Massive MU-MIMO-OFDM using Coordinate Descent

[...]

Michael Wu¹, Christopher H. Dick², Joseph R. Cavallaro¹, Christoph Studer³•Institutions (3)

Rice University¹, Xilinx², Cornell University³

27 Nov 2016-arXiv: Information Theory

TL;DR: A novel, equalization-based soft-output data-detection algorithm and corresponding reference FPGA designs for wideband massive MU-MIMO systems that use orthogonal frequency-division multiplexing (OFDM).

...read moreread less

Abstract: Data detection in massive multi-user (MU) multiple-input multiple-output (MIMO) wireless systems is among the most critical tasks due to the excessively high implementation complexity. In this paper, we propose a novel, equalization-based soft-output data-detection algorithm and corresponding reference FPGA designs for wideband massive MU-MIMO systems that use orthogonal frequency-division multiplexing (OFDM). Our data-detection algorithm performs approximate minimum mean-square error (MMSE) or box-constrained equalization using coordinate descent. We deploy a variety of algorithm-level optimizations that enable near-optimal error-rate performance at low implementation complexity, even for systems with hundreds of base-station (BS) antennas and thousands of subcarriers. We design a parallel VLSI architecture that uses pipeline interleaving and can be parametrized at design time to support various antenna configurations. We develop reference FPGA designs for massive MU-MIMO-OFDM systems and provide an extensive comparison to existing designs in terms of implementation complexity, throughput, and error-rate performance. For a 128 BS antenna, 8 user massive MU-MIMO-OFDM system, our FPGA design outperforms the next-best implementation by more than 2.6x in terms of throughput per FPGA look-up tables.

...read moreread less

Patent•

Vehicular radar sensing system utilizing high rate true random number generator

[...]

Curtis Davis, Manju Hegde, Wayne E. Stark, John Lovberg

07 Jul 2016

TL;DR: In this paper, the first binary sequence is defined by least significant bit (LSB) outputs from the ADC and the second binary sequence of bits is defined as a truly random unbiased sequence with an equal probability of 1 and 0.

...read moreread less

Abstract: A radar sensing system for a vehicle includes transmit and receive pipelines. The transmit pipeline includes transmitters able to transmit radio signals. The receive pipeline includes receivers able to receive signals. The received signals are transmitted signals that are reflected from an object. The transmit pipeline phase modulates the signals before transmission, as defined by a first binary sequence. The receive pipeline comprises an analog to digital converter (ADC) for sampling the received signals. The transmit pipeline includes a pseudorandom binary sequence (PRBS) generator for outputting a second binary sequence of bits with an equal probability of 1 and 0. The first binary sequence is defined by least significant bit (LSB) outputs from the ADC and the second binary sequence of bits. The first binary sequence comprises a truly random unbiased sequence of bits with an equal probability of 1 and 0.

...read moreread less

Journal Article•DOI•

Techno-economic evaluation of the effects of impurities on conditioning and transport of CO2 by pipeline

[...]

Geir Skaugen¹, Simon Roussanaly¹, Jana P. Jakobsen¹, Amy L. Brunsvold¹•Institutions (1)

SINTEF¹

01 Nov 2016-International Journal of Greenhouse Gas Control

TL;DR: In this article, a detailed technical and economic assessment of conditioning and transporting 13.1 MTPA CO 2 with impurities in an on-shore pipe-line over a distance of 500 kilometres is presented.

...read moreread less

pyraf-dbsp: Reduction pipeline for the Palomar Double Beam Spectrograph

[...]

Eric C. Bellm, Branimir Sesar

01 Feb 2016

Proceedings Article•DOI•

HARE: hardware accelerator for regular expressions

[...]

Vaibhav Gogte¹, Aasheesh Kolli¹, Michael Cafarella¹, Loris D'Antoni², Thomas F. Wenisch¹ - Show less +1 more•Institutions (2)

University of Michigan¹, University of Wisconsin-Madison²

15 Oct 2016

TL;DR: This paper describes a 1GHz 32-character-wide HARE design targeting ASIC implementation that processes data at 32 GB/s - matching modern memory bandwidths and demonstrates a scaled-down FPGA proof-of-concept that operates at 100MHz with 4-wide parallelism (400 MB/s).

...read moreread less

Abstract: Rapidly processing text data is critical for many technical and business applications. Traditional software-based tools for processing large text corpora use memory bandwidth inefficiently due to software overheads and thus fall far short of peak scan rates possible on modern memory systems. Prior hardware designs generally target I/O rather than memory bandwidth. In this paper, we present HARE, a hardware accelerator for matching regular expressions against large in-memory logs. HARE comprises a stall-free hardware pipeline that scans input data at a fixed rate, examining multiple characters from a single input stream in parallel in a single accelerator clock cycle. We describe a 1GHz 32-character-wide HARE design targeting ASIC implementation that processes data at 32 GB/s — matching modern memory bandwidths. This ASIC design outperforms software solutions by as much as two orders of magnitude. We further demonstrate a scaled-down FPGA proof-of-concept that operates at 100MHz with 4-wide parallelism (400 MB/s). Even at this reduced rate, the prototype outperforms grep by 1.5–20x on commonly used regular expressions.

...read moreread less

Journal Article•DOI•

High-Throughput Data Detection for Massive MU-MIMO-OFDM Using Coordinate Descent

[...]

Michael Wu¹, Christopher H. Dick², Joseph R. Cavallaro¹, Christoph Studer³•Institutions (3)

Rice University¹, Xilinx², Cornell University³

01 Dec 2016-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: In this paper, an equalization-based soft-output data-detection algorithm and corresponding reference FPGA designs for wideband massive MU-MIMO systems that use orthogonal frequency division multiplexing (OFDM) were proposed.

...read moreread less

Abstract: Data detection in massive multi-user (MU) multiple-input multiple-output (MIMO) wireless systems is among the most critical tasks due to the excessively high implementation complexity. In this paper, we propose a novel, equalization-based soft-output data-detection algorithm and corresponding reference FPGA designs for wideband massive MU-MIMO systems that use orthogonal frequency-division multiplexing (OFDM). Our data-detection algorithm performs approximate minimum mean-square error (MMSE) or box-constrained equalization using coordinate descent. We deploy a variety of algorithm-level optimizations that enable near-optimal error-rate performance at low implementation complexity, even for systems with hundreds of base-station (BS) antennas and thousands of subcarriers. We design a parallel VLSI architecture that uses pipeline interleaving and can be parametrized at design time to support various antenna configurations. We develop reference FPGA designs for massive MU-MIMO-OFDM systems and provide an extensive comparison to existing designs in terms of implementation complexity, throughput, and error-rate performance. For a 128 BS antenna, 8-user massive MU-MIMO-OFDM system, our FPGA design outperforms the next-best implementation by more than $2.6 \times $ in terms of throughput per FPGA look-up tables.

...read moreread less

Journal Article•DOI•

Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records

[...]

Reinier Kop¹, Mark Hoogendoorn¹, Annette ten Teije¹, Frederike L. Büchner², Pauline Slottje³, Leon M G Moons⁴, Mattijs E. Numans⁴ - Show less +3 more•Institutions (4)

VU University Amsterdam¹, Leiden University Medical Center², VU University Medical Center³, Utrecht University⁴

01 Sep 2016-Computers in Biology and Medicine

TL;DR: A dedicated medical pre-processing pipeline aimed at taking on many problems and opportunities contained within EMR data, such as their temporal, inaccurate and incomplete nature is proposed, which has great potential to enhance disease prediction, and hence early detection and intervention in medical practice.

...read moreread less

Journal Article•DOI•

Modeling and optimization of the lightweight HIGHT block cipher design with FPGA implementation

[...]

Bassam Jamil Mohd¹, Thaier Hayajneh², Zaid Abu Khalaf², Khalil M. Ahmad Yousef¹•Institutions (2)

Hashemite University¹, New York Institute of Technology²

10 Sep 2016-Security and Communication Networks

TL;DR: The objective of this research work is to design, optimize, and model FPGA implementation of the HIGHT cipher, and shows that the scalar designs have smaller area and power dissipation, whereas the pipeline designs have higher throughput and lower energy.

...read moreread less

Abstract: The growth of low-resource devices has increased rapidly in recent years. Communication in such devices presents two challenges: security and resource limitation. Lightweight ciphers, such as HIGHT cipher, are encryption algorithms targeted for low resource systems. Designing lightweight ciphers in reconfigurable platform e.g., field-programmable gate array provides speedup as well as flexibility. The HIGHT cipher consists of simple operations and provides adequate security level. The objective of this research work is to design, optimize, and model FPGA implementation of the HIGHT cipher. Several optimized designs are presented to minimize the required hardware resources and energy including the scalar and pipeline ones. Our analysis shows that the scalar designs have smaller area and power dissipation, whereas the pipeline designs have higher throughput and lower energy. Because of the fact that obtaining the best performance out of any implemented design mainly requires balancing the design area and energy, our experimental results demonstrate that it is possible to obtain such optimal performance using the pipeline design with two and four rounds per stage as well as with the scalar design with one and eight rounds. Comparing the best implementations of pipeline and scalar designs, the scalar design requires 18% less resources and 10% less power, while the pipeline design has 18 times higher throughput and 60% less energy consumption. Copyright © 2016 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF( ${2}^{m}$ )

[...]

Lijuan Li¹, Shuguo Li¹•Institutions (1)

Tsinghua University¹

01 Apr 2016-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF(2163) in 6.1 μs using 7354 Slices on Virtex-4.

...read moreread less

Abstract: This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM) over GF( ${2}^{m}$ ). The architecture uses a bit-parallel finite-field (FF) multiplier accumulator (MAC) based on the Karatsuba-Ofman algorithm. The Montgomery ladder algorithm is modified for better sharing of execution paths. The data path in the architecture is well designed, so that the critical path contains few extra logic primitives apart from the FF MAC. In order to find the optimal number of pipeline stages, scheduling schemes with different pipeline stages are proposed and the ideal placement of pipeline registers is thoroughly analyzed. We implement ECSM over the five binary fields recommended by the National Institute of Standard and Technology on Xilinx Virtex-4 and Virtex-5 field-programmable gate arrays. The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF( ${2^{163}}$ ) in 6.1 $\mu \text{s}$ using 7354 Slices on Virtex-4. Using Virtex-5, the scalar multiplication for ${m} = 163$ , 233, 283, 409, and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 $\mu \text{s}$ , respectively, which are faster than previous results.

...read moreread less

Journal Article•DOI•

CAMA: contact-aware matrix assembly with unified collision handling for GPU-based cloth simulation

[...]

Min Tang¹, Huamin Wang², Le Tang¹, Ruofeng Tong¹, Dinesh Manocha³ - Show less +1 more•Institutions (3)

Zhejiang University¹, Ohio State University², University of North Carolina at Chapel Hill³

01 May 2016-Computer Graphics Forum

TL;DR: A novel GPU‐based approach to robustly and efficiently simulate high‐resolution and complexly layered cloth using a parallelized matrix assembly algorithm that can quickly build a large and sparse matrix in a compressed format and accurately solve linear systems on GPUs.

...read moreread less

Abstract: We present a novel GPU-based approach to robustly and efficiently simulate high-resolution and complexly layered cloth. The key component of our formulation is a parallelized matrix assembly algorithm that can quickly build a large and sparse matrix in a compressed format and accurately solve linear systems on GPUs. We also present a fast and integrated solution for parallel collision handling, including collision detection and response computations, which utilizes spatio-temporal coherence. We combine these algorithms as part of a new cloth simulation pipeline that incorporates contact forces into implicit time integration for collision avoidance. The entire pipeline is implemented on GPUs, and we evaluate its performance on complex benchmarks consisting of 100 -- 300K triangles. In practice, our system takes a few seconds to simulate one frame of a complex cloth scene, which represents significant speedups over prior CPU and GPU-based cloth simulation systems.

...read moreread less

Journal Article•DOI•

Automated refinement of macromolecular structures at low resolution using prior information.

[...]

Oleg Kovalevskiy¹, Robert A. Nicholls¹, Garib N. Murshudov¹•Institutions (1)

Laboratory of Molecular Biology¹

01 Oct 2016

TL;DR: An automated pipeline for low-resolution structure refinement (LORESTR) has been developed to assist in the hassle-free refinement of difficult cases, automates the selection of high-resolution homologues for external restraint generation and optimizes the parameters for ProSMART and REFMAC5.

...read moreread less

Abstract: Since the ratio of the number of observations to adjustable parameters is small at low resolution, it is necessary to use complementary information for the analysis of such data. ProSMART is a program that can generate restraints for macromolecules using homologous structures, as well as generic restraints for the stabilization of secondary structures. These restraints are used by REFMAC5 to stabilize the refinement of an atomic model. However, the optimal refinement protocol varies from case to case, and it is not always obvious how to select appropriate homologous structure(s), or other sources of prior information, for restraint generation. After running extensive tests on a large data set of low-resolution models, the best-performing refinement protocols and strategies for the selection of homologous structures have been identified. These strategies and protocols have been implemented in the Low-Resolution Structure Refinement (LORESTR) pipeline. The pipeline performs auto-detection of twinning and selects the optimal scaling method and solvent parameters. LORESTR can either use user-supplied homologous structures, or run an automated BLAST search and download homologues from the PDB. The pipeline executes multiple model-refinement instances using different parameters in order to find the best protocol. Tests show that the automated pipeline improves R factors, geometry and Ramachandran statistics for 94% of the low-resolution cases from the PDB included in the test set.

...read moreread less

Journal Article•DOI•

FPGA-Based Dynamically Reconfigurable SQL Query Processing

[...]

Daniel Ziener¹, Florian J. Bauer¹, Andreas Becher¹, Christopher Dennl¹, Klaus Meyer-Wegener¹, Ute Schürfeld², Jürgen Teich¹, Jörg-Stephan Vogt², Helmut H. Weber² - Show less +5 more•Institutions (2)

University of Erlangen-Nuremberg¹, IBM²

22 Aug 2016

TL;DR: An FPGA-based SQL query processing approach exploiting the capabilities of partial dynamic reconfiguration of modern FPGAs and a performance analysis is introduced that is able to estimate the processing time of a query for different processing strategies and different communication and processing architecture configurations.

...read moreread less

Abstract: In this article, we propose an FPGA-based SQL query processing approach exploiting the capabilities of partial dynamic reconfiguration of modern FPGAs. After the analysis of an incoming query, a query-specific hardware processing unit is generated on the fly and loaded on the FPGA for immediate query execution. For each query, a specialized hardware accelerator pipeline is composed and configured on the FPGA from a set of presynthesized hardware modules. These partially reconfigurable hardware modules are gathered in a library covering all major SQL operations like restrictions and aggregations, as well as more complex operations such as joins and sorts. Moreover, this holistic query processing approach in hardware supports different data processing strategies including row- as column-wise data processing in order to optimize data communication and processing. This article gives an overview of the proposed query processing methodology and the corresponding library of modules. Additionally, a performance analysis is introduced that is able to estimate the processing time of a query for different processing strategies and different communication and processing architecture configurations. With the help of this performance analysis, architectural bottlenecks may be exposed and future optimized architectures, besides the two prototypes presented here, may be determined.

...read moreread less

Journal Article•DOI•

Gas transmission pipeline failure probability estimation and defect repairs activities based on in-line inspection data

[...]

Maciej Witek¹•Institutions (1)

Warsaw University of Technology¹

01 Dec 2016-Engineering Failure Analysis

TL;DR: In this paper, a simple analytical method of burst pressure calculation for a straight pipeline repaired with a composite sleeve was investigated, and the Monte Carlo method was selected for estimation of pipeline failure probability and cumulative failure probability due to the external corrosion considering fluid pressure fluctuations in dynamic flow effects with respect to statistical distribution of input parameters.

...read moreread less

Journal Article•DOI•

Hardware Implementation of Architecture Techniques for Fast Efficient Lossless Image Compression System

[...]

N. Muthukumaran, R. Ravi

01 Oct 2016-Wireless Personal Communications

TL;DR: A high throughput memory efficient pipelining architecture for Fast Efficient Set Partitioning in Hierarchical Trees (SPIHT) image compression system is explained and maximum PSNR value, CR is attained and very high accurate image after decompression is produced.

...read moreread less

Abstract: In this research paper, a high throughput memory efficient pipelining architecture for Fast Efficient Set Partitioning in Hierarchical Trees (SPIHT) image compression system is explained. The main aim of this paper is to compress and implement the image without any loss of information. So, we are using spatial oriented tree approach in Fast Efficient SPIHT algorithm for compression and Spartan 3 EDK kit for hardware implementation analysis purpose. Integer wavelet transform is used for encoding and decoding process in SPIHT algorithm. Here, we are using pipelining architecture to implement it in FPGA kit because pipeline architecture is more suitable for hardware utility purpose. Generally an image file will occupy more amount of space. In order to reduce the memory size no loss during transmission we are using this approach. By this way we are attained maximum PSNR value, CR and also produced very high accurate image after decompression, when compared with the results of other previous algorithms. In this module, the hardware tools used are dual core processor and FPGA Spartan 3 EDK kit and the software tool windows 8 operating system and the tool kit is MATLAB 7.8.

...read moreread less

Collapse