Showing papers on "Pipeline (computing) published in 2021"

PDF

Open Access

Journal Article•DOI•

Machine learning pipeline for battery state-of-health estimation

[...]

Darius Roman¹, Saurabh Saxena², Saurabh Saxena³, Valentin Robu¹, Valentin Robu⁴, Michael Pecht², David Flynn¹ - Show less +3 more•Institutions (4)

Heriot-Watt University¹, University of Maryland, College Park², Argonne National Laboratory³, Delft University of Technology⁴

01 Feb 2021-Nature Machine Intelligence

TL;DR: This work designs and evaluates a machine learning pipeline for estimation of battery capacity fade—a metric of battery health—on 179 cells cycled under various conditions, and provides insights into the design of scalable data-driven models for battery SOH estimation, emphasizing the value of confidence bounds around the prediction.

...read moreread less

Abstract: Lithium-ion batteries are ubiquitous in applications ranging from portable electronics to electric vehicles Irrespective of the application, reliable real-time estimation of battery state of health (SOH) by on-board computers is crucial to the safe operation of the battery, ultimately safeguarding asset integrity In this Article, we design and evaluate a machine learning pipeline for estimation of battery capacity fade—a metric of battery health—on 179 cells cycled under various conditions The pipeline estimates battery SOH with an associated confidence interval by using two parametric and two non-parametric algorithms Using segments of charge voltage and current curves, the pipeline engineers 30 features, performs automatic feature selection and calibrates the algorithms When deployed on cells operated under the fast-charging protocol, the best model achieves a root-mean-squared error of 045% This work provides insights into the design of scalable data-driven models for battery SOH estimation, emphasizing the value of confidence bounds around the prediction The pipeline methodology combines experimental data with machine learning modelling and could be applied to other critical components that require real-time estimation of SOH Rechargeable lithium-ion batteries play a crucial role in many modern-day applications, including portable electronics and electric vehicles, but they degrade over time To ensure safe operation, a battery’s ‘state of health’ should be monitored in real time, and this machine learning pipeline, tested on a variety of charging conditions, can provide such an online estimation of battery state of health

...read moreread less

173 citations

Proceedings Article•DOI•

Efficient large-scale language model training on GPU clusters using megatron-LM

[...]

Deepak Narayanan¹, Mohammad Shoeybi², Jared Casper², Patrick LeGresley², Mostofa Patwary², Vijay Anand Korthikanti², Dmitri Vainbrand², Prethvi Kashinkunti², Julie Bernauer², Bryan Catanzaro², Amar Phanishayee¹, Matei Zaharia³ - Show less +8 more•Institutions (3)

Microsoft¹, Nvidia², Stanford University³

14 Nov 2021

TL;DR: In this paper, the authors propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches, allowing them to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.

...read moreread less

Abstract: Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of GPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak).

...read moreread less

135 citations

Journal Article•DOI•

Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement

[...]

Andong Li¹, Wenzhe Liu¹, Chengshi Zheng¹, Cunhang Fan², Xiaodong Li¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Anhui University²

14 May 2021-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, the authors proposed a novel complex spectral mapping approach with a two-stage pipeline for monaural speech enhancement in the time-frequency domain, which decouple the primal problem into multiple sub-problems.

...read moreread less

Abstract: For challenging acoustic scenarios as low signal-to-noise ratios, current speech enhancement systems usually suffer from performance bottleneck in extracting the target speech from the mixtures within one step. To address this issue, we propose a novel complex spectral mapping approach with a two-stage pipeline for monaural speech enhancement in the time-frequency domain. The proposed algorithm aims to decouple the primal problem into multiple sub-problems, which follows the classic proverb, “two heads are better than one”. More specifically, in the first stage, only magnitude is estimated, which is incorporated with the noisy phase to obtain a coarse complex spectrum estimation. To facilitate the previous estimation, in the second stage, an auxiliary network serves as the post-processing module, where residual noise is further suppressed and the phase information is effectively modified. The global residual connection strategy is adopted in the second stage to accelerate the training convergence speed. To alleviate the parameter burden caused by the multi-stage pipeline, we propose a light-weight temporal convolutional module, which substantially decreases the trainable parameters and obtains even better objective performance over the original version. We conduct extensive experiments on three standard corpora, including WSJ0-SI84, DNS Challenge dataset, and Voice Bank + DEMAND dataset. Objective test results demonstrate that our proposed approach achieves state-of-the-art performance over previous advanced systems under various conditions. Meanwhile, subjective listening test results further validate the superiority of our proposed method in terms of subjective quality.

...read moreread less

95 citations

Journal Article•DOI•

BlockQNN: Efficient Block-Wise Neural Network Architecture Generation

[...]

Zhao Zhong¹, Zichen Yang², Boyang Deng², Junjie Yan², Wei Wu², Jing Shao², Cheng-Lin Liu¹ - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, SenseTime²

01 Jul 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper provides a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy and proposes a distributed asynchronous framework and an early stop strategy.

...read moreread less

Abstract: Convolutional neural networks have gained a remarkable success in computer vision. However, most popular network architectures are hand-crafted and usually require expertise and elaborate design. In this paper, we provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy. The optimal network block is constructed by the learning agent which is trained to choose component layers sequentially. We stack the block to construct the whole auto-generated network. To accelerate the generation process, we also propose a distributed asynchronous framework and an early stop strategy. The block-wise generation brings unique advantages: (1) it yields state-of-the-art results in comparison to the hand-crafted networks on image classification, particularly, the best network generated by BlockQNN achieves 2.35 percent top-1 error rate on CIFAR-10. (2) it offers tremendous reduction of the search space in designing networks, spending only 3 days with 32 GPUs. A faster version can yield a comparable result with only 1 GPU in 20 hours. (3) it has strong generalizability in that the network built on CIFAR also performs well on the larger-scale dataset. The best network achieves very competitive accuracy of 82.0 percent top-1 and 96.0 percent top-5 on ImageNet.

...read moreread less

91 citations

Journal Article•DOI•

A new hybrid algorithm model for prediction of internal corrosion rate of multiphase pipeline

[...]

Shanbi Peng¹, Zhe Zhang¹, Enbin Liu¹, Wei Liu², Weibiao Qiao³ - Show less +1 more•Institutions (3)

Southwest Petroleum University¹, Chongqing University², North China University of Water Conservancy and Electric Power³

01 Jan 2021-Journal of Natural Gas Science and Engineering

TL;DR: A hybrid intelligent algorithm method that combines support vector regression, principal component analysis, and chaos particle swarm optimization is proposed to predict the corrosion rate of the multiphase flow pipeline, named PCA-CPSO-SVR, which has a good performance.

...read moreread less

84 citations

Journal Article•DOI•

Transformations of High-Level Synthesis Codes for High-Performance Computing

[...]

Johannes de Fine Licht¹, Maciej Besta¹, Simon Meierhans¹, Torsten Hoefler¹•Institutions (1)

University of Zurich¹

01 May 2021-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

...read moreread less

Abstract: Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

...read moreread less

83 citations

Journal Article•DOI•

Collaborative Cloud-Edge-End Task Offloading in Mobile-Edge Computing Networks With Limited Communication Capability

[...]

Caihong Kai¹, Hao Zhou¹, Yibo Yi¹, Wei Huang¹•Institutions (1)

Hefei University of Technology¹

01 Jun 2021-IEEE Transactions on Cognitive Communications and Networking

TL;DR: This article investigates the collaborative computation offloading, computation and communication resource allocation scheme, and develops a collaborative computing framework that the tasks of mobile devices can be partially processed at the terminals, edge nodes (EN) and cloud center (CC).

...read moreread less

Abstract: Mobile edge computing (MEC) is an emerging computing paradigm for enabling low-latency, high-bandwidth and agile mobile services by deploying computing platform at the edge of network. In order to improve the cloud-edge-end processing efficiency of the tasks within the limited computation and communication capabilities, in this article, we investigate the collaborative computation offloading, computation and communication resource allocation scheme, and develop a collaborative computing framework that the tasks of mobile devices (MDs) can be partially processed at the terminals, edge nodes (EN) and cloud center (CC). Then, we propose the pipeline-based offloading scheme, where both MDs and ENs can offload computation-intensive tasks to a particular EN and CC, according to their computation and communication capacities, respectively. Based on the proposed pipeline offloading strategy, a sum latency of all MDs minimization problem is formulated with the consideration of the offloading strategy, computation resource, delivery rate and power allocation, which is a non-convex problem and difficult to deal with. To solve the optimization problem, by using the classic successive convex approximation (SCA) approach, we transform the non-convex optimization problem into the convex one. Finally, simulation results indicate that the proposed collaboration offloading scheme with the pipeline strategy is efficient and outperforms other offloading schemes.

...read moreread less

82 citations

Journal Article•DOI•

Vibration analysis and control technologies of hydraulic pipeline system in aircraft: A review

[...]

Peixin Gao¹, Tao Yu¹, Yuanlin Zhang¹, Wang Jiao¹, Jingyu Zhai² - Show less +1 more•Institutions (2)

Yantai University¹, Dalian University of Technology²

01 Apr 2021-Chinese Journal of Aeronautics

TL;DR: A review of the general approaches following the passive and active control technologies are presented, which are including optimal layout technique of pipeline and clamps, constrained layer damping technique, vibration absorber technique, hydraulic hose technique, optimal pump structure technique, and active vibration control technique of Pipeline system.

...read moreread less

70 citations

Journal Article•DOI•

DeepRx: Fully Convolutional Deep Learning Receiver

[...]

Mikko Honkala¹, Dani Korpi¹, Janne M. J. Huttunen¹•Institutions (1)

Bell Labs¹

02 Feb 2021-IEEE Transactions on Wireless Communications

TL;DR: A deep fully convolutional neural network, DeepRx is proposed, which executes the whole receiver pipeline from frequency domain signal stream to uncoded bits in a 5G-compliant fashion and outperforms traditional methods.

...read moreread less

Abstract: Deep learning has solved many problems that are out of reach of heuristic algorithms. It has also been successfully applied in wireless communications, even though the current radio systems are well-understood and optimal algorithms exist for many tasks. While some gains have been obtained by learning individual parts of a receiver, a better approach is to jointly learn the whole receiver. This, however, often results in a challenging nonlinear problem, for which the optimal solution is infeasible to implement. To this end, we propose a deep fully convolutional neural network, DeepRx, which executes the whole receiver pipeline from frequency domain signal stream to uncoded bits in a 5G-compliant fashion. We facilitate accurate channel estimation by constructing the input of the convolutional neural network in a very specific manner using both the data and pilot symbols. Also, DeepRx outputs soft bits that are compatible with the channel coding used in 5G systems. Using 3GPP-defined channel models, we demonstrate that DeepRx outperforms traditional methods. We also show that the high performance can likely be attributed to DeepRx learning to utilize the known constellation points of the unknown data symbols, together with the local symbol distribution, for improved detection accuracy.

...read moreread less

63 citations

Journal Article•DOI•

A tnGAN-Based Leak Detection Method for Pipeline Network Considering Incomplete Sensor Data

[...]

Xuguang Hu¹, Huaguang Zhang¹, Dazhong Ma¹, Rui Wang¹•Institutions (1)

Northeastern University (China)¹

01 Jan 2021-IEEE Transactions on Instrumentation and Measurement

TL;DR: In this article, a generative adversarial network based on trinetworks form (tnGAN) is proposed to handle leak detection problems with incomplete sensor data, which can achieve different incomplete data recovery situations, such as individual lost and random missing.

...read moreread less

Abstract: Due to the widely deployed sensors in the pipeline network, the data-driven detection method is a natural choice with multiple sensor measurements. However, the incomplete data problem caused by device failure or network interruption seriously hinders the implementation of pipeline status monitoring. Aiming at this difficulty, this article proposes a generative adversarial network based on trinetworks form (tnGAN) to handle leak detection problems with incomplete sensor data. First, the generative model is proposed to recover incomplete data through fully exploiting the same-level nature similarity of data features. Therein, the same type of sensor data, obtained from the pipeline network, is used as the input. Next, to further boost the temporal evolvement characteristics and the spatial similarity, a multiview awareness strategy is incorporated in the established model to facilitate the integration of inherent information. Then, a dual-discriminative network architecture is proposed to detect pipeline status through computing the similarity of the latent features of samples. With the abovementioned structure, the proposed method can achieve different incomplete data recovery situations, such as individual lost and random missing. In addition, it can also aggregate the output and features of the discriminative networks to obtain the pipeline leak detection result. Finally, the experiment results on a pipeline network demonstrate the capability and effectiveness of the proposed method in both data recovery and leak detection.

...read moreread less

61 citations

Journal Article•DOI•

PHANGS–ALMA data processing and pipeline

[...]

Adam K. Leroy¹, Annie Hughes², Daizhong Liu³, Jérôme Pety⁴, Erik Rosolowsky⁵, Toshiki Saito³, Eva Schinnerer³, Andreas Schruba³, Antonio Usero, Christopher M Faesi⁶, Cinthya N. Herrera, Mélanie Chevance⁷, Alexander P.S. Hygate⁸, Amanda A. Kepley⁹, Eric W. Koch⁵, Miguel Querejeta, K. Sliwa³, David Will¹, Christine D. Wilson¹⁰, Gagandeep S. Anand, Ashley T. Barnes¹¹, Francesco Belfiore, Ivana Bešlić¹¹, Frank Bigiel¹¹, Guillermo A. Blanc¹², Guillermo A. Blanc¹³, Alberto D. Bolatto¹⁴, Médéric Boquien¹⁵, Yixian Cao, Rupali Chandar¹⁶, Jérémy Chastenet¹⁷, I. Da Chiang¹⁷, Enrico Congiu¹², Enrico Congiu¹³, Daniel A. Dale¹⁸, S. Deger¹⁹, Jakob S. den Brok¹¹, Cosima Eibensteiner¹¹, Eric Emsellem²⁰, Eric Emsellem²¹, Axel García-Rodríguez, Simon C. O. Glover⁷, Kathryn Grasha²², Brent Groves²², Jonathan D. Henshaw³, Maria J. Jimenez Donaire, Jaeyeon Kim⁷, Ralf S. Klessen⁷, Kathryn Kreckel⁷, J. M. Diederik Kruijssen⁷, K. Larson¹⁹, Janice C. Lee¹⁹, Ness Mayker¹, Rebecca McElroy²³, Sharon Meidt²⁴, Angus Mok¹⁶, Hsi-An Pan³, J. Puschnig¹¹, A. Razza¹³, Patricia Sanchez-Blazquez²⁵, Karin Sandstrom¹⁷, F. Santoro³, Amy Sardone¹, Fabian Scheuermann⁷, Jiayi Sun¹, David A. Thilker²⁶, Jordan A. Turner¹⁸, Leonardo Ubeda²⁷, Dyas Utomo⁹, Dyas Utomo¹, Elizabeth J. Watkins⁷, Thomas G. Williams³ - Show less +68 more•Institutions (27)

01 Jul 2021-Astrophysical Journal Supplement Series

Journal Article•DOI•

Towards an Accelerometer-Based Elderly Fall Detection System Using Cross-Disciplinary Time Series Features

[...]

Md. Jaber Al Nahian¹, Tapotosh Ghosh¹, Md. Hasan Al Banna¹, Mohammed Aseeri², Mohammed Nasir Uddin¹, Muhammad R. Ahmed, Mufti Mahmud³, M. Shamim Kaiser⁴ - Show less +4 more•Institutions (4)

Bangladesh University¹, King Abdulaziz City for Science and Technology², Nottingham Trent University³, Jahangirnagar University⁴

02 Feb 2021-IEEE Access

TL;DR: Wang et al. as mentioned in this paper proposed a novel pipeline for fall detection based on wearable accelerometer data and three publicly available datasets have been used to validate their proposed method, and more than 7700 cross-disciplinary time-series features were investigated for each of the datasets.

...read moreread less

Abstract: Fall causes trauma or critical injury among the geriatric population which is a second leading accidental cause of post-injury mortality around the world. It is crucial to keep elderly people under supervision by ensuring proper privacy and comfort. Thus the elderly fall detection and prediction using wearable/ non-wearable sensors become an active field of research. In this work, a novel pipeline for fall detection based on wearable accelerometer data has been proposed. Three publicly available datasets have been used to validate our proposed method, and more than 7700 cross-disciplinary time-series features were investigated for each of the datasets. After following a series of feature reduction techniques such as mutual information, removing highly correlated features using the Pearson correlation coefficient, Boruta algorithm, we have obtained the dominant features for each dataset. Different classical machine learning (ML) algorithms were utilized to detect falls based on the obtained features. For individual datasets, the simple ML classifiers achieved very good accuracy. We trained our pipeline with two of the three datasets and tested with the remaining one dataset until all three datasets were used as the test set to show the generalization capability of our proposed pipeline. A set of 39 high-performing features is selected, and the classifiers were trained with them. For all the cases, the proposed pipeline showed excellent efficiency in detecting falls. This architecture performed better than most of the existing works in all the used publicly available datasets, proving the supremacy of the proposed data analysis pipeline.

...read moreread less

Proceedings Article•DOI•

Seesaw Loss for Long-Tailed Instance Segmentation

[...]

Jiaqi Wang¹, Wenwei Zhang², Yuhang Zang², Yuhang Cao¹, Jiangmiao Pang³, Tao Gong⁴, Kai Chen⁵, Ziwei Liu², Chen Change Loy², Dahua Lin¹ - Show less +6 more•Institutions (5)

The Chinese University of Hong Kong¹, Nanyang Technological University², Zhejiang University³, University of Science and Technology of China⁴, SenseTime⁵

01 Jun 2021

TL;DR: Seesaw Loss as discussed by the authors dynamically re-balance gradients of positive and negative samples for each category, with two complementary factors, i.e., mitigation factor and compensation factor.

...read moreread less

Abstract: Instance segmentation has witnessed a remarkable progress on class-balanced benchmarks. However, they fail to perform as accurately in real-world scenarios, where the category distribution of objects naturally comes with a long tail. Instances of head classes dominate a long-tailed dataset and they serve as negative samples of tail categories. The overwhelming gradients of negative samples on tail classes lead to a biased learning process for classifiers. Consequently, objects of tail categories are more likely to be misclassified as backgrounds or head categories. To tackle this problem, we propose Seesaw Loss to dynamically re-balance gradients of positive and negative samples for each category, with two complementary factors, i.e., mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t. the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. We conduct extensive experiments on Seesaw Loss with mainstream frameworks and different data sampling strategies. With a simple end-to-end training pipeline, Seesaw Loss obtains significant gains over Cross-Entropy Loss, and achieves state-of-the-art performance on LVIS dataset without bells and whistles. Code is available at https://github.com/open-mmlab/mmdetection.

...read moreread less

Journal Article•DOI•

A Low Area High Speed FPGA Implementation of AES Architecture for Cryptography Application

[...]

Thanikodi Manoj Kumar, Kasarla Satish Reddy, Stefano Rinaldi, Bidare Divakarachari Parameshachari, Kavitha Arunachalam - Show less +1 more

21 Aug 2021-Electronics

TL;DR: Low power high-speed hardware architectures for the efficient field programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security and modified positive polarity reed muller (MPPRM) architecture is inserted.

...read moreread less

Abstract: Nowadays, a huge amount of digital data is frequently changed among different embedded devices over wireless communication technologies. Data security is considered an important parameter for avoiding information loss and preventing cyber-crimes. This research article details the low power high-speed hardware architectures for the efficient field programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security. This work does not depend on the Look-Up Table (LUTs) for the implementation the SubBytes and InvSubBytes stages of transformations of the AES encryption and decryption; this new architecture uses combinational logical circuits for implementing SubBytes and InvSubBytes transformation. Due to the elimination of LUTs, unwanted delays are eliminated in this architecture and a subpipelining structure is introduced for improving the speed of the AES algorithm. Here, modified positive polarity reed muller (MPPRM) architecture is inserted to reduce the total hardware requirements, and comparisons are made with different implementations. With MPPRM architecture introduced in SubBytes stages, an efficient mixcolumn and invmixcolumn architecture that is suited to subpipelined round units is added. The performances of the proposed AES-MPPRM architecture is analyzed in terms of number of slice registers, flip flops, number of slice LUTs, number of logical elements, slices, bonded IOB, operating frequency and delay. There are five different AES architectures including LAES, AES-CTR, AES-CFA, AES-BSRD, and AES-EMCBE. The LUT of the AES-MPPRM architecture designed in the Spartan 6 is reduced up to 15.45% when compared to the AES-BSRD.

...read moreread less

Journal Article•DOI•

Mid-term prediction of electrical energy consumption for crude oil pipelines using a hybrid algorithm of support vector machine and genetic algorithm

[...]

Lei Xu¹, Lei Hou¹, Zhenyu Zhu¹, Yu Li¹, Jiaquan Liu¹, Ting Lei¹, Xingguang Wu¹ - Show less +3 more•Institutions (1)

China University of Petroleum¹

01 May 2021-Energy

TL;DR: GA-SVM hybrid model has the best effect in improving the predictive accuracy, and the forecast results are in the best agreement with the actual data.

...read moreread less

Proceedings Article•DOI•

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

[...]

Licheng Guo¹, Yuze Chi¹, Jie Wang¹, Jason Lau¹, Weikang Qiao¹, Ecenur Ustun², Zhiru Zhang², Jason Cong¹ - Show less +4 more•Institutions (2)

University of California, Los Angeles¹, Cornell University²

17 Feb 2021

TL;DR: Guo et al. as mentioned in this paper proposed AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries.

...read moreread less

Abstract: Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable clock frequency between an HLS-generated design and a handcrafted RTL one. A key factor that limits the timing quality of the HLS outputs is the difficulty in accurately estimating the interconnect delay at the HLS level. Unfortunately, this problem becomes even worse when large HLS designs are implemented on the latest multi-die FPGAs, where die-crossing interconnects incur a high delay penalty. To tackle this challenge, we propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. First, our approach provides HLS with a view on the global physical layout of the design, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining, the floorplanner is able to distribute the design logic across multiple dies on the FPGA device without degrading clock frequency. This prevents the placer from aggressively packing the logic on a single die which often results in local routing congestion that eventually degrades timing. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput. AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.

...read moreread less

Journal Article•DOI•

LWRpro: An Energy-Efficient Configurable Crypto-Processor for Module-LWR

[...]

Yihong Zhu¹, Min Zhu, Bohan Yang¹, Wenping Zhu¹, Chenchen Deng¹, Chen Chen¹, Shaojun Wei¹, Leibo Liu¹ - Show less +4 more•Institutions (1)

Tsinghua University¹

12 Jan 2021-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: In this paper, an energy-efficient configurable crypto-processor supporting multi-security-level key encapsulation mechanism of Saber, is proposed, where an 8-level hierarchical Karatsuba framework is utilized to reduce degree-256 polynomial multiplication to the coefficient-wise multiplication.

...read moreread less

Abstract: Saber, the only module-learning with rounding-based algorithm in NIST’s third round of post-quantum cryptography (PQC) standardization process, is characterized by simplicity and flexibility. However, energy-efficient implementation of Saber is still under investigation since the commonly used number theoretic transform can not be utilized directly. In this manuscript, an energy-efficient configurable crypto-processor supporting multi-security-level key encapsulation mechanism of Saber, is proposed. First, an 8-level hierarchical Karatsuba framework is utilized to reduce degree-256 polynomial multiplication to the coefficient-wise multiplication. Second, a hardware-efficient Karatsuba scheduling strategy and an optimized pre-/post-processing structure is designed to reduce the area overheads of scheduling strategy. Third, a task-rescheduling-based pipeline strategy and truncated multipliers are proposed to enable fine-grained processing. Moreover, multiple parameter sets are supported in LWRpro to enable configurability among various security scenarios. Enabled by these optimizations, LWRpro requires 1066, 1456 and 1701 clock cycles for key generation, encapsulation, and decapsulation of Saber768. The post-layout version of LWRpro is implemented with TSMC 40 nm CMOS process within 0.38 mm2. The throughput for Saber768 is up to 275k encapsulation operations per second and the energy efficiency is 0.15 uJ/encapsulation while operating at 400 MHz, achieving nearly $50\times $ improvement and $31\times $ improvement, respectively compared with current PQC hardware solutions.

...read moreread less

Journal Article•DOI•

Analysis and research on pipeline vibration of a natural gas compressor station and vibration reduction measures

[...]

Enbin Liu¹, Wang Xingjie¹, Wanwei Zhao², Su Zhongya³, Qikun Chen⁴ - Show less +1 more•Institutions (4)

Southwest Petroleum University¹, Gas Technology Institute², Sinopec³, Cardiff University⁴

07 Jan 2021-Energy & Fuels

TL;DR: In this paper, the outbound pipeline of the Yongchang pressure station was taken as the research object, and the vibration analysis of the station yard pipeline was carried out, and three kinds of vibration reduction schemes were proposed and verified by simulation.

...read moreread less

Abstract: The abnormal vibration of natural gas station pipelines seriously threatens the safety of pipeline transportation, and improper handling will cause huge economic losses. For the abnormal vibration of the pipeline, reasonable treatment must be carried out. The Yongchang gas station belongs to the west–east gas pipeline system in China. Since its production, abnormal vibration has often occurred in the west-third outbound pipeline of the Yongchang gas station, and the vibration changes according to the different gas transport volumes. In this paper, the outbound pipeline of the Yongchang pressure station is taken as the research object, and the vibration analysis of the station yard pipeline is carried out. The numerical model of the station yard pipeline is established, and the correctness of the model is verified by the field vibration test. The fluid–solid coupling method is used to analyze pipeline vibration under different working conditions. Then, three kinds of vibration reduction schemes are proposed and verified by simulation. The main conclusions are as follows: (1) The fluid pressure fluctuation in the pipeline is the root cause of abnormal vibration in the station. (2) When the gas transmission volume is large, the vibration of the pipeline system will become more severe. (3) The scheme of increasing pipe diameter and adding appropriate constraints has the best vibration reduction effect.

...read moreread less

Journal Article•DOI•

A deep learning framework for pancreas segmentation with multi-atlas registration and 3D level-set

[...]

Yue Zhang¹, Jiong Wu¹, Yilong Liu², Yifan Chen³, Wei Chen⁴, Ed X. Wu², Chunming Li³, Xiaoying Tang¹ - Show less +4 more•Institutions (4)

Southern University of Science and Technology¹, University of Hong Kong², University of Electronic Science and Technology of China³, Third Military Medical University⁴

01 Feb 2021-Medical Image Analysis

TL;DR: A deep learning framework that incorporates both multi-atlas registration and level-set for segmenting pancreas from CT volume images is proposed and validated and achieves an average Dice score over 82%, being superior or comparable to other existing state-of-the-art Pancreas segmentation algorithms.

...read moreread less

Journal Article•DOI•

rTPC and nls.multstart: A new pipeline to fit thermal performance curves in r

[...]

Daniel Padfield¹, Hannah O'Sullivan², Samraat Pawar²•Institutions (2)

University of Exeter¹, Imperial College London²

01 Jun 2021-Methods in Ecology and Evolution

TL;DR: This work presents a new, reproducible pipeline in r that allows for relatively simple fitting of 24 different TPC models using nonlinear least squares (NLLS) regression and demonstrates how this pipeline can be combined with other packages in r to robustly and reproducibly fit multiple mathematical models to multiple TPC datasets at once.

...read moreread less

Abstract: 1. The quantification of thermal performance curves (TPCs) for biological rates has many applications to problems such as predicting species9 responses to climate change. There is currently no widely used open-source pipeline to fit mathematical TPC models to data, which limits the transparency and reproducibility of the curve fitting process underlying applications of TPCs. 2. We present a new pipeline in R that currently allows for reproducible fitting of 24 different TPC models using non-linear least squares (NLLS) regression. The pipeline consists of two packages - rTPC and nls.multstart - that allow multiple start values for NLLS fitting and provides helper functions for setting start parameters. This pipeline overcomes previous problems that have made NLLS fitting and estimation of key parameters difficult or unreliable. 3. We demonstrate how rTPC and nls.multstart can be combined with other packages in R to robustly and reproducibly fit multiple models to multiple TPC datasets at once. In addition, we show how model selection or averaging, weighted model fitting, and bootstrapping can easily be implemented within the pipeline. 4. This new pipeline provides a flexible and reproducible approach that makes the challenging task of fitting multiple TPC models to data accessible to a wide range of users.

...read moreread less

Proceedings Article•DOI•

Code Prediction by Feeding Trees to Transformers

[...]

Seohyun Kim¹, Jinman Zhao², Yuchi Tian³, Satish Chandra¹•Institutions (3)

Facebook¹, University of Wisconsin-Madison², Columbia University³

22 May 2021

TL;DR: In this paper, the Transformer architecture is used to predict the next token in the list of potential code completions in the IDE at cursor position, and it outperforms previous state-of-the-art next token prediction systems by margins ranging from 14% to 18%.

...read moreread less

Abstract: Code prediction, more specifically autocomplete, has become an essential feature in modern IDEs. Autocomplete is more effective when the desired next token is at (or close to) the top of the list of potential completions offered by the IDE at cursor position. This is where the strength of the underlying machine learning system that produces a ranked order of potential completions comes into play. We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. Our work uses Transformers as the base neural architecture. We show that by making the Transformer architecture aware of the syntactic structure of code, we increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of several state-of-the-art next token prediction systems by margins ranging from 14% to 18%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a company internal Python corpus. Our code and data preparation pipeline will be available in open source.

...read moreread less

Proceedings Article•DOI•

DAPPLE: a pipelined data parallel approach for training large models

[...]

Shiqing Fan¹, Yi Rong¹, Chen Meng¹, Zongyan Cao¹, Siyu Wang¹, Zhen Zheng¹, Chuan Wu², Guoping Long¹, Jun Yang¹, Lixue Xia¹, Lansong Diao¹, Xiaoyong Liu¹, Wei Lin¹ - Show less +9 more•Institutions (2)

Alibaba Group¹, University of Hong Kong²

17 Feb 2021

TL;DR: DAPPLE as mentioned in this paper is a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, and it features a novel parallelization strategy planner to solve the partition and placement problems.

...read moreread less

Abstract: It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23× speedup under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6× speedup of training throughput and saves 12% of memory consumption at the same time.

...read moreread less

Proceedings Article•DOI•

BoostGCN: A Framework for Optimizing GCN Inference on FPGA

[...]

Bingyi Zhang¹, Rajgopal Kannan², Viktor K. Prasanna¹•Institutions (2)

University of Southern California¹, United States Army Research Laboratory²

09 May 2021

TL;DR: BoostGCN as discussed by the authors proposes a hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme that leverages 3-D partitioning with the vertex-centric computing paradigm.

...read moreread less

Abstract: Graph convolutional networks (GCNs) have revolutionized many big data applications, such as recommendation systems, traffic prediction, etc. However, accelerating GCN inference is challenging due to (1) massive external memory traffic and irregular memory access, (2) workload imbalance due to skewed degree distribution, and (3) intra-stage load imbalance caused by two heterogeneous computation phases of the algorithm. To address the above challenges, we propose a framework named BoostGCN to optimize GCN inference on FPGA. First, we develop a novel hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme that leverages 3-D partitioning with the vertex-centric computing paradigm. This increases on-chip data reuse and reduces the total data communication volume with external memory. Second, we design a novel hardware architecture to enable pipelined execution of the two heterogeneous computation phases. We develop a low-overhead task scheduling strategy to reduce the pipeline stalls caused by the two computation phases. Third, we provide a complete GCN acceleration framework on FPGA with optimized RTL templates. It can generate hardware designs based on the customized configuration and is adaptable to various GCN models. Using our framework, we generate accelerators for various GCN models on a state-of-the-art FPGA platform and evaluate our designs using widely used datasets. Experimental results show that the accelerators produced by our framework achieve significant speedup compared with state-of-the-art implementations on CPU (≈ 100×), GPU (≈ 30×), prior FPGA accelerator (3-45)×.

...read moreread less

Journal Article•DOI•

Generation-Level Parallelism for Evolutionary Computation: A Pipeline-Based Parallel Particle Swarm Optimization

[...]

Jian-Yu Li¹, Zhi-Hui Zhan¹, Run-Dong Liu¹, Chuan Wang², Sam Kwong³, Jun Zhang⁴ - Show less +2 more•Institutions (4)

South China University of Technology¹, Henan Normal University², City University of Hong Kong³, University of Victoria⁴

01 Oct 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: This article proposes a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level, inspired by the industrial pipeline technique and shows that generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.

...read moreread less

Abstract: Due to the population-based and iterative-based characteristics of evolutionary computation (EC) algorithms, parallel techniques have been widely used to speed up the EC algorithms. However, the parallelism usually performs in the population level where multiple populations (or subpopulations) run in parallel or in the individual level where the individuals are distributed to multiple resources. That is, different populations or different individuals can be executed simultaneously to reduce running time. However, the research into generation-level parallelism for EC algorithms has seldom been reported. In this article, we propose a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level. This idea is inspired by the industrial pipeline technique. Specifically, a kind of EC algorithm called local version particle swarm optimization (PSO) is adopted to implement a pipeline-based parallel PSO (PPPSO, i.e., P3SO). Due to the generation-level parallelism in P3SO, when some particles still perform their evolutionary operations in the current generation, some other particles can simultaneously go to the next generation to carry out the new evolutionary operations, or even go to further next generation(s). The experimental results show that the problem-solving ability of P3SO is not affected while the evolutionary speed has been substantially accelerated in a significant fashion. Therefore, generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.

...read moreread less

Journal Article•DOI•

The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing

[...]

William M. Landau

15 Jan 2021-The Journal of Open Source Software

Proceedings Article•DOI•

Chimera: efficiently training large-scale neural networks with bidirectional pipelines

[...]

Shigang Li¹, Torsten Hoefler¹•Institutions (1)

ETH Zurich¹

14 Nov 2021

TL;DR: Chimera as discussed by the authors proposes a synchronous pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models, which is more convergence-friendly than asynchronous approaches.

...read moreread less

Abstract: Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

...read moreread less

Proceedings Article•DOI•

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline

[...]

Sumon Biswas¹, Hridesh Rajan¹•Institutions (1)

Iowa State University¹

20 Aug 2021

TL;DR: In this paper, the authors introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline and leveraged existing metrics to define the fairness measures of the stages.

...read moreread less

Abstract: In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.

...read moreread less

Journal Article•DOI•

Day-Ahead Optimal Dispatch for Integrated Energy System Considering Power-to-Gas and Dynamic Pipeline Networks

[...]

Zhenwei Zhang¹, Chengfu Wang¹, Huacan Lv¹, Fengquan Liu¹, Hongzhang Sheng¹, Ming Yang¹ - Show less +2 more•Institutions (1)

Shandong University¹

27 Apr 2021-IEEE Transactions on Industry Applications

TL;DR: A cooperative dispatching strategy for P2G and pipeline storage capability is presented to catch the flexibility of IES, in which the unbalanced wind power is converted into natural gas and stored in pipeline networks.

...read moreread less

Abstract: A large number of renewable energy resources are integrated into the integrated energy system (IES), which complicates the IES dispatching, especially for accommodating anti-peak-regulation of wind power. To cope with that, a day-ahead IES optimal dispatching method considering power to gas (P2G) units and dynamic pipeline networks is proposed in this article. First, by introducing P2G, an IES structure based on energy hub is established to implement bidirectional flow between power and natural gas systems. Second, the dynamic characteristic of gas pipelines is modeled with energy storage capability, which can improve the flexibility of the natural gas system by regulating the pressure level of pipeline networks. Furthermore, a cooperative dispatching strategy for P2G and pipeline storage capability is presented to catch the flexibility of IES, in which the unbalanced wind power is converted into natural gas and stored in pipeline networks. Finally, case study is verified on the modified IEEE39-NGS20-HS20 and IEEE118-NGS40-HS20 IES systems with different typical wind power scenarios. The proposed cooperative dispatching strategy can effectively increase wind power consumption and reduce operating cost of the whole system without high computation burden.

...read moreread less

Journal Article•DOI•

EEG Integrated Platform Lossless (EEG-IP-L) pre-processing pipeline for objective signal quality assessment incorporating data annotation and blind source separation.

[...]

James A. Desjardins¹, Stefon J.R. van Noordt¹, Scott Huberty¹, Sidney J. Segalowitz², Mayada Elsabbagh³ - Show less +1 more•Institutions (3)

Montreal Neurological Institute and Hospital¹, Brock University², Douglas Mental Health University Institute³

01 Jan 2021-Journal of Neuroscience Methods

TL;DR: By using the EEG Integrated Platform Lossless (EEG-IP-L) pipeline's signal quality annotations, significant increase in data retention is achieved when applying subsequent post-processing ERP segment rejection procedures, and it is demonstrated that the increase inData retention does not attenuate the ERP signal.

...read moreread less

Proceedings Article•DOI•

LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism

[...]

Jongyul Kim¹, Insu Jang², Waleed Reda³, Jaeseong Im¹, Marco Canini⁴, Dejan Kostic³, Youngjin Kwon¹, Simon Peter⁵, Emmett Witchel⁵ - Show less +5 more•Institutions (5)

KAIST¹, University of Michigan², Royal Institute of Technology³, King Abdullah University of Science and Technology⁴, University of Texas at Austin⁵

26 Oct 2021

TL;DR: LineFS as mentioned in this paper decomposes DFS operations into execution stages that can be offloaded to a parallel dataapath execution pipeline on the SmartNIC, which offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC.

...read moreread less

Abstract: In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe. We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.

...read moreread less

Collapse