scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 2021"


Journal ArticleDOI
TL;DR: This work designs and evaluates a machine learning pipeline for estimation of battery capacity fade—a metric of battery health—on 179 cells cycled under various conditions, and provides insights into the design of scalable data-driven models for battery SOH estimation, emphasizing the value of confidence bounds around the prediction.
Abstract: Lithium-ion batteries are ubiquitous in applications ranging from portable electronics to electric vehicles Irrespective of the application, reliable real-time estimation of battery state of health (SOH) by on-board computers is crucial to the safe operation of the battery, ultimately safeguarding asset integrity In this Article, we design and evaluate a machine learning pipeline for estimation of battery capacity fade—a metric of battery health—on 179 cells cycled under various conditions The pipeline estimates battery SOH with an associated confidence interval by using two parametric and two non-parametric algorithms Using segments of charge voltage and current curves, the pipeline engineers 30 features, performs automatic feature selection and calibrates the algorithms When deployed on cells operated under the fast-charging protocol, the best model achieves a root-mean-squared error of 045% This work provides insights into the design of scalable data-driven models for battery SOH estimation, emphasizing the value of confidence bounds around the prediction The pipeline methodology combines experimental data with machine learning modelling and could be applied to other critical components that require real-time estimation of SOH Rechargeable lithium-ion batteries play a crucial role in many modern-day applications, including portable electronics and electric vehicles, but they degrade over time To ensure safe operation, a battery’s ‘state of health’ should be monitored in real time, and this machine learning pipeline, tested on a variety of charging conditions, can provide such an online estimation of battery state of health

173 citations


Proceedings ArticleDOI
14 Nov 2021
TL;DR: In this paper, the authors propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches, allowing them to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.
Abstract: Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of GPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak).

135 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a novel complex spectral mapping approach with a two-stage pipeline for monaural speech enhancement in the time-frequency domain, which decouple the primal problem into multiple sub-problems.
Abstract: For challenging acoustic scenarios as low signal-to-noise ratios, current speech enhancement systems usually suffer from performance bottleneck in extracting the target speech from the mixtures within one step. To address this issue, we propose a novel complex spectral mapping approach with a two-stage pipeline for monaural speech enhancement in the time-frequency domain. The proposed algorithm aims to decouple the primal problem into multiple sub-problems, which follows the classic proverb, “two heads are better than one”. More specifically, in the first stage, only magnitude is estimated, which is incorporated with the noisy phase to obtain a coarse complex spectrum estimation. To facilitate the previous estimation, in the second stage, an auxiliary network serves as the post-processing module, where residual noise is further suppressed and the phase information is effectively modified. The global residual connection strategy is adopted in the second stage to accelerate the training convergence speed. To alleviate the parameter burden caused by the multi-stage pipeline, we propose a light-weight temporal convolutional module, which substantially decreases the trainable parameters and obtains even better objective performance over the original version. We conduct extensive experiments on three standard corpora, including WSJ0-SI84, DNS Challenge dataset, and Voice Bank + DEMAND dataset. Objective test results demonstrate that our proposed approach achieves state-of-the-art performance over previous advanced systems under various conditions. Meanwhile, subjective listening test results further validate the superiority of our proposed method in terms of subjective quality.

95 citations


Journal ArticleDOI
TL;DR: This paper provides a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy and proposes a distributed asynchronous framework and an early stop strategy.
Abstract: Convolutional neural networks have gained a remarkable success in computer vision. However, most popular network architectures are hand-crafted and usually require expertise and elaborate design. In this paper, we provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy. The optimal network block is constructed by the learning agent which is trained to choose component layers sequentially. We stack the block to construct the whole auto-generated network. To accelerate the generation process, we also propose a distributed asynchronous framework and an early stop strategy. The block-wise generation brings unique advantages: (1) it yields state-of-the-art results in comparison to the hand-crafted networks on image classification, particularly, the best network generated by BlockQNN achieves 2.35 percent top-1 error rate on CIFAR-10. (2) it offers tremendous reduction of the search space in designing networks, spending only 3 days with 32 GPUs. A faster version can yield a comparable result with only 1 GPU in 20 hours. (3) it has strong generalizability in that the network built on CIFAR also performs well on the larger-scale dataset. The best network achieves very competitive accuracy of 82.0 percent top-1 and 96.0 percent top-5 on ImageNet.

91 citations


Journal ArticleDOI
TL;DR: A hybrid intelligent algorithm method that combines support vector regression, principal component analysis, and chaos particle swarm optimization is proposed to predict the corrosion rate of the multiphase flow pipeline, named PCA-CPSO-SVR, which has a good performance.

84 citations


Journal ArticleDOI
TL;DR: A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.
Abstract: Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

83 citations


Journal ArticleDOI
TL;DR: This article investigates the collaborative computation offloading, computation and communication resource allocation scheme, and develops a collaborative computing framework that the tasks of mobile devices can be partially processed at the terminals, edge nodes (EN) and cloud center (CC).
Abstract: Mobile edge computing (MEC) is an emerging computing paradigm for enabling low-latency, high-bandwidth and agile mobile services by deploying computing platform at the edge of network. In order to improve the cloud-edge-end processing efficiency of the tasks within the limited computation and communication capabilities, in this article, we investigate the collaborative computation offloading, computation and communication resource allocation scheme, and develop a collaborative computing framework that the tasks of mobile devices (MDs) can be partially processed at the terminals, edge nodes (EN) and cloud center (CC). Then, we propose the pipeline-based offloading scheme, where both MDs and ENs can offload computation-intensive tasks to a particular EN and CC, according to their computation and communication capacities, respectively. Based on the proposed pipeline offloading strategy, a sum latency of all MDs minimization problem is formulated with the consideration of the offloading strategy, computation resource, delivery rate and power allocation, which is a non-convex problem and difficult to deal with. To solve the optimization problem, by using the classic successive convex approximation (SCA) approach, we transform the non-convex optimization problem into the convex one. Finally, simulation results indicate that the proposed collaboration offloading scheme with the pipeline strategy is efficient and outperforms other offloading schemes.

82 citations


Journal ArticleDOI
TL;DR: A review of the general approaches following the passive and active control technologies are presented, which are including optimal layout technique of pipeline and clamps, constrained layer damping technique, vibration absorber technique, hydraulic hose technique, optimal pump structure technique, and active vibration control technique of Pipeline system.

70 citations


Journal ArticleDOI
TL;DR: A deep fully convolutional neural network, DeepRx is proposed, which executes the whole receiver pipeline from frequency domain signal stream to uncoded bits in a 5G-compliant fashion and outperforms traditional methods.
Abstract: Deep learning has solved many problems that are out of reach of heuristic algorithms. It has also been successfully applied in wireless communications, even though the current radio systems are well-understood and optimal algorithms exist for many tasks. While some gains have been obtained by learning individual parts of a receiver, a better approach is to jointly learn the whole receiver. This, however, often results in a challenging nonlinear problem, for which the optimal solution is infeasible to implement. To this end, we propose a deep fully convolutional neural network, DeepRx, which executes the whole receiver pipeline from frequency domain signal stream to uncoded bits in a 5G-compliant fashion. We facilitate accurate channel estimation by constructing the input of the convolutional neural network in a very specific manner using both the data and pilot symbols. Also, DeepRx outputs soft bits that are compatible with the channel coding used in 5G systems. Using 3GPP-defined channel models, we demonstrate that DeepRx outperforms traditional methods. We also show that the high performance can likely be attributed to DeepRx learning to utilize the known constellation points of the unknown data symbols, together with the local symbol distribution, for improved detection accuracy.

63 citations


Journal ArticleDOI
TL;DR: In this article, a generative adversarial network based on trinetworks form (tnGAN) is proposed to handle leak detection problems with incomplete sensor data, which can achieve different incomplete data recovery situations, such as individual lost and random missing.
Abstract: Due to the widely deployed sensors in the pipeline network, the data-driven detection method is a natural choice with multiple sensor measurements. However, the incomplete data problem caused by device failure or network interruption seriously hinders the implementation of pipeline status monitoring. Aiming at this difficulty, this article proposes a generative adversarial network based on trinetworks form (tnGAN) to handle leak detection problems with incomplete sensor data. First, the generative model is proposed to recover incomplete data through fully exploiting the same-level nature similarity of data features. Therein, the same type of sensor data, obtained from the pipeline network, is used as the input. Next, to further boost the temporal evolvement characteristics and the spatial similarity, a multiview awareness strategy is incorporated in the established model to facilitate the integration of inherent information. Then, a dual-discriminative network architecture is proposed to detect pipeline status through computing the similarity of the latent features of samples. With the abovementioned structure, the proposed method can achieve different incomplete data recovery situations, such as individual lost and random missing. In addition, it can also aggregate the output and features of the discriminative networks to obtain the pipeline leak detection result. Finally, the experiment results on a pipeline network demonstrate the capability and effectiveness of the proposed method in both data recovery and leak detection.

61 citations


Journal ArticleDOI

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a novel pipeline for fall detection based on wearable accelerometer data and three publicly available datasets have been used to validate their proposed method, and more than 7700 cross-disciplinary time-series features were investigated for each of the datasets.
Abstract: Fall causes trauma or critical injury among the geriatric population which is a second leading accidental cause of post-injury mortality around the world. It is crucial to keep elderly people under supervision by ensuring proper privacy and comfort. Thus the elderly fall detection and prediction using wearable/ non-wearable sensors become an active field of research. In this work, a novel pipeline for fall detection based on wearable accelerometer data has been proposed. Three publicly available datasets have been used to validate our proposed method, and more than 7700 cross-disciplinary time-series features were investigated for each of the datasets. After following a series of feature reduction techniques such as mutual information, removing highly correlated features using the Pearson correlation coefficient, Boruta algorithm, we have obtained the dominant features for each dataset. Different classical machine learning (ML) algorithms were utilized to detect falls based on the obtained features. For individual datasets, the simple ML classifiers achieved very good accuracy. We trained our pipeline with two of the three datasets and tested with the remaining one dataset until all three datasets were used as the test set to show the generalization capability of our proposed pipeline. A set of 39 high-performing features is selected, and the classifiers were trained with them. For all the cases, the proposed pipeline showed excellent efficiency in detecting falls. This architecture performed better than most of the existing works in all the used publicly available datasets, proving the supremacy of the proposed data analysis pipeline.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Seesaw Loss as discussed by the authors dynamically re-balance gradients of positive and negative samples for each category, with two complementary factors, i.e., mitigation factor and compensation factor.
Abstract: Instance segmentation has witnessed a remarkable progress on class-balanced benchmarks. However, they fail to perform as accurately in real-world scenarios, where the category distribution of objects naturally comes with a long tail. Instances of head classes dominate a long-tailed dataset and they serve as negative samples of tail categories. The overwhelming gradients of negative samples on tail classes lead to a biased learning process for classifiers. Consequently, objects of tail categories are more likely to be misclassified as backgrounds or head categories. To tackle this problem, we propose Seesaw Loss to dynamically re-balance gradients of positive and negative samples for each category, with two complementary factors, i.e., mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t. the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. We conduct extensive experiments on Seesaw Loss with mainstream frameworks and different data sampling strategies. With a simple end-to-end training pipeline, Seesaw Loss obtains significant gains over Cross-Entropy Loss, and achieves state-of-the-art performance on LVIS dataset without bells and whistles. Code is available at https://github.com/open-mmlab/mmdetection.

Journal ArticleDOI
TL;DR: Low power high-speed hardware architectures for the efficient field programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security and modified positive polarity reed muller (MPPRM) architecture is inserted.
Abstract: Nowadays, a huge amount of digital data is frequently changed among different embedded devices over wireless communication technologies. Data security is considered an important parameter for avoiding information loss and preventing cyber-crimes. This research article details the low power high-speed hardware architectures for the efficient field programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security. This work does not depend on the Look-Up Table (LUTs) for the implementation the SubBytes and InvSubBytes stages of transformations of the AES encryption and decryption; this new architecture uses combinational logical circuits for implementing SubBytes and InvSubBytes transformation. Due to the elimination of LUTs, unwanted delays are eliminated in this architecture and a subpipelining structure is introduced for improving the speed of the AES algorithm. Here, modified positive polarity reed muller (MPPRM) architecture is inserted to reduce the total hardware requirements, and comparisons are made with different implementations. With MPPRM architecture introduced in SubBytes stages, an efficient mixcolumn and invmixcolumn architecture that is suited to subpipelined round units is added. The performances of the proposed AES-MPPRM architecture is analyzed in terms of number of slice registers, flip flops, number of slice LUTs, number of logical elements, slices, bonded IOB, operating frequency and delay. There are five different AES architectures including LAES, AES-CTR, AES-CFA, AES-BSRD, and AES-EMCBE. The LUT of the AES-MPPRM architecture designed in the Spartan 6 is reduced up to 15.45% when compared to the AES-BSRD.

Journal ArticleDOI
Lei Xu1, Lei Hou1, Zhenyu Zhu1, Yu Li1, Jiaquan Liu1, Ting Lei1, Xingguang Wu1 
01 May 2021-Energy
TL;DR: GA-SVM hybrid model has the best effect in improving the predictive accuracy, and the forecast results are in the best agreement with the actual data.

Proceedings ArticleDOI
17 Feb 2021
TL;DR: Guo et al. as mentioned in this paper proposed AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries.
Abstract: Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable clock frequency between an HLS-generated design and a handcrafted RTL one. A key factor that limits the timing quality of the HLS outputs is the difficulty in accurately estimating the interconnect delay at the HLS level. Unfortunately, this problem becomes even worse when large HLS designs are implemented on the latest multi-die FPGAs, where die-crossing interconnects incur a high delay penalty. To tackle this challenge, we propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. First, our approach provides HLS with a view on the global physical layout of the design, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining, the floorplanner is able to distribute the design logic across multiple dies on the FPGA device without degrading clock frequency. This prevents the placer from aggressively packing the logic on a single die which often results in local routing congestion that eventually degrades timing. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput. AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.

Journal ArticleDOI
TL;DR: In this paper, an energy-efficient configurable crypto-processor supporting multi-security-level key encapsulation mechanism of Saber, is proposed, where an 8-level hierarchical Karatsuba framework is utilized to reduce degree-256 polynomial multiplication to the coefficient-wise multiplication.
Abstract: Saber, the only module-learning with rounding-based algorithm in NIST’s third round of post-quantum cryptography (PQC) standardization process, is characterized by simplicity and flexibility. However, energy-efficient implementation of Saber is still under investigation since the commonly used number theoretic transform can not be utilized directly. In this manuscript, an energy-efficient configurable crypto-processor supporting multi-security-level key encapsulation mechanism of Saber, is proposed. First, an 8-level hierarchical Karatsuba framework is utilized to reduce degree-256 polynomial multiplication to the coefficient-wise multiplication. Second, a hardware-efficient Karatsuba scheduling strategy and an optimized pre-/post-processing structure is designed to reduce the area overheads of scheduling strategy. Third, a task-rescheduling-based pipeline strategy and truncated multipliers are proposed to enable fine-grained processing. Moreover, multiple parameter sets are supported in LWRpro to enable configurability among various security scenarios. Enabled by these optimizations, LWRpro requires 1066, 1456 and 1701 clock cycles for key generation, encapsulation, and decapsulation of Saber768. The post-layout version of LWRpro is implemented with TSMC 40 nm CMOS process within 0.38 mm2. The throughput for Saber768 is up to 275k encapsulation operations per second and the energy efficiency is 0.15 uJ/encapsulation while operating at 400 MHz, achieving nearly $50\times $ improvement and $31\times $ improvement, respectively compared with current PQC hardware solutions.

Journal ArticleDOI
TL;DR: In this paper, the outbound pipeline of the Yongchang pressure station was taken as the research object, and the vibration analysis of the station yard pipeline was carried out, and three kinds of vibration reduction schemes were proposed and verified by simulation.
Abstract: The abnormal vibration of natural gas station pipelines seriously threatens the safety of pipeline transportation, and improper handling will cause huge economic losses. For the abnormal vibration of the pipeline, reasonable treatment must be carried out. The Yongchang gas station belongs to the west–east gas pipeline system in China. Since its production, abnormal vibration has often occurred in the west-third outbound pipeline of the Yongchang gas station, and the vibration changes according to the different gas transport volumes. In this paper, the outbound pipeline of the Yongchang pressure station is taken as the research object, and the vibration analysis of the station yard pipeline is carried out. The numerical model of the station yard pipeline is established, and the correctness of the model is verified by the field vibration test. The fluid–solid coupling method is used to analyze pipeline vibration under different working conditions. Then, three kinds of vibration reduction schemes are proposed and verified by simulation. The main conclusions are as follows: (1) The fluid pressure fluctuation in the pipeline is the root cause of abnormal vibration in the station. (2) When the gas transmission volume is large, the vibration of the pipeline system will become more severe. (3) The scheme of increasing pipe diameter and adding appropriate constraints has the best vibration reduction effect.

Journal ArticleDOI
TL;DR: A deep learning framework that incorporates both multi-atlas registration and level-set for segmenting pancreas from CT volume images is proposed and validated and achieves an average Dice score over 82%, being superior or comparable to other existing state-of-the-art Pancreas segmentation algorithms.

Journal ArticleDOI
TL;DR: This work presents a new, reproducible pipeline in r that allows for relatively simple fitting of 24 different TPC models using nonlinear least squares (NLLS) regression and demonstrates how this pipeline can be combined with other packages in r to robustly and reproducibly fit multiple mathematical models to multiple TPC datasets at once.
Abstract: 1. The quantification of thermal performance curves (TPCs) for biological rates has many applications to problems such as predicting species9 responses to climate change. There is currently no widely used open-source pipeline to fit mathematical TPC models to data, which limits the transparency and reproducibility of the curve fitting process underlying applications of TPCs. 2. We present a new pipeline in R that currently allows for reproducible fitting of 24 different TPC models using non-linear least squares (NLLS) regression. The pipeline consists of two packages - rTPC and nls.multstart - that allow multiple start values for NLLS fitting and provides helper functions for setting start parameters. This pipeline overcomes previous problems that have made NLLS fitting and estimation of key parameters difficult or unreliable. 3. We demonstrate how rTPC and nls.multstart can be combined with other packages in R to robustly and reproducibly fit multiple models to multiple TPC datasets at once. In addition, we show how model selection or averaging, weighted model fitting, and bootstrapping can easily be implemented within the pipeline. 4. This new pipeline provides a flexible and reproducible approach that makes the challenging task of fitting multiple TPC models to data accessible to a wide range of users.

Proceedings ArticleDOI
22 May 2021
TL;DR: In this paper, the Transformer architecture is used to predict the next token in the list of potential code completions in the IDE at cursor position, and it outperforms previous state-of-the-art next token prediction systems by margins ranging from 14% to 18%.
Abstract: Code prediction, more specifically autocomplete, has become an essential feature in modern IDEs. Autocomplete is more effective when the desired next token is at (or close to) the top of the list of potential completions offered by the IDE at cursor position. This is where the strength of the underlying machine learning system that produces a ranked order of potential completions comes into play. We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. Our work uses Transformers as the base neural architecture. We show that by making the Transformer architecture aware of the syntactic structure of code, we increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of several state-of-the-art next token prediction systems by margins ranging from 14% to 18%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a company internal Python corpus. Our code and data preparation pipeline will be available in open source.

Proceedings ArticleDOI
17 Feb 2021
TL;DR: DAPPLE as mentioned in this paper is a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, and it features a novel parallelization strategy planner to solve the partition and placement problems.
Abstract: It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23× speedup under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6× speedup of training throughput and saves 12% of memory consumption at the same time.

Proceedings ArticleDOI
09 May 2021
TL;DR: BoostGCN as discussed by the authors proposes a hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme that leverages 3-D partitioning with the vertex-centric computing paradigm.
Abstract: Graph convolutional networks (GCNs) have revolutionized many big data applications, such as recommendation systems, traffic prediction, etc. However, accelerating GCN inference is challenging due to (1) massive external memory traffic and irregular memory access, (2) workload imbalance due to skewed degree distribution, and (3) intra-stage load imbalance caused by two heterogeneous computation phases of the algorithm. To address the above challenges, we propose a framework named BoostGCN to optimize GCN inference on FPGA. First, we develop a novel hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme that leverages 3-D partitioning with the vertex-centric computing paradigm. This increases on-chip data reuse and reduces the total data communication volume with external memory. Second, we design a novel hardware architecture to enable pipelined execution of the two heterogeneous computation phases. We develop a low-overhead task scheduling strategy to reduce the pipeline stalls caused by the two computation phases. Third, we provide a complete GCN acceleration framework on FPGA with optimized RTL templates. It can generate hardware designs based on the customized configuration and is adaptable to various GCN models. Using our framework, we generate accelerators for various GCN models on a state-of-the-art FPGA platform and evaluate our designs using widely used datasets. Experimental results show that the accelerators produced by our framework achieve significant speedup compared with state-of-the-art implementations on CPU (≈ 100×), GPU (≈ 30×), prior FPGA accelerator (3-45)×.

Journal ArticleDOI
TL;DR: This article proposes a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level, inspired by the industrial pipeline technique and shows that generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.
Abstract: Due to the population-based and iterative-based characteristics of evolutionary computation (EC) algorithms, parallel techniques have been widely used to speed up the EC algorithms. However, the parallelism usually performs in the population level where multiple populations (or subpopulations) run in parallel or in the individual level where the individuals are distributed to multiple resources. That is, different populations or different individuals can be executed simultaneously to reduce running time. However, the research into generation-level parallelism for EC algorithms has seldom been reported. In this article, we propose a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level. This idea is inspired by the industrial pipeline technique. Specifically, a kind of EC algorithm called local version particle swarm optimization (PSO) is adopted to implement a pipeline-based parallel PSO (PPPSO, i.e., P3SO). Due to the generation-level parallelism in P3SO, when some particles still perform their evolutionary operations in the current generation, some other particles can simultaneously go to the next generation to carry out the new evolutionary operations, or even go to further next generation(s). The experimental results show that the problem-solving ability of P3SO is not affected while the evolutionary speed has been substantially accelerated in a significant fashion. Therefore, generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.


Proceedings ArticleDOI
Shigang Li1, Torsten Hoefler1
14 Nov 2021
TL;DR: Chimera as discussed by the authors proposes a synchronous pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models, which is more convergence-friendly than asynchronous approaches.
Abstract: Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

Proceedings ArticleDOI
20 Aug 2021
TL;DR: In this paper, the authors introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline and leveraged existing metrics to define the fairness measures of the stages.
Abstract: In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.

Journal ArticleDOI
TL;DR: A cooperative dispatching strategy for P2G and pipeline storage capability is presented to catch the flexibility of IES, in which the unbalanced wind power is converted into natural gas and stored in pipeline networks.
Abstract: A large number of renewable energy resources are integrated into the integrated energy system (IES), which complicates the IES dispatching, especially for accommodating anti-peak-regulation of wind power. To cope with that, a day-ahead IES optimal dispatching method considering power to gas (P2G) units and dynamic pipeline networks is proposed in this article. First, by introducing P2G, an IES structure based on energy hub is established to implement bidirectional flow between power and natural gas systems. Second, the dynamic characteristic of gas pipelines is modeled with energy storage capability, which can improve the flexibility of the natural gas system by regulating the pressure level of pipeline networks. Furthermore, a cooperative dispatching strategy for P2G and pipeline storage capability is presented to catch the flexibility of IES, in which the unbalanced wind power is converted into natural gas and stored in pipeline networks. Finally, case study is verified on the modified IEEE39-NGS20-HS20 and IEEE118-NGS40-HS20 IES systems with different typical wind power scenarios. The proposed cooperative dispatching strategy can effectively increase wind power consumption and reduce operating cost of the whole system without high computation burden.

Journal ArticleDOI
TL;DR: By using the EEG Integrated Platform Lossless (EEG-IP-L) pipeline's signal quality annotations, significant increase in data retention is achieved when applying subsequent post-processing ERP segment rejection procedures, and it is demonstrated that the increase inData retention does not attenuate the ERP signal.

Proceedings ArticleDOI
26 Oct 2021
TL;DR: LineFS as mentioned in this paper decomposes DFS operations into execution stages that can be offloaded to a parallel dataapath execution pipeline on the SmartNIC, which offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC.
Abstract: In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe. We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.