scispace - formally typeset
Search or ask a question

Showing papers by "Charles E. Leiserson published in 2021"



Posted Content
TL;DR: Salient as mentioned in this paper proposes to perform mini-batch training with neighborhood sampling in a distributed multi-GPU environment, under which they identify major performance bottlenecks hitherto under-explored by developers.
Abstract: Improving the training and inference performance of graph neural networks (GNNs) is faced with a challenge uncommon in general neural networks: creating mini-batches requires a lot of computation and data movement due to the exponential growth of multi-hop graph neighborhoods along network layers. Such a unique challenge gives rise to a diverse set of system design choices. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment, under which we identify major performance bottlenecks hitherto under-explored by developers: mini-batch preparation and transfer. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler, a shared-memory parallelization strategy, and the pipelining of batch transfer with GPU computation. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised. Such an observation unifies training and inference, simplifying model implementation. We report comprehensive experimental results with several benchmark data sets and GNN architectures, including a demonstration that, for the ogbn-papers100M data set, our system SALIENT achieves a speedup of 3x over a standard PyTorch-Geometric implementation with a single GPU and a further 8x parallel speedup with 16 GPUs. Therein, training a 3-layer GraphSAGE model with sampling fanout (15, 10, 5) takes 2.0 seconds per epoch and inference with fanout (20, 20, 20) takes 2.4 seconds, attaining test accuracy 64.58%.

Posted Content
TL;DR: In this paper, a bidirectional box-sum (BDBS) algorithm was proposed to solve the strong included-sums problem in O(d N) time, asymptotically beating the classical summed-area table (SAT) algorithm.
Abstract: This paper presents algorithms for the included-sums and excluded-sums problems used by scientific computing applications such as the fast multipole method. These problems are defined in terms of a $d$-dimensional array of $N$ elements and a binary associative operator~$\oplus$ on the elements. The included-sum problem requires that the elements within overlapping boxes cornered at each element within the array be reduced using $\oplus$. The excluded-sum problem reduces the elements outside each box. The weak versions of these problems assume that the operator $\oplus$ has an inverse $\ominus$, whereas the strong versions do not require this assumption. In addition to studying existing algorithms to solve these problems, we introduce three new algorithms. The bidirectional box-sum (BDBS) algorithm solves the strong included-sums problem in $\Theta(d N)$ time, asymptotically beating the classical summed-area table (SAT) algorithm, which runs in $\Theta(2^d N)$ and which only solves the weak version of the problem. Empirically, the BDBS algorithm outperforms the SAT algorithm in higher dimensions by up to $17.1\times$. The \defn{box-complement} algorithm can solve the strong excluded-sums problem in $\Theta(d N)$ time, asymptotically beating the state-of-the-art corners algorithm by Demaine et al., which runs in $\Omega(2^d N)$ time. In 3 dimensions the box-complement algorithm empirically outperforms the corners algorithm by about $1.4\times$ given similar amounts of space. The weak excluded-sums problem can be solved in $\Theta(d N)$ time by the bidirectional box-sum complement (BDBSC) algorithm, which is a trivial extension of the BDBS algorithm. Given an operator inverse $\ominus$, BDBSC can beat box-complement by up to a factor of $4$.