Showing papers by "Vladimir Braverman published in 2022"

PDF

Open Access

Proceedings Article•DOI•

Pretrained Models for Multilingual Federated Learning

[...]

Orion Weller, Marc Marone, Vladimir Braverman, Dawn Lawrie, B. Van Durme - Show less +1 more

06 Jun 2022

TL;DR: The results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning, even when using non-IID partitioning.

...read moreread less

Abstract: Since the advent of Federated Learning (FL), research has applied these methods to natural language processing (NLP) tasks. Despite a plethora of papers in FL for NLP, no previous works have studied how multilingual text impacts FL algorithms. Furthermore, multilingual text provides an interesting avenue to examine the impact of non-IID text (e.g. different languages) on FL in naturally occurring data. We explore three multilingual language tasks, language modeling, machine translation, and text classification using differing federated and non-federated learning algorithms. Our results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning, even when using non-IID partitioning.

...read moreread less

13 citations

Proceedings Article•DOI•

The Power of Uniform Sampling for Coresets

[...]

Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, M. Toftrup, Xuan Wu - Show less +3 more

05 Sep 2022

TL;DR: The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error, which enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently they can easily handle constrained objectives.

...read moreread less

Abstract: Motivated by practical generalizations of the classic k-median and k-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of n, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fréchet and Hausdorff distance. Finally, our technique yields also smaller coresets for 1-median in low-dimensional Euclidean spaces, specifically of size $\tilde{O}(\varepsilon^{-15})$ in $\mathbb{R}^{2}$ and $\tilde{O}(\varepsilon^{-16})$ in $\mathbb{R}^{3}$.

...read moreread less

12 citations

Proceedings Article•DOI•

New Coresets for Projective Clustering and Applications

[...]

Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, Daniel Feldman - Show less +1 more

08 Mar 2022

TL;DR: This paper proposes the first algorithm that returns an L ∞ coreset of size polynomial in R d and gives the first strong coreset construction for general M -estimator regression, and provides experimental results based on real-world datasets, showing the eﬃcacy of the approach.

...read moreread less

Abstract: $(j,k)$-projective clustering is the natural generalization of the family of $k$-clustering and $j$-subspace clustering problems. Given a set of points $P$ in $\mathbb{R}^d$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure. In this paper, we propose the first algorithm that returns an $L_\infty$ coreset of size polynomial in $d$. Moreover, we give the first strong coreset construction for general $M$-estimator regression. Specifically, we show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_1-L_2$, and Fair regression, as well as general concave and power-bounded loss functions. Finally, we provide experimental results based on real-world datasets, showing the efficacy of our approach.

...read moreread less

12 citations

Proceedings Article•DOI•

The White-Box Adversarial Data Stream Model

[...]

Miklós Ajtai, Vladimir Braverman, T. S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, Samson Zhou - Show less +3 more

19 Apr 2022

TL;DR: This work gives a randomized algorithm for the L1-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams and gives a general technique that translates any two-player deterministic communication lower bound to a lower bound for randomized algorithms robust to a white-box adversary.

...read moreread less

Abstract: There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a black-box adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams. If the white-box adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1-heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any two-player deterministic communication lower bound to a lower bound for randomized algorithms robust to a white-box adversary. In particular, our results show that for all p≥0, there exists a constant Cp>1 such that any Cp-approximation algorithm for Fp moment estimation in insertion-only streams with a white-box adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C>1 such that any C-approximation algorithm in an insertion-only stream for matrix rank requires Ω(n) space with a white-box adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log(n)) bits for the fundamental problem of deterministic approximate counting in a stream of 0s and 1s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the white-box space complexity of a streaming algorithm.

...read moreread less

7 citations

Journal Article•DOI•

Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations

[...]

Ali Abbasi, Parsa Nooralinejad, Vladimir Braverman, Hamed Pirsiavash, Soheil Kolouri - Show less +1 more

12 Mar 2022

TL;DR: This paper proposes two biologically-inspired mechanisms based on sparsity and heterogeneous dropout that signiﬁcantly increase a continual learner’s performance over a long sequence of tasks.

...read moreread less

Abstract: Continual/lifelong learning from a non-stationary input data stream is a cornerstone of intelligence. Despite their phenomenal performance in a wide variety of applications, deep neural networks are prone to forgetting their previously learned information upon learning new ones. This phenomenon is called"catastrophic forgetting"and is deeply rooted in the stability-plasticity dilemma. Overcoming catastrophic forgetting in deep neural networks has become an active field of research in recent years. In particular, gradient projection-based methods have recently shown exceptional performance at overcoming catastrophic forgetting. This paper proposes two biologically-inspired mechanisms based on sparsity and heterogeneous dropout that significantly increase a continual learner's performance over a long sequence of tasks. Our proposed approach builds on the Gradient Projection Memory (GPM) framework. We leverage k-winner activations in each layer of a neural network to enforce layer-wise sparse activations for each task, together with a between-task heterogeneous dropout that encourages the network to use non-overlapping activation patterns between different tasks. In addition, we introduce two new benchmarks for continual learning under distributional shift, namely Continual Swiss Roll and ImageNet SuperDog-40. Lastly, we provide an in-depth analysis of our proposed method and demonstrate a significant performance boost on various benchmark continual learning problems.

...read moreread less

6 citations

Proceedings Article•DOI•

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

[...]

Jingfeng Wu, Difan Zou, Vladimir Braverman, Qingsong Gu, Sham M. Kakade - Show less +1 more

03 Aug 2022

TL;DR: It is shown that ﬁnetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining, and bounds suggest that for a large class of linear regression instances, transfer learning with O ( N 2 ) source data is as eﬀective as supervised learning with N target data.

...read moreread less

Abstract: We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by online SGD) for this problem. We establish sharp instance-dependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data. In addition, we show that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems.

...read moreread less

5 citations

Journal Article•DOI•

From Local to Global: Spectral-Inspired Graph Neural Networks

[...]

N. Huang, Soledad Villar, Carey E. Priebe, Da Zheng, Cheng Fu Huang, Lin F. Yang, Vladimir Braverman - Show less +3 more

24 Sep 2022-arXiv.org

TL;DR: It is shown PowerEmbed can provably express the top- k leading eigenvectors of the graph operator, which prevents oversmoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing.

...read moreread less

Abstract: Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate such issues, existing works typically borrow normalization techniques from training neural networks on Euclidean data or modify the graph structures. Yet these approaches are not well-understood theoretically and could increase the overall computational complexity. In this work, we draw inspirations from spectral graph embedding and propose $\texttt{PowerEmbed}$ -- a simple layer-wise normalization technique to boost MPNNs. We show $\texttt{PowerEmbed}$ can provably express the top-$k$ leading eigenvectors of the graph operator, which prevents over-smoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing. We apply $\texttt{PowerEmbed}$ in a wide range of simulated and real graphs and demonstrate its competitive performance, particularly for heterophilous graphs.

...read moreread less

2 citations

Proceedings Article•DOI•

Flow-level loss detection with Δ-sketches

[...]

Shir Landau Feibish, Zaoxing Liu, Nikita P. Ivkin, Xiaoqi Chen, Vladimir Braverman, Jennifer Rexford - Show less +2 more

19 Oct 2022

TL;DR: MIDST identifies the flows experiencing loss, as well as the bursty flows responsible, across different burst durations, and uses little memory while providing high accuracy under varying loss rates and Burst durations.

...read moreread less

Abstract: Packet drops caused by congestion are a fundamental problem in network operation. Yet, it is difficult to detect where drops are happening, let alone which flows are most affected. Detecting the small-timescale drops caused by short bursts of traffic is even more challenging, and traditional monitoring techniques can easily miss them. To uncover packet drops as they occur inside a switch, the analysis must be real-time, fine-grained, and efficient. However, modern switches have distributed packet-processing pipelines that see either the arriving or departing traffic, but not the packet drops. Additionally, they do not have enough memory to store per-flow state. Our MIDST system addresses these challenges through a distributed compact data structure with lightweight coordination between ingress and egress pipelines. MIDST identifies the flows experiencing loss, as well as the bursty flows responsible, across different burst durations. Our evaluation with real-world traces and TCP connections shows that MIDST uses little memory (e.g., 320KB) while providing high accuracy (95% to 98%) under varying loss rates and burst durations. We evaluate a low-rate DDoS attack and demonstrate the potential use of our measurement results for attack detection and mitigation.

...read moreread less

2 citations

Proceedings Article•DOI•

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

[...]

Difan Zou, Jingfeng Wu, Vladimir Braverman, Qingsong Gu, Sham M. Kakade - Show less +1 more

07 Mar 2022

TL;DR: The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance.

...read moreread less

Abstract: Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.

...read moreread less

1 citations

DOI•

Least-Mean-Squares Coresets for Infinite Streams

[...]

Vladimir Braverman, Daniel Feldman, Harry Lang, Daniela Rus, Adiel Statman - Show less +1 more

IEEE Transactions on Knowledge and Data Engineering

S parsity and h eterogeneous d ropout for c ontinual l earning in the n ull s pace of n eural a ctivations

[...]

Ali Abbasi, Parsa Nooralinejad, Vladimir Braverman, Hamed Pirsiavash, Soheil Kolouri - Show less +1 more

TL;DR: The authors leverage k-winner activations in each layer of a neural network to enforce layer-wise sparse activations for each task, together with a between-task heterogeneous dropout that encourages the network to use non-overlapping activation patterns between different tasks.

...read moreread less

Abstract: Continual/lifelong learning from a non-stationary input data stream is a cornerstone of intelligence. Despite their phenomenal performance in a wide variety of applications, deep neural networks are prone to forgetting their previously learned information upon learning new ones. This phenomenon is called “catastrophic forgetting” and is deeply rooted in the stability-plasticity dilemma. Overcoming catastrophic forgetting in deep neural networks has become an active field of research in recent years. In particular, gradient projection-based methods have recently shown exceptional performance at overcoming catastrophic forgetting. This paper proposes two biologically-inspired mechanisms based on sparsity and heterogeneous dropout that significantly increase a continual learner’s performance over a long sequence of tasks. Our proposed approach builds on the Gradient Projection Memory (GPM) framework. We leverage k-winner activations in each layer of a neural network to enforce layer-wise sparse activations for each task, together with a between-task heterogeneous dropout that encourages the network to use non-overlapping activation patterns between different tasks. In addition, we introduce two new benchmarks for continual learning under distributional shift, namely Continual Swiss Roll and ImageNet SuperDog-40. Lastly, we provide an in-depth analysis of our proposed method and demonstrate a significant performance boost on various benchmark continual learning problems.

...read moreread less

Posted Content•DOI•

A scalable and unbiased discordance metric with H+

[...]

Nathan T. Dyjack, Daniel N. Baker, Vladimir Braverman, Ben Langmead, Stephanie C. Hicks - Show less +1 more

05 Feb 2022-bioRxiv

TL;DR: A modification of G+, referred to as H+, is proposed, and it is demonstrated that H+ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data.

...read moreread less

Abstract: A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or consistency of the cluster, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the ‘scale-agnostic’ G+ discordance metric, however this internal metric is slow to calculate for large data. Furthermore, we show that G+ varies as a function of the proportion of observations in the predicted cluster labels (group balance), which is an undesirable property. To address this problem, we propose a modification of G+, referred to as H+, and demonstrate that H+ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate H+, which are available in the fasthplus R package.

...read moreread less

Bidirectional Adaptive Communication for Heterogeneous Distributed Learning

[...]

Dmitrii Avdiukhin, Vladimir Braverman, Nikita P. Ivkin

TL;DR: In this article , a new scheme that adaptively skips communication (broadcast and client uploads) by detecting slow-varying updates is presented. But the convergence rate is the same as for batch gradient descent in the convex and nonconvex smooth cases, even in the case when the data distribution is highly non-IID.

...read moreread less

Abstract: Communication is a key bottleneck in distributed optimization, and, in particular, bandwidth and latency can be limiting factors when devices are connected over commodity networks, such as in Federated Learning. State-of-the-art techniques tackle these challenges by advanced compression techniques or delaying communication rounds according to predefined schedules. We present a new scheme that adaptively skips communication (broadcast and client uploads) by detecting slow-varying updates. The scheme automatically adjusts the communication frequency independently for each worker and the server. By utilizing an error-feedback mechanism – borrowed from the compression literature – we prove that the convergence rate is the same as for batch gradient descent in the convex and nonconvex smooth cases. We show that the total number of communication rounds between server and clients needed to achieve a targeted accuracy is reduced, even in the case when the data distribution is highly non-IID.

...read moreread less