scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Data Structures and Algorithms in 2019"


Posted Content
TL;DR: The core ideas and algorithmic techniques in the emerging area of algorithmic high-dimensional robust statistics with a focus on robust mean estimation are introduced and an overview of the approaches that have led to computationally efficient robust estimators for a range of broader statistical tasks are provided.
Abstract: Learning in the presence of outliers is a fundamental problem in statistics. Until recently, all known efficient unsupervised learning algorithms were very sensitive to outliers in high dimensions. In particular, even for the task of robust mean estimation under natural distributional assumptions, no efficient algorithm was known. Recent work in theoretical computer science gave the first efficient robust estimators for a number of fundamental statistical tasks, including mean and covariance estimation. Since then, there has been a flurry of research activity on algorithmic high-dimensional robust estimation in a range of settings. In this survey article, we introduce the core ideas and algorithmic techniques in the emerging area of algorithmic high-dimensional robust statistics with a focus on robust mean estimation. We also provide an overview of the approaches that have led to computationally efficient robust estimators for a range of broader statistical tasks and discuss new directions and opportunities for future work.

151 citations


Posted Content
TL;DR: A new distributional model called the adversarial stochastic input model, which is a generalization of the i.i.d model with unknown distributions, where the distributions can change over time is introduced, and a 1-O(ε) approximation algorithm is given for the resource allocation problem.
Abstract: We present prior robust algorithms for a large class of resource allocation problems where requests arrive one-by-one (online), drawn independently from an unknown distribution at every step. We design a single algorithm that, for every possible underlying distribution, obtains a $1-\epsilon$ fraction of the profit obtained by an algorithm that knows the entire request sequence ahead of time. The factor $\epsilon$ approaches $0$ when no single request consumes/contributes a significant fraction of the global consumption/contribution by all requests together. We show that the tradeoff we obtain here that determines how fast $\epsilon$ approaches $0$, is near optimal: we give a nearly matching lower bound showing that the tradeoff cannot be improved much beyond what we obtain. Going beyond the model of a static underlying distribution, we introduce the adversarial stochastic input model, where an adversary, possibly in an adaptive manner, controls the distributions from which the requests are drawn at each step. Placing no restriction on the adversary, we design an algorithm that obtains a $1-\epsilon$ fraction of the optimal profit obtainable w.r.t. the worst distribution in the adversarial sequence. In the offline setting we give a fast algorithm to solve very large LPs with both packing and covering constraints. We give algorithms to approximately solve (within a factor of $1+\epsilon$) the mixed packing-covering problem with $O(\frac{\gamma m \log (n/\delta)}{\epsilon^2})$ oracle calls where the constraint matrix of this LP has dimension $n\times m$, the success probability of the algorithm is $1-\delta$, and $\gamma$ quantifies how significant a single request is when compared to the sum total of all requests. We discuss implications of our results to several special cases including online combinatorial auctions, network routing and the adwords problem.

124 citations


Posted Content
TL;DR: A convergence guarantee in Kullback-Leibler (KL) divergence is proved assuming $ u$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded.
Abstract: We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $ u = e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $ u$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in Renyi divergence of order $q > 1$ assuming the limit of ULA satisfies either the log-Sobolev or Poincare inequality.

95 citations


Posted Content
TL;DR: A practical approximate fairlet decomposition algorithm that runs in nearly linear time and allows for finer control over the balance of resulting clusters than the original work.
Abstract: We study the fair variant of the classic $k$-median problem introduced by Chierichetti et al. [2017]. In the standard $k$-median problem, given an input pointset $P$, the goal is to find $k$ centers $C$ and assign each input point to one of the centers in $C$ such that the average distance of points to their cluster center is minimized. In the fair variant of $k$-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an "approximately equal" number of points of each color. Chierichetti et al. proposed a two-phase algorithm for fair $k$-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the $k$-median objective. In the second step, fairlets are merged into $k$ clusters by one of the existing $k$-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time. In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. Our algorithm additionally allows for finer control over the balance of resulting clusters than the original work. We complement our theoretical bounds with empirical evaluation.

73 citations


Posted Content
TL;DR: In this paper, the authors consider the setting where the service time is not known, but is predicted by a machine learning algorithm, and derive formulae for the performance of several strategies for queueing systems that use predictions for service times in order to schedule jobs.
Abstract: In many traditional job scheduling settings, it is assumed that one knows the time it will take for a job to complete service. In such cases, strategies such as shortest job first can be used to improve performance in terms of measures such as the average time a job waits in the system. We consider the setting where the service time is not known, but is predicted by for example a machine learning algorithm. Our main result is the derivation, under natural assumptions, of formulae for the performance of several strategies for queueing systems that use predictions for service times in order to schedule jobs. As part of our analysis, we suggest the framework of the "price of misprediction," which offers a measure of the cost of using predicted information.

63 citations


Journal ArticleDOI
TL;DR: The first learned index that supports predecessor, range queries and updates within provably efficient time and space bounds in the worst case is presented, and its flexible design allows it to introduce three variants which are novel in the context of learned data structures.
Abstract: The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give outstanding improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they lack any guarantees both in their time and space requirements. We propose the Piecewise Geometric Model index (shortly, PGM-index), which achieves guaranteed I/O-optimality in query operations, learns an optimal number of linear models, and its peculiar recursive construction makes it a purely learned data structure, rather than a hybrid of traditional and learned indexes (such as RMI and FITing-tree). We show that the PGM-index improves the space of the FITing-tree by 63.3% and of the B-tree by more than four orders of magnitude, while achieving their same or even better query time efficiency. We complement this result by proposing three variants of the PGM-index. First, we design a compressed PGM-index that further reduces its space footprint by exploiting the repetitiveness at the level of the learned linear models it is composed of. Second, we design a PGM-index that adapts itself to the distribution of the queries, thus resulting in the first known distribution-aware learned index to date. Finally, given its flexibility in the offered space-time trade-offs, we propose the multicriteria PGM-index that efficiently auto-tune itself in a few seconds over hundreds of millions of keys to the possibly evolving space-time constraints imposed by the application of use. We remark to the reader that this paper is an extended and improved version of our previous paper titled "Superseding traditional indexes by orchestrating learning and geometry" (arXiv:1903.00507).

63 citations


Posted Content
TL;DR: A simple polylogarithmic-time deterministic distributed algorithm for network decomposition that improves on a celebrated 2 O(√logn)-time algorithm and settles a central and long-standing question in distributed graph algorithms.
Abstract: We present a simple polylogarithmic-time deterministic distributed algorithm for network decomposition. This improves on a celebrated $2^{O(\sqrt{\log n})}$-time algorithm of Panconesi and Srinivasan [STOC'92] and settles a central and long-standing question in distributed graph algorithms. It also leads to the first polylogarithmic-time deterministic distributed algorithms for numerous other problems, hence resolving several well-known and decades-old open problems, including Linial's question about the deterministic complexity of maximal independent set [FOCS'87; SICOMP'92]---which had been called the most outstanding problem in the area. The main implication is a more general distributed derandomization theorem: Put together with the results of Ghaffari, Kuhn, and Maus [STOC'17] and Ghaffari, Harris, and Kuhn [FOCS'18], our network decomposition implies that $$\mathsf{P}\textit{-}\mathsf{RLOCAL} = \mathsf{P}\textit{-}\mathsf{LOCAL}.$$ That is, for any problem whose solution can be checked deterministically in polylogarithmic-time, any polylogarithmic-time randomized algorithm can be derandomized to a polylogarithmic-time deterministic algorithm. Informally, for the standard first-order interpretation of efficiency as polylogarithmic-time, distributed algorithms do not need randomness for efficiency. By known connections, our result leads also to substantially faster randomized distributed algorithms for a number of well-studied problems including $(\Delta+1)$-coloring, maximal independent set, and Lovasz Local Lemma, as well as massively parallel algorithms for $(\Delta+1)$-coloring.

62 citations


Posted Content
TL;DR: An improved algorithm with a competitive ratio of $O(1 + \min((\eta/OPT)/k, 1) \log k) and a lower bound of $\Omega(\log \min(\sqrt{\eta/ OPT), k), k) are provided".
Abstract: In the model of online caching with machine learned advice, introduced by Lykouris and Vassilvitskii, the goal is to solve the caching problem with an online algorithm that has access to next-arrival predictions: when each input element arrives, the algorithm is given a prediction of the next time when the element will reappear. The traditional model for online caching suffers from an $\Omega(\log k)$ competitive ratio lower bound (on a cache of size $k$). In contrast, the augmented model admits algorithms which beat this lower bound when the predictions have low error, and asymptotically match the lower bound when the predictions have high error, even if the algorithms are oblivious to the prediction error. In particular, Lykouris and Vassilvitskii showed that there is a prediction-augmented caching algorithm with a competitive ratio of $O(1+\min(\sqrt{\eta/OPT}, \log k))$ when the overall $\ell_1$ prediction error is bounded by $\eta$, and $OPT$ is the cost of the optimal offline algorithm. The dependence on $k$ in the competitive ratio is optimal, but the dependence on $\eta/OPT$ may be far from optimal. In this work, we make progress towards closing this gap. Our contributions are twofold. First, we provide an improved algorithm with a competitive ratio of $O(1 + \min((\eta/OPT)/k, 1) \log k)$. Second, we provide a lower bound of $\Omega(\log \min((\eta/OPT)/(k \log k), k))$.

61 citations


Posted Content
TL;DR: The first deterministic, almost-linear time approximation algorithm for the classical Minimum Balanced Cut problem, which provides a stronger guarantee: it either returns a balanced cut whose value is close to a given target value, or it certifies that such a cut does not exist by exhibiting a large subgraph of $G$ that has high conductance.
Abstract: We consider the classical Minimum Balanced Cut problem: given a graph $G$, compute a partition of its vertices into two subsets of roughly equal volume, while minimizing the number of edges connecting the subsets. We present the first {\em deterministic, almost-linear time} approximation algorithm for this problem. Specifically, our algorithm, given an $n$-vertex $m$-edge graph $G$ and any parameter $1\leq r\leq O(\log n)$, computes a $(\log m)^{r^2}$-approximation for Minimum Balanced Cut on $G$, in time $O\left ( m^{1+O(1/r)+o(1)}\cdot (\log m)^{O(r^2)}\right )$. In particular, we obtain a $(\log m)^{1/\epsilon}$-approximation in time $m^{1+O(1/\sqrt{\epsilon})}$ for any constant $\epsilon$, and a $(\log m)^{f(m)}$-approximation in time $m^{1+o(1)}$, for any slowly growing function $m$. We obtain deterministic algorithms with similar guarantees for the Sparsest Cut and the Lowest-Conductance Cut problems. Our algorithm for the Minimum Balanced Cut problem in fact provides a stronger guarantee: it either returns a balanced cut whose value is close to a given target value, or it certifies that such a cut does not exist by exhibiting a large subgraph of $G$ that has high conductance. We use this algorithm to obtain deterministic algorithms for dynamic connectivity and minimum spanning forest, whose worst-case update time on an $n$-vertex graph is $n^{o(1)}$, thus resolving a major open problem in the area of dynamic graph algorithms. Our work also implies deterministic algorithms for a host of additional problems, whose time complexities match, up to subpolynomial in $n$ factors, those of known randomized algorithms. The implications include almost-linear time deterministic algorithms for solving Laplacian systems and for approximating maximum flows in undirected graphs.

61 citations


Posted Content
TL;DR: Recently, Song et al. as discussed by the authors showed that one can also settle the deterministic setting by derandomizing Cohen et al.'s deterministic algorithm in subquadratic time.
Abstract: Interior point algorithms for solving linear programs have been studied extensively for a long time [e.g. Karmarkar 1984; Lee, Sidford FOCS'14; Cohen, Lee, Song STOC'19]. For linear programs of the form $\min_{Ax=b, x \ge 0} c^\top x$ with $n$ variables and $d$ constraints, the generic case $d = \Omega(n)$ has recently been settled by Cohen, Lee and Song [STOC'19]. Their algorithm can solve linear programs in $\tilde O(n^\omega \log(n/\delta))$ expected time, where $\delta$ is the relative accuracy. This is essentially optimal as all known linear system solvers require up to $O(n^{\omega})$ time for solving $Ax = b$. However, for the case of deterministic solvers, the best upper bound is Vaidya's 30 years old $O(n^{2.5} \log(n/\delta))$ bound [FOCS'89]. In this paper we show that one can also settle the deterministic setting by derandomizing Cohen et al.'s $\tilde{O}(n^\omega \log(n/\delta))$ time algorithm. This allows for a strict $\tilde{O}(n^\omega \log(n/\delta))$ time bound, instead of an expected one, and a simplified analysis, reducing the length of their proof of their central path method by roughly half. Derandomizing this algorithm was also an open question asked in Song's PhD Thesis. The main tool to achieve our result is a new data-structure that can maintain the solution to a linear system in subquadratic time. More accurately we are able to maintain $\sqrt{U}A^\top(AUA^\top)^{-1}A\sqrt{U}\:v$ in subquadratic time under $\ell_2$ multiplicative changes to the diagonal matrix $U$ and the vector $v$. This type of change is common for interior point algorithms. Previous algorithms [e.g. Vaidya STOC'89; Lee, Sidford FOCS'15; Cohen, Lee, Song STOC'19] required $\Omega(n^2)$ time for this task. [...]

56 citations


Posted Content
TL;DR: The experimental study confirms that the local algorithms, both kernel and neural architectures, lead to vastly reduced computation times, and prevent overfitting, and the kernel version establishes a new state-of-the-art for graph classification on a wide range of benchmark datasets.
Abstract: Graph kernels based on the $1$-dimensional Weisfeiler-Leman algorithm and corresponding neural architectures recently emerged as powerful tools for (supervised) learning with graphs. However, due to the purely local nature of the algorithms, they might miss essential patterns in the given data and can only handle binary relations. The $k$-dimensional Weisfeiler-Leman algorithm addresses this by considering $k$-tuples, defined over the set of vertices, and defines a suitable notion of adjacency between these vertex tuples. Hence, it accounts for the higher-order interactions between vertices. However, it does not scale and may suffer from overfitting when used in a machine learning setting. Hence, it remains an important open problem to design WL-based graph learning methods that are simultaneously expressive, scalable, and non-overfitting. Here, we propose local variants and corresponding neural architectures, which consider a subset of the original neighborhood, making them more scalable, and less prone to overfitting. The expressive power of (one of) our algorithms is strictly higher than the original algorithm, in terms of ability to distinguish non-isomorphic graphs. Our experimental study confirms that the local algorithms, both kernel and neural architectures, lead to vastly reduced computation times, and prevent overfitting. The kernel version establishes a new state-of-the-art for graph classification on a wide range of benchmark datasets, while the neural version shows promising performance on large-scale molecular regression tasks.

Proceedings ArticleDOI
TL;DR: In this article, the quantum singular value transformation (SVT) framework was used for quantum machine learning algorithms. But quantum SVT does not yield exponential quantum speedups and is not suitable for the QRAM data structure input model.
Abstract: We present an algorithmic framework for quantum-inspired classical algorithms on close-to-low-rank matrices, generalizing the series of results started by Tang's breakthrough quantum-inspired algorithm for recommendation systems [STOC'19]. Motivated by quantum linear algebra algorithms and the quantum singular value transformation (SVT) framework of Gilyen et al. [STOC'19], we develop classical algorithms for SVT that run in time independent of input dimension, under suitable quantum-inspired sampling assumptions. Our results give compelling evidence that in the corresponding QRAM data structure input model, quantum SVT does not yield exponential quantum speedups. Since the quantum SVT framework generalizes essentially all known techniques for quantum linear algebra, our results, combined with sampling lemmas from previous work, suffice to generalize all recent results about dequantizing quantum machine learning algorithms. In particular, our classical SVT framework recovers and often improves the dequantization results on recommendation systems, principal component analysis, supervised clustering, support vector machines, low-rank regression, and semidefinite program solving. We also give additional dequantization results on low-rank Hamiltonian simulation and discriminant analysis. Our improvements come from identifying the key feature of the quantum-inspired input model that is at the core of all prior quantum-inspired results: $\ell^2$-norm sampling can approximate matrix products in time independent of their dimension. We reduce all our main results to this fact, making our exposition concise, self-contained, and intuitive.

Posted Content
TL;DR: In this paper, the authors study the problem of finding low-cost fair clusterings in data where each data point may belong to many protected groups and propose a fair clustering algorithm that allows the user to specify the parameters that define fair representation.
Abstract: We study the problem of finding low-cost Fair Clusterings in data where each data point may belong to many protected groups. Our work significantly generalizes the seminal work of Chierichetti this http URL. (NIPS 2017) as follows. - We allow the user to specify the parameters that define fair representation. More precisely, these parameters define the maximum over- and minimum under-representation of any group in any cluster. - Our clustering algorithm works on any $\ell_p$-norm objective (e.g. $k$-means, $k$-median, and $k$-center). Indeed, our algorithm transforms any vanilla clustering solution into a fair one incurring only a slight loss in quality. - Our algorithm also allows individuals to lie in multiple protected groups. In other words, we do not need the protected groups to partition the data and we can maintain fairness across different groups simultaneously. Our experiments show that on established data sets, our algorithm performs much better in practice than what our theoretical results suggest.

Posted Content
TL;DR: In this paper, a deterministic deterministic central path method was proposed to maintain the central path of interior point methods even when the weights update vector is dense, which is the current best for solving a dense least squares regression problem.
Abstract: Many convex problems in machine learning and computer science share the same form: \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where $f_i$ are convex functions on $\mathbb{R}^{n_i}$ with constant $n_i$, $A_i \in \mathbb{R}^{n_i \times d}$, $b_i \in \mathbb{R}^{n_i}$ and $\sum_i n_i = n$. This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy. Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in $\delta$. For the current bound $\omega \sim 2.38$ [Vassilevska Williams'12, Le Gall'14] and $\alpha \sim 0.31$ [Le Gall, Urrutia'18], our runtime $O^* ( n^{\omega} \log (n / \delta))$ matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman'18] proved that all the current known techniques can not give a better $\omega$ below $2.168$ which is larger than our $2+1/6$. Our result generalizes the very recent result of solving linear programs in the current matrix multiplication time [Cohen, Lee, Song'19] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song'19] : $\bullet$ We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. $\bullet$ We propose an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense.

Posted Content
TL;DR: In this article, a greedy variant of the Sinkhorn algorithm, known as the \emph{Greenkhorn algorithm}, was improved to O(n^2\varepsilon^{-2}) by using a primal-dual formulation and an upper bound for the dual solution.
Abstract: We provide theoretical analyses for two algorithms that solve the regularized optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. We show that a greedy variant of the classical Sinkhorn algorithm, known as the \emph{Greenkhorn algorithm}, can be improved to $\widetilde{\mathcal{O}}(n^2\varepsilon^{-2})$, improving on the best known complexity bound of $\widetilde{\mathcal{O}}(n^2\varepsilon^{-3})$. Notably, this matches the best known complexity bound for the Sinkhorn algorithm and helps explain why the Greenkhorn algorithm can outperform the Sinkhorn algorithm in practice. Our proof technique, which is based on a primal-dual formulation and a novel upper bound for the dual solution, also leads to a new class of algorithms that we refer to as \emph{adaptive primal-dual accelerated mirror descent} (APDAMD) algorithms. We prove that the complexity of these algorithms is $\widetilde{\mathcal{O}}(n^2\sqrt{\delta}\varepsilon^{-1})$, where $\delta > 0$ refers to the inverse of the strong convexity module of Bregman divergence with respect to $\|\cdot\|_\infty$. This implies that the APDAMD algorithm is faster than the Sinkhorn and Greenkhorn algorithms in terms of $\varepsilon$. Experimental results on synthetic and real datasets demonstrate the favorable performance of the Greenkhorn and APDAMD algorithms in practice.

Posted Content
TL;DR: An equivalent form of the dual problem that relates the dual LP with a sample average approximation to a stochastic program is identified and a new type of OLP algorithm is proposed, action-history-dependent learning algorithm, which improves the previous algorithm performances by taking into account the past input data and the past decisions/actions.
Abstract: We study an online linear programming (OLP) problem under a random input model in which the columns of the constraint matrix along with the corresponding coefficients in the objective function are generated i.i.d. from an unknown distribution and revealed sequentially over time. Virtually all pre-existing online algorithms were based on learning the dual optimal solutions/prices of the linear programs (LP), and their analyses were focused on the aggregate objective value and solving the packing LP where all coefficients in the constraint matrix and objective are nonnegative. However, two major open questions were: (i) Does the set of LP optimal dual prices learned in the pre-existing algorithms converge to those of the "offline" LP, and (ii) Could the results be extended to general LP problems where the coefficients can be either positive or negative. We resolve these two questions by establishing convergence results for the dual prices under moderate regularity conditions for general LP problems. Specifically, we identify an equivalent form of the dual problem which relates the dual LP with a sample average approximation to a stochastic program. Furthermore, we propose a new type of OLP algorithm, Action-History-Dependent Learning Algorithm, which improves the previous algorithm performances by taking into account the past input data as well as decisions/actions already made. We derive an $O(\log n \log \log n)$ regret bound (under a locally strong convexity and smoothness condition) for the proposed algorithm, against the $O(\sqrt{n})$ bound for typical dual-price learning algorithms, where $n$ is the number of decision variables. Numerical experiments demonstrate the effectiveness of the proposed algorithm and the action-history-dependent design.

Proceedings ArticleDOI
TL;DR: This paper presents a simple greedy approach that builds a family of itemsets directly from data that allows for complex interactions between the attributes, not just co-occurrences of 1s.
Abstract: The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. More formally, assuming that the items are ordered, we create a decision tree for each item that may only depend on the previous items. Our approach allows us to find complex interactions between the attributes, not just co-occurrences of 1s. Further, we present a link between the itemsets and the decision trees and use this link to export the itemsets from the decision trees. In this paper we present two algorithms. The first one is a simple greedy approach that builds a family of itemsets directly from data. The second one, given a collection of candidate itemsets, selects a small subset of these itemsets. Our experiments show that these approaches result in compact and high quality descriptions of the data.

Posted Content
TL;DR: This work provides a differentially private algorithm for hypothesis selection, and applies it to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes.
Abstract: We provide a differentially private algorithm for hypothesis selection. Given samples from an unknown probability distribution $P$ and a set of $m$ probability distributions $\mathcal{H}$, the goal is to output, in a $\varepsilon$-differentially private manner, a distribution from $\mathcal{H}$ whose total variation distance to $P$ is comparable to that of the best such distribution (which we denote by $\alpha$). The sample complexity of our basic algorithm is $O\left(\frac{\log m}{\alpha^2} + \frac{\log m}{\alpha \varepsilon}\right)$, representing a minimal cost for privacy when compared to the non-private algorithm. We also can handle infinite hypothesis classes $\mathcal{H}$ by relaxing to $(\varepsilon,\delta)$-differential privacy. We apply our hypothesis selection algorithm to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes. Our hypothesis selection procedure allows us to generically convert a cover for a class to a learning algorithm, complementing known learning lower bounds which are in terms of the size of the packing number of the class. As the covering and packing numbers are often closely related, for constant $\alpha$, our algorithms achieve the optimal sample complexity for many classes of interest. Finally, we describe an application to private distribution-free PAC learning.

Posted Content
TL;DR: In this article, the authors proposed a new outlier scoring method called QUE-scoring based on quantum entropy regularization, which yields the first algorithm with optimal error rates and nearly linear running time.
Abstract: We study two problems in high-dimensional robust statistics: \emph{robust mean estimation} and \emph{outlier detection}. In robust mean estimation the goal is to estimate the mean $\mu$ of a distribution on $\mathbb{R}^d$ given $n$ independent samples, an $\varepsilon$-fraction of which have been corrupted by a malicious adversary. In outlier detection the goal is to assign an \emph{outlier score} to each element of a data set such that elements more likely to be outliers are assigned higher scores. Our algorithms for both problems are based on a new outlier scoring method we call QUE-scoring based on \emph{quantum entropy regularization}. For robust mean estimation, this yields the first algorithm with optimal error rates and nearly-linear running time $\widetilde{O}(nd)$ in all parameters, improving on the previous fastest running time $\widetilde{O}(\min(nd/\varepsilon^6, nd^2))$. For outlier detection, we evaluate the performance of QUE-scoring via extensive experiments on synthetic and real data, and demonstrate that it often performs better than previously proposed algorithms. Code for these experiments is available at this https URL .

Proceedings ArticleDOI
TL;DR: This paper shows that r=\mathcal{O}(z\log^{2}n)$ holds for every text, and proves that many results related to BWT automatically apply to methods based on LZ77, and implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse.
Abstract: The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of BWT is quantified by the number $r$ of equal-letter runs. Despite the practical significance of BWT, no non-trivial bound on the value of $r$ is known. This is in contrast to nearly all other known compression methods, whose sizes have been shown to be either always within a ${\rm polylog}\,n$ factor (where $n$ is the length of text) from $z$, the size of Lempel-Ziv (LZ77) parsing of the text, or significantly larger in the worst case (by a $n^{\varepsilon}$ factor for $\varepsilon > 0$). In this paper, we show that $r = \mathcal{O}(z \log^2n)$ holds for every text. This result has numerous implications for text indexing and data compression; for example: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, e.g., it is possible to obtain functionality of the suffix tree in $\mathcal{O}(z\,{\rm polylog}\,n)$ space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large ${\rm polylog}\,n$ factor; (3) it implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse. In addition, we provide an $\mathcal{O}(z\,{\rm polylog}\,n)$-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop a number of new data structures and techniques of independent interest.

Posted Content
TL;DR: This survey aims to provide a comprehensive overview of the existing methods for subgraph counting, identifying and describing the main conceptual approaches, giving insight on their advantages and limitations, and providing pointers to existing implementations.
Abstract: Computing subgraph frequencies is a fundamental task that lies at the core of several network analysis methodologies, such as network motifs and graphlet-based metrics, which have been widely used to categorize and compare networks from multiple domains. Counting subgraphs is however computationally very expensive and there has been a large body of work on efficient algorithms and strategies to make subgraph counting feasible for larger subgraphs and networks. This survey aims precisely to provide a comprehensive overview of the existing methods for subgraph counting. Our main contribution is a general and structured review of existing algorithms, classifying them on a set of key characteristics, highlighting their main similarities and differences. We identify and describe the main conceptual approaches, giving insight on their advantages and limitations, and provide pointers to existing implementations. We initially focus on exact sequential algorithms, but we also do a thorough survey on approximate methodologies (with a trade-off between accuracy and execution time) and parallel strategies (that need to deal with an unbalanced search space).

Posted Content
TL;DR: This work 'redeem' the statistical physics approach by showing that their hierarchy of increasingly powerful algorithms gives a polynomial-time algorithm matching the performance of SOS, and suggests a new avenue for systematically obtaining optimal algorithms for Bayesian inference problems.
Abstract: For the tensor PCA (principal component analysis) problem, we propose a new hierarchy of increasingly powerful algorithms with increasing runtime. Our hierarchy is analogous to the sum-of-squares (SOS) hierarchy but is instead inspired by statistical physics and related algorithms such as belief propagation and AMP (approximate message passing). Our level-$\ell$ algorithm can be thought of as a linearized message-passing algorithm that keeps track of $\ell$-wise dependencies among the hidden variables. Specifically, our algorithms are spectral methods based on the Kikuchi Hessian, which generalizes the well-studied Bethe Hessian to the higher-order Kikuchi free energies. It is known that AMP, the flagship algorithm of statistical physics, has substantially worse performance than SOS for tensor PCA. In this work we 'redeem' the statistical physics approach by showing that our hierarchy gives a polynomial-time algorithm matching the performance of SOS. Our hierarchy also yields a continuum of subexponential-time algorithms, and we prove that these achieve the same (conjecturally optimal) tradeoff between runtime and statistical power as SOS. Our proofs are much simpler than prior work, and also apply to the related problem of refuting random $k$-XOR formulas. The results we present here apply to tensor PCA for tensors of all orders, and to $k$-XOR when $k$ is even. Our methods suggest a new avenue for systematically obtaining optimal algorithms for Bayesian inference problems, and our results constitute a step toward unifying the statistical physics and sum-of-squares approaches to algorithm design.

Posted Content
TL;DR: In this paper, the authors showed that the problem of minimizing the amount of perturbation of the central path needed to maximize energy and reduce congestion can be reduced to a smoothed flow optimization problem.
Abstract: In this paper we provide an algorithm which given any $m$-edge $n$-vertex directed graph with integer capacities at most $U$ computes a maximum $s$-$t$ flow for any vertices $s$ and $t$ in $m^{11/8+o(1)}U^{1/4}$ time with high probability. This running time improves upon the previous best of $\tilde{O}(m^{10/7} U^{1/7})$ (Mądry 2016), $\tilde{O}(m \sqrt{n} \log U)$ (Lee Sidford 2014), and $O(mn)$ (Orlin 2013) when the graph is not too dense or has large capacities. We achieve this result by leveraging recent advances in solving undirected flow problems on graphs. We show that in the maximum flow framework of (Mądry 2016) the problem of optimizing the amount of perturbation of the central path needed to maximize energy and thereby reduce congestion can be efficiently reduced to a smoothed $\ell_2$-$\ell_p$ flow optimization problem, which can be solved approximately via recent work (Kyng, Peng, Sachdeva, Wang 2019). Leveraging this new primitive, we provide a new long-step interior point method for maximum flow with faster convergence and simpler analysis that no longer needs global potential functions involving energy as in previous methods (Mądry 2013, Mądry 2016).

Posted Content
TL;DR: An approach to clustering with fairness constraints that involve multiple, non-disjoint types, that is also scalable and achieves a speed-up to recent fair clustering algorithms by incorporating the first known coreset construction for theFair clustering problem with thek-median objective.
Abstract: In a recent work, [19] studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender [6], or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve multiple, non-disjoint types, that is also scalable. Our approach is based on novel constructions of coresets: for the $k$-median objective, we construct an $\varepsilon$-coreset of size $O(\Gamma k^2 \varepsilon^{-d})$ where $\Gamma$ is the number of distinct collections of groups that a point may belong to, and for the $k$-means objective, we show how to construct an $\varepsilon$-coreset of size $O(\Gamma k^3\varepsilon^{-d-1})$. The former result is the first known coreset construction for the fair clustering problem with the $k$-median objective, and the latter result removes the dependence on the size of the full dataset as in [39] and generalizes it to multiple, non-disjoint types. Plugging our coresets into existing algorithms for fair clustering such as [5] results in the fastest algorithms for several cases. Empirically, we assess our approach over the \textbf{Adult}, \textbf{Bank}, \textbf{Diabetes} and \textbf{Athlete} dataset, and show that the coreset sizes are much smaller than the full dataset. We also achieve a speed-up to recent fair clustering algorithms [5,6] by incorporating our coreset construction.

Journal ArticleDOI
TL;DR: Xor filters can be faster than Bloom and cuckoo filters while using less memory and it is found that a more compact version of xor filters (xor+) can use even less space than highly compact alternatives (e.g., Golomb-compressed sequences) while providing speeds competitive with Bloom filters.
Abstract: The Bloom filter provides fast approximate set membership while using little memory. Engineers often use these filters to avoid slow operations such as disk or network accesses. As an alternative, a cuckoo filter may need less space than a Bloom filter and it is faster. Chazelle et al. proposed a generalization of the Bloom filter called the Bloomier filter. Dietzfelbinger and Pagh described a variation on the Bloomier filter that can be used effectively for approximate membership queries. It has never been tested empirically, to our knowledge. We review an efficient implementation of their approach, which we call the xor filter. We find that xor filters can be faster than Bloom and cuckoo filters while using less memory. We further show that a more compact version of xor filters (xor+) can use even less space than highly compact alternatives (e.g., Golomb-compressed sequences) while providing speeds competitive with Bloom filters.

Posted Content
TL;DR: A fairness concept is formulated that takes local population densities into account and gives an approximation algorithm that guarantees a factor of at most 2 in all metric spaces; this algorithm is applied to real-world address data and proves matching lower bounds in some metric spaces.
Abstract: When selecting locations for a set of facilities, standard clustering algorithms may place unfair burden on some individuals and neighborhoods. We formulate a fairness concept that takes local population densities into account. In particular, given $k$ facilities to locate and a population of size $n$, we define the "neighborhood radius" of an individual $i$ as the minimum radius of a ball centered at $i$ that contains at least $n/k$ individuals. Our objective is to ensure that each individual has a facility within at most a small constant factor of her neighborhood radius. We present several theoretical results: We show that optimizing this factor is NP-hard; we give an approximation algorithm that guarantees a factor of at most 2 in all metric spaces; and we prove matching lower bounds in some metric spaces. We apply a variant of this algorithm to real-world address data, showing that it is quite different from standard clustering algorithms and outperforms them on our objective function and balances the load between facilities more evenly.

Posted Content
TL;DR: This survey gives a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set and hopes it will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.
Abstract: The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the last ten years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

Proceedings ArticleDOI
TL;DR: This work presents a new approach for the 2-respecting min-cut problem which exploits some cute structural properties so that it only needs to compute the values of Õ(n) cuts corresponding to removing Õ (n) pairs of tree edges, an operation that can be done quickly in many settings.
Abstract: Consider the following 2-respecting min-cut problem. Given a weighted graph $G$ and its spanning tree $T$, find the minimum cut among the cuts that contain at most two edges in $T$. This problem is an important subroutine in Karger's celebrated randomized near-linear-time min-cut algorithm [STOC'96]. We present a new approach for this problem which can be easily implemented in many settings, leading to the following randomized min-cut algorithms for weighted graphs. * An $O(m\frac{\log^2 n}{\log\log n} + n\log^6 n)$-time sequential algorithm: This improves Karger's $O(m \log^3 n)$ and $O(m\frac{(\log^2 n)\log (n^2/m)}{\log\log n} + n\log^6 n)$ bounds when the input graph is not extremely sparse or dense. Improvements over Karger's bounds were previously known only under a rather strong assumption that the input graph is simple [Henzinger et al. SODA'17; Ghaffari et al. SODA'20]. For unweighted graphs with parallel edges, our bound can be improved to $O(m\frac{\log^{1.5} n}{\log\log n} + n\log^6 n)$. * An algorithm requiring $\tilde O(n)$ cut queries to compute the min-cut of a weighted graph: This answers an open problem by Rubinstein et al. ITCS'18, who obtained a similar bound for simple graphs. * A streaming algorithm that requires $\tilde O(n)$ space and $O(\log n)$ passes to compute the min-cut: The only previous non-trivial exact min-cut algorithm in this setting is the 2-pass $\tilde O(n)$-space algorithm on simple graphs [Rubinstein et al., ITCS'18] (observed by Assadi et al. STOC'19). In contrast to Karger's 2-respecting min-cut algorithm which deploys sophisticated dynamic programming techniques, our approach exploits some cute structural properties so that it only needs to compute the values of $\tilde O(n)$ cuts corresponding to removing $\tilde O(n)$ pairs of tree edges, an operation that can be done quickly in many settings.

Posted Content
TL;DR: A smaller measure, $\delta$, is studied, which can be computed in linear time and captures better the concept of compressibility in repetitive strings, and it is proved that, for some string families, it holds $\gamma = \Omega(\delta \log n)$.
Abstract: Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such a clear measure exists for the compressibility of repetitive sequences other than the uncomputable Kolmogorov's complexity. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel-Ziv parse are frequently used to estimate it. Recently, a more principled measure, the size $\gamma$ of the smallest {\em attractor} of a string $S[1..n]$, was introduced. Measure $\gamma$ lower bounds all the previous relevant ones (e.g., $z$), yet $S$ can be represented and indexed within space $O(\gamma\log(n/\gamma))$, which also upper bounds most measures. While $\gamma$ is certainly a better measure of repetitiveness, it is NP-complete to compute, and it is not known if $S$ can always be represented in $O(\gamma)$ space. In this paper we study a smaller measure, $\delta \le \gamma$, which can be computed in linear time. We show that $\delta$ captures better the concept of compressibility in repetitive strings: We prove that, for some string families, it holds $\gamma = \Omega(\delta \log n)$. Still, we can build a representation of $S$ of size $O(\delta\log(n/\delta))$, which supports direct access to any $S[i]$ in time $O(\log(n/\delta))$ and finds the $occ$ occurrences of any pattern $P[1..m]$ in time $O(m\log n + occ\log^\epsilon n)$ for any constant $\epsilon>0$. Further, such representation is worst-case optimal because, in some families, $S$ can only be represented in $\Omega(\delta\log n)$ space. We complete our characterization of $\delta$ by showing that $\gamma$, $z$ and other measures of repetitiveness are always $O(\delta\log(n/\delta))$, but in some string families, the smallest context-free grammar is of size $g=\Omega(\delta \log^2 n / \log\log n)$. No such a lower bound is known to hold for $\gamma$.

Proceedings ArticleDOI
TL;DR: Nanongkai et al. as discussed by the authors presented a sublinear-time algorithm for a distributed message-passing network that can compute its edge connectivity exactly in the CONGEST model, as long as there are no parallel edges.
Abstract: We present the first sublinear-time algorithm for a distributed message-passing network sto compute its edge connectivity $\lambda$ exactly in the CONGEST model, as long as there are no parallel edges. Our algorithm takes $\tilde O(n^{1-1/353}D^{1/353}+n^{1-1/706})$ time to compute $\lambda$ and a cut of cardinality $\lambda$ with high probability, where $n$ and $D$ are the number of nodes and the diameter of the network, respectively, and $\tilde O$ hides polylogarithmic factors. This running time is sublinear in $n$ (i.e. $\tilde O(n^{1-\epsilon})$) whenever $D$ is. Previous sublinear-time distributed algorithms can solve this problem either (i) exactly only when $\lambda=O(n^{1/8-\epsilon})$ [Thurimella PODC'95; Pritchard, Thurimella, ACM Trans. Algorithms'11; Nanongkai, Su, DISC'14] or (ii) approximately [Ghaffari, Kuhn, DISC'13; Nanongkai, Su, DISC'14]. To achieve this we develop and combine several new techniques. First, we design the first distributed algorithm that can compute a $k$-edge connectivity certificate for any $k=O(n^{1-\epsilon})$ in time $\tilde O(\sqrt{nk}+D)$. Second, we show that by combining the recent distributed expander decomposition technique of [Chang, Pettie, Zhang, SODA'19] with techniques from the sequential deterministic edge connectivity algorithm of [Kawarabayashi, Thorup, STOC'15], we can decompose the network into a sublinear number of clusters with small average diameter and without any mincut separating a cluster (except the `trivial' ones). Finally, by extending the tree packing technique from [Karger STOC'96], we can find the minimum cut in time proportional to the number of components. As a byproduct of this technique, we obtain an $\tilde O(n)$-time algorithm for computing exact minimum cut for weighted graphs.