scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Data Structures and Algorithms in 2015"


Posted Content
TL;DR: The algorithm builds on Luks’s SI framework and attacks the barrier configurations for Luks's algorithm by group theoretic “local certificates” and combinatorial canonical partitioning techniques and shows that in a well-defined sense, Johnson graphs are the only obstructions to effective canonical partitioned.
Abstract: We show that the Graph Isomorphism (GI) problem and the related problems of String Isomorphism (under group action) (SI) and Coset Intersection (CI) can be solved in quasipolynomial (exp (logn) O(1) � ) time. The best previous bound for GI was exp(O( √ nlogn)), where n is the number of vertices (Luks, 1983); for the other two problems, the bound was similar, exp( e O( √ n)), where n is the size of the permutation domain (Babai, 1983). The algorithm builds on Luks’s SI framework and attacks the barrier configurations for Luks’s algorithm by group theoretic “local certificates” and combinatorial canonical partitioning techniques. We show that in a well-defined sense, Johnson graphs are the only obstructions to effective canonical partitioning.

571 citations


Posted Content
TL;DR: Andoni et al. as mentioned in this paper showed the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent.
Abstract: We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.

326 citations


Posted Content
TL;DR: In this article, an optimal data-dependent hashing scheme for the approximate near neighbor problem was proposed, which achieves query time O(d n^{\rho+o(1)}) and space O(n 2 + o(1 + dn) for the Euclidean space and approximation factor c>1 for the Hamming space.
Abstract: We show an optimal data-dependent hashing scheme for the approximate near neighbor problem. For an $n$-point data set in a $d$-dimensional space our data structure achieves query time $O(d n^{\rho+o(1)})$ and space $O(n^{1+\rho+o(1)} + dn)$, where $\rho=\tfrac{1}{2c^2-1}$ for the Euclidean space and approximation $c>1$. For the Hamming space, we obtain an exponent of $\rho=\tfrac{1}{2c-1}$. Our result completes the direction set forth in [AINR14] who gave a proof-of-concept that data-dependent hashing can outperform classical Locality Sensitive Hashing (LSH). In contrast to [AINR14], the new bound is not only optimal, but in fact improves over the best (optimal) LSH data structures [IM98,AI06] for all approximation factors $c>1$. From the technical perspective, we proceed by decomposing an arbitrary dataset into several subsets that are, in a certain sense, pseudo-random.

211 citations


Posted Content
TL;DR: In this article, a simple randomized block Krylov subspace method was proposed, which converges quickly for any matrix, independently of singular value gaps, and achieves a low-rank approximation within a factor of 1 + ϵ of the spectral norm error.
Abstract: Since being analyzed by Rokhlin, Szlam, and Tygert and popularized by Halko, Martinsson, and Tropp, randomized Simultaneous Power Iteration has become the method of choice for approximate singular value decomposition. It is more accurate than simpler sketching algorithms, yet still converges quickly for any matrix, independently of singular value gaps. After $\tilde{O}(1/\epsilon)$ iterations, it gives a low-rank approximation within $(1+\epsilon)$ of optimal for spectral norm error. We give the first provable runtime improvement on Simultaneous Iteration: a simple randomized block Krylov method, closely related to the classic Block Lanczos algorithm, gives the same guarantees in just $\tilde{O}(1/\sqrt{\epsilon})$ iterations and performs substantially better experimentally. Despite their long history, our analysis is the first of a Krylov subspace method that does not depend on singular value gaps, which are unreliable in practice. Furthermore, while it is a simple accuracy benchmark, even $(1+\epsilon)$ error for spectral norm low-rank approximation does not imply that an algorithm returns high quality principal components, a major issue for data applications. We address this problem for the first time by showing that both Block Krylov Iteration and a minor modification of Simultaneous Iteration give nearly optimal PCA for any matrix. This result further justifies their strength over non-iterative sketching methods. Finally, we give insight beyond the worst case, justifying why both algorithms can run much faster in practice than predicted. We clarify how simple techniques can take advantage of common matrix properties to significantly improve runtime.

152 citations


Posted Content
TL;DR: In this article, the authors presented a truly subquadratic approximation algorithm for most of the versions of Diameter and radius with \emph{optimal} approximation guarantees, under plausible assumptions, under the assumption that the eccentricity of a vertex is the largest distance from the vertex to another node.
Abstract: The radius and diameter are fundamental graph parameters They are defined as the minimum and maximum of the eccentricities in a graph, respectively, where the eccentricity of a vertex is the largest distance from the vertex to another node In directed graphs, there are several versions of these problems For instance, one may choose to define the eccentricity of a node in terms of the largest distance into the node, out of the node, the sum of the two directions (ie roundtrip) and so on All versions of diameter and radius can be solved via solving all-pairs shortest paths (APSP), followed by a fast postprocessing step Solving APSP, however, on $n$-node graphs requires $\Omega(n^2)$ time even in sparse graphs, as one needs to output $n^2$ distances Motivated by known and new negative results on the impossibility of computing these measures exactly in general graphs in truly subquadratic time, under plausible assumptions, we search for \emph{approximation} and \emph{fixed parameter subquadratic} algorithms, and for reasons why they do not exist Our results include: - Truly subquadratic approximation algorithms for most of the versions of Diameter and Radius with \emph{optimal} approximation guarantees (given truly subquadratic time), under plausible assumptions In particular, there is a $2$-approximation algorithm for directed Radius with one-way distances that runs in $\tilde{O}(m\sqrt{n})$ time, while a $(2-\delta)$-approximation algorithm in $O(n^{2-\epsilon})$ time is unlikely - On graphs with treewidth $k$, we can solve the problems in $2^{O(k\log{k})}n^{1+o(1)}$ time We show that these algorithms are near optimal since even a $(3/2-\delta)$-approximation algorithm that runs in time $2^{o(k)}n^{2-\epsilon}$ would refute the plausible assumptions

138 citations


Posted Content
TL;DR: It is shown that one can compute driving directions in milliseconds or less even at continental scale, and a variety of techniques provide different trade-offs between preprocessing effort, space requirements, and query time.
Abstract: We survey recent advances in algorithms for route planning in transportation networks. For road networks, we show that one can compute driving directions in milliseconds or less even at continental scale. A variety of techniques provide different trade-offs between preprocessing effort, space requirements, and query time. Some algorithms can answer queries in a fraction of a microsecond, while others can deal efficiently with real-time traffic. Journey planning on public transportation systems, although conceptually similar, is a significantly harder problem due to its inherent time-dependent and multicriteria nature. Although exact algorithms are fast enough for interactive queries on metropolitan transit systems, dealing with continent-sized instances requires simplifications or heavy preprocessing. The multimodal route planning problem, which seeks journeys combining schedule-based transportation (buses, trains) with unrestricted modes (walking, driving), is even harder, relying on approximate solutions even for metropolitan inputs.

133 citations


Posted Content
TL;DR: This paper develops an algorithm that is the first streaming algorithm that can maintain the densest subgraph in one pass and can be extended to a (2+ε)-approximation sublinear-time algorithm and a distributed-streaming algorithm.
Abstract: While in many graph mining applications it is crucial to handle a stream of updates efficiently in terms of {\em both} time and space, not much was known about achieving such type of algorithm. In this paper we study this issue for a problem which lies at the core of many graph mining applications called {\em densest subgraph problem}. We develop an algorithm that achieves time- and space-efficiency for this problem simultaneously. It is one of the first of its kind for graph problems to the best of our knowledge. In a graph $G = (V, E)$, the "density" of a subgraph induced by a subset of nodes $S \subseteq V$ is defined as $|E(S)|/|S|$, where $E(S)$ is the set of edges in $E$ with both endpoints in $S$. In the densest subgraph problem, the goal is to find a subset of nodes that maximizes the density of the corresponding induced subgraph. For any $\epsilon>0$, we present a dynamic algorithm that, with high probability, maintains a $(4+\epsilon)$-approximation to the densest subgraph problem under a sequence of edge insertions and deletions in a graph with $n$ nodes. It uses $\tilde O(n)$ space, and has an amortized update time of $\tilde O(1)$ and a query time of $\tilde O(1)$. Here, $\tilde O$ hides a $O(\poly\log_{1+\epsilon} n)$ term. The approximation ratio can be improved to $(2+\epsilon)$ at the cost of increasing the query time to $\tilde O(n)$. It can be extended to a $(2+\epsilon)$-approximation sublinear-time algorithm and a distributed-streaming algorithm. Our algorithm is the first streaming algorithm that can maintain the densest subgraph in {\em one pass}. The previously best algorithm in this setting required $O(\log n)$ passes [Bahmani, Kumar and Vassilvitskii, VLDB'12]. The space required by our algorithm is tight up to a polylogarithmic factor.

101 citations


Proceedings ArticleDOI
TL;DR: In this article, a new algorithm for Personalized PageRank (PPR) estimation and personalized PageRank search is proposed, which identifies the most relevant results by sampling candidates based on their PPR and finds the best results without iterating through the set of all candidate results.
Abstract: We present new algorithms for Personalized PageRank estimation and Personalized PageRank search. First, for the problem of estimating Personalized PageRank (PPR) from a source distribution to a target node, we present a new bidirectional estimator with simple yet strong guarantees on correctness and performance, and 3x to 8x speedup over existing estimators in experiments on a diverse set of networks. Moreover, it has a clean algebraic structure which enables it to be used as a primitive for the Personalized PageRank Search problem: Given a network like Facebook, a query like "people named John", and a searching user, return the top nodes in the network ranked by PPR from the perspective of the searching user. Previous solutions either score all nodes or score candidate nodes one at a time, which is prohibitively slow for large candidate sets. We develop a new algorithm based on our bidirectional PPR estimator which identifies the most relevant results by sampling candidates based on their PPR; this is the first solution to PPR search that can find the best results without iterating through the set of all candidate results. Finally, by combining PPR sampling with sequential PPR estimation and Monte Carlo, we develop practical algorithms for PPR search, and we show via experiments that our algorithms are efficient on networks with billions of edges.

100 citations


Posted Content
TL;DR: A novel combination of classic numerical techniques of low rank update, preconditioning, and fast matrix multiplication as well as recent work on subspace embeddings and spectral sparsification are used that will be of independent interest.
Abstract: In this paper, we consider the following inverse maintenance problem: given $A \in \mathbb{R}^{n\times d}$ and a number of rounds $r$, we receive a $n\times n$ diagonal matrix $D^{(k)}$ at round $k$ and we wish to maintain an efficient linear system solver for $A^{T}D^{(k)}A$ under the assumption $D^{(k)}$ does not change too rapidly. This inverse maintenance problem is the computational bottleneck in solving multiple optimization problems. We show how to solve this problem with $\tilde{O}(nnz(A)+d^{\omega})$ preprocessing time and amortized $\tilde{O}(nnz(A)+d^{2})$ time per round, improving upon previous running times for solving this problem. Consequently, we obtain the fastest known running times for solving multiple problems including, linear programming and computing a rounding of a polytope. In particular given a feasible point in a linear program with $d$ variables, $n$ constraints, and constraint matrix $A\in\mathbb{R}^{n\times d}$, we show how to solve the linear program in time $\tilde{O}(nnz(A)+d^{2})\sqrt{d}\log(\epsilon^{-1}))$. We achieve our results through a novel combination of classic numerical techniques of low rank update, preconditioning, and fast matrix multiplication as well as recent work on subspace embeddings and spectral sparsification that we hope will be of independent interest.

97 citations


Proceedings ArticleDOI
TL;DR: It is shown that a conjecture that there is no truly subcubic (O(n3-ε) time algorithm for this problem can be used to exhibit the underlying polynomial time hardness shared by many dynamic problems.
Abstract: Consider the following Online Boolean Matrix-Vector Multiplication problem: We are given an $n\times n$ matrix $M$ and will receive $n$ column-vectors of size $n$, denoted by $v_1,\ldots,v_n$, one by one. After seeing each vector $v_i$, we have to output the product $Mv_i$ before we can see the next vector. A naive algorithm can solve this problem using $O(n^3)$ time in total, and its running time can be slightly improved to $O(n^3/\log^2 n)$ [Williams SODA'07]. We show that a conjecture that there is no truly subcubic ($O(n^{3-\epsilon})$) time algorithm for this problem can be used to exhibit the underlying polynomial time hardness shared by many dynamic problems. For a number of problems, such as subgraph connectivity, Pagh's problem, $d$-failure connectivity, decremental single-source shortest paths, and decremental transitive closure, this conjecture implies tight hardness results. Thus, proving or disproving this conjecture will be very interesting as it will either imply several tight unconditional lower bounds or break through a common barrier that blocks progress with these problems. This conjecture might also be considered as strong evidence against any further improvement for these problems since refuting it will imply a major breakthrough for combinatorial Boolean matrix multiplication and other long-standing problems if the term "combinatorial algorithms" is interpreted as "non-Strassen-like algorithms" [Ballard et al. SPAA'11]. The conjecture also leads to hardness results for problems that were previously based on diverse problems and conjectures, such as 3SUM, combinatorial Boolean matrix multiplication, triangle detection, and multiphase, thus providing a uniform way to prove polynomial hardness results for dynamic algorithms; some of the new proofs are also simpler or even become trivial. The conjecture also leads to stronger and new, non-trivial, hardness results.

78 citations


Posted Content
TL;DR: In this article, a new rounding technique called online contention resolution schemes (OCRSs) is introduced for online optimization problems, which is related to contention resolution scheme, a technique initially introduced in the context of submodular function maximization.
Abstract: We introduce a new rounding technique designed for online optimization problems, which is related to contention resolution schemes, a technique initially introduced in the context of submodular function maximization. Our rounding technique, which we call online contention resolution schemes (OCRSs), is applicable to many online selection problems, including Bayesian online selection, oblivious posted pricing mechanisms, and stochastic probing models. It allows for handling a wide set of constraints, and shares many strong properties of offline contention resolution schemes. In particular, OCRSs for different constraint families can be combined to obtain an OCRS for their intersection. Moreover, we can approximately maximize submodular functions in the online settings we consider. We, thus, get a broadly applicable framework for several online selection problems, which improves on previous approaches in terms of the types of constraints that can be handled, the objective functions that can be dealt with, and the assumptions on the strength of the adversary. Furthermore, we resolve two open problems from the literature; namely, we present the first constant-factor constrained oblivious posted price mechanism for matroid constraints, and the first constant-factor algorithm for weighted stochastic probing with deadlines.

Posted Content
TL;DR: These new algorithms are used to derive the first nearly linear time algorithms for solving systems of equations in connection Laplacians---a generalization of LaplACian matrices that arise in many problems in image and signal processing.
Abstract: We introduce the sparsified Cholesky and sparsified multigrid algorithms for solving systems of linear equations. These algorithms accelerate Gaussian elimination by sparsifying the nonzero matrix entries created by the elimination process. We use these new algorithms to derive the first nearly linear time algorithms for solving systems of equations in connection Laplacians, a generalization of Laplacian matrices that arise in many problems in image and signal processing. We also prove that every connection Laplacian has a linear sized approximate inverse. This is an LU factorization with a linear number of nonzero entries that is a strong approximation of the original matrix. Using such a factorization one can solve systems of equations in a connection Laplacian in linear time. Such a factorization was unknown even for ordinary graph Laplacians.

Posted Content
Insu Han1, Dmitry Malioutov2, Jinwoo Shin1
TL;DR: This work proposes a linear-time randomized algorithm to approximate log-determinants for very large-scale positive definite and general non-singular matrices using a stochastic trace approximation, called the Hutchinson method, coupled with Chebyshev polynomial expansions that both rely on efficient matrix-vector multiplications.
Abstract: Logarithms of determinants of large positive definite matrices appear ubiquitously in machine learning applications including Gaussian graphical and Gaussian process models, partition functions of discrete graphical models, minimum-volume ellipsoids, metric learning and kernel learning. Log-determinant computation involves the Cholesky decomposition at the cost cubic in the number of variables, i.e., the matrix dimension, which makes it prohibitive for large-scale applications. We propose a linear-time randomized algorithm to approximate log-determinants for very large-scale positive definite and general non-singular matrices using a stochastic trace approximation, called the Hutchinson method, coupled with Chebyshev polynomial expansions that both rely on efficient matrix-vector multiplications. We establish rigorous additive and multiplicative approximation error bounds depending on the condition number of the input matrix. In our experiments, the proposed algorithm can provide very high accuracy solutions at orders of magnitude faster time than the Cholesky decomposition and Schur completion, and enables us to compute log-determinants of matrices involving tens of millions of variables.

Posted Content
TL;DR: A 1/e-competitive algorithm for the unconstrained case in which the algorithm may hold any subset of the elements, and constant competitive ratio algorithms for the case where the algorithms may hold at most k elements in its solution.
Abstract: Submodular function maximization has been studied extensively in recent years under various constraints and models. The problem plays a major role in various disciplines. We study a natural online variant of this problem in which elements arrive one-by-one and the algorithm has to maintain a solution obeying certain constraints at all times. Upon arrival of an element, the algorithm has to decide whether to accept the element into its solution and may preempt previously chosen elements. The goal is to maximize a submodular function over the set of elements in the solution. We study two special cases of this general problem and derive upper and lower bounds on the competitive ratio. Specifically, we design a $1/e$-competitive algorithm for the unconstrained case in which the algorithm may hold any subset of the elements, and constant competitive ratio algorithms for the case where the algorithm may hold at most $k$ elements in its solution.

Posted Content
TL;DR: In this article, a lower bound on the sample complexity of differentially private algorithms for answering statistical queries on high-dimensional databases has been established, which depends optimally on the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy and pure differential privacy.
Abstract: We show a new lower bound on the sample complexity of $(\varepsilon, \delta)$-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter $\delta$, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy ($\delta > 0$) and pure differential privacy ($\delta = 0$). Specifically, we consider a database $D \in \{\pm1\}^{n \times d}$ and its \emph{one-way marginals}, which are the $d$ queries of the form "What fraction of individual records have the $i$-th bit set to $+1$?" We show that in order to answer all of these queries to within error $\pm \alpha$ (on average) while satisfying $(\varepsilon, \delta)$-differential privacy, it is necessary that $$ n \geq \Omega\left( \frac{\sqrt{d \log(1/\delta)}}{\alpha \varepsilon} \right), $$ which is optimal up to constant factors. To prove our lower bound, we build on the connection between \emph{fingerprinting codes} and lower bounds in differential privacy (Bun, Ullman, and Vadhan, STOC'14). In addition to our lower bound, we give new purely and approximately differentially private algorithms for answering arbitrary statistical queries that improve on the sample complexity of the standard Laplace and Gaussian mechanisms for achieving worst-case accuracy guarantees by a logarithmic factor.

Posted Content
TL;DR: A 2n+o(n)-time and space randomized algorithm for solving the exact Closest Vector Problem (CVP) on n-dimensional Euclidean lattices and it is shown that the approximate closest vectors to a target vector t can be grouped into “lower-dimensional clusters,” and the discrete Gaussian sampling algorithm can be used to solve this variant of approximate CVP.
Abstract: We give a $2^{n+o(n)}$-time and space randomized algorithm for solving the exact Closest Vector Problem (CVP) on $n$-dimensional Euclidean lattices. This improves on the previous fastest algorithm, the deterministic $\widetilde{O}(4^{n})$-time and $\widetilde{O}(2^{n})$-space algorithm of Micciancio and Voulgaris. We achieve our main result in three steps. First, we show how to modify the sampling algorithm from [ADRS15] to solve the problem of discrete Gaussian sampling over lattice shifts, $L- t$, with very low parameters. While the actual algorithm is a natural generalization of [ADRS15], the analysis uses substantial new ideas. This yields a $2^{n+o(n)}$-time algorithm for approximate CVP for any approximation factor $\gamma = 1+2^{-o(n/\log n)}$. Second, we show that the approximate closest vectors to a target vector $t$ can be grouped into "lower-dimensional clusters," and we use this to obtain a recursive reduction from exact CVP to a variant of approximate CVP that "behaves well with these clusters." Third, we show that our discrete Gaussian sampling algorithm can be used to solve this variant of approximate CVP. The analysis depends crucially on some new properties of the discrete Gaussian distribution and approximate closest vectors, which might be of independent interest.

Posted Content
TL;DR: In this article, the authors considered the problem of maximizing a nonnegative submodular set function subject to a $p$-matchoid constraint in the single-pass streaming setting.
Abstract: We consider the problem of maximizing a nonnegative submodular set function $f:2^{\mathcal{N}} \rightarrow \mathbb{R}^+$ subject to a $p$-matchoid constraint in the single-pass streaming setting. Previous work in this context has considered streaming algorithms for modular functions and monotone submodular functions. The main result is for submodular functions that are {\em non-monotone}. We describe deterministic and randomized algorithms that obtain a $\Omega(\frac{1}{p})$-approximation using $O(k \log k)$-space, where $k$ is an upper bound on the cardinality of the desired set. The model assumes value oracle access to $f$ and membership oracles for the matroids defining the $p$-matchoid constraint.

Posted Content
TL;DR: In particular, the authors showed that no known algorithm can achieve sublinear regret and constant competitive ratio simultaneously, even in the case of one-dimensional decision spaces, while maintaining a competitive ratio growing arbitrarily slowly.
Abstract: We consider algorithms for "smoothed online convex optimization" problems, a variant of the class of online convex optimization problems that is strongly related to metrical task systems. Prior literature on these problems has focused on two performance metrics: regret and the competitive ratio. There exist known algorithms with sublinear regret and known algorithms with constant competitive ratios; however, no known algorithm achieves both simultaneously. We show that this is due to a fundamental incompatibility between these two metrics - no algorithm (deterministic or randomized) can achieve sublinear regret and a constant competitive ratio, even in the case when the objective functions are linear. However, we also exhibit an algorithm that, for the important special case of one-dimensional decision spaces, provides sublinear regret while maintaining a competitive ratio that grows arbitrarily slowly.

Posted Content
TL;DR: In this article, the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having O(m = O(tilde{r}/\varepsilon^2) rows was shown.
Abstract: We prove, using the subspace embedding guarantee in a black box way, that one can achieve the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having $m = O(\tilde{r}/\varepsilon^2)$ rows. Here $\tilde{r}$ is the maximum stable rank, i.e. squared ratio of Frobenius and operator norms, of the two matrices being multiplied. This is a quantitative improvement over previous work of [MZ11, KVZ14], and is also optimal for any oblivious dimensionality-reducing map. Furthermore, due to the black box reliance on the subspace embedding property in our proofs, our theorem can be applied to a much more general class of sketching matrices than what was known before, in addition to achieving better bounds. For example, one can apply our theorem to efficient subspace embeddings such as the Subsampled Randomized Hadamard Transform or sparse subspace embeddings, or even with subspace embedding constructions that may be developed in the future. Our main theorem, via connections with spectral error matrix multiplication shown in prior work, implies quantitative improvements for approximate least squares regression and low rank approximation. Our main result has also already been applied to improve dimensionality reduction guarantees for $k$-means clustering [CEMMP14], and implies new results for nonparametric regression [YPW15]. We also separately point out that the proof of the "BSS" deterministic row-sampling result of [BSS12] can be modified to show that for any matrices $A, B$ of stable rank at most $\tilde{r}$, one can achieve the spectral norm guarantee for approximate matrix multiplication of $A^T B$ by deterministically sampling $O(\tilde{r}/\varepsilon^2)$ rows that can be found in polynomial time. The original result of [BSS12] was for rank instead of stable rank. Our observation leads to a stronger version of a main theorem of [KMST10].

Posted Content
TL;DR: For bipartite graphs, this paper gave a randomized algorithm that maintains a 3/2 + 2-approximation in worst-case update time O(m^{1/4}\eps^{/2.5}) when the arboricity is polylogarithmic.
Abstract: Maximum cardinality matching in bipartite graphs is an important and well-studied problem. The fully dynamic version, in which edges are inserted and deleted over time has also been the subject of much attention. Existing algorithms for dynamic matching (in general graphs) seem to fall into two groups: there are fast (mostly randomized) algorithms that do not achieve a better than 2-approximation, and there slow algorithms with $O(\sqrt{m})$ update time that achieve a better-than-2 approximation. Thus the obvious question is whether we can design an algorithm -- deterministic or randomized -- that achieves a tradeoff between these two: a $o(\sqrt{m})$ approximation and a better-than-2 approximation simultaneously. We answer this question in the affirmative for bipartite graphs. Our main result is a fully dynamic algorithm that maintains a $3/2 + \eps$ approximation in worst-case update time $O(m^{1/4}\eps^{/2.5})$. We also give stronger results for graphs whose arboricity is at most $\al$, achieving a $(1+ \eps)$ approximation in worst-case time $O(\al (\al + \log n))$ for constant $\eps$. When the arboricity is constant, this bound is $O(\log n)$ and when the arboricity is polylogarithmic the update time is also polylogarithmic. The most important technical developement is the use of an intermediate graph we call an edge degree constrained subgraph (EDCS). This graph places constraints on the sum of the degrees of the endpoints of each edge: upper bounds for matched edges and lower bounds for unmatched edges. The main technical content of our paper involves showing both how to maintain an EDCS dynamically and that and EDCS always contains a sufficiently large matching. We also make use of graph orientations to help bound the amount of work done during each update.

Posted Content
TL;DR: It is shown that the viewpoint of semirandom models can help explain why some algorithms are preferred to others in practice, in spite of the gaps in their statistical performance on random models, and that algorithms based on semidefinite programming are robust in ways that any algorithm meeting the information-theoretic threshold cannot be.
Abstract: The stochastic block model is one of the oldest and most ubiquitous models for studying clustering and community detection. In an exciting sequence of developments, motivated by deep but non-rigorous ideas from statistical physics, Decelle et al. conjectured a sharp threshold for when community detection is possible in the sparse regime. Mossel, Neeman and Sly and Massoulie proved the conjecture and gave matching algorithms and lower bounds. Here we revisit the stochastic block model from the perspective of semirandom models where we allow an adversary to make `helpful' changes that strengthen ties within each community and break ties between them. We show a surprising result that these `helpful' changes can shift the information-theoretic threshold, making the community detection problem strictly harder. We complement this by showing that an algorithm based on semidefinite programming (which was known to get close to the threshold) continues to work in the semirandom model (even for partial recovery). This suggests that algorithms based on semidefinite programming are robust in ways that any algorithm meeting the information-theoretic threshold cannot be. These results point to an interesting new direction: Can we find robust, semirandom analogues to some of the classical, average-case thresholds in statistics? We also explore this question in the broadcast tree model, and we show that the viewpoint of semirandom models can help explain why some algorithms are preferred to others in practice, in spite of the gaps in their statistical performance on random models.

Posted Content
TL;DR: This paper considers fully dynamic graph algorithms with both faster worst case update time and sublinear space, and shows that 2-edge connectivity can be maintained using O(n log^2 n) words with an amortized update time of O(log^6 n).
Abstract: This paper considers fully dynamic graph algorithms with both faster worst case update time and sublinear space. The fully dynamic graph connectivity problem is the following: given a graph on a fixed set of n nodes, process an online sequence of edge insertions, edge deletions, and queries of the form "Is there a path between nodes a and b?" In 2013, the first data structure was presented with worst case time per operation which was polylogarithmic in n. In this paper, we shave off a factor of log n from that time, to O(log^4 n) per update. For sequences which are polynomial in length, our algorithm answers queries in O(log n/\log\log n) time correctly with high probability and using O(n \log^2 n) words (of size log n). This matches the amount of space used by the most space-efficient graph connectivity streaming algorithm. We also show that 2-edge connectivity can be maintained using O(n log^2 n) words with an amortized update time of O(log^6 n).

Proceedings ArticleDOI
TL;DR: A new algorithm is presented for generating a uniformly random spanning tree in an undirected graph that establishes a new connection between the effective resistance metric and the cut structure of the underlying graph.
Abstract: We present a new algorithm for generating a uniformly random spanning tree in an undirected graph. Our algorithm samples such a tree in expected $\tilde{O}(m^{4/3})$ time. This improves over the best previously known bound of $\min(\tilde{O}(m\sqrt{n}),O(n^{\omega}))$ -- that follows from the work of Kelner and Mądry [FOCS'09] and of Colbourn et al. [J. Algorithms'96] -- whenever the input graph is sufficiently sparse. At a high level, our result stems from carefully exploiting the interplay of random spanning trees, random walks, and the notion of effective resistance, as well as from devising a way to algorithmically relate these concepts to the combinatorial structure of the graph. This involves, in particular, establishing a new connection between the effective resistance metric and the cut structure of the underlying graph.

Posted Content
TL;DR: KaHyPar as discussed by the authors is a multilevel algorithm for hypergraph partitioning that contracts the vertices one at a time, using several caching and lazy-evaluation techniques during coarsening and refinement.
Abstract: We develop a multilevel algorithm for hypergraph partitioning that contracts the vertices one at a time. Using several caching and lazy-evaluation techniques during coarsening and refinement, we reduce the running time by up to two-orders of magnitude compared to a naive $n$-level algorithm that would be adequate for ordinary graph partitioning. The overall performance is even better than the widely used hMetis hypergraph partitioner that uses a classical multilevel algorithm with few levels. Aided by a portfolio-based approach to initial partitioning and adaptive budgeting of imbalance within recursive bipartitioning, we achieve very high quality. We assembled a large benchmark set with 310 hypergraphs stemming from application areas such VLSI, SAT solving, social networks, and scientific computing. We achieve significantly smaller cuts than hMetis and PaToH, while being faster than hMetis. Considerably larger improvements are observed for some instance classes like social networks, for bipartitioning, and for partitions with an allowed imbalance of 10%. The algorithm presented in this work forms the basis of our hypergraph partitioning framework KaHyPar (Karlsruhe Hypergraph Partitioning).

Posted Content
TL;DR: A carefully-designed hybrid of Monte-Carlo tree search, novel neighbourhood moves, memetic algorithms, and hyper-heuristic methods that significantly outperforms other solvers on a set of "hidden" instances, i.e. instances not available at the algorithm design phase.
Abstract: Multi-mode resource and precedence-constrained project scheduling is a well-known challenging real-world optimisation problem. An important variant of the problem requires scheduling of activities for multiple projects considering availability of local and global resources while respecting a range of constraints. A critical aspect of the benchmarks addressed in this paper is that the primary objective is to minimise the sum of the project completion times, with the usual makespan minimisation as a secondary objective. We observe that this leads to an expected different overall structure of good solutions and discuss the effects this has on the algorithm design. This paper presents a carefully designed hybrid of Monte-Carlo tree search, novel neighbourhood moves, memetic algorithms, and hyper-heuristic methods. The implementation is also engineered to increase the speed with which iterations are performed, and to exploit the computing power of multicore machines. Empirical evaluation shows that the resulting information-sharing multi-component algorithm significantly outperforms other solvers on a set of "hidden" instances, i.e. instances not available at the algorithm design phase.

Posted Content
TL;DR: In this article, the authors studied the complexity of k-mismatch in both the standard offline setting and also as a streaming problem, where the text arrives one symbol at a time and we must give an output before processing any future symbols.
Abstract: We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic $O(n k^2\log{k} / m+n \text{polylog} m)$ time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only $O(k^2\text{polylog} {m})$ space in total. 3) Next we give a randomised $(1+\epsilon)$-approximation algorithm for the streaming k-mismatch problem which uses $O(k^2\text{polylog} m / \epsilon^2)$ space and runs in $O(\text{polylog} m / \epsilon^2)$ worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised $O(k^2\text{polylog} {m})$ space algorithm for the streaming k-mismatch problem which runs in $O(\sqrt{k}\log{k} + \text{polylog} {m})$ worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).

Journal ArticleDOI
TL;DR: A survey and discussion of major techniques of AllSAT solvers can be found in this paper, where they faithfully implement them and conduct comprehensive experiments using a large number of instances and various types of solvers including one of the few public softwares.
Abstract: All solutions SAT (AllSAT for short) is a variant of propositional satisfiability problem. Despite its significance, AllSAT has been relatively unexplored compared to other variants. We thus survey and discuss major techniques of AllSAT solvers. We faithfully implement them and conduct comprehensive experiments using a large number of instances and various types of solvers including one of the few public softwares. The experiments reveal solver's characteristics. Our implemented solvers are made publicly available so that other researchers can easily develop their solver by modifying our codes and compare it with existing methods.

Posted Content
TL;DR: There cannot be an algorithm running in time polynomial in k and 1/ε unless P = NP, which implies that the problem of minimizing Σi dist(a, F)p is NP-hard, even to output a (1 + 1/poly(d))-approximation.
Abstract: In the subspace approximation problem, we seek a k-dimensional subspace F of R^d that minimizes the sum of p-th powers of Euclidean distances to a given set of n points a_1, ..., a_n in R^d, for p >= 1. More generally than minimizing sum_i dist(a_i,F)^p,we may wish to minimize sum_i M(dist(a_i,F)) for some loss function M(), for example, M-Estimators, which include the Huber and Tukey loss functions. Such subspaces provide alternatives to the singular value decomposition (SVD), which is the p=2 case, finding such an F that minimizes the sum of squares of distances. For p in [1,2), and for typical M-Estimators, the minimizing $F$ gives a solution that is more robust to outliers than that provided by the SVD. We give several algorithmic and hardness results for these robust subspace approximation problems. We think of the n points as forming an n x d matrix A, and letting nnz(A) denote the number of non-zero entries of A. Our results hold for p in [1,2). We use poly(n) to denote n^{O(1)} as n -> infty. We obtain: (1) For minimizing sum_i dist(a_i,F)^p, we give an algorithm running in O(nnz(A) + (n+d)poly(k/eps) + exp(poly(k/eps))), (2) we show that the problem of minimizing sum_i dist(a_i, F)^p is NP-hard, even to output a (1+1/poly(d))-approximation, answering a question of Kannan and Vempala, and complementing prior results which held for p >2, (3) For loss functions for a wide class of M-Estimators, we give a problem-size reduction: for a parameter K=(log n)^{O(log k)}, our reduction takes O(nnz(A) log n + (n+d) poly(K/eps)) time to reduce the problem to a constrained version involving matrices whose dimensions are poly(K eps^{-1} log n). We also give bicriteria solutions, (4) Our techniques lead to the first O(nnz(A) + poly(d/eps)) time algorithms for (1+eps)-approximate regression for a wide class of convex M-Estimators.

Book ChapterDOI
TL;DR: In this paper, a comprehensive review of algorithmic approaches for the design and management of one-way vehicle-sharing systems is presented, where customers are allowed to pick-up a vehicle at any location and return it to any other station which best suit typical urban journey requirements.
Abstract: Vehicle (bike or car) sharing represents an emerging transportation scheme which may comprise an important link in the green mobility chain of smart city environments. This chapter offers a comprehensive review of algorithmic approaches for the design and management of vehicle-sharing systems. Our focus is on one-way vehicle-sharing systems (wherein customers are allowed to pick-up a vehicle at any location and return it to any other station) which best suit typical urban journey requirements. Along this line, we present methods dealing with the so-called asymmetric demand–offer problem (ie, the unbalanced offer and demand of vehicles) typically experienced in one-way sharing systems which severely affects their economic viability as it implies that considerable human (and financial) resources should be engaged in relocating vehicles to satisfy customer demand. The chapter covers all planning aspects that affect the effectiveness and viability of vehicle-sharing systems: the actual system design (eg, number and location of vehicle station facilities, vehicle fleet size, vehicles distribution among stations); customer incentivisation schemes to motivate customer-based distribution of bicycles/cars (such schemes offer meaningful incentives to users so as to leave their vehicle to a station different to that originally intended and satisfy future user demand); cost-effective solutions to schedule operator-based repositioning of bicycles/cars (by employees explicitly enrolled in vehicle relocation) based on the current and future (predicted) demand patterns (operator-based and customer-based relocation may be thought as complementary methods to achieve the intended distribution of vehicles among stations).

Posted Content
TL;DR: A new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream is presented, using coresets to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix.
Abstract: In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the low rank approximation (reduced SVD) of such matrices. Our solution uses coresets, which is a subset of $O(k/\eps^2)$ scaled rows from the $n\times d$ input matrix, that approximates the sub of squared distances from its rows to every $k$-dimensional subspace in $\REAL^d$, up to a factor of $1\pm\eps$. An open theoretical problem has been whether we can compute such a coreset that is independent of the input matrix and also a weighted subset of its rows. %An open practical problem has been whether we can compute a non-trivial approximation to the reduced SVD of very large databases such as the Wikipedia document-term matrix in a reasonable time. We answer this question affirmatively. % and demonstrate an algorithm that efficiently computes a low rank approximation of the entire English Wikipedia. Our main technical result is a novel technique for deterministic coreset construction that is based on a reduction to the problem of $\ell_2$ approximation for item frequencies.