We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

Random graphs

/pdf/convex-analysisnoer-san-nojin-zhan-nituite-5dl2rbmoyr.pdf

Convex Analysisの二,三の進展について

Suppose that one observes an incomplete subset of entries selected from a low-rank matrix. When is it possible to complete the matrix and recover the entries that have not been seen? We demonstrate that in very general settings, one can perfectly recover all of the missing entries from most sufficiently large subsets by solving a convex programming problem that finds the matrix with the minimum nuclear norm agreeing with the observed entries. The techniques used in this analysis draw upon parallels in the field of compressed sensing, demonstrating that objects other than signals and images can be perfectly reconstructed from very limited information.

/pdf/exact-matrix-completion-via-convex-optimization-15hkb1t9u7.pdf

Exact matrix completion via convex optimization

Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of 1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.

Statistical Learning with Sparsity: The Lasso and Generalizations

Matrix Factorization Techniques for Recommender Systems

We study the problem of online learning with non-convex losses, where the learner has access to an offline optimization oracle We show that the classical Follow the Perturbed Leader (FTPL) algorithm achieves optimal regret rate of $O(T^{-1/2})$ in this setting This improves upon the previous best-known regret rate of $O(T^{-1/3})$ for FTPL We further show that an optimistic variant of FTPL achieves better regret bounds when the sequence of losses encountered by the learner is `predictable'

/pdf/online-non-convex-learning-following-the-perturbed-leader-is-fl3oj1u7tj.pdf

Online Non-Convex Learning: Following the Perturbed Leader is Optimal

This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression. 
Furthermore, this work establishes a precise problem-dependent extent to which mini-batch SGD yields provable near-linear parallelization speedups over SGD with batch size one. This allows for understanding learning rate versus batch size tradeoffs for the final iterate of an SGD method. These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD methods. A non-asymptotic analysis of communication efficient parallelization schemes such as model-averaging/parameter mixing methods is then provided. 
Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure minimax risk for the agnostic case must be a function of the noise properties. 
This paper builds on the operator view of analyzing SGD methods, introduced by Defossez and Bach (2015), followed by developing a novel analysis in bounding these operators to characterize the excess risk. These techniques are of broader interest in analyzing computational aspects of stochastic approximation.

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification.

Simple weighted averaging of word vectors often yields effective representations for sentences which outperform sophisticated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One of the key reasons is that a longer document is likely to contain words from many different topics; hence, creating a single vector while ignoring all the topical structure is unlikely to yield an effective document representation. This problem is less acute in single sentences and other short text fragments where the presence of a single topic is most likely. To alleviate this problem, we present P-SIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document's topical structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. We provide theoretical justifications on the correctness of P-SIF. Through a comprehensive set of experiments, we demonstrate P-SIF's effectiveness compared to simple weighted averaging and many other baselines.

/pdf/p-sif-document-embeddings-using-partition-averaging-2iki6t2bfz.pdf

P-SIF: Document Embeddings Using Partition Averaging

This paper presents a rate distortion approach to Markov graph learning. It provides lower bounds on the number of samples required for any algorithm to learn the Markov graph structure of a probability distribution, up to edit distance. We first prove a general result for any probability distribution, and then specialize it for Ising and Gaussian models. In particular, for both Ising and Gaussian models on p variables with degree at most d, we show that at least Ω((d − s/p)log p) samples are required for any algorithm to learn the graph structure up to edit distance s. Our bounds represent a strong converse; i.e., we show that for a lower number of samples, the probability of error goes to 1 as the problem size increases. These results show that substantial gains in sample complexity may not be possible without paying a significant price in edit distance error.

/pdf/learning-markov-graphs-up-to-edit-distance-1s6nmdwrva.pdf

Learning Markov graphs up to edit distance

We give faster algorithms and improved sample complexities for the fundamental problem of estimating the top eigenvector. Given an explicit matrix A ∈ Rn×d, we show how to compute an e approximate top eigenvector of AτA in time O ([nnz(A) + d sr(A)/gap2] ċ log 1/e . Here nnz(A) is the number of nonzeros in A, sr(A) is the stable rank, and gap is the relative eigengap. We also consider an online setting in which, given a stream of i.i.d. samples from a distribution D with covariance matrix Σ and a vector x0 which is an O(gap) approximate top eigenvector for Σ, we show how to refine x0 to an e approximation using O (v(D)/gapċe) samples from D. Here v(D) is a natural notion of variance. Combining our algorithm with previous work to initialize x0, we obtain improved sample complexities and runtimes under a variety of assumptions on D.

We achieve our results via a robust analysis of the classic shift-and-invert preconditioning method. This technique lets us reduce eigenvector computation to approximately solving a series of linear systems with fast stochastic gradient methods.

/pdf/faster-eigenvector-computation-via-shift-and-invert-25uof9mcz5.pdf

Praneeth Netrapalli

Papers

Online Non-Convex Learning: Following the Perturbed Leader is Optimal

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification.

P-SIF: Document Embeddings Using Partition Averaging

Learning Markov graphs up to edit distance

Faster eigenvector computation via shift-and-invert preconditioning