scispace - formally typeset
Search or ask a question

Showing papers by "Pierre Alquier published in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors obtained estimation error rates and sharp oracle inequalities for regularization procedures of the form \begin{equation*}hat{f}\in\mathop{\operatorname{argmin}}_{f\in F]-Bigg(1}{N}\sum_{i=1}^{N}\ell_{f}(X,Y,I})+\lambda \Vert f\Vert \Bigg), when $F$ is any norm, and $Bigg$ is a Lipschitz loss function satisfying a Bernstein
Abstract: We obtain estimation error rates and sharp oracle inequalities for regularization procedures of the form \begin{equation*}\hat{f}\in\mathop{\operatorname{argmin}}_{f\in F}\Bigg(\frac{1}{N}\sum_{i=1}^{N}\ell_{f}(X_{i},Y_{i})+\lambda \Vert f\Vert \Bigg)\end{equation*} when $\Vert \cdot \Vert $ is any norm, $F$ is a convex class of functions and $\ell$ is a Lipschitz loss function satisfying a Bernstein condition over $F$. We explore both the bounded and sub-Gaussian stochastic frameworks for the distribution of the $f(X_{i})$’s, with no assumption on the distribution of the $Y_{i}$’s. The general results rely on two main objects: a complexity function and a sparsity equation, that depend on the specific setting in hand (loss $\ell$ and norm $\Vert \cdot \Vert $). As a proof of concept, we obtain minimax rates of convergence in the following problems: (1) matrix completion with any Lipschitz loss function, including the hinge and logistic loss for the so-called 1-bit matrix completion instance of the problem, and quantile losses for the general case, which enables to estimate any quantile on the entries of the matrix; (2) logistic LASSO and variants such as the logistic SLOPE, and also shape constrained logistic regression; (3) kernel methods, where the loss is the hinge loss, and the regularization function is the RKHS norm.

50 citations


29 Sep 2019
TL;DR: In this paper, a pseudo-likelihood based on the Maximum Mean Discrepancy (MMD)-Bayes posterior is proposed, which is consistent and robust to model misspecification, and reasonable variational approximations of this posterior enjoy the same properties.
Abstract: In some misspecified settings, the posterior distribution in Bayesian statistics may lead to inconsistent estimates. To fix this issue, it has been suggested to replace the likelihood by a pseudo-likelihood, that is the exponential of a loss function enjoying suitable robustness properties. In this paper, we build a pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding of probability distributions into a reproducing kernel Hilbert space. We show that this MMD-Bayes posterior is consistent and robust to model misspecification. As the posterior obtained in this way might be intractable, we also prove that reasonable variational approximations of this posterior enjoy the same properties. We provide details on a stochastic gradient algorithm to compute these variational approximations. Numerical simulations indeed suggest that our estimator is more robust to misspecification than the ones based on the likelihood.

38 citations


Posted Content
TL;DR: This paper tackles the problem of universal estimation using a minimum distance estimator presented in Briol et al. (2019) based on the Maximum Mean Discrepancy, and shows that the estimator is robust to both dependence and to the presence of outliers in the dataset.
Abstract: Many works in statistics aim at designing a universal estimation procedure, that is, an estimator that would converge to the best approximation of the (unknown) data generating distribution in a model, without any assumption on this distribution. This question is of major interest, in particular because the universality property leads to the robustness of the estimator. In this paper, we tackle the problem of universal estimation using a minimum distance estimator presented in Briol et al. (2019) based on the Maximum Mean Discrepancy. We show that the estimator is robust to both dependence and to the presence of outliers in the dataset. Finally, we provide a theoretical study of the stochastic gradient descent algorithm used to compute the estimator, and we support our findings with numerical simulations.

26 citations


Posted Content
TL;DR: A pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding of probability distributions into a reproducing kernel Hilbert space is built, and it is shown that this MMD-Bayes posterior is consistent and robust to model misspecification.
Abstract: In some misspecified settings, the posterior distribution in Bayesian statistics may lead to inconsistent estimates. To fix this issue, it has been suggested to replace the likelihood by a pseudo-likelihood, that is the exponential of a loss function enjoying suitable robustness properties. In this paper, we build a pseudo-likelihood based on the Maximum Mean Discrepancy, defined via an embedding of probability distributions into a reproducing kernel Hilbert space. We show that this MMD-Bayes posterior is consistent and robust to model misspecification. As the posterior obtained in this way might be intractable, we also prove that reasonable variational approximations of this posterior enjoy the same properties. We provide details on a stochastic gradient algorithm to compute these variational approximations. Numerical simulations indeed suggest that our estimator is more robust to misspecification than the ones based on the likelihood.

16 citations


Journal ArticleDOI
TL;DR: The basic tools of [19] to nonstationary Markov chains are extended and a Bernsteintype inequality is provided, and risk bounds for the prediction of periodic autoregressive processes with an unknown period are deduced.
Abstract: Exponential inequalities are main tools in machine learning theory. To prove exponential inequalities for non i.i.d random variables allows to extend many learning techniques to these variables. Indeed, much work has been done both on inequalities and learning theory for time series, in the past 15 years. However, for the non independent case, almost all the results concern stationary time series. This excludes many important applications: for example any series with a periodic behavior is non-stationary. In this paper, we extend the basic tools of Dedecker and Fan (2015) to nonstationary Markov chains. As an application, we provide a Bernstein-type inequality, and we deduce risk bounds for the prediction of periodic autoregressive processes with an unknown period.

14 citations


Proceedings Article
15 Oct 2019
TL;DR: It is shown that this is indeed the case for some variational inference (VI) algorithms, and theoretical justifications in favor of online algorithms relying on approximate Bayesian methods are presented.
Abstract: Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this paper, we show that this is indeed the case for some variational inference (VI) algorithms. We consider a few existing online, tempered VI algorithms, as well as a new algorithm, and derive their generalization bounds. Our theoretical result relies on the convexity of the variational objective, but we argue that the result should hold more generally and present empirical evidence in support of this. Our work in this paper presents theoretical justifications in favor of online algorithms relying on approximate Bayesian methods.

12 citations


Journal ArticleDOI
TL;DR: A Markov chain whose transition kernel uses an unknown fraction of fixed size of the available data that is randomly refreshed throughout the algorithm is designed, which preserves the simplicity of the Metropolis–Hastings algorithm.
Abstract: This paper introduces a framework for speeding up Bayesian inference conducted in presence of large datasets. We design a Markov chain whose transition kernel uses an unknown fraction of fixed size of the available data that is randomly refreshed throughout the algorithm. Inspired by the Approximate Bayesian Computation literature, the subsampling process is guided by the fidelity to the observed data, as measured by summary statistics. The resulting algorithm, Informed Sub-Sampling MCMC, is a generic and flexible approach which, contrary to existing scalable methodologies, preserves the simplicity of the Metropolis–Hastings algorithm. Even though exactness is lost, i.e the chain distribution approximates the posterior, we study and quantify theoretically this bias and show on a diverse set of examples that it yields excellent performances when the computational budget is limited. If available and cheap to compute, we show that setting the summary statistics as the maximum likelihood estimator is supported by theoretical arguments.

11 citations


Posted Content
TL;DR: In this paper, a variational inference (VI) algorithm is proposed to preserve the generalization properties of Bayesian inference, and its theoretical bounds on the convexity of the variational objective are derived.
Abstract: Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this paper, we show that this is indeed the case for some variational inference (VI) algorithms. We consider a few existing online, tempered VI algorithms, as well as a new algorithm, and derive their generalization bounds. Our theoretical result relies on the convexity of the variational objective, but we argue that the result should hold more generally and present empirical evidence in support of this. Our work in this paper presents theoretical justifications in favor of online algorithms relying on approximate Bayesian methods.

9 citations


Journal ArticleDOI
TL;DR: A vector auto-regressive model with a low-rank constraint on the transition matrix that is well suited to predict high-dimensional series that are highly correlated, or that are driven by a small number of hidden factors.
Abstract: We propose a vector auto-regressive (VAR) model with a low-rank constraint on the transition matrix. This new model is well suited to predict high-dimensional series that are highly correlated, or that are driven by a small number of hidden factors. We study estimation, prediction, and rank selection for this model in a very general setting. Our method shows excellent performances on a wide variety of simulated datasets. On macro-economic data from Giannone et al. (2015), our method is competitive with state-of-the-art methods in small dimension, and even improves on them in high dimension.

8 citations


Journal ArticleDOI
TL;DR: In this paper, the authors extend the results known for matrix estimation in the i.i.d. setting to time series and prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.
Abstract: Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.

5 citations


Journal ArticleDOI
TL;DR: In this paper, the authors extend the results known for matrix estimation in the i.i.d. setting to time series and prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.
Abstract: Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence.