scispace - formally typeset
Search or ask a question

Showing papers by "Jean-Michel Loubes published in 2018"


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new method to predict the final destination of vehicle trips based on their initial partial trajectories using a mixture of 2-D Gaussian distributions.
Abstract: In this paper, we propose a new method to predict the final destination of vehicle trips based on their initial partial trajectories. We first review how we obtained clustering of trajectories that describes user behavior. Then, we explain how we model main traffic flow patterns by a mixture of 2-D Gaussian distributions. This yielded a density-based clustering of locations, which produces a data driven grid of similar points within each pattern. We present how this model can be used to predict the final destination of a new trajectory based on their first locations using a two-step procedure: we first assign the new trajectory to the clusters it most likely belongs. Second, we use characteristics from trajectories inside these clusters to predict the final destination. Finally, we present experimental results of our methods for classification of trajectories and final destination prediction on data sets of timestamped GPS-Location of taxi trips. We test our methods on two different data sets, to assess the capacity of our method to adapt automatically to different subsets.

79 citations


Posted Content
TL;DR: In this article, the authors focus on two of them which are called Disparate Impact (DI) and Balanced Error Rate (BER), both are based on the outcome of the algorithm across the different groups determined by the protected variable.
Abstract: Statistical algorithms are usually helping in making decisions in many aspects of our lives. But, how do we know if these algorithms are biased and commit unfair discrimination of a particular group of people, typically a minority? \textit{Fairness} is generally studied in a probabilistic framework where it is assumed that there exists a protected variable, whose use as an input of the algorithm may imply discrimination. There are different definitions of Fairness in the literature. In this paper we focus on two of them which are called Disparate Impact (DI) and Balanced Error Rate (BER). Both are based on the outcome of the algorithm across the different groups determined by the protected variable. The relationship between these two notions is also studied. The goals of this paper are to detect when a binary classification rule lacks fairness and to try to fight against the potential discrimination attributable to it. This can be done by modifying either the classifiers or the data itself. Our work falls into the second category and modifies the input data using optimal transport theory.

61 citations


Journal ArticleDOI
TL;DR: In this paper, a family of positive definite kernels built using transportation-based distances is presented, and the Gaussian processes indexed by distributions corresponding to these kernels can be efficiently forecast.
Abstract: Monge-Kantorovich distances, otherwise known as Wasserstein distances, have received a growing attention in statistics and machine learning as a powerful discrepancy measure for probability distributions. In this paper, we focus on forecasting a Gaussian process indexed by probability distributions. For this, we provide a family of positive definite kernels built using transportation based distances. We provide a probabilistic understanding of these kernels and characterize the corresponding stochastic processes. We prove that the Gaussian processes indexed by distributions corresponding to these kernels can be efficiently forecast, opening new perspectives in Gaussian process modeling.

49 citations



Book ChapterDOI
28 May 2018
TL;DR: This paper sets up features based on fixed functional bases and databased bases and applies outlier detection methods on those features, which can be distance- or density-based and tested on real telemetry data.
Abstract: Satellites monitoring is an important task to prevent the failure of satellites. For this purpose, a large number of time series are analyzed in order to detect anomalies. In this paper, we provide a review of such analysis focusing on methods that rely on features extraction. In particular, we set up features based on fixed functional bases (Fourier, wavelets, kernel bases...) and databased bases (PCA, KPCA). The outlier detection methods we apply on those features can be distance- or density-based. Those algorithms will be tested on real telemetry data.

14 citations


Posted Content
TL;DR: This work characterize in turn radial positive definite kernels on Hilbert spaces, and show that the covariance parameters of virtually all parametric families of covariance functions are microergodic in the case of (infinite-dimensional) Hilbert spaces.
Abstract: In this work, we investigate Gaussian Processes indexed by multidimensional distributions While directly constructing radial positive definite kernels based on the Wasserstein distance has been proven to be possible in the unidimensional case, such constructions do not extend well to the multidimensional case as we illustrate here To tackle the problem of defining positive definite kernels between multivariate distributions based on optimal transport, we appeal instead to Hilbert space embeddings relying on optimal transport maps to a reference distribution, that we suggest to take as a Wasserstein barycenter We characterize in turn radial positive definite kernels on Hilbert spaces, and show that the covariance parameters of virtually all parametric families of covariance functions are microergodic in the case of (infinite-dimensional) Hilbert spaces We also investigate statistical properties of our suggested positive definite kernels on multidimensional distributions, with a focus on consistency when a population Wasserstein barycenter is replaced by an empirical barycenter and additional explicit results in the special case of Gaussian distributions Finally, we study the Gaussian process methodology based on our suggested positive definite kernels in regression problems with multidimensional distribution inputs, on simulation data stemming both from synthetic examples and from a mechanical engineering test case

11 citations


Book ChapterDOI
13 Sep 2018
TL;DR: A decision-making tool based on supervised learning techniques that detects defects and proposes to the Surface-Mount Technology (SMT) operator a probability of being a false call is proposed and four tree-based learning methods are compared.
Abstract: In this paper, we propose a decision-making tool based on supervised learning techniques that detects defects and proposes to the Surface-Mount Technology (SMT) operator a probability of being a false call. In this work, we compare four tree-based learning methods. The result of our experiments shows that a XGBoost model trained with our real-world dataset can accurately classify most real defects and false calls with an accuracy score of about 99.4% and a recall of about 98.6%. Moreover, we investigated the computing time of our prediction model and concluded that integration of our classification tool based on the XGBoost algorithm is realistic and feasible in the SMT production line. We believe that our tool will significantly improve the daily work of the SMT verify operator.

8 citations


Posted Content
TL;DR: The asymptotic distribution of the major indexes used in the statistical literature to quantify disparate treatment in machine learning is provided.
Abstract: We provide the asymptotic distribution of the major indexes used in the statistical literature to quantify disparate treatment in machine learning. We aim at promoting the use of confidence intervals when testing the so-called group disparate impact. We illustrate on some examples the importance of using confidence intervals and not a single value.

8 citations


Posted Content
18 Oct 2018
TL;DR: A new explainability formalism designed to explain how the possible values of each input variable in a whole test set impact the predictions given by black-box decision rules, which has in addition a low algorithmic complexity making it scalable to real-life large datasets.
Abstract: In this paper, we present a new explainability formalism designed to explain how the input variables of the testing set impact the predictions of black-box decision rules Hence we propose a group explainability frame for machine learning algorithms based on the variability of the distribution of the input variables Our formalism is based on an information theory framework that quantifies the influence of all input-output observations when emphasizing the impact of each input variable, based on entropic projections This formalism is thus the first unified and model agnostic framework enabling us to interpret the dependence between the input variables, their impact on the prediction errors, and their influence on the output predictions In addition and most importantly, we prove that computing an explanation in our framework has a low algorithmic complexity making it scalable to real-life large datasets We illustrate our strategy by explaining complex decision rules learned using XGBoost, Random Forest or Deep Neural Network classifiers We finally make clear its differences with explainability strategies based on single observations, such as those of LIME or SHAP A toolbox is proposed at this https URL

5 citations


Posted Content
TL;DR: A new explainability formalism is presented to make clear the impact of each variable on the predictions given by black-box decision rules and proposes a new computation-ally efficient algorithm to stress the variables, which only reweights the reference observations and predictions.
Abstract: In this paper, we present a new explainability formalism to make clear the impact of each variable on the predictions given by black-box decision rules. Our method consists in evaluating the decision rules on test samples generated in such a way that each variable is stressed incrementally while preserving the original distribution of the machine learning problem. We then propose a new computation-ally efficient algorithm to stress the variables, which only reweights the reference observations and predictions. This makes our methodology scalable to large datasets. Results obtained on standard machine learning datasets are presented and discussed.

4 citations


Posted Content
TL;DR: It is proved that the kernel defined as the quadratic distance between the transportation maps, that transport each distribution to the barycenter of the distributions, provides a valid covariance function.
Abstract: In this work, we propose to define Gaussian Processes indexed by multidimensional distributions. In the framework where the distributions can be modeled as i.i.d realizations of a measure on the set of distributions, we prove that the kernel defined as the quadratic distance between the transportation maps, that transport each distribution to the barycenter of the distributions, provides a valid covariance function. In this framework, we study the asymptotic properties of this process, proving micro ergodicity of the parameters.

Posted Content
TL;DR: This paper proposes and study an harmonic analysis of the covariance operators that enables to consider Gaussian processes models and forecasting issues and is motivated by statistical ranking problems.
Abstract: In the framework of the supervised learning of a real function defined on a space X , the so called Kriging method stands on a real Gaussian field defined on X. The Euclidean case is well known and has been widely studied. In this paper, we explore the less classical case where X is the non commutative finite group of permutations. In this setting, we propose and study an harmonic analysis of the covariance operators that enables to consider Gaussian processes models and forecasting issues. Our theory is motivated by statistical ranking problems.

Posted Content
TL;DR: This work focuses on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions, and lists some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.
Abstract: Combining big data and machine learning algorithms, the power of automatic decision tools induces as much hope as fear. Many recently enacted European legislation (GDPR) and French laws attempt to regulate the use of these tools. Leaving aside the well-identified problems of data confidentiality and impediments to competition, we focus on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions. The detailed perspective of the legal texts, faced with the complexity and opacity of the learning algorithms, reveals the need for important technological disruptions for the detection or reduction of the discrimination risk, and for addressing the right to obtain an explanation of the automatic decision. Since trust of the developers and above all of the users (citizens, litigants, customers) is essential, algorithms exploiting personal data must be deployed in a strict ethical framework. In conclusion, to answer this need, we list some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.

Posted Content
TL;DR: An estimate of the asymptotic variance is provided which enables to build a two sample test to assess the similarity between two distributions and this test is then used to provide a new criterion to assessment the notion of fairness of a classification algorithm.
Abstract: We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with size $n$ and $m$, $W_p(P_n,Q_m)$ for $p>1$ for observations on the real line, using a minimal amount of assumptions We provide an estimate of the asymptotic variance which enables to build a two sample test to assess the similarity between two distributions This test is then used to provide a new criterion to assess the notion of fairness of a classification algorithm

Posted Content
TL;DR: This paper proposes a group explainability formalism for trained machine learning decision rules, based on their response to the variability of the input variables distribution, and proves that computing an explanation in this framework has a low algorithmic complexity, making it scalable to real-life large datasets.
Abstract: In this paper, we present a new explainability formalism designed to explain how each input variable of a test set impacts the predictions of machine learning models. Hence, we propose a group explainability formalism for trained machine learning decision rules, based on their response to the variability of the input variables distribution. In order to emphasize the impact of each input variable, this formalism uses an information theory framework that quantifies the influence of all input-output observations based on entropic projections. This is thus the first unified and model agnostic formalism enabling data scientists to interpret the dependence between the input variables, their impact on the prediction errors, and their influence on the output predictions. Convergence rates of the entropic projections are provided in the large sample case. Most importantly, we prove that computing an explanation in our framework has a low algorithmic complexity, making it scalable to real-life large datasets. We illustrate our strategy by explaining complex decision rules learned by using XGBoost, Random Forest or Deep Neural Network classifiers on various datasets such as Adult income, MNIST and CelebA. We finally make clear its differences with the explainability strategies \textit{LIME} and \textit{SHAP}, that are based on single observations. Results can be reproduced by using the freely distributed Python toolbox this https URL}.

Posted Content
TL;DR: In this paper, the authors developed a new method for estimating the Value-at-Risk, the Conditional Value at Risk and the Economic Capital when the underlying risk factors follow a Beta-Kotz distribution.
Abstract: This paper considers the use for Value-at-Risk computations of the so-called Beta-Kotz distribution based on a general family of distributions including the classical Gaussian model. Actually, this work develops a new method for estimating the Value-at-Risk, the Conditional Value-at-Risk and the Economic Capital when the underlying risk factors follow a Beta-Kotz distribution. After estimating the parameters of the distribution of the loss random variable, both analytical for some particular values of the parameters and numerical approaches are provided for computing these mentioned measures. The proposed risk measures are finally applied for quantifying the potential risk of economic losses in Credit Risk.

Posted Content
TL;DR: This work focuses on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions, and lists some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.
Abstract: Combining big data and machine learning algorithms, the power of automatic decision tools induces as much hope as fear. Many recently enacted European legislation (GDPR) and French laws attempt to regulate the use of these tools. Leaving aside the well-identified problems of data confidentiality and impediments to competition, we focus on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions. The detailed perspective of the legal texts, faced with the complexity and opacity of the learning algorithms, reveals the need for important technological disruptions for the detection or reduction of the discrimination risk, and for addressing the right to obtain an explanation of the auto- matic decision. Since trust of the developers and above all of the users (citizens, litigants, customers) is essential, algorithms exploiting personal data must be deployed in a strict ethical framework. In conclusion, to answer this need, we list some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.

Posted Content
TL;DR: A new R package COREclust dedicated to the detection of representative variables in high dimensional spaces with a potentially limited number of observations is presented and how to use it and results obtained on synthetic and real data are presented.
Abstract: In this paper, we present a new R package COREclust dedicated to the detection of representative variables in high dimensional spaces with a potentially limited number of observations. Variable sets detection is based on an original graph clustering strategy denoted CORE-clustering algorithm that detects CORE-clusters, i.e. variable sets having a user defined size range and in which each variable is very similar to at least another variable. Representative variables are then robustely estimate as the CORE-cluster centers. This strategy is entirely coded in C++ and wrapped by R using the Rcpp package. A particular effort has been dedicated to keep its algorithmic cost reasonable so that it can be used on large datasets. After motivating our work, we will explain the CORE-clustering algorithm as well as a greedy extension of this algorithm. We will then present how to use it and results obtained on synthetic and real data.