Showing papers by "Jean-Michel Loubes published in 2018"

PDF

Open Access

Journal Article•DOI•

Destination Prediction by Trajectory Distribution-Based Model

[...]

Philippe Besse¹, Brendan Guillouet¹, Jean-Michel Loubes¹, Francois Royer•Institutions (1)

01 Aug 2018-IEEE Transactions on Intelligent Transportation Systems

TL;DR: In this paper, the authors proposed a new method to predict the final destination of vehicle trips based on their initial partial trajectories using a mixture of 2-D Gaussian distributions.

...read moreread less

Abstract: In this paper, we propose a new method to predict the final destination of vehicle trips based on their initial partial trajectories. We first review how we obtained clustering of trajectories that describes user behavior. Then, we explain how we model main traffic flow patterns by a mixture of 2-D Gaussian distributions. This yielded a density-based clustering of locations, which produces a data driven grid of similar points within each pattern. We present how this model can be used to predict the final destination of a new trajectory based on their first locations using a two-step procedure: we first assign the new trajectory to the clusters it most likely belongs. Second, we use characteristics from trajectories inside these clusters to predict the final destination. Finally, we present experimental results of our methods for classification of trajectories and final destination prediction on data sets of timestamped GPS-Location of taxi trips. We test our methods on two different data sets, to assess the capacity of our method to adapt automatically to different subsets.

...read moreread less

79 citations

Posted Content•

Obtaining fairness using optimal transport theory

[...]

Eustasio del Barrio, Fabrice Gamboa, Paula Gordaliza, Jean-Michel Loubes

08 Jun 2018-arXiv: Statistics Theory

TL;DR: In this article, the authors focus on two of them which are called Disparate Impact (DI) and Balanced Error Rate (BER), both are based on the outcome of the algorithm across the different groups determined by the protected variable.

...read moreread less

Abstract: Statistical algorithms are usually helping in making decisions in many aspects of our lives. But, how do we know if these algorithms are biased and commit unfair discrimination of a particular group of people, typically a minority? \textit{Fairness} is generally studied in a probabilistic framework where it is assumed that there exists a protected variable, whose use as an input of the algorithm may imply discrimination. There are different definitions of Fairness in the literature. In this paper we focus on two of them which are called Disparate Impact (DI) and Balanced Error Rate (BER). Both are based on the outcome of the algorithm across the different groups determined by the protected variable. The relationship between these two notions is also studied. The goals of this paper are to detect when a binary classification rule lacks fairness and to try to fight against the potential discrimination attributable to it. This can be done by modifying either the classifiers or the data itself. Our work falls into the second category and modifies the input data using optimal transport theory.

...read moreread less

61 citations

Journal Article•DOI•

A Gaussian Process Regression Model for Distribution Inputs

[...]

François Bachoc¹, Fabrice Gamboa¹, Jean-Michel Loubes¹, Nil Venet¹•Institutions (1)

Paul Sabatier University¹

01 Oct 2018-IEEE Transactions on Information Theory

TL;DR: In this paper, a family of positive definite kernels built using transportation-based distances is presented, and the Gaussian processes indexed by distributions corresponding to these kernels can be efficiently forecast.

...read moreread less

Abstract: Monge-Kantorovich distances, otherwise known as Wasserstein distances, have received a growing attention in statistics and machine learning as a powerful discrepancy measure for probability distributions. In this paper, we focus on forecasting a Gaussian process indexed by probability distributions. For this, we provide a family of positive definite kernels built using transportation based distances. We provide a probabilistic understanding of these kernels and characterize the corresponding stochastic processes. We prove that the Gaussian processes indexed by distributions corresponding to these kernels can be efficiently forecast, opening new perspectives in Gaussian process modeling.

...read moreread less

49 citations

Journal Article•DOI•

Oral health and microbiota status in professional rugby players: A case-control study.

[...]

Matthieu Minty¹, Matthieu Minty², Thibault Canceill¹, Sylvie Lê¹, Sylvie Lê², Pauline Dubois¹, Oihana Amestoy¹, Pascale Loubieres², Pascale Loubieres¹, Jeffrey E. Christensen², Jeffrey E. Christensen¹, Camille Champion¹, Camille Champion², Camille Champion³, Vincent Azalbert², Vincent Azalbert¹, Estelle Grasset¹, Estelle Grasset², Sara Hardy¹, Sara Hardy², Jean-Michel Loubes³, Jean-Philippe Mallet², Jean-Philippe Mallet¹, François Tercé², François Tercé¹, Jean-Noel Vergnes², Jean-Noel Vergnes¹, Rémy Burcelin¹, Rémy Burcelin², Matteo Serino⁴, Franck Diemer⁵, Franck Diemer¹, Vincent Blasco-Baque², Vincent Blasco-Baque¹ - Show less +30 more•Institutions (5)

Paul Sabatier University¹, French Institute of Health and Medical Research², Institut de Mathématiques de Toulouse³, University of Toulouse⁴, Centre national de la recherche scientifique⁵

01 Dec 2018-Journal of Dentistry

TL;DR: The study shows that the oral health of professional rugby players PRG was poorer than the general population, and predicted metagenomics of oral microbiota in rugby players was suggestive of a cariogenic metagenome favourable to the development of caries.

...read moreread less

15 citations

Book Chapter•DOI•

Statistical Methods for Outlier Detection in Space Telemetries

[...]

Clémentine Barreyre¹, Loic Boussouf¹, Bertrand Cabon¹, Béatrice Laurent², Jean-Michel Loubes² - Show less +1 more•Institutions (2)

Airbus Defence and Space¹, Institut de Mathématiques de Toulouse²

28 May 2018

TL;DR: This paper sets up features based on fixed functional bases and databased bases and applies outlier detection methods on those features, which can be distance- or density-based and tested on real telemetry data.

...read moreread less

Abstract: Satellites monitoring is an important task to prevent the failure of satellites. For this purpose, a large number of time series are analyzed in order to detect anomalies. In this paper, we provide a review of such analysis focusing on methods that rely on features extraction. In particular, we set up features based on fixed functional bases (Fourier, wavelets, kernel bases...) and databased bases (PCA, KPCA). The outlier detection methods we apply on those features can be distance- or density-based. Those algorithms will be tested on real telemetry data.

...read moreread less

14 citations

Posted Content•

Gaussian processes with multidimensional distribution inputs via optimal transport and Hilbertian embedding

[...]

François Bachoc, Alexandra Suvorikova, David Ginsbourger, Jean-Michel Loubes, Vladimir Spokoiny - Show less +1 more

02 May 2018-arXiv: Methodology

TL;DR: This work characterize in turn radial positive definite kernels on Hilbert spaces, and show that the covariance parameters of virtually all parametric families of covariance functions are microergodic in the case of (infinite-dimensional) Hilbert spaces.

...read moreread less

Abstract: In this work, we investigate Gaussian Processes indexed by multidimensional distributions While directly constructing radial positive definite kernels based on the Wasserstein distance has been proven to be possible in the unidimensional case, such constructions do not extend well to the multidimensional case as we illustrate here To tackle the problem of defining positive definite kernels between multivariate distributions based on optimal transport, we appeal instead to Hilbert space embeddings relying on optimal transport maps to a reference distribution, that we suggest to take as a Wasserstein barycenter We characterize in turn radial positive definite kernels on Hilbert spaces, and show that the covariance parameters of virtually all parametric families of covariance functions are microergodic in the case of (infinite-dimensional) Hilbert spaces We also investigate statistical properties of our suggested positive definite kernels on multidimensional distributions, with a focus on consistency when a population Wasserstein barycenter is replaced by an empirical barycenter and additional explicit results in the special case of Gaussian distributions Finally, we study the Gaussian process methodology based on our suggested positive definite kernels in regression problems with multidimensional distribution inputs, on simulation data stemming both from synthetic examples and from a mechanical engineering test case

...read moreread less

11 citations

Book Chapter•DOI•

Supervised Learning Approach for Surface-Mount Device Production

[...]

Eva Jabbar¹, Philippe Besse¹, Jean-Michel Loubes¹, Nathalie Barbosa Roa², Christophe Merle², Rémi Dettai² - Show less +2 more•Institutions (2)

University of Toulouse¹, Continental Automotive Systems²

13 Sep 2018

TL;DR: A decision-making tool based on supervised learning techniques that detects defects and proposes to the Surface-Mount Technology (SMT) operator a probability of being a false call is proposed and four tree-based learning methods are compared.

...read moreread less

Abstract: In this paper, we propose a decision-making tool based on supervised learning techniques that detects defects and proposes to the Surface-Mount Technology (SMT) operator a probability of being a false call. In this work, we compare four tree-based learning methods. The result of our experiments shows that a XGBoost model trained with our real-world dataset can accurately classify most real defects and false calls with an accuracy score of about 99.4% and a recall of about 98.6%. Moreover, we investigated the computing time of our prediction model and concluded that integration of our classification tool based on the XGBoost algorithm is realistic and feasible in the SMT production line. We believe that our tool will significantly improve the daily work of the SMT verify operator.

...read moreread less

8 citations

Posted Content•

Confidence Intervals for Testing Disparate Impact in Fair Learning

[...]

Philippe Besse, Eustasio del Barrio, Paula Gordaliza, Jean-Michel Loubes

17 Jul 2018-arXiv: Machine Learning

TL;DR: The asymptotic distribution of the major indexes used in the statistical literature to quantify disparate treatment in machine learning is provided.

...read moreread less

Abstract: We provide the asymptotic distribution of the major indexes used in the statistical literature to quantify disparate treatment in machine learning. We aim at promoting the use of confidence intervals when testing the so-called group disparate impact. We illustrate on some examples the importance of using confidence intervals and not a single value.

...read moreread less

8 citations

Posted Content•

Entropic Variable Projection for Explainability and Intepretability

[...]

François Bachoc, Fabrice Gamboa, Max Halford, Jean-Michel Loubes, Laurent Risser - Show less +1 more

18 Oct 2018

TL;DR: A new explainability formalism designed to explain how the possible values of each input variable in a whole test set impact the predictions given by black-box decision rules, which has in addition a low algorithmic complexity making it scalable to real-life large datasets.

...read moreread less

Abstract: In this paper, we present a new explainability formalism designed to explain how the input variables of the testing set impact the predictions of black-box decision rules Hence we propose a group explainability frame for machine learning algorithms based on the variability of the distribution of the input variables Our formalism is based on an information theory framework that quantifies the influence of all input-output observations when emphasizing the impact of each input variable, based on entropic projections This formalism is thus the first unified and model agnostic framework enabling us to interpret the dependence between the input variables, their impact on the prediction errors, and their influence on the output predictions In addition and most importantly, we prove that computing an explanation in our framework has a low algorithmic complexity making it scalable to real-life large datasets We illustrate our strategy by explaining complex decision rules learned using XGBoost, Random Forest or Deep Neural Network classifiers We finally make clear its differences with explainability strategies based on single observations, such as those of LIME or SHAP A toolbox is proposed at this https URL

...read moreread less

5 citations

Posted Content•

Entropic Variable Boosting for Explainability and Interpretability in Machine Learning.

[...]

François Bachoc, Fabrice Gamboa, Jean-Michel Loubes, Laurent Risser

18 Oct 2018-arXiv: Machine Learning

TL;DR: A new explainability formalism is presented to make clear the impact of each variable on the predictions given by black-box decision rules and proposes a new computation-ally efficient algorithm to stress the variables, which only reweights the reference observations and predictions.

...read moreread less

Abstract: In this paper, we present a new explainability formalism to make clear the impact of each variable on the predictions given by black-box decision rules. Our method consists in evaluating the decision rules on test samples generated in such a way that each variable is stressed incrementally while preserving the original distribution of the machine learning problem. We then propose a new computation-ally efficient algorithm to stress the variables, which only reweights the reference observations and predictions. This makes our methodology scalable to large datasets. Results obtained on standard machine learning datasets are presented and discussed.

...read moreread less

4 citations

Posted Content•

Gaussian Process Forecast with multidimensional distributional entries

[...]

François Bachoc, Alexandra Suvorikova, Jean-Michel Loubes, Vladimir Spokoiny

02 May 2018-arXiv: Methodology

TL;DR: It is proved that the kernel defined as the quadratic distance between the transportation maps, that transport each distribution to the barycenter of the distributions, provides a valid covariance function.

...read moreread less

Abstract: In this work, we propose to define Gaussian Processes indexed by multidimensional distributions. In the framework where the distributions can be modeled as i.i.d realizations of a measure on the set of distributions, we prove that the kernel defined as the quadratic distance between the transportation maps, that transport each distribution to the barycenter of the distributions, provides a valid covariance function. In this framework, we study the asymptotic properties of this process, proving micro ergodicity of the parameters.

...read moreread less

Posted Content•

Gaussian Processes indexed on the symmetric group: prediction and learning

[...]

François Bachoc, Baptiste Broto, Fabrice Gamboa¹, Jean-Michel Loubes¹•Institutions (1)

Institut de Mathématiques de Toulouse¹

09 Sep 2018-arXiv: Machine Learning

TL;DR: This paper proposes and study an harmonic analysis of the covariance operators that enables to consider Gaussian processes models and forecasting issues and is motivated by statistical ranking problems.

...read moreread less

Abstract: In the framework of the supervised learning of a real function defined on a space X , the so called Kriging method stands on a real Gaussian field defined on X. The Euclidean case is well known and has been widely studied. In this paper, we explore the less classical case where X is the non commutative finite group of permutations. In this setting, we propose and study an harmonic analysis of the covariance operators that enables to consider Gaussian processes models and forecasting issues. Our theory is motivated by statistical ranking problems.

...read moreread less

Posted Content•

Can Everyday AI be Ethical? Machine Learning Algorithm Fairness

[...]

Philippe Besse¹, Céline Castets-Renard, Aurélien Garivier², Jean-Michel Loubes¹•Institutions (2)

Paul Sabatier University¹, École Normale Supérieure²

20 May 2018-Social Science Research Network

TL;DR: This work focuses on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions, and lists some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.

...read moreread less

Abstract: Combining big data and machine learning algorithms, the power of automatic decision tools induces as much hope as fear. Many recently enacted European legislation (GDPR) and French laws attempt to regulate the use of these tools. Leaving aside the well-identified problems of data confidentiality and impediments to competition, we focus on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions. The detailed perspective of the legal texts, faced with the complexity and opacity of the learning algorithms, reveals the need for important technological disruptions for the detection or reduction of the discrimination risk, and for addressing the right to obtain an explanation of the automatic decision. Since trust of the developers and above all of the users (citizens, litigants, customers) is essential, algorithms exploiting personal data must be deployed in a strict ethical framework. In conclusion, to answer this need, we list some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.

...read moreread less

Posted Content•

A Central Limit Theorem for $L_p$ transportation cost with applications to Fairness Assessment in Machine Learning

[...]

Eustasio del Barrio, Paula Gordaliza, Jean-Michel Loubes

18 Jul 2018-arXiv: Statistics Theory

TL;DR: An estimate of the asymptotic variance is provided which enables to build a two sample test to assess the similarity between two distributions and this test is then used to provide a new criterion to assessment the notion of fairness of a classification algorithm.

...read moreread less

Abstract: We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with size $n$ and $m$, $W_p(P_n,Q_m)$ for $p>1$ for observations on the real line, using a minimal amount of assumptions We provide an estimate of the asymptotic variance which enables to build a two sample test to assess the similarity between two distributions This test is then used to provide a new criterion to assess the notion of fairness of a classification algorithm

...read moreread less

Posted Content•

Explaining Machine Learning Models using Entropic Variable Projection

[...]

François Bachoc, Fabrice Gamboa, Max Halford, Jean-Michel Loubes, Laurent Risser - Show less +1 more

18 Oct 2018-arXiv: Machine Learning

TL;DR: This paper proposes a group explainability formalism for trained machine learning decision rules, based on their response to the variability of the input variables distribution, and proves that computing an explanation in this framework has a low algorithmic complexity, making it scalable to real-life large datasets.

...read moreread less

Abstract: In this paper, we present a new explainability formalism designed to explain how each input variable of a test set impacts the predictions of machine learning models. Hence, we propose a group explainability formalism for trained machine learning decision rules, based on their response to the variability of the input variables distribution. In order to emphasize the impact of each input variable, this formalism uses an information theory framework that quantifies the influence of all input-output observations based on entropic projections. This is thus the first unified and model agnostic formalism enabling data scientists to interpret the dependence between the input variables, their impact on the prediction errors, and their influence on the output predictions. Convergence rates of the entropic projections are provided in the large sample case. Most importantly, we prove that computing an explanation in our framework has a low algorithmic complexity, making it scalable to real-life large datasets. We illustrate our strategy by explaining complex decision rules learned by using XGBoost, Random Forest or Deep Neural Network classifiers on various datasets such as Adult income, MNIST and CelebA. We finally make clear its differences with the explainability strategies \textit{LIME} and \textit{SHAP}, that are based on single observations. Results can be reproduced by using the freely distributed Python toolbox this https URL}.

...read moreread less

Posted Content•

Risk measures and credit risk under the beta-kotz distribution

[...]

Jean-Michel Loubes¹, M. Andrea Arias-Serna, Francisco J. Caro-Lopera•Institutions (1)

Institut de Mathématiques de Toulouse¹

27 Jun 2018-arXiv: Statistics Theory

TL;DR: In this paper, the authors developed a new method for estimating the Value-at-Risk, the Conditional Value at Risk and the Economic Capital when the underlying risk factors follow a Beta-Kotz distribution.

...read moreread less

Abstract: This paper considers the use for Value-at-Risk computations of the so-called Beta-Kotz distribution based on a general family of distributions including the classical Gaussian model. Actually, this work develops a new method for estimating the Value-at-Risk, the Conditional Value-at-Risk and the Economic Capital when the underlying risk factors follow a Beta-Kotz distribution. After estimating the parameters of the distribution of the loss random variable, both analytical for some particular values of the parameters and numerical approaches are provided for computing these mentioned measures. The proposed risk measures are finally applied for quantifying the potential risk of economic losses in Credit Risk.

...read moreread less

Posted Content•

Can everyday AI be ethical. Fairness of Machine Learning Algorithms

[...]

Philippe Besse, Céline Castets-Renard, Aurélien Garivier, Jean-Michel Loubes

03 Oct 2018-arXiv: Other Statistics

...read moreread less

Abstract: Combining big data and machine learning algorithms, the power of automatic decision tools induces as much hope as fear. Many recently enacted European legislation (GDPR) and French laws attempt to regulate the use of these tools. Leaving aside the well-identified problems of data confidentiality and impediments to competition, we focus on the risks of discrimination, the problems of transparency and the quality of algorithmic decisions. The detailed perspective of the legal texts, faced with the complexity and opacity of the learning algorithms, reveals the need for important technological disruptions for the detection or reduction of the discrimination risk, and for addressing the right to obtain an explanation of the auto- matic decision. Since trust of the developers and above all of the users (citizens, litigants, customers) is essential, algorithms exploiting personal data must be deployed in a strict ethical framework. In conclusion, to answer this need, we list some ways of controls to be developed: institutional control, ethical charter, external audit attached to the issue of a label.

...read moreread less

Posted Content•

COREclust: a new package for a robust and scalable analysis of complex data

[...]

Camille Champion, Anne-Claire Brunet, Jean-Michel Loubes, Laurent Risser

24 May 2018-arXiv: Mathematical Software

TL;DR: A new R package COREclust dedicated to the detection of representative variables in high dimensional spaces with a potentially limited number of observations is presented and how to use it and results obtained on synthetic and real data are presented.

...read moreread less

Abstract: In this paper, we present a new R package COREclust dedicated to the detection of representative variables in high dimensional spaces with a potentially limited number of observations. Variable sets detection is based on an original graph clustering strategy denoted CORE-clustering algorithm that detects CORE-clusters, i.e. variable sets having a user defined size range and in which each variable is very similar to at least another variable. Representative variables are then robustely estimate as the CORE-cluster centers. This strategy is entirely coded in C++ and wrapped by R using the Rcpp package. A particular effort has been dedicated to keep its algorithmic cost reasonable so that it can be used on large datasets. After motivating our work, we will explain the CORE-clustering algorithm as well as a greedy extension of this algorithm. We will then present how to use it and results obtained on synthetic and real data.

...read moreread less