Author

# Fabian Pedregosa

Other affiliations: École Normale Supérieure, French Institute for Research in Computer Science and Automation, IBM ...read more

Bio: Fabian Pedregosa is an academic researcher from Google. The author has contributed to research in topics: Python (programming language) & Computer science. The author has an hindex of 23, co-authored 75 publications receiving 76129 citations. Previous affiliations of Fabian Pedregosa include École Normale Supérieure & French Institute for Research in Computer Science and Automation.

##### Papers

More filters

•

Kobe University

^{1}, Bauhaus University, Weimar^{2}, Google^{3}, University of Washington^{4}, University of Massachusetts Amherst^{5}, Total S.A.^{6}TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

•

TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations

••

University of Jyväskylä

^{1}, California Polytechnic State University^{2}, University of California, Los Angeles^{3}, Los Alamos National Laboratory^{4}, National Research University – Higher School of Economics^{5}, University of California, Berkeley^{6}, University of Birmingham^{7}, Australian Nuclear Science and Technology Organisation^{8}, University of Washington^{9}, University of Massachusetts Amherst^{10}, University of West Bohemia^{11}, Brigham Young University^{12}, University of Texas at Austin^{13}, Universidade Federal de Minas Gerais^{14}, Google^{15}TL;DR: SciPy as discussed by the authors is an open source scientific computing library for the Python programming language, which includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics.

Abstract: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

12,774 citations

••

University of Jyväskylä

^{1}, University of California, Los Angeles^{2}, California Polytechnic State University^{3}, Los Alamos National Laboratory^{4}, National Research University – Higher School of Economics^{5}, University of California, Berkeley^{6}, University of Birmingham^{7}, Australian Nuclear Science and Technology Organisation^{8}, University of Washington^{9}, University of Massachusetts Amherst^{10}, University of West Bohemia^{11}, Brigham Young University^{12}, University of Texas at Austin^{13}, Universidade Federal de Minas Gerais^{14}, Google^{15}TL;DR: SciPy as discussed by the authors is an open-source scientific computing library for the Python programming language, which has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year.

Abstract: SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.

6,244 citations

••

École Polytechnique

^{1}, IBM^{2}, University of Bonn^{3}, Imperial College London^{4}, Institut Mines-Télécom^{5}TL;DR: It is illustrated how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps and its application to neuroimaging data provides a versatile tool to study the brain.

Abstract: Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g. multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g. resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.

1,418 citations

##### Cited by

More filters

••

13 Aug 2016TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.

Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

14,872 citations

••

TL;DR: This paper proposes a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning and provides insights on cache access patterns, data compression and sharding to build a scalable tree boosting system called XGBoost.

Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

13,333 citations

••

University of Jyväskylä

^{1}, University of California, Los Angeles^{2}, California Polytechnic State University^{3}, Los Alamos National Laboratory^{4}, National Research University – Higher School of Economics^{5}, University of California, Berkeley^{6}, University of Birmingham^{7}, Australian Nuclear Science and Technology Organisation^{8}, University of Washington^{9}, University of Massachusetts Amherst^{10}, University of West Bohemia^{11}, Brigham Young University^{12}, University of Texas at Austin^{13}, Universidade Federal de Minas Gerais^{14}, Google^{15}TL;DR: SciPy as discussed by the authors is an open source scientific computing library for the Python programming language, which includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics.

Abstract: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

12,774 citations

••

13 Aug 2016TL;DR: In this article, the authors propose LIME, a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem.

Abstract: Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

11,104 citations

••

Anthony G. A. Brown

^{1}, Antonella Vallenari^{1}, T. Prusti^{1}, J. H. J. de Bruijne^{1}+449 more•Institutions (1)TL;DR: The second Gaia data release, Gaia DR2 as mentioned in this paper, is a major advance with respect to Gaia DR1 in terms of completeness, performance, and richness of the data products.

Abstract: Context. We present the second Gaia data release, Gaia DR2, consisting of astrometry, photometry, radial velocities, and information on astrophysical parameters and variability, for sources brighter than magnitude 21. In addition epoch astrometry and photometry are provided for a modest sample of minor planets in the solar system. Aims: A summary of the contents of Gaia DR2 is presented, accompanied by a discussion on the differences with respect to Gaia DR1 and an overview of the main limitations which are still present in the survey. Recommendations are made on the responsible use of Gaia DR2 results. Methods: The raw data collected with the Gaia instruments during the first 22 months of the mission have been processed by the Gaia Data Processing and Analysis Consortium (DPAC) and turned into this second data release, which represents a major advance with respect to Gaia DR1 in terms of completeness, performance, and richness of the data products. Results: Gaia DR2 contains celestial positions and the apparent brightness in G for approximately 1.7 billion sources. For 1.3 billion of those sources, parallaxes and proper motions are in addition available. The sample of sources for which variability information is provided is expanded to 0.5 million stars. This data release contains four new elements: broad-band colour information in the form of the apparent brightness in the GBP (330-680 nm) and GRP (630-1050 nm) bands is available for 1.4 billion sources; median radial velocities for some 7 million sources are presented; for between 77 and 161 million sources estimates are provided of the stellar effective temperature, extinction, reddening, and radius and luminosity; and for a pre-selected list of 14 000 minor planets in the solar system epoch astrometry and photometry are presented. Finally, Gaia DR2 also represents a new materialisation of the celestial reference frame in the optical, the Gaia-CRF2, which is the first optical reference frame based solely on extragalactic sources. There are notable changes in the photometric system and the catalogue source list with respect to Gaia DR1, and we stress the need to consider the two data releases as independent. Conclusions: Gaia DR2 represents a major achievement for the Gaia mission, delivering on the long standing promise to provide parallaxes and proper motions for over 1 billion stars, and representing a first step in the availability of complementary radial velocity and source astrophysical information for a sample of stars in the Gaia survey which covers a very substantial fraction of the volume of our galaxy.

8,308 citations