scispace - formally typeset
Search or ask a question
Author

Jake Vanderplas

Bio: Jake Vanderplas is an academic researcher from University of Washington. The author has contributed to research in topics: Python (programming language) & Weak gravitational lensing. The author has an hindex of 30, co-authored 56 publications receiving 77174 citations.

Papers published on a yearly basis

Papers
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

Posted Content
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open source scientific computing library for the Python programming language, which includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics.
Abstract: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

12,774 citations

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open-source scientific computing library for the Python programming language, which has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year.
Abstract: SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.

6,244 citations

01 Dec 2020
TL;DR: Seaborn as discussed by the authors is a library for making statistical graphics in Python that provides a high-level interface to matplotlib and integrates closely with pandas data structures, which makes it easy to translate questions about data into graphics that can answer them.
Abstract: seaborn is a library for making statistical graphics in Python. It provides a high-level interface to matplotlib and integrates closely with pandas data structures. Functions in the seaborn library expose a declarative, dataset-oriented API that makes it easy to translate questions about data into graphics that can answer them. When given a dataset and a specification of the plot to make, seaborn automatically maps the data values to visual attributes such as color, size, or style, internally computes statistical transformations, and decorates the plot with informative axis labels and a legend. Many seaborn functions can generate figures with multiple panels that elicit comparisons between conditional subsets of data or across different pairings of variables in a dataset. seaborn is designed to be useful throughout the lifecycle of a scientific project. By producing complete graphics from a single function call with minimal arguments, seaborn facilitates rapid prototyping and exploratory data analysis. And by offering extensive options for customization, along with exposing the underlying matplotlib objects, it can be used to create polished, publication-quality figures.

1,244 citations


Cited by
More filters
Proceedings ArticleDOI
13 Aug 2016
TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.
Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

14,872 citations

Proceedings ArticleDOI
TL;DR: This paper proposes a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning and provides insights on cache access patterns, data compression and sharding to build a scalable tree boosting system called XGBoost.
Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

13,333 citations

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open source scientific computing library for the Python programming language, which includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics.
Abstract: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

12,774 citations

Journal ArticleDOI
TL;DR: In this article, a combination of seven-year data from WMAP and improved astrophysical data rigorously tests the standard cosmological model and places new constraints on its basic parameters and extensions.
Abstract: The combination of seven-year data from WMAP and improved astrophysical data rigorously tests the standard cosmological model and places new constraints on its basic parameters and extensions. By combining the WMAP data with the latest distance measurements from the baryon acoustic oscillations (BAO) in the distribution of galaxies and the Hubble constant (H0) measurement, we determine the parameters of the simplest six-parameter ΛCDM model. The power-law index of the primordial power spectrum is ns = 0.968 ± 0.012 (68% CL) for this data combination, a measurement that excludes the Harrison–Zel’dovich–Peebles spectrum by 99.5% CL. The other parameters, including those beyond the minimal set, are also consistent with, and improved from, the five-year results. We find no convincing deviations from the minimal model. The seven-year temperature power spectrum gives a better determination of the third acoustic peak, which results in a better determination of the redshift of the matter-radiation equality epoch. Notable examples of improved parameters are the total mass of neutrinos, � mν < 0.58 eV (95% CL), and the effective number of neutrino species, Neff = 4.34 +0.86 −0.88 (68% CL), which benefit from better determinations of the third peak and H0. The limit on a constant dark energy equation of state parameter from WMAP+BAO+H0, without high-redshift Type Ia supernovae, is w =− 1.10 ± 0.14 (68% CL). We detect the effect of primordial helium on the temperature power spectrum and provide a new test of big bang nucleosynthesis by measuring Yp = 0.326 ± 0.075 (68% CL). We detect, and show on the map for the first time, the tangential and radial polarization patterns around hot and cold spots of temperature fluctuations, an important test of physical processes at z = 1090 and the dominance of adiabatic scalar fluctuations. The seven-year polarization data have significantly improved: we now detect the temperature–E-mode polarization cross power spectrum at 21σ , compared with 13σ from the five-year data. With the seven-year temperature–B-mode cross power spectrum, the limit on a rotation of the polarization plane due to potential parity-violating effects has improved by 38% to Δα =− 1. 1 ± 1. 4(statistical) ± 1. 5(systematic) (68% CL). We report significant detections of the Sunyaev–Zel’dovich (SZ) effect at the locations of known clusters of galaxies. The measured SZ signal agrees well with the expected signal from the X-ray data on a cluster-by-cluster basis. However, it is a factor of 0.5–0.7 times the predictions from “universal profile” of Arnaud et al., analytical models, and hydrodynamical simulations. We find, for the first time in the SZ effect, a significant difference between the cooling-flow and non-cooling-flow clusters (or relaxed and non-relaxed clusters), which can explain some of the discrepancy. This lower amplitude is consistent with the lower-than-theoretically expected SZ power spectrum recently measured by the South Pole Telescope Collaboration.

11,309 citations

Proceedings ArticleDOI
13 Aug 2016
TL;DR: In this article, the authors propose LIME, a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem.
Abstract: Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

11,104 citations