scispace - formally typeset
Author

Gaël Varoquaux

Bio: Gaël Varoquaux is a academic researcher at Université Paris-Saclay who has co-authored 264 publication(s) receiving 87110 citation(s). The author has an hindex of 36. Previous affiliations of Gaël Varoquaux include Centre national de la recherche scientifique & University of Florence. The author has done significant research in the topic(s): Supervised learning & Cognition.

...read more

Topics: Supervised learning, Cognition, Population ...read more
Papers
  More

Open accessJournal Article
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

...read more

33,540 Citations


Open accessPosted Content
02 Jan 2012-arXiv: Learning
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

...read more

28,898 Citations


Open accessJournal ArticleDOI: 10.1109/MCSE.2011.37
Abstract: In the Python world, NumPy arrays are the standard representation for numerical data and enable efficient implementation of numerical computations in a high-level language. As this effort shows, NumPy performance can be improved through three techniques: vectorizing calculations, avoiding copying data in memory, and minimizing operation counts.

...read more

  • Figure 1: Computing dense grid values without and with broadcasting. Note how, with broadcasting, much less memory is used.
    Figure 1: Computing dense grid values without and with broadcasting. Note how, with broadcasting, much less memory is used.
Topics: NumPy (71%), Python (programming language) (53%)

7,607 Citations


Open accessJournal ArticleDOI: 10.1109/MCSE.2011.37
Abstract: In the Python world, NumPy arrays are the standard representation for numerical data. Here, we show how these arrays enable efficient implementation of numerical computations in a high-level language. Overall, three techniques are applied to improve performance: vectorizing calculations, avoiding copying data in memory, and minimizing operation counts. We first present the NumPy array structure, then show how to use it for efficient computation, and finally how to share array data with other libraries.

...read more

  • Figure 1: Computing dense grid values without and with broadcasting. Note how, with broadcasting, much less memory is used.
    Figure 1: Computing dense grid values without and with broadcasting. Note how, with broadcasting, much less memory is used.

4,799 Citations


Open accessJournal ArticleDOI: 10.3389/FNINF.2014.00014
Abstract: Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g. multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g. resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.

...read more

897 Citations


Cited by
  More

Open accessJournal Article
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

...read more

33,540 Citations


Open accessPosted Content
02 Jan 2012-arXiv: Learning
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

...read more

28,898 Citations


Journal ArticleDOI: 10.1145/242224.242229
Thomas G. Dietterich1Institutions (1)
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read more

Topics: Active learning (machine learning) (81%), Robot learning (81%), Instance-based learning (77%) ...read more

12,323 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTU638
15 Jan 2015-Bioinformatics
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read more

11,833 Citations


Open accessProceedings ArticleDOI: 10.1145/2939672.2939785
Tianqi Chen1, Carlos Guestrin1Institutions (1)
13 Aug 2016-
Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

...read more

Topics: Incremental decision tree (64%), Gradient boosting (61%), ID3 algorithm (60%) ...read more

10,428 Citations


Performance
Metrics

Author's H-index: 36

No. of papers from the Author in previous years
YearPapers
202120
202029
201915
201822
201725
201629

Top Attributes

Show by:

Author's top 5 most impactful journals

NeuroImage

19 papers, 2.3K citations

arXiv: Machine Learning

19 papers, 390 citations

arXiv: Learning

15 papers, 29.7K citations

bioRxiv

13 papers, 91 citations

GigaScience

5 papers, 223 citations

Network Information
Related Authors (5)
Bertrand Thirion

311 papers, 73.8K citations

92% related
Alexandre Abraham

20 papers, 2K citations

89% related
Andrés Hoyos-Idrobo

10 papers, 535 citations

88% related
Kamalaker Dadi

14 papers, 547 citations

88% related
Vincent Michel

30 papers, 64.3K citations

88% related