Scikit-learn: Machine Learning Without Learning the Machinery

doi:10.1145/2786984.2786995

Home
/
Papers
/
Scikit-learn: Machine Learning Without Learning the Machinery

Journal Article•DOI•

Scikit-learn: Machine Learning Without Learning the Machinery

Gaël Varoquaux¹, Lars Buitinck², Gilles Louppe³, Olivier Grisel¹, Fabian Pedregosa¹, A. Mueller⁴ - Show less +2 more•Institutions (4)

French Institute for Research in Computer Science and Automation¹, University of Amsterdam², University of Liège³, Amazon.com⁴

01 Jun 2015-Vol. 19, Iss: 1, pp 29-33

TL;DR: A quick introduction to scikit-learn as well as to machine-learning basics are given.

read less

Abstract: Machine learning is a pervasive development at the intersection of statistics and computer science. While it can benefit many data-related applications, the technical nature of the research literature and the corresponding algorithms slows down its adoption. Scikit-learn is an open-source software project that aims at making machine learning accessible to all, whether it be in academia or in industry. It benefits from the general-purpose Python language, which is both broadly adopted in the scientific world, and supported by a thriving ecosystem of contributors. Here we give a quick introduction to scikit-learn as well as to machine-learning basics.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Image-Based malware classification using ensemble of CNN architectures (IMCEC)

[...]

Danish Vasan¹, Danish Vasan², Mamoun Alazab³, Sobia Wassan⁴, Sobia Wassan⁵, Babak Safaei⁶, Qin Zheng¹ - Show less +3 more•Institutions (6)

Tsinghua University¹, Isra University², Charles Darwin University³, University of Sindh⁴, Nanjing University⁵, Eastern Mediterranean University⁶

01 May 2020-Computers & Security

TL;DR: A novel ensemble convolutional neural networks (CNNs) based architecture for effective detection of both packed and unpacked malware, named Image-based Malware Classification using Ensemble of CNNs (IMCEC).

...read moreread less

221 citations

Journal Article•DOI•

A novel stacked generalization ensemble-based hybrid LGBM-XGB-MLP model for Short-Term Load Forecasting

[...]

Mohamed Massaoudi¹, Mohamed Massaoudi², Shady S. Refaat², Ines Chihi³, Mohamed Trabelsi, Fakhreddine S. Oueslati¹, Haitham Abu-Rub² - Show less +3 more•Institutions (3)

Carthage College¹, Texas A&M University at Qatar², Tunis El Manar University³

01 Jan 2021-Energy

TL;DR: A novel stacking ensemble-based algorithm is proposed that copes with the stochastic variations of the load demand using a stacked generalization approach and is validated using two datasets from different locations: Malaysia and New England.

...read moreread less

145 citations

Posted Content•

sktime: A Unified Interface for Machine Learning with Time Series.

[...]

Markus Löning, Anthony J. Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines, Franz J. Király - Show less +2 more

17 Sep 2019-arXiv: Learning

TL;DR: The main rationale for creating a unified interface, including reduction, as well as the design of sktime's core API, are discussed, supported by a clear overview of common time series tasks and reduction approaches.

...read moreread less

Abstract: We present sktime -- a new scikit-learn compatible Python library with a unified interface for machine learning with time series. Time series data gives rise to various distinct but closely related learning tasks, such as forecasting and time series classification, many of which can be solved by reducing them to related simpler tasks. We discuss the main rationale for creating a unified interface, including reduction, as well as the design of sktime's core API, supported by a clear overview of common time series tasks and reduction approaches.

...read moreread less

111 citations

Cites methods from "Scikit-learn: Machine Learning With..."

...We follow scikit-learn [57, 49] and Weka [32, 31] in adopting a uniform basic API for estimators, consisting of a fit method used for learning a model from training data and a predict method used for making predictions based on the fitted model, as well as a common interface for setting and retrieving hyper-parameters....
[...]

Journal Article•DOI•

SMILES-based deep generative scaffold decorator for de-novo drug design

[...]

Josep Arús-Pous¹, Josep Arús-Pous², Atanas Patronov¹, Esben Jannik Bjerrum¹, Christian Tyrchan¹, Jean-Louis Reymond², Hongming Chen, Ola Engkvist¹ - Show less +4 more•Institutions (2)

AstraZeneca¹, University of Bern²

29 May 2020-Journal of Cheminformatics

TL;DR: A new SMILES-based molecular generative architecture that generates molecules from scaffolds and can be trained from any arbitrary molecular set and serves as a data augmentation technique and is readily coupled with randomized SMilES to obtain even better results with small sets.

...read moreread less

Abstract: Molecular generative models trained with small sets of molecules represented as SMILES strings can generate large regions of the chemical space. Unfortunately, due to the sequential nature of SMILES strings, these models are not able to generate molecules given a scaffold (i.e., partially-built molecules with explicit attachment points). Herein we report a new SMILES-based molecular generative architecture that generates molecules from scaffolds and can be trained from any arbitrary molecular set. This approach is possible thanks to a new molecular set pre-processing algorithm that exhaustively slices all possible combinations of acyclic bonds of every molecule, combinatorically obtaining a large number of scaffolds with their respective decorations. Moreover, it serves as a data augmentation technique and can be readily coupled with randomized SMILES to obtain even better results with small sets. Two examples showcasing the potential of the architecture in medicinal and synthetic chemistry are described: First, models were trained with a training set obtained from a small set of Dopamine Receptor D2 (DRD2) active modulators and were able to meaningfully decorate a wide range of scaffolds and obtain molecular series predicted active on DRD2. Second, a larger set of drug-like molecules from ChEMBL was selectively sliced using synthetic chemistry constraints (RECAP rules). In this case, the resulting scaffolds with decorations were filtered only to allow those that included fragment-like decorations. This filtering process allowed models trained with this dataset to selectively decorate diverse scaffolds with fragments that were generally predicted to be synthesizable and attachable to the scaffold using known synthetic approaches. In both cases, the models were already able to decorate molecules using specific knowledge without the need to add it with other techniques, such as reinforcement learning. We envision that this architecture will become a useful addition to the already existent architectures for de novo molecular generation.

...read moreread less

104 citations

Journal Article•DOI•

Rapid triage for COVID-19 using routine clinical data for patients attending hospital: development and prospective validation of an artificial intelligence screening test.

[...]

Andrew Soltan¹, Andrew Soltan², Samaneh Kouchaki³, Samaneh Kouchaki², Tingting Zhu², Dani Kiyasseh², Thomas Taylor², Zaamin B. Hussain⁴, Tim E. A. Peto¹, Tim E. A. Peto², Tim E. A. Peto⁵, Andrew Brent¹, Andrew Brent², David W Eyre², David W Eyre⁵, David W Eyre¹, David A. Clifton² - Show less +13 more•Institutions (5)

John Radcliffe Hospital¹, University of Oxford², University of Surrey³, Harvard University⁴, Public Health England⁵

01 Feb 2021

TL;DR: Two early-detection models for COVID-19 were developed and validated, screening for the disease among patients attending the emergency department and the subset being admitted to hospital, using routinely collected health-care data (laboratory tests, blood gas measurements, and vital signs).

...read moreread less

Abstract: Summary Background The early clinical course of COVID-19 can be difficult to distinguish from other illnesses driving presentation to hospital. However, viral-specific PCR testing has limited sensitivity and results can take up to 72 h for operational reasons. We aimed to develop and validate two early-detection models for COVID-19, screening for the disease among patients attending the emergency department and the subset being admitted to hospital, using routinely collected health-care data (laboratory tests, blood gas measurements, and vital signs). These data are typically available within the first hour of presentation to hospitals in high-income and middle-income countries, within the existing laboratory infrastructure. Methods We trained linear and non-linear machine learning classifiers to distinguish patients with COVID-19 from pre-pandemic controls, using electronic health record data for patients presenting to the emergency department and admitted across a group of four teaching hospitals in Oxfordshire, UK (Oxford University Hospitals). Data extracted included presentation blood tests, blood gas testing, vital signs, and results of PCR testing for respiratory viruses. Adult patients (>18 years) presenting to hospital before Dec 1, 2019 (before the first COVID-19 outbreak), were included in the COVID-19-negative cohort; those presenting to hospital between Dec 1, 2019, and April 19, 2020, with PCR-confirmed severe acute respiratory syndrome coronavirus 2 infection were included in the COVID-19-positive cohort. Patients who were subsequently admitted to hospital were included in their respective COVID-19-negative or COVID-19-positive admissions cohorts. Models were calibrated to sensitivities of 70%, 80%, and 90% during training, and performance was initially assessed on a held-out test set generated by an 80:20 split stratified by patients with COVID-19 and balanced equally with pre-pandemic controls. To simulate real-world performance at different stages of an epidemic, we generated test sets with varying prevalences of COVID-19 and assessed predictive values for our models. We prospectively validated our 80% sensitivity models for all patients presenting or admitted to the Oxford University Hospitals between April 20 and May 6, 2020, comparing model predictions with PCR test results. Findings We assessed 155 689 adult patients presenting to hospital between Dec 1, 2017, and April 19, 2020. 114 957 patients were included in the COVID-negative cohort and 437 in the COVID-positive cohort, for a full study population of 115 394 patients, with 72 310 admitted to hospital. With a sensitive configuration of 80%, our emergency department (ED) model achieved 77·4% sensitivity and 95·7% specificity (area under the receiver operating characteristic curve [AUROC] 0·939) for COVID-19 among all patients attending hospital, and the admissions model achieved 77·4% sensitivity and 94·8% specificity (AUROC 0·940) for the subset of patients admitted to hospital. Both models achieved high negative predictive values (NPV; >98·5%) across a range of prevalences (≤5%). We prospectively validated our models for all patients presenting and admitted to Oxford University Hospitals in a 2-week test period. The ED model (3326 patients) achieved 92·3% accuracy (NPV 97·6%, AUROC 0·881), and the admissions model (1715 patients) achieved 92·5% accuracy (97·7%, 0·871) in comparison with PCR results. Sensitivity analyses to account for uncertainty in negative PCR results improved apparent accuracy (ED model 95·1%, admissions model 94·1%) and NPV (ED model 99·0%, admissions model 98·5%). Interpretation Our models performed effectively as a screening test for COVID-19, excluding the illness with high-confidence by use of clinical data routinely available within 1 h of presentation to hospital. Our approach is rapidly scalable, fitting within the existing laboratory testing infrastructure and standard of care of hospitals in high-income and middle-income countries. Funding Wellcome Trust, University of Oxford, Engineering and Physical Sciences Research Council, National Institute for Health Research Oxford Biomedical Research Centre.

...read moreread less

81 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

Collapse

References

PDF

Open Access

More filters

Journal Article•

R: A language and environment for statistical computing.

[...]

R Core Team

01 Jan 2014-MSOR connections

TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.

...read moreread less

Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

...read moreread less

272,030 citations

"Scikit-learn: Machine Learning With..." refers background in this paper

...While it can benefit many data-related applications, the technical nature of the research literature and the corresponding algorithms slows down its adoption....
[...]

Journal Article•DOI•

LIBSVM: A library for support vector machines

[...]

Chih-Chung Chang¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

06 May 2011-ACM Transactions on Intelligent Systems and Technology

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

40,826 citations

Posted Content•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa¹, Gaël Varoquaux¹, Alexandre Gramfort¹, Vincent Michel¹, Bertrand Thirion¹, Olivier Grisel, Mathieu Blondel, Andreas Müller², Joel Nothman, Gilles Louppe², Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Edouard Duchesnay - Show less +15 more•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Liège²

02 Jan 2012-arXiv: Learning

TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.

...read moreread less

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

...read moreread less

28,898 citations

Journal Article•DOI•

Matplotlib: A 2D Graphics Environment

[...]

J.D. Hunter

01 May 2007-Computing in Science and Engineering

TL;DR: Matplotlib is a 2D graphics package used for Python for application development, interactive scripting, and publication-quality image generation across user interfaces and operating systems.

...read moreread less

Abstract: Matplotlib is a 2D graphics package used for Python for application development, interactive scripting,and publication-quality image generation across user interfaces and operating systems

...read moreread less

23,312 citations

Journal Article•DOI•

The WEKA data mining software: an update

[...]

Mark Hall, Eibe Frank¹, Geoffrey Holmes¹, Bernhard Pfahringer¹, Peter Reutemann¹, Ian H. Witten¹ - Show less +2 more•Institutions (1)

University of Waikato¹

16 Nov 2009-Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

19,603 citations

"Scikit-learn: Machine Learning With..." refers background in this paper

...While it can benefit many data-related applications, the technical nature of the research literature and the corresponding algorithms slows down its adoption....
[...]