Building Predictive Models in R Using the caret Package

doi:10.18637/JSS.V028.I05

Home
/
Papers
/
Building Predictive Models in R Using the caret Package

Journal Article•DOI•

Building Predictive Models in R Using the caret Package

10 Nov 2008-Journal of Statistical Software (Foundation for Open Access Statistics)-Vol. 28, Iss: 5, pp 1-26

TL;DR: The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R to simplify model training and tuning across a wide variety of modeling techniques.

read less

Abstract: The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations. An example from computational chemistry is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Modern Applied Statistics With S

[...]

Christina Gloeckner

01 Jan 2016

TL;DR: The modern applied statistics with s is universally compatible with any devices to read, and is available in the digital library an online access to it is set as public so you can download it instantly.

...read moreread less

Abstract: Thank you very much for downloading modern applied statistics with s. As you may know, people have search hundreds times for their favorite readings like this modern applied statistics with s, but end up in harmful downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some harmful virus inside their laptop. modern applied statistics with s is available in our digital library an online access to it is set as public so you can download it instantly. Our digital library saves in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the modern applied statistics with s is universally compatible with any devices to read.

...read moreread less

5,249 citations

Book•

Applied Predictive Modeling

[...]

Max Kuhn, Kjell Johnson

17 May 2013

TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.

...read moreread less

Abstract: General Strategies.- Regression Models.- Classification Models.- Other Considerations.- Appendix.- References.- Indices.

...read moreread less

3,672 citations

Cites methods from "Building Predictive Models in R Usi..."

...More detail on the caret package can be found in Kuhn (2008) or the four extended manuals (called “vignettes”) on the package web site (Kuhn 2010)....
[...]

Journal Article•DOI•

Feature Selection with the Boruta Package

[...]

Miron B. Kursa¹, Witold R. Rudnicki¹•Institutions (1)

University of Warsaw¹

16 Sep 2010-Journal of Statistical Software

TL;DR: The Boruta package provides a convenient interface to the Boruta algorithm, implementing a novel feature selection algorithm for finding emph{all relevant variables}.

...read moreread less

Abstract: This article describes a R package Boruta, implementing a novel feature selection algorithm for finding emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

...read moreread less

2,832 citations

Cites methods from "Building Predictive Models in R Usi..."

...…of highly relevant and uncorrelated attributes within the result returned by Boruta may use for example package party (Strobl et al. 2009), caret (Kuhn 2008; Kuhn, Wing, Weston, Williams, Keefer, and Engelhardt 2010), varSelRF (Diaz-Uriarte 2007, 2010) or FSelector (Romanski 2009) for further…...
[...]
...2009), caret (Kuhn 2008; Kuhn, Wing, Weston, Williams, Keefer, and Engelhardt 2010), varSelRF (Diaz-Uriarte 2007, 2010) or FSelector (Romanski 2009) for further refinement....
[...]

Journal Article•DOI•

Do we need hundreds of classifiers to solve real world classification problems

[...]

Manuel Fernández-Delgado¹, E. Cernadas¹, Senén Barro¹, Dinani Gomes Amorim•Institutions (1)

University of Santiago de Compostela¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in theTop-20, respectively).

...read moreread less

Abstract: We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

...read moreread less

2,616 citations

Cites methods from "Building Predictive Models in R Usi..."

...The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively)....
[...]
...Besides, the R package caret (Kuhn, 2008) provides a very easy interface for the execution of many classifiers, allowing automatic parameter tuning and reducing the requirements on the researcher’s knowledge (about the tunable parameter values, among other issues)....
[...]

Journal Article•DOI•

SoilGrids250m: Global gridded soil information based on machine learning

[...]

Tomislav Hengl, Jorge Mendes de Jesus, Gerard B. M. Heuvelink, Maria Ruiperez Gonzalez, Milan Kilibarda¹, Aleksandar Blagotić, Wei Shangguan², Marvin N. Wright, Xiaoyuan Geng³, Bernhard Bauer-Marschallinger⁴, Mario Guevara⁵, Rodrigo Vargas⁵, R. A. MacMillan, Niels H. Batjes, Johan G. B. Leenaars, Eloi Ribeiro, Ichsani Wheeler, Stephan Mantel, Bas Kempen - Show less +15 more•Institutions (5)

University of Belgrade¹, Sun Yat-sen University², Agriculture and Agri-Food Canada³, Vienna University of Technology⁴, University of Delaware⁵

16 Feb 2017-PLOS ONE

TL;DR: Improvements in the relative accuracy considering the amount of variation explained, in comparison to the previous version of SoilGrids at 1 km spatial resolution, range from 60 to 230%.

...read moreread less

Abstract: This paper describes the technical development and accuracy assessment of the most recent and improved version of the SoilGrids system at 250m resolution (June 2016 update). SoilGrids provides global predictions for standard numeric soil properties (organic carbon, bulk density, Cation Exchange Capacity (CEC), pH, soil texture fractions and coarse fragments) at seven standard depths (0, 5, 15, 30, 60, 100 and 200 cm), in addition to predictions of depth to bedrock and distribution of soil classes based on the World Reference Base (WRB) and USDA classification systems (ca. 280 raster layers in total). Predictions were based on ca. 150,000 soil profiles used for training and a stack of 158 remote sensing-based soil covariates (primarily derived from MODIS land products, SRTM DEM derivatives, climatic images and global landform and lithology maps), which were used to fit an ensemble of machine learning methods-random forest and gradient boosting and/or multinomial logistic regression-as implemented in the R packages ranger, xgboost, nnet and caret. The results of 10-fold cross-validation show that the ensemble models explain between 56% (coarse fragments) and 83% (pH) of variation with an overall average of 61%. Improvements in the relative accuracy considering the amount of variation explained, in comparison to the previous version of SoilGrids at 1 km spatial resolution, range from 60 to 230%. Improvements can be attributed to: (1) the use of machine learning instead of linear regression, (2) to considerable investments in preparing finer resolution covariate layers and (3) to insertion of additional soil profiles. Further development of SoilGrids could include refinement of methods to incorporate input uncertainties and derivation of posterior probability distributions (per pixel), and further automation of spatial modeling so that soil maps can be generated for potentially hundreds of soil variables. Another area of future research is the development of methods for multiscale merging of SoilGrids predictions with local and/or national gridded soil products (e.g. up to 50 m spatial resolution) so that increasingly more accurate, complete and consistent global soil information can be produced. SoilGrids are available under the Open Data Base License.

...read moreread less

2,228 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

The Elements of Statistical Learning

[...]

Trevor Hastie, Robert Tibshirani, Jerome H. Friedman

01 Jan 2001

19,211 citations

Book•DOI•

Modern Applied Statistics with S

[...]

W. N. Venables, Brian D. Ripley

01 Dec 2010

TL;DR: A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods.

...read moreread less

Abstract: A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods The emphasis is on presenting practical problems and full analyses of real data sets

...read moreread less

18,346 citations

Classification and Regression by randomForest

[...]

Andy Liaw, Matthew C. Wiener

01 Jan 2007

TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.

...read moreread less

Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

...read moreread less

14,830 citations

"Building Predictive Models in R Usi..." refers methods in this paper

...Random forest from Liaw and Wiener (2002): “For each tree, the prediction accuracy on the out-of-bag portion of the data is recorded....
[...]
... Random forest from Liaw and Wiener (2002): “For each tree, the prediction accuracy on the out-of-bag portion of the data is recorded....
[...]
...…and Ripley 1999), nws 1.7.1.0 (Scientific Computing Associates, Inc. 2007), pamr 1.31 (Hastie et al. 2003), party 0.9-96 (Hothorn et al. 2006), pls 2.1-0 (Mevik and Wehrens 2007), randomForest 4.5-25 (Liaw and Wiener 2002), rpart 3.1-39 (Therneau and Atkinson 1997) and SDDA 1.0-3 (Stone 2008)....
[...]
...5-25 (Liaw and Wiener 2002), rpart 3....
[...]

Modern Applied Statistics With S

[...]

Christina Gloeckner

01 Jan 2016

...read moreread less

5,249 citations

"Building Predictive Models in R Usi..." refers background in this paper

...The knn3 function is a clone of knn from the MASS package (Venables and Ripley 2002) whose predict function can return the vote proportions for each of the classes (instead of just the winning class)....
[...]
...The knn3 function is a clone of knn from the MASS package (Venables and Ripley 2002) whose predict function can return the vote...
[...]

Proceedings Article•DOI•

Validity of the single processor approach to achieving large scale computing capabilities

[...]

Gene Myron Amdahl¹•Institutions (1)

IBM¹

18 Apr 1967

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.

...read moreread less

Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

...read moreread less

3,653 citations