Showing papers on "Random forest published in 2007"

PDF

Open Access

Classification and Regression by randomForest

[...]

01 Jan 2007

TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.

...read moreread less

Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

...read moreread less

14,830 citations

Journal Article•DOI•

Random Forests for Classification in Ecology

[...]

D. Richard Cutler¹, Thomas C. Edwards², Thomas C. Edwards¹, Karen H. Beard¹, Adele Cutler¹, Kyle Hess¹, Jacob Gibson¹, Joshua J. Lawler³ - Show less +4 more•Institutions (3)

Utah State University¹, United States Geological Survey², University of Washington³

01 Nov 2007-Ecology

TL;DR: High classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods are observed.

...read moreread less

Abstract: Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature.

...read moreread less

3,368 citations

Journal Article•DOI•

Bias in random forest variable importance measures: Illustrations, sources and a solution

[...]

Carolin Strobl¹, Anne-Laure Boulesteix², Achim Zeileis³, Torsten Hothorn⁴•Institutions (4)

Ludwig Maximilian University of Munich¹, Technische Universität München², Vienna University of Economics and Business³, University of Erlangen-Nuremberg⁴

25 Jan 2007-BMC Bioinformatics

TL;DR: An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.

...read moreread less

Abstract: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

...read moreread less

2,697 citations

Proceedings Article•DOI•

Image Classification using Random Forests and Ferns

[...]

Anna Bosch¹, Andrew Zisserman², X. Muoz¹•Institutions (2)

University of Girona¹, University of Oxford²

26 Dec 2007

TL;DR: It is shown that selecting the ROI adds about 5% to the performance and, together with the other improvements, the result is about a 10% improvement over the state of the art for Caltech-256.

...read moreread less

Abstract: We explore the problem of classifying images by the object categories they contain in the case of a large number of object categories. To this end we combine three ingredients: (i) shape and appearance representations that support spatial pyramid matching over a region of interest. This generalizes the representation of Lazebnik et al., (2006) from an image to a region of interest (ROI), and from appearance (visual words) alone to appearance and local shape (edge distributions); (ii) automatic selection of the regions of interest in training. This provides a method of inhibiting background clutter and adding invariance to the object instance 's position; and (iii) the use of random forests (and random ferns) as a multi-way classifier. The advantage of such classifiers (over multi-way SVM for example) is the ease of training and testing. Results are reported for classification of the Caltech-101 and Caltech-256 data sets. We compare the performance of the random forest/ferns classifier with a benchmark multi-way SVM classifier. It is shown that selecting the ROI adds about 5% to the performance and, together with the other improvements, the result is about a 10% improvement over the state of the art for Caltech-256.

...read moreread less

1,401 citations

Journal Article•DOI•

Boosted trees for ecological modeling and prediction.

[...]

Glenn De'ath¹•Institutions (1)

Australian Institute of Marine Science¹

01 Jan 2007-Ecology

TL;DR: A new form of boosted trees, namely, "aggregated boosted trees" (ABT), is proposed and, in a simulation study, is shown to reduce prediction error relative to boosted trees.

...read moreread less

Abstract: Accurate prediction and explanation are fundamental objectives of statistical analysis, yet they seldom coincide. Boosted trees are a statistical learning method that attains both of these objectives for regression and classification analyses. They can deal with many types of response variables (numeric, categorical, and censored), loss functions (Gaussian, binomial, Poisson, and robust), and predictors (numeric, categorical). Interactions between predictors can also be quantified and visualized. The theory underpinning boosted trees is presented, together with interpretive techniques. A new form of boosted trees, namely, "aggregated boosted trees" (ABT), is proposed and, in a simulation study, is shown to reduce prediction error relative to boosted trees. A regression data set is analyzed using ABT to illustrate the technique and to compare it with other methods, including boosted trees, bagged trees, random forests, and generalized additive models. A software package for ABT analysis using the R software environment is included in the Appendices together with worked examples.

...read moreread less

1,121 citations

Proceedings Article•DOI•

A comparison of machine learning techniques for phishing detection

[...]

Saeed Abu-Nimeh¹, D. Nappa¹, Xinlei Wang¹, Suku Nair¹•Institutions (1)

Southern Methodist University¹

04 Oct 2007

TL;DR: This study compares the predictive accuracy of several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails.

...read moreread less

Abstract: There are many applications available for phishing detection. However, unlike predicting spam, there are only few studies that compare machine learning techniques in predicting phishing. The present study compares the predictive accuracy of several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails. A data set of 2889 phishing and legitimate emails is used in the comparative study. In addition, 43 features are used to train and test the classifiers.

...read moreread less

386 citations

Journal Article•DOI•

A Comparison of Decision Tree Ensemble Creation Techniques

[...]

Banfield, Hall, Bowyer, Kegelmeyer

01 Jan 2007-IEEE Transactions on Pattern Analysis and Machine Intelligence

361 citations

Journal Article•DOI•

Random forests as a tool for ecohydrological distribution modelling

[...]

Jan Peters¹, Bernard De Baets¹, Niko E. C. Verhoest¹, Roeland Samson¹, Sven Degroeve¹, Piet De Becker, Willy Huybrechts - Show less +3 more•Institutions (1)

Ghent University¹

10 Oct 2007-Ecological Modelling

TL;DR: In this paper, two statistical techniques are evaluated: (i) the widely used multiple logistic regression technique in the generalized linear modelling framework, and (ii) a recently developed machine learning technique called "random forests" to predict vegetation type distributions within the study area.

...read moreread less

323 citations

Journal Article•DOI•

Random Forest Models To Predict Aqueous Solubility

[...]

David S. Palmer, Noel M. O'Boyle¹, Robert C. Glen¹, John B. O. Mitchell¹•Institutions (1)

University of Cambridge¹

01 Jan 2007-Journal of Chemical Information and Modeling

TL;DR: The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use.

...read moreread less

Abstract: Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 °C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descrip...

...read moreread less

295 citations

Journal Article•DOI•

Variable importance in binary regression trees and forests

[...]

Hemant Ishwaran¹•Institutions (1)

Cleveland Clinic¹

01 Jan 2007-Electronic Journal of Statistics

TL;DR: The authors characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees, where a key component involves the node mean squared error for a quantity referred to as a maximal subtree.

...read moreread less

Abstract: We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends from single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.

...read moreread less

294 citations

Proceedings Article•DOI•

An Empirical Study of Learning from Imbalanced Data Using Random Forest

[...]

Taghi M. Khoshgoftaar¹, M. Golawala¹, J. Van Hulse¹•Institutions (1)

Florida Atlantic University¹

29 Oct 2007

TL;DR: A comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka are discussed, providing an extensive empirical evaluation of RF learners built from imbalanced data.

...read moreread less

Abstract: This paper discusses a comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka. RF is a relatively new learner, and to the best of our knowledge, only preliminary experimentation on the construction of random forest classifiers in the context of imbalanced data has been reported in previous work. Therefore, the contribution of this study is to provide an extensive empirical evaluation of RF learners built from imbalanced data. What should be the recommended default number of trees in the ensemble? What should the recommended value be for the number of attributes? How does the RF learner perform on imbalanced data when compared with other commonly-used learners? We address these and other related issues in this work.

...read moreread less

Journal Article•DOI•

Application and comparison of classification algorithms for recognition of Alzheimer's disease in electrical brain activity (EEG).

[...]

Christoph Lehmann¹, Thomas Koenig¹, Vesna Jelic², Leslie S. Prichep³, Leslie S. Prichep⁴, Roy E. John³, Roy E. John⁴, Lars-Olof Wahlund², Yadolah Dodge⁵, Thomas Dierks¹ - Show less +6 more•Institutions (5)

University of Bern¹, Karolinska University Hospital², Nathan Kline Institute for Psychiatric Research³, New York University⁴, University of Neuchâtel⁵

15 Apr 2007-Journal of Neuroscience Methods

TL;DR: It could be demonstrated that even tough modern computer-intensive classification algorithms such as random forests, SVM and neural networks show a slight superiority, more classical classification algorithms performed nearly equally well.

...read moreread less

Journal Article•DOI•

A Comparison of Decision Tree Ensemble Creation Techniques

[...]

R.E. Banfield¹, Lawrence O. Hall¹, Kevin W. Bowyer, W.P. Kegelmeyer•Institutions (1)

University of Florida¹

01 Jan 2007-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An algorithm is introduced that decides when a sufficient number of classifiers has been created for an ensemble, and is shown to result in an accurate ensemble for those methods that incorporate bagging into the construction of the ensemble.

...read moreread less

Abstract: We experimentally evaluate bagging and seven other randomization-based approaches to creating an ensemble of decision tree classifiers. Statistical tests were performed on experimental results from 57 publicly available data sets. When cross-validation comparisons were tested for statistical significance, the best method was statistically more accurate than bagging on only eight of the 57 data sets. Alternatively, examining the average ranks of the algorithms across the group of data sets, we find that boosting, random forests, and randomized trees are statistically significantly better than bagging. Because our results suggest that using an appropriate ensemble size is important, we introduce an algorithm that decides when a sufficient number of classifiers has been created for an ensemble. Our algorithm uses the out-of-bag error estimate, and is shown to result in an accurate ensemble for those methods that incorporate bagging into the construction of the ensemble

...read moreread less

Journal Article•DOI•

Variable importance in binary regression trees and forests

[...]

Hemant Ishwaran¹•Institutions (1)

Cleveland Clinic¹

15 Nov 2007-arXiv: Machine Learning

TL;DR: This work characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees and investigates the node mean squared error for a quantity the authors refer to as a maximal subtree.

...read moreread less

Proceedings Article•DOI•

The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals.

[...]

Björn Schuller, Anton Batliner¹, Dino Seppi, Stefan Steidl, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, Vered Aharonson - Show less +7 more•Institutions (1)

University of Erlangen-Nuremberg¹

27 Aug 2007

TL;DR: This paper reports on classification results for emotional user states (4 classes, German database of children interacting with a pet robot), where six sites computed acoustic and linguistic features independently from each other, following in part different strategies.

...read moreread less

Abstract: In this paper, we report on classification results for emotional user states (4 classes, German database of children interacting with a pet robot). Six sites computed acoustic and linguistic features independently from each other, following in part different strategies. A total of 4244 features were pooled together and grouped into 12 low level descriptor types and 6 functional types. For each of these groups, classification results using Support Vector Machines and Random Forests are reported for the full set of features, and for 150 features each with the highest individual Information Gain Ratio. The performance for the different groups varies mostly between ≈ 50% and ≈ 60%. Index Terms: emotional user states, automatic classification, feature types, functionals

...read moreread less

Book Chapter•DOI•

Ensembles of Multi-Objective Decision Trees

[...]

Dragi Kocev¹, Celine Vens², Jan Struyf², Sašo Džeroski¹•Institutions (2)

Jožef Stefan Institute¹, Katholieke Universiteit Leuven²

17 Sep 2007

TL;DR: This paper considers two ensemble learning techniques, bagging and random forests, and applies them to multi-objective decision trees (MODTs), which are decision trees that predict multiple target attributes at once and concludes that ensembles of MODTs yield better predictive performance than MODTs and are equally good, or better than ensembled of single-objectives decision trees.

...read moreread less

Abstract: Ensemble methods are able to improve the predictive performance of many base classifiers. Up till now, they have been applied to classifiers that predict a single target attribute. Given the non-trivial interactions that may occur among the different targets in multi-objective prediction tasks, it is unclear whether ensemble methods also improve the performance in this setting. In this paper, we consider two ensemble learning techniques, bagging and random forests, and apply them to multi-objective decision trees (MODTs), which are decision trees that predict multiple target attributes at once. We empirically investigate the performance of ensembles of MODTs. Our most important conclusions are: (1) ensembles of MODTs yield better predictive performance than MODTs, and (2) ensembles of MODTs are equally good, or better than ensembles of single-objective decision trees, i.e., a set of ensembles for each target. Moreover, ensembles of MODTs have smaller model size and are faster to learn than ensembles of single-objective decision trees.

...read moreread less

Book Chapter•DOI•

An experimental study on rotation forest ensembles

[...]

Ludmila I. Kuncheva¹, Juan J. Rodríguez²•Institutions (2)

Bangor University¹, University of Burgos²

23 May 2007

TL;DR: A lesion study on Rotation Forest is carried out to find out which of the parameters and the randomization heuristics are responsible for the good performance of the method.

...read moreread less

Abstract: Rotation Forest is a recently proposed method for building classifier ensembles using independently trained decision trees. It was found to be more accurate than bagging, AdaBoost and Random Forest ensembles across a collection of benchmark data sets. This paper carries out a lesion study on Rotation Forest in order to find out which of the parameters and the randomization heuristics are responsible for the good performance. Contrary to common intuition, the features extracted through PCA gave the best results compared to those extracted through non-parametric discriminant analysis (NDA) or random projections. The only ensemble method whose accuracy was statistically indistinguishable from that of Rotation Forest was LogitBoost although it gave slightly inferior results on 20 out of the 32 benchmark data sets. It appeared that the main factor for the success of Rotation Forest is that the transformation matrix employed to calculate the (linear) extracted features is sparse.

...read moreread less

Journal Article•DOI•

Learning to classify e-mail

[...]

Irena Koprinska¹, Josiah Poon¹, James Clark¹, Jason Chan¹•Institutions (1)

University of Sydney¹

20 May 2007-Information Sciences

TL;DR: The study shows that a classifiers trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples, showing that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits.

...read moreread less

Proceedings Article•DOI•

Tree-based Classifiers for Bilayer Video Segmentation

[...]

Pei Yin¹, Antonio Criminisi², John Winn², M. Essa¹•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

17 Jun 2007

TL;DR: A novel motion representation, "motons", inspired by research in object recognition is introduced, and the segmentation likelihood from the spatial context of motion is proposed, which is efficiently performed by Random Forests.

...read moreread less

Abstract: This paper presents an algorithm for the automatic segmentation of monocular videos into foreground and background layers. Correct segmentations are produced even in the presence of large background motion with nearly stationary foreground. There are three key contributions. The first is the introduction of a novel motion representation, "motons", inspired by research in object recognition. Second, we propose learning the segmentation likelihood from the spatial context of motion. The learning is efficiently performed by Random Forests. The third contribution is a general taxonomy of tree-based classifiers, which facilitates theoretical and experimental comparisons of several known classification algorithms, as well as spawning new ones. Diverse visual cues such as motion, motion context, colour, contrast and spatial priors are fused together by means of a conditional random field (CRF) model. Segmentation is then achieved by binary min-cut. Our algorithm requires no initialization. Experiments on many video-chat type sequences demonstrate the effectiveness of our algorithm in a variety of scenes. The segmentation results are comparable to those obtained by stereo systems.

...read moreread less

Book Chapter•DOI•

Multiple classifier systems in remote sensing: from basics to recent developments

[...]

Jon Atli Benediktsson¹, Jocelyn Chanussot, Mathieu Fauvel¹•Institutions (1)

University of Iceland¹

23 May 2007

TL;DR: The fusion of two classifiers using both spectral and the spatial information is discussed in the frame of hyperspectral remote sensing for the classification of urban areas and complementary use of several algorithms in a decision fusion scheme is discussed.

...read moreread less

Abstract: In this paper, we present some recent developments of Multiple Classifiers Systems (MCS) for remote sensing applications. Some standard MCS methods (boosting, bagging, consensus theory and random forests) are briefly described and applied to multisource data (satellite multispectral images, elevation, slope and aspect data) for landcover classification. In the second part, special attention is given to Support Vector Machines (SVM) based algorithms. In particular, the fusion of two classifiers using both spectral and the spatial information is discussed in the frame of hyperspectral remote sensing for the classification of urban areas. In all the cases, MCS provide a significant improvement of the classification accuracies. In order to address new challenges for the analysis of remote sensing data, MCS provide invaluable tools to handle situations with an ever growing complexity. Examples include extraction of multiple features from one data set, use of multi-sensor data, and complementary use of several algorithms in a decision fusion scheme.

...read moreread less

Journal Article•DOI•

Contemporary QSAR Classifiers Compared

[...]

Craig L. Bruce¹, James L. Melville¹, Stephen D. Pickett¹, Jonathan D. Hirst¹•Institutions (1)

University of Nottingham¹

03 Jan 2007-Journal of Chemical Information and Modeling

TL;DR: A comparative assessment of several state-of-the-art machine learning tools for mining drug data, including support vector machines (SVMs) and the ensemble decision tree methods boosting, bagging, and random forest, demonstrates that these techniques can provide consistent improvements in predictive performance over single decision trees.

...read moreread less

Abstract: We present a comparative assessment of several state-of-the-art machine learning tools for mining drug data, including support vector machines (SVMs) and the ensemble decision tree methods boosting, bagging, and random forest, using eight data sets and two sets of descriptors. We demonstrate, by rigorous multiple comparison statistical tests, that these techniques can provide consistent improvements in predictive performance over single decision trees. However, within these methods, there is no clearly best-performing algorithm. This motivates a more in-depth investigation into the properties of random forests. We identify a set of parameters for the random forest that provide optimal performance across all the studied data sets. Additionally, the tree ensemble structure of the forest may provide an interpretable model, a considerable advantage over SVMs. We test this possibility and compare it with standard decision tree models.

...read moreread less

Book Chapter•DOI•

A Statistical Framework for the Prediction of Fault-Proneness

[...]

Yan Ma¹, Lan Guo¹, Bojan Cukic¹•Institutions (1)

West Virginia University¹

01 Jan 2007

TL;DR: A methodology for predicting fault prone modules using a modified random forests algorithm and a thorough and statistically sound comparison of these methods against ten other classifiers frequently used in the literature are presented.

...read moreread less

Abstract: Accurate prediction of fault prone modules in software development process enables effective discovery and identification of the defects. Such prediction models are especially valuable for the large-scale systems, where verification experts need to focus their attention and resources to problem areas in the system under development. This paper presents a methodology for predicting fault prone modules using a modified random forests algorithm. Random forests improve classification accuracy by growing an ensemble of classification trees and letting them vote on the classification decision. We applied the methodology to five NASA public domain defect data sets. These data sets vary in size, but all typically contain a small number of defect samples in the learning set. For instance, in project PC1, only around 7% of the instances are defects. If overall accuracy maximization is the goal, then learning from such data usually results in a biased classifier, i.e. the majority of samples would be classified into non-defect class. To obtain better prediction of fault-proneness, two strategies are investigated: proper sampling technique in constructing the tree classifiers, and threshold adjustment in determining the winning class. Both are found to be effective in accurate prediction of fault prone modules. In addition, the paper presents a thorough and statistically sound comparison of these methods against ten other classifiers frequently used in the literature.

...read moreread less

Journal Article•

Traffic Flow Prediction using Adaboost Algorithm with Random Forests as a Weak Learner

[...]

Guy Leshem

24 Jan 2007-World Academy of Science, Engineering and Technology, International Journal of Mathematical, Computational, Physical, Electrical and Computer Engineering

TL;DR: A machine learning system for traffic flow management and control for a prediction of traffic flow problem and it is shown that this algorithm performs relatively well on real data, and enables, according to the Traffic Flow Evaluation model, to estimate and predict whether there is congestion or not at a given time on road intersections.

...read moreread less

Abstract: Traffic Management and Information Systems, which rely on a system of sensors, aim to describe in real-time traffic in urban areas using a set of parameters and estimating them. Though the state of the art focuses on data analysis, little is done in the sense of prediction. In this paper, we describe a machine learning system for traffic flow management and control for a prediction of traffic flow problem. This new algorithm is obtained by combining Random Forests algorithm into Adaboost algorithm as a weak learner. We show that our algorithm performs relatively well on real data, and enables, according to the Traffic Flow Evaluation model, to estimate and predict whether there is congestion or not at a given time on road intersections. Keywords—Machine Learning, Boosting, Classification, Traffic Congestion, Data Collecting, Magnetic Loop Detectors, Signalized Intersections, Traffic Signal Timing Optimization.

...read moreread less

Book Chapter•DOI•

Combining bagging and random subspaces to create better ensembles

[...]

Panče Panov¹, Sašo Džeroski¹•Institutions (1)

Jožef Stefan Institute¹

06 Sep 2007

TL;DR: The results of the experiments show that the proposed approach has a comparable performance to that of random forests, with the added advantage of being applicable to any base-level algorithm without the need to randomize the latter.

...read moreread less

Abstract: Random forests are one of the best performing methods for constructing ensembles They derive their strength from two aspects: using random subsamples of the training data (as in bagging) and randomizing the algorithm for learning base-level classifiers (decision trees) The base-level algorithm randomly selects a subset of the features at each step of tree construction and chooses the best among these We propose to use a combination of concepts used in bagging and random subspaces to achieve a similar effect The latter randomly select a subset of the features at the start and use a deterministic version of the base-level algorithm (and is thus somewhat similar to the randomized version of the algorithm) The results of our experiments show that the proposed approach has a comparable performance to that of random forests, with the added advantage of being applicable to any base-level algorithm without the need to randomize the latter

...read moreread less

Journal Article•DOI•

Improved customer choice predictions using ensemble methods

[...]

Michiel van Wezel¹, Rob Potharst¹•Institutions (1)

Erasmus University Rotterdam¹

16 Aug 2007-European Journal of Operational Research

TL;DR: In this article, various ensemble learning methods from machine learning and statistics are considered and applied to the customer choice modeling problem, and experimental results for two real-life marketing datasets using decision trees, ensemble versions of decision trees and the logistic regression model are given.

...read moreread less

Journal Article•DOI•

Random forest prediction of mutagenicity from empirical physicochemical descriptors.

[...]

Qingyou Zhang¹, João Aires-de-Sousa¹•Institutions (1)

Universidade Nova de Lisboa¹

01 Jan 2007-Journal of Chemical Information and Modeling

TL;DR: Fast-to-calculate empirical physicochemical descriptors were investigated for their ability to predict mutagenicity (positive or negative Ames test) from the molecular structure and random forests were able to associate meaningful probabilities to the predictions.

...read moreread less

Abstract: Fast-to-calculate empirical physicochemical descriptors were investigated for their ability to predict mutagenicity (positive or negative Ames test) from the molecular structure. Fast methods are highly desired for the screening of large libraries of compounds. Global molecular descriptors and MOLMAP descriptors of bond properties were used to train random forests. Error percentages as low as 15% and 16% were achieved for an external test set with 472 compounds and for the training set with 4083 structures, respectively. High sensitivity and specificity were observed. Random forests were able to associate meaningful probabilities to the predictions and to explain the predictions in terms of similarities between query structures and compounds in the training set.

...read moreread less

Journal Article•DOI•

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

[...]

Timon Schroeter, Anton Schwaighofer, Sebastian Mika, Antonius Ter Laak¹, Detlev Suelzle¹, Ursula Ganzer¹, Nikolaus Heinrich¹, Klaus-Robert Müller² - Show less +4 more•Institutions (2)

Bayer Schering Pharma AG¹, Technical University of Berlin²

01 Dec 2007-Journal of Computer-aided Molecular Design

TL;DR: This work investigates the use of different Machine Learning methods to construct models for aqueous solubility, evaluating all approaches in terms of their prediction accuracy and in how far the individual error bars can faithfully represent the actual prediction error.

...read moreread less

Abstract: We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

...read moreread less

Proceedings Article•DOI•

Using Random Forests for Handwritten Digit Recognition

[...]

Simon Bernard, Laurent Heutte, Sébastien Adam

23 Sep 2007

TL;DR: This work aims at studying random forest methods in a strictly pragmatic approach, in order to provide rules on parameter settings for practitioners and draws some conclusions on random forest global behavior according to their parameter tuning.

...read moreread less

Abstract: In the pattern recognition field, growing interest has been shown in recent years for multiple classifier systems and particularly for bagging, boosting and random sub-spaces. Those methods aim at inducing an ensemble of classifiers by producing diversity at different levels. Following this principle, Breiman has introduced in 2001 another family of methods called random forest. Our work aims at studying those methods in a strictly pragmatic approach, in order to provide rules on parameter settings for practitioners. For that purpose we have experimented the forest-RI algorithm, considered as the random forest reference method, on the MNIST handwritten digits database. In this paper, we describe random forest principles and review some methods proposed in the literature. We present next our experimental protocol and results. We finally draw some conclusions on random forest global behavior according to their parameter tuning.

...read moreread less

Proceedings Article•DOI•

Comparative Study of Supervised Machine Learning Techniques for Intrusion Detection

[...]

Farnaz Gharibian¹, Ali A. Ghorbani¹•Institutions (1)

University of New Brunswick¹

14 May 2007

TL;DR: A comparative study of using supervised probabilistic and predictive machine learning techniques for intrusion detection using Naive Bayes and Gaussian and two predictive techniques decision tree and random forests for detecting four attack categories.

...read moreread less

Abstract: Intrusion detection is an effective approach for dealing with various problems in the area of network security. This paper presents a comparative study of using supervised probabilistic and predictive machine learning techniques for intrusion detection. Two probabilistic techniques Naive Bayes and Gaussian and two predictive techniques decision tree and random forests are employed. Different training datasets constructed from the KDD99 dataset are employed for training. The ability of each technique for detecting four attack categories (DoS,Probe,R2L and U2R) have been compared. The statistical results to show the sensitivity of each technique to the population of attacks in a dataset have also been reported. We compare the performance of the techniques and also investigate the robustness of each technique by calculating their standard deviations with respect to the detection rate of each attack category.

...read moreread less

Journal Article•

A Stochastic Algorithm for Feature Selection in Pattern Recognition

[...]

Sébastien Gadat¹, Laurent Younes•Institutions (1)

Institut de Mathématiques de Toulouse¹

01 May 2007-Journal of Machine Learning Research

TL;DR: A new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P is introduced.

...read moreread less

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination.

...read moreread less