Showing papers on "Random forest published in 2008"

PDF

Open Access

Journal Article•DOI•

A working guide to boosted regression trees

[...]

Jane Elith¹, John R. Leathwick², Trevor Hastie³•Institutions (3)

University of Melbourne¹, National Institute of Water and Atmospheric Research², Stanford University³

01 Jul 2008-Journal of Animal Ecology

TL;DR: This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model.

...read moreread less

Abstract: Summary 1 Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions 2 This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance) The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion 3 Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods 4 The unique features of BRT raise a number of practical issues in model fitting We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel ( Anguilla australis Richardson), a native freshwater fish of New Zealand We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data We provide code and a tutorial to enable the wider use of BRT by ecologists

...read moreread less

4,787 citations

Proceedings Article•DOI•

Isolation Forest

[...]

F.T. Liu¹, Kai Ming Ting¹, Zhi-Hua Zhou¹•Institutions (1)

Monash University, Clayton campus¹

15 Dec 2008

TL;DR: The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement.

...read moreread less

Abstract: Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.

...read moreread less

3,455 citations

Journal Article•DOI•

Random survival forests

[...]

Hemant Ishwaran¹, Udaya B. Kogalur¹, Eugene H. Blackstone, Michael S. Lauer•Institutions (1)

University of Miami¹

11 Nov 2008-arXiv: Applications

TL;DR: Random Survival Forest (RSF) as discussed by the authors is a random forests method for the analysis of right-censored survival data, which is based on the conservation-of-events principle.

...read moreread less

Abstract: We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.

...read moreread less

1,562 citations

Journal Article•DOI•

Random survival forests

[...]

Hemant Ishwaran¹, Udaya B. Kogalur¹, Eugene H. Blackstone, Michael S. Lauer•Institutions (1)

University of Miami¹

01 Sep 2008-The Annals of Applied Statistics

TL;DR: This article introduces random survival forests, a random forests method for the analysis of right-censored survival data, and extends Breiman’s random forests (RF) method, showing it to be highly accurate and comparable to state-of-the-art methods.

...read moreread less

1,147 citations

Journal Article•DOI•

Empirical characterization of random forest variable importance measures

[...]

Kellie J. Archer¹, Ryan V. Kimes¹•Institutions (1)

Virginia Commonwealth University¹

01 Jan 2008-Computational Statistics & Data Analysis

TL;DR: The RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables.

...read moreread less

854 citations

Journal Article•DOI•

Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery

[...]

Jonathan Cheung-Wai Chan¹, Desiré Paelinckx²•Institutions (2)

Vrije Universiteit Brussel¹, Research Institute for Nature and Forest²

16 Jun 2008-Remote Sensing of Environment

TL;DR: Two tree-based ensemble classification algorithms are assessed: Adaboost and Random Forest, based on standard classification accuracy, training time and classification stability, and both outperform a neural network classifier in dealing with hyperspectral data.

...read moreread less

720 citations

Journal Article•DOI•

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

[...]

Alexander Statnikov¹, Lily Wang¹, Constantin F. Aliferis•Institutions (1)

Vanderbilt University¹

22 Jul 2008-BMC Bioinformatics

TL;DR: Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

...read moreread less

Abstract: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

...read moreread less

616 citations

Proceedings Article•DOI•

An empirical evaluation of supervised learning in high dimensions

[...]

Rich Caruana¹, Nikos Karampatziakis¹, Ainur Yessenalina¹•Institutions (1)

Cornell University¹

05 Jul 2008

TL;DR: To the surprise, the method that performs consistently well across all dimensions is random forests, followed by neural nets, boosted trees, and SVMs, and the effect of increasing dimensionality on the performance of the learning algorithms changes.

...read moreread less

Abstract: In this paper we perform an empirical evaluation of supervised learning on high-dimensional data. We evaluate performance on three metrics: accuracy, AUC, and squared loss and study the effect of increasing dimensionality on the performance of the learning algorithms. Our findings are consistent with previous studies for problems of relatively low dimension, but suggest that as dimensionality increases the relative performance of the learning algorithms changes. To our surprise, the method that performs consistently well across all dimensions is random forests, followed by neural nets, boosted trees, and SVMs.

...read moreread less

555 citations

Journal Article•

Consistency of Random Forests and Other Averaging Classifiers

[...]

Gérard Biau, Luc Devroye, Gábor Lugosi

01 Jun 2008-Journal of Machine Learning Research

TL;DR: A number of theorems are given that establish the universal consistency of averaging rules, and it is shown that some popular classifiers, including one suggested by Breiman, are not universally consistent.

...read moreread less

Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent.

...read moreread less

521 citations

Journal Article•DOI•

Random-Forests-Based Network Intrusion Detection Systems

[...]

Jiong Zhang, Mohammad Zulkernine¹, Anwar Haque²•Institutions (2)

Queen's University¹, Bell Canada²

01 Sep 2008

TL;DR: The experimental results demonstrate that the performance provided by the proposed misuse approach is better than the best KDDpsila99 result; compared to other reported unsupervised anomaly detection approaches, the anomaly detection approach achieves higher detection rate when the false positive rate is low; and the presented hybrid system can improve the overall performance of the aforementioned IDSs.

...read moreread less

Abstract: Prevention of security breaches completely using the existing security technologies is unrealistic. As a result, intrusion detection is an important component in network security. However, many current intrusion detection systems (IDSs) are rule-based systems, which have limitations to detect novel intrusions. Moreover, encoding rules is time-consuming and highly depends on the knowledge of known intrusions. Therefore, we propose new systematic frameworks that apply a data mining algorithm called random forests in misuse, anomaly, and hybrid-network-based IDSs. In misuse detection, patterns of intrusions are built automatically by the random forests algorithm over training data. After that, intrusions are detected by matching network activities against the patterns. In anomaly detection, novel intrusions are detected by the outlier detection mechanism of the random forests algorithm. After building the patterns of network services by the random forests algorithm, outliers related to the patterns are determined by the outlier detection algorithm. The hybrid detection system improves the detection performance by combining the advantages of the misuse and anomaly detection. We evaluate our approaches over the knowledge discovery and data mining 1999 (KDDpsila99) dataset. The experimental results demonstrate that the performance provided by the proposed misuse approach is better than the best KDDpsila99 result; compared to other reported unsupervised anomaly detection approaches, our anomaly detection approach achieves higher detection rate when the false positive rate is low; and the presented hybrid system can improve the overall performance of the aforementioned IDSs.

...read moreread less

440 citations

Book•

Statistical Learning from a Regression Perspective

[...]

Richard A. Berk

01 Jan 2008

TL;DR: This paper presents a meta-modelling framework called CART, which automates the very labor-intensive and therefore time-heavy and expensive process of Classification and Regression Trees (CART) that is currently used in statistical inference.

...read moreread less

Abstract: Statistical Learning as a Regression Problem.- Regression Splines and Regression Smoothers.- Classification and Regression Trees (CART).- Bagging.- Random Forests.- Boosting.- Support Vector Machines.- Broader Implications and a Bit of Craft Lore.

...read moreread less

Journal Article•DOI•

Comparison of decision tree methods for finding active objects

[...]

Yong-Heng Zhao, Yanxia Zhang

01 Jan 2008-Advances in Space Research

TL;DR: Several kinds of decision trees for finding active objects by multi-wavelength data, such as REPTree, Random Tree, Decision Stump, Random Forest, J48, NBTree, AdTree are described and experimental results show that ADTree is the best only in terms of accuracy.

...read moreread less

Journal Article•DOI•

Explaining Classifications For Individual Instances

[...]

Marko Robnik-Šikonja¹, Igor Kononenko¹•Institutions (1)

University of Ljubljana¹

01 May 2008-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is demonstrated that the generated explanations closely follow the learned models and a visualization technique is presented that shows the utility of the approach and enables the comparison of different prediction methods.

...read moreread less

Abstract: We present a method for explaining predictions for individual instances. The presented approach is general and can be used with all classification models that output probabilities. It is based on the decomposition of a model's predictions on individual contributions of each attribute. Our method works for the so-called black box models such as support vector machines, neural networks, and nearest neighbor algorithms, as well as for ensemble methods such as boosting and random forests. We demonstrate that the generated explanations closely follow the learned models and present a visualization technique that shows the utility of our approach and enables the comparison of different prediction methods.

...read moreread less

Proceedings Article•DOI•

Object Class Segmentation using Random Forests.

[...]

Florian Schroff¹, Antonio Criminisi¹, Andrew Zisserman²•Institutions (2)

University of Oxford¹, Microsoft²

01 Jan 2008

TL;DR: It is shown that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously.

...read moreread less

Abstract: This work investigates the use of Random Forests for class based pixel-wise segmentation of images. The contribution of this paper is three-fold. First, we show that apparently quite dissimilar classifiers (such as nearest neighbour matching to texton class histograms) can be mapped onto a Random Forest architecture. Second, based on this insight, we show that the performance of such classifiers can be improved by incorporating the spatial context and discriminative learning that arises naturally in the Random Forest framework. Finally, we show that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously. The benefit of the multi-feature classifier is demonstrated with extensive experimentation on existing labelled image datasets. The method equals or exceeds the state of the art on these datasets.

...read moreread less

Journal Article•DOI•

A comparative study of different machine learning methods on microarray gene expression data.

[...]

Mehdi Pirooznia¹, Jack Y. Yang², Mary Qu Yang³, Youping Deng¹•Institutions (3)

University of Southern Mississippi¹, Harvard University², United States Department of Health and Human Services³

20 Mar 2008-BMC Genomics

TL;DR: The importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes are revealed.

...read moreread less

Abstract: Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results. In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers. We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.

...read moreread less

Book Chapter•DOI•

Implementing Decision Trees and Forests on a GPU

[...]

Toby Sharp¹•Institutions (1)

Microsoft¹

12 Oct 2008

TL;DR: A method for implementing the evaluation and training of decision trees and forests entirely on a GPU is described and how this method can be used in the context of object recognition is shown.

...read moreread less

Abstract: We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition.

...read moreread less

Journal Article•DOI•

Enriched random forests

[...]

Dhammika Amaratunga¹, Javier Cabrera¹, Yung-Seop Lee¹•Institutions (1)

Johnson & Johnson¹

15 Sep 2008-Bioinformatics

TL;DR: This work proposes a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features.

...read moreread less

Abstract: Although the random forest classification procedure works well in datasets with many features, when the number of features is huge and the percentage of truly informative features is small, such as with DNA microarray data, its performance tends to decline significantly. In such instances, the procedure can be improved by reducing the contribution of trees whose nodes are populated by non-informative features. To some extent, this can be achieved by prefiltering, but we propose a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features. This results in an ‘enriched random forest’. We illustrate the superior performance of this procedure in several actual microarray datasets. Contact: damaratu@prdus.jnj.com

...read moreread less

Journal Article•DOI•

Classifying Multilevel Imagery From SAR and Optical Sensors by Decision Fusion

[...]

Björn Waske¹, S. van der Linden²•Institutions (2)

University of Iceland¹, Humboldt University of Berlin²

15 Apr 2008-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: It is shown that the classification of multilevel-multisource data sets with SVM and RF is feasible and does not require a definition of ideal aggregation levels.

...read moreread less

Abstract: A strategy for the joint classification of multiple segmentation levels from multisensor imagery is introduced by using synthetic aperture radar and optical data. At first, the two data sets are separately segmented, creating independent aggregation levels at different scales. Each individual level from the two sensors is then preclassified by a support vector machine (SVM). The original outputs of each SVM, i.e., images showing the distances of the pixels to the hyperplane fitted by the SVM, are used in a decision fusion to determine the final classes. The fusion strategy is based on the application of an additional classifier, which is applied on the preclassification results. Both a second SVM and random forests (RF) were tested for the decision fusion. The results are compared with SVM and RF applied to the full data set without preclassification. Both the integration of multilevel information and the use of multisensor imagery increase the overall accuracy. It is shown that the classification of multilevel-multisource data sets with SVM and RF is feasible and does not require a definition of ideal aggregation levels. The proposed decision fusion approach that applies RF to the preclassification outperforms all other approaches.

...read moreread less

Proceedings Article•DOI•

Randomized trees for human pose detection

[...]

Grégory Rogez¹, Jonathan Rihan², Srikumar Ramalingam², Carlos Orrite¹, Philip H. S. Torr² - Show less +1 more•Institutions (2)

University of Zaragoza¹, Oxford Brookes University²

23 Jun 2008

TL;DR: This paper addresses human pose recognition from video sequences by formulating it as a classification problem and considers Histograms of Orientated Gradient (HOG), a pose detection algorithm based on random forests.

...read moreread less

Abstract: This paper addresses human pose recognition from video sequences by formulating it as a classification problem. Unlike much previous work we do not make any assumptions on the availability of clean segmentation. The first step of this work consists in a novel method of aligning the training images using 3D Mocap data. Next we define classes by discretizing a 2D manifold whose two dimensions are camera viewpoint and actions. Our main contribution is a pose detection algorithm based on random forests. A bottom-up approach is followed to build a decision tree by recursively clustering and merging the classes at each level. For each node of the decision tree we build a list of potentially discriminative features using the alignment of training images; in this paper we consider Histograms of Orientated Gradient (HOG). We finally grow an ensemble of trees by randomly sampling one of the selected HOG blocks at each node. Our proposed approach gives promising results with both fixed and moving cameras.

...read moreread less

Journal Article•DOI•

Random Forests for multiclass classification: Random MultiNomial Logit

[...]

Anita Prinzie¹, Dirk Van den Poel¹•Institutions (1)

Ghent University¹

01 Apr 2008-Expert Systems With Applications

TL;DR: This paper proposes the Random MultiNomial Logit (RMNL), i.e. a random forest of MNLs, and compares its predictive performance to that of (a) MNL with expert feature selection, (b) Random Forests of classification trees, and indicates a substantial increase in model accuracy of the RMNL model.

...read moreread less

Abstract: Several supervised learning algorithms are suited to classify instances into a multiclass value space. MultiNomial Logit (MNL) is recognized as a robust classifier and is commonly applied within the CRM (Customer Relationship Management) domain. Unfortunately, to date, it is unable to handle huge feature spaces typical of CRM applications. Hence, the analyst is forced to immerse himself into feature selection. Surprisingly, in sharp contrast with binary logit, current software packages lack any feature-selection algorithm for MultiNomial Logit. Conversely, Random Forests, another algorithm learning multiclass problems, is just like MNL robust but unlike MNL it easily handles high-dimensional feature spaces. This paper investigates the potential of applying the Random Forests principles to the MNL framework. We propose the Random MultiNomial Logit (RMNL), i.e. a random forest of MNLs, and compare its predictive performance to that of (a) MNL with expert feature selection, (b) Random Forests of classification trees. We illustrate the Random MultiNomial Logit on a cross-sell CRM problem within the home-appliances industry. The results indicate a substantial increase in model accuracy of the RMNL model to that of the MNL model with expert feature selection.

...read moreread less

Proceedings Article•DOI•

Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous

[...]

Michael S. Gashler¹, Christophe Giraud-Carrier¹, Tony Martinez¹•Institutions (1)

Brigham Young University¹

01 Dec 2008

TL;DR: The results indicate that random forests do poorly when faced with irrelevant attributes, while the heterogeneous technique handles them robustly, and it is shown that large ensembles of random trees are more susceptible to diminishing returns than the technique.

...read moreread less

Abstract: Using decision trees that split on randomly selected attributes is one way to increase the diversity within an ensemble of decision trees. Another approach increases diversity by combining multiple tree algorithms. The random forest approach has become popular because it is simple and yields good results with common datasets. We present a technique that combines heterogeneous tree algorithms and contrast it with homogeneous forest algorithms. Our results indicate that random forests do poorly when faced with irrelevant attributes, while our heterogeneous technique handles them robustly. Further, we show that large ensembles of random trees are more susceptible to diminishing returns than our technique. We are able to obtain better results across a large number of common datasets with a significantly smaller ensemble.

...read moreread less

Journal Article•DOI•

A Bias Correction Algorithm for the Gini Variable Importance Measure in Classification Trees

[...]

Marco Sandri¹, Paola Zuccolotto•Institutions (1)

University of Brescia¹

01 Sep 2008-Journal of Computational and Graphical Statistics

TL;DR: In this paper, a measure of variable importance is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned.

...read moreread less

Abstract: This article considers a measure of variable importance frequently used in variable-selection methods based on decision trees and tree-based ensemble models These models include CART, random forests, and gradient boosting machine The measure of variable importance is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned Despite its popularity, some authors have shown that this measure is biased to the extent that, under certain conditions, there may be dangerous effects on variable selection Here we present a simple and effective method for bias correction, focusing on the easily generalizable case of the Gini index as a measure of heterogeneity

...read moreread less

Posted Content•

Random Forests: some methodological insights

[...]

Robin Genuer¹, Jean-Michel Poggi¹, Christine Tuleau•Institutions (1)

Département de Mathématiques¹

21 Nov 2008-arXiv: Machine Learning

TL;DR: This paper aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size.

...read moreread less

Abstract: This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size. But the main contribution of this paper is twofold: to provide some insights about the behavior of the variable importance index based on random forests and in addition, to propose to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The strategy involves a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.

...read moreread less

Journal Article•DOI•

Random forests classifier for machine fault diagnosis

[...]

Bo-Suk Yang¹, Xiao Di¹, Tian Han²•Institutions (2)

Pukyong National University¹, University of Science and Technology Beijing²

01 Sep 2008-Journal of Mechanical Science and Technology

TL;DR: This paper investigates the possibilities of applying the random forests algorithm (RF) in machine fault diagnosis, and proposes a hybrid method combined with genetic algorithm to improve the classification accuracy.

...read moreread less

Abstract: This paper investigates the possibilities of applying the random forests algorithm (RF) in machine fault diagnosis, and proposes a hybrid method combined with genetic algorithm to improve the classification accuracy. The proposed method is based on RF, a novel ensemble classifier which builds a number of decision trees to improve the single tree classifier. Although there are several existing techniques for faults diagnosis, the application research on RF is meaningful and necessary because of its fast execution speed, the characteristics of tree classifier, and high performance in machine faults diagnosis. The proposed method is demonstrated by a case study on induction motor fault diagnosis. Experimental results indicate the validity and reliability of RF-based diagnosis method.

...read moreread less

Journal Article•DOI•

Predicting credit card customer churn in banks using data mining

[...]

Dudyala Anil Kumar, Vadlamani Ravi

01 Aug 2008-International Journal of Data Analysis Techniques and Strategies

TL;DR: An ensemble system incorporating majority voting and involving Multilayer Perceptron, Logistic Regression, decision trees, Random Forest, Radial Basis Function, and Support Vector Machine as the constituents is developed to solve the customer credit card churn prediction via data mining.

...read moreread less

Abstract: In this paper, we solve the customer credit card churn prediction via data mining. We developed an ensemble system incorporating majority voting and involving Multilayer Perceptron (MLP), Logistic Regression (LR), decision trees (J48), Random Forest (RF), Radial Basis Function (RBF) network and Support Vector Machine (SVM) as the constituents. The dataset was taken from the Business Intelligence Cup organised by the University of Chile in 2004. Since it is a highly unbalanced dataset with 93% loyal and 7% churned customers, we employed (1) undersampling, (2) oversampling, (3) a combination of undersampling and oversampling and (4) the Synthetic Minority Oversampling Technique (SMOTE) for balancing it. Furthermore, tenfold cross-validation was employed. The results indicated that SMOTE achieved good overall accuracy. Also, SMOTE and a combination of undersampling and oversampling improved the sensitivity and overall accuracy in majority voting. In addition, the Classification and Regression Tree (CART) was used for the purpose of feature selection. The reduced feature set was fed to the classifiers mentioned above. Thus, this paper outlines the most important predictor variables in solving the credit card churn prediction problem. Moreover, the rules generated by decision tree J48 act as an early warning expert system.

...read moreread less

DOI•

Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance

[...]

Carolin Strobl¹, Achim Zeileis²•Institutions (2)

Ludwig Maximilian University of Munich¹, Vienna University of Economics and Business²

30 Jan 2008

TL;DR: In this article, the authors highlight the advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated predictor variables, for the test of Breiman and Cutler (2008).

...read moreread less

Abstract: Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated predictor variables. For the test of Breiman and Cutler (2008), we investigate the statistical properties and find that the power of the test depends both on the sample size and the number of trees, an arbitrarily chosen tuning parameter, leading to undesired results that nullify any significance judgments. Moreover, the specification of the null hypothesis of this test is discussed in the context of correlated predictor variables.

...read moreread less

Book Chapter•DOI•

Segmentation of SBFSEM Volume Data of Neural Tissue by Hierarchical Classification

[...]

Björn Andres¹, Ullrich Köthe¹, Moritz Helmstaedter², Winfried Denk², Fred A. Hamprecht¹ - Show less +1 more•Institutions (2)

Interdisciplinary Center for Scientific Computing¹, Max Planck Society²

10 Jun 2008

TL;DR: This work proposes a hierarchical segmentation procedure based on statistical learning and topology-preserving grouping that is close to human labelings in three-dimensional electron-microscopic image stacks with almost isotropic resolution.

...read moreread less

Abstract: Three-dimensional electron-microscopic image stacks with almost isotropic resolution allow, for the first time, to determine the complete connection matrix of parts of the brain. In spite of major advances in staining, correct segmentation of these stacks remains challenging, because very few local mistakes can lead to severe global errors. We propose a hierarchical segmentation procedure based on statistical learning and topology-preserving grouping. Edge probability maps are computed by a random forest classifier (trained on hand-labeled data) and partitioned into supervoxels by the watershed transform. Over-segmentation is then resolved by another random forest. Careful validation shows that the results of our algorithm are close to human labelings.

...read moreread less

Journal Article•DOI•

Interaction trees with censored survival data.

[...]

Xiaogang Su¹, Tianni Zhou², Xin Yan³, Juanjuan Fan⁴, Song Yang⁵ - Show less +1 more•Institutions (5)

University of Central Florida¹, University of Southern California², University of Missouri–Kansas City³, San Diego State University⁴, National Institutes of Health⁵

28 Jan 2008-The International Journal of Biostatistics

TL;DR: An interaction tree (IT) procedure to optimize the subgroup analysis in comparative studies that involve censored survival times and follows the standard CART (Breiman, et al., 1984) methodology to develop the interaction tree structure.

...read moreread less

Abstract: We propose an interaction tree (IT) procedure to optimize the subgroup analysis in comparative studies that involve censored survival times. The proposed method recursively partitions the data into two subsets that show the greatest interaction with the treatment, which results in a number of objectively defined subgroups: in some of them the treatment effect is prominent while in others the treatment may have a negligible or even negative effect. The resultant tree structure can be used to explore the overall interaction between treatment and other covariates and help identify and describe possible target populations on which an experimental treatment demonstrates desired efficacy. We follow the standard CART (Breiman, et al., 1984) methodology to develop the interaction tree structure. Variable importance information is extracted via random forests of interaction trees. Both simulated experiments and an analysis of the primary billiary cirrhosis (PBC) data are provided for evaluation and illustration of the proposed procedure.

...read moreread less

Journal Article•DOI•

Factors Associated With Persistence in Science and Engineering Majors: An Exploratory Study Using Classification Trees and Random Forests

[...]

Guillermo Mendez¹, Trent D. Buskirk², Sharon L. Lohr¹, Susans Haag¹•Institutions (2)

Arizona State University¹, Saint Louis University²

01 Jan 2008-Journal of Engineering Education

TL;DR: The authors used the recently developed statistical method of random forests to obtain a new perspective of variables that are associated with persistence to a science or engineering degree and compared the results from classification trees and random forests with results from the more commonly used method of logistic regression.

...read moreread less

Abstract: Many students who start college intending to major in science or engineering do not graduate, or decide to switch to a non-science major. We used the recently developed statistical method of random forests to obtain a new perspective of variables that are associated with persistence to a science or engineering degree. We describe classification trees and random forests and contrast the results from these methods with results from the more commonly used method of logistic regression. Among the variables available in Arizona State University data, high school and freshman year GPAs have highest importance for predicting persistence; other variables such as number of science and engineering courses taken freshman year are important for subgroups of the student population. The method used in this study could be employed in other settings to identify faculty practices, teaching methods, and other factors that are associated with high persistence to a degree.

...read moreread less

Journal Article•DOI•

Microarray-based classification and clinical predictors

[...]

Anne-Laure Boulesteix¹, Christine Porzelius¹, Martin Daumer¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Aug 2008-Bioinformatics

TL;DR: This work proposes a novel but simple two-step approach based on random forests and partial least squares (PLS) dimension reduction embedding the idea of pre-validation suggested by Tibshirani and colleagues, which is based on an internal cross- validation for avoiding overfitting.

...read moreread less

Abstract: Motivation: In the context of clinical bioinformatics methods are needed for assessing the additional predictive value of microarray data compared to simple clinical parameters alone. Such methods should also provide an optimal prediction rule making use of all potentialities of both types of data: they should ideally be able to catch subtypes which are not identified by clinical parameters alone. Moreover, they should address the question of the additional predictive value of microarray data in a fair framework. Results: We propose a novel but simple two-step approach based on random forests and partial least squares (PLS) dimension reduction embedding the idea of pre-validation suggested by Tibshirani and colleagues, which is based on an internal cross-validation for avoiding overfitting. Our approach is fast, flexible and can be used both for assessing the overall additional significance of the microarray data and for building optimal hybrid classification rules. Its efficiency is demonstrated through simulations and an application to breast cancer and colorectal cancer data. Availability: Our method is implemented in the freely available R package ‘MAclinical’ which can be downloaded from http://www.stat.uni-muenchen.de/~socher/MAclinical Contact: boulesteix@slcmsr.org

...read moreread less

Collapse