scispace - formally typeset
Search or ask a question
Topic

Random forest

About: Random forest is a research topic. Over the lifetime, 13345 publications have been published within this topic receiving 345395 citations. The topic is also known as: random forests & randomized trees.


Papers
More filters
Journal ArticleDOI
TL;DR: Initial results indicate that the screening of MOFs with high drug loading capacity is a well generalized, straightforward, and cost-effective method that can be applied not only for the prediction of IBU loading capacity, but also in many other biomaterials projects.
Abstract: Metal-organic frameworks (MOFs) have been widely researched as drug delivery systems due to their intrinsic porous structures. Herein, machine learning (ML) technologies were applied for the screening of MOFs with high drug loading capacity. To achieve this, first, a comprehensive dataset was gathered, including 40 data points from more than 100 different publications. The organic linkers, metal ions, and the functional groups, as well as the surface area and the pore volume of the investigated MOFs, were chosen as the model’s inputs, and the output was the ibuprofen (IBU) loading capacity. Thereafter, various advanced and powerful machine learning algorithms, such as support vector regression (SVR), random forest (RF), adaptive boosting (AdaBoost), and categorical boosting (CatBoost), were employed to predict the ibuprofen loading capacity of MOFs. The coefficient of determination (R2) of 0.70, 0.72, 0.66, and 0.76 were obtained for the SVR, RF, AdaBoost, and CatBoost approaches, respectively. Among all the algorithms, CatBoost was the most reliable, exhibiting superior performance regarding the sparse matrices and categorical features. Shapley additive explanations (SHAP) analysis was employed to explore the impact of the eigenvalues of the model’s outputs. Our initial results indicate that this methodology is a well generalized, straightforward, and cost-effective method that can be applied not only for the prediction of IBU loading capacity, but also in many other biomaterials projects.

4 citations

Journal ArticleDOI
TL;DR: In this paper , the importance of various dynamical features in predicting the dynamical state (ds) of galaxy clusters, based on the Random Forest (RF) machine learning approach, was investigated.
Abstract: We investigate the importances of various dynamical features in predicting the dynamical state (ds ) of galaxy clusters, based on the Random Forest (RF) machine learning approach. We use a large sample of galaxy clusters from the Three Hundred Project of hydrodynamical zoomed-in simulations, and construct dynamical features from the raw data as well as from the corresponding mock maps in the optical, X-ray, and Sunyaev-Zel’dovich (SZ) channels. Instead of relying on the impurity based feature importance of the RF algorithm, we directly use the out-of-bag (oob ) scores to evaluate the importances of individual features and different feature combinations. Among all the features studied, we find the virial ratio, η, to be the most important single feature. The features calculated directly from the simulations and in 3-dimensions carry more information on the ds than those constructed from the mock maps. Compared with the features based on X-ray or SZ maps, features related to the centroid positions are more important. Despite the large number of investigated features, a combination of up to three features of different types can already saturate the score of the prediction. Lastly, we show that the most sensitive feature η is strongly correlated with the well-known half-mass bias in dynamical modelling. Without a selection in ds , cluster haloes have an asymmetric distribution in η, corresponding to an overall positive half-mass bias. Our work provides a quantitative reference for selecting the best features to discriminate the ds of galaxy clusters in both simulations and observations.

4 citations

Journal ArticleDOI
TL;DR: A feature selection model is established to selects 20 molecular descriptors of compounds with the most significant influence on biological activity and parameters such as MlogP, XlogP and TopoPSA were found that had a prominent effect on the biological activity.
Abstract: This paper establishes a feature selection model to selects 20 molecular descriptors of compounds with the most significant influence on biological activity. Random forest algorithm was used to calculate the correlation between molecular descriptors and pIC50 values of biological activity. In this way, the top 26 molecular descriptors with high correlation were screened out. The Pearson correlation coefficient was used to analyze the 26 molecular descriptors just selected and eliminate the variables with high correlation between the independent variables. By consulting literature, the parameters such as MlogP, XlogP and TopoPSA in the selected molecular descriptors were found that had a prominent effect on the biological activity, indicating that the screening methods and results of the 20 molecular descriptors were reasonable.

4 citations

Proceedings ArticleDOI
01 Aug 2017
TL;DR: Experimental results demonstrate that CMI-RF method can select the feature subset with stronger correlation, no redundancy and high classification accuracy.
Abstract: Random Forest (RF) has been widely used in the classification of high dimensional data. However, all the features of high dimensional data are used for classification, which will increase the computation time and reduce the classification accuracy. Therefore, feature selection is critical to high dimensional data classification. In order to solve this problem, this paper presents a method of Conditional Mutual Information (CMI) and Random Forest (CMI-RF). CMI is used to remove irrelevant and redundant information. The optimal subset of features with higher classification accuracy is obtained by RF. In this paper, the high dimensional near infrared spectral data is taken as experimental data. The experimental results demonstrate that CMI-RF method can select the feature subset with stronger correlation, no redundancy and high classification accuracy.

4 citations

Journal ArticleDOI
Xiaobang Liu, Shunlin Liang, Bing Li, Han Ma, Tao He 
TL;DR: Wang et al. as discussed by the authors used multiple machine learning algorithms (MLAs) to estimate the fractional forest cover (FFC) in China's Three North Region (TNR) using 30m Landsat-8 data and aggregated 1-m GaoFen-2 (GF-2) satellite images.
Abstract: The accurate monitoring of forest cover and its changes are essential for environmental change research, but current satellite products for forest coverage carry many uncertainties. This study used 30-m Landsat-8 data, and aggregated 1-m GaoFen-2 (GF-2) satellite images to construct the training samples and used multiple machine learning algorithms (MLAs) to estimate the fractional forest cover (FFC) in China’s Three North Region (TNR). In this study, multiple MLAs were merged to construct stacked generalization (SG) models based on the idea of SG, and the performances of the MLAs in the FFC estimation were evaluated. The results of the 10-fold cross-validation showed that all non-linear algorithms had a good performance, with an R2 value of greater than 0.8 and a root-mean square error (RMSE) of less than 0.05. In the bagging ensemble, the random forest (RF) (R2 = 0.993, RMSE = 0.020) model performed the best and in the boosting ensemble, the light gradient boosted machine (LGBM) (R2 = 0.992, RMSE = 0.022) performed the best. Although the evaluation index of the RF is slightly better than that of the LGBM, the independent validation results show that the two models have similar performances. The model evaluation results of the independent datasets showed that, in the SG model, the performance of the SG(LGBM) (R2 = 0.991, RMSE = 0.034) was better than that of the single or non-ensemble model. Comparing the FFC estimates of our model with those of existing datasets showed that our model exhibited more forest spatial distribution details and higher accuracy in complex landscapes. Overall, in this study, the method of using high-resolution remote sensing (RS) images to extract samples for FFC estimation is feasible. Our results demonstrate the potential of the ensemble MLAs to map the FFC. The research results also show that among many MALs, the RF algorithm is the most suitable algorithm for estimating FFC, which provides a reference for future research.

4 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
90% related
Convolutional neural network
74.7K papers, 2M citations
90% related
Cluster analysis
146.5K papers, 2.9M citations
89% related
Feature extraction
111.8K papers, 2.1M citations
87% related
Artificial neural network
207K papers, 4.5M citations
86% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20241
20235,459
202210,287
20212,325
20202,251
20191,961