scispace - formally typeset
Search or ask a question

Showing papers on "Decision tree model published in 2020"


Journal ArticleDOI
28 Aug 2020
TL;DR: In this article, a convolution neural networks method is used for binary classification pneumonia-based conversion of VGG-19, Inception_V2 and decision tree model on X-ray and CT scan images dataset, which contains 360 images.
Abstract: The novel coronavirus infection (COVID-19) that was first identified in China in December 2019 has spread across the globe rapidly infecting over ten million people. The World Health Organization (WHO) declared it as a pandemic on March 11, 2020. What makes it even more critical is the lack of vaccines available to control the disease, although many pharmaceutical companies and research institutions all over the world are working toward developing effective solutions to battle this life-threatening disease. X-ray and computed tomography (CT) images scanning is one of the most encouraging exploration zones; it can help in finding and providing early diagnosis to diseases and gives both quick and precise outcomes. In this study, convolution neural networks method is used for binary classification pneumonia-based conversion of VGG-19, Inception_V2 and decision tree model on X-ray and CT scan images dataset, which contains 360 images. It can infer that fine-tuned version VGG-19, Inception_V2 and decision tree model show highly satisfactory performance with a rate of increase in training and validation accuracy (91%) other than Inception_V2 (78%) and decision tree (60%) models.

98 citations


Journal ArticleDOI
TL;DR: The susceptibility mapping procedure is performed by testing three extensions of a decision tree model namely, Alternating Decision Tree (ADTree), Naive-Bayes tree (NBTree), and Logistic Model Tree (LMT) by dichotomizing the gully information over space into gully presence/absence conditions, which are further explored in their calibration and validation stages.
Abstract: Gully erosion is a disruptive phenomenon which extensively affects the Iranian territory, especially in the Northern provinces. A number of studies have been recently undertaken to study this process and to predict it over space and ultimately, in a broader national effort, to limit its negative effects on local communities. We focused on the Bastam watershed where 9.3% of its surface is currently affected by gullying. Machine learning algorithms are currently under the magnifying glass across the geomorphological community for their high predictive ability. However, unlike the bivariate statistical models, their structure does not provide intuitive and quantifiable measures of environmental preconditioning factors. To cope with such weakness, we interpret preconditioning causes on the basis of a bivariate approach namely, Index of Entropy. And, we performed the susceptibility mapping procedure by testing three extensions of a decision tree model namely, Alternating Decision Tree (ADTree), Naive-Bayes tree (NBTree), and Logistic Model Tree (LMT). We dichotomized the gully information over space into gully presence/absence conditions, which we further explored in their calibration and validation stages. Being the presence/absence information and associated factors identical, the resulting differences are only due to the algorithmic structures of the three models we chose. Such differences are not significant in terms of performances; in fact, the three models produce outstanding predictive AUC measures (ADTree ​= ​0.922; NBTree ​= ​0.939; LMT ​= ​0.944). However, the associated mapping results depict very different patterns where only the LMT is associated with reasonable susceptibility patterns. This is a strong indication of what model combines best performance and mapping for any natural hazard – oriented application.

92 citations


Journal ArticleDOI
TL;DR: This paper proposes Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output).
Abstract: Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies {\it vertical} federated learning, which tackles the scenarios where (i) collaborating organizations own data of the same set of users but with disjoint features, and (ii) only one organization holds the labels. We propose Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output). Pivot does not rely on any trusted third party and provides protection against a semi-honest adversary that may compromise $m-1$ out of $m$ clients. We further identify two privacy leakages when the trained decision tree model is released in plaintext and propose an enhanced protocol to mitigate them. The proposed solution can also be extended to tree ensemble models, e.g., random forest (RF) and gradient boosting decision tree (GBDT) by treating single decision trees as building blocks. Theoretical and experimental analysis suggest that Pivot is efficient for the privacy achieved.

87 citations


Journal ArticleDOI
TL;DR: This study proposed a series of methods to select the optimal feature domain to improve land cover classification in a complex urbanized coastal area and found that compared to the traditional band-only model, the variable selection process can significantly improve the model parsimony and computational efficiency.

79 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed BehavDT context-aware model is more effective when compared with the traditional machine learning approaches, in predicting user diverse behaviors considering multi-dimensional contexts.
Abstract: This paper formulates the problem of building a context-aware predictive model based on user diverse behavioral activities with smartphones. In the area of machine learning and data science, a tree-like model as that of decision tree is considered as one of the most popular classification techniques, which can be used to build a data-driven predictive model. The traditional decision tree model typically creates a number of leaf nodes as decision nodes that represent context-specific rigid decisions, and consequently may cause overfitting problem in behavior modeling. However, in many practical scenarios within the context-aware environment, the generalized outcomes could play an important role to effectively capture user behavior. In this paper, we propose a behavioral decision tree, “BehavDT” context-aware model that takes into account user behavior-oriented generalization according to individual preference level. The BehavDT model outputs not only the generalized decisions but also the context-specific decisions in relevant exceptional cases. The effectiveness of our BehavDT model is studied by conducting experiments on individual user real smartphone datasets. Our experimental results show that the proposed BehavDT context-aware model is more effective when compared with the traditional machine learning approaches, in predicting user diverse behaviors considering multi-dimensional contexts.

75 citations


Journal ArticleDOI
01 Jul 2020
TL;DR: Pivot as discussed by the authors is a solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output).
Abstract: Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies vertical federated learning, which tackles the scenarios where (i) collaborating organizations own data of the same set of users but with disjoint features, and (ii) only one organization holds the labels. We propose Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output). Pivot does not rely on any trusted third party and provides protection against a semi-honest adversary that may compromise m - 1 out of m clients. We further identify two privacy leakages when the trained decision tree model is released in plain-text and propose an enhanced protocol to mitigate them. The proposed solution can also be extended to tree ensemble models, e.g., random forest (RF) and gradient boosting decision tree (GBDT) by treating single decision trees as building blocks. Theoretical and experimental analysis suggest that Pivot is efficient for the privacy achieved.

55 citations


Journal ArticleDOI
TL;DR: This work demonstrated that HSI coupled with intelligence algorithms as a rapid and effective strategy could be successfully applied to accurately identify the rank quality of black tea.

43 citations


Journal ArticleDOI
TL;DR: One of the main messages of this paper is that far fewer samples are needed than for recovering the underlying tree, which means that accurate predictions are possible using the wrong tree.
Abstract: We study the problem of learning a tree Ising model from samples such that subsequent predictions made using the model are accurate. The prediction task considered in this paper is that of predicting the values of a subset of variables given values of some other subset of variables. Virtually all previous work on graphical model learning has focused on recovering the true underlying graph. We define a distance (“small set TV” or ssTV) between distributions $P$ and $Q$ by taking the maximum, over all subsets $\mathcal{S}$ of a given size, of the total variation between the marginals of $P$ and $Q$ on $\mathcal{S}$; this distance captures the accuracy of the prediction task of interest. We derive nonasymptotic bounds on the number of samples needed to get a distribution (from the same class) with small ssTV relative to the one generating the samples. One of the main messages of this paper is that far fewer samples are needed than for recovering the underlying tree, which means that accurate predictions are possible using the wrong tree.

42 citations


Journal ArticleDOI
01 Apr 2020-Symmetry
TL;DR: The experimental results on smartphone apps usage datasets show that “ContextPCA” model effectively predicts context-aware smartphone apps in terms of precision, recall, f-score and ROC values in various test cases.
Abstract: This paper mainly formulates the problem of predicting context-aware smartphone apps usage based on machine learning techniques. In the real world, people use various kinds of smartphone apps differently in different contexts that include both the user-centric context and device-centric context. In the area of artificial intelligence and machine learning, decision tree model is one of the most popular approaches for predicting context-aware smartphone usage. However, real-life smartphone apps usage data may contain higher dimensions of contexts, which may cause several issues such as increases model complexity, may arise over-fitting problem, and consequently decreases the prediction accuracy of the context-aware model. In order to address these issues, in this paper, we present an effective principal component analysis (PCA) based context-aware smartphone apps prediction model, “ContextPCA” using decision tree machine learning classification technique. PCA is an unsupervised machine learning technique that can be used to separate symmetric and asymmetric components, and has been adopted in our “ContextPCA” model, in order to reduce the context dimensions of the original data set. The experimental results on smartphone apps usage datasets show that “ContextPCA” model effectively predicts context-aware smartphone apps in terms of precision, recall, f-score and ROC values in various test cases.

34 citations


Journal ArticleDOI
TL;DR: In this paper, an end-to-end trainable unified model is presented to leverage the appealing properties from Autoencoder and random forest to evaluate the quality of products by distinguishing spamming reviews.

30 citations


Proceedings ArticleDOI
30 Oct 2020
TL;DR: The study of zero knowledge machine learning is initiated and protocols for zero knowledge decision tree predictions and accuracy tests are proposed, which allow the owner of a decision tree model to convince others that the model computes a prediction on a data sample, or achieves a certain accuracy on a public dataset without leaking any information about the model itself.
Abstract: Machine learning has become increasingly prominent and is widely used in various applications in practice. Despite its great success, the integrity of machine learning predictions and accuracy is a rising concern. The reproducibility of machine learning models that are claimed to achieve high accuracy remains challenging, and the correctness and consistency of machine learning predictions in real products lack any security guarantees. In this paper, we initiate the study of zero knowledge machine learning and propose protocols for zero knowledge decision tree predictions and accuracy tests. The protocols allow the owner of a decision tree model to convince others that the model computes a prediction on a data sample, or achieves a certain accuracy on a public dataset, without leaking any information about the model itself. We develop approaches to efficiently turn decision tree predictions and accuracy into statements of zero knowledge proofs. We implement our protocols and demonstrate their efficiency in practice. For a decision tree model with 23 levels and 1,029 nodes, it only takes 250 seconds to generate a zero knowledge proof proving that the model achieves high accuracy on a dataset of 5,000 samples and 54 attributes, and the proof size is around 287 kilobytes.

Journal ArticleDOI
TL;DR: The study concludes that the two models make modelling of uncertainty in the credit scoring process possible and fuzzy logic is more accurate for modelling the uncertainty, however, the decision tree model is more favourable to the presentation of the problem.
Abstract: Among the numerous alternatives used in the world of risk balance, it highlights the provision of guarantees in the formalization of credit agreements. The objective of this paper is to compare the achievement of fuzzy sets with that of artificial neural network-based decision trees on credit scoring to predict the recovered value using a sample of 1890 borrowers. Comparing with fuzzy logic, the decision analytic approach can more easily present the outcomes of the analysis. On the other hand, fuzzy logic makes some implicit assumptions that may make it even harder for credit-grantors to follow the logical decision-making process. This paper leads an initial study of collateral as a variable in the calculation of the credit scoring. The study concludes that the two models make modelling of uncertainty in the credit scoring process possible. Although more difficult to implement, fuzzy logic is more accurate for modelling the uncertainty. However, the decision tree model is more favourable to the presentation of the problem.

Journal ArticleDOI
TL;DR: A new model for nondestructive estimation of tree volume, above-ground biomass (AGB) or carbon stock based on LiDAR data is provided and is in better consistency with the reference value based on field survey data.
Abstract: Tree-level information can be estimated based on light detection and ranging (LiDAR) point clouds. We propose to develop a quantitative structural model based on terrestrial laser scanning (TLS) point clouds to automatically and accurately estimate tree attributes and to detect real trees for the first time. This model is suitable for forest research where branches are involved in the calculation. First, the Adtree method was used to approximate the geometry of the tree stem and branches by fitting a series of cylinders. Trees were represented as a broad set of cylinders. Then, the end of the stem or all branches were closed. The tree model changed from a cylinder to a closed convex hull polyhedron, which was to reconstruct a 3D model of the tree. Finally, to extract effective tree attributes from the reconstructed 3D model, a convex hull polyhedron calculation method based on the tree model was defined. This calculation method can be used to extract wood (including tree stem and branches) volume, diameter at breast height (DBH) and tree height. To verify the accuracy of tree attributes extracted from the model, the tree models of 153 Chinese scholartrees from TLS data were reconstructed and the tree volume, DBH and tree height were extracted from the model. The experimental results show that the DBH and tree height extracted based on this model are in better consistency with the reference value based on field survey data. The bias, RMSE and R2 of DBH were 0.38 cm, 1.28 cm and 0.92, respectively. The bias, RMSE and R2 of tree height were −0.76 m, 1.21 m and 0.93, respectively. The tree volume extracted from the model is in better consistency with the reference value. The bias, root mean square error (RMSE) and determination coefficient (R2) of tree volume were −0.01236 m3, 0.03498 m3 and 0.96, respectively. This study provides a new model for nondestructive estimation of tree volume, above-ground biomass (AGB) or carbon stock based on LiDAR data.

Journal ArticleDOI
TL;DR: A deep fuzzy tree model is proposed which learns a better tree structure and classifiers for hierarchical classification with theory guarantee and experimental results show the effectiveness and efficiency of the proposed model in various visual classification datasets.
Abstract: Deep learning models often use a flat softmax layer to classify samples after feature extraction in visual classification tasks. However, it is hard to make a single decision of finding the true label from massive classes. In this scenario, hierarchical classification is proved to be an effective solution and can be utilized to replace the softmax layer. A key issue of hierarchical classification is to construct a good label structure, which is very significant for classification performance. Several works have been proposed to address the issue, but they have some limitations and are almost designed heuristically. In this article, inspired by fuzzy rough set theory, we propose a deep fuzzy tree model which learns a better tree structure and classifiers for hierarchical classification with theory guarantee. Experimental results show the effectiveness and efficiency of the proposed model in various visual classification datasets.

Journal ArticleDOI
TL;DR: Combination of CVs data (V2V and V2I and deep learning networks) is promising to determine crash risks at intersections with high time efficiency and at low CV penetration rates, which help to deploy countermeasures to reduce the crash rates and resolve traffic safety problems.

Journal ArticleDOI
TL;DR: The decision tree algorithm can be successfully applied as an alternative for the determination of potential pathogenicity of VOUS, producing consistently relevant forecasts for the sample tests with an accuracy close to the best ones achieved from supervised ML algorithms.
Abstract: A variant of unknown significance (VUS) is a variant form of a gene that has been identified through genetic testing, but whose significance to the organism function is not known. An actual challenge in precision medicine is to precisely identify which detected mutations from a sequencing process have a suitable role in the treatment or diagnosis of a disease. The average accuracy of pathogenicity predictors is 85%. However, there is a significant discordance about the identification of mutational impact and pathogenicity among them. Therefore, manual verification is necessary for confirming the real effect of a mutation in its casuistic. In this work, we use variables categorization and selection for building a decision tree model, and later we measure and compare its accuracy with four known mutation predictors and seventeen supervised machine-learning (ML) algorithms. The results showed that the proposed tree reached the highest precision among all tested variables: 91% for True Neutrals, 8% for False Neutrals, 9% for False Pathogenic, and 92% for True Pathogenic. The decision tree exceptionally demonstrated high classification precision with cancer data, producing consistently relevant forecasts for the sample tests with an accuracy close to the best ones achieved from supervised ML algorithms. Besides, the decision tree algorithm is easier to apply in clinical practice by non-IT experts. From the cancer research community perspective, this approach can be successfully applied as an alternative for the determination of potential pathogenicity of VOUS.

Journal ArticleDOI
30 Nov 2020
TL;DR: This research proposed utilizing two different machine learning algorithms (random forest and decision tree (J48)) to detect the fake news using the full dataset size and testing sample size.
Abstract: Fake News is one of the most popular phenomena that have considerable effects on our social life, especially in the political domain. Nowadays, creating fake news becomes very easy because of users' widespread using the internet and social media. Therefore, the detection of elusiveness news is a crucial problem that needs to be considerable mainly because of its challenges like the limited amount of the benchmark datasets and the amount of the published news every second. This research proposed utilizing two different machine learning algorithms (random forest and decision tree (J48)) to detect the fake news. In this paper, the full dataset size equals 20,761 samples, while the testing sample size equals 4,345 samples. The preprocessing steps start with cleaning data by removing unnecessary special characters, numbers, English letters, and white spaces, and finally, removing stop words is implemented. After that, the most popular feature extraction method (TF-IDF) is used before applying the two suggested classification algorithms. The results show that the best accuracy achieved equals 89.11% using the decision tree model while using the random forest; the accuracy achieved equals 84.97 %.

Journal ArticleDOI
22 Oct 2020-Sensors
TL;DR: By a combination of spectral analysis and the application of decision trees to a set of spectral features, this paper is able to take advantage of the multidimensionality of diagnostic data and classify/recognize the gearbox condition almost faultlessly even in non-stationary operating conditions.
Abstract: Monitoring the condition of rotating machinery, especially planetary gearboxes, is a challenging problem. In most of the available approaches, diagnostic procedures are related to advanced signal pre-processing/feature extraction methods or advanced data (features) analysis by using artificial intelligence. In this paper, the second approach is explored, so an application of decision trees for the classification of spectral-based 15D vectors of diagnostic data is proposed. The novelty of this paper is that by a combination of spectral analysis and the application of decision trees to a set of spectral features, we are able to take advantage of the multidimensionality of diagnostic data and classify/recognize the gearbox condition almost faultlessly even in non-stationary operating conditions. The diagnostics of time-varying systems are a complicated issue due to time-varying probability densities estimated for features. Using multidimensional data instead of an aggregated 1D feature, it is possible to improve the efficiency of diagnostics. It can be underlined that in comparison to previous work related to the same data, where the aggregated 1D variable was used, the efficiency of the proposed approach is around 99% (ca. 19% better). We tested several algorithms: classification and regression trees with the Gini index and entropy, as well as the random tree. We compare the obtained results with the K-nearest neighbors classification algorithm and meta-classifiers, namely: random forest and AdaBoost. As a result, we created the decision tree model with 99.74% classification accuracy on the test dataset.

Journal ArticleDOI
03 Dec 2020-PLOS ONE
TL;DR: It is proved that the algorithm selected in this paper can effectively filter the features, which simplifies the complexity of the model to a certain extent and improves the classification accuracy of machine learning.
Abstract: In recent years, China's e-commerce industry has developed at a high speed, and the scale of various industries has continued to expand. Service-oriented enterprises such as e-commerce transactions and information technology came into being. This paper analyzes the shortcomings and challenges of traditional online shopping behavior prediction methods, and proposes an online shopping behavior analysis and prediction system. The paper chooses linear model logistic regression and decision tree based XGBoost model. After optimizing the model, it is found that the nonlinear model can make better use of these features and get better prediction results. In this paper, we first combine the single model, and then use the model fusion algorithm to fuse the prediction results of the single model. The purpose is to avoid the accuracy of the linear model easy to fit and the decision tree model over-fitting. The results show that the model constructed by the article has further improvement than the single model. Finally, through two sets of contrast experiments, it is proved that the algorithm selected in this paper can effectively filter the features, which simplifies the complexity of the model to a certain extent and improves the classification accuracy of machine learning. The XGBoost hybrid model based on p/n samples is simpler than a single model. Machine learning models are not easily over-fitting and therefore more robust.

Journal ArticleDOI
Jiawei Li1, Yiming Li1, Xingchun Xiang1, Shu-Tao Xia1, Siyi Dong, Yun Cai 
24 Oct 2020-Entropy
TL;DR: A Tree-Network-Tree (TNT) learning framework for explainable decision-making, where the knowledge is alternately transferred between the tree model and DNNs is proposed, and extensive experiments demonstrated the effectiveness of the proposed method.
Abstract: Deep Neural Networks (DNNs) usually work in an end-to-end manner. This makes the trained DNNs easy to use, but they remain an ambiguous decision process for every test case. Unfortunately, the interpretability of decisions is crucial in some scenarios, such as medical or financial data mining and decision-making. In this paper, we propose a Tree-Network-Tree (TNT) learning framework for explainable decision-making, where the knowledge is alternately transferred between the tree model and DNNs. Specifically, the proposed TNT learning framework exerts the advantages of different models at different stages: (1) a novel James–Stein Decision Tree (JSDT) is proposed to generate better knowledge representations for DNNs, especially when the input data are in low-frequency or low-quality; (2) the DNNs output high-performing prediction result from the knowledge embedding inputs and behave as a teacher model for the following tree model; and (3) a novel distillable Gradient Boosted Decision Tree (dGBDT) is proposed to learn interpretable trees from the soft labels and make a comparable prediction as DNNs do. Extensive experiments on various machine learning tasks demonstrated the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: An application that uses academic information provided by the university and generates classification models from three different algorithms, artificial neural networks, ID3 and C4.5, concluded that the ratio of credits approved by a student to the credits that he should have taken is the variable more significant.
Abstract: Academic performance is a topic studied not only to identify those students who could drop out of their studies, but also to classify them according to the type of academic risk they could find themselves. An application has been implemented that uses academic information provided by the university and generates classification models from three different algorithms: artificial neural networks, ID3 and C4.5. The models created use a set of variables and criteria for their construction and can be used to classify student desertion and more specifically to predict their type of academic risk. The performance of these models was compared to define the one that provided the best results and that will serve to make the classification of students. Decision tree algorithms, C4.5 and ID3, presented better measurements with respect to the artificial neural network. The tree generated using the C4.5 algorithm presented the best performance metrics with correctness, accuracy, and sensitivity equal to 0.83, 0.87, and 0.90 respectively. As a result of the classification to determine student desertion it was concluded, according to the model generated using the C4.5 algorithm, that the ratio of credits approved by a student to the credits that he should have taken is the variable more significant. The classification, depending on the type of academic risk, generated a tree model indicating that the number of abandoned subjects is the most significant variable. The admission scan modality through which the student entered the university did not turn out to be significant, as it does not appear in the generated decision tree.

Book ChapterDOI
01 Jan 2020
TL;DR: A sensitive pruning-based decision tree to tackle the privacy issues in this domain is proposed and the proposed pruning algorithm is modified based on C4.8 decision tree (better known as J48 in Weka package).
Abstract: Machine learning techniques have been extensively adopted in the domain of Network-based Intrusion Detection System (NIDS) especially in the task of network traffics classification. A decision tree model with its kinship terminology is very suitable in this application. The merit of its straightforward and simple “if-else” rules makes the interpretation of network traffics easier. Despite its powerful classification and interpretation capacities, the visibility of its tree rules is introducing a new privacy risk to NIDS where it reveals the network posture of the owner. In this paper, we propose a sensitive pruning-based decision tree to tackle the privacy issues in this domain. The proposed pruning algorithm is modified based on C4.8 decision tree (better known as J48 in Weka package). The proposed model is tested with the 6 percent GureKDDCup NIDS dataset.

Journal ArticleDOI
01 Jan 2020
TL;DR: This paper proposes a design pattern detection approach based on tree-based machine learning algorithms and software metrics to study the effectiveness of software metrics in distinguishing between similar structural design patterns.
Abstract: Design patterns are general reusable solutions for recurrent occurring problems. When software systems become more complicated due to the lack of documentation of design patterns in software and the maintenance and evolution costs become a challenge. Design pattern detection is used to reduce the complexity and to increase the understandability of the design in the software. In this paper, we propose a design pattern detection approach based on tree-based machine learning algorithms and software metrics to study the effectiveness of software metrics in distinguishing between similar structural design patterns. We build our datasets using P-MARt repository by extracting the roles of design patterns and calculating the metrics for each role. We used parameter optimization techniques based on the Grid search algorithm to define the optimal parameter of each algorithm. We used two feature selection methods based on a genetic algorithm to find features that influence the most in the distinguishing process. Through our experimental study, we showed the effectiveness of machine learning and software metrics when distinguishing similar structure design patterns. Moreover, we extracted the essential metrics in each dataset that supported the machine learning model to take its decision. We presented the detection conditions for each role in the design pattern by extracting them from the decision tree model.

Journal ArticleDOI
TL;DR: In this research, cost sensitive decision tree C5.0 was used to solve multiclass imbalanced data problems and had better performance than C4.5 and ID3 algorithms.
Abstract: The multiclass imbalanced data problems in data mining were an interesting to study currently. The problems had an influence on the classification process in machine learning processes. Some cases showed that minority class in the dataset had an important information value compared to the majority class. When minority class was misclassification, it would affect the accuracy value and classifier performance. In this research, cost sensitive decision tree C5.0 was used to solve multiclass imbalanced data problems. The first stage, making the decision tree model uses the C5.0 algorithm then the cost sensitive learning uses the metacost method to obtain the minimum cost model. The results of testing the C5.0 algorithm had better performance than C4.5 and ID3 algorithms. The percentage of algorithm performance from C5.0, C4.5 and ID3 were 40.91%, 40, 24% and 19.23%.

Journal ArticleDOI
TL;DR: The research results showed that the highest accuracy is obtained using the tree model classifiers and the best algorithm of this type to predict is gradient boosted trees.
Abstract: In the paper, the flight time deviation of Lithuania airports has been analyzed. The supervised machine learning model has been implemented to predict the interval of time delay deviation of new flights. The analysis has been made using seven algorithms: probabilistic neural network, multilayer perceptron, decision trees, random forest, tree ensemble, gradient boosted trees, and support vector machines. To find the best parameters which give the highest accuracy for each algorithm, the grid search has been used. To evaluate the quality of each algorithm, the five measures have been calculated: sensitivity/recall, precision, specificity, F-measure, and accuracy. All experimental investigation has been made using the newly collected dataset from Lithuania airports and weather information on departure/landing time. The departure flights and arrival flights have been investigated separately. To balance the dataset, the SMOTE technique is used. The research results showed that the highest accuracy is obtained using the tree model classifiers and the best algorithm of this type to predict is gradient boosted trees.

Journal ArticleDOI
01 Mar 2020
TL;DR: Several dimensional reduction algorithm and Decision Tree as classifier are used and all models that implement dimensional reduction can significantly improve the performance of the Decision Tree model.
Abstract: The complexity of the software can increase the possibility of defects. Defective software can cause high losses. The software containing defects can cause large losses. Most software developers don't document their work properly so that making it difficult to analyse software development history data. The cross-project software defect prediction used several datasets from different projects and combining for training and testing. The dataset with high dimension can cause bias, contain irrelevance data, and require large resources to process it. In this study, several dimensional reduction algorithm and Decision Tree as classifier. Based on the analysis using ANOVA, all models that implement dimensional reduction can significantly improve the performance of the Decision Tree model.

Journal ArticleDOI
TL;DR: In this article, the authors presented a methodology to establish incident duration estimation models by utilizing decision tree models of CHAID, CART, C4.5 and LMT.
Abstract: Unexpected events such as crashes, disabled vehicles, flat tires and spilled loads cause traffic congestion or extend the duration of the traffic congestion on the roadways. It is possible to reduce the effects of such incidents by implementing intelligent transportation systems solutions that require the estimation of the incident duration to identify well-fitted strategies. This paper presents a methodology to establish incident duration estimation models by utilizing decision tree models of CHAID, CART, C4.5 and LMT. For this study, the data contained traffic incidents that occurred on the Istanbul Trans European Motorway were obtained and separated into three groups according to duration by utilizing some studies about classification of traffic incidents. By using classified data, decision tree models of CHAID, CART, C4.5 and LMT were established and validated to estimate the incident duration. According to the results, although the models used different variables, the decision tree models of CHAID, CART and C4.5 have nearly the same prediction accuracy which is approximately 74%. On the other hand, the prediction accuracy of decision tree model of LMT is 75.4% which is somewhat better than the others. However, C4.5 model required less number of parameters than the others, while its accuracy is the same with others.

Journal ArticleDOI
TL;DR: By applying simple and cost-effective classification rules, the decision tree model estimates the development of diabetes in a high-risk adult Chinese population with strong potential for implementation of diabetes management.
Abstract: Background To predict and make an early diagnosis of diabetes is a critical approach in a population with high risk of diabetes, one of the devastating diseases globally. Traditional and conventional blood tests are recommended for screening the suspected patients; however, applying these tests could have health side effects and expensive cost. The goal of this study was to establish a simple and reliable predictive model based on the risk factors associated with diabetes using a decision tree algorithm. Methods A retrospective cross-sectional study was used in this study. A total of 10,436 participants who had a health check-up from January 2017 to July 2017 were recruited. With appropriate data mining approaches, 3454 participants remained in the final dataset for further analysis. Seventy percent of these participants (2420 cases) were then randomly allocated to either the training dataset for the construction of the decision tree or the testing dataset (30%, 1034 cases) for evaluation of the performance of the decision tree. For this purpose, the cost-sensitive J48 algorithm was used to develop the decision tree model. Results Utilizing all the key features of the dataset consisting of 14 input variables and two output variables, the constructed decision tree model identified several key factors that are closely linked to the development of diabetes and are also modifiable. Furthermore, our model achieved an accuracy of classification of 90.3% with a precision of 89.7% and a recall of 90.3%. Conclusion By applying simple and cost-effective classification rules, our decision tree model estimates the development of diabetes in a high-risk adult Chinese population with strong potential for implementation of diabetes management.

Journal ArticleDOI
17 Jan 2020-PLOS ONE
TL;DR: An alternative EEG signal characterization using graph metrics and, based on such features, a classification analysis using a decision tree model is introduced to identify group differences in brain connectivity networks with respect to mathematical skills in elementary school children.
Abstract: Recent studies aiming to facilitate mathematical skill development in primary school children have explored the electrophysiological characteristics associated with different levels of arithmetic achievement. The present work introduces an alternative EEG signal characterization using graph metrics and, based on such features, a classification analysis using a decision tree model. This proposal aims to identify group differences in brain connectivity networks with respect to mathematical skills in elementary school children. The methods of analysis utilized were signal-processing (EEG artifact removal, Laplacian filtering, and magnitude square coherence measurement) and the characterization (Graph metrics) and classification (Decision Tree) of EEG signals recorded during performance of a numerical comparison task. Our results suggest that the analysis of quantitative EEG frequency-band parameters can be used successfully to discriminate several levels of arithmetic achievement. Specifically, the most significant results showed an accuracy of 80.00% (α band), 78.33% (δ band), and 76.67% (θ band) in differentiating high-skilled participants from low-skilled ones, averaged-skilled subjects from all others, and averaged-skilled participants from low-skilled ones, respectively. The use of a decision tree tool during the classification stage allows the identification of several brain areas that seem to be more specialized in numerical processing.

Journal ArticleDOI
TL;DR: This work applies tools from constraint satisfaction to learn optimal decision trees in the form of sparse k-CNF (Conjunctive Normal Form) rules, which are significantly more accurate than those learned by existing heuristic approaches.
Abstract: Decision trees are a popular choice for providing explainable machine learning, since they make explicit how different features contribute towards the prediction. We apply tools from constraint satisfaction to learn optimal decision trees in the form of sparse k-CNF (Conjunctive Normal Form) rules. We develop two methods offering different trade-offs between accuracy and computational complexity: one offline method that learns decision trees using the entire training dataset and one online method that learns decision trees over a local subset of the training dataset. This subset is obtained from training examples near a query point. The developed methods are applied on a number of datasets both in an online and an offline setting. We found that our methods learn decision trees which are significantly more accurate than those learned by existing heuristic approaches. However, the global decision tree model tends to be computationally more expensive compared to heuristic approaches. The online method is faster to train and finds smaller decision trees with an accuracy comparable to that of the k-nearest-neighbour method.