scispace - formally typeset
Search or ask a question

Showing papers on "Decision tree model published in 2019"


Journal ArticleDOI
TL;DR: In this paper, the authors analyzed Twitter signals as a medium for user sentiment to predict the price fluctuations of a small-cap alternative cryptocurrency called ''ZClassic'' using an Extreme Gradient Boosting Regression Tree Model.
Abstract: In this paper, we analyze Twitter signals as a medium for user sentiment to predict the price fluctuations of a small-cap alternative cryptocurrency called \emph{ZClassic}. We extracted tweets on an hourly basis for a period of 3.5 weeks, classifying each tweet as positive, neutral, or negative. We then compiled these tweets into an hourly sentiment index, creating an unweighted and weighted index, with the latter giving larger weight to retweets. These two indices, alongside the raw summations of positive, negative, and neutral sentiment were juxtaposed to $\sim 400$ data points of hourly pricing data to train an Extreme Gradient Boosting Regression Tree Model. Price predictions produced from this model were compared to historical price data, with the resulting predictions having a 0.81 correlation with the testing data. Our model'€™s predictive data yielded statistical significance at the $p < 0.0001$ level. Our model is the first academic proof of concept that social media platforms such as Twitter can serve as powerful social signals for predicting price movements in the highly speculative alternative cryptocurrency, or ``alt-coin'', market.

59 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: This paper adopted the Shapley additive explanation (SHAP) for interpreting a gradient-boosting decision tree model using hospital data and proposes two novel techniques, a new metric of feature importance using SHAP and a technique termed feature packing, which packs multiple similar features into one grouped feature to allow an easier understanding of the model without reconstruction of themodel.
Abstract: When using machine learning techniques in decision-making processes, the interpretability of the models is important. In the present paper, we adopted the Shapley additive explanation (SHAP), which is based on fair profit allocation among many stakeholders depending on their contribution, for interpreting a gradient-boosting decision tree model using hospital data. For better interpretability, we propose two novel techniques as follows: (1) a new metric of feature importance using SHAP and (2) a technique termed feature packing, which packs multiple similar features into one grouped feature to allow an easier understanding of the model without reconstruction of the model.

32 citations


Journal ArticleDOI
TL;DR: It is shown that a certain “product” lower bound method of Impagliazzo and Williams (CCC 2010) fails to capture PNP communication complexity up to polynomial factors, which answers a question of Papakonstantinou, Scheder, and Song (CCC 2014).
Abstract: We prove that the PNP-type query complexity (alternatively, decision list width) of any Boolean function f is quadratically related to the PNP-type communication complexity of a lifted version of f. As an application, we show that a certain “product” lower bound method of Impagliazzo and Williams (CCC 2010) fails to capture PNP communication complexity up to polynomial factors, which answers a question of Papakonstantinou, Scheder, and Song (CCC 2014).

31 citations


Journal ArticleDOI
01 May 2019-Geoderma
TL;DR: In this article, the DSMART algorithm was used to disaggregate conventional soil maps and to produce high-quality soil maps when point observations are not available, and the results demonstrated that a suitable approach can provide reliable soil maps at a national extent.

29 citations


Journal ArticleDOI
TL;DR: This study verifies the efficiency and validity of HDTTCA by using a large data set from the NHI of Taiwan by measuring their average performance and determining which model addresses the telehealth patient classification problem better.
Abstract: Although previous research showed that telehealth services can reduce the misuse of resources and urban–rural disparities, most healthcare insurers do not include telehealth services in their health insurance schemes. Therefore, no target variable exists for the classification approaches to learn from or train with. The problem of identifying the potential recipients of telehealth services when introducing telehealth services into health welfare or health insurance schemes becomes an unsupervised classification problem without a target variable. We propose a HDTTCA approach, which is a systematic approach (the main process of HDTTCA involves (1) data set preprocessing, (2) decision tree model building, and (3) predicting and explaining of the most important attributes in the data set for patients who qualify for telehealth service) to identify those who are eligible for telehealth services. This work uses data from the NHIRD provided by the NHIA in Taiwan in 2012 as our research scope, which consist of 55,389 distinct hospitals and 653,209 distinct patients with 15,882,153 outpatient and 135,775 inpatient records. After HDTTCA produces the final version of the decision tree, the rules can be used to assign the values of the target variables in the entire NHIRD. Our data indicate that 3.56% (23,262 out of 653,209) of the patients are eligible for telehealth services in 2012. This study verifies the efficiency and validity of HDTTCA by using a large data set from the NHI of Taiwan. This study conducts a series of experiments 30 times to compare the HDTTCA results with the logistic regression findings by measuring their average performance and determining which model addresses the telehealth patient classification problem better. Four important metrics are used to compare the results. In terms of sensitivity, the decision trees generated by HDTTCA and the logistic regression model are on equal grounds. In terms of accuracy, specificity, and precision, the decision tree generated by HDTTCA provides a better performance than that of the logistic regression model. When HDTTCA is applied, the decision tree model generates a competitive performance and provides clear, easily understandable rules. Therefore, HDTTCA is a suitable choice in solving telehealth service classification problems.

29 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a behavioral decision tree, "BehavDT" context-aware model that takes into account user behavior-oriented generalization according to individual preference level.
Abstract: This paper formulates the problem of building a context-aware predictive model based on user diverse behavioral activities with smartphones. In the area of machine learning and data science, a tree-like model as that of decision tree is considered as one of the most popular classification techniques, which can be used to build a data-driven predictive model. The traditional decision tree model typically creates a number of leaf nodes as decision nodes that represent context-specific rigid decisions, and consequently may cause overfitting problem in behavior modeling. However, in many practical scenarios within the context-aware environment, the generalized outcomes could play an important role to effectively capture user behavior. In this paper, we propose a behavioral decision tree, "BehavDT" context-aware model that takes into account user behavior-oriented generalization according to individual preference level. The BehavDT model outputs not only the generalized decisions but also the context-specific decisions in relevant exceptional cases. The effectiveness of our BehavDT model is studied by conducting experiments on individual user real smartphone datasets. Our experimental results show that the proposed BehavDT context-aware model is more effective when compared with the traditional machine learning approaches, in predicting user diverse behaviors considering multi-dimensional contexts.

24 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Compared with linear regression and decision tree model, XGboost algorithm has better generalization ability and robustness in data prediction, and also prevents overfitting phenomenon, laying a solid foundation for the subsequent second-hand house price prediction.
Abstract: In order to better and more accurately study the housing price of second-hand houses, this paper analyzed and studied 35417 pieces of data captured by Chengdu HOME LINK network. Firstly, the captured data were cleaned and the characteristics were selected. Then, multiple linear regression, decision tree and XGboost models were used to fit the predicted housing price score curve for these ten factors, and finally, the optimal prediction model was selected through parameter adjustment. The experimental results show that the accuracy of XGboost prediction is the highest, and the prediction accuracy score reaches 0.9251. Compared with linear regression and decision tree model, XGboost algorithm has better generalization ability and robustness in data prediction, and also prevents overfitting phenomenon, laying a solid foundation for the subsequent second-hand house price prediction.

22 citations


Journal ArticleDOI
Wenchao Ma1
TL;DR: This study introduces the so-called two-digit scoring scheme into diagnostic assessments to record both students' partial credits and their strategies, and proposes a diagnostic tree model (DTM) by integrating the cognitive diagnosis models with the tree model to analyse the items scored using the two- digit rubrics.
Abstract: Constructed-response items have been shown to be appropriate for cognitively diagnostic assessments because students' problem-solving procedures can be observed, providing direct evidence for making inferences about their proficiency. However, multiple strategies used by students make item scoring and psychometric analyses challenging. This study introduces the so-called two-digit scoring scheme into diagnostic assessments to record both students' partial credits and their strategies. This study also proposes a diagnostic tree model (DTM) by integrating the cognitive diagnosis models with the tree model to analyse the items scored using the two-digit rubrics. Both convergent and divergent tree structures are considered to accommodate various scoring rules. The MMLE/EM algorithm is used for item parameter estimation of the DTM, and has been shown to provide good parameter recovery under varied conditions in a simulation study. A set of data from TIMSS 2007 mathematics assessment is analysed to illustrate the use of the two-digit scoring scheme and the DTM.

20 citations


Posted Content
TL;DR: The Multi-Label Deep Forest (MLDF) method with two mechanisms: measure-aware feature reuse and measure- Aware layer growth, which ensures MLDF gradually increase the model complexity by performance measure.
Abstract: In multi-label learning, each instance is associated with multiple labels and the crucial task is how to leverage label correlations in building models. Deep neural network methods usually jointly embed the feature and label information into a latent space to exploit label correlations. However, the success of these methods highly depends on the precise choice of model depth. Deep forest is a recent deep learning framework based on tree model ensembles, which does not rely on backpropagation. We consider the advantages of deep forest models are very appropriate for solving multi-label problems. Therefore we design the Multi-Label Deep Forest (MLDF) method with two mechanisms: measure-aware feature reuse and measure-aware layer growth. The measure-aware feature reuse mechanism reuses the good representation in the previous layer guided by confidence. The measure-aware layer growth mechanism ensures MLDF gradually increase the model complexity by performance measure. MLDF handles two challenging problems at the same time: one is restricting the model complexity to ease the overfitting issue; another is optimizing the performance measure on user's demand since there are many different measures in the multi-label evaluation. Experiments show that our proposal not only beats the compared methods over six measures on benchmark datasets but also enjoys label correlation discovery and other desired properties in multi-label learning.

20 citations


Journal ArticleDOI
TL;DR: A machine learning method to identify marketing intentions from large-scale The authors-Media data is proposed and the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features and the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity.
Abstract: Social network services for self-media, such as Weibo, Blog, and WeChat Public, constitute a powerful medium that allows users to publish posts every day. Due to insufficient information transparency, malicious marketing of the Internet from self-media posts imposes potential harm on society. Therefore, it is necessary to identify news with marketing intentions for life. We follow the idea of text classification to identify marketing intentions. Although there are some current methods to address intention detection, the challenge is how the feature extraction of text reflects semantic information and how to improve the time complexity and space complexity of the recognition model. To this end, this paper proposes a machine learning method to identify marketing intentions from large-scale We-Media data. First, the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features. Second, the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity. Finally, this paper examines the effects of classifier associations and uses the optimal configuration to help people efficiently identify marketing intention. Finally, the detailed experimental evaluation on several metrics shows that our approaches are effective and efficient. The F1 value can be increased by about 5%, and the running time is increased by 20%, which prove that the newly-proposed method can effectively improve the accuracy of marketing news recognition.

17 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: A novel approach to finding the simplest and most effective decision tree model called ‘manual pruning’ is described, and implementing the skip criteria reduced the average encoding time by 42.1% for a Bjøntegaard Delta rate detriment.
Abstract: This paper proposes a method for complexity reduction in practical video encoders using multiple decision tree classifiers. The method is demonstrated for the fast implementation of the ‘High Efficiency Video Coding’ (HEVC) standard, chosen because of its high bit rate reduction capability but large complexity overhead. Optimal partitioning of each video frame into coding units (CUs) is the main source of complexity as a vast number of combinations are tested. The decision tree models were trained to identify when the CU testing process, a time-consuming Lagrangian optimisation, can be skipped i.e a high probability that the CU can remain whole. A novel approach to finding the simplest and most effective decision tree model called ‘manual pruning’ is described. Implementing the skip criteria reduced the average encoding time by 42.1% for a Bjontegaard Delta rate detriment of 0.7%, for 17 standard test sequences in a range of resolutions and quantisation parameters.

Posted Content
TL;DR: The goal is to design and implement a novel client-server protocol that delegates the complete tree evaluation to the server while preserving privacy and reducing the overhead, and is able to provide the first non-interactive protocol.
Abstract: Decision trees are a powerful prediction model with many applications in statistics, data mining, and machine learning. In some settings, the model and the data to be classified may contain sensitive information belonging to different parties. In this paper, we, therefore, address the problem of privately evaluating a decision tree on private data. This scenario consists of a server holding a private decision tree model and a client interested in classifying its private attribute vector using the server's private model. The goal of the computation is to obtain the classification while preserving the privacy of both - the decision tree and the client input. After the computation, the classification result is revealed only to the client, and nothing else is revealed neither to the client nor to the server. Existing privacy-preserving protocols that address this problem use or combine different generic secure multiparty computation approaches resulting in several interactions between the client and the server. Our goal is to design and implement a novel client-server protocol that delegates the complete tree evaluation to the server while preserving privacy and reducing the overhead. The idea is to use fully (somewhat) homomorphic encryption and evaluate the tree on ciphertexts encrypted under the client's public key. However, since current somewhat homomorphic encryption schemes have high overhead, we combine efficient data representations with different algorithmic optimizations to keep the computational overhead and the communication cost low. As a result, we are able to provide the first non-interactive protocol, that allows the client to delegate the evaluation to the server by sending an encrypted input and receiving only the encryption of the result. Our scheme has only one round and can evaluate a complete tree of depth 10 within seconds.

Proceedings ArticleDOI
25 Jun 2019
TL;DR: This work proposes a sound, general framework for multi-parameter analysis of security, which relies on the attack– defense tree model that security experts from industry are already familiar with, and presents mathematical foundations of the framework and characterize the class of parameters it is suitable for.
Abstract: The cheapest attacks are often time-consuming, and those requiring high level of technical skills might occur rarely but result in disastrous consequences. Therefore, analysis focusing on a single parameter at a time, e.g., only cost or time, is insufficient for the successful selection of the appropriate measures increasing system^{\prime}s security. In practice, security engineers are thus confronted with the problem of multi-parameter analysis. The objective of this work is to address this problem and propose a sound, general framework for multi-parameter analysis of security. In order to ensure the usability of our solution for real-life applications, our proposal relies on the attack– defense tree model that security experts from industry are already familiar with. We present mathematical foundations of our framework and characterize the class of parameters it is suitable for. We identify conditions under which the proposed method applies to attack–defense trees where several nodes represent the same action. We discuss the complexity of our approach and implement the underlying algorithms in a proof of concept tool. We analyze its performance on a number of trees of varying complexity, and validate our proposal on a case study borrowed from industry.

Journal ArticleDOI
TL;DR: This work presents an algorithm that performs a point location query with O(d^2\log n) linear comparisons, improving the previous best result by about a factor of d and has currently the best performance for arbitrary hyperplanes.
Abstract: We consider the point location problem in an arrangement of n arbitrary hyperplanes in any dimension d, in the linear decision tree model, in which we only count linear comparisons involving the query point, and all other operations do not explicitly access the query and are for free. We mainly consider the simpler variant (which arises in many applications) where we only want to determine whether the query point lies on some input hyperplane. We present an algorithm that performs a point location query with $$O(d^2\log n)$$ linear comparisons, improving the previous best result by about a factor of d. Our approach is a variant of Meiser’s technique for point location (Inf Comput 106(2):286–303, 1993) (see also Cardinal et al. in: Proceedings of the 24th European symposium on algorithms, 2016), and its improved performance is due to the use of vertical decompositions in an arrangement of hyperplanes in high dimensions, rather than bottom-vertex triangulation used in the earlier approaches. The properties of such a decomposition, both combinatorial and algorithmic (in the standard real RAM model), are developed in a companion paper (Ezra et al. arXiv:1712.02913 , 2017), and are adapted here (in simplified form) for the linear decision tree model. Several applications of our algorithm are presented, such as the k-SUM problem and the Knapsack and SubsetSum problems. However, these applications have been superseded by the more recent result of Kane et al. (in: Proceedings of the 50th ACM symposium on theory of computing, 2018), obtained after the original submission (and acceptance) of the conference version of our paper (Ezra and Sharir in: Proceedings of the 33rd international symposium on computational geometry, 2017). This result only applies to ‘low-complexity’ hyperplanes (for which the $$\ell _1$$ -norm of their coefficient vector is a small integer), which arise in the aforementioned applications. Still, our algorithm has currently the best performance for arbitrary hyperplanes.

Patent
01 Feb 2019
TL;DR: In this paper, a federated learning method for multi-party data was proposed, in which the data terminal performs federation training on the multiparty training samples based on the gradient descent tree GBDT algorithm to construct a gradient tree model.
Abstract: The invention discloses a federation learning method, a system and a readable storage medium. The federated learning method includes the following steps: the data terminal performs federation trainingon the multi-party training samples based on the gradient descent tree GBDT algorithm, to construct a gradient tree model, wherein the data terminal is a plurality of, the gradient tree model comprises a plurality of regression trees, the regression trees comprise a plurality of partition points, and the training sample comprises a plurality of features, the features correspond to the partition points one by one; the data terminal performs joint prediction on a sample to be predicted based on the gradient tree model to determine a prediction value of the sample to be predicted. The inventioncarries out federation training on multi-party training samples through GBDT algorithm, realizes the establishment of gradient tree model, and is suitable for scenes with large data volume and can well meet the needs of realistic production environment through the gradient tree model. Forecast the sample to be forecasted jointly, and realize the forecast of the sample to be forecasted.

Journal ArticleDOI
30 May 2019
TL;DR: The methodology presents the mapping of a PROSA based ontology model on a decision tree, which was created with the Waikato Environment for Knowledge Analysis (WEKA) application, and demonstrates the formulation of the Semantic Web Rule Language (SWRL) rules from the WEKA decision tree with the help of MATLAB programming.
Abstract: This paper aims to create a predictive model, which will assist in the allocation of newly received orders in a manufacturing network. The manufacturing network, which is taken as a case study in t...

Journal ArticleDOI
TL;DR: The proposed model performs better than previously proposed CR based Decision Tree (CRDT) Model since an efficient discretization module has been added with it and also compared the model with Information Gain, Gain Ratio and Gini Index based models.
Abstract: In predictive tasks like classification, Information Gain (IG) based Decision Tree is very popularly used. However, IG method has some inherent problems like its preference towards choosing attributes with higher number of distinct values as the splitting attribute in case of nominal attributes and another problem is associated with imbalanced datasets. Most of the real-world datasets have many nominal attributes, and those nominal attributes may have many number of distinct values. In this paper, we have tried to point out these characteristics of the datasets while discussing the performance of our proposed approach. Our approach is a variant of the traditional Decision Tree model and uses a new technique called Dispersion_Ratio, a modification of existing Correlation Ratio (CR) method. The whole approach is divided into two phases - firstly, the dataset is discretised using a discretization module and secondly, the preprocessed dataset is used to build a Dispersion Ratio based Decision Tree model. The proposed method does not prefer the attributes with many unique values and indifferent about class distribution. It performs better than previously proposed CR based Decision Tree (CRDT) Model since an efficient discretization module has been added with it. We have evaluated the performance of our approach on some benchmark datasets from various domains to demonstrate the effectiveness of the proposed technique and also compared our model with Information Gain, Gain Ratio and Gini Index based models. Result shows that the proposed model outperforms other models in majority of the cases that we have considered in our experiment.

Patent
23 Apr 2019
TL;DR: In this paper, a travel time prediction method based on multi-modal data fusion and multi-model integration is proposed, which includes a preprocessing module which extracts taxi travel data from taxi GPS track data according to the passenger carrying state; a multi-mode data analysis, feature extraction and feature fusion module which is used for extracting corresponding feature sub-vectors from the fields of taxi track data, weather data, driver portrait data and the like and completing feature splicing; and a multi model integration module which are used for respectively establishing a gradient improvement decision tree model and a
Abstract: The invention discloses a travel time prediction method based on multi-modal data fusion and multi-model integration. The travel time prediction method comprises a multi-modal data preprocessing module which extracts taxi travel data from taxi GPS track data according to the passenger carrying state; a multi-modal data analysis, feature extraction and feature fusion module which is used for extracting corresponding feature sub-vectors from the fields of taxi track data, weather data, driver portrait data and the like and completing feature splicing; and a multi-model integration module which is used for respectively establishing a gradient improvement decision tree model and a deep neural network model, and integrating prediction results of the models by using the decision tree model. According to the travel time prediction method, by fusing the multi-modal data such as taxi track data, weather data and driver portrait data, the factors influencing travel time are fully extracted and mined, and an integrated model based on a decision tree is established, so that higher travel time prediction accuracy is obtained at lower calculation cost.

Posted Content
TL;DR: This work proposes to assign least-squares-based importance scores to each word of an instance by exploiting syntactic constituency structure and establishes an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory.
Abstract: We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: In this paper, the authors provide a formal definition of local forward models for which they propose two basic learning approaches, Hash Set and Decision Tree, to predict future state transitions of both the training and the test set.
Abstract: This paper examines learning approaches for forward models based on local cell transition functions. We provide a formal definition of local forward models for which we propose two basic learning approaches. Our analysis is based on the game Sokoban, where a wrong action can lead to an unsolvable game state. Therefore, an accurate prediction of an action’s resulting state is necessary to avoid this scenario.In contrast to learning the complete state transition function, local forward models allow extracting multiple training examples from a single state transition. In this way, the Hash Set model, as well as the Decision Tree model, quickly learn to predict upcoming state transitions of both the training and the test set. Applying the model using a statistical forward planner showed that the best models can be used to satisfying degree even in cases in which the test levels have not yet been seen.Our evaluation includes an analysis of various local neighbourhood patterns and sizes to test the learners’ capabilities in case too few or too many attributes are extracted, of which the latter has shown do degrade the performance of the model learner.

Posted Content
TL;DR: In this paper, a method for complexity reduction in practical video encoders using multiple decision tree classifiers is proposed, which is demonstrated for the fast implementation of the "High Efficiency Video Coding" (HEVC) standard, chosen because of its high bit rate reduction capability but large complexity overhead.
Abstract: This paper proposes a method for complexity reduction in practical video encoders using multiple decision tree classifiers. The method is demonstrated for the fast implementation of the 'High Efficiency Video Coding' (HEVC) standard, chosen because of its high bit rate reduction capability but large complexity overhead. Optimal partitioning of each video frame into coding units (CUs) is the main source of complexity as a vast number of combinations are tested. The decision tree models were trained to identify when the CU testing process, a time-consuming Lagrangian optimisation, can be skipped i.e a high probability that the CU can remain whole. A novel approach to finding the simplest and most effective decision tree model called 'manual pruning' is described. Implementing the skip criteria reduced the average encoding time by 42.1% for a Bjontegaard Delta rate detriment of 0.7%, for 17 standard test sequences in a range of resolutions and quantisation parameters.

Journal ArticleDOI
01 Jan 2019
TL;DR: The new GraftedTrees model inherits the advantages of Random Forest and further employs random mixture of two interchangeable node splitting rule inductions with the aim to obtain higher computational efficiency and better performance in terms of accuracy.
Abstract: Data mining and machine learning are both useful tools in the field of data analysis. Classification algorithm is one of the most important techniques in data mining, therefore, it is of great significance to select suitable classification models with high efficiency to show superiority when solving classification problems with the use of Iris data. With this goal, a decision tree induction algorithm, namely graftedTree, is proposed to build randomized decision trees. Randomization is explicitly introduced into this algorithm, such that applying the algorithm several times on the same training data results in diversified models. An ensemble classification model is constructed using multiple randomized decision trees via majority voting. In order to show the performance of different models in classification, we propose the usage of precision, recall, F-Measure, the area under the ROC curve (AUC) and Gini coefficient as evaluation indexes of the classifying performance on the Iris dataset. The experimental results show that classification with Random Forests model has generally better performance than that with the Boosting Tree model and other three popular algorithms: KNN, SMO and Simple Cart. However, the Gini coefficient of the Random Forests model shows that it gets less pure training set than other models. The new GraftedTrees model inherits the advantages of Random Forest and further employs random mixture of two interchangeable node splitting rule inductions with the aim to obtain higher computational efficiency and better performance in terms of accuracy. With its superiority, it is expected that the new GraftedTrees model can prove to be the most powerful model with better performance in classification in the near future.

Journal ArticleDOI
14 Oct 2019
Abstract: The process of accepting new members in a team requires clarity of the evaluation criteria and the accuracy of the process in the assessment. Likewise, what is wanted by the Harmoni Nusantara Choir Team Builder at Nusantara University PGRI Kediri. During this time the selection process for the recruitment of team members is carried out conventionally with interviews and live voice tests. To decide whether a selection participant is accepted or not, the guiding team must conduct discussions that often occur when disputes arise between coaches when one participant has the same or balanced results with the other selection participants. Therefore we need a system (application) with the implementation of certain algorithms that can help in the selection process for new members. To realize the planned system, a Decision Tree algorithm modeling with the Classification Error concept is used to analyze an algorithm that is suitable to be used as an assistive algorithm in decision making. Decision Tree modeling with the concept of Classification Error is done by training data with a total of 60 records with attributes input gender, interpretation, technique, appearance, commitment, octave. The target class specified is accepted or not accepted. Based on the modeling done, a decision tree is obtained that produces 5 (five) rule’s bases that cover all records in the training data. So it can be concluded that 100% training data records (60 records) are covered in the rule's base, and concluded that the Decision Tree with the Classification Error concept can be used as an assist algorithm that will be implemented in the system (application) to assist selection of new members of the Harmoni Nusantara Choir Team on Planned Nusantara Nusantara PGRI Kediri.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Through a large number of empirical data analysis, it is found the predictive accuracy of S-DT investment strategy is 89.5%, which is 11.6% higher than that of GRA-DT, ANN, DM and so on.
Abstract: Based on the decision tree model, we propose an S-DT investment strategy which predicts the closing price trends of international and domestic stocks. We train the history data of stocks through decision tree model. In order to make the stock price trend prediction accuracy is higher, we choose indexes that are highly correlated with closing price as input variables based on synergy factor, information entropy and grey association rules. Through a large number of empirical data analysis, we found the predictive accuracy of S-DT investment strategy is 89.5%, which is 11.6% higher than that of GRA-DT, ANN, DM and so on.

Journal ArticleDOI
TL;DR: Through the practical exploration of applying the decision tree algorithm to the MOOC teaching evaluation management system of higher vocational colleges, it is found that the application of data mining technology to the construction of digital campus is not only reflected in the theoretical feasibility, but also reflected in its technical feasibility.
Abstract: to better carry out Massive Open Online Courses (MOOC) teaching evaluation and improve teaching effect, firstly, a teaching decision support system with evaluation function is designed by analyzing the actual situation of the college. Secondly, the decision tree data mining algorithm is introduced in the subsystem of student score analysis and evaluation. Finally, the decision tree model of student score analysis evaluation is constructed according to the decision tree algorithm. Through the practical exploration of applying the decision tree algorithm to the MOOC teaching evaluation management system of higher vocational colleges, it is found that the application of data mining technology to the construction of digital campus is not only reflected in the theoretical feasibility, but also reflected in its technical feasibility.

Journal ArticleDOI
TL;DR: A nonparametric ensemble tree model called gradient boosting survival tree (GBST) is proposed that extends the survival tree models with a gradient boosting algorithm and outperforms the existing survival models measured by the concordance index, Kolmogorov–Smirnov index, and the area under the receiver operating characteristic curve of each time period.
Abstract: Credit scoring plays a vital role in the field of consumer finance. Survival analysis provides an advanced solution to the credit-scoring problem by quantifying the probability of survival time. In order to deal with highly heterogeneous industrial data collected in Chinese market of consumer finance, we propose a nonparametric ensemble tree model called gradient boosting survival tree (GBST) that extends the survival tree models with a gradient boosting algorithm. The survival tree ensemble is learned by minimizing the negative log-likelihood in an additive manner. The proposed model optimizes the survival probability simultaneously for each time period, which can reduce the overall error significantly. Finally, as a test of the applicability, we apply the GBST model to quantify the credit risk with large-scale real market datasets. The results show that the GBST model outperforms the existing survival models measured by the concordance index (C-index), Kolmogorov-Smirnov (KS) index, as well as by the area under the receiver operating characteristic curve (AUC) of each time period.

Journal ArticleDOI
TL;DR: Multi-scale texture can better describe the texture feature of land, more effectively solve with the phenomenon of “same image for different object” in the classification results, and help to improve classification accuracy of high resolution image.
Abstract: remote sensing image land type data mining was studied based on QUEST decision tree with Dongting Lake area as the research object. First of all, the texture feature of gray level co-occurrence matrix was expounded, and the texture size was selected to construct the QUEST decision tree model; secondly, through spectrum and texture feature of remote sensing data with different resolutions and combining with other auxiliary data, Dongting land information was explored, and land type was classified. Finally, the following conclusions were reached: multi-scale texture can better describe the texture feature of land, more effectively solve with the phenomenon of “same image for different object” in the classification results, and help to improve classification accuracy of high resolution image.

Journal ArticleDOI
TL;DR: This work focuses on the equi-join and proposes a resource-efficient join architecture based on a tree model, which achieves a data throughput of 8–100 million tuples per second, which is compatible with the bus rate, and performs well in balancing resource utilization and data throughput.
Abstract: The offloading and acceleration of database operations on field programmable gate arrays (FPGAs) have been extensively studied for a long time. Architectures of join, a key database operation, have been proposed and optimized on FPGAs. However, these join architectures are either resource-intensive or have low-throughput. In this brief, we focus on the equi-join and propose a resource-efficient join architecture based on a tree model. The architecture needs two phases: the build phase, in which a binary tree is built by using the first database table, and the probe phase, in which the architecture searches the tree to find matching solutions through a second database table. In addition, we propose a parallel implementation for this architecture to improve its performance. The proposed design was implemented on a Xilinx FPGA, and the results were compared with the most recent works on hardware join. The experimental results demonstrate that for a range of parallelism and dataset sizes, our design achieves a data throughput of 8–100 million tuples per second, which is compatible with the bus rate, and performs well in balancing resource utilization and data throughput.

Journal ArticleDOI
TL;DR: TLS is an effective measurement tool to provide highly accurate and precise results for 3D modelling of tree structure parameters without cutting trees and has great potential to provide many individual tree attributes with high accuracy.
Abstract: Terrestrial light detection and ranging technology provides an accurate measurement of individual tree parameters that are essential for managing forest resources, modeling forest fires, planning forest operations, etc. This study aimed to measure individual tree parameters to model a single tree using terrestrial laser scanner (TLS) data. A high-resolution digital terrain model (DTM) was generated using point cloud data (2,800,430 points) to obtain the tree parameters. Next, the diameter of breast heights (DBH), tree heights, tree lengths, tree projection areas, and crown parameters were calculated using 3D Forest 0.42 software. In order to evaluate the capabilities of TLS data, estimated tree parameters were compared with the parameters obtained by field measurements. Regression analysis and paired sample t-test were performed to compare the DBH and tree height values estimated by TLS with those obtained from field measurements. We found a strong relationship between the field measurements and TLS estimates for DBHs (R2 = 0.99) with 1.65 cm root mean square error (RMSE) and tree heights (R2 = 0.98) with RMSE = 0.724 m. The paired Wilcoxon signed-rank test for DBH groups showed no significant difference (P = 0.7285 > 0.05), whereas according to the results of the paired sample t-test for the height groups, there were significant differences between tree heights (P = 0.015 < 0.05; t = -2.55). The results also indicate that TLS is an effective measurement tool to provide highly accurate and precise results for 3D modelling of tree structure parameters without cutting trees. TLS also has great potential to provide many individual tree attributes with high accuracy, which can be used for further evaluations in many forestry disciplines such as silviculture, nature conservation, forest management, and urban forestry.

Posted Content
TL;DR: In this paper, the authors propose a secure protocol for collaborative evaluation of random forests contributed by multiple owners in a two-party setting, where the feature vector of the client or the decision tree model (such as the threshold values of its nodes) is kept secret from another party.
Abstract: Decision tree and its generalization of random forests are a simple yet powerful machine learning model for many classification and regression problems. Recent works propose how to privately evaluate a decision tree in a two-party setting where the feature vector of the client or the decision tree model (such as the threshold values of its nodes) is kept secret from another party. However, these works cannot be extended trivially to support the outsourcing setting where a third-party who should not have access to the model or the query. Furthermore, their use of an interactive comparison protocol does not support branching program, hence requires interactions with the client to determine the comparison result before resuming the evaluation task. In this paper, we propose the first secure protocol for collaborative evaluation of random forests contributed by multiple owners. They outsource evaluation tasks to a third-party evaluator. Upon receiving the client's encrypted inputs, the cloud evaluates obliviously on individually encrypted random forest models and calculates the aggregated result. The system is based on our new secure comparison protocol, secure counting protocol, and a multi-key somewhat homomorphic encryption on top of symmetric-key encryption. This allows us to reduce communication overheads while achieving round complexity lower than existing work.