scispace - formally typeset
Search or ask a question

Showing papers on "Decision tree model published in 2013"


Proceedings ArticleDOI
23 Jun 2013
TL;DR: In this paper, the authors propose to use a mixed representation of single and combined parts to approximate their joint distribution in a simple tree model, and then perform inference on the learned latent tree.
Abstract: Simple tree models for articulated objects prevails in the last decade. However, it is also believed that these simple tree models are not capable of capturing large variations in many scenarios, such as human pose estimation. This paper attempts to address three questions: 1) are simple tree models sufficient? more specifically, 2) how to use tree models effectively in human pose estimation? and 3) how shall we use combined parts together with single parts efficiently? Assuming we have a set of single parts and combined parts, and the goal is to estimate a joint distribution of their locations. We surprisingly find that no latent variables are introduced in the Leeds Sport Dataset (LSP) during learning latent trees for deformable model, which aims at approximating the joint distributions of body part locations using minimal tree structure. This suggests one can straightforwardly use a mixed representation of single and combined parts to approximate their joint distribution in a simple tree model. As such, one only needs to build Visual Categories of the combined parts, and then perform inference on the learned latent tree. Our method outperformed the state of the art on the LSP, both in the scenarios when the training images are from the same dataset and from the PARSE dataset. Experiments on animal images from the VOC challenge further support our findings.

138 citations


Posted Content
TL;DR: This paper surprisingly finds that no latent variables are introduced in the Leeds Sport Dataset (LSP) during learning latent trees for deformable model, which aims at approximating the joint distributions of body part locations using minimal tree structure.
Abstract: Simple tree models for articulated objects prevails in the last decade. However, it is also believed that these simple tree models are not capable of capturing large variations in many scenarios, such as human pose estimation. This paper attempts to address three questions: 1) are simple tree models sufficient? more specifically, 2) how to use tree models effectively in human pose estimation? and 3) how shall we use combined parts together with single parts efficiently? Assuming we have a set of single parts and combined parts, and the goal is to estimate a joint distribution of their locations. We surprisingly find that no latent variables are introduced in the Leeds Sport Dataset (LSP) during learning latent trees for deformable model, which aims at approximating the joint distributions of body part locations using minimal tree structure. This suggests one can straightforwardly use a mixed representation of single and combined parts to approximate their joint distribution in a simple tree model. As such, one only needs to build Visual Categories of the combined parts, and then perform inference on the learned latent tree. Our method outperformed the state of the art on the LSP, both in the scenarios when the training images are from the same dataset and from the PARSE dataset. Experiments on animal images from the VOC challenge further support our findings.

109 citations


Journal ArticleDOI
TL;DR: This review of the latent tree model, a particular type of probabilistic graphical models, deserves attention because its simple structure allows simple and efficient inference, while its latent variables capture complex relationships.
Abstract: In data analysis, latent variables play a central role because they help provide powerful insights into a wide variety of phenomena, ranging from biological to human sciences. The latent tree model, a particular type of probabilistic graphical models, deserves attention. Its simple structure - a tree - allows simple and efficient inference, while its latent variables capture complex relationships. In the past decade, the latent tree model has been subject to significant theoretical and methodological developments. In this review, we propose a comprehensive study of this model. First we summarize key ideas underlying the model. Second we explain how it can be efficiently learned from data. Third we illustrate its use within three types of applications: latent structure discovery, multidimensional clustering, and probabilistic inference. Finally, we conclude and give promising directions for future researches in this field.

94 citations


Journal ArticleDOI
TL;DR: Decision tree methods can be used efficiently for GSH analysis and might be widely used for prediction of various spatial events, higher than previously reported results for decision tree.

90 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper shows how the parameters of the label tree can be found using maximum likelihood estimation, and produces a label tree with significantly improved recognition accuracy.
Abstract: Large-scale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows. The label tree model integrates classification with the traversal of the tree so that complexity grows logarithmically. In this paper, we show how the parameters of the label tree can be found using maximum likelihood estimation. This new probabilistic learning technique produces a label tree with significantly improved recognition accuracy.

84 citations


Journal ArticleDOI
TL;DR: The proposed model called Flexible Beta Basis Function Neural Tree (FBBFNT) can be created and optimized based on the predefined Beta operator sets and is compared with those of related methods.

51 citations


Journal ArticleDOI
TL;DR: A novel tag anti-collision algorithm called M-ary query tree scheme (MQT) is proposed, and theoretical analysis and simulation results verify that MQT outperforms other tree-based protocols in terms of time complexity and communication overhead.
Abstract: An anti-collision scheme in RFID systems is required to identify all the tags in the reader field. Deterministic tree search algorithms are mostly used to guarantee that all the tags in the field are identified, and achieve the best performance. Such tree search algorithms are based on the binary tree, and single bit arbitration is made at a time. In this letter, a novel tag anti-collision algorithm called M-ary query tree scheme (MQT) is proposed. An analytic model is developed for the response time to complete identifying all tags and then derive optimal M-ary tree for the minimum average response time. Our theoretical analysis and simulation results verify that MQT outperforms other tree-based protocols in terms of time complexity and communication overhead.

46 citations


Journal ArticleDOI
TL;DR: By using the decision tree model, the proposed DTTSVM effectively overcomes the possible ambiguous occurred in multi- TWSVM and MBSVM.

46 citations


Journal ArticleDOI
TL;DR: This work gives a review of some works on the complexity of implementation of arithmetic operations in finite fields by Boolean circuits.
Abstract: We give a review of some works on the complexity of implementation of arithmetic operations in finite fields by Boolean circuits.

36 citations


Journal ArticleDOI
TL;DR: In this article, a set of tools for variable selection and sensitivity analysis based on the recently proposed dynamic tree model is proposed for automatic tuning of computer codes. But, the response function is nonlinear and noisy and may not be smooth or stationary, and variable selection, decomposition of influence, and analysis of main and secondary effects for both real-valued and binary inputs and outputs.
Abstract: We investigate an application in the automatic tuning of computer codes, an area of research that has come to prominence alongside the recent rise of distributed scientific processing and heterogeneity in high-performance computing environments. Here, the response function is nonlinear and noisy and may not be smooth or stationary. Clearly needed are variable selection, decomposition of influence, and analysis of main and secondary effects for both real-valued and binary inputs and outputs. Our contribution is a novel set of tools for variable selection and sensitivity analysis based on the recently proposed dynamic tree model. We argue that this approach is uniquely well suited to the demands of our motivating example. In illustrations on benchmark data sets, we show that the new techniques are faster and offer richer feature sets than do similar approaches in the static tree and computer experiment literature. We apply the methods in code-tuning optimization, examination of a cold-cache effect, and detection of transformation errors.

36 citations


Posted Content
TL;DR: In this paper, a Markov random field (MRF) approach based on frequent sets and maximum entropy is proposed to estimate the number of rows in the data satisfying a given predicate.
Abstract: Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we show how one can apply bucket elimination and clique tree approaches to take advantage of structure in the models and in the queries. We provide experimental results on two large real-world transaction datasets.

Journal ArticleDOI
TL;DR: Hockey was chosen as an example to illustrate the potential use of decision tree inductions for the purpose of identifying and communicating characteristics that drive the outcome and the suitability of decision trees for analysing the features of one-versus-one exchanges are discussed.
Abstract: Decision tree induction is a novel approach to exploring attacker-defender interactions in many sports. In this study hockey was chosen as an example to illustrate the potential use of decision tree inductions for the purpose of identifying and communicating characteristics that drive the outcome. Elite female players performed one-versus-one contests (n = 75) over two sessions. Each contest outcome was classified as either a win or loss. Position data were acquired using radio-tracking devices, and movement-based derivatives were calculated for two time epochs (5 to 2.5 seconds, and 2.5 to zero seconds before the outcome occurred). A decision tree model was trained using these attributes from the first session data, which predicted that when the attacker was moving at ≥ 0.5 m · s−1 faster than the defender during the early epoch, the probability of an attacker's win was 1.00. Conversely, when the speed difference at that time was below this threshold the probability of a loss was 0.78. Secondary...

Journal ArticleDOI
TL;DR: The objective of this paper is to examine the performance of recent invented decision tree modeling algorithms and compared with one that achieved by radial basis function kernel support vector machine (RBFSVM) on the diagnosis of breast cancer using cytological proven tumor dataset.
Abstract: Breast cancer represents the second important cause of cancer deaths in women today and it is the most common type of cancer in women. Disease diagnosis is one of the applications where data mining tools are proving successful results. Data mining with decision trees is popular and effective data mining classification approach. Decision trees have the ability to generate understandable classification rules, which are very efficient tool for transfer knowledge to physicians and medical specialists. In fundamental truth, they provide trails to find rules that could be evaluated for separating the input samples into one of several groups without having to state the functional relationship directly. The objective of this paper is to examine the performance of recent invented decision tree modeling algorithms and compared with one that achieved by radial basis function kernel support vector machine (RBFSVM) on the diagnosis of breast cancer using cytological proven tumor dataset. Four models have been evaluated in decision tree: Chi-squared Automatic Interaction Detection (CHAID), Classification and Regression tree (CR classification accuracy, sensitivity, and specificity.

Journal ArticleDOI
TL;DR: The results show that the proposed approach outperforms the pure decision tree model because the former has the capability of examining the marginal effects of risk factors.
Abstract: This study presents a tree-based logistic regression approach to assessing work zone casualty risk, which is defined as the probability of a vehicle occupant being killed or injured in a work zone crash. First, a decision tree approach is employed to determine the tree structure and interacting factors. Based on the Michigan M-94\I-94\I-94BL\I-94BR highway work zone crash data, an optimal tree comprising four leaf nodes is first determined and the interacting factors are found to be airbag, occupant identity (i.e., driver, passenger), and gender. The data are then split into four groups according to the tree structure. Finally, the logistic regression analysis is separately conducted for each group. The results show that the proposed approach outperforms the pure decision tree model because the former has the capability of examining the marginal effects of risk factors. Compared with the pure logistic regression method, the proposed approach avoids the variable interaction effects so that it significantly improves the prediction accuracy.

Journal ArticleDOI
TL;DR: The different rules classified by the decision tree model in this study should contribute as baseline data for discovering informative knowledge and developing interventions tailored to these individual characteristics.
Abstract: Purpose: The purpose of this study was to develop a prediction model for the characteristics of older adults with depression using the decision tree method. Methods: A large dataset from the 2008 Korean Elderly Survey was used and data of 14,970 elderly people were analyzed. Target variable was depression and 53 input variables were general characteristics, family & social relationship, economic status, health status, health behavior, functional status, leisure & social activity, quality of life, and living environment. Data were analyzed by decision tree analysis, a data mining technique using SPSS Window 19.0 and Clementine 12.0 programs. Results: The decision trees were classified into five different rules to define the characteristics of older adults with depression. Classification & Regression Tree (C&RT) showed the best prediction with an accuracy of 80.81% among data mining models. Factors in the rules were life satisfaction, nutritional status, daily activity difficulty due to pain, functional limitation for basic or instrumental daily activities, number of chronic diseases and daily activity difficulty due to disease. Conclusion: The different rules classified by the decision tree model in this study should contribute as baseline data for discovering informative knowledge and developing interventions tailored to these individual characteristics.

Journal ArticleDOI
TL;DR: An incremental optimization mechanism that possesses an optimized node-splitting control mechanism that seeks for a balance between accuracy and tree size for data stream mining and obtains an optimal tree structure in both numeric and nominal datasets is proposed.
Abstract: Imperfect data stream leads to tree size explosion and detrimental accuracy problems. Overfitting problem and the imbalanced class distribution reduce the performance of the original decision-tree algorithm for stream mining. In this paper, we propose an incremental optimization mechanism to solve these problems. The mechanism is called Optimized Very Fast Decision Tree (OVFDT) that possesses an optimized node-splitting control mechanism. Accuracy, tree size, and the learning time are the significant factors influencing the algorithm’s performance. Naturally a bigger tree size takes longer computation time. OVFDT is a pioneer model equipped with an incremental optimization mechanism that seeks for a balance between accuracy and tree size for data stream mining. It operates incrementally by a test-then-train approach. Three types of functional tree leaves improve the accuracy with which the tree model makes a prediction for a new data stream in the testing phase. The optimized node-splitting mechanism controls the tree model growth in the training phase. The experiment shows that OVFDT obtains an optimal tree structure in both numeric and nominal datasets.

Proceedings ArticleDOI
01 Jan 2013
TL;DR: This note gives an example of a sampling problem whose information and communication complexity the authors conjecture to be as much as exponentially far apart.
Abstract: Whether the information complexity of any interactive problem is close to its communication complexity is an important open problem. In this note we give an example of a sampling problem whose information and communication complexity we conjecture to be as much as exponentially far apart.

Book ChapterDOI
TL;DR: The methodology for proving lower bounds on the query complexity of property testing via communication complexity, which was put forward by Blais, Brody, and Matulef, is considered.
Abstract: We consider the methodology for proving lower bounds on the query complexity of property testing via communication complexity, which was put forward by Blais, Brody, and Matulef (Computational Complexity, 2012). They provided a restricted formulation of their methodology (via “simple combining operators”) and also hinted towards a more general formulation, which we spell out in this paper.

Proceedings ArticleDOI
19 Jun 2013
TL;DR: A model-based complexity number that is defined on the decision diagram (DD) representation of the system functionality gives an upper bound on the number of tests that are necessary to achieve Condition/Decision (C/D) coverage (which is required for safety critical systems).
Abstract: The development cost of safety-critical embedded systems is dominated today by the cost of software including verification and validation. This cost is typically related to the complexity of the software functions implementing the desired system behavior in nominal and off-nominal conditions. A widely used measure of complexity is the cyclomatic number, which is computed on the implementation code. However this technique is not effective when model-based development and code generation are used because the complexity of the software also depends on the communication and execution semantics of the models. This paper proposes a model-based complexity number that is defined on the decision diagram (DD) representation of the system functionality. The proposed complexity number gives an upper bound on the number of tests that are necessary to achieve Condition/Decision (C/D) coverage (which is required for safety critical systems). We show that the number of tests is related to the min-flow/max-cut computed on the DD. By comparing the proposed metric with the cyclomatic complexity, we show that the former seems to be better suited for capturing the complexity of the model than the latter. A case study on an aircraft power system shows that the complexity metric has applications in functional partitioning and architecture selection.

Posted Content
TL;DR: This paper develops novel proposal mechanisms for efficient sampling in the Bayesian Additive Regression Tree (BART) model and implements this sampling algorithm in the model and demonstrates its effectiveness on a prediction problem from computer experiments and a test function where structural tree variability is needed to fully explore the posterior.
Abstract: Bayesian regression trees are flexible non-parametric models that are well suited to many modern statistical regression problems. Many such tree models have been proposed, from the simple single- tree model to more complex tree ensembles. Their non-parametric formulation allows for effective and efficient modeling of datasets exhibiting complex non-linear relationships between the model pre- dictors and observations. However, the mixing behavior of the Markov Chain Monte Carlo (MCMC) sampler is sometimes poor. This is because the proposals in the sampler are typically local alterations of the tree structure, such as the birth/death of leaf nodes, which does not allow for efficient traversal of the model space. This poor mixing can lead to inferential problems, such as under-representing uncertainty. In this paper, we develop novel proposal mechanisms for efficient sampling. The first is a rule perturbation proposal while the second we call tree rotation. The perturbation proposal can be seen as an efficient variation of the change proposal found in existing literature. The novel tree rotation proposal is simple to implement as it only requires local changes to the regression tree structure, yet it efficiently traverses disparate regions of the model space along contours of equal probability. When combined with the classical birth/death proposal, the resulting MCMC sampler exhibits good acceptance rates and properly represents model uncertainty in the posterior samples. We implement this sampling algorithm in the Bayesian Additive Regression Tree (BART) model and demonstrate its effectiveness on a prediction problem from computer experiments and a test function where structural tree variability is needed to fully explore the posterior.

Posted Content
TL;DR: In this paper, an online tree-based Bayesian approach for reinforcement learning is proposed, where the tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces.
Abstract: This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with least squares policy iteration.

Journal ArticleDOI
TL;DR: Harmful alcohol users in a shared living situation, with high interpersonal sensitivity, have a significantly higher probability of positive treatment outcome, using recursive partitioning classification tree analysis.
Abstract: Internet-based interventions are seen as attractive for harmful users of alcohol and lead to desirable clinical outcomes. Some participants will however not achieve the desired results. In this study, harmful users of alcohol have been partitioned in subgroups with low, intermediate or high probability of positive treatment outcome, using recursive partitioning classification tree analysis. Data were obtained from a randomized controlled trial assessing the effectiveness of two Internet-based alcohol interventions. The main outcome variable was treatment response, a dichotomous outcome measure for treatment success. Candidate predictors for the classification analysis were first selected using univariate regression. Next, a tree decision model to classify participants in categories with a low, medium and high probability of treatment response was constructed using recursive partitioning software. Based on literature review, 46 potentially relevant baseline predictors were identified. Five variables were selected using univariate regression as candidate predictors for the classification analysis. Two variables were found most relevant for classification and selected for the decision tree model: ‘living alone’, and ‘interpersonal sensitivity’. Using sensitivity analysis, the robustness of the decision tree model was supported. Harmful alcohol users in a shared living situation, with high interpersonal sensitivity, have a significantly higher probability of positive treatment outcome. The resulting decision tree model may be used as part of a decision support system but is on its own insufficient as a screening algorithm with satisfactory clinical utility. Netherlands Trial Register (Cochrane Collaboration): NTR-TC1155 .

Journal ArticleDOI
TL;DR: A model of neural tree architecture with probabilistic neurons used for classification of a large amount of computer grid resources to classes is proposed and improvements have been made even for middle and small batch of tasks.
Abstract: This paper proposes a model of neural tree architecture with probabilistic neurons. These trees are used for classification of a large amount of computer grid resources to classes. The first tree is used for classification of hardware part of dataset. The second tree classifies patterns of software identifiers. Trees are implemented to successfully separate inputs into nine classes of resources. We propose Particle Swarm Optimization model for tasks scheduling in computer grid. We compared time of creation of schedule and time of makespan in six series of experiments without and with using neural trees. In experiments with using neural tree we gained the subset of suitable computational resources. The aim is effective mapping of a large batch of tasks into particular resources. On the base of experiments we can say that improvements have been made even for middle and small batch of tasks.

Journal ArticleDOI
TL;DR: In the decision tree analysis, pain and discomfort during the last 2 weeks, age, the longest occupation and thyroid disorders was significantly associated with self-reported voice problem.
Abstract: The purpose of this study was to analyze the risk factors of self-reported voice problem. Data were from the Korea National Health and Nutritional Examination Survey 2008. Subjects were 3,600 persons (1,501 men, 2,099 women) aged 19 years and older. A prediction model was developed by the use of a exhaustive CHAID (Chi Squared Automatic Interaction Detection) algorism of decision tree model. In the decision tree analysis, pain and discomfort during the last 2 weeks, age, the longest occupation and thyroid disorders was significantly associated with self-reported voice problem. The findings of associated factors suggest potential ways of targeting counseling and prevention efforts to control self-reported voice problem. Key Words : Decision tree, National survey, Prediction model, Risk factor, Self-reported voice problem * Corresponding Author : DrSc. Haewon, Byeon(Nambu Univ.)Tel: +82-62-970-0227 email: byeon@nambu.ac.krReceived April 30, 2013 Revised June 4, 2013 Accepted July 11, 2013

Journal Article
TL;DR: It is shown that the complexity in the decision tree model of computing composite relations of the form h = g ◦ (f, . . . , f), where each relation f i is boolean-valued, is completely characterised.
Abstract: We completely characterise the complexity in the decision tree model of computing composite relations of the form h = g ◦ (f, , f), where each relation f i is boolean-valued Immediate corollaries include a direct sum theorem for decision tree complexity and a tight characterisation of the decision tree complexity of iterated boolean functions

Journal ArticleDOI
TL;DR: A new Tree-based Backtracking Orthogonal Matching Pursuit (TBOMP) algorithm is presented with the idea of the tree model in wavelet domain, which can convert the wavelet tree structure to the corresponding relations of candidate atoms without any prior information of signal sparsity.
Abstract: Compressed sensing (CS) is a theory which exploits the sparsity characteristic of the original signal in signal sampling and coding. By solving an optimization problem, the original sparse signal can be reconstructed accurately. In this paper, a new Tree-based Backtracking Orthogonal Matching Pursuit (TBOMP) algorithm is presented with the idea of the tree model in wavelet domain. The algorithm can convert the wavelet tree structure to the corresponding relations of candidate atoms without any prior information of signal sparsity. Thus, the atom selection process will be more structural and the search space can be narrowed. Moreover, according to the backtracking process, the previous chosen atoms’ reliability can be detected and the unreliable atoms can be deleted at each iteration, which leads to an accurate reconstruction of the signal ultimately. Compared with other compressed sensing algorithms, simulation results show the proposed algorithm’s superior performance to that of several other OMP-type algorithms.

Journal Article
TL;DR: By using decision tree model and enhanced ID3 algorithm, it is found out settlement of car insurance is mainly influenced by the driving experience and the use process, and then a further analysis on these influence factors is made.
Abstract: By using data mining method in this paper, we try to make more effective market segmentation and find out the optimal price of settlement in financial companies. Through the assessment of the customer value by using BP neural network, the customers can be classified scientifically and rationally. At the same time, we can also adopt different marketing strategies to different customers and improve the customer relationship management. At the same time, by using decision tree model and enhanced ID3 algorithm, we find out settlement of car insurance is mainly influenced by the driving experience and the use process, and then we make a further analysis on these influence factors.

Posted Content
TL;DR: In this paper, the complexity of the decision tree model of computing composite relations of the form h = g(f^1,...,f^n), where each relation f^i is boolean-valued, is characterized.
Abstract: We completely characterise the complexity in the decision tree model of computing composite relations of the form h = g(f^1,...,f^n), where each relation f^i is boolean-valued. Immediate corollaries include a direct sum theorem for decision tree complexity and a tight characterisation of the decision tree complexity of iterated boolean functions.

Journal ArticleDOI
Yang Wang, Peng Zeng1, Haibin Yu1, Yanyu Zhang1, Xu Wang1 
TL;DR: An informational architecture based on the conceptual energy tree finally can be established using incomplete measurement data and reasoning for large-scale industrial networks.
Abstract: Service-oriented architectures make establishing comprehensive profiles of smart factories feasible. In this paper, an energy tree model is used to describe a profile that shapes energy system dynamics. The energy tree shows an overall and detailed profile that combines information communication technologies and ontology knowledge bases. A 7-level network protocol defines sustainable communication services for accumulating local information to maintain the global energy tree in real time. The communication protocol manages everchanging temporal and spatial misalignments by aligning groups of energy resources that are temporally or spatially related. Meanwhile, correlated domain information regarding industrial processes is formulized into ontology models. Ontology-based semantic contexts allocate knowledge-supported attributes to energy resources, including systems, resources, and users. The key objective of context awareness is to align attributes and to intensify couplings between different energy resources by decomposing and aggregating internal ontology models. Intertemporal and interspatial correlations of energy resources are made available by the cooperative transmission of ontology-based semantic contexts in the protocol framework. An informational architecture based on the conceptual energy tree finally can be established using incomplete measurement data and reasoning for large-scale industrial networks. A Smart Grid application instance is given to demo,strate the functionalities of energy tree dynamics.

Proceedings ArticleDOI
10 Nov 2013
TL;DR: An improved text feature selection method UDsIG is proposed, which uses feature equilibrium of distribution to decrease the interference with feature selection when features are unevenly distributed and the improved information gain formula, which is based on weighed dispersion, to get the optimal feature subset.
Abstract: The classification performance of previous IG algorithm may decline obviously because of the maldistribution of classes and features, due to which an improved text feature selection method UDsIG is proposed. First, we select features by classes to reduce the impact on feature selection when the classes are unevenly distributed. After that, we use feature equilibrium of distribution to decrease the interference with feature selection when features are unevenly distributed. And then we deal with class features by feature relation tree model, thus to retain strong correlation features. Finally, we use the improved information gain formula, which is based on weighed dispersion, to get the optimal feature subset. The experimental results show the proposed method has better classification performance.