Showing papers on "Decision tree model published in 2003"

PDF

Open Access

Journal Article•DOI•

Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models

[...]

Weida Tong¹, Huixiao Hong¹, Hong Fang¹, Qian Xie¹, Roger Perkins¹ - Show less +1 more•Institutions (1)

National Center for Toxicological Research¹

04 Feb 2003-Journal of Chemical Information and Computer Sciences

TL;DR: A novel approach is suggested, named Decision Forest, that combines multiple Decision Tree models that are of similar predictive quality and quality compared to the individual models is consistently and significantly improved in both training and testing steps.

...read moreread less

Abstract: The techniques of combining the results of multiple classification models to produce a single prediction have been investigated for many years. In earlier applications, the multiple models to be combined were developed by altering the training set. The use of these so-called resampling techniques, however, poses the risk of reducing predictivity of the individual models to be combined and/or over fitting the noise in the data, which might result in poorer prediction of the composite model than the individual models. In this paper, we suggest a novel approach, named Decision Forest, that combines multiple Decision Tree models. Each Decision Tree model is developed using a unique set of descriptors. When models of similar predictive quality are combined using the Decision Forest method, quality compared to the individual models is consistently and significantly improved in both training and testing steps. An example will be presented for prediction of binding affinity of 232 chemicals to the estrogen receptor.

...read moreread less

202 citations

Proceedings Article•DOI•

Two applications of information complexity

[...]

T. S. Jayram¹, Ravi Kumar¹, Dandapani Sivakumar¹•Institutions (1)

IBM¹

09 Jun 2003

TL;DR: The following new lower bounds in two concrete complexity models are shown: in the two-party communication complexity model, it is shown that the tribes function on n inputs has two-sided error randomized complexity, while its nondeterminstic complexity and co-nondeterministic complexity are both Θ(√n).

...read moreread less

Abstract: We show the following new lower bounds in two concrete complexity models:(1) In the two-party communication complexity model, we show that the tribes function on n inputs[6] has two-sided error randomized complexity Ω(n), while its nondeterminstic complexity and co-nondeterministic complexity are both Θ(√n). This separation between randomized and nondeterministic complexity is the best possible and it settles an open problem in Kushilevitz and Nisan[17], which was also posed by Beame and Lawry[5].(2) In the Boolean decision tree model, we show that the recursive majority-of-three function on 3h inputs has randomized complexity Ω((7/3)h). The deterministic complexity of this function is Θ(3h), and the nondeterministic complexity is Θ(2h). Our lower bound on the randomized complexity is a substantial improvement over any lower bound for this problem that can be obtained via the techniques of Saks and Wigderson [23], Heiman and Wigderson[14], and Heiman, Newman, and Wigderson[13]. Recursive majority is an important function for which a class of natural algorithms known as directional algorithms does not achieve the best randomized decision tree upper bound.These lower bounds are obtained using generalizations of information complexity, which quantifies the minimum amount of information that will have to be revealed about the inputs by every correct algorithm in a given model of computation.

...read moreread less

122 citations

Journal Article•

Efficient algorithms for decision tree cross-validation

[...]

Hendrik Blockeel¹, Jan Struyf¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Mar 2003-Journal of Machine Learning Research

TL;DR: In this article, the authors show that the computational overhead of cross-validation can be reduced significantly by integrating the crossvalidation with the normal decision tree induction process, and they discuss how existing decision tree algorithms can be adapted to this aim and provide an analysis of the speedups these adaptations may yield.

...read moreread less

Abstract: Cross-validation is a useful and generally applicable technique often employed in machine learning, including decision tree induction. An important disadvantage of straightforward implementation of the technique is its computational overhead. In this paper we show that, for decision trees, the computational overhead of cross-validation can be reduced significantly by integrating the cross-validation with the normal decision tree induction process. We discuss how existing decision tree algorithms can be adapted to this aim, and provide an analysis of the speedups these adaptations may yield. We identify a number of parameters that influence the obtainable speedups, and validate and refine our analysis with experiments on a variety of data sets with two different implementations. Besides cross-validation, we also briefly explore the usefulness of these techniques for bagging. We conclude with some guidelines concerning when these optimizations should be considered.

...read moreread less

116 citations

Journal Article•DOI•

Beyond independence: probabilistic models for query approximation on binary transaction data

[...]

D.N. Pavlov¹, Heikki Mannila², Padhraic Smyth•Institutions (2)

Princeton University¹, University of Helsinki²

01 Nov 2003

TL;DR: This work investigates the problem of generating fast approximate answers to queries posed to large sparse binary data sets and introduces two techniques for building probabilistic models from frequent itemsets: the item set maximum entropy model and the itemset inclusion-exclusion model.

...read moreread less

Abstract: We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.

...read moreread less

98 citations

Proceedings Article•DOI•

A bag of paths model for measuring structural similarity in Web documents

[...]

Sachindra Joshi¹, Neeraj Agrawal¹, Raghu Krishnapuram¹, Sumit Negi¹•Institutions (1)

Indian Institutes of Technology¹

24 Aug 2003

TL;DR: This paper proposes an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model, which allows for a new family of meaningful (and at the same time computationally simple) structural similarity measures.

...read moreread less

Abstract: Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages.

...read moreread less

87 citations

Proceedings Article•DOI•

The complexity of multiagent systems: the price of silence

[...]

Zinovi Rabinovich¹, Claudia V. Goldman², Jeffrey S. Rosenschein¹•Institutions (2)

Hebrew University of Jerusalem¹, University of Massachusetts Amherst²

14 Jul 2003

TL;DR: It is shown that there are complexity bounds that cannot be lowered even when approximation techniques are applied, and the possible sources of this complexity are studied.

...read moreread less

Abstract: In this work, we suggest representing multiagent systems using computational models, choosing, specifically, Multi-Prover Interactive Protocols to represent agent systems and the interactions occurring within them. This approach enables us to analyze complexity issues related to multiagent systems. We focus here on the complexity of coordination and study the possible sources of this complexity. We show that there are complexity bounds that cannot be lowered even when approximation techniques are applied.

...read moreread less

82 citations

Patent•

Binary prediction tree modeling with many predictors and its uses in clinical and genomic applications

[...]

Joseph R. Nevins, Mike West¹, Andrew T. Huang¹•Institutions (1)

Duke University¹

24 Oct 2003

TL;DR: In this paper, a statistical analysis method that is a predictive statistical tree model is provided, where the model first screens genes to reduce noise, applies k-means correlation-based clustering, and then uses singular-value decomposition to extract the single dominant factor (principal component) from each cluster.

...read moreread less

Abstract: Provided is a statistical analysis method that is a predictive statistical tree model. This model first screens genes to reduce noise, applies k-means correlation-based clustering, and then uses singular-value decompositions to extract the single dominant factor (principal component) from each cluster. This generates a statistically significant number of cluster-derived singular factors, that we refer to as metagenes, which characterize multiple patterns of expression of the genes across samples. The strategy aims to extract multiple such patterns while reducing dimension and smoothing out genespecific noise through the aggregation within clusters. Formal predictive analysis then uses these metagenes in a Bayesian classification tree analysis. This generates multiple recursive partitions of the sample into subgroups ('leaves' of the tree), and associates Bayesian predictive probabilities of outcomes with each subgroup. Overall predictions for an individual sample are then generated by averaging predictions, with appropriate weights, across many such tree models. The model includes the use of iterative out-of-sample cross-validation predictions to perform refitting of the model, and mirrors the real-world prognostic context where prediction of new cases as they arise is the major goal.

...read moreread less

81 citations

Journal Article•DOI•

Nondeterministic Quantum Query and Communication Complexities

[...]

Ronald de Wolf

01 Mar 2003-SIAM Journal on Computing

TL;DR: The nondeterministic quantum algorithms for Boolean functions f have positive acceptance probability on input x iff f(x)=1, which implies that the quantum communication complexities of the equality and disjointness functions are n+1 if the authors do not allow any error probability.

...read moreread less

Abstract: We study nondeterministic quantum algorithms for Boolean functions f. Such algorithms have positive acceptance probability on input x iff f(x)=1. In the setting of query complexity, we show that the nondeterministic quantum complexity of a Boolean function is equal to its "nondeterministic polynomial" degree. We also prove a quantum-vs.-classical gap of 1 vs. n for nondeterministic query complexity for a total function. In the setting of communication complexity, we show that the nondeterministic quantum complexity of a two-party function is equal to the logarithm of the rank of a nondeterministic version of the communication matrix. This implies that the quantum communication complexities of the equality and disjointness functions are n+1 if we do not allow any error probability. We also exhibit a total function in which the nondeterministic quantum communication complexity is exponentially smaller than its classical counterpart.

...read moreread less

72 citations

Book Chapter•DOI•

Improved Class Probability Estimates from Decision Tree Models

[...]

Dragos D. Margineantu, Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Jan 2003

TL;DR: This chapter introduces a new algorithm, B-LOTs, for constructing decision trees and compares it to an alternative, Bagged Probability Estimation Trees (B-PETs), to compare the ability of the two methods to make good classification decisions when the misclassification costs are asymmetric.

...read moreread less

Abstract: Decision tree models typically give good classification decisions but poor probability estimates In many applications, it is important to have good probability estimates as well This chapter introduces a new algorithm, Bagged Lazy Option Trees (B-LOTs), for constructing decision trees and compares it to an alternative, Bagged Probability Estimation Trees (B-PETs) The quality of the class probability estimates produced by the two methods is evaluated in two ways First, we compare the ability of the two methods to make good classification decisions when the misclassification costs are asymmetric Second, we compare the absolute accuracy of the estimates themselves The experiments show that B-LOTs produce better decisions and more accurate probability estimates than B-PETs

...read moreread less

53 citations

Proceedings Article•DOI•

Genetic programming-based decision trees for software quality classification

[...]

Taghi M. Khoshgoftaar¹, Naeem Seliya, Yi Liu•Institutions (1)

Florida Atlantic University¹

03 Nov 2003

TL;DR: This paper presents an automated and simplified genetic programming (gp) based decision tree modeling technique for the software quality classification problem and shows that the GP-based decision tree technique yielded better classification models.

...read moreread less

Abstract: The knowledge of the likely problematic areas of a software system is very useful for improving its overall quality. Based on such information, a more focused software testing and inspection plan can be devised. Decision trees are attractive for a software quality classification problem which predicts the quality of program modules in terms of risk-based classes. They provide a comprehensible classification model which can be directly interpreted by observing the tree-structure. A simultaneous optimization of the classification accuracy and the size of the decision tree is a difficult problem, and very few studies have addressed the issue. This paper presents an automated and simplified genetic programming (gp) based decision tree modeling technique for the software quality classification problem. Genetic programming is ideally suited for problems that require optimization of multiple criteria. The proposed technique is based on multi-objective optimization using strongly typed GP. In the context of an industrial high-assurance software system, two fitness functions are used for the optimization problem: one for minimizing the average weighted cost of misclassification, and one for controlling the size of the decision tree. The classification performances of the GP-based decision trees are compared with those based on standard GP, i.e., S-expression tree. It is shown that the GP-based decision tree technique yielded better classification models. As compared to other decision tree-based methods, such as C4.5, GP-based decision trees are more flexible and can allow optimization of performance objectives other than accuracy. Moreover, it provides a practical solution for building models in the presence of conflicting objectives, which is commonly observed in software development practice.

...read moreread less

52 citations

Journal Article•DOI•

Phenol mechanism of toxic action classification and prediction: a decision tree approach

[...]

Shijin Ren¹•Institutions (1)

University of Tennessee¹

15 Oct 2003-Toxicology Letters

TL;DR: Validation of the decision tree approach indicated that the overall mechanism prediction accuracy was approximately 85%.

...read moreread less

Journal Article•DOI•

Complexity considerations for transform-domain adaptive filters

[...]

Kutluyil Doǧançay¹•Institutions (1)

University of South Australia¹

01 Jun 2003-Signal Processing

TL;DR: This paper is concerned with the computational complexity and convergence performance of transform-domain adaptive filtering algorithms, and the transform- domain least-mean-square algorithm and the generalized subband decomposition LMS algorithm are considered.

...read moreread less

Proceedings Article•

Skewing: an efficient alternative to lookahead for decision tree induction

[...]

David C. Page¹, Soumya Ray¹•Institutions (1)

University of Wisconsin-Madison¹

09 Aug 2003

TL;DR: A novel, promising approach that allows greedy decision tree induction algorithms to handle problematic functions such as parity functions, and is effective with only modest amounts of data for problematic functions or subfunctions of up to six or seven variables.

...read moreread less

Abstract: This paper presents a novel, promising approach that allows greedy decision tree induction algorithms to handle problematic functions such as parity functions. Lookahead is the standard approach to addressing difficult functions for greedy decision tree learners. Nevertheless, this approach is limited to very small problematic functions or subfunctions (2 or 3 variables), because the time complexity grows more than exponentially with the depth of lookahead. In contrast, the approach presented in this paper carries only a constant run-time penalty. Experiments indicate that the approach is effective with only modest amounts of data for problematic functions or subfunctions of up to six or seven variables, where the examples themselves may contain numerous other (irrelevant) variables as well.

...read moreread less

Book Chapter•DOI•

Rough Set Based Decision Tree Model for Classification

[...]

Sonajharia Minz¹, Rajni Jain¹, Rajni Jain²•Institutions (2)

Jawaharlal Nehru University¹, Agricultural & Applied Economics Association²

03 Sep 2003

TL;DR: RDT model combining the RS tools with classical DT capabilities, is proposed to address the issue of computational overheads and the performance of RDT with RS approach and ID3 algorithm is compared.

...read moreread less

Abstract: Decision tree, a commonly used classification model, is constructed recursively following a top down approach (from the general concepts to particular examples) by repeatedly splitting the training data set. ID3 is a greedy algorithm that considers one attribute at a time for splitting at a node. In C4.5, all attributes, barring the nominal attributes used at the parent nodes, are retained for further computation. This leads to extra overheads of memory and computational efforts. Rough Set theory (RS) simplifies the search for dominant attributes in the information systems. In this paper, Rough set based Decision Tree (RDT) model combining the RS tools with classical DT capabilities, is proposed to address the issue of computational overheads. The experiments compare the performance of RDT with RS approach and ID3 algorithm. The performance of RDT over RS approach is observed better in accuracy and rule complexity while RDT and ID3 are comparable.

...read moreread less

Journal Article•DOI•

Decision tree SAR models for developmental toxicity based on an FDA/TERIS database.

[...]

Nancy B. Sussman¹, Vincent C. Arena¹, S. Yu¹, Sati Mazumdar¹, B.P. Thampatty¹ - Show less +1 more•Institutions (1)

University of Pittsburgh¹

01 Apr 2003-Sar and Qsar in Environmental Research

TL;DR: The decision tree developmental SAR models exhibited modest prediction accuracy and bagging tended to enhance the accuracy of prediction, and the model ensemble approach reduced the variability of prediction measures compared to the single model approach.

...read moreread less

Abstract: Humans are exposed to thousands of environmental chemicals for which no developmental toxicity information is available. Structure-activity relationships (SARs) are models that could be used to efficiently predict the biological activity of potential developmental toxicants. However, at this time, no adequate SAR models of developmental toxicity are available for risk assessment. In the present study, a new developmental database was compiled by combining toxicity information from the Teratogen Information System (TERIS) and the Food and Drug Administration (FDA) guidelines. We implemented a decision tree modeling procedure, using Classification and Regression Tree software and a model ensemble approach termed bagging. We then assessed the empirical distributions of the prediction accuracy measures of the single and ensemble-based models, achieved by repeating our modeling experiment many times by repeated random partitioning of the working database. The decision tree developmental SAR models exhibited mo...

...read moreread less

Journal Article•DOI•

Dynamic trees for image modelling

[...]

Nicholas Adams, Christopher Williams

01 Sep 2003-Image and Vision Computing

TL;DR: DTs are seen to offer significant improvement in performance over the fixed-architecture TSBN and in a coding comparison the DT achieves 0.294 bits per pixel (bpp) compression compared to 0.378 bpp for lossless JPEG on images of seven colours.

...read moreread less

Book Chapter•DOI•

Multi-objective genetic programming optimization of decision trees for classifying medical data

[...]

Ernest Muthomi Mugambi¹, Andrew Hunter²•Institutions (2)

University of Sunderland¹, Durham University²

03 Sep 2003

TL;DR: This paper has used the multi-objective Genetic Programming method to build decision tree models from Diabetes data in a bid to investigate its capability to trade-off comprehensibility and performance.

...read moreread less

Abstract: Although there has been considerable study in the area of trading- off accuracy and comprehensibility of decision tree models, the bulk of the methods dwell on sacrificing comprehensibility for the sake of accuracy, or fine-tuning the balance between comprehensibility and accuracy. Invariably, the level of trade-off is decided {itshape a priori}. It is possible for such decisions to be made {itshape a posteriori} which means the induction process does not discriminate against any of the objectives. In this paper, we present such a method that uses multi-objective Genetic Programming to optimize decision tree models. We have used this method to build decision tree models from Diabetes data in a bid to investigate its capability to trade-off comprehensibility and performance.

...read moreread less

Journal Article•DOI•

Genetically Engineered Decision Trees: Population Diversity Produces Smarter Trees

[...]

Zhiwei Fu, Bruce L. Golden¹, Shreevardhan Lele¹, S. Raghavan¹, Edward Wasil² - Show less +1 more•Institutions (2)

University of Maryland, College Park¹, American University²

01 Nov 2003-Operations Research

TL;DR: This paper develops a genetic algorithm for constructing a tree using a new probabilistic measure for assessing the performance of a tree, and investigates the effect of introducing diversity into the population used by the genetic algorithm.

...read moreread less

Abstract: When considering a decision tree for the purpose of classification, accuracy is usually the sole performance measure used in the construction process. In this paper, we introduce the idea of combining a decision tree's expected value and variance in a new probabilistic measure for assessing the performance of a tree. We develop a genetic algorithm for constructing a tree using our new measure and conduct computational experiments that show the advantages of our approach. Further, we investigate the effect of introducing diversity into the population used by our genetic algorithm. We allow the genetic algorithm to simultaneously focus on two distinct probabilistic measures--one that is risk averse and one that is risk seeking. Our bivariate genetic algorithm for constructing a decision tree performs very well, scales up quite nicely to handle data sets with hundreds of thousands of points, and requires only a small percent of the data to generate a high-quality decision tree. We demonstrate the effectiveness of our algorithm on three large data sets.

...read moreread less

Journal Article•DOI•

Model-Based Conifer Canopy Surface Reconstruction from Photographic Imagery: Overcoming the Occlusion, Foreshortening, and Edge Effects

[...]

Yongwei Sheng, Peng Gong, Gregory S. Biging

01 Mar 2003-Photogrammetric Engineering and Remote Sensing

TL;DR: In this article, a 3D geometric model-based approach to reconstruct canopy surface for models and modeled it as a generalized 3D hemi-ellipsoid is presented.

...read moreread less

Abstract: signed to extract canopy surfaces with high variability in the Canopy surface data are desirable in forestry, but they are vertical direction. difficult to collect in the field. Existing surface reconstruction Sheng et al. (2001) introduced a model-based photogramtechniques cannot adequately extract canopy surfaces, es- metric approach to tree crown surface reconstruction. They expecially for conifer stands. This paper develops an integrated pressed the crown morphology of a tree using 3D geometric model-based approach to reconstruct canopy surface for models and modeled it as a generalized 3D hemi-ellipsoid. conifer stands analytically from the crown level. To deal with They manually established the optimal tree model from aerial dense stands, critical problems are addressed in the process photos, and used the initial crown surface derived from the of model-based surface reconstruction. These include the tree model to guide image matching in crown surface reconocclusion problem in disparity (parallax) prediction from tree struction. The potential of this model-based approach was models, the edge effect of tree models on the disparity map, demonstrated using a single tree. When reconstructing the canand the foreshortening effect in image matching. The model- opy surface of a dense stand of multiple trees, we need to estabbased approach was applied to recover the canopy surface of lish optimal tree models in a more efficient manner and cona dense redwood stand using images scanned from 1:2,400- sider problems such as occlusion. This paper extends the scale aerial photographs. Compared with field measurements, capability of the model-based approach from recovering the crown radius and tree height derived from the reconstructed crown surface of a single tree to reconstructing the canopy surcanopy surface model have an overall accuracy of 92 percent face of a tree stand, and further develops the model-based and 94 percent, respectively. The results demonstrate the method to canopy surface reconstruction for complicated tree approach’s ability to reconstruct complicated stands. stands. The improved method addresses the problems of occlusion, foreshortening, and tree edge effects, and was applied to

...read moreread less

Journal Article•DOI•

Partially Unified Multiple Property Recursive Partitioning (PUMP-RP): A New Method for Predicting and Understanding Drug Selectivity

[...]

Thomas P. Stockfisch¹•Institutions (1)

Symyx Technologies¹

24 Jul 2003-Journal of Chemical Information and Computer Sciences

TL;DR: A new tree representation and growth procedure, PUMP-RP, has been developed that has the potential to leverage copious data from an older, well-studied target while beginning to study a newer target for which only a small amount of data are available.

...read moreread less

Abstract: The decision tree method for classification problems has been extended to accommodate multiple dependent properties. When applied to drug discovery efforts this means a separate activity class can be predicted for each of several targets with a single tree model. A new tree representation and growth procedure, PUMP-RP, has been developed. The final architecture of the tree allows for easy interpretation as to which independent variables and split values are important for all targets and which are specific to a given target. It should thus be usefully applied to studies of drug specificity. A side benefit of the new method is that it can make use of data with missing (or even sparse) dependent property values. This has the potential to leverage copious data from an older, well-studied target while beginning to study a newer target for which only a small amount of data are available.

...read moreread less

Journal Article•DOI•

Adaptive tree similarity learning for image retrieval

[...]

Tao Wang¹, Yong Rui², Shi-Min Hu¹, Jia-Guang Sun¹•Institutions (2)

Tsinghua University¹, Microsoft²

01 Aug 2003-Multimedia Systems

TL;DR: A new approach to interactive image retrieval based on an adaptive tree similarity model that does not require any prior knowledge of the data and supports incremental learning with a fast convergence rate and achieves better performance than most approaches.

...read moreread less

Abstract: Learning-enhanced relevance feedback is one of the most promising and active research directions in content-based image retrieval in recent years. However, the existing approaches either require prior knowledge of the data or converge slowly and are thus not coneffective. Motivated by the successful history of optimal adaptive filters, we present a new approach to interactive image retrieval based on an adaptive tree similarity model to solve these difficulties. The proposed tree model is a hierarchical nonlinear Boolean representation of a user query concept. Each path of the tree is a clustering pattern of the feedback samples, which is small enough and local in the feature space that it can be approximated by a linear model nicely. Because of the linearity, the parameters of the similartiy model are better learned by the optimal adaptive filter, which does not require any prior knowledge of the data and supports incremental learning with a fast convergence rate. The proposed approach is simple to implement and achieves better performance than most approaches. To illustrate the performance of the proposed approach, extensive experiments have been carded out on a large heterogeneous image collection with 17,000 images, which render promising results on a wide variety of queries.

...read moreread less

GenMiner: A data mining tool for protein analysis

[...]

Gerasimos Hatzidamianos, Sotiris Diplaris, Ioannis N. Athanasiadis, Pericles A. Mitkas

01 Jan 2003

TL;DR: GenMiner is a preprocessing software tool that can receive data from three major protein databases and transform them in a form suitable for input to the WEKA data mining suite, and shows that the use of the decision tree model for mining protein data is an efficient and easy-to-implement solution.

...read moreread less

Abstract: We present an integrated tool for preprocessing and analysis of genetic data through data mining. Our goal is the prediction of the functional behavior of proteins, a critical problem in functional genomics. During the last years, many programming approaches have been developed for the identification of short amino-acid chains, which are included in families of related proteins. These chains are called motifs and they are widely used for the prediction of the protein's behavior, since the latter is dependent on them. The idea to use data mining techniques stems from the sheer size of the problem. Since every protein consists of a specific number of motifs, some stronger than others, the identification of the properties of a protein requires the examination of immeasurable combinations. The presence or absence of stronger motifs affects the way in which a protein reacts. GenMiner is a preprocessing software tool that can receive data from three major protein databases and transform them in a form suitable for input to the WEKA data mining suite. A decision tree model was created using the derived training set and an efficienc y test was conducted. Finally, the model was applied to unknown proteins. Our experiments have shown that the use of the decision tree model for mining protein data is an efficient and easy-to-implement solution, since it possesses a high degree of parameterization and therefore, it can be used in a plethora of cases.

...read moreread less

Journal Article•

Complete classifications for the communication complexity of regular languages

[...]

Pascal Tesson¹, Denis Thérien¹•Institutions (1)

McGill University¹

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: It is shown that every regular language L has either constant, logarithmic or linear two-party communication complexity (in a worst-case partition sense) and a similar trichotomy for simultaneous and probabilistic communication complexity is proved.

...read moreread less

Abstract: We show that every regular language L has either constant, logarithmic or linear two-party communication complexity (in a worst-case partition sense). We prove a similar trichotomy for simultaneous communication complexity and a quadrichotomy for probabilistic communication complexity.

...read moreread less

Journal Article•DOI•

Exact communication costs for consensus and leader in a tree

[...]

Yefim Dinitz¹, Shlomo Moran², Sergio Rajsbaum•Institutions (2)

Ben-Gurion University of the Negev¹, Technion – Israel Institute of Technology²

01 Apr 2003-Journal of Discrete Algorithms

TL;DR: All scenarios with linear complexity in a tree topology are dealt with, and exact tight bounds for the bit and message complexities are proved.

...read moreread less

Proceedings Article•DOI•

On computational complexity of non-reducible descriptors

[...]

Ventzeslav Valev¹, Asai Asaithambi¹•Institutions (1)

Saint Louis University¹

27 Oct 2003

TL;DR: A supervised pattern recognition model that uses Boolean formulas for non-reducible descriptors is presented, which leads to computational problem which is shown to be NP-complete.

...read moreread less

Abstract: We present a supervised pattern recognition model that uses Boolean formulas for non-reducible descriptors. This model leads to computational problem which is shown to be NP-complete. In the paper, we identify two open combinatorial problems in the construction of non-reducible descriptors that can be applied to a large set of applications.

...read moreread less

Book Chapter•DOI•

Building decision tree software quality classification models using genetic programming

[...]

Yi Liu¹, Taghi M. Khoshgoftaar¹•Institutions (1)

Florida Atlantic University¹

12 Jul 2003

TL;DR: An automated and simplified genetic programming (gp) based decision tree modeling technique for calibrating software quality classification models using strongly typed gp is presented.

...read moreread less

Abstract: Predicting the quality of software modules prior to testing or system operations allows a focused software quality improvement en-deavor. Decision trees are very attractive for classification problems, because of their comprehensibility and white box modeling features. However, optimizing the classification accuracy and the tree size is a difficult problem, and to our knowledge very few studies have addressed the issue. This paper presents an automated and simplified genetic programming (gp) based decision tree modeling technique for calibrating software quality classification models. The proposed technique is based on multi-objective optimization using strongly typed gp. Two fitness functions are used to optimize the classification accuracy and tree size of the classification models calibrated for a real-world high-assurance software system. The performances of the classification models are compared with those obtained by standard gp. It is shown that the gp-based decision tree technique yielded better classification models.

...read moreread less

Journal Article•DOI•

On the performance of randomized embedding of reproduction trees in static networks

[...]

Keqin Li¹•Institutions (1)

State University of New York System¹

01 Oct 2003-International Journal of Parallel Programming

TL;DR: A linear system of equations is developed that characterizes expected loads on all processors under the reproduction tree model, which can generate trees of arbitrary size and shape and implies that the simple randomized tree embedding algorithm is able to generate high quality load distributions on virtually all static networks commonly employed in parallel and distributed computing.

...read moreread less

Abstract: High performance computing requires high quality load distribution of processes of a parallel application over processors in a parallel computer at runtime such that both the maximum load and dilation are minimized. The performance of a simple randomized tree embedding algorithm that dynamically supports tree-structured parallel computations on arbitrary static networks is analyzed in this paper. The algorithm spreads newly created tree nodes to neighboring processors, which actually provides randomized dilation-1 tree embedding in static networks. We develop a linear system of equations that characterizes expected loads on all processors under the reproduction tree model, which can generate trees of arbitrary size and shape. It is shown that as the tree size becomes large, the asymptotic performance ratio of the randomized tree embedding algorithm is the ratio of the maximum processor degree to the average processor degree. This implies that the simple randomized tree embedding algorithm is able to generate high quality load distributions on virtually all static networks commonly employed in parallel and distributed computing.

...read moreread less

Journal Article•DOI•

The 2D modeling of tree community: from “microscopic” description to macroscopic behavior

[...]

Valerii V. Galitskii¹•Institutions (1)

Russian Academy of Sciences¹

15 Sep 2003-Forest Ecology and Management

TL;DR: The 2D individual-based approach in modeling appears to be an effective tool to analyze the influence of many “microscopic” peculiarities of tree community arrangement on macroscopic behavior of the communities.

...read moreread less

Proceedings Article•DOI•

Minimization of decision trees is hard to approximate

[...]

Detlef Sieling¹•Institutions (1)

RWTH Aachen University¹

07 Jul 2003

TL;DR: This paper focuses on decision trees of small size, which have widespread applications in complexity theory and data mining and exploration and are known to be NP-hard.

...read moreread less

Abstract: Decision trees are representations of discrete functions with widespread applications in, e.g., complexity theory and data mining and exploration. In these areas it is important to obtain decision trees of small size. The minimization problem for decision trees is known to be NP-hard. The problem is even hard to approximate up to any constant factor.

...read moreread less

Proceedings Article•

A New Sentence Reduction based on Decision Tree Model

[...]

Minh Le Nguyen¹, Susumu Horiguchi•Institutions (1)

Japan Advanced Institute of Science and Technology¹

01 Oct 2003

TL;DR: The proposed algorithm is able to deal with the changeable order problem in sentence reduction and shows a better result when comparing with the original methods.

...read moreread less

Abstract: This paper addresses a novel sentence reduction algorithm base on decision tree model where semantic information is used to enhance the accuracy of sentence reduction. The proposed algorithm is able to deal with the changeable order problem in sentence reduction. Experimental show a better result when comparing with the original methods.

...read moreread less