scispace - formally typeset
Search or ask a question

Showing papers by "Thomas G. Dietterich published in 2004"


Proceedings ArticleDOI
04 Jul 2004
TL;DR: Experiments show that when the training data set is very small, training with auxiliary data can produce large improvements in accuracy, even when the auxiliary data is significantly different from the training (and test) data.
Abstract: The standard model of supervised learning assumes that training and test data are drawn from the same underlying distribution. This paper explores an application in which a second, auxiliary, source of data is available drawn from a different distribution. This auxiliary data is more plentiful, but of significantly lower quality, than the training and test data. In the SVM framework, a training example has two roles: (a) as a data point to constrain the learning process and (b) as a candidate support vector that can form part of the definition of the classifier. The paper considers using the auxiliary data in either (or both) of these roles. This auxiliary data framework is applied to a problem of classifying images of leaves of maple and oak trees using a kernel derived from the shapes of the leaves. Experiments show that when the training data set is very small, training with auxiliary data can produce large improvements in accuracy, even when the auxiliary data is significantly different from the training (and test) data. The paper also introduces techniques for adjusting the kernel scores of the auxiliary data points to make them more comparable to the training data points.

297 citations


Journal ArticleDOI
TL;DR: An extended experimental analysis of bias-variance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels, shows that the expected trade-off between bias and variance is sometimes observed, but more complex relationships can be detected.
Abstract: Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental analysis of bias-variance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels. A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters, offering insights into the way SVMs learn. The results show that the expected trade-off between bias and variance is sometimes observed, but more complex relationships can be detected, especially in Gaussian and polynomial kernels. We show that the bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base learners, and we outline two directions for developing SVM ensembles, exploiting the SVM bias characteristics and the bias-variance dependence on the kernel param

262 citations


Proceedings ArticleDOI
04 Jul 2004
TL;DR: This paper describes a new method for training CRFs by applying Friedman's (1999) gradient tree boosting method, which scales linearly in the order of the Markov model and in the Order of the feature interactions, rather than exponentially like previous algorithms based on iterative scaling and gradient descent.
Abstract: Conditional Random Fields (CRFs; Lafferty, McCallum, & Pereira, 2001) provide a flexible and powerful model for learning to assign labels to elements of sequences in such applications as part-of-speech tagging, text-to-speech mapping, protein and DNA sequence analysis, and information extraction from web pages. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features. This paper describes a new method for training CRFs by applying Friedman's (1999) gradient tree boosting method. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees. Regression trees are learned by stage-wise optimizations similar to Adaboost, but with the objective of maximizing the conditional likelihood P(Y|X) of the CRF model. By growing regression trees, interactions among features are introduced only as needed, so although the parameter space is potentially immense, the search algorithm does not explicitly consider the large space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially like previous algorithms based on iterative scaling and gradient descent.

131 citations


01 Jan 2004
TL;DR: Recent work on alternatives to HMMs and PCFGs, based on generalizations of binary classification algorithms such as boosting, the perceptron algorithm, or large-margin (SVM) methods are described.
Abstract: Stuctured machine learning problems in natural language processing Michael Collins, MIT CSAIL/EECS Many problems in natural language processing involve the mapping from strings to structured objects such as parse trees, underlying state sequences, or segmentations. This leads to an interesting class of learning problems: how to induce classification functions where the output "labels" have meaningful internal structure, and where the number of possible labels may grow exponentially with the size of the input strings. Probabilistic grammars -for example hidden markov models or probabilistic context-free grammars -are one common approach to this type of problem. In this talk I will describe recent work on alternatives to HMMs and PCFGs, based on generalizations of binary classification algorithms such as boosting, the perceptron algorithm, or large-margin (SVM) methods. Statistical Models for Social Networks Mark Handcock, University of Washington This talk is an overview of social network analysis from the perspective of a statistician. The main focus is on the conceptual and methodological contributions of the social network community going back over eighty years. The field is, and has been, broadly multidisciplinary with significant contributions from the social, natural and mathematical sciences. This has lead to a plethora of terminology, and network conceptualizations commensurate with the varied objectives of network analysis. As a primary focus of the social sciences has been the representation of social relations with the objective of understanding social structure, social scientists have been central to this development. We review statistical exponential family models that recognize the complex dependencies within relational data structures. We consider three issues: the specification of realistic models, the algorithmic difficulties of the inferential methods, and the assessment of the degree to which the graph structure produced by the models matches that of the data. Insight can be gained by considering model degeneracy and inferential degeneracy for commonly used estimators. Probabilistic Entity-Relationship Models, PRMs, and Plate Models David Heckerman, Microsoft Research We introduce a graphical language for relational data called the probabilistic entity-relationship (PER) model. The model is an extension of the entityrelationship model, a common model for the abstract representation of database structure. We concentrate on the directed version of this model---the directed acyclic probabilistic entity-relationship (DAPER) model. The DAPER model is closely related to the plate model and the probabilistic relational model (PRM), existing models for relational data. The DAPER model is more expressive than either existing model, and also helps to demonstrate their similarity. In addition to describing the new language, we discuss important facets of modeling relational data, including the use of restricted relationships, self relationships, and probabilistic relationships. This is joint work with Christopher Meek and Daphne Koller. Pictorial Structure Models for Visual Recognition Dan Huttenlocher, Cornell University There has been considerable recent work in object recognition on representations that combine both local visual appearance and global spatial constraints. Several such approaches are based on statistical characterizations of the spatial relations between local image patches. In this talk I will give an overview of one such approach, called pictorial structures, which uses spatial relations between pairs of parts. I will focus on the recent development of highly efficient techniques both for learning certain forms of pictorial structure models from examples and for detecting objects using these models. Relations, generalizations and the reference-class problem: A logic programming / Bayesian perspective David Poole, Dept of Computer Science, University of British Columbia Logic programs provide a rich language to specify the interdependence between relations. There has been much success with inductive logic programming finding relationships from data. There has also been considerable success with Bayesian learning. However there is a large conceptual gap in that inductive logic programming does not have any statistics. This talk will explore how to get statistics from data. This problem is known as the reference-class problem. This talk will explore the combination of logic programming and hierarchical Bayesian models as a solution to the reference class problem. This is joint work with Michael Chiang. Feature Definition and Discovery in Probabilistic Relational Models Eric Altendorf eric@cleverset.com Bruce D’Ambrosio dambrosi@cleverset.com CleverSet, Inc., 673 Jackson Avenue, Corvallis OR, 97330

10 citations


01 Jan 2004
TL;DR: The chain MDP algorithm is described which in many cases is able to capture more of the sensing costs than the even odd POMDP approximation and both heuristics compute value functions that are upper bounded by i e bet ter than the value function of the underlying MDP and in the case of the even MDP also lower bounded by the POM DP s optimal value function.
Abstract: A common heuristic for solving Partial ly Observable Markov Decision Problems POMDPs is to rst solve the underlying Markov Decision Process MDP and then con struct a POMDP policy by performing a xed depth lookahead search in the POMDP and evaluating the leaf nodes using the MDP value function A problem with this approximation is that it does not account for the need to choose actions in order to gain information about the state of the world particularly when those ob servation actions are needed at some point in the future This paper proposes two heuristics that are better than the MDP approximation in POMDPs where there is a delayed need to observe The rst approximation introduced in is the even odd POMDP in which the world is assumed to be fully observable every other time step The even odd POMDP can be converted into an equivalent MDP the even MDP whose value function captures some of the sensing costs of the original POMDP An online policy consisting in a step lookahead search com bined with the value function of the even MDP gives an approximation to the POMDP s value function that is at least as good as the method based on the value function of the underlying MDP The second POMDP approximation is applicable to a special kind of POMDP which we call the Cost Observable Markov Decision Problem COMDP In a COMDP the actions are partitioned into those that change the state of the world and those that are pure observa tion actions For such problems we describe the chain MDP algorithm which in many cases is able to capture more of the sensing costs than the even odd POMDP approximation We prove that both heuristics compute value functions that are upper bounded by i e bet ter than the value function of the underlying MDP and in the case of the even MDP also lower bounded by the POMDP s optimal value function We show cases where the chain MDP online policy is better equal or worse than the even MDP online policy

4 citations