scispace - formally typeset
Search or ask a question

Showing papers by "Thomas G. Dietterich published in 1995"


Journal ArticleDOI
TL;DR: There has been extensive research in applying secondorder methods to fit neural networks and in conducting much more thorough searches in learning decision trees and rule sets, including gradient descent and greedy search, with great success.
Abstract: A central problem in machine learning is supervised learning—that is, learning from labeled training data. For example, a learning system for medical diagnosis might be trained with examples of patients whose case records (medical tests, clinical observations) and diagnoses were known. The task of the learning system is to infer a function that predicts the diagnosis of a patient from his or her case records. The function to be learned might be represented as a set of rules, a decision tree, a Bayes network, or a neural network. Learning algorithms essentially operate by searching some space of functions (usually called the hypothesis class) for a function that fits the given data. Because there are usually exponentially many functions, this search cannot actually examine individual hypothesis functions but instead must use some more direct method of constructing the hypothesis functions from the data. This search can usually be formalized by defining an objective function (e.g., number of data points predicted incorrectly) and applying various algorithms to find a function that minimizes this objective function is NP-hard. For example, fitting the weights of a neural network or finding the smallest decision tree are both NP-complete problems [Blum and Rivest, 1989; Quinlan and Rivest 1989]. Hence, heuristic algorithms such as gradient descent (for neural networks) and greedy search (for decision trees) have been applied with great success. Of course, the suboptimality of such heuristic algorithms ~mmediately suggests a reas&able line of research: find ~lgorithms that can search the hypothesis class better. Hence, there has been extensive research in applying secondorder methods to fit neural networks and in conducting much more thorough searches in learning decision trees and rule sets. Ironically, when these algorithms were tested on real datasets, it was found that their performance was often worse than simrde szradient descent or greedy sear~h [&inlan and Cameron-Jones 1995; Weigend 1994]. In short: it appears to be bet~er not to optimize! One of the other important trends in machine-learning research has been the establishment and nurturing of connections between various previously disparate fields, including computational learning theory, connectionist learning, symbolic learning. and statistics. The . connection to statistics was crucial in resolvins$ this naradox. The-key p~oblem arises from the structure of the machine-learning task, A learning algorithm is trained on a set of training data, but then it is applied to make predictions on new data points. The goal is to maximize its predictive accuracy on the new data points—not necessarily its accuracy on the trammg data. Indeed, if we work too hard to find the very best fit to the training data, there is a risk that we will fit the noise in the data by memorizing various peculiarities

535 citations


Book ChapterDOI
09 Jul 1995
TL;DR: An investigation of why the ECOC technique works, particularly when employed with decision-tree learning algorithms, shows that it can reduce the variance of the learning algorithm.
Abstract: Previous research has shown that a technique called error-correcting output coding (ECOC) can dramatically improve the classification accuracy of supervised learning algorithms that learn to classify data points into one of k ≫ 2 classes. This paper presents an investigation of why the ECOC technique works, particularly when employed with decision-tree learning algorithms. It shows that the ECOC method— like any form of voting or committee—can reduce the variance of the learning algorithm. Furthermore—unlike methods that simply combine multiple runs of the same learning algorithm—ECOC can correct for errors caused by the bias of the learning algorithm. Experiments show that this bias correction ability relies on the non-local behavior of C4.5.

402 citations


Proceedings Article
20 Aug 1995
TL;DR: Reinforcement learning methods are applied to learn domain-specific heuristics for job shop scheduling to suggest that reinforcement learning can provide a new method for constructing high-performance scheduling systems.
Abstract: We apply reinforcement learning methods to learn domain-specific heuristics for job shop scheduling. A repair-based scheduler starts with a critical-path schedule and incrementally repairs constraint violations with the goal of finding a short conflict-free schedule. The temporal difference algorithm TD(λ) is applied to tram a neural network to learn a heuristic evaluation function over states. This evaluation function is used by a one-step lookahead search procedure to find good solutions to new scheduling problems. We evaluate this approach on synthetic problems and on problems from a NASA space shuttle pay load processing task. The evaluation function is trained on problems involving a small number of jobs and then tested on larger problems. The TD scheduler performs better than the best known existing algorithm for this task--Zwehen's iterative repair method based on simulated annealing. The results suggest that reinforcement learning can provide a new method for constructing high-performance scheduling systems.

396 citations


Journal ArticleDOI
TL;DR: In domains where the decision boundaries are axis-parallel, the NGE approach can produce excellent generalization with interpretable hypotheses, and in all domains tested, NGE algorithms require much less memory to store generalized exemplars than is required by NN algorithms.
Abstract: Algorithms based on Nested Generalized Exemplar (NGE) theory (Salzberg, 1991) classify new data points by computing their distance to the nearest “generalized exemplar” (i.e., either a point or an axis-parallel rectangle). They combine the distance-based character of nearest neighbor (NN) classifiers with the axis-parallel rectangle representation employed in many rule-learning systems. An implementation of NGE was compared to the k-nearest neighbor (kNN) algorithm in 11 domains and found to be significantly inferior to kNN in 9 of them. Several modifications of NGE were studied to understand the cause of its poor performance. These show that its performance can be substantially improved by preventing NGE from creating overlapping rectangles, while still allowing complete nesting of rectangles. Performance can be further improved by modifying the distance metric to allow weights on each of the features (Salzberg, 1991). Best results were obtained in this study when the weights were computed using mutual information between the features and the output class. The best version of NGE developed is a batch algorithm (BNGE FWMI) that has no user-tunable parameters. BNGE FWMI's performance is comparable to the first-nearest neighbor algorithm (also incorporating feature weights). However, the k-nearest neighbor algorithm is still significantly superior to BNGE FWMI in 7 of the 11 domains, and inferior to it in only 2. We conclude that, even with our improvements, the NGE approach is very sensitive to the shape of the decision boundaries in classification problems. In domains where the decision boundaries are axis-parallel, the NGE approach can produce excellent generalization with interpretable hypotheses. In all domains tested, NGE algorithms require much less memory to store generalized exemplars than is required by NN algorithms.

229 citations


Proceedings Article
27 Nov 1995
TL;DR: Experimental tests show that this TDNN-TD(λ) network can match the performance of the previous hand-engineered system, and both neural network approaches significantly outperform the best previous (non-learning) solution to this problem.
Abstract: Job-shop scheduling is an important task for manufacturing industries. We are interested in the particular task of scheduling payload processing for NASA's space shuttle program. This paper summarizes our previous work on formulating this task for solution by the reinforcement learning algorithm TD(λ). A shortcoming of this previous work was its reliance on hand-engineered input features. This paper shows how to extend the time-delay neural network (TDNN) architecture to apply it to irregular-length schedules. Experimental tests show that this TDNN-TD(λ) network can match the performance of our previous hand-engineered system. The tests also show that both neural network approaches significantly outperform the best previous (non-learning) solution to this problem in terms of the quality of the resulting schedules and the number of search steps required to construct them.

114 citations


Journal ArticleDOI
TL;DR: The performance of the error backpropagation (BP) and ID3 learning algorithms was compared on the task of mapping English text to phonemes and stresses and it is shown that BP consistently out-performs ID3 on this task by several percentage points.
Abstract: The performance of the error backpropagation (BP) and ID3 learning algorithms was compared on the task of mapping English text to phonemes and stresses. Under the distributed output code developed by Sejnowski and Rosenberg, it is shown that BP consistently out-performs ID3 on this task by several percentage points. Three hypotheses explaining this difference were explored: (a) ID3 is overfitting the training data, (b) BP is able to share hidden units across several output units and hence can learn the output units better, and (c) BP captures statistical information that ID3 does not. We conclude that only hypothesis (c) is correct. By augmenting ID3 with a simple statistical learning procedure, the performance of BP can be closely matched. More complex statistical procedures can improve the performance of both BP and ID3 substantially in this domain.

68 citations


Book ChapterDOI
09 Jul 1995
TL;DR: This paper shows how to develop a dynamic programming version of EBL, which is called Explanation-Based Reinforcement Learning (EBRL), and shows that EBRL combines the strengths of E BL (fast learning and the ability to scale to large state spaces) with the strength of RL* (learning of optimal policies).
Abstract: In speedup-learning problems, where full descriptions of operators are always known, both explanation-based learning (EBL) and reinforcement learning (RL) can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state. RL performs this propagation on a state-by-state basis, while EBL computes the weakest preconditions of operators, and hence, performs this propagation on a region-by-region basis. Based on the observation that RL is a form of asynchronous dynamic programming, this paper shows how to develop a dynamic programming version of EBL, which we call Explanation-Based Reinforcement Learning (EBRL). The paper compares batch and online versions of EBRL to batch and online versions of RL and to standard EBL. The results show that EBRL combines the strengths of EBL (fast learning and the ability to scale to large state spaces) with the strengths of RL* (learning of optimal policies). Results are shown in chess endgames and in synthetic maze tasks.

52 citations