scispace - formally typeset
Search or ask a question

Showing papers in "Machine Learning in 1995"


Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations


Journal ArticleDOI
TL;DR: In this article, a Bayesian approach for learning Bayesian networks from a combination of prior knowledge and statistical data is presented, which is derived from a set of assumptions made previously as well as the assumption of likelihood equivalence, which says that data should not help to discriminate network structures that represent the same assertions of conditional independence.
Abstract: We describe a Bayesian approach for learning Bayesian networks from a combination of prior knowledge and statistical data. First and foremost, we develop a methodology for assessing informative priors needed for learning. Our approach is derived from a set of assumptions made previously as well as the assumption of likelihood equivalence, which says that data should not help to discriminate network structures that represent the same assertions of conditional independence. We show that likelihood equivalence when combined with previously made assumptions implies that the user's priors for network parameters can be encoded in a single Bayesian network for the next case to be seen—a prior network—and a single measure of confidence for that network. Second, using these priors, we show how to compute the relative posterior probabilities of network structures given data. Third, we describe search methods for identifying network structures with high posterior probabilities. We describe polynomial algorithms for finding the highest-scoring network structures in the special case where every node has at most k e 1 parent. For the general case (k > 1), which is NP-hard, we review heuristic search algorithms including local search, iterative local search, and simulated annealing. Finally, we describe a methodology for evaluating Bayesian-network learning algorithms, and apply this approach to a comparison of various approaches.

4,124 citations


Journal ArticleDOI
TL;DR: The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences and can discover both the CRP and LexA binding sites from a set of sequences which contain one or both sites.
Abstract: The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unaligned biopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solved using EM and increase the chance of finding good solutions. First, subsequences which actually occur in the biopolymer sequences are used as starting points for the EM algorithm to increase the probability of finding globally optimal motifs. Second, the assumption that each sequence contains exactly one occurrence of the shared motif is removed. This allows multiple appearances of a motif to occur in any sequence and permits the algorithm to ignore sequences with no appearance of the shared motif, increasing its resistance to noisy data. Third, a method for probabilistically erasing shared motifs after they are found is incorporated so that several distinct motifs can be found in the same set of sequences, both when different motifs appear in different sequences and when a single sequence may contain multiple motifs. Experiments show that MEME can discover both the CRP and LexA binding sites from a set of sequences which contain one or both sites, and that MEME can discover both the −10 and −35 promoter regions in a set of E. coli sequences.

697 citations


Journal ArticleDOI
TL;DR: It is confirmed that allowing multivariate tests generally improves the accuracy of the resulting decision tree over a univariate tree, and several new methods for forming multivariate decision trees are presented.
Abstract: Unlike a univariate decision tree, a multivariate decision tree is not restricted to splits of the instance space that are orthogonal to the features' axes. This article addresses several issues for constructing multivariate decision trees: representing a multivariate test, including symbolic and numeric features, learning the coefficients of a multivariate test, selecting the features to include in a test, and pruning of multivariate decision trees. We present several new methods for forming multivariate decision trees and compare them with several well-known methods. We compare the different methods across a variety of learning tasks, in order to assess each method's ability to find concise, accurate decision trees. The results demonstrate that some multivariate methods are in general more effective than others (in the context of our experimental assumptions). In addition, the experiments confirm that allowing multivariate tests generally improves the accuracy of the resulting decision tree over a univariate tree.

346 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the existence of a sample compression scheme of fixed-size for aclass C is sufficient to ensure that the class C is pac-learnable, and the relationship between sample compression schemes and the VC dimension is explored.
Abstract: Within the framework of pac-learning, we explore the learnability of concepts from samples using the paradigm of sample compression schemes. A sample compression scheme of size k for a concept class C ⊆ 2X consists of a compression function and a reconstruction function. The compression function receives a finite sample set consistent with some concept in C and chooses a subset of k examples as the compression set. The reconstruction function forms a hypothesis on X from a compression set of k examples. For any sample set of a concept in C the compression set produced by the compression function must lead to a hypothesis consistent with the whole original sample set when it is fed to the reconstruction function. We demonstrate that the existence of a sample compression scheme of fixed-size for a class C is sufficient to ensure that the class C is pac-learnable. Previous work has shown that a class is pac-learnable if and only if the Vapnik-Chervonenkis (VC) dimension of the class is finite. In the second half of this paper we explore the relationship between sample compression schemes and the VC dimension. We define maximum and maximal classes of VC dimension d. For every maximum class of VC dimension d, there is a sample compression scheme of size d, and for sufficiently-large maximum classes there is no sample compression scheme of size less than d. We discuss briefly classes of VC dimension d that are maximal but not maximum. It is an open question whether every class of VC dimension d has a sample compression scheme of size O(d).

296 citations


Journal ArticleDOI
TL;DR: In domains where the decision boundaries are axis-parallel, the NGE approach can produce excellent generalization with interpretable hypotheses, and in all domains tested, NGE algorithms require much less memory to store generalized exemplars than is required by NN algorithms.
Abstract: Algorithms based on Nested Generalized Exemplar (NGE) theory (Salzberg, 1991) classify new data points by computing their distance to the nearest “generalized exemplar” (i.e., either a point or an axis-parallel rectangle). They combine the distance-based character of nearest neighbor (NN) classifiers with the axis-parallel rectangle representation employed in many rule-learning systems. An implementation of NGE was compared to the k-nearest neighbor (kNN) algorithm in 11 domains and found to be significantly inferior to kNN in 9 of them. Several modifications of NGE were studied to understand the cause of its poor performance. These show that its performance can be substantially improved by preventing NGE from creating overlapping rectangles, while still allowing complete nesting of rectangles. Performance can be further improved by modifying the distance metric to allow weights on each of the features (Salzberg, 1991). Best results were obtained in this study when the weights were computed using mutual information between the features and the output class. The best version of NGE developed is a batch algorithm (BNGE FWMI) that has no user-tunable parameters. BNGE FWMI's performance is comparable to the first-nearest neighbor algorithm (also incorporating feature weights). However, the k-nearest neighbor algorithm is still significantly superior to BNGE FWMI in 7 of the 11 domains, and inferior to it in only 2. We conclude that, even with our improvements, the NGE approach is very sensitive to the shape of the decision boundaries in classification problems. In domains where the decision boundaries are axis-parallel, the NGE approach can produce excellent generalization with interpretable hypotheses. In all domains tested, NGE algorithms require much less memory to store generalized exemplars than is required by NN algorithms.

229 citations


Journal ArticleDOI
TL;DR: An attribute-selection metric is proposed here that takes both the error as well as monotonicity into account while building decision trees and is empirically shown capable of significantly reducing the degree of non-monotonicity of decision trees without sacrificing their inductive accuracy.
Abstract: Decision trees that are based on information-theory are useful paradigms for learning from examples However, in some real-world applications, known information-theoretic methods frequently generate nonmonotonic decision trees, in which objects with better attribute values are sometimes classified to lower classes than objects with inferior values This property is undesirable for problem solving in many application domains, such as credit scoring and insurance premium determination, where monotonicity of subsequent classifications is important An attribute-selection metric is proposed here that takes both the error as well as monotonicity into account while building decision trees The metric is empirically shown capable of significantly reducing the degree of non-monotonicity of decision trees without sacrificing their inductive accuracy

158 citations


Journal ArticleDOI
TL;DR: A system, forte (First-Order Revision of Theories from Examples from Examples), which refines first-order Horn-clause theories by integrating a variety of different revision techniques into a coherent whole, guided by a global heuristic.
Abstract: Knowledge acquisition is a difficult, error-prone, and time-consuming task. The task of automatically improving an existing knowledge base using learning methods is addressed by the class of systems performing theory refinement. This paper presents a system, forte (First-Order Revision of Theories from Examples), which refines first-order Horn-clause theories by integrating a variety of different revision techniques into a coherent whole. FORTE uses these techniques within a hill-climbing framework, guided by a global heuristic. It identifies possible errors in the theory and calls on a library of operators to develop possible revisions. The best revision is implemented, and the process repeats until no further revisions are possible. Operators are drawn from a variety of sources, including prepositional theory refinement, first-order induction, and inverse resolution. FORTE is demonstrated in several domains, including logic programming and qualitative modelling.

128 citations


Journal ArticleDOI
TL;DR: This article presents an approach that uses characteristics of the given data set, in the form of feedback from the learning process, to guide a search for a tree-structured hybrid classifier.
Abstract: The results of empirical comparisons of existing learning algorithms illustrate that each algorithm has a selective superioritys each is best for some but not all tasks. Given a data set, it is often not clear beforehand which algorithm will yield the best performance. In this article we present an approach that uses characteristics of the given data set, in the form of feedback from the learning process, to guide a search for a tree-structured hybrid classifier. Heuristic knowledge about the characteristics that indicate one bias is better than another is encoded in the rule base of the Model Class Selection (MCS) system. The approach does not assume that the entire instance space is best learned using a single representation languages for some data sets, choosing to form a hybrid classifier is a better bias, and MCS has the ability to determine these cases. The results of an empirical evaluation illustrate that MCS achieves classification accuracies equal to or higher than the best of its primitive learning components for each data set, demonstrating that the heuristic rules effectively select an appropriate learning bias.

122 citations


Journal ArticleDOI
TL;DR: This introduction motivates the importance of automated methods for evaluating and selecting biases using a framework of bias selection as search in bias and meta-bias spaces.
Abstract: In this introduction, we define the term bias as it is used in machine learning systems We motivate the importance of automated methods for evaluating and selecting biases using a framework of bias selection as search in bias and meta-bias spaces Recent research in the field of machine learning bias is summarized

122 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce a method for quantifying stability, based on a measure of the agreement between concepts, and discuss the relationships among stability, predictive accuracy, and bias.
Abstract: Research on bias in machine learning algorithms has generally been concerned with the impact of bias on predictive accuracy. We believe that there are other factors that should also play a role in the evaluation of bias. One such factor is the stability of the algorithms in other words, the repeatability of the results. If we obtain two sets of data from the same phenomenon, with the same underlying probability distribution, then we would like our learning algorithm to induce approximately the same concepts from both sets of data. This paper introduces a method for quantifying stability, based on a measure of the agreement between concepts. We also discuss the relationships among stability, predictive accuracy, and bias.

Journal ArticleDOI
TL;DR: Edge-recombination crossover used in conjunction with several specialized operators is found to perform best in these experiments; these operators solved a 10KB sequence, consisting of 177 fragments, with no manual intervention.
Abstract: We study different genetic algorithm operators for one permutation problem associated with the Human Genome Project—the assembly of DNA sequence fragments from a parent clone whose sequence is unknown into a consensus sequence corresponding to the parent sequence. The sorted-order representation, which does not require specialized operators, is compared with a more traditional permutation representation, which does require specialized operators. The two representations and their associated operators are compared on problems ranging from 2K to 34K base pairs (KB). Edge-recombination crossover used in conjunction with several specialized operators is found to perform best in these experimentss these operators solved a 10KB sequence, consisting of 177 fragments, with no manual intervention. Natural building blocks in the problem are exploited at progressively higher levels through “macro-operators.” This significantly improves performance.

Journal ArticleDOI
TL;DR: A new SVD (singular value decomposition) method, which compresses the long and sparsen-gram input vectors and captures semantics ofn-gram words, has improved the generalization capability of the network.
Abstract: A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences of n-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparse n-gram input vectors and captures semantics of n-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.

Journal ArticleDOI
TL;DR: A framework for representing and automatically selecting a wide variety of biases is presented and experiments with an instantiation of the framework addressing various pragmatic tradeoffs of time, space, accuracy, and the cost of errors are described.
Abstract: This paper extends the currently accepted model of inductive bias by identifying six categories of bias and separates inductive bias from the policy for its selection (the inductive policy) We analyze existing “bias selection” systems, examining the similarities and differences in their inductive policies, and identify three techniques useful for building inductive policies We then present a framework for representing and automatically selecting a wide variety of biases and describe experiments with an instantiation of the framework addressing various pragmatic tradeoffs of time, space, accuracy, and the cost of errors The experiments show that a common framework can be used to implement policies for a variety of different types of bias selection, such as parameter selection, term selection, and example selection, using similar techniques The experiments also show that different tradeoffs can be made by the implementation of different policiess for example, from the same data different rule sets can be learned based on different tradeoffs of accuracy versus the cost of erroneous predictions

Journal ArticleDOI
TL;DR: The feasibility of using learning classifier systems as a tool for building adaptive control systems for real robots is investigated and it is shown that with this approach it is possible to let the AutonoMouse, a small real robot, learn to approach a light source under a number of different noise and lesion conditions.
Abstract: In this article we investigate the feasibility of using learning classifier systems as a tool for building adaptive control systems for real robots. Their use on real robots imposes efficiency constraints which are addressed by three main tools: parallelism, distributed architecture, and training. Parallelism is useful to speed up computation and to increase the flexibility of the learning system design. Distributed architecture helps in making it possible to decompose the overall task into a set of simpler learning tasks. Finally, training provides guidance to the system while learning, shortening the number of cycles required to learn. These tools and the issues they raise are first studied in simulation, and then the experience gained with simulations is used to implement the learning system on the real robot. Results have shown that with this approach it is possible to let the AutonoMouse, a small real robot, learn to approach a light source under a number of different noise and lesion conditions.

Journal ArticleDOI
TL;DR: The performance of the error backpropagation (BP) and ID3 learning algorithms was compared on the task of mapping English text to phonemes and stresses and it is shown that BP consistently out-performs ID3 on this task by several percentage points.
Abstract: The performance of the error backpropagation (BP) and ID3 learning algorithms was compared on the task of mapping English text to phonemes and stresses. Under the distributed output code developed by Sejnowski and Rosenberg, it is shown that BP consistently out-performs ID3 on this task by several percentage points. Three hypotheses explaining this difference were explored: (a) ID3 is overfitting the training data, (b) BP is able to share hidden units across several output units and hence can learn the output units better, and (c) BP captures statistical information that ID3 does not. We conclude that only hypothesis (c) is correct. By augmenting ID3 with a simple statistical learning procedure, the performance of BP can be closely matched. More complex statistical procedures can improve the performance of both BP and ID3 substantially in this domain.

Journal ArticleDOI
TL;DR: This paper presents a learning algorithm that learns any SDA M in the limit from positive data, satisfying the properties that (i) the time for updating a conjecture is at most O(lm), and (ii) the number of implicit prediction errors is at least O(ln), where l is the maximum length of all positive data provided.
Abstract: This paper deals with the polynomial-time learnability of a language class in the limit from positive data, and discusses the learning problem of a subclass of deterministic finite automata (DFAs), called strictly deterministic automata (SDAs), in the framework of learning in the limit from positive data. We first discuss the difficulty of Pitt's definition in the framework of learning in the limit from positive data, by showing that any class of languages with an infinite descending chain property is not polynomial-time learnable in the limit from positive data. We then propose new definitions for polynomial-time learnability in the limit from positive data. We show in our new definitions that the class of SDAs is iteratively, consistently polynomial-time learnable in the limit from positive data. In particular, we present a learning algorithm that learns any SDA M in the limit from positive data, satisfying the properties that (i) the time for updating a conjecture is at most O(lm), (ii) the number of implicit prediction errors is at most O(ln), where l is the maximum length of all positive data provided, m is the alphabet size of M and n is the size of M, (iii) each conjecture is computed from only the previous conjecture and the current example, and (iv) at any stage the conjecture is consistent with the sample set seen so far. This is in marked contrast to the fact that the class of DFAs is neither learnable in the limit from positive data nor polynomial-time learnable in the limit.

Journal ArticleDOI
TL;DR: A comparative study is presented of language biases employed in specific-to-general learning systems within the Inductive Logic Programming (ILP) paradigm, focusing on three well known systems: CLINT, GOLEM and ITOU, and evaluating both conceptually and empirically their strengths and weaknesses.
Abstract: A comparative study is presented of language biases employed in specific-to-general learning systems within the Inductive Logic Programming (ILP) paradigm. More specifically, we focus on the biases employed in three well known systems: CLINT, GOLEM and ITOU, and evaluate both conceptually and empirically their strengths and weaknesses. The evaluation is carried out within the generic framework of the NINA system, in which bias is a parameter. Two different types of biases are considered: syntactic bias, which defines the set of well-formed clauses, and semantic bias, which imposes restrictions on the behaviour of hypotheses or clauses. NINA is also able to shift its bias (within a predefined series of biases), whenever its current bias is insufficient for finding complete and consistent concept definitions. Furthermore, a new formalism for specifying the syntactic bias of inductive logic programming systems is introduced.

Journal ArticleDOI
TL;DR: The CNF learner performs surprisingly well, and results on five natural data sets indicates that it frequently trains faster and produces more accurate and simpler concepts.
Abstract: This paper presents results comparing three simple inductive learning systems using different representations for concepts, namely: CNF formulae, DNF formulae, and decision trees. The CNF learner performs surprisingly well. Results on five natural data sets indicates that it frequently trains faster and produces more accurate and simpler concepts.

Journal ArticleDOI
TL;DR: A key contribution of this work is that a “non-pure” or noisy binary relation is defined and then by exploiting the robustness of weighted majority voting with respect to noise, it is shown that both of the algorithms can learn non-pure relations.
Abstract: In this paper we demonstrate how weighted majority voting with multiplicative weight updating can be applied to obtain robust algorithms for learning binary relations. We first present an algorithm that obtains a nearly optimal mistake bound but at the expense of using exponential computation to make each prediction. However, the time complexity of our algorithm is significantly reduced from that of previously known algorithms that have comparable mistake bounds. The second algorithm we present is a polynomial time algorithm with a non-optimal mistake bound. Again the mistake bound of our second algorithm is significantly better than previous bounds proven for polynomial time algorithms. A key contribution of our work is that we define a “non-pure” or noisy binary relation and then by exploiting the robustness of weighted majority voting with respect to noise, we show that both of our algorithms can learn non-pure relations. These provide the first algorithms that can learn non-pure binary relations.

Journal ArticleDOI
TL;DR: In this paper, it was shown that no polynomial-time algorithm for learning disjunctive normal form (DNF) formulas is known to be able to learn most DNF formulas.
Abstract: We present two related results about the learnability of disjunctive normal form (DNF) formulas. First we show that a common approach for learning arbitrary DNF formulas requires exponential time. We then contrast this with a polynomial time algorithm for learning “most” (rather than all) DNF formulas. A natural approach for learning boolean functions involves greedily collecting the prime implicants of the hidden function. In a seminal paper of learning theory, Valiant demonstrated the efficacy of this approach for learning monotone DNF, and suggested this approach for learning DNF. Here we show that no algorithm using such an approach can learn DNF in polynomial time. We show this by constructing a counterexample DNF formula which would force such an algorithm to take exponential time. This counterexample seems to capture much of what makes DNF hard to learn, and thus is useful to consider when evaluating the run-time of a proposed DNF learning algorithm. This hardness result, as well as other hardness results for learning DNF, relies on the construction of particular hard-to-learn formulas, formulas that appear to be relatively rare. This raises the question of whether most DNF formulas are learnable. For certain natural definitions of “most DNF formulas,” we answer this question affirmatively.

Journal ArticleDOI
TL;DR: It is proved that for some restricted languages predicate invention does not help when the learning task fails and the languages for which predicate invention is useful are characterized.
Abstract: The task of predicate invention in Inductive Logic Programming is to extend the hypothesis language with new predicates if the vocabulary given initially is insufficient for the learning task. However, whether predicate invention really helps to make learning succeed in the extended language depends on the language bias currently employed. In this paper, we investigate for which commonly employed language biases predicate invention is an appropriate shift operation. We prove that for some restricted languages predicate invention does not help when the learning task fails and we characterize the languages for which predicate invention is useful. We investigate the decidability of the bias shift problem for these languages and discuss the capabilities of predicate invention as a bias shift operation.

Journal ArticleDOI
TL;DR: This work applies a heuristic version of the newly proposed algorithmic significance method to one of the main problems in DNA and protein sequence comparisons: the problem of deciding whether observed similarity between sequences should be explained by their relatedness or by the mere presence of some shared internal structure.
Abstract: Algorithmic mutual information is a central concept in algorithmic information theory and may be measured as the difference between independent and joint minimal encoding lengths of objectss it is also a central concept in Chaitin's fascinating mathematical definition of life. We explore applicability of algorithmic mutual information as a tool for discovering dependencies in biology. In order to determine significance of discovered dependencies, we extend the newly proposed algorithmic significance method. The main theorem of the extended method states that d bits of algorithmic mutual information imply dependency at the significance level 2−d+O(1). We apply a heuristic version of the method to one of the main problems in DNA and protein sequence comparisons: the problem of deciding whether observed similarity between sequences should be explained by their relatedness or by the mere presence of some shared internal structure, e.g., shared internal repetitive patterns. We take advantage of the fact that mutual information factors out sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. In addition to providing a general framework for sequence comparisons, we also propose an efficient way to compare sequences based on their subword composition that does not require any a priori assumptions about k-tuple length.

Journal ArticleDOI
Darrell C. Conklin1
TL;DR: A novel approach to protein motif representation and discovery is presented, which is based on aspatial description logic and the symbolic machine learning paradigm of structured concept formation, and several interesting and significant protein motifs are discovered.
Abstract: The investigation of relations between protein tertiary structure and amino acid sequence is a topic of tremendous importance in molecular biology. The automated discovery of recurrent patterns of structure and sequence is an essential part of this investigation. These patterns, known as protein motifs, are abstractions of fragments drawn from proteins of known sequence and tertiary structure. This paper has two objectives. The first is to introduce and define protein motifs, and provide a survey of previous research on protein motif discovery. The second is to present and apply a novel approach to protein motif representation and discovery, which is based on a spatial description logic and the symbolic machine learning paradigm of structured concept formation. A large database of protein fragments is processed using this approach, and several interesting and significant protein motifs are discovered.

Journal ArticleDOI
TL;DR: A hybrid concept representation is used that integrates numeric weights and thresholds with rules and combines rules with exemplars and shows a statistically meaningful advantage of the proposed method over others both in terms of classification accuracy and description simplicity on several problems.
Abstract: This paper presents a method for learning graded concepts. Our method uses a hybrid concept representation that integrates numeric weights and thresholds with rules and combines rules with exemplars. Concepts are learned by constructing general descriptions to represent common cases. These general descriptions are in the form of decision rules with weights on conditions, interpreted by a similarity measure and numeric thresholds. The exceptional cases are represented as exemplars. This method was implemented in the Flexible Concept Learning System (FCLS) and tested on a variety of problems. The testing problems included practical concepts, concepts with graded structures, and concepts that can be defined in the classic view. For comparison, a decision tree learning system, an instance-based learning system, and the basic rule learning variant of FCLS were tested on the same problems. The results have shown a statistically meaningful advantage of the proposed method over others both in terms of classification accuracy and description simplicity on several problems.

Journal ArticleDOI
TL;DR: It is shown that the optimal method for prediction may not be the same as that for data compression, even in the limit of an infinite amount of training data, although the two problems are asymptotically equivalent in the realizable case.
Abstract: The problem of learning from examples in an average case setting is considered. Focusing on the stochastic complexity, an information theoretic quantity measuring the minimal description length of the data given a class of models, we find rigorous upper and lower bounds for this quantity under various conditions. For realizable problems, where the model class used is sufficiently rich to represent the function giving rise to the examples, we find tight upper and lower bounds for the stochastic complexity. In this case, bounds on the prediction error follow immediately using the methods of Haussler et al. (1994a). For unrealizable learning we find a tight upper bound only in the case of learning within a space of finite VC dimension. Moreover, we show in the latter case that the optimal method for prediction may not be the same as that for data compression, even in the limit of an infinite amount of training data, although the two problems (i.e. prediction and compression) are asymptotically equivalent in the realizable case. This result may bear consequences for many of the widely used model selection methods.

Journal ArticleDOI
TL;DR: This paper demonstrates that the sample complexity bound for the MDL-based discrimination algorithm is essentially related to Barron and Cover's index of resolvability, and gives a new view at the relationship between the index of RESOLvability and theMDL principle from the PAD-learning viewpoint.
Abstract: This paper develops a new computational model for learning stochastic rules, called PAD (Probably Almost Discriminative)-learning model, based on statistical hypothesis testing theory. The model deals with the problem of designing a discrimination algorithm to test whether or not any given test sequence of examples of pairs of (instance, label) has come from a given stochastic rule P*. Here a composite hypothesis {\tilde P} is unknown other than it belongs to a given class {\cal C}. In this model, we propose a new discrimination algorithm on the basis of the MDL (Minimum Description Length) principle, and then derive upper bounds on the least test sample size required by the algorithm to guarantee that two types of error probabilities are respectively less than δ1 and δ2 provided that the distance between the two rules to be discriminated is not less than e. For the parametric case where {\cal C} is a parametric class, this paper shows that an upper bound on test sample size is given by O({1\over \varepsilon}\ {\rm ln}\ {1\over\delta _1}\ +\ {1\over \varepsilon ^2}\ {\rm ln}{1\over \delta _2}\ +\ {\tilde{k}\over \varepsilon}\ {\rm ln}\ {\tilde{k}\over \varepsilon}\ +\ {\ell({\tilde M})\over \varepsilon}). Here {\tilde k} is the number of real-valued parameters for the composite hypothesis {\tilde P,} and \ell({\tilde M}) is the description length for the countable model for {\tilde P.} Further this paper shows that the MDL-based discrimination algorithm performs well in the sense of sample complexity efficiency, comparing it with other kinds of information-criteria-based discrimination algorithms. This paper also shows how to transform any stochastic PAC (Probably Approximately Correct)-learning algorithm into a PAD-learning algorithm. For the non-parametric case where {\cal C} is a non-parametric class but the discrimination algorithm uses a parametric class, this paper demonstrates that the sample complexity bound for the MDL-based discrimination algorithm is essentially related to Barron and Cover's index of resolvability. The sample complexity bound gives a new view at the relationship between the index of resolvability and the MDL principle from the PAD-learning viewpoint.

Journal ArticleDOI
TL;DR: A computer program is described that is capable of learning multiple concepts and their structural descriptions from observations of examples that decomposes this conceptual clustering problem into two modules that incrementally incorporates these generalizations into a hierarchy of concepts.
Abstract: A computer program is described that is capable of learning multiple concepts and their structural descriptions from observations of examples. It decomposes this conceptual clustering problem into two modules. The first module is concerned with forming a generalization from a pair of examples by extracting their common structure and calculating an information measure for each structural description. The second module, which is the subject of this paper, incrementally incorporates these generalizations into a hierarchy of concepts. This second module operates without reference to any underlying representation language and utilizes only the information measure provided by the first module, while employing a branch and bound procedure to search the hierarchy for concepts from which to form new clusters. This ability to search the hierarchy is used as the basis of a hill climbing strategy which has as its goal the avoidance of local peaks so as to reduce the sensitivity of the program to the order in which the observations are encountered.

Journal ArticleDOI
TL;DR: There is evidence that machine learning models can provide better classification accuracy than explicit knowledge acquisition techniques and the main contribution of machine learning to expert systems is not just cost reduction, but rather the provision of tools for the development of better expert systems.
Abstract: This empirical study provides evidence that machine learning models can provide better classification accuracy than explicit knowledge acquisition techniques. The findings suggest that the main contribution of machine learning to expert systems is not just cost reduction, but rather the provision of tools for the development of better expert systems.

Journal ArticleDOI
TL;DR: The process of machine learning is described in terms of the various events that happen on a given trial, including the crucial association of words with internal representations of their meaning, as well as a comprehension grammar for a superset of a natural language.
Abstract: We are developing a theory of probabilistic language learning in the context of robotic instruction in elementary assembly actions. We describe the process of machine learning in terms of the various events that happen on a given trial, including the crucial association of words with internal representations of their meaning. Of central importance in learning is the generalization from utterances to grammatical forms. Our system derives a comprehension grammar for a superset of a natural language from pairs of verbal stimuli like Go to the screw! and corresponding internal representations of coerced actions. For the derivation of a grammar no knowledge of the language to be learned is assumed but only knowledge of an internal language. We present grammars for English, Chinese, and German generated from a finite sample of about 500 commands that are roughly equivalent across the three languages. All of the three grammars, which are context-free in form, accept an infinite set of commands in the given language.