scispace - formally typeset
Search or ask a question

Showing papers in "Machine Learning in 1997"


Journal ArticleDOI
TL;DR: Tree Augmented Naive Bayes (TAN) is single out, which outperforms naive Bayes, yet at the same time maintains the computational simplicity and robustness that characterize naive Baye.
Abstract: Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. In this paper we evaluate approaches for inducing classifiers from data, based on the theory of learning Bayesian networks. These networks are factored representations of probability distributions that generalize the naive Bayesian classifier and explicitly represent statements about independence. Among these approaches we single out a method we call Tree Augmented Naive Bayes (TAN), which outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness that characterize naive Bayes. We experimentally tested these approaches, using problems from the University of California at Irvine repository, and compared them to C4.5, naive Bayes, and wrapper methods for feature selection.

4,775 citations


Journal ArticleDOI
TL;DR: The Bayesian classifier is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption, and will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain.
Abstract: The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive. This article shows that, although the Bayesian classifier‘s probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zero-one loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadratic-loss optimality of the Bayesian classifier is in fact a second-order infinitesimal fraction of the region of zero-one optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article‘s results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.

3,225 citations


Journal ArticleDOI
TL;DR: The use of a naive Bayesian classifier is described, and it is demonstrated that it can incrementally learn profiles from user feedback on the interestingness of Web sites and may easily be extended to revise user provided profiles.
Abstract: We discuss algorithms for learning and revising user profiles that can determine which World Wide Web sites on a given topic would be interesting to a user. We describe the use of a naive Bayesian classifier for this task, and demonstrate that it can incrementally learn profiles from user feedback on the interestingness of Web sites. Furthermore, the Bayesian classifier may easily be extended to revise user provided profiles. In an experimental evaluation we compare the Bayesian classifier to computationally more intensive alternatives, and show that it performs at least as well as these approaches throughout a range of different domains. In addition, we empirically analyze the effects of providing the classifier with background knowledge in form of user defined profiles and examine the use of lexical knowledge for feature selection. We find that both approaches can substantially increase the prediction accuracy.

1,353 citations


Journal ArticleDOI
TL;DR: It is shown that if the two-member committee algorithm achieves information gain with positive lower bound, then the prediction error decreases exponentially with the number of queries, and this exponential decrease holds for query learning of perceptrons.
Abstract: We analyze the “query by committee” algorithm, a method for filtering informative queries from a random stream of inputs. We show that if the two-member committee algorithm achieves information gain with positive lower bound, then the prediction error decreases exponentially with the number of queries. We show that, in particular, this exponential decrease holds for query learning of perceptrons.

1,234 citations


Journal ArticleDOI
TL;DR: It is argued that for many common machine learning problems, although in general the authors do not know the true (objective) prior for the problem, they do have some idea of a set of possible priors to which the true prior belongs.
Abstract: A Bayesian model of learning to learn by sampling from multiple tasks is presented. The multiple tasks are themselves generated by sampling from a distribution over an environment of related tasks. Such an environment is shown to be naturally modelled within a Bayesian context by the concept of an objective prior distribution. It is argued that for many common machine learning problems, although in general we do not know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by learning sufficiently many tasks from the environment. In addition, bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, but the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous. The theory is applied to the problem of learning a common feature set or equivalently a low-dimensional-representation (LDR) for an environment of related tasks.

496 citations


Journal ArticleDOI
TL;DR: This paper presents a gradient-based algorithm and shows that the gradient can be computed locally, using information that is available as a byproduct of standard inference algorithms for probabilistic networks.
Abstract: Probabilistic networks (also known as Bayesian belief networks) allow a compact description of complex stochastic relationships among several random variables. They are used widely for uncertain reasoning in artificial intelligence. In this paper, we investigate the problem of learning probabilistic networks with known structure and hidden variables. This is an important problem, because structure is much easier to elicit from experts than numbers, and the world is rarely fully observable. We present a gradient-based algorithm and show that the gradient can be computed locally, using information that is available as a byproduct of standard inference algorithms for probabilistic networks. Our experimental results demonstrate that using prior knowledge about the structure, even with hidden variables, can significantly improve the learning rate of probabilistic networks. We extend the method to networks in which the conditional probability tables are described using a small number of parameters. Examples include noisy-OR nodes and dynamic probabilistic networks. We show how this additional structure can be exploited by our algorithm to speed up the learning even further. We also outline an extension to hybrid networks, in which some of the nodes take on values in a continuous domain.

432 citations


Journal ArticleDOI
TL;DR: Two approaches to decision tree induction are described, one being incremental tree induction (ITI) and the other being non-incremental tree induction using a measure of tree quality instead of test quality (DMTI), which offer new computational and classifier characteristics that lend themselves to particular applications.
Abstract: The ability to restructure a decision tree efficiently enables a variety of approaches to decision tree induction that would otherwise be prohibitively expensive. Two such approaches are described here, one being incremental tree induction (ITI), and the other being non-incremental tree induction using a measure of tree quality instead of test quality (DMTI). These approaches and several variants offer new computational and classifier characteristics that lend themselves to particular applications.

399 citations


Journal ArticleDOI
TL;DR: Bayesian methods for model averaging and model selection among Bayesian-network models with hidden variables, and large-sample approximations for the marginal likelihood of naive-Bayes models in which the root node is hidden are examined.
Abstract: We discuss Bayesian methods for model averaging and model selection among Bayesian-network models with hidden variables. In particular, we examine large-sample approximations for the marginal likelihood of naive-Bayes models in which the root node is hidden. Such models are useful for clustering or unsupervised learning. We consider a Laplace approximation and the less accurate but more computationally efficient approximation known as the Bayesian Information Criterion (BIC), which is equivalent to Rissanen‘s (1987) Minimum Description Length (MDL). Also, we consider approximations that ignore some off-diagonal elements of the observed information matrix and an approximation proposed by Cheeseman and Stutz (1995). We evaluate the accuracy of these approximations using a Monte-Carlo gold standard. In experiments with artificial and real examples, we find that (1) none of the approximations are accurate when used for model averaging, (2) all of the approximations, with the exception of BIC/MDL, are accurate for model selection, (3) among the accurate approximations, the Cheeseman–Stutz and Diagonal approximations are the most computationally efficient, (4) all of the approximations, with the exception of BIC/MDL, can be sensitive to the prior distribution over model parameters, and (5) the Cheeseman–Stutz approximation can be more accurate than the other approximations, including the Laplace approximation, in situations where the parameters in the maximum a posteriori configuration are near a boundary.

344 citations


Journal ArticleDOI
TL;DR: A new variant on the Winnow algorithm is created that is especially suited to conditions with string-valued classifications, and an analysis of a policy for discarding predictors in Weighted-Majority that allows it to speed up as it learns.
Abstract: This paper describes experimental results on using Winnow and Weighted-Majority based algorithms on a real-world calendar scheduling domain. These two algorithms have been highly studied in the theoretical machine learning literature. We show here that these algorithms can be quite competitive practically, outperforming the decision-tree approach currently in use in the Calendar Apprentice system in terms of both accuracy and speed. One of the contributions of this paper is a new variant on the Winnow algorithm (used in the experiments) that is especially suited to conditions with string-valued classifications, and we give a theoretical analysis of its performance. In addition we show how Winnow can be applied to achieve a good accuracy/coverage tradeoff and explore issues that arise such as concept drift. We also provide an analysis of a policy for discarding predictors in Weighted-Majority that allows it to speed up as it learns.

251 citations


Journal ArticleDOI
TL;DR: In inductive transfer case studies, task sequences that allow for speeding up the learner's average reward intake through appropriate shifts of inductive bias are studied through the use of the “success-story algorithm” (SSA).
Abstract: We study task sequences that allow for speeding up the learner‘s average reward intake through appropriate shifts of inductive bias (changes of the learner‘s policy). To evaluate long-term effects of bias shifts setting the stage for later bias shifts we use the “success-story algorithm” (SSA). SSA is occasionally called at times that may depend on the policy itself. It uses backtracking to undo those bias shifts that have not been empirically observed to trigger long-term reward accelerations (measured up until the current SSA call). Bias shifts that survive SSA represent a lifelong success history. Until the next SSA call, they are considered useful and build the basis for additional bias shifts. SSA allows for plugging in a wide variety of learning algorithms. We plug in (1) a novel, adaptive extension of Levin search and (2) a method for embedding the learner‘s policy modification strategy within the policy itself (incremental self-improvement). Our inductive transfer case studies involve complex, partially observable environments where traditional reinforcement learning fails.

205 citations


Journal ArticleDOI
TL;DR: By adapting their teacher/learner model to grammatical inference it is proved that languages given by context-free grammars, simple deterministic grammar, linear grammARS and nondeterministic finite automata are not identifiable in the limit from polynomial time and data.
Abstract: When concerned about efficient grammatical inference two issues are relevant: the first one is to determine the quality of the result, and the second is to try to use polynomial time and space. A typical idea to deal with the first point is to say that an algorithm performs well if it infers {\it in\ the\ limit} the correct language. The second point has led to debate about how to define polynomial time: the main definitions of polynomial inference have been proposed by Pitt and Angluin. We return in this paper to a definition proposed by Gold that requires a characteristic set of strings to exist for each grammar, and this set to be polynomial in the size of the grammar or automaton that is to be learned, where the size of the sample is the sum of the lengths of all strings it includes. The learning algorithm must also infer correctly as soon as the characteristic set is included in the data. We first show that this definition corresponds to a notion of teachability as defined by Goldman and Mathias. By adapting their teacher/learner model to grammatical inference we prove that languages given by context-free grammars, simple deterministic grammars, linear grammars and nondeterministic finite automata are not identifiable in the limit from polynomial time and data.

Journal ArticleDOI
TL;DR: The primary goal of this paper is to show how to solve the problems of noise in decision tree learning with two new algorithms that combine and integrate pre- and post-pruning.
Abstract: Pre-pruning and Post-pruning are two standard techniques for handling noise in decision tree learning. Pre-pruning deals with noise during learning, while post-pruning addresses this problem after an overfitting theory has been learned. We first review several adaptations of pre- and post-pruning techniques for separate-and-conquer rule learning algorithms and discuss some fundamental problems. The primary goal of this paper is to show how to solve these problems with two new algorithms that combine and integrate pre- and post-pruning.

Journal ArticleDOI
TL;DR: A general two-level learning model is presented that effectively adjusts to changing contexts by trying to detect (via ‘meta-learning’) contextual clues and using this information to focus the learning process.
Abstract: The article deals with the problem of learning incrementally (‘on-line’) in domains where the target concepts are context-dependent, so that changes in context can produce more or less radical changes in the associated concepts. In particular, we concentrate on a class of learning tasks where the domain provides explicit clues as to the current context (e.g., attributes with characteristic values). A general two-level learning model is presented that effectively adjusts to changing contexts by trying to detect (via ‘meta-learning’) contextual clues and using this information to focus the learning process. Context learning and detection occur during regular on-line learning, without separate training phases for context recognition. Two operational systems based on this model are presented that differ in the underlying learning algorithm and in the way they use contextual information: METAL(B) combines meta-learning with a Bayesian classifier, while METAL(IB) is based on an instance-based learning algorithm. Experiments with synthetic domains as well as a number of ‘real-world” problems show that the algorithms are robust in a variety of dimensions, and that meta-learning can produce substantial increases in accuracy over simple object-level learning in situations with changing contexts.

Journal ArticleDOI
TL;DR: CHILD is described, an agent capable of Continual, Hierarchical, Incremental Learning and Development, which can quickly solve complicated non-Markovian reinforcement-learning tasks and can then transfer its skills to similar but even more complicated tasks, learning these faster still.
Abstract: Continual learning is the constant development of increasingly complex behaviors; the process of building more complicated skills on top of those already developed. A continual-learning agent should therefore learn incrementally and hierarchically. This paper describes CHILD, an agent capable of Continual, Hierarchical, Incremental Learning and Development. CHILD can quickly solve complicated non-Markovian reinforcement-learning tasks and can then transfer its skills to similar but even more complicated tasks, learning these faster still.

Journal ArticleDOI
TL;DR: Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.
Abstract: ID3‘s information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated for by C4.5‘s gain ratio. Several alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). All of these metrics are biased towards splits with smaller branches, where low-entropy splits are likely to occur by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis that the class distribution is independent of the split. Both gain and the chi-squared tests arise in asymptotic approximations to the hypergeometric, with similar criteria for their admissibility. Previous failures of pre-pruning are traced in large part to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. The choice of split-selection metric typically has little effect on accuracy, but can profoundly affect complexity and the effectiveness and efficiency of pruning. Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.

Journal ArticleDOI
TL;DR: This paper shows how to develop dynamic programming versions of EBL, which it is called region-based dynamic programming or Explanation-Based Reinforcement Learning (EBRL), and compares batch and online versions of EBRL to batch andOnline versions of point-basedynamic programming and to standard EBL.
Abstract: In speedup-learning problems, where full descriptions of operators are known, both explanation-based learning (EBL) and reinforcement learning (RL) methods can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state. Most RL methods perform this propagation on a state-by-state basis, while EBL methods compute the weakest preconditions of operators, and hence, perform this propagation on a region-by-region basis. Barto, Bradtke, and Singh (1995) have observed that many algorithms for reinforcement learning can be viewed as asynchronous dynamic programming. Based on this observation, this paper shows how to develop dynamic programming versions of EBL, which we call region-based dynamic programming or Explanation-Based Reinforcement Learning (EBRL). The paper compares batch and online versions of EBRL to batch and online versions of point-based dynamic programming and to standard EBL. The results show that region-based dynamic programming combines the strengths of EBL (fast learning and the ability to scale to large state spaces) with the strengths of reinforcement learning algorithms (learning of optimal policies). Results are shown in chess endgames and in synthetic maze tasks.

Journal ArticleDOI
TL;DR: This work allows the conditional probabilities to be represented in any manner (as tables or specialized functions) and obtains sample complexity bounds for learning nets with and without hidden nodes.
Abstract: We consider the problem of PAC learning probabilistic networks in the case where the structure of the net is specified beforehand. We allow the conditional probabilities to be represented in any manner (as tables or specialized functions) and obtain sample complexity bounds for learning nets with and without hidden nodes.

Journal ArticleDOI
Naoki Abe1, Hiroshi Mamitsuka1
TL;DR: A new method for predicting protein secondary structure of a given amino acid sequence is proposed, based on a training algorithm for the probability parameters of a stochastic tree grammar, which can predict the structure as well as the location of β-sheet regions, which was not possible by conventional methods for secondary structure prediction.
Abstract: We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a training algorithm for the probability parameters of a stochastic tree grammar. In particular, we concentrate on the problem of predicting β-sheet regions, which has previously been considered difficult because of the unbounded dependencies exhibited by sequences corresponding to β-sheets. To cope with this difficulty, we use a new family of stochastic tree grammars, which we call Stochastic Ranked Node Rewriting Grammars, which are powerful enough to capture the type of dependencies exhibited by the sequences of β-sheet regions, such as the ’parallel‘ and ’anti-parallel‘ dependencies and their combinations. The training algorithm we use is an extension of the ’inside-outside‘ algorithm for stochastic context-free grammars, but with a number of significant modifications. We applied our method on real data obtained from the HSSP database (Homology-derived Secondary Structure of Proteins Ver 1.0) and the results were encouraging: Our method was able to predict roughly 75 percent of the β-strands correctly in a systematic evaluation experiment, in which the test sequences not only have less than 25 percent identity to the training sequences, but are totally unrelated to them. This figure compares favorably to the predictive accuracy of the state-of-the-art prediction methods in the field, even though our experiment was on a restricted type of β-sheet structures and the test was done on a relatively small data size. We also stress that our method can predict the structure as well as the location of β-sheet regions, which was not possible by conventional methods for secondary structure prediction. Extended abstracts of parts of the work presented in this paper have appeared in (Abe & Mamitsuka, 1994) and (Mamitsuka & Abe, 1994).

Journal ArticleDOI
TL;DR: This study indicates that a class of domain models cannot be learned by search procedures that modify a network structure one link at a time, and suggests that prior knowledge about the problem domain together with a multi-link search strategy would provide an effective way to uncover many domain models.
Abstract: Several scoring metrics are used in different search procedures for learning probabilistic networks. We study the properties of cross entropy in learning a decomposable Markov network. Though entropy and related scoring metrics were widely used, its “microscopic” properties and asymptotic behavior in a search have not been analyzed. We present such a “microscopic” study of a minimum entropy search algorithm, and show that it learns an I-map of the domain model when the data size is large. Search procedures that modify a network structure one link at a time have been commonly used for efficiency. Our study indicates that a class of domain models cannot be learned by such procedures. This suggests that prior knowledge about the problem domain together with a multi-link search strategy would provide an effective way to uncover many domain models.

Journal ArticleDOI
TL;DR: An off-line variant of the mistake-bound model of learning is presented, an intermediate model between the on-line learning model and the self-directed learning model, and the combinatorial tool of labeled trees is extended to a unified approach that captures the various mistake bound measures.
Abstract: We present an off-line variant of the mistake-bound model of learning. This is an intermediate model between the on-line learning model (Littlestone, 1988, Littlestone, 1989) and the self-directed learning model (Goldman, Rivest & Schapire, 1993, Goldman & Sloan, 1994). Just like in the other two models, a learner in the off-line model has to learn an unknown concept from a sequence of elements of the instance space on which it makes “guess and test” trials. In all models, the aim of the learner is to make as few mistakes as possible. The difference between the models is that, while in the on-line model only the set of possible elements is known, in the off-line model the sequence of elements (i.e., the identity of the elements as well as the order in which they are to be presented) is known to the learner in advance. On the other hand, the learner is weaker than the self-directed learner, which is allowed to choose adaptively the sequence of elements presented to him. We study some of the fundamental properties of the off-line model. In particular, we compare the number of mistakes made by the off-line learner on certain concept classes to those made by the on-line and self-directed learners. We give bounds on the possible gaps between the various models and show examples that prove that our bounds are tight. Another contribution of this paper is the extension of the combinatorial tool of labeled trees to a unified approach that captures the various mistake bound measures of all the models discussed. We believe that this tool will prove to be useful for further study of models of incremental learning.

Journal ArticleDOI
TL;DR: Two issues in polynomial-time exact learning of concepts using membership and equivalence queries are considered: errors or omissions in answers to membership queries, and learning finite variants of concepts drawn from a learnable class.
Abstract: We consider two issues in polynomial-time exact learning of concepts using membership and equivalence queries: (1) errors or omissions in answers to membership queries, and (2) learning finite variants of concepts drawn from a learnable class. To study (1), we introduce two new kinds of membership queries: limited membership queries and malicious membership queries. Each is allowed to give incorrect responses on a maliciously chosen set of strings in the domain. Instead of answering correctly about a string, a limited membership query may give a special “I don‘t know” answer, while a malicious membership query may give the wrong answer. A new parameter L is used to bound the length of an encoding of the set of strings that receive such incorrect answers. Equivalence queries are answered correctly, and learning algorithms are allowed time polynomial in the usual parameters and L. Any class of concepts learnable in polynomial time using equivalence and malicious membership queries is learnable in polynomial time using equivalence and limited membership queries; the converse is an open problem. For the classes of monotone monomials and monotone k -term DNF formulas, we present polynomial-time learning algorithms using limited membership queries alone. We present polynomial-time learning algorithms for the class of monotone DNF formulas using equivalence and limited membership queries, and using equivalence and malicious membership queries. To study (2), we consider classes of concepts that are polynomially closed under finite exceptions and a natural operation to add exception tables to a class of concepts. Applying this operation, we obtain the class of monotone DNF formulas with finite exceptions. We give a polynomial-time algorithm to learn the class of monotone DNF formulas with finite exceptions using equivalence and membership queries. We also give a general transformation showing that any class of concepts that is polynomially closed under finite exceptions and is learnable in polynomial time using standard membership and equivalence queries is also polynomial-time learnable using malicious membership and equivalence queries. Corollaries include the polynomial-time learnability of the following classes using malicious membership and equivalence queries: deterministic finite acceptors, boolean decision trees, and monotone DNF formulas with finite exceptions.

Journal ArticleDOI
TL;DR: This paper obtains the most general family of prior-posterior distributions which is conjugate to a Dirichlet likelihood and identifies those hyperparameter that are influenced by data values and describes some methods to assess the prior hyperparameters.
Abstract: In this paper we analyze the problem of learning and updating of uncertainty in Dirichlet models, where updating refers to determining the conditional distribution of a single variable when some evidence is known. We first obtain the most general family of prior-posterior distributions which is conjugate to a Dirichlet likelihood and we identify those hyperparameters that are influenced by data values. Next, we describe some methods to assess the prior hyperparameters and we give a numerical method to estimate the Dirichlet parameters in a Bayesian context, based on the posterior mode. We also give formulas for updating uncertainty by determining the conditional probabilities of single variables when the values of other variables are known. A time series approach is presented for dealing with the cases in which samples are not identically distributed, that is, the Dirichlet parameters change from sample to sample. This typically occurs when the population is observed at different times. Finally, two examples are given that illustrate the learning and updating processes and the time series approach.

Journal ArticleDOI
TL;DR: A representation framework that offers a unifying platform for alternative systems, which learn concepts in First Order Logics, and a novelty, in the hypothesis representation language, is the introduction of the construct of internal disjunction.
Abstract: This paper describes a representation framework that offers a unifying platform for alternative systems, which learn concepts in First Order Logics. The main aspects of this framework are discussed. First of all, the separation between the hypothesis logical language (a version of the VL21 language) and the representation of data by means of a relational database is motivated. Then, the functional layer between data and hypotheses, which makes the data accessible by the logical level through a set of abstract properties is described. A novelty, in the hypothesis representation language, is the introduction of the construct of internal disjunction; such a construct, first used by the AQ and Induce systems, is here made operational via a set of algorithms, capable to learn it, for both the discrete and the continuous-valued attributes case. These algorithms are embedded in learning systems (SMART+, REGAL, SNAP, WHY, RTL) using different paradigms (symbolic, genetic or connectionist), thus realizing an effective integration among them; in fact, categorical and numerical attributes can be handled in a uniform way. In order to exemplify the effectiveness of the representation framework and of the multistrategy integration, the results obtained by the above systems in some application domains are summarized.

Journal ArticleDOI
TL;DR: This work has combined inductive logic programming (ILP) directly with a relational database management system and found that the better structured the hypothesis space is, the better learning can prune away uninteresting or losing hypotheses and the faster it becomes.
Abstract: When learning from very large databases, the reduction of complexity is extremely important. Two extremes of making knowledge discovery in databases (KDD) feasible have been put forward. One extreme is to choose a very simple hypothesis language, thereby being capable of very fast learning on real-world databases. The opposite extreme is to select a small data set, thereby being able to learn very expressive (first-order logic) hypotheses. A multistrategy approach allows one to include most of these advantages and exclude most of the disadvantages. Simpler learning algorithms detect hierarchies which are used to structure the hypothesis space for a more complex learning algorithm. The better structured the hypothesis space is, the better learning can prune away uninteresting or losing hypotheses and the faster it becomes. We have combined inductive logic programming (ILP) directly with a relational database management system. The ILP algorithm is controlled in a model-driven way by the user and in a data-driven way by structures that are induced by three simple learning algorithms.

Journal ArticleDOI
TL;DR: The parallel complexity of learning formulas from membership and equivalence queries is investigated and it is shown that many restricted classes of boolean functions cannot be efficiently learned in parallel with a polynomial number of processors.
Abstract: We investigate the parallel complexity of learning formulas from membership and equivalence queries. We show that many restricted classes of boolean functions cannot be efficiently learned in parallel with a polynomial number of processors.

Journal ArticleDOI
TL;DR: This paper shows how prior knowledge can be refined or supplemented using data by employing either a Bayesian approach, by a weighted combination of knowledge bases, or by generating artificial training data representing the prior knowledge.
Abstract: There is great interest in understanding the intrinsic knowledge neural networks have acquired during training. Most work in this direction is focussed on the multi-layer perceptron architecture. The topic of this paper is networks of Gaussian basis functions which are used extensively as learning systems in neural computation. We show that networks of Gaussian basis functions can be generated from simple probabilistic rules. Also, if appropriate learning rules are used, probabilistic rules can be extracted from trained networks. We present methods for the reduction of network complexity with the goal of obtaining concise and meaningful rules. We show how prior knowledge can be refined or supplemented using data by employing either a Bayesian approach, by a weighted combination of knowledge bases, or by generating artificial training data representing the prior knowledge. We validate our approach using a standard statistical data set.


Journal ArticleDOI
TL;DR: It is argued that a variety of learning strategies can be embodied in the background knowledge provided to a general purpose learning algorithm.
Abstract: This paper discusses the role that background knowledge can play in building flexible multistrategy learning systems. We contend that a variety of learning strategies can be embodied in the background knowledge provided to a general purpose learning algorithm. To be effective, the general purpose algorithm must have a mechanism for learning new concept descriptions that can refer to knowledge provided by the user or learned during some other task. The method of knowledge representation is a central problem in designing such a system since it should be possible to specify background knowledge in such a way that the learner can apply its knowledge to new information.

Journal ArticleDOI
TL;DR: Only recently have researchers realized that Pearl's ideas have profound implications for learning, but, as the papers in this issue show, belief networks and their associated graphical notation now play a prominent role in work on probabilistic induction.
Abstract: 1. Introduction and motivationMachine learning cannot occur without some means to represent the learned knowledge.Researchers have long recognized the influence of representational choices, and the majorparadigms in machine learning are organized not around induction algorithms or perfor-manceelementsasmuchasaroundrepresentationalclasses. Majorexamplesincludelogicalrepresentations, which encode knowledge as rule sets or as univariate decision trees, neu-ral networks, which instead use nodes connected by weighted links, and instance-basedapproaches, which store specific training cases in memory.In the late 1980s, work on probabilistic representations also started to appear in themachine learning literature. This representational framework had a number of attractions,including a clean probabilistic semantics and the ability to explicitly describe degrees ofcertainty. This general approach attracted only a moderate amount of attention until recentyears, when progress on Bayesian belief networks led to enough activity in the area tojustify this special issue on the topic of probabilistic learning.Representing uncertainty has a long and sometimes chequered history in artificial intel-ligence. Early work on knowledge-based systems, such asMycin and Prospector,modeled uncertainty explicitly and incorporated approximations to Bayesian inference.However, subsequent years saw probabilistic approaches largely ignored in AI and ma-chine learning, until Pearl (1988) clearly demonstrated that probabilistic representationsare less about numbers than about structure. He showed that a graphical notation, whichlets one specify the independence assumption of a probabilistic model, has clear advantagesfor probabilistic inference. Only recently have researchers realized that Pearl’s ideas (andrelated work in statistics) have profound implications for learning, but, as the papers in thisissue show, belief networks and their associated graphical notation now play a prominentrole in work on probabilistic induction.Afundamentalcontributionofsuchprobabilisticindependencenetworkstolearningisthenotionthatthecomplexityofaprobabilisticrepresentationisroughlyinverselyproportionalto the number of independence assumptions it makes. The process of model buildingcorresponds to searching among models by trading off complexity and fit to achieve good