scispace - formally typeset
Search or ask a question

Showing papers by "Michael J. Pazzani published in 1996"


Proceedings Article
01 Jan 1996
TL;DR: The simple Bayesian classi er (SBC) is commonly thought to assume that attributes are independent given the class, but this is apparently contradicted by the surprisingly good performance it exhibits in many domains that contain clear attribute dependences as mentioned in this paper.
Abstract: The simple Bayesian classi er (SBC) is commonly thought to assume that attributes are independent given the class, but this is apparently contradicted by the surprisingly good performance it exhibits in many domains that contain clear attribute dependences. No explanation for this has been proposed so far. In this paper we show that the SBC does not in fact assume attribute independence, and can be optimal even when this assumption is violated by a wide margin. The key to this nding lies in the distinction between classi cation and probability estimation: correct classi cation can be achieved even when the probability estimates used contain large errors. We show that the previously-assumed region of optimality of the SBC is a second-order in nitesimal fraction of the actual one. This is followed by the derivation of several necessary and several su cient conditions for the optimality of the SBC. For example, the SBC is optimal for learning arbitrary conjunctions and disjunctions, even though they violate the independence assumption. The paper also reports empirical evidence of the SBC's competitive performance in domains containing substantial degrees of attribute dependence. 1 THE SIMPLE BAYESIAN

798 citations


Proceedings Article
04 Aug 1996
TL;DR: The naive Bayesian classifier offers several advantages over other learning algorithms on this task and an initial portion of a web page is sufficient for making predictions on its interestingness substantially reducing the amount of network transmission required to make predictions.
Abstract: We describe Syskill & Webert, a software agent that learns to rate pages on the World Wide Web (WWW), deciding what pages might interest a user. The user rates explored pages on a three point scale, and Syskill & Webert learns a user profile by analyzing the information on each page. The user profile can be used in two ways. First, it can be used to suggest which links a user would be interested in exploring. Second, it can be used to construct a LYCOS query to find pages that would interest a user. We compare six different algorithms from machine learning and information retrieval on this task. We find that the naive Bayesian classifier offers several advantages over other learning algorithms on this task. Furthermore, we find that an initial portion of a web page is sufficient for making predictions on its interestingness substantially reducing the amount of network transmission required to make predictions.

756 citations


Journal ArticleDOI
TL;DR: It is empirically shown that it is possible to learn descriptions that make less correlated errors in domains in which many ties in the search evaluation measure (e.g. information gain) are experienced during learning.
Abstract: Learning multiple descriptions for each class in the data has been shown to reduce generalization error but the amount of error reduction varies greatly from domain to domain. This paper presents a novel empirical analysis that helps to understand this variation. Our hypothesis is that the amount of error reduction is linked to the "degree to which the descriptions for a class make errors in a correlated manner." We present a precise and novel definition for this notion and use twenty-nine data sets to show that the amount of observed error reduction is negatively correlated with the degree to which the descriptions make errors in a correlated manner. We empirically show that it is possible to learn descriptions that make less correlated errors in domains in which many ties in the search evaluation measure (e.g. information gain) are experienced during learning. The paper also presents results that help to understand when and why multiple descriptions are a help (irrelevant attributes) and when they are not as much help (large amounts of class noise).

260 citations


Book ChapterDOI
01 Jan 1996
TL;DR: It is shown that the backward sequential elimination and joining algorithm provides the most improvement over the naive Bayesian classifier and that the violations of the independence assumption that affect the accuracy of the classifier can be detected from training data.
Abstract: Naive Bayesian classifiers which make independence assumptions perform remarkably well on some data sets but poorly on others. We explore ways to improve the Bayesian classifier by searching for dependencies among attributes. We propose and evaluate two algorithms for detecting dependencies among attributes and show that the backward sequential elimination and joining algorithm provides the most improvement over the naive Bayesian classifier. The domains on which the most improvement occurs are those domains on which the naive Bayesian classifier is significantly less accurate than a decision tree learner. This suggests that the attributes used in some common databases are not independent conditioned on the class and that the violations of the independence assumption that affect the accuracy of the classifier can be detected from training data.

212 citations


Proceedings ArticleDOI
18 Apr 1996
TL;DR: This work proposes an innovative World Wide Web agent that uses a model of collaboration that leverages the natural incentives for individual users to easily provide for collaborative work.
Abstract: Social filtering and collaborative resource discovery mechanisms often fail because of the extra burden, even tiny, placed on the user. This work proposes an innovative World Wide Web agent that uses a model of collaboration that leverages the natural incentives for individual users to easily provide for collaborative work.

54 citations


Proceedings Article
03 Dec 1996
TL;DR: An evaluation of the new approach, PCR*, based on principal components regression, reveals that it was the most robust combination method as the redundancy of the learned models increased, and redundancy could be handled without eliminating any of the learning models.
Abstract: When combining a set of learned models to form an improved estimator, the issue of redundancy or multicollinearity in the set of models must be addressed. A progression of existing approaches and their limitations with respect to the redundancy is discussed. A new approach, PCR*, based on principal components regression is proposed to address these limitations. An evaluation of the new approach on a collection of domains reveals that: 1) PCR* was the most robust combination method as the redundancy of the learned models increased, 2) redundancy could be handled without eliminating any of the learned models, and 3) the principal components of the learned models provided a continuum of "regularized" weights from which PCR* could choose.

45 citations


01 Jan 1996
TL;DR: This dissertation presents methods for increasing the accuracy of probabilistic classification rules learned from noisy, relational data and presents the system HYDRA, which implements the one-per-class approach.
Abstract: This dissertation presents methods for increasing the accuracy of probabilistic classification rules learned from noisy, relational data. It addresses the problem of learning probabilistic rules in noisy, "real-world" data sets, the problem of "small disjuncts" in which rules that apply to rare subclasses have high error rates, and the problems that arise in domains in which the learning algorithm is forced to pick from many rules that appear to be equally good. It is shown that learning a class description for each class in the data--the one-per-class approach--and attaching probabilistic estimates to the learned rules allows accurate classifications to be made on real-world data sets. The thesis presents the system HYDRA which implements this approach. It is shown that the resulting classifications are often more accurate than those made by three major methods for learning from noisy, relational data. Furthermore, the learned rules are relational and so are more expressive than the attribute-value rules learned by most induction systems. Several results are also presented in the arena of multiple models. The multiple models approach is relevant to the problem of making accurate classifications in "real-world" domains since it facilitates evidence combination which is needed to accurately learn on such domains. The most important result of the multiple models research is that the amount of error reduction afforded by the multiple models approach is linearly correlated with the degree to which the individual models make errors in an uncorrelated manner. It is shown that it is possible to learn models that make less correlated errors in domains in which there are many gain ties. The third major result of the research on multiple models is the realization that models should be learned that make errors in a negatively-correlated manner rather than those that make errors in an uncorrelated manner. Finally, results are presented on the small-disjuncts problem in which rules that apply to rare subclasses have high error rates. It is shown that the one-per-class approach reduces error rates for such rare rules while not sacrificing the error rates of the other rules.

27 citations


01 Jan 1996
TL;DR: This work focuses on an extension to Syskill & Webert that lets a user provide the system with an initial profile of his interests in order to increase the classification accuracy without seeing many rated pages, and finds that a user defined profile can significantly increase the Classification accuracy.
Abstract: We describe Syskill & Webert, a software agent that learns to rate pages on the World Wide Web (WWW), deciding what pages might interest a user. The user rates explored pages on a three point scale, and Syskill & Webert learns a user profile by analyzing the information on each page. We focus on an extension to Syskill & Webert that lets a user provide the system with an initial profile of his interests in order to increase the classification accuracy without seeing many rated pages. We represent this user profile in a probabilistic way, which allows us to revise the profile as more training data is becoming available using "conjugate priors", a common technique from Bayesian statistics for probability revision. Unseen pages are classified using a simple Bayesian classifier that uses the revised probabilities. We compare our approach to learning algorithms that do not make use of such background knowledge, and find that a user defined profile can significantly increase the classification accuracy.

25 citations


01 Jan 1996
TL;DR: The Do-I-Care agent, which uses machine learning to detect "interesting" changes to Web pages previously found to be relevant, and can be used collaboratively by cascading them and by propagating interesting findings to other users’ agents.
Abstract: We describe the Do-I-Care agent, which uses machine learning to detect "interesting" changes to Web pages previously found to be relevant. Because this agent focuses on changes to known pages rather than discovering new pages, we increase the likelihood that the information found will be interesting. The agent’s accuracy in finding interesting changes and in learning is improved by exploiting regularities in how pages are changed. Additionally, these agents can be used collaboratively by cascading them and by propagating interesting findings to other users’ agents.

13 citations


01 Jan 1996
TL;DR: Syskill & Webert, a software agent that learns to rate pages on the World Wide Web, deciding what pages might interest a user is described, and the naive Bayesian classifier of this agent has several advantages over other learning algorithms on this task.
Abstract: We describe Syskill & Webert, a software agent that learns to rate pages on the World Wide Web (WWW), deciding what pages might interest a user. The user rates explored pages on a three point scale, and Syskill & Webert learns a user profile by analyzing the information on each page. The user profile can be used in two ways. First, it can be used to suggest which links a user would be interested in exploring. Second, it can be used to construct a LYCOS query to find pages that would interest a user. We compare six different algorithms from machine learning and information retrieval on this task. We find that the naive Bayesian classifier ofsers several advantages over other learning algorithms on this task. Furthermore, we find that an initial portion of a web page is sufficient for making predictions on its interestingness substantially reducing the amount of network transmission required to make predictions.

12 citations


Book ChapterDOI
01 Jan 1996
TL;DR: The authors' approximation to the posterior probability of a rule-set model that is comprised of a set of class descriptions yields significant improvements in accuracy as measured on four relational data sets and four attribute-value data sets from the UCI repository.
Abstract: We present a way of approximating the posterior probability of a rule-set model that is comprised of a set of class descriptions. Each class description, in turn, consists of a set of relational rules. The ability to compute this posterior and to learn many models from the same training set allows us to approximate the expectation that an example to be classified belongs to some class. The example is assigned to the class maximizing the expectation. By assuming a uniform prior distribution of models, the posterior of the model does not depend on the structure of the model: it only depends on how the training examples are partitioned by the rules of the rule-set model. This uniform distribution assumption allows us to compute the posterior for models containing relational and recursive rules. Our approximation to the posterior probability yields significant improvements in accuracy as measured on four relational data sets and four attribute-value data sets from the UCI repository. We also provide evidence that learning multiple models helps most in data sets in which there are many, apparently equally good rules to learn.

Book ChapterDOI
01 Jan 1996
TL;DR: Experimental results showing that the semantic hierarchies generated by the method yield learned translation rules with higher average accuracy are reported.
Abstract: This paper addresses the problem of constructing a semantic hierarchy for a machine translation system. We propose two methods of constructing a hierarchy: acquiring a hierarchy from scratch and updating a hierarchy. When acquiring a hierarchy from scratch, translation rules are learned by an inductive learning algorithm in the first step. A new hierarchy is then generated by applying a clustering method to internal disjunctions of the learned rules and new rules are learned under the bias of this hierarchy. When updating an existing manually-constructed hierarchy, we take advantage of its node structure. We report experimental results showing that the semantic hierarchies generated by our method yield learned translation rules with higher average accuracy.

01 Jan 1996
TL;DR: It is shown that human subjects make fewer errors and learn more rapidly when the set of concepts is logically consistent, and the experiments illustrate the importance of learning the relevance of combinations of features, rather than individual features.
Abstract: Initial Psychological Experimentation In the first experiment, subjects were asked to imagine that they work for the US Forest Service and were assigned the task of learning to predict years in which there is a severe risk of forest fire danger in the fall. Four concepts had to be learned in the experiment -one concept in each of four phases. All subjects learned the same 3 background concepts in phases 1-3. Then, for phase 4, they were divided into two groups (the logical consistency group and the feature consistency group) to learn one of two separate concepts which depended on the background concepts. We investigate learning a set of causally related concepts from examples. We show that human subjects make fewer errors and learn more rapidly when the set of concepts is logically consistent. We compare the results of these subjects to subjects learning equivalent concepts that share sets of relevant features, but are not logically consistent. We present a shared-task neural network model simulation of the psychological experimentation. Introduction The first phase of the experiment was designed to minimize the effects of the subjects’ domain-specific preexisting theories by having every subject learn the same concept. In this first phase, subjects had to learn when there is a severe risk of forest fires in the fall given data on rain in the spring and summer. An example of these data is shown in Figure 1. Subjects were given data that indicated that there is a severe risk of forest fires in the fall only when there is both a wet spring and a dry summer. This rule is consistent with the knowledge of most people who live in Southern California. In the remaining phases, when we measure the learning rate and number of errors made by subjects, novel stimuli are used as features to insure that the knowledge was acquired during the experiment. Researchers have investigated how the relevant background knowledge of the learner influences the speed or accuracy of concept learning (e.g., Murphy & Medin 1985, Nakamura 1985, Pazzani 1991, Wattenmaker et al. 1986). However, the psychological investigation to date has only explored problems where subjects learn a single concept and the relevant background knowledge is either brought to the experiment by the subject or given in written instructions. In contrast, research in machine learning has addressed issues that occur when learning a set of related concepts. For example, relevant background concepts might be learned inductively from examples before learning concepts that depend upon this knowledge (Pazzani 1990). Here, we report on two experiments in which subjects induce the relevant background knowledge from examples and use this background knowledge to facilitate later learning. The experiments illustrate the importance of learning the relevance of combinations of features, rather than individual features. We model this experiment with shared-task neural networks (Caruana, 1993). Next, the subjects were told that the US Forest Service needs to do advance planning, so it cannot wait until the end of summer to predict when there will be a severe risk of fire in the fall. The subjects again examined data from several years. This time, however, the data was from five simulated scientific instruments that are used each January to detect the presence of factors that may be useful in predicting the amount of rain. When one of the instruments detects the presence of a particular factor, it displays a distinctive graph, as shown in Figure 2. Otherwise, a bar is shown to mark the absence of the instrument’s graph (see Instrument 3 of Figure 2) Each instrument displays a graph whose shape differs from that of the other instruments. In this second concept learning problem, subjects had to learn to predict from the instrument readings when there would be a rainy spring. All subjects were given data that indicated there would be a wet spring when one particular instrument showed a distinctive graph. All subjects learned a rule of the form “There will be a wet spring when Instrument-A displays a graph,” with the instrument corresponding to In the first experiment, subjects first induce the relevant background knowledge and then have the opportunity to use this knowledge in later learning. To more closely simulate the real world, we ran a second experiment wherein the subjects induce the relevant background knowledge at the same time as learning the concept that depends on this knowledge. In both experiments, subjects were divided into two groups. One group, the “feature consistency” group, learned a complex concept that shared relevant features with previously learned related concepts, but was not logically consistent with those concepts. Another group, the “logical consistency” group, learned a complex concept that was logically consistent with previously learned related concepts. Instrument-A selected randomly. This concept will serve as background knowledge for learning the fourth concept. Results. Subjects in the logical consistency group required an average of 27.6 trials to learn the fourth concept, while subjects in the feature consistency group required an average of 50.4 trials t(16) = 1.91, p < .05. Subjects in the logical consistency group made an average of 6.8 errors, while subjects in the feature consistency group made an average of 14.0 errors t(16) = 2.135, p < .05. In the third concept learning problem, subjects learned another piece of background knowledge. Here, subjects had to learn to predict from the instrument readings when there would be a dry summer. All subjects were shown data derived from the rule “There will be a dry summer when Instrument-B or Instrument-C displays a graph.” Multiple Concept Learning In the fourth, and final, concept learning problem, subjects had to learn to predict from the instrument readings when there would be a severe risk of fire in the fall. Concepts 1-3 served as background knowledge for this concept. Subjects in the logical consistency group were given data that indicated there would be a severe risk of fire when Instrument-A displayed a graph and when either Instrument-B or Instrument-C (or both) displayed a graph, i.e., A ∧ (B ∨ C). This concept is logically consistent with the first three concepts that were learned. Subjects in the feature consistency group were given data that indicated there would be a severe risk of fire when Instrument-C displayed a graph and when either Instrument-B or Instrument-A (or both) displayed a graph, i.e. C ∧ (B ∨ A). Although not consistent with the concepts that were learned, this concept shares relevant features with the logical consistency concept. In Experiment 1, subjects accurately induced three relevant background concepts, prior to learning a single concept which depended upon those concepts. The order of the concepts is the ideal order for subjects to first acquire knowledge inductively and then use that knowledge in future learning. However, the natural world does not have a benevolent teacher who orders experiences for the learner. To more closely simulate the natural world, in the second experiment, those concepts that had the same stimuli from the first experiment (the last three concepts) are learned at the same time. For each presentation of stimuli, subjects predicted whether there would be a rainy spring, a dry summer, and a severe risk of fire in the fall (see Figure 3). With this exception, Experiment 2 was identical to Experiment 1. For the second learning phase, subjects had to click on all three boxes correctly before proceeding to the next stimuli. We recorded the number of the last trial on which the subject made an error and the total number of errors made by the subject only for the concept that involved predicting whether there would be a severe risk of fire in the fall from the instrument data. In addition, for this concept, we also recorded the number of errors made by the subject on blocks of 16 trials. If the subject did not obtain the correct answer after 128 trials, we recorded that the last error was made on trial 128. Subjects. The subjects were 18 male and female undergraduates attending the University of California, Irvine who participated in this experiment to receive extra credit in an introductory psychology course. Stimuli. The stimuli consisted of data that were displayed on a computer monitor. In the first concept, since there are two two-valued features, 4 distinct stimuli were constructed. In the remaining three concepts, there were 32 distinct stimuli since there are five two-valued features. The stimuli were presented in a random order for each subject. Procedures. Each subject was shown data on the computer from a single year and asked to make a prediction (e.g., whether there would be a severe risk of fire in the fall) by clicking on a circle next to the word Yes or a circle next to the word No (i.e., using a mouse to move a pointer to the circle and pressing a button on the mouse). Next, the subject clicked on a box labeled Check Answer. While still displaying the data, the computer indicated to the subject whether his answer was the correct answer. If the subject’s answer was correct, the subject could click on a box labeled Continue and data from another year was shown. Otherwise, he selected a different answer and clicked on Check Answer again. This process was repeated until the subjects performed at a level that ensured they had learned an accurate approximation to the concept (making no more than one error in any sequence of 24 consecutive trials). The subjects were allowed as much time as they wanted to make their prediction and to view the data after the correct answer was shown. This process of learning a concept to criteria was repeated for each of the four concepts learned. We recorded the number of the last trial on which the subject made an error, the total number of errors made by the subject for each concept, and the number o

Journal ArticleDOI
TL;DR: Inductive Logic Programming: Techniques and Applications is appropriate as an introductory graduate text that contains sufficient background material to gently introduce someone to the field, and it provides detailed descriptions of recent research contributions toThe field.
Abstract: Inductive Logic Programming (ILP) is an important and active subfield of machine learning. Unlike most of machine learning, ILP is concerned with learning first-order (or relational) rules, a representation that is more expressive than the attribute-value representation typically used by decision trees and neural networks. Due to the recent emergence of this field, most of the important work is scattered across a number of conference papers, journal papers, and theses. Most papers use slightly different notation, terminology, and definitions, which results in unnecessary confusion to newcomers to this research area. Inductive Logic Programming." Techniques and Applications is the first text that attempts to present an overview of the field, and it performs admirably at this task. While there are other books on ILP, some are edited volumes that do not provide extensive introductory material (e.g., Muggleton, 1992) and others (e.g., De Raedt, 1992; Morik, Wrobel, Kietz, & Emde, 1993) provide detailed reports on a single research project. Inductive Logic Programming: Techniques and Applications is appropriate as an introductory graduate text. It contains sufficient background material to gently introduce someone to the field, and it provides detailed descriptions of recent research contributions to the field. Since many ILP systems have sound theoretical foundations, the book contains a number of definitions and introduces some new notation. The definitions are illustrated with numerous examples that help to make the concepts concrete. The book does not suffer from a flaw I have seen in some theoretical treatments. There is no formalism for its own sake in this book. The definitions and notations introduced are used later in the book and help to clarify important concepts.

Journal ArticleDOI
TL;DR: The Nynex Max expert system analyzes the result of an automated electric test on a telephone line and determines the type of problem as mentioned in this paper. But tuning the system's parameter values can be difficult.
Abstract: The Nynex Max expert system analyzes the result of an automated electric test on a telephone line and determines the type of problem. However, tuning the system's parameter values can be difficult. The Opti-Max system can automatically set these parameters by analyzing decisions made by experts who troubleshoot problems.

Proceedings Article
04 Aug 1996
TL;DR: Bayes` theorem tells us how to optimally predict the class of a previously unseen example, given a training sample, by identifying the class which maximizes the conditional probability of Y given X.
Abstract: Bayes` theorem tells us how to optimally predict the class of a previously unseen example, given a training sample. The chosen class should be the one which maximizes P(C{sub i}/E) = P(C{sub i}) P(E/C{sub i}) / P(E), where C{sub i} is the ith class, E is the test example, P(Y/X) denotes the conditional probability of Y given X, and probabilities are estimated from the training sample. Let an example be a vector of a attributes. If the attributes are independent given the class, P(E{sub i}C{sub i}) can be decomposed into the product P(v{sub i}/C{sub i}) ... P(V{sub a}/C{sub i}), where v{sub i} is the value of the jth attribute in the example E.