scispace - formally typeset
Search or ask a question
Author

Huajie Zhang

Other affiliations: University of New Brunswick
Bio: Huajie Zhang is an academic researcher from University of Western Ontario. The author has contributed to research in topics: Naive Bayes classifier & Bayesian network. The author has an hindex of 8, co-authored 9 publications receiving 174 citations. Previous affiliations of Huajie Zhang include University of New Brunswick.

Papers
More filters
Book ChapterDOI
16 Apr 2001
TL;DR: This work extends Naive Bayes classifier to allow certain dependency relations among attributes, which is more efficient, and produces simpler dependency relation for better comprehensibility, while maintaining very similar predictive accuracy.
Abstract: Data mining applications require learning algorithms to have high predictive accuracy, scale up to large datasets, and produce comprehensible outcomes. Naive Bayes classifier has received extensive attention due to its efficiency, reasonable predictive accuracy, and simplicity. However, the assumption of attribute dependency given class of Naive Bayes is often violated, producing incorrect probability that can affect the success of data mining applications. We extend Naive Bayes classifier to allow certain dependency relations among attributes. Comparing to previous extensions of Naive Bayes, our algorithm is more efficient (more so in problems with a large number of attributes), and produces simpler dependency relation for better comprehensibility, while maintaining very similar predictive accuracy.

38 citations

Journal ArticleDOI
TL;DR: This paper establishes an association between the structural complexity of Bayesian networks and their representational power, and uses the maximum number of nodes' parents and the maximum XOR contained in a target function as the measure for the function complexity.
Abstract: One of the most important fundamental properties of Bayesian networks is the representational power, reflecting what kind of functions they can or cannot represent. In this paper, we establish an association between the structural complexity of Bayesian networks and their representational power. We use the maximum number of nodes' parents as the measure for the structural complexity of Bayesian networks, and the maximum XOR contained in a target function as the measure for the function complexity. A representational upper bound is established and proved. Roughly speaking, discrete Bayesian networks with each node having at most k parents cannot represent any function containing (k+1)-XORs. Our theoretical results help us to gain a deeper understanding on the capacities and limitations of Bayesian networks.

34 citations

Book ChapterDOI
06 May 2002
TL;DR: AUC provides a more discriminating evaluation for the ranking and probability estimation than the accuracy does, and it is shown that classifiers constructed to maximise the AUC score produce not only higher AUC values, but also higher classification accuracies.
Abstract: In most data mining applications, accurate ranking and probability estimation are essential. However, many traditional classifiers aim at a high classification accuracy (or low error rate) only, even though they also produce probability estimates. Does high predictive accuracy imply a better ranking and probability estimation? Is there any better evaluation method for those classifiers than the classification accuracy, for the purpose of data mining applications? The answer is the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC. We show that AUC provides a more discriminating evaluation for the ranking and probability estimation than the accuracy does. Further, we show that classifiers constructed to maximise the AUC score produce not only higher AUC values, but also higher classification accuracies. Our results are based on experimental comparison between error-based and AUC-based learning algorithms for TAN (Tree-Augmented Naive Bayes).

30 citations

Book ChapterDOI
TL;DR: This work gives necessary and sufficient conditions on linearly separable functions in the binary domain to be learnable by Naive Bayes under uniform representation and shows that the learnability (and error rates) of Naïve Bayes can be affected dramatically by sampling distributions.
Abstract: Naive Bayes is an efficient and effective learning algorithm, but previous results show that its representation ability is severely limited since it can only represent certain linearly separable functions in the binary domain. We give necessary and sufficient conditions on linearly separable functions in the binary domain to be learnable by Naive Bayes under uniform representation. We then show that the learnability (and error rates) of Naive Bayes can be affected dramatically by sampling distributions. Our results help us to gain a much deeper understanding of this seemingly simple, yet powerful learning algorithm.

22 citations

Proceedings ArticleDOI
03 Jan 2001
TL;DR: A data-mining approach is proposed that produces generalized query patterns or templates from the raw user logs of a popular commercial knowledge-based search engine that is currently in use and shows that such templates can improve search engine's speed and precision.
Abstract: User logs of a popular search engine keep track of user activities including user queries, user click-through from the returned list, and user browsing behaviors. Knowledge about user queries discovered from user logs can improve the performance of the search engine. We propose a data-mining approach that produces generalized query patterns or templates from the raw user logs of a popular commercial knowledge-based search engine that is currently in use. Our simulation shows that such templates can improve search engine's speed and precision, and can cover queries not asked previously. The templates are also comprehensible so web editors can easily discover topics in which most users are interested.

15 citations


Cited by
More filters
Proceedings Article
01 Jan 2004
TL;DR: A sufficient condition for the optimality of naive Bayes is presented and proved, in which the dependence between attributes do exist, and evidence that dependence among attributes may cancel out each other is provided.
Abstract: Naive Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. Its competitive performance in classification is surprising, because the conditional independence assumption on which it is based, is rarely true in realworld applications. An open question is: what is the true reason for the surprisingly good performance of naive Bayes in classification? In this paper, we propose a novel explanation on the superb classification performance of naive Bayes. We show that, essentially, the dependence distribution; i.e., how the local dependence of a node distributes in each class, evenly or unevenly, and how the local dependencies of all nodes work together, consistently (supporting a certain classification) or inconsistently (canceling each other out), plays a crucial role. Therefore, no matter how strong the dependences among attributes are, naive Bayes can still be optimal if the dependences distribute evenly in classes, or if the dependences cancel each other out. We propose and prove a sufficient and necessary conditions for the optimality of naive Bayes. Further, we investigate the optimality of naive Bayes under the Gaussian distribution. We present and prove a sufficient condition for the optimality of naive Bayes, in which the dependence between attributes do exist. This provides evidence that dependence among attributes may cancel out each other. In addition, we explore when naive Bayes works well. Naive Bayes and Augmented Naive Bayes Classification is a fundamental issue in machine learning and data mining. In classification, the goal of a learning algorithm is to construct a classifier given a set of training examples with class labels. Typically, an example E is represented by a tuple of attribute values (x1, x2, , · · · , xn), where xi is the value of attribute Xi. Let C represent the classification variable, and let c be the value of C. In this paper, we assume that there are only two classes: + (the positive class) or − (the negative class). A classifier is a function that assigns a class label to an example. From the probability perspective, according to Bayes Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Rule, the probability of an example E = (x1, x2, · · · , xn) being class c is p(c|E) = p(E|c)p(c) p(E) . E is classified as the class C = + if and only if fb(E) = p(C = +|E) p(C = −|E) ≥ 1, (1) where fb(E) is called a Bayesian classifier. Assume that all attributes are independent given the value of the class variable; that is, p(E|c) = p(x1, x2, · · · , xn|c) = n ∏

1,536 citations

Journal ArticleDOI
TL;DR: It is shown theoretically and empirically that AUC is a better measure (defined precisely) than accuracy and reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results.
Abstract: The area under the ROC (receiver operating characteristics) curve, or simply AUC, has been traditionally used in medical diagnosis since the 1970s. It has recently been proposed as an alternative single-number measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. We establish formal criteria for comparing two different measures for learning algorithms and we show theoretically and empirically that AUC is a better measure (defined precisely) than accuracy. We then reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results. For example, it has been well-established and accepted that Naive Bayes and decision trees are very similar in predictive accuracy. We show, however, that Naive Bayes is significantly better than decision trees in AUC. The conclusions drawn in this paper may make a significant impact on machine learning and data mining applications.

1,528 citations

Journal ArticleDOI
Xin Yang1, Yifei Wang1, Ryan Byrne2, Gisbert Schneider2, Shengyong Yang1 
TL;DR: The current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects.
Abstract: Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.

425 citations

Proceedings Article
09 Aug 2003
TL;DR: It is formally proved that, for the first time, AUC is a better measure than accuracy in the evaluation of learning algorithms.
Abstract: Predictive accuracy has been used as the main and often only evaluation criterion for the predictive performance of classification learning algorithms. In recent years, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been proposed as an alternative single-number measure for evaluating learning algorithms. In this paper, we prove that AUC is a better measure than accuracy. More specifically, we present rigourous definitions on consistency and discriminancy in comparing two evaluation measures for learning algorithms. We then present empirical evaluations and a formal proof to establish that AUC is indeed statistically consistent and more discriminating than accuracy. Our result is quite significant since we formally prove that, for the first time, AUC is a better measure than accuracy in the evaluation of learning algorithms.

422 citations