Home
/
Authors
/
Huajie Zhang

Author

Huajie Zhang

Other affiliations: University of New Brunswick

Bio: Huajie Zhang is an academic researcher from University of Western Ontario. The author has contributed to research in topics: Naive Bayes classifier & Bayesian network. The author has an hindex of 8, co-authored 9 publications receiving 174 citations. Previous affiliations of Huajie Zhang include University of New Brunswick.

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

An Improved Learning Algorithm for Augmented Naive Bayes

[...]

Huajie Zhang¹, Charles X. Ling¹•Institutions (1)

University of Western Ontario¹

16 Apr 2001

TL;DR: This work extends Naive Bayes classifier to allow certain dependency relations among attributes, which is more efficient, and produces simpler dependency relation for better comprehensibility, while maintaining very similar predictive accuracy.

...read moreread less

Abstract: Data mining applications require learning algorithms to have high predictive accuracy, scale up to large datasets, and produce comprehensible outcomes. Naive Bayes classifier has received extensive attention due to its efficiency, reasonable predictive accuracy, and simplicity. However, the assumption of attribute dependency given class of Naive Bayes is often violated, producing incorrect probability that can affect the success of data mining applications. We extend Naive Bayes classifier to allow certain dependency relations among attributes. Comparing to previous extensions of Naive Bayes, our algorithm is more efficient (more so in problems with a large number of attributes), and produces simpler dependency relation for better comprehensibility, while maintaining very similar predictive accuracy.

...read moreread less

38 citations

Journal Article•DOI•

The representational power of discrete bayesian networks

[...]

Charles X. Ling¹, Huajie Zhang²•Institutions (2)

University of Western Ontario¹, University of New Brunswick²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This paper establishes an association between the structural complexity of Bayesian networks and their representational power, and uses the maximum number of nodes' parents and the maximum XOR contained in a target function as the measure for the function complexity.

...read moreread less

Abstract: One of the most important fundamental properties of Bayesian networks is the representational power, reflecting what kind of functions they can or cannot represent. In this paper, we establish an association between the structural complexity of Bayesian networks and their representational power. We use the maximum number of nodes' parents as the measure for the structural complexity of Bayesian networks, and the maximum XOR contained in a target function as the measure for the function complexity. A representational upper bound is established and proved. Roughly speaking, discrete Bayesian networks with each node having at most k parents cannot represent any function containing (k+1)-XORs. Our theoretical results help us to gain a deeper understanding on the capacities and limitations of Bayesian networks.

...read moreread less

34 citations

Book Chapter•DOI•

Toward Bayesian Classifiers with Accurate Probabilities

[...]

Charles X. Ling¹, Huajie Zhang¹•Institutions (1)

University of Western Ontario¹

06 May 2002

TL;DR: AUC provides a more discriminating evaluation for the ranking and probability estimation than the accuracy does, and it is shown that classifiers constructed to maximise the AUC score produce not only higher AUC values, but also higher classification accuracies.

...read moreread less

Abstract: In most data mining applications, accurate ranking and probability estimation are essential. However, many traditional classifiers aim at a high classification accuracy (or low error rate) only, even though they also produce probability estimates. Does high predictive accuracy imply a better ranking and probability estimation? Is there any better evaluation method for those classifiers than the classification accuracy, for the purpose of data mining applications? The answer is the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC. We show that AUC provides a more discriminating evaluation for the ranking and probability estimation than the accuracy does. Further, we show that classifiers constructed to maximise the AUC score produce not only higher AUC values, but also higher classification accuracies. Our results are based on experimental comparison between error-based and AUC-based learning algorithms for TAN (Tree-Augmented Naive Bayes).

...read moreread less

30 citations

Book Chapter•DOI•

The Learnability of Naive Bayes

[...]

Huajie Zhang¹, Charles X. Ling¹, Zhiduo Zhao¹•Institutions (1)

University of Western Ontario¹

14 May 2000-Lecture Notes in Computer Science

TL;DR: This work gives necessary and sufficient conditions on linearly separable functions in the binary domain to be learnable by Naive Bayes under uniform representation and shows that the learnability (and error rates) of Naïve Bayes can be affected dramatically by sampling distributions.

...read moreread less

Abstract: Naive Bayes is an efficient and effective learning algorithm, but previous results show that its representation ability is severely limited since it can only represent certain linearly separable functions in the binary domain. We give necessary and sufficient conditions on linearly separable functions in the binary domain to be learnable by Naive Bayes under uniform representation. We then show that the learnability (and error rates) of Naive Bayes can be affected dramatically by sampling distributions. Our results help us to gain a much deeper understanding of this seemingly simple, yet powerful learning algorithm.

...read moreread less

22 citations

Proceedings Article•DOI•

Mining generalized query patterns from web logs

[...]

Charles X. Ling¹, Jianfeng Gao, Huajie Zhang, Weining Qian, Hongjiang Zhang - Show less +1 more•Institutions (1)

University of Western Ontario¹

03 Jan 2001

TL;DR: A data-mining approach is proposed that produces generalized query patterns or templates from the raw user logs of a popular commercial knowledge-based search engine that is currently in use and shows that such templates can improve search engine's speed and precision.

...read moreread less

Abstract: User logs of a popular search engine keep track of user activities including user queries, user click-through from the returned list, and user browsing behaviors. Knowledge about user queries discovered from user logs can improve the performance of the search engine. We propose a data-mining approach that produces generalized query patterns or templates from the raw user logs of a popular commercial knowledge-based search engine that is currently in use. Our simulation shows that such templates can improve search engine's speed and precision, and can cover queries not asked previously. The templates are also comprehensible so web editors can easily discover topics in which most users are interested.

...read moreread less

15 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•

The Optimality of Naive Bayes.

[...]

Harry Zhang¹•Institutions (1)

University of New Brunswick¹

01 Jan 2004

TL;DR: A sufficient condition for the optimality of naive Bayes is presented and proved, in which the dependence between attributes do exist, and evidence that dependence among attributes may cancel out each other is provided.

...read moreread less

Abstract: Naive Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. Its competitive performance in classification is surprising, because the conditional independence assumption on which it is based, is rarely true in realworld applications. An open question is: what is the true reason for the surprisingly good performance of naive Bayes in classification? In this paper, we propose a novel explanation on the superb classification performance of naive Bayes. We show that, essentially, the dependence distribution; i.e., how the local dependence of a node distributes in each class, evenly or unevenly, and how the local dependencies of all nodes work together, consistently (supporting a certain classification) or inconsistently (canceling each other out), plays a crucial role. Therefore, no matter how strong the dependences among attributes are, naive Bayes can still be optimal if the dependences distribute evenly in classes, or if the dependences cancel each other out. We propose and prove a sufficient and necessary conditions for the optimality of naive Bayes. Further, we investigate the optimality of naive Bayes under the Gaussian distribution. We present and prove a sufficient condition for the optimality of naive Bayes, in which the dependence between attributes do exist. This provides evidence that dependence among attributes may cancel out each other. In addition, we explore when naive Bayes works well. Naive Bayes and Augmented Naive Bayes Classification is a fundamental issue in machine learning and data mining. In classification, the goal of a learning algorithm is to construct a classifier given a set of training examples with class labels. Typically, an example E is represented by a tuple of attribute values (x1, x2, , · · · , xn), where xi is the value of attribute Xi. Let C represent the classification variable, and let c be the value of C. In this paper, we assume that there are only two classes: + (the positive class) or − (the negative class). A classifier is a function that assigns a class label to an example. From the probability perspective, according to Bayes Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Rule, the probability of an example E = (x1, x2, · · · , xn) being class c is p(c|E) = p(E|c)p(c) p(E) . E is classified as the class C = + if and only if fb(E) = p(C = +|E) p(C = −|E) ≥ 1, (1) where fb(E) is called a Bayesian classifier. Assume that all attributes are independent given the value of the class variable; that is, p(E|c) = p(x1, x2, · · · , xn|c) = n ∏

...read moreread less

1,536 citations

Journal Article•DOI•

Using AUC and accuracy in evaluating learning algorithms

[...]

Jin Huang¹, Charles X. Ling¹•Institutions (1)

University of Western Ontario¹

01 Mar 2005-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is shown theoretically and empirically that AUC is a better measure (defined precisely) than accuracy and reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results.

...read moreread less

Abstract: The area under the ROC (receiver operating characteristics) curve, or simply AUC, has been traditionally used in medical diagnosis since the 1970s. It has recently been proposed as an alternative single-number measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. We establish formal criteria for comparing two different measures for learning algorithms and we show theoretically and empirically that AUC is a better measure (defined precisely) than accuracy. We then reevaluate well-established claims in machine learning based on accuracy using AUC and obtain interesting and surprising new results. For example, it has been well-established and accepted that Naive Bayes and decision trees are very similar in predictive accuracy. We show, however, that Naive Bayes is significantly better than decision trees in AUC. The conclusions drawn in this paper may make a significant impact on machine learning and data mining applications.

...read moreread less

1,528 citations

Journal Article•

Using AUC and Accuracy in Evaluating Learning Algorithms - Appendices.

[...]

Jin Huang, Charles X. Ling

01 Jan 2005-IEEE Transactions on Knowledge and Data Engineering

1,063 citations

Journal Article•DOI•

Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery

[...]

Xin Yang¹, Yifei Wang¹, Ryan Byrne², Gisbert Schneider², Shengyong Yang¹ - Show less +1 more•Institutions (2)

Sichuan University¹, ETH Zurich²

11 Jul 2019-Chemical Reviews

TL;DR: The current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects.

...read moreread less

Abstract: Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.

...read moreread less

425 citations

Proceedings Article•

AUC: a statistically consistent and more discriminating measure than accuracy

[...]

Charles X. Ling¹, Jin Huang¹, Harry Zhang²•Institutions (2)

University of Western Ontario¹, University of New Brunswick²

09 Aug 2003

TL;DR: It is formally proved that, for the first time, AUC is a better measure than accuracy in the evaluation of learning algorithms.

...read moreread less

Abstract: Predictive accuracy has been used as the main and often only evaluation criterion for the predictive performance of classification learning algorithms. In recent years, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been proposed as an alternative single-number measure for evaluating learning algorithms. In this paper, we prove that AUC is a better measure than accuracy. More specifically, we present rigourous definitions on consistency and discriminancy in comparing two evaluation measures for learning algorithms. We then present empirical evaluations and a formal proof to establish that AUC is indeed statistically consistent and more discriminating than accuracy. Our result is quite significant since we formally prove that, for the first time, AUC is a better measure than accuracy in the evaluation of learning algorithms.

...read moreread less

422 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Collapse