scispace - formally typeset
Search or ask a question

Showing papers by "Qiang Yang published in 2004"


Proceedings Article•DOI•
04 Jul 2004
TL;DR: A simple, novel and yet effective method for building and testing decision trees that minimizes the sum of the misclassification and test costs and design several intelligent test strategies that can suggest ways of obtaining the missing values at a cost in order to minimize the total cost.
Abstract: We propose a simple, novel and yet effective method for building and testing decision trees that minimizes the sum of the misclassification and test costs. More specifically, we first put forward an original and simple splitting criterion for attribute selection in tree building. Our tree-building algorithm has many desirable properties for a cost-sensitive learning system that must account for both types of costs. Then, assuming that the test cases may have a large number of missing values, we design several intelligent test strategies that can suggest ways of obtaining the missing values at a cost in order to minimize the total cost. We experimentally compare these strategies and C4.5, and demonstrate that our new algorithms significantly outperform C4.5 and its variations. In addition, our algorithm's complexity is similar to that of C4.5, and is much lower than that of previous work. Our work is useful for many diagnostic tasks which must factor in the misclassification and test costs for obtaining missing information.

291 citations


Proceedings Article•DOI•
25 Jul 2004
TL;DR: This paper gives empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web- page classification algorithms and proposes a new Web summarization-based classification algorithm that achieves an approximately 8.8% improvement over pure-text based methods.
Abstract: Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

204 citations


Proceedings Article•DOI•
01 Nov 2004
TL;DR: This paper shows how to obtain a test-cost sensitive naive Bayes classifier (csNB) by including a test strategy which determines how unknown attributes are selected to perform test on in order to minimize the sum of the mis-classification costs and test costs.
Abstract: Inductive learning techniques such as the naive Bayes and decision tree algorithms have been extended in the past to handle different types of costs mainly by distinguishing different costs of classification errors. However, it is an equally important issue to consider how to handle the test costs associated with querying the missing values in a test case. When the value of an attribute is missing in a test case, it may or may not be worthwhile to take the effort to obtain its missing value, depending on how much the value results in a potential gain in the classification accuracy. In this paper, we show how to obtain a test-cost sensitive naive Bayes classifier (csNB) by including a test strategy which determines how unknown attributes are selected to perform test on in order to minimize the sum of the mis-classification costs and test costs. We propose and evaluate several potential test strategies including one that allows several tests to be done at once. We empirically evaluate the csNB method, and show that it compares favorably with its decision tree counterpart.

152 citations


Journal Article•DOI•
TL;DR: A promising approach to utilizing the correlative information for improving the peptide identification accuracy by extending the tandem mass spectral dot product to the kernel SDP (KSDP), which outperforms two SDP-based software tools, SEQUEST and Sonar MS/MS, in terms of identification accuracy.
Abstract: Motivation: The correlation among fragment ions in a tandem mass spectrum is crucial in reducing stochastic mismatches for peptide identification by database searching. Until now, an efficient scoring algorithm that considers the correlative information in a tunable and comprehensive manner has been lacking. Results: This paper provides a promising approach to utilizing the correlative information for improving the peptide identification accuracy. The kernel trick, rooted in the statistical learning theory, is exploited to address this issue with low computational effort. The common scoring method, the tandem mass spectral dot product (SDP), is extended to the kernel SDP (KSDP). Experiments on a dataset reported previously demonstrate the effectiveness of the KSDP. The implementation on consecutive fragments shows a decrease of 10% in the error rate compared with the SDP. Our software tool, pFind, using a simple scoring function based on the KSDP, outperforms two SDP-based software tools, SEQUEST and Sonar MS/MS, in terms of identification accuracy. Supplementary Information: http://www.jdl.ac.cn/user/yfu/pfind/index.html

101 citations


Proceedings Article•
25 Jul 2004
TL;DR: An integrated plan-recognition model that combines low-level sensory readings with high-level goal inference and a dynamic Bayesian network to infer a user's actions from raw signals and an N-gram model to infer the users' goals from actions is presented.
Abstract: Plan recognition has traditionally been developed for logically encoded application domains with a focus on logical reasoning. In this paper, we present an integrated plan-recognition model that combines low-level sensory readings with high-level goal inference. A two-level architecture is proposed to infer a user's goals in a complex indoor environment using an RF-based wireless network. The novelty of our work derives from our ability to infer a user's goals from sequences of signal trajectory, and the ability for us to make a trade-off between model accuracy and inference efficiency. The model relies on a dynamic Bayesian network to infer a user's actions from raw signals, and an N-gram model to infer the users' goals from actions. We present a method for constructing the model from the past data and demonstrate the effectiveness of our proposed solution through empirical studies using some real data that we have collected.

70 citations


Journal Article•DOI•
TL;DR: A comparative study on different kinds of sequential association rules for web document prediction shows that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules.
Abstract: Web servers keep track of web users' browsing behavior in web logs. From these logs, one can build statistical models that predict the users' next requests based on their current behavior. These data are complex due to their large size and sequential nature. In the past, researchers have proposed different methods for building association-rule based prediction models using the web logs, but there has been no systematic study on the relative merits of these methods. In this paper, we provide a comparative study on different kinds of sequential association rules for web document prediction. We show that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules. From this comparison we propose a best overall method and empirically test the proposed model on real web logs.

70 citations


Proceedings Article•DOI•
22 Aug 2004
TL;DR: An incremental supervised subspace learning algorithm to infer an adaptive subspace by optimizing the Maximum Margin Criterion is proposed, and experimental results show that IMMC converges to the similar subspace as that of batch approach.
Abstract: Subspace learning approaches have attracted much attention in academia recently. However, the classical batch algorithms no longer satisfy the applications on streaming data or large-scale data. To meet this desirability, Incremental Principal Component Analysis (IPCA) algorithm has been well established, but it is an unsupervised subspace learning approach and is not optimal for general classification tasks, such as face recognition and Web document categorization. In this paper, we propose an incremental supervised subspace learning algorithm, called Incremental Maximum Margin Criterion (IMMC), to infer an adaptive subspace by optimizing the Maximum Margin Criterion. We also present the proof for convergence of the proposed algorithm. Experimental results on both synthetic dataset and real world datasets show that IMMC converges to the similar subspace as that of batch approach.

61 citations


Proceedings Article•DOI•
13 Nov 2004
TL;DR: Experimental results show that the proposed algorithm outperforms the traditional Cosine similarity and is superior to LSI, and a novel iterative algorithm for computing non-orthogonal space similarity measures is proposed.
Abstract: Many machine learning and data mining algorithms crucially rely on the similarity metrics. The Cosine similarity, which calculates the inner product of two normalized feature vectors, is one of the most commonly used similarity measures. However, in many practical tasks such as text categorization and document clustering, the Cosine similarity is calculated under the assumption that the input space is an orthogonal space which usually could not be satisfied due to synonymy and polysemy. Various algorithms such as Latent Semantic Indexing (LSI) were used to solve this problem by projecting the original data into an orthogonal space. However LSI also suffered from the high computational cost and data sparseness. These shortcomings led to increases in computation time and storage requirements for large scale realistic data. In this paper, we propose a novel and effective similarity metric in the non-orthogonal input space. The basic idea of our proposed metric is that the similarity of features should affect the similarity of objects, and vice versa. A novel iterative algorithm for computing non-orthogonal space similarity measures is then proposed. Experimental results on a synthetic data set, a real MSN search click-thru logs, and 20NG dataset show that our algorithm outperforms the traditional Cosine similarity and is superior to LSI.

45 citations


Journal Article•DOI•
TL;DR: This method enhances the efficiency and the stability of nitrogen removal, and reduces operating costs and construction investment in the process of wastewater treatment.

26 citations


Proceedings Article•DOI•
01 Nov 2004
TL;DR: This paper proposes a method, called principal sparse nonnegative matrix factorization (PSNMF), for learning the associations between itemsets in the form of ratio rules, and provides a support measurement to weigh the importance of each rule for the entire dataset.
Abstract: Association rules are traditionally designed to capture statistical relationship among itemsets in a given database. To additionally capture the quantitative association knowledge, Korn et al. (1998) proposed a paradigm named ratio rules for quantifiable data mining. However, their approach is mainly based on principle component analysis (PCA) and as a result, it cannot guarantee that the ratio coefficient is nonnegative. This may lead to serious problems in the rules' application. In this paper, we propose a method, called principal sparse nonnegative matrix factorization (PSNMF), for learning the associations between itemsets in the form of ratio rules. In addition, we provide a support measurement to weigh the importance of each rule for the entire dataset.

22 citations



Proceedings Article•DOI•
01 Nov 2004
TL;DR: IRC attempts to classify the interrelated Web objects by iterative reinforcement between individual classification results of different types via the interrelationships by exploiting the full interrelationship between the heterogeneous objects on the Web.
Abstract: Most existing categorization algorithms deal with homogeneous Web data objects, and consider interrelated objects as additional features when taking the interrelationships with other types of objects into account. However, focusing on any single aspects of these interrelationships and objects does not fully reveal their true categories. In this paper, we propose a categorization algorithm, the iterative reinforcement categorization algorithm (IRC), to exploit the full interrelationships between the heterogeneous objects on the Web. IRC attempts to classify the interrelated Web objects by iterative reinforcement between individual classification results of different types via the interrelationships. Experiments on a clickthrough log dataset from MSN search engine show that, with the Fl measures, IRC achieves a 26.4% improvement over a pure content-based classification method, a 21% improvement over a query metadata-based method, and a 16.4% improvement over a virtual document-based method. Furthermore, our experiments show that IRC converges rapidly.

Journal Article•DOI•
TL;DR: This paper describes the solution for the protein homology prediction task in KDD Cup 2004 competition and focuses on making full use of the abundant information within the blocks, and developing a new technique for reducing and balancing training data to make the support vector machine applicable to this kind of large-scale and imbalanced learning tasks.
Abstract: This paper describes our solution for the protein homology prediction task in KDD Cup 2004 competition. This task is modeled as a supervised learning problem with multiple performance metrics. Several key characteristics make the problem both novel and challenging, including the concept of data blocks and the presence of large-scale and imbalanced training data. These features make a naive application of the traditional classification algorithms infeasible. Our approach focuses on making full use of the abundant information within the blocks, and developing a new technique for reducing and balancing training data to make the support vector machine applicable to this kind of large-scale and imbalanced learning tasks.

Journal Article•
TL;DR: In this article, an integrated framework called LEAPS (location estimation and action prediction), jointly developed by Hong Kong University of Science and Technology, and the Institute of Computing, Shanghai, of the Chinese Academy of Sciences, is presented.
Abstract: Location estimation and user behavior recognition are research issues that go hand in hand. In the past, these two issues have been investigated separately. In this paper, we present an integrated framework called LEAPS (location estimation and action prediction), jointly developed by Hong Kong University of Science and Technology, and the Institute of Computing, Shanghai, of the Chinese Academy of Sciences that combines two areas of interest, namely, location estimation and plan recognition, in a coherent whole. Under this framework, we have been carrying out several investigations, including action and plan recognition from low-level signals and location estimation by intelligently selecting access points (AP). Our two-layered model, including a sensor-level model and an action and goal prediction model, allows for future extensions in more advanced features and services.

Book Chapter•DOI•
18 Oct 2004
TL;DR: An integrated framework called LEAPS (location estimation and action prediction), jointly developed by Hong Kong University of Science and Technology and the Institute of Computing, Shanghai, is presented that combines two areas of interest, namely, location estimation and plan recognition, in a coherent whole.
Abstract: Location estimation and user behavior recognition are research issues that go hand in hand. In the past, these two issues have been investigated separately. In this paper, we present an integrated framework called LEAPS (location estimation and action prediction), jointly developed by Hong Kong University of Science and Technology, and the Institute of Computing, Shanghai, of the Chinese Academy of Sciences that combines two areas of interest, namely, location estimation and plan recognition, in a coherent whole. Under this framework, we have been carrying out several investigations, including action and plan recognition from low-level signals and location estimation by intelligently selecting access points (AP). Our two-layered model, including a sensor-level model and an action and goal prediction model, allows for future extensions in more advanced features and services.

Book Chapter•DOI•
26 May 2004
TL;DR: A new prediction model for predicting when an online customer leaves a current page and which next Web page the customer will visit, which is based on the Kolmogorov’s backward equations is presented.
Abstract: This paper presents a new prediction model for predicting when an online customer leaves a current page and which next Web page the customer will visit. The model can forecast the total number of visits of a given Web page by all incoming users at the same time. The prediction technique can be used as a component for many Web based applications . The prediction model regards a Web browsing session as a continuous-time Markov process where the transition probability matrix can be computed from Web log data using the Kolmogorov’s backward equations. The model is tested against real Web-log data where the scalability and accuracy of our method are analyzed.

Book Chapter•DOI•
TL;DR: This paper explores how to handle case retrieval when the case base is nonlinear in similarity measurement, in which situation the linear similarity functions will result in the wrong solutions.
Abstract: Good similarity functions are at the heart of effective case-based reasoning. However, the similarity functions that have been designed so far have been mostly linear, weighted-sum in nature. In this paper, we explore how to handle case retrieval when the case base is nonlinear in similarity measurement, in which situation the linear similarity functions will result in the wrong solutions. Our approach is to first transform the case base into a feature space using kernel computation. We perform correlation analysis with maximum correlation criterion(MCC) in the feature space to find the most important features through which we construct a feature-space case base. We then solve the new case in the feature space using the traditional similarity-based retrieval. We show that for nonlinear case bases, our method results in a performance gain by a large margin. We show the theoretical foundation and empirical evaluation to support our observations.

Proceedings Article•DOI•
Yiming Yang1, Hui Wang, Lei Li, Tianyi Li, Wen-Min Li, Qiang Yang, Wei Lv, Ping Huang •
26 Aug 2004
TL;DR: An innovative sequential data mining system for mining the customers' churning behaviors for the telecommunications industry by using a model-based clustering method, extended to handle multi-dimensional data, to automatically and efficiently partition the customer behavior according to their behavior.
Abstract: We develop an innovative sequential data mining system for mining the customers' churning behaviors for the telecommunications industry. Recently, an increasing number of telecommunications customers are switching from one service or service provider to another. This phenomenon is called 'churn', which is a major cause of corporations' loss of profitability. It is important for a telecommunications company to find out the transitional behavior of its customers through data mining. Our approach is to use a model-based clustering method, extended to handle multi-dimensional data, to automatically and efficiently partition the customer behavior according to their behavior. We model this problem as a sequential clustering problem, and present an effective solution for solving the problem when the elements in the sequences are of a multi-dimensional nature. We provide theory and algorithms for the task, and empirically demonstrate that the method is effective in mining the customers data for the telecommunications industry.

Proceedings Article•DOI•
01 Nov 2004
TL;DR: This work proposes a new approach to clustering high dimensional data based on a novel notion of cluster cores, instead of on nearest neighbors, which outperforms the well-known clustering algorithm, ROCK, with both lower time complexity and higher accuracy.
Abstract: We propose a new approach to clustering high dimensional data based on a novel notion of cluster cores, instead of on nearest neighbors. A cluster core is a fairly dense group with a maximal number of pairwise similar objects. It represents the core of a cluster, as all objects in a cluster are with a great degree attracted to it. As a result, building clusters from cluster cores achieves high accuracy. Other major characteristics of the approach include: (1) It uses a semantics-based similarity measure. (2) It does not incur the curse of dimensionality and is scalable linearly with the dimensionality of data. (3) It outperforms the well-known clustering algorithm, ROCK, with both lower time complexity and higher accuracy.

Book Chapter•DOI•
01 Apr 2004
TL;DR: This chapter presents three examples of actionable Web log mining, and presents an example of applying Web query log knowledge to improving Web search for a search engine application.
Abstract: Everyday, popular Websites attract millions of visitors. These visitors leave behind vast amounts of Websites traversal information in the form of Web server and query logs. By analyzing these logs, it is possible to discover various kinds of knowledge, which can be applied to improve the performance of Web services. A particularly useful kind of knowledge is knowledge that can be immediately applied to the operation of the Web-sites; we call this type of knowledge actionable knowledge. In this chapter, we present three examples of actionable Web log mining. The first method is to mine a Web log for Markov models that can be used for improving caching and prefetching of Web objects. A second method is to use the mined knowledge for building better, adaptive user interfaces. The new user interface can adjust as the user behavior changes with time. Finally, we present an example of applying Web query log knowledge to improving Web search for a search engine application.

Journal Article•DOI•
TL;DR: This special issue of IEEE Intelligent Systems features five articles that address the problem of actionable Web mining.
Abstract: The Web-its resources and users-offers a wealth of information for data mining and knowledge discovery. Up to now, a great deal of work has been done applying data mining and machine learning methods to discover novel and useful knowledge on the Web. However, many techniques aim only at extracting knowledge for human users to view and use. Recently, more and more work addresses Web for knowledge that computer systems will use. You can apply such actionable knowledge back to the Web for measurable performance improvements. This special issue of IEEE Intelligent Systems features five articles that address the problem of actionable Web mining.

Book Chapter•DOI•
09 Aug 2004
TL;DR: An approach to utilizing the correlative information among features to compute the similarity of cases for case retrieval is provided by extending the dot product-based linear similarity measures to their nonlinear versions with kernel functions.
Abstract: Case retrieval in case-based reasoning relies heavily on the design of a good similarity function. This paper provides an approach to utilizing the correlative information among features to compute the similarity of cases for case retrieval. This is achieved by extending the dot product-based linear similarity measures to their nonlinear versions with kernel functions. An application to the peptide retrieval problem in bioinformatics shows the effectiveness of the approach. In this problem, the objective is to retrieve the corresponding peptide to the input tandem mass spectrum from a large database of known peptides. By a kernel function implicitly mapping the tandem mass spectrum to a high dimensional space, the correlative information among fragment ions in a tandem mass spectrum can be modeled to dramatically reduce the stochastic mismatches. The experiment on the real spectra dataset shows a significant reduction of 10% in the error rate as compared to a common linear similarity function.

01 Jan 2004
TL;DR: In this paper, the authors used a support vector machine (SVM) to solve the protein homology prediction task in the KDD Cup 2004 competition, which was modeled as a supervised learning problem with multiple performance metrics.
Abstract: This paper describes our solution for the protein homology prediction task in KDD Cup 2004 competition. This task is modeled as a supervised learning problem with multiple performance metrics. Several key characteristics make the problem both novel and challenging, including the concept of data blocks and the presence of large-scale and imbalanced training data. These features make a naive application of the traditional classification algorithms infeasible. Our approach focuses on making full use of the abundant information within the blocks, and developing a new technique for reducing and balancing training data to make the support vector machine applicable to this kind of large-scale and imbalanced learning tasks.

Proceedings Article•DOI•
06 Dec 2004
TL;DR: An adaptive CBR model which can learn continually through detecting feedbacks from the outside to partially release this to enhance the system's adaptation of solving problems in dynamic environment.
Abstract: Adaptation is one of the necessary capabilities of any expert system. In a traditional expert system, the evolving environment is often treated in a static view. And the system accepts the change negatively. Our focus in this paper is to construct an adaptive CBR model which can learn continually through detecting feedbacks from the outside to partially release this. Knowledge base here is improved gradually so to enhance the system's adaptation of solving problems in dynamic environment.