scispace - formally typeset
Search or ask a question
Author

Yunfa Hu

Bio: Yunfa Hu is an academic researcher from Fudan University. The author has contributed to research in topics: Categorization & Association rule learning. The author has an hindex of 8, co-authored 43 publications receiving 224 citations.

Papers
More filters
Journal Article
TL;DR: In this article, the authors used the maximum entropy model for text categorization and compared it to Bayes, KNN, and SVM, and showed that its performance is higher than Bayes and comparable with SVM.
Abstract: Maximum Entropy Model is a probability estimation technique widely used for a variety of natural language tasks. It offers a clean and accommodable frame to combine diverse pieces of contextual information to estimate the probability of a certain linguistics phenomena. This approach for many tasks of NLP perform near state-of-the-art level, or outperform other competing probability methods when trained and tested under similar conditions. In this paper, we use maximum entropy model for text categorization. We compare and analyze its categorization performance using different approaches for text feature generation, different number of features and smoothing technique. Moreover, in experiments we compare it to Bayes, KNN and SVM, and show that its performance is higher than Bayes and comparable with KNN and SVM. We think it is a promising technique for text categorization.

35 citations

Book ChapterDOI
Shuigeng Zhou1, Aoying Zhou1, Jing Cao1, Jin Wen1, Ye Fan1, Yunfa Hu1 
18 Apr 2000
TL;DR: Two sampling-based DBSCAN (SDBSCAN) algorithms are developed that are effective and efficient in clustering large-scale spatial databases.
Abstract: In this paper, we combine sampling technique with DBSCAN algorithm to cluster large spatial databases, two sampling-based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN; and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large-scale spatial databases.

32 citations

Proceedings ArticleDOI
30 Jul 2007
TL;DR: A novel concept, critical point (CP), is proposed, and traditional kNN is adapted by integrating CP's approximate value, LB or UB, training number with decision rules, and the adapted kNN achieves significant classification performance improvement on biased corpora.
Abstract: Many of standard classification algorithms usually assume that the training examples are evenly distributed among different classes. However, unbalanced data sets often appear in many applications. As a simple, effective categorization method, kNN is widely used, but it suffers from biased data sets, too. In developing the Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare categories are classified into common ones. To alleviate such a misfortune, we propose a novel concept, critical point (CP), and adapt traditional kNN by integrating CP's approximate value, LB or UB, training number with decision rules. Exhaustive experiments illustrate that the adapted kNN achieves significant classification performance improvement on biased corpora.

18 citations

Proceedings ArticleDOI
Rong-Lu Li1, Yunfa Hu1
02 Nov 2003
TL;DR: A density-based method for reducing the noises of training data, which solves problems of large computational demands and decrease of precision of classification in KNN classifier.
Abstract: With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. As a simple and effective classification approach, KNN method is widely used in text categorization. But KNN classifier not only has the large computational demands, but also may result in the decrease of precision of classification because of uneven density of training data. In this paper, we present a density-based method for reducing the noises of training data, which solves these problems. Our experiment results also illustrate it.

18 citations

Proceedings ArticleDOI
29 Oct 2007
TL;DR: This paper presents an ontology-based deep Web classification, which includes a category ontology model and a deep Web vector space model (VSM) that can get a good performance with average precision 91.6% and average recall 92.4%.
Abstract: The research on deep Web classification is an important area in large-scale deep Web integration, which is still at its early stage. Many deep Web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. In this paper, we present an ontology-based deep Web classification, which includes a category ontology model and a deep Web vector space model (VSM). The experimental results show that we can get a good performance with average precision 91.6% and average recall 92.4%.

15 citations


Cited by
More filters
Book
01 Dec 2006
TL;DR: Providing an in-depth examination of core text mining and link detection algorithms and operations, this text examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches.
Abstract: 1. Introduction to text mining 2. Core text mining operations 3. Text mining preprocessing techniques 4. Categorization 5. Clustering 6. Information extraction 7. Probabilistic models for Information extraction 8. Preprocessing applications using probabilistic and hybrid approaches 9. Presentation-layer considerations for browsing and query refinement 10. Visualization approaches 11. Link analysis 12. Text mining applications Appendix Bibliography.

1,628 citations

Proceedings ArticleDOI
24 Aug 2004
TL;DR: This paper presents an improved sampling-based DBSCAN which can cluster large-scale spatial databases effectively and outperforms DBS CAN as well as its other counterparts, in terms of execution time, without losing the quality of clustering.
Abstract: Spatial data clustering is one of the important data mining techniques for extracting knowledge from large amount of spatial data collected in various applications, such as remote sensing, GIS, computer cartography, environmental assessment and planning, etc. Several useful and popular spatial data clustering algorithms have been proposed in the past decade. DBSCAN is one of them, which can discover clusters of any arbitrary shape and can handle the noise points effectively. However, DBSCAN requires large volume of memory support because it operates on the entire database. This paper presents an improved sampling-based DBSCAN which can cluster large-scale spatial databases effectively. Experimental results included to establish that the proposed sampling-based DBSCAN outperforms DBSCAN as well as its other counterparts, in terms of execution time, without losing the quality of clustering.

182 citations

Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, and -mean.
Abstract: Under-sampling is a popular data preprocessing method in dealing with class imbalance problems, with the purposes of balancing datasets to achieve a high classification rate and avoiding the bias toward majority class examples. It always uses full minority data in a training dataset. However, some noisy minority examples may reduce the performance of classifiers. In this paper, a new under-sampling scheme is proposed by incorporating a noise filter before executing resampling. In order to verify the efficiency, this scheme is implemented based on four popular under-sampling methods, i.e., Undersampling + Adaboost, RUSBoost, UnderBagging, and EasyEnsemble through benchmarks and significance analysis. Furthermore, this paper also summarizes the relationship between algorithm performance and imbalanced ratio. Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, ${F}$ -measure, and ${G}$ -mean.

172 citations

BookDOI
01 Jan 1999
TL;DR: This paper provides a survey of various data mining techniques for advanced database applications, including association rule generation, clustering and classification, on high dimensional data spaces with large volumes of data.
Abstract: This paper provides a survey of various data mining techniques for advanced database applications. These include association rule generation, clustering and classification. With the recent increase in large online repositories of information, such techniques have great importance. The focus is on high dimensional data spaces with large volumes of data. The paper discusses past research on the topic and also studies the corresponding algorithms and applications.

131 citations

Journal ArticleDOI
TL;DR: Experimental results show that the models using MBPNN outperform than the basic BPNN and the application of LSA for this system can lead to dramatic dimensionality reduction while achieving good classification results.
Abstract: New text categorization models using back-propagation neural network (BPNN) and modified back-propagation neural network (MBPNN) are proposed. An efficient feature selection method is used to reduce the dimensionality as well as improve the performance. The basic BPNN learning algorithm has the drawback of slow training speed, so we modify the basic BPNN learning algorithm to accelerate the training speed. The categorization accuracy also has been improved consequently. Traditional word-matching based text categorization system uses vector space model (VSM) to represent the document. However, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which can also lead to poor classification accuracy. Latent semantic analysis (LSA) can overcome the problems caused by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. It not only greatly reduces the dimensionality but also discovers the important associative relationship between terms. We test our categorization models on 20-newsgroup data set, experimental results show that the models using MBPNN outperform than the basic BPNN. And the application of LSA for our system can lead to dramatic dimensionality reduction while achieving good classification results.

115 citations