Home
/
Authors
/
Yunfa Hu

Author

Yunfa Hu

Bio: Yunfa Hu is an academic researcher from Fudan University. The author has contributed to research in topics: Categorization & Association rule learning. The author has an hindex of 8, co-authored 43 publications receiving 224 citations.

Papers

PDF

Open Access

More filters

Journal Article•

Using Maximum Entropy Model for chinese text categorization

[...]

Ronglu Li, Xiaopeng Tao, Lei Tang, Yunfa Hu

01 Jan 2004-Lecture Notes in Computer Science

TL;DR: In this article, the authors used the maximum entropy model for text categorization and compared it to Bayes, KNN, and SVM, and showed that its performance is higher than Bayes and comparable with SVM.

...read moreread less

Abstract: Maximum Entropy Model is a probability estimation technique widely used for a variety of natural language tasks. It offers a clean and accommodable frame to combine diverse pieces of contextual information to estimate the probability of a certain linguistics phenomena. This approach for many tasks of NLP perform near state-of-the-art level, or outperform other competing probability methods when trained and tested under similar conditions. In this paper, we use maximum entropy model for text categorization. We compare and analyze its categorization performance using different approaches for text feature generation, different number of features and smoothing technique. Moreover, in experiments we compare it to Bayes, KNN and SVM, and show that its performance is higher than Bayes and comparable with KNN and SVM. We think it is a promising technique for text categorization.

...read moreread less

35 citations

Book Chapter•DOI•

Combining Sampling Technique with DBSCAN Algorithm for Clustering Large Spatial Databases

[...]

Shuigeng Zhou¹, Aoying Zhou¹, Jing Cao¹, Jin Wen¹, Ye Fan¹, Yunfa Hu¹ - Show less +2 more•Institutions (1)

Fudan University¹

18 Apr 2000

TL;DR: Two sampling-based DBSCAN (SDBSCAN) algorithms are developed that are effective and efficient in clustering large-scale spatial databases.

...read moreread less

Abstract: In this paper, we combine sampling technique with DBSCAN algorithm to cluster large spatial databases, two sampling-based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN; and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large-scale spatial databases.

...read moreread less

32 citations

Proceedings Article•DOI•

An Effective Method To Improve kNN Text Classifier

[...]

Xiulan Hao¹, Xiaopeng Tao¹, Chenghong Zhang¹, Yunfa Hu¹•Institutions (1)

Fudan University¹

30 Jul 2007

TL;DR: A novel concept, critical point (CP), is proposed, and traditional kNN is adapted by integrating CP's approximate value, LB or UB, training number with decision rules, and the adapted kNN achieves significant classification performance improvement on biased corpora.

...read moreread less

Abstract: Many of standard classification algorithms usually assume that the training examples are evenly distributed among different classes. However, unbalanced data sets often appear in many applications. As a simple, effective categorization method, kNN is widely used, but it suffers from biased data sets, too. In developing the Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare categories are classified into common ones. To alleviate such a misfortune, we propose a novel concept, critical point (CP), and adapt traditional kNN by integrating CP's approximate value, LB or UB, training number with decision rules. Exhaustive experiments illustrate that the adapted kNN achieves significant classification performance improvement on biased corpora.

...read moreread less

18 citations

Proceedings Article•DOI•

Noise reduction to text categorization based on density for KNN

[...]

Rong-Lu Li¹, Yunfa Hu¹•Institutions (1)

Fudan University¹

02 Nov 2003

TL;DR: A density-based method for reducing the noises of training data, which solves problems of large computational demands and decrease of precision of classification in KNN classifier.

...read moreread less

Abstract: With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. As a simple and effective classification approach, KNN method is widely used in text categorization. But KNN classifier not only has the large computational demands, but also may result in the decrease of precision of classification because of uneven density of training data. In this paper, we present a density-based method for reducing the noises of training data, which solves these problems. Our experiment results also illustrate it.

...read moreread less

18 citations

Proceedings Article•DOI•

A Method of Deep Web Classification

[...]

HeXiang Xu¹, Xiulan Hao¹, Shuyun Wang¹, Yunfa Hu¹•Institutions (1)

Fudan University¹

29 Oct 2007

TL;DR: This paper presents an ontology-based deep Web classification, which includes a category ontology model and a deep Web vector space model (VSM) that can get a good performance with average precision 91.6% and average recall 92.4%.

...read moreread less

Abstract: The research on deep Web classification is an important area in large-scale deep Web integration, which is still at its early stage. Many deep Web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. In this paper, we present an ontology-based deep Web classification, which includes a category ontology model and a deep Web vector space model (VSM). The experimental results show that we can get a good performance with average precision 91.6% and average recall 92.4%.

...read moreread less

15 citations

1
2
3
4
…
5
6
7
8
9

Collapse

Cited by

PDF

Open Access

More filters

Book•

The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

[...]

Ronen Feldman¹, James Sanger•Institutions (1)

Hebrew University of Jerusalem¹

01 Dec 2006

TL;DR: Providing an in-depth examination of core text mining and link detection algorithms and operations, this text examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches.

...read moreread less

Abstract: 1. Introduction to text mining 2. Core text mining operations 3. Text mining preprocessing techniques 4. Categorization 5. Clustering 6. Information extraction 7. Probabilistic models for Information extraction 8. Preprocessing applications using probabilistic and hybrid approaches 9. Presentation-layer considerations for browsing and query refinement 10. Visualization approaches 11. Link analysis 12. Text mining applications Appendix Bibliography.

...read moreread less

1,628 citations

Proceedings Article•DOI•

An improved sampling-based DBSCAN for large spatial databases

[...]

Bhogeswar Borah¹, Dhruba K. Bhattacharyya•Institutions (1)

Tezpur University¹

24 Aug 2004

TL;DR: This paper presents an improved sampling-based DBSCAN which can cluster large-scale spatial databases effectively and outperforms DBS CAN as well as its other counterparts, in terms of execution time, without losing the quality of clustering.

...read moreread less

Abstract: Spatial data clustering is one of the important data mining techniques for extracting knowledge from large amount of spatial data collected in various applications, such as remote sensing, GIS, computer cartography, environmental assessment and planning, etc. Several useful and popular spatial data clustering algorithms have been proposed in the past decade. DBSCAN is one of them, which can discover clusters of any arbitrary shape and can handle the noise points effectively. However, DBSCAN requires large volume of memory support because it operates on the entire database. This paper presents an improved sampling-based DBSCAN which can cluster large-scale spatial databases effectively. Experimental results included to establish that the proposed sampling-based DBSCAN outperforms DBSCAN as well as its other counterparts, in terms of execution time, without losing the quality of clustering.

...read moreread less

182 citations

Journal Article•DOI•

A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification

[...]

Qi Kang¹, Xiaoshuang Chen¹, Sisi Li², MengChu Zhou³•Institutions (3)

Tongji University¹, Mercy College², Macau University of Science and Technology³

01 Dec 2017-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, and -mean.

...read moreread less

Abstract: Under-sampling is a popular data preprocessing method in dealing with class imbalance problems, with the purposes of balancing datasets to achieve a high classification rate and avoiding the bias toward majority class examples. It always uses full minority data in a training dataset. However, some noisy minority examples may reduce the performance of classifiers. In this paper, a new under-sampling scheme is proposed by incorporating a noise filter before executing resampling. In order to verify the efficiency, this scheme is implemented based on four popular under-sampling methods, i.e., Undersampling + Adaboost, RUSBoost, UnderBagging, and EasyEnsemble through benchmarks and significance analysis. Furthermore, this paper also summarizes the relationship between algorithm performance and imbalanced ratio. Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, ${F}$ -measure, and ${G}$ -mean.

...read moreread less

172 citations

Book•DOI•

Methodologies for Knowledge Discovery and Data Mining

[...]

Ning Zhong, Lizhu Zhou

01 Jan 1999

TL;DR: This paper provides a survey of various data mining techniques for advanced database applications, including association rule generation, clustering and classification, on high dimensional data spaces with large volumes of data.

...read moreread less

Abstract: This paper provides a survey of various data mining techniques for advanced database applications. These include association rule generation, clustering and classification. With the recent increase in large online repositories of information, such techniques have great importance. The focus is on high dimensional data spaces with large volumes of data. The paper discusses past research on the topic and also studies the corresponding algorithms and applications.

...read moreread less

131 citations

Journal Article•DOI•

Latent semantic analysis for text categorization using neural network

[...]

Bo Yu¹, Zongben Xu¹, Cheng-hua Li²•Institutions (2)

Xi'an Jiaotong University¹, Chonbuk National University²

01 Dec 2008-Knowledge Based Systems

TL;DR: Experimental results show that the models using MBPNN outperform than the basic BPNN and the application of LSA for this system can lead to dramatic dimensionality reduction while achieving good classification results.

...read moreread less

Abstract: New text categorization models using back-propagation neural network (BPNN) and modified back-propagation neural network (MBPNN) are proposed. An efficient feature selection method is used to reduce the dimensionality as well as improve the performance. The basic BPNN learning algorithm has the drawback of slow training speed, so we modify the basic BPNN learning algorithm to accelerate the training speed. The categorization accuracy also has been improved consequently. Traditional word-matching based text categorization system uses vector space model (VSM) to represent the document. However, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which can also lead to poor classification accuracy. Latent semantic analysis (LSA) can overcome the problems caused by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. It not only greatly reduces the dimensionality but also discovers the important associative relationship between terms. We test our categorization models on 20-newsgroup data set, experimental results show that the models using MBPNN outperform than the basic BPNN. And the application of LSA for our system can lead to dramatic dimensionality reduction while achieving good classification results.

...read moreread less

115 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

Collapse