scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2014"


Proceedings ArticleDOI
06 Jan 2014
TL;DR: This work developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method and shows how this approach can be effectively used to solve text analysis tasks and evaluated in a qualitative user study.
Abstract: Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.

318 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: This paper proposes a novel multi-scale representation for scene text recognition that consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities.
Abstract: Driven by the wide range of applications, scene text detection and recognition have become active research topics in computer vision. Though extensively studied, localizing and reading text in uncontrolled environments remain extremely challenging, due to various interference factors. In this paper, we propose a novel multi-scale representation for scene text recognition. This representation consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities. Strokelets possess four distinctive advantages: (1) Usability: automatically learned from bounding box labels, (2) Robustness: insensitive to interference factors, (3) Generality: applicable to variant languages, and (4) Expressivity: effective at describing characters. Extensive experiments on standard benchmarks verify the advantages of strokelets and demonstrate the effectiveness of the proposed algorithm for text recognition.

303 citations


Journal ArticleDOI
TL;DR: The characters of short text and the difficulty of shortText classification are discussed, and the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classified, ensemble short text Classification, and real-time classification are introduced.
Abstract: With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification.

134 citations


Book ChapterDOI
06 Apr 2014
TL;DR: A ConceptNet-based semantic parser that deconstructs natural language text into concepts based on the dependency relation between clauses is proposed that is domain-independent and is able to extract concepts from heterogeneous text.
Abstract: Concept-level text analysis is superior to word-level analysis as it preserves the semantics associated with multi-word expressions. It offers a better understanding of text and helps to significantly increase the accuracy of many text mining tasks. Concept extraction from text is a key step in concept-level text analysis. In this paper, we propose a ConceptNet-based semantic parser that deconstructs natural language text into concepts based on the dependency relation between clauses. Our approach is domain-independent and is able to extract concepts from heterogeneous text. Through this parsing technique, 92.21% accuracy was obtained on a dataset of 3,204 concepts. We also show experimental results on three different text analysis tasks, on which the proposed framework outperformed state-of-the-art parsing techniques.

68 citations


Proceedings ArticleDOI
07 Apr 2014
TL;DR: This paper advocates the thesis that the quality of the summary obtained with combinations of sentence scoring methods depend on text subject, and evaluates the validity of the hypothesis formulated and point at which techniques are more effective in each of those contexts studied.
Abstract: Text summarization is the process of creating a shorter version of one or more text documents. Automatic text summarization has become an important way of finding relevant information in large text libraries or in the Internet. Extractive text summarization techniques select entire sentences from documents according to some criteria to form a summary. Sentence scoring is the technique most used for extractive text summarization, today. Depending on the context, however, some techniques may yield better results than some others. This paper advocates the thesis that the quality of the summary obtained with combinations of sentence scoring methods depend on text subject. Such hypothesis is evaluated using three different contexts: news, blogs and articles. The results obtained show the validity of the hypothesis formulated and point at which techniques are more effective in each of those contexts studied.

52 citations


01 Jan 2014
TL;DR: This paper has discussed general idea of text mining and comparison of its techniques, and briefly discusses a number of textmining applications which are used presently and in future.
Abstract: Text mining has become an exciting research field as it tries to discover valuable information from unstructured texts. The unstructured texts which contain vast amount of information cannot simply be used for further processing by computers. Therefore, exact processing methods, algorithms and techniques are vital in order to extract this valuable information which is completed by using text mining. In this paper, we have discussed general idea of text mining and comparison of its techniques. In addition, we briefly discuss a number of text mining applications which are used presently and in future.

43 citations


Journal ArticleDOI
TL;DR: A useful text-to-speech synthesizer is developed in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file.
Abstract: A Text-to-speech synthesizer is an application that converts text into spoken word, by analyzing and processing the text using Natural Language Processing (NLP) and then using Digital Signal Processing (DSP) technology to convert this processed text into synthesized speech representation of the text. Here, we developed a useful text-to-speech synthesizer in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file. The development of a text to speech synthesizer will be of great help to people with visual impairment and make making through large volume of text easier

40 citations


01 Jan 2014
TL;DR: Different method of text categorization and cluster analysis for text documents and a new text mining technique is proposed for future implementation are discussed.
Abstract: text mining is a technique to find meaningful patterns from the available text documents. The pattern discovery from the text and document organization of document is a well-known problem in data mining. Analysis of text content and categorization of the documents is a complex task of data mining. In order to find an efficient and effective technique for text categorization, various techniques of text categorization and classification is recently developed. Some of them are supervised and some of them unsupervised manner of document arrangement. This presented paper discusses different method of text categorization and cluster analysis for text documents. In addition of that a new text mining technique is proposed for future implementation.

39 citations


Proceedings ArticleDOI
19 May 2014
TL;DR: The effect OCR errors have on named entity recognition (NER) and how it affects text mining is studied and a simple method for estimating text quality of OCRed text is introduced and how well human raters can evaluate it is examined.
Abstract: The focus of this paper is on the quality of historical text digitised through optical character recognition (OCR) and how it affects text mining. We study the effect OCR errors have on named entity recognition (NER) and show that in a random sample of documents picked from several historical text collections, 30.6% of false negative commodity and location mentions and 13.3% of all manually annotated commodity and location mentions contain OCR errors. We introduce a simple method for estimating text quality of OCRed text and examine how well human raters can evaluate it. We also illustrate how automatic text quality estimation compares to manual rating with the aim of determining a quality threshold below which documents could potentially be discarded or would require extensive correction first. This work was conducted during the Trading Consequences project which focussed on text mining and visualisation of historical documents for the study of nineteenth century trade.

35 citations


Journal ArticleDOI
TL;DR: The authors offer insight into the process of semantic tagging, the capabilities and specificities of today's semantic taggers, and also indicate some of the criteria to be considered when choosing a tagger.
Abstract: Motivated by a continually increasing demand for applications that depend on machine comprehension of text-based content, researchers in both academia and industry have developed innovative solutions for automated information extraction from text. In this article, the authors focus on a subset of such tools--semantic taggers--that not only extract and disambiguate entities mentioned in the text but also identify topics that unambiguously describe the text's main themes. The authors offer insight into the process of semantic tagging, the capabilities and specificities of today's semantic taggers, and also indicate some of the criteria to be considered when choosing a tagger.

31 citations


Journal ArticleDOI
TL;DR: Two character-level methods for the abbreviation modeling aspect of the noisy channel model are introduced: a statistical classifier using language-based features to decide whether a character is likely to be removed from a word, and a character- level machine translation model.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: This paper has discussed text classification process, classifiers, and numerous feature extraction methodologies but all in context of short texts i.e. news classification based on their headlines.
Abstract: For the last few years, text mining has been gaining significant importance. Since Knowledge is now available to users through variety of sources i.e. electronic media, digital media, print media, and many more. Due to huge availability of text in numerous forms, a lot of unstructured data has been recorded by research experts and have found numerous ways in literature to convert this scattered text into defined structured volume, commonly known as text classification. Focus on full text classification i.e. full news, huge documents, long length texts etc. is more prominent as compared to the short length text. In this paper, we have discussed text classification process, classifiers, and numerous feature extraction methodologies but all in context of short texts i.e. news classification based on their headlines. Existing classifiers and their working methodologies are being compared and results are presented effectively.

Proceedings ArticleDOI
07 Apr 2014
TL;DR: A combined system for text localization and transcription in page images is presented that includes flexible learning-based methods for layout analysis and handwriting recognition, which were developed in the context of the Swiss research project HisDoc.
Abstract: Automated reading of historical handwriting is needed to search and browse ancient manuscripts in digital libraries based on their textual content. In this paper, we present a combined system for text localization and transcription in page images. It includes flexible learning-based methods for layout analysis and handwriting recognition, which were developed in the context of the Swiss research project HisDoc. A comprehensive experimental evaluation is provided for the medieval Parzival database, demonstrating a promising word recognition accuracy of 93.0% with closed vocabulary. In order to harmonize the evaluation of the two document analysis tasks, we introduce a novel evaluation measure for text line extraction that takes substitution, deletion, as well as insertion errors into account.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: A novel approach is proposed to recognize text in complex background natural scene, word formation from recognized text, spelling checking and word translation into user defined language and finally overlay translated word onto the image.
Abstract: In recent year's availability of economical image capturing devices in low cost products like mobile phones has led a significant attention of researchers to the problem of recognizing text in images. Recognition of scene text is a challenging problem compared to the recognition of printed documents. In this work a novel approach is proposed to recognize text in complex background natural scene, word formation from recognized text, spelling checking and word translation into user defined language and finally overlay translated word onto the image. The proposed approach is robust to different kinds of text appearances, including font size, font style, color, and background. Combining the respective strengths of different complementary techniques and overcoming their shortcomings, the proposed method uses efficient character detection and localization technique and multiclass classifier to recognize the text accurately. The proposed approach successfully recognizes text on natural scene images and does not depend on a particular alphabet, text background. It works with a wide variety in size of characters and can handle up to 20 degree skewness efficiently.

Proceedings Article
01 Jan 2014
TL;DR: The authors presented various visualizations for the text re-use found between texts of a collection to support humanists in answering a broad palette of research questions, such as how to compare different text editions to each other.
Abstract: In this paper, we present various visualizations for the Text Re-use found between texts of a collection to support humanists in answering a broad palette of research questions. When juxtaposing all texts of a corpus in the form of tuples, we propose the Text Re-use Grid as a distant reading method that emphasizes text tuples with systematic or repetitive Text Re-use. In contrast, the Text Re-use Browser allows for close reading of the Text Re-use between the two texts of a tuple. Additionally, we present Sentence Alignment Flows to improve the readability for Text Variant Graphs on sentence level that are used to compare various text editions to each other. Finally, we portray findings of the humanists of our project using the proposed visualizations.

Journal ArticleDOI
TL;DR: This paper describes the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain and shows significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.
Abstract: The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.

Proceedings ArticleDOI
16 Jun 2014
TL;DR: This work proposes a weighted-graph representation of text, called Text Graphs, which captures the grammatical and semantic relations between words and terms in the text.
Abstract: The Web has made possible many advanced text-mining applications, such as news summarization, essay grading, question answering, and semantic search For many of such applications, statistical text-mining techniques are ineffective since they do not utilize the morphological structure of the text Thus, many approaches use NLP-based techniques, that parse the text and use patterns to mine and analyze the parse trees which are often unnecessarily complex Therefore, we propose a weighted-graph representation of text, called Text Graphs, which captures the grammatical and semantic relations between words and terms in the text Text Graphs are generated using a new text mining framework which is the main focus of this paper Our framework, SemScape, uses a statistical parser to generate few of the most probable parse trees for each sentence and employs a novel two-step pattern-based technique to extract from parse trees candidate terms and their grammatical relations Moreover, SemScape resolves co references by a novel technique, generates domain-specific Text Graphs by consulting ontologies, and provides a SPARQL-like query language and an optimized engine for semantically querying and mining Text Graphs

Ming-Wei Chang1, Bo-June (Paul) Hsu1, Hao Ma1, Ricky Loynd1, Kuansan Wang1 
01 Jan 2014
TL;DR: E2E, an end-to-end entity linking system that is designed for short and noisy text found in microblogs and text messages, can process short and noise text robustly.
Abstract: We present E2E, an end-to-end entity linking system that is designed for short and noisy text found in microblogs and text messages. Mining and extracting entities from short text is an essential step for many content analysis applications. By jointly optimizing entity recognition and disambiguation as a single task, our system can process short and noisy text robustly.

Patent
08 Aug 2014
TL;DR: In this article, the unstructured text of a post or message generated by the user on a social-networking system is used to determine whether the text includes a request for recommendation using a machine-learning model based on comparison of the text to the one or more predetermined words associated with requests for recommendation.
Abstract: In one embodiment, a method includes receiving unstructured text from a user of a social-networking system, determining whether the unstructured text includes a request for a recommendation, identifying one or more first entity names in the unstructured text, generating a structured query based upon the one or more first entity names, identifying, in the social graph, one or more second entity names corresponding to the structured query, and presenting the one or more second entity names and the unstructured text in a social context of the user. The unstructured text may include text of a post or message generated by the user on a social-networking system. A score may be generated based on the unstructured text to determine whether the text includes a request for recommendation using a machine-learning model based on comparison of the unstructured text to the one or more predetermined words associated with requests for recommendation.

Journal Article
TL;DR: Some pre-processing techniques in Gujarati are introduced in this paper and it is shown that Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces.
Abstract: Text mining is the process of obtaining interesting patterns or knowledge from text documents. The most often used type of data in the WWW is text. Text mining is used to extract interesting knowledge from unstructured text data. Pre-processing is a very important phase in the text mining process. Text mining framework includes two components, text refining and knowledge distillation. This paper is about pre-processing for text mining in English and Gujarati language. There is very less work done for text mining in Gujarati language. It is very challenging task as Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces. Some pre-processing techniques in Gujarati are introduced in this paper.

Journal ArticleDOI
TL;DR: The techniques for text mining, Information Extraction, Information retrieval, Query processing, Natural Language processing, Categorization, Clustering, and information retrieval are described.
Abstract: Text mining is a technology that can work with unstructured or semi-structured data. It is a technology that can be used to find the meaningful information from natural language text using existing data in corporate databases by making unstructured text data available for analysis. There are many techniques for text mining. In this paper we describe the techniques, Information Extraction, Information retrieval, Query processing, Natural Language processing, Categorization, Clustering.

Journal ArticleDOI
TL;DR: As the content of a full text page usually focuses on a specific topic, a topic language model adaption method is proposed to improve the recognition performance of homologous offline handwritten Chinese text image and to obtain a tradeoff between the recognition and computational complexity.
Abstract: As the content of a full text page usually focuses on a specific topic, a topic language model adaption method is proposed to improve the recognition performance of homologous offline handwritten Chinese text image. Firstly, the text images are recognized with a character based bi-gram language model. Secondly, the topic of the text image is matched adaptively. Finally, the text image is recognized again with the best matched topic language model. To obtain a tradeoff between the recognition performance and computational complexity, a restricted topic language model adaption method is further presented. The methods have been evaluated on 100 offline Chinese text images. Compared to the general language model, the topic language model adaption has reduced the relative error rate by 11.94%. The restricted topic language model has lessened the running time by 19.22% at the cost of losing 0.35% of the accuracy.

01 Jan 2014
TL;DR: The paper presents introduction, major research work and applications of Optical Character Recognition in various fields, and some points will be stressed on the major research works that have made a great impact in character recognition.
Abstract: Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is widely used as a form of data entry from some sort of original paper data source, whether passport documents, invoices, bank statement, receipts, business card, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website [1]. A large number of research papers and reports have already been published on this topic. The paper presents introduction, major research work and applications of Optical Character Recognition in various fields. At the first introduction of OCR will be discussed and then some points will be stressed on the major research works that have made a great impact in character recognition. And finally the most important applications of OCR will be covered and then conclusion.

01 Aug 2014
TL;DR: This paper adds additional support for the claim that the new self-trained parser has improved over the baseline by carrying out a qualitative linguistic analysis of the kinds of differences between two parsers on non-native text.
Abstract: We apply the well-known parsing technique of self-training to a new type of text: languagelearner text. This type of text often contains grammatical and other errors which can cause problems for traditional treebank-based parsers. Evaluation on a small test set of student data shows improvement over the baseline, both by training on native or non-native text. The main contribution of this paper adds additional support for the claim that the new self-trained parser has improved over the baseline by carrying out a qualitative linguistic analysis of the kinds of differences between two parsers on non-native text. We show that for a number of linguistically interesting cases, the self-trained parser is able to provide better analyses, despite the sometimes ungrammatical nature of the text.

Patent
Patrick W. Fink1, Philip E. Parker1
18 Mar 2014
TL;DR: In this article, a topic summary application receives the user-selected text and identifies entities in the text using natural language processing, and then generates a summary, presented to the user in a pop-up window, of most frequently correlated related entities along with text phrases that are semantically important.
Abstract: Techniques are disclosed for discovering and presenting topic summaries related to a selection of text in an electronic document. A topic summary application receives the user-selected text and identifies entities in the text using natural language processing. Using natural language processing, the summary application also identifies related entities and associated text phrases in a remaining portion of the electronic document. The remaining portion may be a portion of the document that precedes the user-selected text, so that a summary generated therefrom may be used to refresh the memory of the user while not revealing information that the user has not yet encountered. In addition, the summary application determines semantically important text phrases using text analytics and generates a summary, presented to the user in a pop-up window, of most frequently correlated related entities along with text phrases that are semantically important.

Patent
18 Sep 2014
TL;DR: In this paper, a system, method and computer program is provided for generating customized text representations of audio commands based on a general language grammar, which is used for generating a first text representation of an audio command, the second module including a custom language grammar that may include contacts for a particular user.
Abstract: A system, method and computer program is provided for generating customized text representations of audio commands. A first speech recognition module may be used for generating a first text representation of an audio command based on a general language grammar. A second speech recognition module may be used for generating a second text representation of the audio command, the second module including a custom language grammar that may include contacts for a particular user. Entity extraction is applied to the second text representation and the entities are checked against a file containing personal language. If the entities are found in the user-specific language, the two text representations may be fused into a combined text representation and named entity recognition may be performed again to extract further entities.

Proceedings ArticleDOI
31 May 2014
TL;DR: This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering.
Abstract: Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa. This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering. The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.

Proceedings ArticleDOI
03 Dec 2014
TL;DR: This paper extended the existing text cube model to incorporate TF-IDF (Term Frequency Inverse Document Frequrency) and LM (Language Model) as measurements, and revealed that the performance and the effectiveness of the proposed text cube outperform the existing one.
Abstract: Recently, unstructured data like texts, documents, or SNS messages has been increasingly being used in many applications, rather than structured data consisting of simple numbers or characters. Thus it becomes more important to analysis unstructured text data to extract valuable information for usres decision making. Like OLAP (On-Line Analytical Processing) analysis over structured data, Multi-dimensional analysis for these unstructured data is popularly being required. To facilitate these analysis requirements on the unstructured data, a text cube model on multi-dimensional text database has been proposed. In this paper, we extended the existing text cube model to incorporate TF-IDF (Term Frequency Inverse Document Frequrency) and LM (Language Model) as measurements. Because the proposed text cube model utilizes new measurements which are more popular in information retrieval systems, it is more efficient and effective to analysis text databases. Through experiments, we revealed that the performance and the effectiveness of the proposed text cube outperform the existing one.

Book
20 Dec 2014
TL;DR: In this paper, the authors present a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of textmining and use text mining for various tasks in natural language processing (NLP).
Abstract: This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects the most recent achievements with respect to the automatic build-up of large lexical resources. It addresses researchers that already perform text mining, and those who want to enrich their battery of methods. Selected articles can be used to support graduate-level teaching. The book is suitable for all readers that completed undergraduate studies of computational linguistics, quantitative linguistics, computer science and computational humanities. It assumes basic knowledge of computer science and corpus processing as well as of statistics.

Journal Article
TL;DR: Experimental results show that the short text reconstruction and concept mapping can improve the effect of clustering and the proposed micro-blog user's interest modeling has a better performance.
Abstract: In this paper, a method on modeling user's interests based on short text of micro-blog is presented. In order to overcome the lack of information in short text, on the base of analyzing the structure and content of micro-blog short text, this paper proposes an approach on micro-blog short text reconstruction, and namely, according to the other related and the three kinds of special symbols of the text, extends the content, thereby extending the characteristic information of original micro-blog. It takes advantage of HowNet2000 concept dictionary to map the feature set of reconstruction text to a set of concepts. It clusters the set of concepts to divide user's interests, and meanwhile, a representation mechanism of user interest model is presented. Experimental results show that the short text reconstruction and concept mapping can improve the effect of clustering. Compared with the modeling based on collaborative filtering, F-Measure value is increased by 29.1%. This means the proposed micro-blog user's interest modeling has a better performance.