scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2007"


Patent
27 Dec 2007
TL;DR: In this paper, a novel predictive feature extraction method which combines linguistic and statistical information for representation of information embedded in a noisy source language is proposed. But the method is not suitable for speech recognition systems due to the large domain size, scarce training data and noisy environmental conditions.
Abstract: The performance of traditional speech recognition systems (as applied to information extraction or translation) decreases significantly with, larger domain size, scarce training data as well as under noisy environmental conditions. This invention mitigates these problems through the introduction of a novel predictive feature extraction method which combines linguistic and statistical information for representation of information embedded in a noisy source language. The predictive features are combined with text classifiers to map the noisy text to one of the semantically or functionally similar groups. The features used by the classifier can be syntactic, semantic, and statistical.

199 citations


Proceedings ArticleDOI
06 Nov 2007
TL;DR: The system, FeatureLens1, visualizes a text collection at several levels of granularity and enables users to explore interesting text patterns, and focuses on frequent itemsets of n-grams, as they capture the repetition of exact or similar expressions in the collection.
Abstract: This paper addresses the problem of making text mining results more comprehensible to humanities scholars, journalists, intelligence analysts, and other researchers, in order to support the analysis of text collections. Our system, FeatureLens1, visualizes a text collection at several levels of granularity and enables users to explore interesting text patterns. The current implementation focuses on frequent itemsets of n-grams, as they capture the repetition of exact or similar expressions in the collection. Users can find meaningful co-occurrences of text patterns by visualizing them within and across documents in the collection. This also permits users to identify the temporal evolution of usage such as increasing, decreasing or sudden appearance of text patterns. The interface could be used to explore other text features as well. Initial studies suggest that FeatureLens helped a literary scholar and 8 users generate new hypotheses and interesting insights using 2 text collections.

134 citations


Proceedings ArticleDOI
28 Oct 2007
TL;DR: The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification, and present interesting results on real-life noisy datasets from various CRM domains.
Abstract: Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters- 21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.

92 citations


Proceedings ArticleDOI
11 Mar 2007
TL;DR: This paper proposes a graph-based text representation, which is capable of capturing Term order, Term frequency, and Term co-occurrence in documents and applies the graph model into the text mining task,Which is to discover unapparent associations between two and more concepts from a large text corpus.
Abstract: For information retrieval and text-mining, a robust scalable framework is required to represent the information extracted from documents and enable visualization and query of such information. One very widely used model is the vector space model which is based on the bag-of-words approach. However, it suffers from the fact that it loses important information about the original text, such as information about the order of the terms in the text or about the frontiers between sentences or paragraphs. In this paper, we propose a graph-based text representation, which is capable of capturing (i) Term order (ii) Term frequency (iii) Term co-occurrence (iv) Term context in documents. We also apply the graph model into our text mining task, which is to discover unapparent associations between two and more concepts (e.g. individuals) from a large text corpus. Counterterrorism corpus is used to evaluate the performance of various retrieval models, which demonstrates feasibility and effectiveness of graphic text representation in information retrieval and text mining.

62 citations


Patent
Takafumi Koshinaka1
02 Feb 2007
TL;DR: In this paper, a speech recognition dictionary compilation assisting system can create and update speech recognition dictionaries and language models efficiently so as to reduce speech recognition errors by utilizing text data available at a low cost.
Abstract: A speech recognition dictionary compilation assisting system can create and update speech recognition dictionary and language models efficiently so as to reduce speech recognition errors by utilizing text data available at a low cost. The system includes speech recognition dictionary storage section 105 , language model storage section 106 and acoustic model storage section 107 . A virtual speech recognition processing section 102 processes analyzed text data generated by the text analyzing section 101 by making reference to the recognition dictionary, language models and acoustic models so as to generate virtual text data resulted from speech recognition, and compares the virtual text data resulted from speech recognition with the analyzed text data. The update processing section 103 updates the recognition dictionary and language models so as to reduce different point(s) between both sets of text data.

35 citations


Journal ArticleDOI
17 Jan 2007
TL;DR: Experimental results show that the proposed techniques present a considerable improvement in classification performance, even when small labeled training sets are available.
Abstract: Text mining, intelligent text analysis, text data mining and knowledge-discovery in text are generally used aliases to the process of extracting relevant and non-trivial information from text. Some crucial issues arise when trying to solve this problem, such as document representation and deficit of labeled data. This paper addresses these problems by introducing information from unlabeled documents in the training set, using the support vector machine (SVM) separating margin as the differentiating factor. Besides studying the influence of several pre-processing methods and concluding on their relative significance, we also evaluate the benefits of introducing background knowledge in a SVM text classifier. We further evaluate the possibility of actively learning and propose a method for successfully combining background knowledge and active learning. Experimental results show that the proposed techniques, when used alone or combined, present a considerable improvement in classification performance, even when small labeled training sets are available.

34 citations


Patent
29 Dec 2007
TL;DR: In this article, a hierarchical list of message categories and a database of key terms and sample phrases for each of such categories are used to determine if a text message is associated with at least one message category of interest.
Abstract: Techniques for classifying electronic text messages include creating a hierarchical list of message categories, composing databases of key terms and sample phrases for each of such categories, and, based on a number and features of the key terms detected in an analyzed text message, determining if the text message is associated with at least one message category of interest. Variants of the key terms or can be produced using fuzzy text objects generation algorithms. Weight factors for the key terms and similarity scores of a text message compared to previously identified sample messages for a particular message category are calculated based on properties of the key terms detected in the text message, such as a frequency of use, location, or appearance in the text message, a number of words in the respective key terms.

27 citations


Journal ArticleDOI
TL;DR: Noisy unstructured text data are ubiquitous in real-world communications, and natural language and its creative usage can cause problems for computational techniques.
Abstract: Noisy unstructured text data are ubiquitous in real-world communications. Text produced by processing signals intended for human interpretation, such as printed and handwritten documents, spontaneous speech, and cameracaptured scene images, are prime examples. Application of Automatic SpeechRecognition (ASR) systems on telephonic conversations between call center agents and customers often see 30–40% word error rates. Optical character recognition (OCR) error rates for hardcopy documents can range widely from 2–3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, and aspects of the typography. Unconstrained handwriting recognition is still considered to be largely an open problem. Recognition errors are not the sole source of noise; natural language and its creative usage can cause problems for computational techniques. Electronic text taken directly from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs, and web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), and mobile phones (text messages) is often very noisy and challenging to process. Spelling errors, abbreviations,

26 citations


Patent
30 Nov 2007
TL;DR: In this article, a speech data retrieving Web site system is provided which may improve erroneous indexing with participation of a user by allowing the user to correct text data obtained by conversion using a speech recognition technique.
Abstract: A speech data retrieving Web site system is provided which may improve erroneous indexing with participation of a user by allowing the user to correct text data obtained by conversion using a speech recognition technique Speech data published on a Web is converted into text data by a speech recognition section 5 A text data publishing section 11 publishes the text data obtained by conversion of the speech data in a state searchable by a search engine, downloadable together with related information corresponding to the text data, and correctable A text data correcting section 9 corrects the text data stored in a text data storage section 7 , according to a correction result registration request supplied from a user terminal device 15 through the Internet

22 citations


Patent
Rakesh Gupta1, Quang Xuan Do1
17 Oct 2007
TL;DR: In this article, a classifier is trained to identify text data having a specific format, such as situation-response or cause-effect, using a training corpus, and then extracts features from the text data with the specified format.
Abstract: The present invention provides a method for extracting relationships between words in textual data. Initially, a classifier is trained to identify text data having a specific format, such as situation-response or cause-effect, using a training corpus. The classifier receives input identifying components of the text data having the specified format and then extracts features from the text data having the specified format, such as the part of speech for words in the text data, the semantic role of words within the text data and sentence structure. These extracted features are then applied to text data to identify components of the text data which have the specified format. Rules are then extracted from the text data having the specified format.

19 citations


Patent
30 Mar 2007
TL;DR: In this article, a probabilistic word substitution model is used to determine likelihoods of the represented structured document text corresponding to the text in the input sequence, and then a most likely sequence of structured document texts is generated as an output.
Abstract: An input sequence of unstructured speech recognition text is transformed into output structured document text. A probabilistic word substitution model is provided which establishes association probabilities indicative of target structured document text correlating with source unstructured speech recognition text. The input sequence of unstructured speech recognition text is looked up in the word substitution model to determine likelihoods of the represented structured document text corresponding to the text in the input sequence. Then, a most likely sequence of structured document text is generated as an output.

Proceedings ArticleDOI
30 Jul 2007
TL;DR: This research proposes concept chains to link semantically-related concepts based on Hownet knowledge database to improve the performance of text summarization and suit Chinese text.
Abstract: The rapid growth of the Internet has resulted in enormous amounts of information that has become more difficult to access efficiently. The primary goal of this research is to create an efficient tool that is able to summarize large documents automatically. We propose concept chains to link semantically-related concepts based on Hownet knowledge database to improve the performance of text summarization and suit Chinese text. Lexical chains is a technique for identifying semantically-related terms in text. The resulting concept chains are then used to identify candidate sentences useful for extraction. Moreover, the other method based on structural features which can makes the summary of the text have more general content and more balance is also proposed. The final experimental results proved the effectiveness of our methods.

Patent
Ick-sang Han1, Joong-mi Cho1, Yoon-kyung Song1, Byung-kwan Kwak1, Namhoon Kim1, Ji-yeun Kim1 
21 Dec 2007
TL;DR: In this article, the authors proposed a method and apparatus for automatically completing a text input using speech recognition, which includes: receiving a first part of a text from a user through text input device, recognizing a speech of the user, which corresponds to the text; and completing a remaining part of the text based on the first part and the recognized speech.
Abstract: Provided are a method and apparatus for automatically completing a text input using speech recognition. The method includes: receiving a first part of a text from a user through a text input device; recognizing a speech of the user, which corresponds to the text; and completing a remaining part of the text based on the first part of the text and the recognized speech. Therefore, accuracy of the text input and convenience of the speech recognition can be ensured, and a non-input part of the text can be easily input based on the input part of the text and the recognized speech at a high speed.

Proceedings ArticleDOI
28 Aug 2007
TL;DR: This paper proposes a bottom-up strategy to adapt associative classification to text categorization, in which it takes into account structure information of text, and shows that the proposed strategy can make use of text structure information and achieve better performance.
Abstract: Associative classification, which originates from numerical data mining, has been applied to deal with text data recently Text data is firstly digitalized to database of transactions, and then training and prediction is actually conducted on the derived numerical dataset This intuitive strategy has demonstrated quite good performance However, it doesn't take into consideration the inherent characteristics of text data as much as possible, although it has to deal with some specific problems of text data such as lemmatizing and stemming during digitalization In this paper, we propose a bottom-up strategy to adapt associative classification to text categorization, in which we take into account structure information of text Experiments on Reuters-21578 dataset show that the proposed strategy can make use of text structure information and achieve better performance

Journal Article
TL;DR: The developed apprach is rule-based and made up of four phases: text tokenization, word light stemming, words' morphological analysis, and text annotation, which produces an annotated text where each word is tagged with its morphological attributes.
Abstract: A new approach for preprocessing vowelized and unvowelized Arabic texts in order to prepare them for Natural Language Processing (NLP) purposes is described. The developed apprach is rule-based and made up of four phases: text tokenization, word light stemming, words' morphological analysis, and text annotation. The first phase preprocesses the input text in order to isolate the words and represent them in a formal way. The second phase applies a light stemmer in order to extract the stem of each word by eliminating the prefixes and suffixes. The third phase is a rule-based morphological analyzer that determines the root and the morphological pattern for each extracted stem. The last phase produces an annotated text where each word is tagged with its morphological attributes. The preprocessor presented in this paper is capable of dealing with vowelized and unvowelized words, and provides the input words along with relevant linguistics information needed by different applications. It is designed to be used with different NLP applications such as machine translation, text summarization, text correction, information retrieval, and automatic vowelization of Arabic text.

Proceedings ArticleDOI
22 Oct 2007
TL;DR: This paper presents an algorithm for N-gram extraction from huge datasets and indicates that the approach reaches outstanding results among other available solutions in terms of speed and amount of processed data.
Abstract: In natural language processing (NLP) mainly single words are utilized to represent text documents. Recent studies have shown that this approach can be often improved by employing other, more sophisticated, features. Among them, mainly N-grams have been successfully used for this purpose and many algorithms and procedures for their extraction have been proposed. However, usually they are not primarily intended for large data processing, which has currently become a critical task. In this paper we present an algorithm for N-gram extraction from huge datasets. The experiments indicate that our approach reaches outstanding results among other available solutions in terms of speed and amount of processed data.

Book ChapterDOI
01 Oct 2007
TL;DR: This study has constructed Muscorian, using MontyLingua, a generic text processor that uses a two-layered generalization-specialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer.
Abstract: The exponential increase in publication rate of new articles is limiting access of researchers to relevant literature. This has prompted the use of text mining tools to extract key biological information. Previous studies have reported extensive modification of existing generic text processors to process biological text. However, this requirement for modification had not been examined. In this study, we have constructed Muscorian, using MontyLingua, a generic text processor. It uses a two-layered generalization-specialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer. Evaluation using a corpus and experts indicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using either specialized biological text processing tools or modified existing tools. Our study had also demonstrated the flexibility of the two-layered generalization-specialization paradigm by using the same generalization layer for two specialized information extraction tasks.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: Two freely available databases are presented, one that consists of annotated screenshot images of 28080 single characters and another holding 400 words extracted from documents plus 2 400 generated isolated words that include meta-information such as x-height, font type, style and rendering conditions.
Abstract: The recognition of screen-rendered text is a novel task. It is performed e.g. by translation tools which allow users to click on any text on the screen and give a translation. Also some commercial OCR programs start to address the problem of reading screenshots. Optical character recognition on screen-shot images can be very challenging due to very small and smoothed fonts. In order to build and compare recognition approaches for screen-rendered text, the availability of standard databases is a fundamental prerequisite. In this paper two freely available databases are presented, one that consists of annotated screenshot images of 28080 single characters and another holding 400 words extracted from documents plus 2 400 generated isolated words. Both databases include meta-information such as x-height, font type, style and rendering conditions. At the example of a developed recognition system, it is shown how these databases can serve for training, testing and optimization.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: This paper proposes two sets of metrics to evaluate the performance of text localization algorithms in different usage conditions, which consider the text distribution characteristics, and the difficulties of the underlying task.
Abstract: The localization of texts in image/video is the first step in a text processing system. Its effect will do great impact on the following processing steps. Although many studies have been done on text localization algorithms, there is not a universally accepted performance evaluation method. In this paper we propose two sets of metrics to evaluate the performance of text localization algorithms in different usage conditions. The metrics also consider the text distribution characteristics, and the difficulties of the underlying task. Some experiments on the proposed metrics are also given.

Proceedings ArticleDOI
29 Jan 2007
TL;DR: A distributed system to extract text contained in natural scenes within consumer photographs, designed to process a large volume of photos, achieves very high text retrieval rate and data throughput with very small false detection rates.
Abstract: We present a distributed system to extract text contained in natural scenes within consumer photographs. The objective is to automatically annotate pictures in order to make consumer photo sets searchable based on the image content. The system is designed to process a large volume of photos, by quickly isolating candidate text regions, and successively cascading them through a series of text recognition engines which jointly make a decision on whether or not the region contains text that is readable by OCR. In addition, a dedicated rejection engine is built on top of each text recognizer to adapt its confidence measure to the specifics of the task. The resulting system achieves very high text retrieval rate and data throughput with very small false detection rates.

Proceedings ArticleDOI
30 Oct 2007
TL;DR: The data processing modules of TexPlorer consist of named entity extraction, entity relation extraction, hierarchical clustering, and text summarization tools that allow users to explore vast amount of text documents efficiently.
Abstract: TexPlorer is an integrated system for exploring and analyzing large amounts of text documents. The data processing modules of TexPlorer consist of named entity extraction, entity relation extraction, hierarchical clustering, and text summarization tools. Using a timeline tool, tree-view, table-view, and concept maps, TexPlorer provides an analytical interface for exploring a set of text documents from different perspectives and allows users to explore vast amount of text documents efficiently.

Journal Article
TL;DR: The text similarity computing based on word co-occurrence presented in this paper enables users to delete or maintain text collections similar to a certain text in order to improve retrieval efficiency.
Abstract: In text retrieval,insufficient expression of the client requirements usually leads to large amounts of inappropriate information,which brings inconvenience to user retrieval.The text similarity computing based on word co-occurrence presented in this paper enables users to delete or maintain text collections similar to a certain text in order to improve retrieval efficiency.

Journal ArticleDOI
TL;DR: A new text mining methodology is evaluated: prototype-matching for text clustering for text clustersering, developed by the authors' research group and discussed in terms of common business applications and possible future research.
Abstract: Text documents are the most common means for exchanging formal knowledge among people Text is a rich medium that can contain a vast range of information, but text can be difficult to decipher automatically Many organizations have vast repositories of textual data but with few means of automatically mining that text Text mining methods seek to use an understanding of natural language text to extract information relevant to user needs This article evaluates a new text mining methodology: prototype-matching for text clustering, developed by the authors' research group The methodology was applied to four applications: clustering documents based on their abstracts, analyzing financial data, distinguishing authorship, and evaluating multiple translation similarity The results are discussed in terms of common business applications and possible future research

Journal Article
TL;DR: The experiment shows the extraction of Chinese topic text from Web pages is so fast and accurate that it can achieve the requirement of constructing a large Chinese text corpus.
Abstract: A simple and efficient method of generally extracting Chinese topic text from Web pages was proposed in this paper in order to build a large Chinese text corpus. This method just utilizes length of Chinese texts and series of punctuations, along with a few rules of discrimination, to extract needed text from Web pages accurately without analyzing HTML tags. The experiment shows the extraction is so fast and accurate that it can achieve the requirement of constructing a large Chinese text corpus.

Proceedings ArticleDOI
28 Jan 2007
TL;DR: Only using a small set of application specific text, by combining unsupervised text clustering and text retrieval techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt training corpus towards the application area of interest.
Abstract: The application-relevant text data are very useful in various natural language applications. Using them can achieve significantly better performance for vocabulary selection, language modeling, which are widely employed in automatic speech recognition, intelligent input method etc. In some situations, however, the relevant data is hard to collect. Thus, the scarcity of application-relevant training text brings difficulty upon these natural language processing. In this paper, only using a small set of application specific text, by combining unsupervised text clustering and text retrieval techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt training corpus towards the application area of interest. We use the performance of n-gram statistical language model, which is trained from the text retrieved and test on the application-specific text, to evaluate the relevance of the text acquired, accordingly, to validate the effectiveness of our corpus adaptation approach. The language models trained from the ranked text bundles present well discriminated perplexities on the application-specific text. The preliminary experiments on short message text and unorganized large corpus demonstrate the performance of the proposed methods.

Journal Article
Xiong Zhang1
TL;DR: An overview of state of the art of text detection and recognition technique in video and typical techniques and method based on feature and learning theory are discussed.
Abstract: Text presented in video frames can provide important supplemental information for video indexing and retrieval.In order to make researchers know the academic area more systemically,this paper gives an overview of state of the art of text detection and recognition technique in video.Typical techniques and method based on feature and learning theory are discussed,as well as their merits and shortcomings.With the present problem,the paper gives some work and issues that can be researched in the future.

Proceedings ArticleDOI
29 Oct 2007
TL;DR: This paper proposes Wiki-Query segmented features to text classification, in hopes of better using the text information, and results show that a much better F1 value has been achieved than that of classical single-word based text representation.
Abstract: The rapid increase of Internet technology requires a better management of Web page contents. Many text mining researches has been conducted, like text categorization, information retrieval, text clustering. When machine learning methods or statistical models are applied to such a large scale of data, the first step we have to solve is to represent a text document into the way that computers could handle. Traditionally, single words are always employed as features in vector space model, which make up the feature space for all text documents. The single-word based representation is based on the word independence and doesn't consider their relations, which may cause information missing. This paper proposes Wiki-Query segmented features to text classification, in hopes of better using the text information. The experiment results show that a much better F1 value has been achieved than that of classical single-word based text representation. This means that Wikipedia and query segmented feature could better represent a text document.

01 Jan 2007
TL;DR: A novel text representation model is proposed, which uses lexical network to represent the text and retains the text's structure and is applied into text classification to measure the representation abifity of this model.
Abstract: Text representation is the basis of text processing.Most current text representation model didn't consider of the words' relations and result in the loss of text's structure information,which is important to understand the text This paper proposes a novel text representation model,which uses lexical network to represent the text and retains the text's structure.According to the different levels of words' relations,co-occurrence network,syntactic network and semantic network are introduced.The text network representation is applied into text classification to measure the representation abifity of this model.The experiment result shows that our text network representation is prior to vector space model.

Proceedings ArticleDOI
29 Oct 2007
TL;DR: A novel method for Chinese short text orientation identification which simulates human's cognition is proposed, which makes full use of field knowledge, combines tendency dictionary and semantic rules, and takes into account the sentiment orientation of words which constructed by Naive Bayesian model.
Abstract: With the rapid development of information technology, huge data are accumulated. A vast number of such data appears as short text. It is very useful to orientation identification for short text. But traditional text filtering technology based on statistics usually is ineffective when it deals with orientation, especially for Chinese short text. This paper proposes a novel method for Chinese short text orientation identification which simulates human's cognition. The approach makes full use of field knowledge, combines tendency dictionary and semantic rules, and takes into account the sentiment orientation of words which constructed by Naive Bayesian model. Experiments show that the proposed method works well in terms of orientation identification for Chinese short text.

Proceedings ArticleDOI
23 Jul 2007
TL;DR: A novel system for evaluating and performing stream-based text categorization that implements character-based languages models--specifically models based on the PPM text compression scheme--as well as count-based measures such as R-Measure and C-Measure, demonstrating that all of these techniques outperform SVM, a feature-based classifier, at stream-related classification tasks such as authorship ascription.
Abstract: We describe a novel system for evaluating and performing stream-based text categorization. Stream-based text categorization considers the text being categorized as a stream of symbols, which differs from the traditional feature-based approach which relies on extracting features from the text. The system implements character-based languages models--specifically models based on the PPM text compression scheme--as well as count-based measures such as R-Measure and C-Measure. Use of the system demonstrates that all of these techniques outperform SVM, a feature-based classifier, at stream-related classification tasks such as authorship ascription.