scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2010"


Proceedings ArticleDOI
19 Jul 2010
TL;DR: A small set of domain-specific features extracted from the author's profile and text is proposed to use to classify short text messages to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages.
Abstract: In microblogging services such as Twitter, the users may become overwhelmed by the raw data One solution to this problem is the classification of short text messages As short texts do not provide sufficient word occurrences, traditional classification methods such as "Bag-Of-Words" have limitations To address this problem, we propose to use a small set of domain-specific features extracted from the author's profile and text The proposed approach effectively classifies the text to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages

782 citations


Proceedings Article
23 Aug 2010
TL;DR: An unsupervised method for the translation of noisy text to clean text and a weighted list of possible clean tokens for each noisy token are obtained.
Abstract: In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propose an unsupervised method for the translation of noisy text to clean text. Our method has two steps. For a given noisy sentence, a weighted list of possible clean tokens for each noisy token are obtained. The clean sentence is then obtained by maximizing the product of the weighted lists and the language model scores.

76 citations


Proceedings ArticleDOI
26 Apr 2010
TL;DR: This work study one traditional text mining task on such new form of text, that is extraction of meaningful keywords, and proposes several intuitive yet useful features and experiment with various classification models.
Abstract: Today, a huge amount of text is being generated for social purposes on social networking services on the Web. Unlike traditional documents, such text is usually extremely short and tends to be informal. Analysis of such text benefit many applications such as advertising, search, and content filtering. In this work, we study one traditional text mining task on such new form of text, that is extraction of meaningful keywords. We propose several intuitive yet useful features and experiment with various classification models. Evaluation is conducted on Facebook data. Performances of various features and models are reported and compared.

63 citations


01 Jan 2010
TL;DR: In depth analysis of algorithm related to classification techniques its advantages and disadvantages and the working mode has been presented and various text mining and data visualization tools for application to patent information like their working mode, capabilities, data sources and result output have been presented.
Abstract: Text Mining has become an important research area, which refers to the application of machine learning (or data mining) techniques in the study of Information Retrieval and Natural Language Processing. In sense, it is defined as the way of discovering knowledge from ubiquitous text data which are easily accessible over the Internet or the Intranet. The survey of Text Mining techniques, Text Mining Applications, literature survey of various applications and tools has been presented. Text Mining techniques like Document clustering and Document Classification have been presented. Text mining based framework for applications like Summarization, Topic Discovery, Information Extraction, Information Retrieval terms and techniques in each method has been discussed. Various text mining and data visualization tools for application to patent information like their working mode, capabilities, data sources and result output have been presented. In depth analysis of algorithm related to classification techniques its advantages and disadvantages and the working mode has been presented.

40 citations


01 Jan 2010
TL;DR: This paper studies the impact of text pre-processing and different term weighting schemes on Arabic text classification, and develops new combinations of term Weighting schemes to be applied on ArabicText for classification purposes.
Abstract: Text mining draw more and more attention recently, it has been applied on different domains including web mining, opinion mining, and sentiment analysis. Text pre-processing is an important stage in text mining. The major obstacle in text mining is the very high dimensionality and the large size of text data. Natural language processing and morphological tools can be employed to reduce dimensionality and size of text data. In addition, there are many term weighting schemes available in the literature that may be used to enhance text representation as feature vector. In this paper, we study the impact of text pre-processing and different term weighting schemes on Arabic text classification. In addition, develop new combinations of term weighting schemes to be applied on Arabic text for classification purposes.

39 citations


Journal ArticleDOI
TL;DR: This paper presents a text steganography system for spelling languages that can work reliably with the capability of immunity from regular operations, such as formatting, compressing and sometimes manual altering operation in text size, front, color and the space between words.
Abstract: H igh transmission efficiency , low resource occupancy and intelligible meaning make text message as the most commonly used type of media in our daily communication . Text steganography, an information hiding technology based on text message , has been a new exploring research in recent years. Due to the restriction of redundant information as well as its alterability in manual operation, text message is difficult to hide secret information effectively and reliably. Based on Markov Chain source model and DES algorithm, this paper presents a text steganography system for spelling languages. It can work reliably with the capability of immunity from regular operations, such as formatting, compressing and sometimes manual altering operation in text size, front, color and the space between words . By adopting the technique of heuristic composition, this system can produce cover text close to nature language. It’s suitable for hiding short information in online communication, such as E-mail, MSN Messenger, QQ, instant conversation and the short message on mobile phone.

37 citations


Posted Content
TL;DR: A new algorithm for text classification using data mining that requires fewer documents for training and only a single concept of Genetic Algorithm has been added for final classification.
Abstract: Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.

31 citations


Patent
28 Dec 2010
TL;DR: In this paper, a system for identifying and classifying text related to an object was proposed, where a set of training phrases for a classifier was developed and then used to analyze the text in the documents.
Abstract: Text in web pages or other text documents may be classified based on the images or other objects within the webpage A system for identifying and classifying text related to an object may identify one or more web pages containing the image or similar images, determine topics from the text of the document, and develop a set of training phrases for a classifier The classifier may be trained and then used to analyze the text in the documents The training set may include both positive examples and negative examples of text taken from the set of documents A positive example may include captions or other elements directly associated with the object, while negative examples may include text taken from the documents, but from a large distance from the object In some cases, the system may iterate on the classification process to refine the results

22 citations


Proceedings ArticleDOI
30 Sep 2010
TL;DR: A model for summarization from large documents using a novel approach has been proposed and extended for an Indian regional language (Kannada) and various analyses of results were discussed.
Abstract: The Information Extraction is a method for filtering information from large volumes of text. Information Extraction is a limited task than full text understanding. In full text understanding, we aspire to represent in an explicit fashion about all the information in a text. In contrast, in Information Extraction, we delimit in advance, as part of the specification of the task and the semantic range of the output. In this paper, a model for summarization from large documents using a novel approach has been proposed. Extending the work for an Indian regional language (Kannada) and various analyses of results were discussed.

20 citations


Proceedings Article
31 Mar 2010
TL;DR: Two new discriminative topic segmentation algorithms are given which employ a new measure of text similarity based on word co-occurrence and it is demonstrated that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.
Abstract: We explore automated discovery of topicallycoherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

16 citations


Proceedings ArticleDOI
16 Nov 2010
TL;DR: A new method based on maximum color difference and boundary growing method for detection of multi-oriented handwritten scene text in video, based on a nearest neighbor concept is presented.
Abstract: There are many video images where hand written text may appear. Therefore handwritten scene text detection in video is essential and useful for many applications for efficient indexing, retrieval etc. Also there are many video frames where text line may be multi-oriented in nature. To the best of our knowledge there is no work on handwritten text detection in video, which is multi-oriented in nature. In this paper, we present a new method based on maximum color difference and boundary growing method for detection of multi-oriented handwritten scene text in video. The method computes maximum color difference for the average of R, G and B channels of the original frame to enhance the text information. The output of maximum color difference is fed to a K-means algorithm with K=2 to separate text and non-text clusters. Text candidates are obtained by intersecting the text cluster with the Sobel output of the original frame. To tackle the fundamental problem of different orientations and skews of handwritten text, boundary growing method based on a nearest neighbor concept is employed. We evaluate the proposed method by testing on our own handwritten text database and publicly available video data (Hua’s data). Experimental results obtained from the proposed method are promising.

Posted Content
TL;DR: A new algorithm for text classification using artificial intelligence technique that requires fewer documents for training and only a single concept of genetic algorithm has been added for final classification.
Abstract: Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms for classifying text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using artificial intelligence technique that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of na\"ive Bayes classifier is then used on derived features and finally only a single concept of genetic algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.

Proceedings ArticleDOI
21 Jun 2010
TL;DR: The ParaText text analysis engine is presented, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents and results on several document collections are presented to illustrate the flexibility, extensibility, and scalability of the entire process of text modeling.
Abstract: Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the flexibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.

Proceedings ArticleDOI
Ye Tian1, Wendong Wang1, Xueli Wang1, Jinghai Rao2, Canfeng Chen2, Jian Ma2 
26 Oct 2010
TL;DR: This paper first cluster the text messages into candidate conversations based on their temporal attributes, and then does further analysis using a semantic model based on Latent Dirichlet Allocation (LDA).
Abstract: How to organize and visualize big amount of text messages stored on one's mobile phone is a challenging problem, since they can hardly be organized by threads as we do for emails due to lack of necessary metadata such as "subject" and "reply-to". In this paper, we propose an innovative approach based on clustering algorithms and natural language processing methods. We first cluster the text messages into candidate conversations based on their temporal attributes, and then do further analysis using a semantic model based on Latent Dirichlet Allocation (LDA). Considering that the text messages are usually short and sparse, we trained the model using a large scale external data collected from twitter-like web sites, and applied the model to text messages. In the end, the text messages are organized as conversations based on their topics. We evaluated our approach based on 122,359 text messages collected from 50 university students during 6 months.

Patent
07 Jul 2010
TL;DR: In this article, a method is performed on a handheld device that involves receiving a search term and searching through stored text message information of multiple text messages for the search term, and in response to a user selection of one of the conversations, a sequence of text messages within the one selected conversation is displayed.
Abstract: A method is performed on a handheld device that involves receiving a search term and searching through stored text message information of multiple text messages for the search term. A listing of text message conversations are listed on the display where each listed conversation has at least one text message whose text message information was found to include the search term. And, in response to a user selection of one of the conversations, a sequence of text messages within the one selected conversation is displayed.

Patent
07 Apr 2010
TL;DR: In this article, a method for converting speech to text in a speech analytics system is presented, which includes receiving audio data containing speech made up of sounds from an audio source, processing the sounds with a phonetic module resulting in symbols corresponding to the sounds, and processing the symbols with a language module and occurrence table resulting in text.
Abstract: A method for converting speech to text in a speech analytics system is provided. The method includes receiving audio data containing speech made up of sounds from an audio source, processing the sounds with a phonetic module resulting in symbols corresponding to the sounds, and processing the symbols with a language module and occurrence table resulting in text. The method also includes determining a probability of correct translation for each word in the text, comparing the probability of correct translation for each word in the text to the occurrence table, and adjusting the occurrence table based on the probability of correct translation for each word in the text.

Book ChapterDOI
08 Nov 2010
TL;DR: It is demonstrated the applicability of selftraining, a form of semi-supervised learning, to neural network based handwriting recognition, to significantly increase the performance of a handwriting recognition system.
Abstract: Off-line handwriting recognition deals with the task of automatically recognizing handwritten text from images, for example from scanned sheets of paper. Due to the tremendous variations of writing styles encountered between different individuals, this is a very challenging task. Traditionally, a recognition system is trained by using a large corpus of handwritten text that has to be transcribed manually. This, however, is a laborious and costly process. Recent developments have proposed semi-supervised learning, which reduces the need for manually transcribed text by adding large amounts of handwritten text without transcription to the training set. The current paper is the first one, to the knowledge of the authors, where semi-supervised learning for unconstrained handwritten text line recognition is proposed.We demonstrate the applicability of selftraining, a form of semi-supervised learning, to neural network based handwriting recognition. Through a set of experiments we show that text without transcription can successfully be used to significantly increase the performance of a handwriting recognition system.

Proceedings ArticleDOI
05 Jul 2010
TL;DR: This paper uses two data compression ratios of text data instead of the attribute in the extraction method, namely compression ratio by Run Length Encoding (RLE) and that by LZ77, and shows that its extraction method with compression ratios by RLE works better than both the previous extraction method and the previous method.
Abstract: Text based pictures called text art are often used in Web pages, email text and so on. They enrich expression in text data, but they can be noise for handling the text data. For example, they can be obstacle for text-to-speech software and natural language processing. Text art extraction methods, which detects the area of text art in a given text data, help to solve such problems. Previously proposed text art extraction methods, however, will not work for text data with more than one natural languages well because they assume that a specific natural language is used in text data. We have proposed a text art extraction method for multi natural languages in our past paper. The extraction method uses an attribute based on successive occurrences of same two characters. The attribute represents a characteristic such that same characters often appear successively in text art. In this paper, we use two data compression ratios of text data instead of the attribute in the our extraction method, namely compression ratio by Run Length Encoding (RLE) and that by LZ77. Our experiments show that our extraction method with compression ratio by RLE works better than both that with compression ratio by LZ77 and our previous extraction method.

Proceedings ArticleDOI
11 Jul 2010
TL;DR: This paper presents an efficient approach of text extraction based on multiple frame integration and stroke filter, where text block filtering and integration are used to obtain the clean background and clear text with high contrast for the effective recognition.
Abstract: Text in video frames contains high-level semantic information and thus can contribute significantly to video content analysis and retrieval. Therefore, video text recognition is crucial to the research in all video indexing and summarization. In the processing of video text recognition, in the extraction step, background in the text rows is removed so only the text pixels are left for recognition. Although many efforts have been made for video text extraction, there are still many research works to do. Text extraction from images remains a challenging problem for character recognition applications, due to complex background, unknown text color, degraded text quality caused by the lossy compression, and different language characteristics. In this paper, we present an efficient approach of text extraction based on multiple frame integration and stroke filter. Firstly text block filtering and integration are used to obtain the clean background and clear text with high contrast for the effective recognition. Secondly the character of the stroke-like structure is adequately considered, the stroke filter is applied to get most step-like edges, and at the same time, the text parts are enhanced well. Lastly the missed text pixels in former steps are recalled by local region growing.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: The Language Pyramid (LaP) model is presented, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding.
Abstract: The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.

Proceedings Article
01 Jun 2010
TL;DR: Text produced by processing signals intended for human use is often noisy for automated computer processing, and processing techniques like Automatic Speech Recognition, Optical Character Recognition and Machine Translation introduce processing noise.
Abstract: Text produced by processing signals intended for human use is often noisy for automated computer processing. Digital text produced in informal settings such as online chat, SMS, emails, tweets, message boards, newsgroups, blogs, wikis and web pages contain considerable noise. Also processing techniques like Automatic Speech Recognition, Optical Character Recognition and Machine Translation introduce processing noise. People are adept when it comes to pattern recognition tasks involving typeset or handwritten documents or recorded speech, machines less-so.

Journal ArticleDOI
TL;DR: A number of models applied to text prediction are presented, some of them are oriented to low-inflected languages while others are for high-inflection languages, and their results are compared.

Proceedings ArticleDOI
07 Nov 2010
TL;DR: It is found that real-time text helped users better coordinate turns and lead to less self-editing of messages, but had no overall influence on users' typing ability and provided minimal support for collaborative completion of sentences.
Abstract: Real-time, character-by-character transmission of messages in synchronous forms of text-based communications has seen a recent resurgence in CMC. We evaluated the impact of real-time text display on the usability of an instant messaging (IM) client. Participants were randomly assigned to dyads to participate in two discussion tasks using IM with both real-time text and enhanced message-by-message display (i.e., line-by-line display with additional cues to show when the remote party is typing). We found that real-time text helped users better coordinate turns and lead to less self-editing of messages, but had no overall influence on users' typing ability and provided minimal support for collaborative completion of sentences. Users who typed less or had less experience with IM tended to prefer real-time text. These findings have significance for several forms text-based CMC, including IM, chat, text telephony, and collaborative document editing.

Proceedings ArticleDOI
20 Jun 2010
TL;DR: Study Chinese language text analysis program, based on the analysis of the characteristics of Tibetan language, proposed a set of speech synthesis for Tibetan text analysis module implementations to laid a good foundation for the Tibetan speech synthesis.
Abstract: A Text analysis is the text to speech conversion system front-end, restricting the naturalness of speech synthesis is an important factor, this paper, study Chinese language text analysis program, based on the analysis of the characteristics of Tibetan language, proposed a set of speech synthesis for Tibetan text analysis module implementations. Maximum matching method used in the text and sub-word lexicon to achieve the method of combining text, automatic word segmentation, the establishment of a hierarchical system of rules for the conduct of standardized text processing, word pronunciation conversion through the SAMPA_ST standard pronunciation of the Tibetan Machine system implementation, The paper has laid a good foundation for the Tibetan speech synthesis.

Proceedings ArticleDOI
Feng Hu1, Yu-feng Zhang1
07 May 2010
TL;DR: This paper introduces and analyzes text mining based on domain ontology, which shows that traditional text mining cannot achieve high accuracy, because it cannot effectively make use of the semantic information of the text.
Abstract: Text mining is an effective means of acquiring potentially useful knowledge from text document. However, traditional text mining cannot achieve high accuracy, because it cannot effectively make use of the semantic information of the text. Ontology provides theoretical basis and technical support for semantic information representation and organization. This paper introduces and analyzes text mining based on domain ontology.

ReportDOI
30 Sep 2010
TL;DR: The effort required to extract electronic text from text-containing computer files preserve its integrity, and, for some use cases, preserve its structure, is brought to light.
Abstract: : Electronic text for use by human language technologies originates from a number of sources direct keyboard entry, optical character recognition, speech recognition, and text-containing computer files. In particular, text-containing computer files may elude processing by an array of human language technology applications (e.g., search, language ID, machine translation, and text analytics). This paper brings to light the effort required to extract electronic text from these files preserve its integrity, and, for some use cases, preserve its structure. It explores a series of specific human language technologies, highlighting the following aspects for each: relevant use cases, the impact of text extraction or conversion errors, the criticality of dependable text extraction and reliable electronic text, and the importance of experimentation and/or testing prior to use. Overall, this paper promotes the successful use of human language technology by equipping the reader to be discerning about the use of human language technology applications with text-containing files.

Dissertation
01 Jan 2010
TL;DR: A zero-watermarking approach towards text watermarking using inherent constituents of text such as double letters, prepositions, words, sentences, and text structure to protect text against digital forgery is proposed.
Abstract: With widespread use of Internet and other communication technologies, it has become extremely easy to reproduce, communicate, and distribute digital contents. As a result, authentication and copyright protection issues have arisen. Text is the most extensively used medium travelling over the Internet besides image, audio, and video. The major part of books, newspapers, web pages, advertisement, research papers, legal documents, letters, novels, poetry, and many other documents is simply the plain text. Copyright protection of plain text is a significant issue which cannot be condoned. The existing solution for watermarking of plain text documents are not robust towards random tampering attacks and are inapplicable for numerous domains. In this thesis, we have proposed a zero-watermarking approach towards text watermarking. We have provided a number of text watermarking solutions using inherent constituents of text such as double letters, prepositions, words, sentences, and text structure to protect text against digital forgery. We have designed a corpus having text of variable length and diversity; containing original as well as attacked samples with various volumes and forms of attacks. Instead of using binary watermarks on text, we used alphabetical, image, and hybrid watermarks.Experimental results illustrate the effectiveness of the proposed algorithms on text encountering combined insertion, deletion, and re-ordering attacks, both in the dispersed and localized forms. The results are also compared with the recent work on text watermarking.

01 Jan 2010
TL;DR: This work aims to automatically direct the expressiveness in speech through tagging the input text appropriately through the use of a graph-based approach named the Reduced Associative Relational Network and the Maximum Entropy classifier.
Abstract: In the context of text processing for Text-to-Speech (TTS) synthesis, this work aims to automatically direct the expressiveness in speech through tagging the input text appropriately. Since the nature of text presents different characteristics according to whether it is domain-dependent (related to its topics) or sentiment-dependent, it is studied how these traits influence the identification of expressiveness in text. To this end, two principal Text Classification (TC) methods are considered: a graph-based approach named the Reduced Associative Relational Network and the Maximum Entropy classifier. Their effectiveness in domain/sentiment dependent environments is evaluated. The results indicate that moving from a domain-dependent environment to a more general sentimentdependent environment strictly results in poorer effectiveness rates, despite the sensible direct association that sentiment provides for dealing with expressiveness. Additionally, it is also evaluated how sensitive the classifiers are to a small increase of training data, yielding a slight positive influence. Index Terms: domain classification, sentiment classification, expressive Text-to-Speech synthesis

Book ChapterDOI
22 Jun 2010
TL;DR: This research proposes to use an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier on a grid environment, this combination is based on a mixture of multinomials, which is commonly used in text classification.
Abstract: The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Text mining is the process of extracting interesting information and knowledge from unstructured text. One key difficulty with text classification learning algorithms is that they require many hand-labeled documents to learn accurately. In the text mining pattern discovery phase, the text classification step aims at automatically attribute one or more predefined classes to text documents. In this research, we propose to use an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier on a grid environment, this combination is based on a mixture of multinomials, which is commonly used in text classification. Naive Bayes is a probabilistic approach to inductive learning. It estimates the a posteriori probability that a document belongs to a class given the observed feature values of the documents, assuming independence of the features. The class with the maximum a posteriori probability is assigned to the document. Expectation-Maximization (EM) is a class of iterative algorithms for maximum likelihood or maximum a posteriori estimation in problems with unlabeled data. The grid environment is a geographically distributed computation infrastructure composed of a set of heterogeneous resources. The semi-supervised learning classifier in the grid is available as a grid service, expanding the functionality of Aiuri Portal, which is a framework for a cooperative academic environment for education and research. Text classification mining methods are time-consuming by using the grid infrastructure can bring significant benefits in learning and the classification process.

Proceedings ArticleDOI
13 Mar 2010
TL;DR: A new greedy algorithm to select text from the mother text to collect phonetically rich sentences with high coverage of phonetic contextual units but has a small text size is presented.
Abstract: This paper presents a method for a automatically constructed text corpus with limited text for speech synthesis system. It is to collect phonetically rich sentences with high coverage of phonetic contextual units but has a small text size. In this paper, we present a new greedy algorithm to select text from the mother text. The mother text is auto-loaded by the web crawler and it is dealt with speech-music discrimination and sentence segmentation, the remainder is used for the mother text, so our text is limited and it is different from the traditional construction of speech corpus. The mother text assembled (about 4612 sentences). Diphone is used as the basic unit. We used the modified Okapi formula to evaluate the score of sentences. The experimental results show that this method successfully achieves the best coverage of diphone is 93.52%. It can generate a good speech corpus.