scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2006"


Journal Article
Shi Bing1
TL;DR: Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks.
Abstract: Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks.Different automatic learning algorithms for text categori-zation have different classification accuracy.Very accurate text classifiers can be learned automatically from training examples.

384 citations


Patent
19 Dec 2006
TL;DR: In this paper, a real-time automated communication method between a text exchange client and a speech application is presented, where a translation table can be identified that includes multiple entries, each entry including a text-exchange item and a corresponding conversational translation item.
Abstract: The present solution includes a real-time automated communication method. In the method, a real-time communication session can be established between a text exchange client and a speech application. A translation table can be identified that includes multiple entries, each entry including a text exchange item and a corresponding conversational translation item. A text exchange message can be received that was entered into a text exchange client. Content in the text exchange message that matches a text exchange item in the translation table can be substituted with a corresponding conversational item. The translated text exchange message can be sent as input to a voice server. Output from the voice server can be used by the speech application, which performs an automatic programmatic action based upon the output.

187 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: A method capable of recognizing smoothed and non-smoothed screen-rendered text of very small size which also works for colored fonts on inhomogeneous backgrounds is presented.
Abstract: The recognition of screen-rendered text is to our knowledge a yet unaddressed task. It has to be performed e.g. by translation tools which allow users to click on any text on the screen and give a translation. This often requires to capture a screenshot and to perform optical character recognition which is very challenging due to very small and smoothed fonts. This paper presents a method capable of recognizing smoothed and non-smoothed screen-rendered text of very small size which also works for colored fonts on inhomogeneous backgrounds.

28 citations


Patent
05 Oct 2006
TL;DR: The Invention method as discussed by the authors is a method and apparatus for automatic retrieval, organization, correlation and presentation of text, image, audio, or video data in a sequential manner, where a user searches a database available on a computer network using a text search engine to locate a text document.
Abstract: The Invention is a method and apparatus for automatic retrieval, organization, correlation and presentation of text, image, audio, or video data in a sequential manner. A user searches a database available on a computer network using a text search engine to locate a text document. The text document is automatically read and parsed to identify text portions and key phrases. The key phrases are used to automatically search a multimedia file database available on the computer network using a multimedia search engine, such as an image search engine. Multimedia documents containing multimedia files are retrieved. Text in the multimedia documents is compared to the key terms and to the query terms and the multimedia documents are ranked by relevance using a variety of techniques including ranking, indexing, statistical analysis and natural language processing. Each text portion in the text document is stored in association with the most relevant multimedia file for that text portion. The resulting correlated information is displayed to the user in a sequence of text, audio, image or video data.

24 citations


01 Jan 2006
TL;DR: The paper shows how feature variables can be created from unstructured text information and used for prediction and describes the methods used to parse data into vectors of terms for analysis.
Abstract: Motivation. One of the newest areas of data mining is text mining. Text mining is used to extract information from free form text data such as that in claim description fields. This paper introduces the methods used to do text naming and applies the method to a simple example. Method. The paper will describe the methods used to parse data into vectors of terms for analysis. It will then show how information extracted from the vectorized data can be used to create new features for use in analysis. Focus will be placed on the method of clustering for finding patterns in unstructured text information. Results. The paper shows how feature variables can be created from unstructured text information and used for prediction Conclusions. Text mining has significant potential to expand the amount of information that is available to insurance analysts for exploring and modeling data Availability. Free software that can be used to perform some of the analyses describes in this paper is described in the appendix.

21 citations


Patent
Makoto Terao1
08 Nov 2006
TL;DR: In this article, a matching unit collates edit result text acquired by a text editor unit with speech recognition result information having time information created by a speech recognition unit to match the edit result texts and speech data.
Abstract: [Problems] To provide a speech-to-text system and the like capable of matching edit result text acquired by editing recognition result text or edit result text which is newly-written text information with speech data. [Means for Solving Problems] A speech-to-text system ( 1 ) includes a matching unit ( 27 ) which collates edit result text acquired by a text editor unit ( 22 ) with speech recognition result information having time information created by a speech recognition unit ( 11 ) to thereby match the edit result text and speech data.

16 citations


Proceedings ArticleDOI
01 Sep 2006
TL;DR: A novel hybrid approach composed of a neural network equipped with a decision rule for language recognition from written text is introduced and results showing the improved performance of the novel approach compared with an existing similar method are shown.
Abstract: In this paper, the problem of language identification from a written text is addressed. The problem is connected to text-to-phoneme mapping application where the letters of a written text must be translated into their corresponding phonemes. When the words of the written text belong only to a single known language, transcription of the written letters into their phonemes can be done directly (without the need of language identification). When the language of the input text is not known, usually the first step is to identify the language and, based on this information, the text- to-phoneme mapping is done. In this paper, we introduce a novel hybrid approach composed of a neural network equipped with a decision rule for language recognition from written text. Comparative results showing the improved performance of our novel approach compared with an existing similar method are also shown.

13 citations


01 Jan 2006
TL;DR: A novel Arabic text categorization system has been developed based on statistical learning that uses a new method for feature extraction and which proved powerfulness in grasping the semantics of documents so that it has encouragement results as a question answering system.
Abstract: In this paper, a novel Arabic text categorization system has been developed based on statistical learning. The system uses a new method for feature extraction. The system has been implemented and tested using an Arabic text corpus. Results prove that the efficiency of the proposed system in text categorization of Arabic documents. Moreover, the system proved powerfulness in grasping the semantics of documents so that it has encouragement results as a question answering system. Keywords: Text mining, Text Categorization, Hidden Markov Models, Arabic Stemming 1. Introduction The recent explosion of online textual information has significantly increased the demand for intelligent agents capable of performing tasks such as personalized information filtering, semantic document indexing, information extraction, and automatic metadata generation. Although a complete answer may require in-depth approaches involving full understanding of natural language, text categorization is a simpler but effective technique that can contribute to the solution of the above problems [5]. Text categorization is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including e-mail filtering, mail routing, spam filtering, news monitoring, selective dissemination of information to information consumers, automated indexing of scientific articles, automated population of hierarchical catalogues of Web resources, identification of document genre, authorship attribution, survey coding and so on. Automated text categorization is attractive because manually organizing text document bases can be too expensive or unfeasible given the time constraints of the application or the number of documents involved [5, 11]. Text categorization can be conveniently formulated as a supervised learning problem [4]. In this setting, a machine-learning algorithm takes as input a set of labeled example documents (where the label indicates which category the example belongs to) and attempts to infer a function that will map new documents into their categories. Several algorithms have been proposed within this framework, including regression models, inductive logic programming, probabilistic classifiers, decision trees, neural networks, and support vector machines. Research on text categorization has been mainly focused on non-structured documents (such as email and news messages) [5]. In the typical approach, inherited from information retrieval, each document is represented by a sequence of words, and the sequence itself is normally flattened down to a simplified representation called

12 citations


01 Dec 2006
TL;DR: This paper aims to review the use of and roles played by MontyLingua and its components in research work published in 19 articles between October 2004 and August 2006.
Abstract: MontyLingua, an integral part of ConceptNet which is currently the largest commonsense knowledge base, is an English text processor developed using Python programming language in MIT Media Lab. The main feature of MontyLingua is the coverage for all aspects of English text processing from raw input text to semantic meanings and summary generation, yet each component in MontyLingua is loosely-coupled to each other at the architectural and code level, which enabled individual components to be used independently or substituted. However, there has been no review exploring the role of MontyLingua in recent research work utilizing it. This paper aims to review the use of and roles played by MontyLingua and its components in research work published in 19 articles between October 2004 and August 2006. We had observed a diversified use of MontyLingua in many different areas, both generic and domain-specific. Although the use of text summarizing component had not been observe, we are optimistic that it will have a crucial role in managing the current trend of information overload in future research.

11 citations


Journal Article
TL;DR: In this article, a hybrid approach for text extraction from images and videos is proposed, which detects both caption text as well as scene text of different font, size, color and intensity.
Abstract: Extraction of text from image and video is an important step in building efficient indexing and retrieval systems for multimedia databases. We adopt a hybrid approach for such text extraction by exploiting a number of characteristics of text blocks in color images and video frames. Our system detects both caption text as well as scene text of different font, size, color and intensity. We have developed an application for on-line extraction and recognition of texts from videos. Such texts are used for retrieval of video clips based on any given keyword. The application is available on the web for the readers to repeat our experiments and also to try text extraction and retrieval from their own videos.

6 citations


01 Jan 2006
TL;DR: A new approach towards word and sentence boundary identification for mixed-lingual sentences that bases upon parsing of character streams is presented that can also be used for word identification in languages without a designated word boundary symbol like Chinese or Japanese.
Abstract: In multilingual countries text-to-speech synthesis systems often have to deal with sentences containing inclusions of multiple other languages in form of phrases, words or even parts of words. Such sentences can only be correctly processed using a system that incorporates a mixed-lingual morphological and syntactic analyzer. A prerequisite for such an analyzer is the correct identification of word and sentence boundaries. Traditional text analysis applies to both problems simple heuristic methods within a text preprocessing step. These methods, however, are not reliable enough for analyzing mixed-lingual sentences. This paper presents a new approach towards word and sentence boundary identification for mixed-lingual sentences that bases upon parsing of character streams. Additionally this approach can also be used for word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian and Spanish.

Journal Article
TL;DR: The importance of text mining in knowledge discovery is shown and the upcoming challenges of textmining and the opportunities it offers are highlighted.
Abstract: Text Mining,also known as intelligent text analysis,text data mining or Knowledge-Discovery in Text(KDT),is a rapidly emerging field concerned with the extraction of concepts,relations,and implicit knowledge from textsAs most information(over 80%) is stored as text,text mining is believed to have a high commercial potential valueFirstly,this review paper discusses the research status of text mining,then it lays out the framework of text mining and analyses techniques of text mining,such as feature selection,automatic abstracting,text categorization,text clustering,text association,data visualizationIn the end, it shows the importance of text mining in knowledge discovery and highlights the upcoming challenges of text mining and the opportunities it offers

Book ChapterDOI
13 Jan 2006
TL;DR: An application for on-line extraction and recognition of texts from videos that detects both caption text as well as scene text of different font, size, color and intensity for retrieval of video clips based on any given keyword.
Abstract: Extraction of text from image and video is an important step in building efficient indexing and retrieval systems for multimedia databases. We adopt a hybrid approach for such text extraction by exploiting a number of characteristics of text blocks in color images and video frames. Our system detects both caption text as well as scene text of different font, size, color and intensity. We have developed an application for on-line extraction and recognition of texts from videos. Such texts are used for retrieval of video clips based on any given keyword. The application is available on the web for the readers to repeat our experiments and also to try text extraction and retrieval from their own videos.

Proceedings ArticleDOI
01 Aug 2006
TL;DR: An automatic text summarization method based on natural language understanding by using RST (Rhetorical Structure Theory) and CIT (Comprehensive Information Theory), an integrated concept of syntactic, semantic and pragmatic information.
Abstract: In this paper, we present an automatic text summarization method based on natural language understanding by using RST (Rhetorical Structure Theory) and CIT (Comprehensive Information Theory). RST is an analytic framework designed to account for text structure at the clause level. CIT is an integrated concept of syntactic, semantic and pragmatic information. The system extracts the rhetorical structure of text and the compound of the rhetorical relations between sentences, and then cuts out less important parts from the extracted structure. Finally it analyses the sentences in the extracted structure to generate text summarization by using CIT.

Patent
24 Feb 2006
TL;DR: In this paper, it is shown that a user of a broadcast receiver can get text information significantly more quickly if a reference to another text information object in the broadcasting signal is contained in a text-information object included in the broadcast signal, which is currently displayed on the display of the broadcast receiver, and the user is enabled to cause the text information to be displayed to be changed from text information of the current text information this paper.
Abstract: It is a finding of the present invention that a user of a broadcast receiver gets text information significantly more quickly if a reference to another text information object in the broadcasting signal is contained in a text information object included in the broadcasting signal, which is currently displayed on the display of the broadcast receiver, and the user is enabled to cause the text information to be displayed to be changed from the text information of the current text information object to the text information of the text information object to which the current text information object refers by simple operation of a user selection means. Here, the additional effort is very limited since in today's broadcast receivers there mostly is an “unoccupied” key, which only has an assigned function and thus is occupied in special situations of use of the broadcast receiver, and may be used as user selection means, as far as that goes. Due to the strong limitation of the available bandwidths of common broadcasting systems for data services, codings as efficient as possible are used in the generation of the text information objects to be transmitted.

Journal ArticleDOI
TL;DR: A method to exclude the misrecognized words using word‐based confidence score and similarities between keywords and adapts the language model by collecting topic‐related text from World Wide Web is developed.
Abstract: To improve the accuracy of an LVCSR (large vocabulary continuous speech recognition) system, it is effective to gather text data related to the topic of the input speech and adapt the language model using the text data. Several systems have been developed that gather text data from World Wide Web using keywords specified by a user. Those systems require the user to be involved in the transcription process. However, it is desirable to automate the entire process. To automate the text collection, we propose a method to create an adapted language model by collecting topic‐related text from World Wide Web. The proposed method composes the search query from the first recognition result, and it gathers text data from the WWW and adapts the language model. Then the input speech is decoded again using the adapted language model. As the first recognition result contains recognition errors, we developed a method to exclude the misrecognized words using word‐based confidence score and similarities between keywords. To evaluate the proposed method, we carried out adaptation experiments.

Proceedings ArticleDOI
01 Dec 2006
TL;DR: An algorithm is proposed to detect, classify and segment both static and simple linear moving text in complex noisy background and attain a quality suitable for text recognition by commercial optical character recognition software.
Abstract: Text superimposed on the video frames provides supplemental but important information for video indexing and retrieval. The detection and recognition of text from video is thus an important issue in automated content-based indexing of visual information in video archives. Text of interest is not limited to static text. They could be scrolling in a linear motion where only part of the text information is available during different frames of the video. The problem is further complicated if the video is corrupted with noise. An algorithm is proposed to detect, classify and segment both static and simple linear moving text in complex noisy background. The extracted texts are further processed using averaging to attain a quality suitable for text recognition by commercial Optical Character Recognition (OCR) software.

Proceedings ArticleDOI
17 Sep 2006
TL;DR: The MD-TTS introduces a flexible TTS architecture that includes an automatic domain classification module, which allows MD- TTS systems to be implemented by different synthesis strategies and speech corpus typologies.
Abstract: This paper describes a multi-domain text-to-speech (MD-TTS) synthesis strategy for generating speech among different domains and so increasing the flexibility of high quality TTS systems. To that effect, the MD-TTS introduces a flexible TTS architecture that includes an automatic domain classification module, which allows MD-TTS systems to be implemented by different synthesis strategies and speech corpus typologies. In this work, the performance of a corpus-based MD-TTS system is subjectively validated by means of several perceptual tests.

Proceedings Article
01 Jan 2006
TL;DR: The current metrics used in text input research are described, considering those used for discrete text input as well as those use for spoken input, and some thoughts about different metrics that might allow for a more fine grained evaluation of recognition improvement or input accuracy are provided.
Abstract: This paper describes the current metrics used in text input research, considering those used for discrete text input as well as those used for spoken input. It examines how these metrics might be used for handwritten text input and provides some thoughts about different metrics that might allow for a more fine grained evaluation of recognition improvement or input accuracy. Text Input Methods Users spend a significant amount of carrying out text input activities. This time comprises thinking time, input time, and correction time and, more often than not, this time is spent at a keyboard, either a QWERTY keyboard or a reduced keyboard (as found on mobile phones). It is therefore, unsurprising that most of the work on text input has focussed on these two paradigms. There are several other methods for entering text at a computer, these include gaze typing and spoken input and, with the advent of the PDA and, more recently, the tablet PC, users can use handwritten text that is created with a stylus or pen. This handwritten text can be close to the users ‘natural 1 ’ handwriting, for example, cursive writing, or can require the user to construct letters in a constrained way (as found in the unistroke gestures such as those incorporated in Graffiti) [1], [2]. When the characters that make up a word are entered one by one the text input method can be described as discrete. Apart from in some of the chord keyboards, in keyboard based text input, the text is always entered discretely, in speech applications, and those handwriting applications that allow for ‘natural’ writing, the text is entered in a continuous stream which makes identification of individual letters problematic.

Journal Article
TL;DR: An approach to topic extraction based on statistics, which integrated with semantic method, is proposed and its implementary process is described, indicating that the method is feasible and useful.
Abstract: Based on analyzing the currently prevalent techniques and methods of automatic extracting of text topic thoroughly,an approach to topic extraction based on statistics,which integrated with semantic method,is proposed and its implementary process is describedby using semantic correlation between sentences in a document automatic generation of text topic is implementedFirstly a text is seg-mented into words and sentences to complete information segmentation,then using clustering algorithm sentences are clustered to imple-ment information combinationFinally representative sentences are extracted from every class to generate text topicExperimental results indicate that the method is feasible and useful

Book ChapterDOI
13 Sep 2006
TL;DR: In this article, the Grid computational, storage, and data access capabilities for text mining tasks and text classification in particular are utilized for the purpose of knowledge discovery in life science research.
Abstract: Efficient access to information and integration of information from various sources and leveraging this information to knowledge are currently major challenges in life science research. However, a large fraction of this information is only available from scientific articles that are stored in huge document databases in free text format or from the Web, where it is available in semi-structured format. Text mining provides some methods (e.g., classification, clustering, etc.) able to automatically extract relevant knowledge patterns contained in the free text data. The inclusion of the Grid text-mining services into a Grid-based knowledge discovery system can significantly support problem solving processes based on such a system. Motivation for the research effort presented in this paper is to use the Grid computational, storage, and data access capabilities for text mining tasks and text classification in particular. Text classification mining methods are time-consuming and utilizing the Grid infrastructure can bring significant benefits. Implementation of text mining techniques in distributed environment allows us to access different geographically distributed data collections and perform text mining tasks in parallel/distributed fashion.

01 Jan 2006
TL;DR: At Oki Electric, activities to achieve sophisticated expression through speech synthesis, in order to provide a richer voice communication environment that can surpass the mere transmission of information and convey emotions and sympathy.
Abstract: At Oki Electric we believe it is important to depict an image of our future society as one of an “e-Society®*1)” and to be able to provide “desired information in a preferred format” to all people. Furthermore, we intend to provide a rich communication environment that can surpass the mere transmission of information and convey emotions and sympathy. The voice is one of the most basic modes of communication. We are, therefore, engaged in activities to achieve sophisticated expression through speech synthesis, in order to provide a richer voice communication environment.