scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2009"


Journal ArticleDOI
TL;DR: In this paper, a Survey of Text Mining techniques and applications have been presented.
Abstract: Text Mining has become an important research area. Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. In this paper, a Survey of Text Mining techniques and applications have been s presented.

650 citations


Patent
09 Sep 2009
TL;DR: In this paper, a system and method apply stores of factual information to correct errors in digital text, for example, generated from OCR, speech and/or handwriting recognition devices, and other automatic recognition devices.
Abstract: The disclosed system and method apply stores of factual information to correct errors in digital text, for example, generated from OCR, speech and/or handwriting recognition devices, and other automatic recognition devices. A text produced by OCR, speech recognition, handwriting recognition, and others may be processed to extract discussed facts. Databases of facts are searched based on information in the text. After comparing facts asserted in the text with the factual data from the databases, suggested corrections of the text are produced.

193 citations


Proceedings ArticleDOI
04 Jun 2009
TL;DR: An unsupervised noisy-channel model for text message normalization is constructed on a sample of creative, non-standard text message word forms to determine frequent word formation processes in texting language.
Abstract: Cell phone text messaging users express themselves briefly and colloquially using a variety of creative forms. We analyze a sample of creative, non-standard text message word forms to determine frequent word formation processes in texting language. Drawing on these observations, we construct an unsupervised noisy-channel model for text message normalization. On a test set of 303 text message forms that differ from their standard form, our model achieves 59% accuracy, which is on par with the best supervised results reported on this dataset.

143 citations


Patent
09 Mar 2009
TL;DR: In this article, an algorithm for synthesizing speech from text strings associated with media assets is presented. But the algorithm is not suitable for the case where each text string can be associated with a native string language (e.g., the language of the string).
Abstract: Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized from text strings associated with media assets, where each text string can be associated with a native string language (e.g., the language of the string). When several text strings are associated with at least two distinct languages, a series of rules can be applied to the strings to identify a single voice language to use for synthesizing the speech content from the text strings. In some embodiments, a prioritization scheme can be applied to the text strings to identify the more important text strings. The rules can include, for example, selecting a voice language based on the prioritization scheme, a default language associated with an electronic device, the ability of a voice language to speak text in a different language, or any other suitable rule.

128 citations


Proceedings ArticleDOI
Shixia Liu1, Michelle X. Zhou1, Shimei Pan1, Weihong Qian1, Weijia Cai1, Xiaoxiao Lian1 
02 Nov 2009
TL;DR: This paper presents the design and development of a time-based, visual text summary that effectively conveys complex text summarization results produced by the Latent Dirichlet Allocation (LDA) model and describes a set of rich interaction tools that allow users to work with a createdVisual text summary to further interpret the summarizationresults in context and examine the text collection from multiple perspectives.
Abstract: We are building an interactive, visual text analysis tool that aids users in analyzing a large collection of text. Unlike existing work in text analysis, which focuses either on developing sophisticated text analytic techniques or inventing novel visualization metaphors, ours is tightly integrating state-of-the-art text analytics with interactive visualization to maximize the value of both. In this paper, we focus on describing our work from two aspects. First, we present the design and development of a time-based, visual text summary that effectively conveys complex text summarization results produced by the Latent Dirichlet Allocation (LDA) model. Second, we describe a set of rich interaction tools that allow users to work with a created visual text summary to further interpret the summarization results in context and examine the text collection from multiple perspectives. As a result, our work offers two unique contributions. First, we provide an effective visual metaphor that transforms complex and even imperfect text summarization results into a comprehensible visual summary of texts. Second, we offer users a set of flexible visual interaction tools as the alternatives to compensate for the deficiencies of current text summarization techniques. We have applied our work to a number of text corpora and our evaluation shows the promise of the work, especially in support of complex text analyses.

105 citations


Journal ArticleDOI
TL;DR: A system is developed, which provides the user with a platform to analyze opinion expressions crawled from a set of pre-defined blogs, aimed at extracting and consolidating opinions of customers from blogs and feedbacks, at multiple levels of granularity.
Abstract: The proliferation of Internet has not only led to the generation of huge volumes of unstructured information in the form of web documents, but a large amount of text is also generated in the form of emails, blogs, and feedbacks, etc. The data generated from online communication acts as potential gold mines for discovering knowledge, particularly for market researchers. Text analytics has matured and is being successfully employed to mine important information from unstructured text documents. The chief bottleneck for designing text mining systems for handling blogs arise from the fact that online communication text data are often noisy. These texts are informally written. They suffer from spelling mistakes, grammatical errors, improper punctuation and irrational capitalization. This paper focuses on opinion extraction from noisy text data. It is aimed at extracting and consolidating opinions of customers from blogs and feedbacks, at multiple levels of granularity. We have proposed a framework in which these texts are first cleaned using domain knowledge and then subjected to mining. Ours is a semi-automated approach, in which the system aids in the process of knowledge assimilation for knowledge-base building and also performs the analytics. Domain experts ratify the knowledge base and also provide training samples for the system to automatically gather more instances for ratification. The system identifies opinion expressions as phrases containing opinion words, opinionated features and also opinion modifiers. These expressions are categorized as positive or negative with membership values varying from zero to one. Opinion expressions are identified and categorized using localized linguistic techniques. Opinions can be aggregated at any desired level of specificity i.e. feature level or product level, user level or site level, etc. We have developed a system based on this approach, which provides the user with a platform to analyze opinion expressions crawled from a set of pre-defined blogs.

97 citations


Proceedings ArticleDOI
23 Jul 2009
TL;DR: A survey of the existing measures for noise in text is presented and application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction are covered.
Abstract: Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as online chat, SMS, emails, message boards, newsgroups, blogs, wikis and web pages contain considerable noise. In this paper, we present a survey of the existing measures for noise in text. We also cover application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction.

88 citations


Proceedings ArticleDOI
26 Jul 2009
TL;DR: A prototype system equipped with a head-mounted video camera and particle filter is employed for fast and robust text tracking and an automatic text image selection is also required for better character recognition and timely text message presentation.
Abstract: Disability of visual text reading has a huge impact on the quality of life for visually disabled people.One of the most anticipated devices is a wearable camera capable of finding text regions in natural scenes and translating the text into another representation such as speech or braille.In order to develop such a device,text tracking in video sequences is required as well as text detection.The device needs to group homogeneous text regions to avoid multiple and redundant speech syntheses or braille conversions.An automatic text image selection is also required for better character recognition and timely text message presentation.We have developed a prototype system equipped with a head-mounted video camera.Particle filter is employed for fast and robust text tracking.We have tested the performance of our system using 1,730 video frames of hall ways with 27 signboards.The number of text candidate regions is reduced to 1.47%.

49 citations


Proceedings ArticleDOI
23 Jul 2009
TL;DR: Analysis is presented on how noise introduced due to incorrect English affects the performance of some of the NLP tools and thereafter the text mining applications.
Abstract: Text mining aims at deriving high quality information from text in an automated way. Text mining applications rely on Natural Language Processing (NLP) tools like tagger, parser etc. to locate and retrieve relevant information in an application specific manner. Most of these NLP tools however have been designed to work on clean and grammatically correct text. Presently, many organizations are interested to derive information from informally written text that is generated as a result of human communication through emails, or blog posts, web-based reviews etc. These texts are highly noisy and often found to contain mixture of languages. In this study we present some analysis on how noise introduced due to incorrect English affects the performance of some of the NLP tools and thereafter the text mining applications. The text mining application that we focus on is opinion mining. Opinion mining is the most significant text mining application that has to deal with noisy text generated in an unregulated fashion by users.

39 citations


Patent
Chun-Yang Chen1, Russell Gulli2, Christopher Passaretti2, Chingfa Wu2, Stephen Buckley2 
11 Dec 2009
TL;DR: In this paper, text-based messaging interaction using natural language understanding technologies enables textbased messages to be received from users and interpreted by a selfservice application platform so that the self-service application platforms can respond to the text based messages in an automated manner.
Abstract: Automated text-based messaging interaction using natural language understanding technologies enables text-based messages to be received from users and interpreted by a self-service application platform so that the self-service application platform can respond to the text-based messages in an automated manner. The characters and strings of characters contained within the text message are interpreted to extract words, which are then processed using a natural language understanding engine to determine the content of the text-based message. The content is used to generate a response message from static and/or dynamic grammars to automate the process of interacting with a user via text-based messages. Multiple text-based message formats are supported, including text messages transmitted using Short Messaging Service (SMS), instant messaging, chat, and e-mail.

33 citations


Proceedings ArticleDOI
08 Mar 2009
TL;DR: The problem of recognizing the language of the text content is addressed, however it is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages.
Abstract: In order to reach a larger cross section of people, it is necessary that a document should be composed of text contents in different languages. But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR. In this research work, this problem of recognizing the language of the text content is addressed, however it is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. As a via media, in this research we have proposed to work on the prioritized requirements of a particular region, for instance in Karnataka state in India,generally any document including official ones, would contain the text in three languages-English-the language of general importance, Hindi-the language of National importance and Kannada –the language of State/Regional importance. We have proposed to learn identifying the language of the text by thoroughly understanding the nature of top and bottom profiles of the printed text lines in these three languages.Experimentation conducted involved 800 text lines for learning and 600 text lines for testing. The performance has turned out to be 95.4%.

Patent
11 Aug 2009
TL;DR: In this article, an apparatus for retrieving a character string includes: storage for storing text data obtained by recognizing a voice in a presentation, second text data extracted from document data used in the presentation, and associated information of the first text data and the second text text data.
Abstract: An apparatus for retrieving a character string includes: storage for storing text data obtained by recognizing a voice in a presentation, second text data extracted from document data used in the presentation, and associated information of the first text data and the second text data. The apparatus also includes a retrieval unit for retrieving, by use of the associated information, the character string from text data composed from the first text data and the second text data.

Proceedings ArticleDOI
Weihui Dai1, Yue Yu1, Bin Deng1
24 Nov 2009
TL;DR: The text steganography is introduced, an information hiding technology based on text message is explored, and the application of Markov state transferring probability among the sequence of nature languages to achieve the purpose of text Steganography in online communication is explored.
Abstract: In recent years, information hiding technology has developed from the focus of verifying the authenticity into Internet communication as an effective means of enhancing its safety secretly. High transmission efficiency, low resource occupancy and intelligible meaning make text message as the most commonly used type of media in our daily communication. But, text message is difficult to hide secret information effectively and reliably because of its restriction of redundant information as well as the alterability in manual operation. This paper introduces the text steganography, an information hiding technology based on text message, and explores the application of Markov state transferring probability among the sequence of nature languages to achieve the purpose of text steganography in online communication. The above approach can realize information hiding in text message with the capability of immunity from regular operations, such as formatting, compressing and sometimes manual altering operation on its text attributes.

Proceedings ArticleDOI
23 Jul 2009
TL;DR: Three techniques each exploring a different approach for solving the noisy text retrieval task using a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval.
Abstract: With the continuous growth of the World Wide Web, there is an urgent need for an efficient information retrieval system which can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains to be a challenging task with inadequate performance (around 30%, accuracy) thus proving to be a major hurdle in providing a robust search experience in the domain of handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text output by imperfect recognizers applied to handwritten document images. We describe three techniques each exploring a different approach for solving the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these approaches in detail and also present their performance using standard IR evaluation metrics.

Patent
13 Feb 2009
TL;DR: In this paper, the authors proposed a method of communication that integrates both speech to text technology and text to speech technology, in which one user employs a communication device having means for converting vocal signals into text; this converted text is then sent to the other user.
Abstract: The disclosed invention comprises a method of communication that integrates both speech to text technology and text to speech technology. In its simplest form, one user employs a communication device having means for converting vocal signals into text; this converted text is then sent to the other user. This recipient is presented with the sender's text and to respond, he can enter text which is then output to the first user as speech sounds. This system creates an opportunity for two users to carry on a conversation, one using his voice (and hearing a synthesized voice in response) and the other using text (and receiving speech rendered as text): the first user has a voice conversation; the second user has a text based conversation. This system allows a user to select his preferred method of communication, regardless of the selection of his communication partner.

Proceedings Article
23 Jul 2009
TL;DR: The Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009) follows two successful previous editions and is being organized in conjunction with the Tenth International Conference on Document Analysis and Recognition (ICDAR'2009). Once again, the authors expect to see a range of thought-provoking discussions on methods for handling noise in text and related topics.
Abstract: Noise in text comes from two types of sources, broadly speaking : (i) text produced by processing signals intended for human use such as printed/handwritten documents, spontaneous speech, and camera-captured scene images and (ii) human generated natural language text such as electronic text from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs and web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), and mobile phones (text messages). The pervasiveness of such noisy data is evident and the importance of analyzing such data is obvious and analyzing this requires moving beyond traditional text analytics techniques. The Third Workshop on Analytics for Noisy Unstructured Text Data (AND 2009) follows two successful previous editions: AND 2007 (in conjunction with the 20th Joint Conference on Artificial Intelligence [IJCAI]) and AND 2008 (in conjunction with the 31st Annual International ACM SIGIR Conference). AND 2007 was successful in creating awareness and emphasizing importance of noisy text analytics. It brought together industrial and academic researchers from various areas leading to high quality papers that were featured in a special issue of the International Journal on Document Analysis and Recognition (IJDAR). Researchers also shared several synthetic and real datasets from various domains for benefit of the community. AND 2008 built on the success of the first edition and brought together a larger community of people from different parts of the world with a specific focus on Information Retrieval as the application area. Another special issue of IJDAR is currently under production. Following this trend, the third edition of the AND workshop is being organized in conjunction with the Tenth International Conference on Document Analysis and Recognition (ICDAR'2009). Once again, we expect to see a range of thought-provoking discussions on methods for handling noise in text and related topics. The workshop Call for Papers had a good response, like previous editions. We received 22 submissions spanning a diverse set of issues relevant to noisy text analytics. Each submission was reviewed by at least three members of the program committee. In the workshop, there will be 15 contributed presentations, invited talks by Dr. Hildelies Balk (Programme Manager for the EU project IMPACT) and Dr. Venu Govindaraju (Distinguished Professor of CSE at SUNY, Buffalo), and several working group discussion sessions spread throughout the day. Through these opportunities for interaction, we hope AND 2009 will continue to foster the international research community as was the case with the first two AND workshops.

Journal ArticleDOI
TL;DR: Evaluation of Text and Speech Systems and its Applications: Foundations of Evaluation, 2nd Ed.
Abstract: Dybkjaer, Laila, Hemsen, Holmer and Minker, Wolfgang (Eds.), Evaluation of Text and Speech Systems. Dordrecht: Springer Verlag, 2007. ISBN 9781402058158, 290 pp. The articles collected in this volum...

Proceedings ArticleDOI
06 Dec 2009
TL;DR: TagLearner is described, a P2P classifier learning system for extracting patterns from text data where the end users can participate both in the task of labeling the data and building a distributed classifier on it.
Abstract: The amount of text data on the Internet is growing at a very fast rate. Online text repositories for news agencies, digital libraries and other organizations currently store giga and tera-bytes of data. Large amounts of unstructured text poses a serious challenge for data mining and knowledge extraction. End user participation coupled with distributed computation can play a crucial role in meeting these challenges. In many applications involving classification of text documents, web users often participate in the tagging process. This collaborative tagging results in the formation of large scale Peer-to-Peer (P2P) systems which can function, scale and self-organize in the presence of highly transient population of nodes and do not need a central server for co-ordination. In this paper, we describe TagLearner, a P2P classifier learning system for extracting patterns from text data where the end users can participate both in the task of labeling the data and building a distributed classifier on it. We present a novel distributed linear programming based classification algorithm which is asynchronous in nature. The paper also provides extensive empirical results on text data obtained from an online repository - the NSF Abstracts Data.

Proceedings ArticleDOI
31 Mar 2009
TL;DR: An algorithm to normalize noisy text, which only focuses on the Arabic language, is introduced and the new normalization technique is evaluated by the under-stemming errors reduction technique introduced by Paice.
Abstract: In this paper, an algorithm to normalize noisy text, which only focuses on the Arabic language, is introduced. Although there have been many theories that discuss Arabic text processing, there has not been, so far, one theory that focuses on noisy Arabic texts. Additionally, this paper introduces a new similarity measure to stem Arabic noisy document. The need for such a new measure stems from the fact that the common rules applied in stemming cannot be applied on noisy texts, which do not conform to the known grammatical rules and have various spelling mistakes. Thus, the proposed normalization algorithm automatically group words after applying the similarity measure. In order to make sure of such a theory of algorithm, the new normalization technique is evaluated by the under-stemming errors reduction technique introduced by Paice.

Book ChapterDOI
01 Jan 2009
TL;DR: Text mining is the process of deriving novel information from a collection of texts, which includes identification of sets of related words in documents, identification of clusters of similar reports, exploratory analysis using structured and unstructured data (textual information) to discover hidden patterns.
Abstract: Pattern recognition is the most basic description of what is done in the process of data mining. Text mining is the process of deriving novel information from a collection of texts. It can be applied to many applications in a variety of fields, namely, marketing, national security, medical and biomedical, and public relation. The process of counting the number of matches to a text pattern occurs repeatedly in text mining, such that one can compare two different documents by counting how many times different words occur in each document. Analysts choose the best way to analyze the text further, by either combining groups of words that appear to mean the same thing or directing the computer to do it automatically in a second iteration of the process, and then analyze the results. The goals of text mining include identification of sets of related words in documents, identification of clusters of similar reports, exploratory analysis using structured (fields in a record) and unstructured data (textual information) to discover hidden patterns that could provide some useful insights related to causes of fatal accidents, and identification of frequent item sets. The Feature Selection and Variable Screening tool is extremely useful for reducing the dimensionality of analytic problems.

Proceedings ArticleDOI
23 Jul 2009
TL;DR: An algorithm for ontology-guided entity disambiguation that uses existing knowledge sources such as domain-specific ontologies and other structured data to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians is presented.
Abstract: Domain-specific knowledge is often recorded by experts in the form of unstructured text. For example, in the medical domain, clinical notes from electronic health records contain a wealth of information. Similar practices are found in other domains. The challenge we discuss in this paper is how to identify and extract part names from technicians repair notes, a noisy unstructured text data source from General Motors' archives of solved vehicle repair problems, with the goal to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians.In the present work, we discuss two approaches to this problem. We present an algorithm for ontology-guided entity disambiguation that uses existing knowledge sources such as domain-specific ontologies and other structured data. We illustrate its use in automotive domain, using GM parts ontology and the unit structure of repair manuals text to build context models, which are then used to disambiguate mentions of part-related entities in the text. We also describe extraction of part names with a small amount of annotated data using Hidden Markov Models (HMM) with shrinkage, achieving an f-score of approximately 80%. Next we used linear-chain Conditional Random Fields (CRF) in order to model observation dependencies present in the repair notes. Using CRF did not lead to improved performance, but a slight improvement over the HMM results was obtained by using a weighted combination of the HMM and CRF models.

Book ChapterDOI
Shourya Roy1, L. Venkata Subramaniam1
01 Jan 2009
TL;DR: This chapter will review some work in the area of noisy text correction and devise extraction techniques which are robust with respect to noise, and see how the task of information extraction is affected by noise.
Abstract: The importance of text mining applications is growing proportionally with the exponential growth of electronic text. Along with the growth of internet many other sources of electronic text have become really popular. With increasing penetration of internet, many forms of communication and interaction such as email, chat, newsgroups, blogs, discussion groups, scraps etc. have become increasingly popular. These generate huge amount of noisy text data everyday. Apart from these the other big contributors in the pool of electronic text documents are call centres and customer relationship management organizations in the form of call logs, call transcriptions, problem tickets, complaint emails etc., electronic text generated by Optical Character Recognition (OCR) process from hand written and printed documents and mobile text such as Short Message Service (SMS). Though the nature of each of these documents is different but there is a common thread between all of these—presence of noise. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: “Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.” Opinion(product1,good), from a blog post such as: “I absolutely liked the texture of SheetK quilts.” At superficial level, there are two ways for information extraction from noisy text. The first one is cleaning text by removing noise and then applying existing state of the art techniques for information extraction. There in lies the importance of techniques for automatically correcting noisy text. In this chapter, first we will review some work in the area of noisy text correction. The second approach is to devise extraction techniques which are robust with respect to noise. Later in this chapter, we will see how the task of information extraction is affected by noise.

01 Dec 2009
TL;DR: A text chat system called ”GaChat”, which simultaneously appends related information about the dialogue text between its users, which settlement of the ambiguity helps users to reduce redundant interactions like searching and asking the details of the phrase.
Abstract: Text chat systems are popular and widely used by a lot of users. However, there are sometimes redundant interactions between the users because of its less awareness. In this paper, we propose a text chat system called ”GaChat”, which simultaneously appends related information about the dialogue text between its users. First, proper nouns are extracted from the dialogue text by morphologic analysis. Then online images and articles related to the nouns are simultaneously displayed with the dialogue text. Such settlement of the ambiguity helps users to reduce redundant interactions like searching and asking the details of the phrase. This paper describes the prototype implementation and its first evaluation experiment. Author Keywords Text chat communication, Instant messaging service, Web based communication

Book ChapterDOI
24 Jun 2009
TL;DR: Today, academic researchers face a flood of information, and full text search provides an important way of finding useful information from mountains of publications, but it generally suffers from low precision, or low quality of document retrieval.
Abstract: Today, academic researchers face a flood of information Full text search provides an important way of finding useful information from mountains of publications, but it generally suffers from low precision, or low quality of document retrieval A full text search algorithm typically examines every word in a given text, trying to find the query words Unfortunately, many words in natural language are polysemous, and thus many documents retrieved using this approach are irrelevant to actual search queries

Proceedings ArticleDOI
02 Oct 2009
TL;DR: This paper proposes a text art extraction method for multi natural languages which works well for text data with other natural languages well and attributes of a given text data which the method uses include how the text data looks like text art while attributes of it which the previous works use include how to look like a natural language text.
Abstract: Text based pictures called text art are often used in Web pages, email text and so on. They delight us, but they can be noise for handling the text data. For example, they can be obstacle for text-to-speech software and natural language processing. Such problems will be solved by text art extraction methods which detects the area of text art in a given text data, and text art extraction methods can be constructed by text art recognition methods which tell if a given fragment of text data is a text art or not. Previous works for text art recognition methods assume that a specific natural language is used in text data, and do not work for text data with other natural languages well. In this paper, we propose a text art extraction method for multi natural languages. Attributes of a given text data which our method uses include how the text data looks like text art while attributes of it which the previous works use include how the text data looks like a natural language text. Our experiment shows that our method works well for both English text and Japanese text data.

Proceedings ArticleDOI
24 May 2009
TL;DR: The paper suggests a software architecture to improve the accessibility to information from texts that combines techniques like: Image Processing, Optical Character Recognition, Machine Translation, Text Analyze and Text to Speech.
Abstract: The paper suggests a software architecture to improve the accessibility to information from texts. This software combines techniques like: Image Processing, Optical Character Recognition, Machine Translation, Text Analyze and Text to Speech. The application uses a scanner or a web cam as an image input device, recognizes the text by using OCR, enables text translation by using Google’s machine translation implementation, interpret the text as a future development and reads the text by using TTS technology. In this way the user can put the text information source into a scanner or under a web cam and can hear the text translated and interpreted, if required. A functional prototype is presented and conclusions are issued.

01 Jan 2009
TL;DR: This chapter will introduce the most fundamental uses of language processing methods in biology and present the basic resources openly available in the field.
Abstract: Text Mining is the process of extracting (novel) interesting and non-trivial information and knowledge from unstructured text (Google TM search result for ''define: text mining''). Infor- mation retrieval, natural language processing, information extraction, and text mining provide methodologies to shift the burden of tracing and relating data contained in text from the human user to the computer. The emergence of high-throughput techniques has allowed biosciences to switch its research focus on Systems Biology, increasing the demands on text mining and extraction of information from heterogeneous sources. This chapter will introduce the most fundamental uses of language processing methods in biology and present the basic resources openly available in the field. The search for information about a common disease, chronic myeloid leukemia, is used to exemplify the capabilities. Tools such as PubMed, eTBLAST, METIS, EBIMed, MEDIE, MarkerInfoFinder, HCAD, iHOP, Chilibot, and G2D - selected from a comprehensive list of currently available systems - provide users with a basic platform for performing complex opera- tions on information accumulated in text.

Book ChapterDOI
Taeho Jo1
16 Sep 2009
TL;DR: This research proposes an alternative approach to machine learning based ones for categorizing online news articles in Reuter21578 by encoding a document or documents into a table, instead of numerical vectors.
Abstract: This research proposes an alternative approach to machine learning based ones for categorizing online news articles in Reuter21578. For using machine learning based approaches for any task of text mining or information retrieval, documents should be encoded into numerical vectors; two problems, huge dimensionality and sparse distribution, caused by encoding so. Although there are various tasks of text mining such as text categorization, text clustering, and text summarization, the scope of this research is restricted to text categorization. The idea of this research is to avoid the two problems by encoding a document or documents into a table, instead of numerical vectors. Therefore, the goal of this research is to improve the performance of text categorization by avoiding the two problems.

Book ChapterDOI
01 Jan 2009
TL;DR: ThisSTATISTICA Text Mining and Document Retrieval tutorial is to provide powerful tools to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms available in the STATISTICA system.
Abstract: Data mining has played a promising role in the extraction of implicit and potentially useful information from databases. Text mining is all about analyzing text for extracting information from unstructured data. The purpose of this STATISTICA Text Mining and Document Retrieval tutorial is to provide powerful tools to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms available in the STATISTICA system. STATISTICA Text Mining and Document Retrieval is a text-mining tool for indexing text in various languages, i.e., for meaningfully representing the number of times that terms occur in the input documents. The program includes numerous options for stemming words (terms), for handling synonym lists and phrases, and for summarizing the results of the indexing using various indices and statistical techniques. The selection of tools or techniques available with STATISTICA, along with the Text Mining module, can help organizations to solve a variety of problems. Extraction of useful insights or information from unstructured data could be used as input for decision-making purposes.

Journal Article
TL;DR: Internet has become a giant data of web text document, but a great scale of Isomerism and non-structure web text produces new Challenge to data Mining.
Abstract: With a great scale popularization of www and improvement of the information of corporation,how to obtain these big capacities of users information has been the important research subject.Technique of Text Mining can quickly and effectively abstract the useful information from the many data.Internet has become a giant data of web text document,but a great scale of Isomerism and non-structure web text produces new Challenge to data Mining.The paper introduces the process of web text mining,emphatically analyzes the related technologies.