Showing papers on "Noisy text analytics published in 2005"

PDF

Open Access

Text Classification Using Machine Learning Techniques

[...]

M. Ikonomakis, Sotiris Kotsiantis, Vassilis Tampakas

01 Jan 2005

TL;DR: This paper illustrates the text classification process using machine learning techniques to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing.

...read moreread less

Abstract: Automated text classification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing In general, text classification plays an important role in information extraction and summarization, text retrieval, and question- answering This paper illustrates the text classification process using machine learning techniques The references cited cover the major theoretical issues and guide the researcher to interesting research directions

...read moreread less

447 citations

Journal Article•DOI•

Mining knowledge from text using information extraction

[...]

Raymond J. Mooney¹, Razvan Bunescu¹•Institutions (1)

University of Texas at Austin¹

01 Jun 2005-Sigkdd Explorations

TL;DR: Methods and implemented systems for information extraction systems used to extract concrete data from a set of documents are discussed and results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions are summarized.

...read moreread less

Abstract: An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.

...read moreread less

256 citations

Patent•

Key usage and text marking in the context of a combined predictive text and speech recognition system

[...]

Juha Purho¹•Institutions (1)

Nokia¹

02 Nov 2005

TL;DR: In this article, a combined predictive speech and text recognition system was proposed, which combines the functionality of text input programs with speech input and recognition systems, and a user can both manually enter text and speak desired letters, words or phrases.

...read moreread less

Abstract: A combined predictive speech and text recognition system. The present invention combines the functionality of text input programs with speech input and recognition systems. With the present invention, a user can both manually enter text and speak desired letters, words or phrases. The system receives and analyzes the provided information and provides one or more proposals for the completion of words or phrases. This process can be repeated until an adequate match is found.

...read moreread less

107 citations

Proceedings Article•DOI•

From text to speech summarization

[...]

Kathleen R. McKeown¹, Julia Hirschberg¹, Michel Galley¹, Sameer Maskey¹•Institutions (1)

Columbia University¹

18 Mar 2005

TL;DR: It is illustrated how features derived from speech can help determine summary content within two ongoing summarization projects at Columbia University.

...read moreread less

Abstract: In this paper, we present approaches used in text summarization, showing how they can be adapted for speech summarization and where they fall short. Informal style and apparent lack of structure in speech mean that the typical approaches used for text summarization must be extended for use with speech. We illustrate how features derived from speech can help determine summary content within two ongoing summarization projects at Columbia University.

...read moreread less

71 citations

Proceedings Article•DOI•

Narrative text classification for automatic key phrase extraction in web document corpora

[...]

Yongzheng Zhang¹, Nur Zincir-Heywood¹, Evangelos E. Milios¹•Institutions (1)

Dalhousie University¹

04 Nov 2005

TL;DR: The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages, demonstrating that narrative text classification is indispensable for effective key phrase extraction in Web document corpora.

...read moreread less

Abstract: Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contain diverse and heterogeneous contents, is of particular interest and challenge in the information age. In this work, we investigate the significance of narrative text classification in the task of automatic key phrase extraction in Web document corpora. We benchmark three methods, TFIDF, KEA, and Keyterm, used to extract key phrases from all the plain text and from only the narrative text of Web pages. ANOVA tests are used to analyze the ranking data collected in a user study using quantitative measures of acceptable percentage and quality value. The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages. This demonstrates that narrative text classification is indispensable for effective key phrase extraction in Web document corpora.

...read moreread less

70 citations

Journal Article•DOI•

Taxonomy generation for text segments: A practical web-based approach

[...]

Shui-Lung Chuang¹, Lee-Feng Chien²•Institutions (2)

Academia Sinica¹, National Taiwan University²

01 Oct 2005-ACM Transactions on Information Systems

TL;DR: This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments and proposes a hierarchical clustering algorithm, which tries to produce a more natural and comprehensive tree hierarchy.

...read moreread less

Abstract: It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.

...read moreread less

57 citations

Patent•

Systems and methods for automatically categorizing unstructured text

[...]

Eric D. Scott, Katrina A. Rhoads

23 Nov 2005

TL;DR: In this paper, a sample set of messages, representative of messages from the message stream, are analyzed to determine interesting or useful categories, and text categorization engines are then trained, using the sample set and text classifiers are published.

...read moreread less

Abstract: Systems, methods and software products analyze messages of a message stream based upon human generated concept recognizers. A sample set of messages, representative of messages from the message stream, are analyzed to determine interesting or useful categories. Text categorization engines are then trained, using the sample set and text classifiers are published. These text classifiers are then used to categorizing further text messages from the message stream.

...read moreread less

51 citations

Patent•

Speech recognition system for providing voice recognition services using a conversational language model

[...]

Cary Lee Bates¹, Brian Paul Wallenfelt²•Institutions (2)

IBM¹, Nuance Communications²

22 Dec 2005

TL;DR: In this article, the authors present a method, system and article of manufacture for adjusting a language model within a voice recognition system, based on text received from an external application, where the external application may supply text representing the words of one participant to a text-based conversation.

...read moreread less

Abstract: Embodiments of the present invention provide a method, system and article of manufacture for adjusting a language model within a voice recognition system, based on text received from an external application. The external application may supply text representing the words of one participant to a text-based conversation. In such a case, changes may be made to a language model by analyzing the external text received from the external application.

...read moreread less

47 citations

Patent•

Text stitching from multiple images

[...]

Raymond C. Kurzweil, Paul Albrecht¹, Lucy Gibson, Lev Lvovsky¹•Institutions (1)

Wellesley College¹

01 Apr 2005

TL;DR: In this paper, a reading machine has been proposed for detecting common text between a pair of individual images, combining the text from the pair of images into a file or data structure if common text is detected, and determining if incomplete text phrases are present in the common text.

...read moreread less

Abstract: A reading machine has processing for detecting common text between a pair of individual images. The reading machine combines the text from the pair of images into a file or data structure if common text is detected, and determines if incomplete text phrases are present in the common text. If incomplete text phrases are present, the machine signals a user to move an image input device in a direction to capture more of the text.

...read moreread less

47 citations

Patent•

Indexing and searching speech with text meta-data

[...]

Alejandro Acero¹, Ciprian Chelba¹, Jorge Silva F. Sanchez¹•Institutions (1)

Microsoft¹

08 Nov 2005

TL;DR: In this article, an index for searching spoken documents having speech data and text meta-data is created by obtaining probabilities of occurrence of words and positional information of the words of the speech data.

...read moreread less

Abstract: An index for searching spoken documents having speech data and text meta-data is created by obtaining probabilities of occurrence of words and positional information of the words of the speech data and combining it with at least positional information of the words in the text meta-data. A single index can be created because the speech data and the text meta-data are treated the same and considered only different categories.

...read moreread less

38 citations

Journal Article•DOI•

An embedded application for degraded text recognition

[...]

Céline Thillou¹, Silvio Ferreira¹, Bernard Gosselin¹•Institutions (1)

Faculté polytechnique de Mons¹

01 Jan 2005-EURASIP Journal on Advances in Signal Processing

TL;DR: This paper describes a mobile device which tries to give the blind or visually impaired access to text information and presents the overall description of the system from text detection to OCR error correction.

...read moreread less

Abstract: This paper describes a mobile device which tries to give the blind or visually impaired access to text information. Three key technologies are required for this system: text detection, optical character recognition, and speech synthesis. Blind users and the mobile environment imply two strong constraints. First, pictures will be taken without control on camera settings and a priori information on text (font or size) and background. The second issue is to link several techniques together with an optimal compromise between computational constraints and recognition efficiency. We will present the overall description of the system from text detection to OCR error correction.

...read moreread less

Proceedings Article•DOI•

Associating text and graphics for scientific chart understanding

[...]

Weihua Huang¹, Chew Lim Tan¹, Wee Kheng Leow¹•Institutions (1)

National University of Singapore¹

31 Aug 2005

TL;DR: The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way.

...read moreread less

Abstract: This paper presents our recent work that aims at associating the recognition results of textual and graphical information contained in the scientific chart images. Text components are first located in the input image and then recognized using OCR. On the other hand, the graphical objects are segmented and form high level symbols. Both logical and semantic correspondence between text and graphical symbols are identified. The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way. The result of scientific chart image understanding is presented using XML documents.

...read moreread less

Patent•

Information retrieval system, method, and program

[...]

Toshihiko Manabe¹, Hideki Tsutsui¹, Koji Urata¹, Mika Fukui¹, Hiroko Hayama¹ - Show less +1 more•Institutions (1)

Toshiba¹

21 Sep 2005

TL;DR: An information retrieval system, including speech recognition means for making speech recognition for a spoken question to generate first text information, generation means for modifying the first text text information to generate second text information as a interrogative to make a search for an answer to the question as discussed by the authors.

...read moreread less

Abstract: An information retrieval system, includes speech recognition means for making speech recognition for a spoken question to generate first text information, generation means for modifying the first text information to generate second text information as a interrogative to make a search for an answer to the question, and search means for searching the answer from a document database by using the second text information.

...read moreread less

Proceedings Article•DOI•

A Fuzzy-Based Approach for Text Representation in Text Categorization

[...]

Son Doan¹•Institutions (1)

Japan Advanced Institute of Science and Technology¹

25 May 2005

TL;DR: A novel scheme for text representation based on fuzzy set theory is proposed and a new algorithm for choosing a term set that characterizes a document in the corpus is given under the view of fuzzy set.

...read moreread less

Abstract: Document representation is one of the most important tasks in text processing, especially in text categorization. This task has many applications that include document management, information retrieval, text routing, etc. In this paper, the author proposes a novel scheme for text representation based on fuzzy set theory. A new algorithm for choosing a term set that characterizes a document in the corpus is given under the view of fuzzy set. Experimental results applied to text categorization problem using the relevance feedback technique show that our proposed method reduced the number of dimensions and achieves higher performances compared to other baseline methods. In addition, it also produces results that compare favorably to the result achieved with the all vocabulary method

...read moreread less

Journal Article•

Multi-Level Document Visualization.

[...]

Stan Ruecker, Eric Homich, Stéfan Sinclair

01 Jan 2005-Visible Language

TL;DR: A prototype system that allows readers to view an electronic text in multiple simultaneous views, providing insight at several different levels of granularity, including a reading view is described, combined with a number of tools for manipulating the text.

...read moreread less

Abstract: This paper describes a prototype system that allows readers to view an electronic text in multiple simultaneous views, providing insight at several different levels of granularity, including a reading view. This prospect display is combined with a number of tools for manipulating the text, for example by highlighting sections of interest for a particular task. The result is a powerful approach to working with electronic text for various purposes: sample scenarios are outlined involving directors reading scripts, students studying novels, and second-language learners familiarizing themselves with grammatical constructions. Introduction Digital text offers software developers and designers the opportunity to provide readers with a variety of new perceptual experiences and possibilities for action that have simply not been available through printed texts (Bork, 1983). An obvious example is the widespread adoption of digital texts connected by hyperlinks and identified by many theorists as a significant change in the way people are able to interact with the written word (Bolter, 1991; Landow, 1994, etc.). However, many other new affordances of digital text remain to be identified, developed and studied. One of these possible new affordances is the ability to have text or layout features change over time (Chang et al, 1988; Ford et al, 1997). In kinetic text research, traditionally static design elements such as font, size, leading, color and placement can all be used dynamically to achieve layout effects that were previously available only in non-interactive media such as film (Lee et al, 2002). This project extends research in hypertext and kinetic text theory to provide readers with a text document display that combines simultaneous prospect - an overview of the entire text - and detail views, with related tools. Much as architectural blueprints allow the person reading them to get a sense of an entire building or some key feature, such as the wiring or the ventilation, allowing readers to see an entire text at once (that is, providing text prospect) has perceptual advantages. These advantages, which we will explore in this paper, are not available in cases where the text can only be accessed sequentially. The system also includes related tools that allow the reader to carry out new kinds of actions that would not otherwise be available. From hypertext theory comes the concept of associated text elements, where interaction with one text moves the reader into a related text. However, zooming through prospect views differs from a hypertextual implementation in that there are no predefined links between views. Hypertext is also predicated on the concept of connecting lexia or individual documents, so that following a link has the effect of visually replacing the source text with the destination text. In this project the text is treated as a stable whole and presented so as to minimize interruptions to the reader's literary engagement with the text (Miall, 1999). Kinetic text theory contributes the notion of a system where text characteristics change as a way of responding to reader interests. In this case, the reader has the ability to identify the portion of the whole text that will display in the reading view. There is also the capacity to highlight specific passages in the entire text, by selecting the features from a set of choices that derive from the tagging available in the document. Finally, in cases where this system has been integrated with related digital reading tools, additional kinetic features may be possible, as in the Watching the Script prototype (Ruecker et al., 2004), where the reader views the script by watching it scroll at various character positions on stage._figure1 The Multi-level Document Visualization Prototype In the Multi-level Document Visualization prototype that we have developed, the prospect view indexes a fisheye reading view, where a segment of text of about a dozen lines is shown at full size, while adjacent text is displayed as increasingly smaller lines of microtext (Small, 1996; Furnas, 1986; Bederson, 2000). …

...read moreread less

Book Chapter•DOI•

Text Processing And Information Retrieval

[...]

N. Mili, Frayling

23 May 2005-WIT Transactions on State-of-the-art in Science and Engineering

Proceedings Article•DOI•

Annotating Text Segments in Documents for Search

[...]

Pu-Jen Cheng, Hsin-Chen Chiao, Yi-Cheng Pan, Lee-Feng Chien

19 Sep 2005

TL;DR: An approach to extract global evidences from documents for improved named entity recognition and an unsupervised, generalized classification approach that collects training data from the Web automatically and classifies text patterns into more refined categories are presented.

...read moreread less

Abstract: It has been shown that annotating prominent text patterns contained in documents with appropriate types may benefit many applications. Most conventional tools for automatic text annotation extract named entities from texts and annotate them with information about persons, locations, dates and so on. However, this kind of entity type information is often short in length and is mostly limited to a small set of broader categories. In this paper, we try to remedy this problem by presenting an approach to extract global evidences from documents for improved named entity recognition. We also propose an unsupervised, generalized classification approach that collects training data from the Web automatically and classifies text patterns into more refined categories. Experimental results show the feasibility of the proposed approaches for search on the data of the NTCIR-2 information retrieval task.

...read moreread less

Proceedings Article•DOI•

A generic method for determining the up/down orientation of text in Roman and non-Roman scripts

[...]

H.B. Aradhye

31 Aug 2005

TL;DR: This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation, and analyzes the "open" portions of text blobs to determine the direction in which the open portions face.

...read moreread less

Abstract: This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation. The method analyzes the "open" portions of text blobs to determine the direction in which the open portions face. By determining the respective densities of blobs opening in a pair of opposite directions (e.g., right or left), the method can establish the direction in which the text as a whole is oriented. We first discuss the orientation of Roman text based on the asymmetry in the openness of Roman letters in the horizontal direction. For non-Roman text such as Pashto and Hebrew, we determine a direction that is the most asymmetric, and therefore the most useful for orientation, given a training dataset. This direction is then used for orientation. This work can be used for automated orientation of mail, checks in ATM envelopes, and scanned, copied, or faxed documents.

...read moreread less

Proceedings Article•DOI•

Profile Extraction from Mean Profile for Automatic Text Categorization

[...]

K. Lakshmi¹, Saswati Mukherjee¹•Institutions (1)

Anna University¹

28 Nov 2005

TL;DR: This work proposed an automatic text categorization approach that uses profiles for categorization of text documents and concluded that the increase in distance between profiles improves the classifier performance.

...read moreread less

Abstract: With overwhelming growth of information technology, better organization of documents is required for easy access of information. Hence the need for text categorization becomes critical. Many researchers have turned their attention towards text categorization. Text categorization is the automated assignment of predefined categories to the text documents based on document contents. We proposed an automatic text categorization approach that uses profiles for categorization of text documents. A new similarity method has been used for measuring similarity between profiles and documents. We have explored different ways of profile creation and concluded that the increase in distance between profiles improves the classifier performance. Our classifier has an improved performance when compared with similar kind of text categorization methods

...read moreread less

Proceedings Article•

Relational Recognition for Information Extraction in Free Text Documents.

[...]

Erik Larson¹, Todd Hughes²•Institutions (2)

University of Texas at Austin¹, Lockheed Martin Corporation²

01 Jan 2005

TL;DR: An information extraction system and experiment is described that demonstrates accurate tuple extraction in a selected domain and helps to bridge the gap between free text and knowledge-based applications.

...read moreread less

Abstract: Information extraction tools provide an important means for distilling content from free text documents, and knowledgebased tools provide an important means for automatically reasoning over statements expressed as well-formed tuples A number of techniques deliver reliable extraction of entities, less reliable extraction of relations, and poor extraction on entity-entity-relation tuples However, tuple extraction is needed to bridge the gap between free text and knowledge-based applications We describe an information extraction system and experiment that demonstrates accurate tuple extraction in a selected domain

...read moreread less

Proceedings Article•DOI•

Knowledge discovery method to accomplish English document classification

[...]

Elmarhomy Ghada¹, El-Sayed Atlam¹, Hiro Hanafusa¹, Masao Fuketa¹, Kazuhiro Morita¹, Jun-ichi Aoe¹ - Show less +2 more•Institutions (1)

University of Tokushima¹

19 May 2005

TL;DR: What is field-associated term and how to discover field- associated terms, which exist in any text, are described and called a field association (FA) word that can be directly related to the field classification.

...read moreread less

Abstract: Although there is much research of text classification based on vector spaces using word information in the whole text, generally humans can recognize the field by finding the specific words. This paper describes what is field-associated term and how to discover field-associated terms, which exist in any text. In this paper, such words are called a field association (FA) word that can be directly related to the field classification. Five criteria of FA terms are defined for hierarchical fields. All of them are stored to field tree to make use of extraction of field-coherent passages for document classification. The presented approach is estimated by the simulation results of 140 fields text files of sports field and extended by 197 text field of civil engineering.

...read moreread less

Book Chapter•DOI•

Summarization and visualization

[...]

D. Mladeni, Marko Grobelnik

23 May 2005-WIT Transactions on State-of-the-art in Science and Engineering

TL;DR: Some basic methods for text summarization and text visualization are presented and an example on real-world data describing the research projects in information technology supported by European Commission is given.

...read moreread less

Abstract: Both text summarization and visualization aim at providing some sort of general view of the text either giving a text summary in the required natural language or giving some visual representation of the text. In both cases the text can be either a single document or a set of documents written in some natural language(s). Here we present some basic methods for text summarization and text visualization and give an example on real-world data describing the research projects in information technology supported by European Commission.

...read moreread less

Book Chapter•DOI•

Spontaneous handwriting text recognition and classification using finite-state models

[...]

Alejandro Héctor Toselli¹, Moisés Pastor¹, Alfons Juan¹, Enrique Vidal¹•Institutions (1)

Polytechnic University of Valencia¹

07 Jun 2005

TL;DR: Two different types of statistical framework for phrase recognition-classification are considered, based on finite-state models, and experimental results are reported which, given the extreme difficulty of the task, are encouraging.

...read moreread less

Abstract: Finite-state models are used to implement a handwritten text recognition and classification system for a real application entailing casual, spontaneous writing with large vocabulary. Handwritten short phrases which involve a wide variety of writing styles and contain many non-textual artifacts, are to be classified into a small number of predefined classes. To this end, two different types of statistical framework for phrase recognition-classification are considered, based on finite-state models. HMMs are used for text recognition process. Depending to the considered architecture, N-grams are used for performing text recognition and then text classification (serial approach) or for performing both simultaneously (integrated approach). The multinomial text classifier is also employed in the classification phase of the serial approach. Experimental results are reported which, given the extreme difficulty of the task, are encouraging.

...read moreread less

Book Chapter•DOI•

Visual Representation of Text in Web Documents and Its Interpretation

[...]

Dimosthenis Karatzas¹, Apostolos Antonacopoulos¹•Institutions (1)

University of Liverpool¹

01 Jan 2005

TL;DR: The uses of text and its representation on Web documents is examined in terms of the challenges in its interpretation and the significant problem of non-uniform representation of text is paid to.

...read moreread less

Abstract: This paper examines the uses of text and its representation on Web documents in terms of the challenges in its interpretation. Particular attention is paid to the significant problem of non-uniform representation of text. This non-uniformity is mainly due to the presence of semantically important text in image form as opposed to the standard encoded text. The issues surrounding text representation in Web documents are discussed in the context of colour perception and spatial representation. The characteristics of the representation of text in image form are examined and research towards interpreting these images of text is briefly described.

...read moreread less