scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2005"


01 Jan 2005
TL;DR: This paper illustrates the text classification process using machine learning techniques to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing.
Abstract: Automated text classification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing In general, text classification plays an important role in information extraction and summarization, text retrieval, and question- answering This paper illustrates the text classification process using machine learning techniques The references cited cover the major theoretical issues and guide the researcher to interesting research directions

447 citations


Journal ArticleDOI
TL;DR: Methods and implemented systems for information extraction systems used to extract concrete data from a set of documents are discussed and results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions are summarized.
Abstract: An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.

256 citations


Patent
Juha Purho1
02 Nov 2005
TL;DR: In this article, a combined predictive speech and text recognition system was proposed, which combines the functionality of text input programs with speech input and recognition systems, and a user can both manually enter text and speak desired letters, words or phrases.
Abstract: A combined predictive speech and text recognition system. The present invention combines the functionality of text input programs with speech input and recognition systems. With the present invention, a user can both manually enter text and speak desired letters, words or phrases. The system receives and analyzes the provided information and provides one or more proposals for the completion of words or phrases. This process can be repeated until an adequate match is found.

107 citations


Proceedings ArticleDOI
18 Mar 2005
TL;DR: It is illustrated how features derived from speech can help determine summary content within two ongoing summarization projects at Columbia University.
Abstract: In this paper, we present approaches used in text summarization, showing how they can be adapted for speech summarization and where they fall short. Informal style and apparent lack of structure in speech mean that the typical approaches used for text summarization must be extended for use with speech. We illustrate how features derived from speech can help determine summary content within two ongoing summarization projects at Columbia University.

71 citations


Proceedings ArticleDOI
04 Nov 2005
TL;DR: The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages, demonstrating that narrative text classification is indispensable for effective key phrase extraction in Web document corpora.
Abstract: Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contain diverse and heterogeneous contents, is of particular interest and challenge in the information age. In this work, we investigate the significance of narrative text classification in the task of automatic key phrase extraction in Web document corpora. We benchmark three methods, TFIDF, KEA, and Keyterm, used to extract key phrases from all the plain text and from only the narrative text of Web pages. ANOVA tests are used to analyze the ranking data collected in a user study using quantitative measures of acceptable percentage and quality value. The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages. This demonstrates that narrative text classification is indispensable for effective key phrase extraction in Web document corpora.

70 citations


Journal ArticleDOI
TL;DR: This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments and proposes a hierarchical clustering algorithm, which tries to produce a more natural and comprehensive tree hierarchy.
Abstract: It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.

57 citations


Patent
23 Nov 2005
TL;DR: In this paper, a sample set of messages, representative of messages from the message stream, are analyzed to determine interesting or useful categories, and text categorization engines are then trained, using the sample set and text classifiers are published.
Abstract: Systems, methods and software products analyze messages of a message stream based upon human generated concept recognizers. A sample set of messages, representative of messages from the message stream, are analyzed to determine interesting or useful categories. Text categorization engines are then trained, using the sample set and text classifiers are published. These text classifiers are then used to categorizing further text messages from the message stream.

51 citations


Patent
22 Dec 2005
TL;DR: In this article, the authors present a method, system and article of manufacture for adjusting a language model within a voice recognition system, based on text received from an external application, where the external application may supply text representing the words of one participant to a text-based conversation.
Abstract: Embodiments of the present invention provide a method, system and article of manufacture for adjusting a language model within a voice recognition system, based on text received from an external application. The external application may supply text representing the words of one participant to a text-based conversation. In such a case, changes may be made to a language model by analyzing the external text received from the external application.

47 citations


Patent
01 Apr 2005
TL;DR: In this paper, a reading machine has been proposed for detecting common text between a pair of individual images, combining the text from the pair of images into a file or data structure if common text is detected, and determining if incomplete text phrases are present in the common text.
Abstract: A reading machine has processing for detecting common text between a pair of individual images. The reading machine combines the text from the pair of images into a file or data structure if common text is detected, and determines if incomplete text phrases are present in the common text. If incomplete text phrases are present, the machine signals a user to move an image input device in a direction to capture more of the text.

47 citations


Patent
08 Nov 2005
TL;DR: In this article, an index for searching spoken documents having speech data and text meta-data is created by obtaining probabilities of occurrence of words and positional information of the words of the speech data.
Abstract: An index for searching spoken documents having speech data and text meta-data is created by obtaining probabilities of occurrence of words and positional information of the words of the speech data and combining it with at least positional information of the words in the text meta-data. A single index can be created because the speech data and the text meta-data are treated the same and considered only different categories.

38 citations


Journal ArticleDOI
TL;DR: This paper describes a mobile device which tries to give the blind or visually impaired access to text information and presents the overall description of the system from text detection to OCR error correction.
Abstract: This paper describes a mobile device which tries to give the blind or visually impaired access to text information. Three key technologies are required for this system: text detection, optical character recognition, and speech synthesis. Blind users and the mobile environment imply two strong constraints. First, pictures will be taken without control on camera settings and a priori information on text (font or size) and background. The second issue is to link several techniques together with an optimal compromise between computational constraints and recognition efficiency. We will present the overall description of the system from text detection to OCR error correction.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way.
Abstract: This paper presents our recent work that aims at associating the recognition results of textual and graphical information contained in the scientific chart images. Text components are first located in the input image and then recognized using OCR. On the other hand, the graphical objects are segmented and form high level symbols. Both logical and semantic correspondence between text and graphical symbols are identified. The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way. The result of scientific chart image understanding is presented using XML documents.

Patent
Toshihiko Manabe1, Hideki Tsutsui1, Koji Urata1, Mika Fukui1, Hiroko Hayama1 
21 Sep 2005
TL;DR: An information retrieval system, including speech recognition means for making speech recognition for a spoken question to generate first text information, generation means for modifying the first text text information to generate second text information as a interrogative to make a search for an answer to the question as discussed by the authors.
Abstract: An information retrieval system, includes speech recognition means for making speech recognition for a spoken question to generate first text information, generation means for modifying the first text information to generate second text information as a interrogative to make a search for an answer to the question, and search means for searching the answer from a document database by using the second text information.

Proceedings ArticleDOI
25 May 2005
TL;DR: A novel scheme for text representation based on fuzzy set theory is proposed and a new algorithm for choosing a term set that characterizes a document in the corpus is given under the view of fuzzy set.
Abstract: Document representation is one of the most important tasks in text processing, especially in text categorization. This task has many applications that include document management, information retrieval, text routing, etc. In this paper, the author proposes a novel scheme for text representation based on fuzzy set theory. A new algorithm for choosing a term set that characterizes a document in the corpus is given under the view of fuzzy set. Experimental results applied to text categorization problem using the relevance feedback technique show that our proposed method reduced the number of dimensions and achieves higher performances compared to other baseline methods. In addition, it also produces results that compare favorably to the result achieved with the all vocabulary method

Journal Article
TL;DR: A prototype system that allows readers to view an electronic text in multiple simultaneous views, providing insight at several different levels of granularity, including a reading view is described, combined with a number of tools for manipulating the text.
Abstract: This paper describes a prototype system that allows readers to view an electronic text in multiple simultaneous views, providing insight at several different levels of granularity, including a reading view. This prospect display is combined with a number of tools for manipulating the text, for example by highlighting sections of interest for a particular task. The result is a powerful approach to working with electronic text for various purposes: sample scenarios are outlined involving directors reading scripts, students studying novels, and second-language learners familiarizing themselves with grammatical constructions. Introduction Digital text offers software developers and designers the opportunity to provide readers with a variety of new perceptual experiences and possibilities for action that have simply not been available through printed texts (Bork, 1983). An obvious example is the widespread adoption of digital texts connected by hyperlinks and identified by many theorists as a significant change in the way people are able to interact with the written word (Bolter, 1991; Landow, 1994, etc.). However, many other new affordances of digital text remain to be identified, developed and studied. One of these possible new affordances is the ability to have text or layout features change over time (Chang et al, 1988; Ford et al, 1997). In kinetic text research, traditionally static design elements such as font, size, leading, color and placement can all be used dynamically to achieve layout effects that were previously available only in non-interactive media such as film (Lee et al, 2002). This project extends research in hypertext and kinetic text theory to provide readers with a text document display that combines simultaneous prospect - an overview of the entire text - and detail views, with related tools. Much as architectural blueprints allow the person reading them to get a sense of an entire building or some key feature, such as the wiring or the ventilation, allowing readers to see an entire text at once (that is, providing text prospect) has perceptual advantages. These advantages, which we will explore in this paper, are not available in cases where the text can only be accessed sequentially. The system also includes related tools that allow the reader to carry out new kinds of actions that would not otherwise be available. From hypertext theory comes the concept of associated text elements, where interaction with one text moves the reader into a related text. However, zooming through prospect views differs from a hypertextual implementation in that there are no predefined links between views. Hypertext is also predicated on the concept of connecting lexia or individual documents, so that following a link has the effect of visually replacing the source text with the destination text. In this project the text is treated as a stable whole and presented so as to minimize interruptions to the reader's literary engagement with the text (Miall, 1999). Kinetic text theory contributes the notion of a system where text characteristics change as a way of responding to reader interests. In this case, the reader has the ability to identify the portion of the whole text that will display in the reading view. There is also the capacity to highlight specific passages in the entire text, by selecting the features from a set of choices that derive from the tagging available in the document. Finally, in cases where this system has been integrated with related digital reading tools, additional kinetic features may be possible, as in the Watching the Script prototype (Ruecker et al., 2004), where the reader views the script by watching it scroll at various character positions on stage._figure1 The Multi-level Document Visualization Prototype In the Multi-level Document Visualization prototype that we have developed, the prospect view indexes a fisheye reading view, where a segment of text of about a dozen lines is shown at full size, while adjacent text is displayed as increasingly smaller lines of microtext (Small, 1996; Furnas, 1986; Bederson, 2000). …


Proceedings ArticleDOI
19 Sep 2005
TL;DR: An approach to extract global evidences from documents for improved named entity recognition and an unsupervised, generalized classification approach that collects training data from the Web automatically and classifies text patterns into more refined categories are presented.
Abstract: It has been shown that annotating prominent text patterns contained in documents with appropriate types may benefit many applications. Most conventional tools for automatic text annotation extract named entities from texts and annotate them with information about persons, locations, dates and so on. However, this kind of entity type information is often short in length and is mostly limited to a small set of broader categories. In this paper, we try to remedy this problem by presenting an approach to extract global evidences from documents for improved named entity recognition. We also propose an unsupervised, generalized classification approach that collects training data from the Web automatically and classifies text patterns into more refined categories. Experimental results show the feasibility of the proposed approaches for search on the data of the NTCIR-2 information retrieval task.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation, and analyzes the "open" portions of text blobs to determine the direction in which the open portions face.
Abstract: This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation. The method analyzes the "open" portions of text blobs to determine the direction in which the open portions face. By determining the respective densities of blobs opening in a pair of opposite directions (e.g., right or left), the method can establish the direction in which the text as a whole is oriented. We first discuss the orientation of Roman text based on the asymmetry in the openness of Roman letters in the horizontal direction. For non-Roman text such as Pashto and Hebrew, we determine a direction that is the most asymmetric, and therefore the most useful for orientation, given a training dataset. This direction is then used for orientation. This work can be used for automated orientation of mail, checks in ATM envelopes, and scanned, copied, or faxed documents.

Proceedings ArticleDOI
28 Nov 2005
TL;DR: This work proposed an automatic text categorization approach that uses profiles for categorization of text documents and concluded that the increase in distance between profiles improves the classifier performance.
Abstract: With overwhelming growth of information technology, better organization of documents is required for easy access of information. Hence the need for text categorization becomes critical. Many researchers have turned their attention towards text categorization. Text categorization is the automated assignment of predefined categories to the text documents based on document contents. We proposed an automatic text categorization approach that uses profiles for categorization of text documents. A new similarity method has been used for measuring similarity between profiles and documents. We have explored different ways of profile creation and concluded that the increase in distance between profiles improves the classifier performance. Our classifier has an improved performance when compared with similar kind of text categorization methods

Proceedings Article
01 Jan 2005
TL;DR: An information extraction system and experiment is described that demonstrates accurate tuple extraction in a selected domain and helps to bridge the gap between free text and knowledge-based applications.
Abstract: Information extraction tools provide an important means for distilling content from free text documents, and knowledgebased tools provide an important means for automatically reasoning over statements expressed as well-formed tuples A number of techniques deliver reliable extraction of entities, less reliable extraction of relations, and poor extraction on entity-entity-relation tuples However, tuple extraction is needed to bridge the gap between free text and knowledge-based applications We describe an information extraction system and experiment that demonstrates accurate tuple extraction in a selected domain

Proceedings ArticleDOI
19 May 2005
TL;DR: What is field-associated term and how to discover field- associated terms, which exist in any text, are described and called a field association (FA) word that can be directly related to the field classification.
Abstract: Although there is much research of text classification based on vector spaces using word information in the whole text, generally humans can recognize the field by finding the specific words. This paper describes what is field-associated term and how to discover field-associated terms, which exist in any text. In this paper, such words are called a field association (FA) word that can be directly related to the field classification. Five criteria of FA terms are defined for hierarchical fields. All of them are stored to field tree to make use of extraction of field-coherent passages for document classification. The presented approach is estimated by the simulation results of 140 fields text files of sports field and extended by 197 text field of civil engineering.

Book ChapterDOI
TL;DR: Some basic methods for text summarization and text visualization are presented and an example on real-world data describing the research projects in information technology supported by European Commission is given.
Abstract: Both text summarization and visualization aim at providing some sort of general view of the text either giving a text summary in the required natural language or giving some visual representation of the text. In both cases the text can be either a single document or a set of documents written in some natural language(s). Here we present some basic methods for text summarization and text visualization and give an example on real-world data describing the research projects in information technology supported by European Commission.

Book ChapterDOI
07 Jun 2005
TL;DR: Two different types of statistical framework for phrase recognition-classification are considered, based on finite-state models, and experimental results are reported which, given the extreme difficulty of the task, are encouraging.
Abstract: Finite-state models are used to implement a handwritten text recognition and classification system for a real application entailing casual, spontaneous writing with large vocabulary. Handwritten short phrases which involve a wide variety of writing styles and contain many non-textual artifacts, are to be classified into a small number of predefined classes. To this end, two different types of statistical framework for phrase recognition-classification are considered, based on finite-state models. HMMs are used for text recognition process. Depending to the considered architecture, N-grams are used for performing text recognition and then text classification (serial approach) or for performing both simultaneously (integrated approach). The multinomial text classifier is also employed in the classification phase of the serial approach. Experimental results are reported which, given the extreme difficulty of the task, are encouraging.

Book ChapterDOI
01 Jan 2005
TL;DR: The uses of text and its representation on Web documents is examined in terms of the challenges in its interpretation and the significant problem of non-uniform representation of text is paid to.
Abstract: This paper examines the uses of text and its representation on Web documents in terms of the challenges in its interpretation. Particular attention is paid to the significant problem of non-uniform representation of text. This non-uniformity is mainly due to the presence of semantically important text in image form as opposed to the standard encoded text. The issues surrounding text representation in Web documents are discussed in the context of colour perception and spatial representation. The characteristics of the representation of text in image form are examined and research towards interpreting these images of text is briefly described.