Showing papers on "Noisy text analytics published in 2014"

PDF

Open Access

Proceedings Article•DOI•

Word Cloud Explorer: Text Analytics Based on Word Clouds

[...]

Florian Heimerl¹, Steffen Lohmann¹, Simon Lange¹, Thomas Ertl¹•Institutions (1)

06 Jan 2014

TL;DR: This work developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method and shows how this approach can be effectively used to solve text analysis tasks and evaluated in a qualitative user study.

...read moreread less

Abstract: Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.

...read moreread less

318 citations

Proceedings Article•DOI•

Strokelets: A Learned Multi-scale Representation for Scene Text Recognition

[...]

Cong Yao¹, Xiang Bai¹, Baoguang Shi¹, Wenyu Liu¹•Institutions (1)

Huazhong University of Science and Technology¹

23 Jun 2014

TL;DR: This paper proposes a novel multi-scale representation for scene text recognition that consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities.

...read moreread less

Abstract: Driven by the wide range of applications, scene text detection and recognition have become active research topics in computer vision. Though extensively studied, localizing and reading text in uncontrolled environments remain extremely challenging, due to various interference factors. In this paper, we propose a novel multi-scale representation for scene text recognition. This representation consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities. Strokelets possess four distinctive advantages: (1) Usability: automatically learned from bounding box labels, (2) Robustness: insensitive to interference factors, (3) Generality: applicable to variant languages, and (4) Expressivity: effective at describing characters. Extensive experiments on standard benchmarks verify the advantages of strokelets and demonstrate the effectiveness of the proposed algorithm for text recognition.

...read moreread less

303 citations

Journal Article•DOI•

Short Text Classification: A Survey

[...]

Ge Song, Yunming Ye, Xiaolin Du, Xiaohui Huang, Shifu Bie - Show less +1 more

05 Jan 2014-Journal of Multimedia

TL;DR: The characters of short text and the difficulty of shortText classification are discussed, and the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classified, ensemble short text Classification, and real-time classification are introduced.

...read moreread less

Abstract: With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification.

...read moreread less

134 citations

Book Chapter•DOI•

Dependency-Based Semantic Parsing for Concept-Level Text Analysis

[...]

Soujanya Poria¹, Basant Agarwal², Alexander Gelbukh³, Amir Hussain⁴, Newton Howard⁵ - Show less +1 more•Institutions (5)

Nanyang Technological University¹, Malaviya National Institute of Technology, Jaipur², Instituto Politécnico Nacional³, University of Stirling⁴, Massachusetts Institute of Technology⁵

06 Apr 2014

TL;DR: A ConceptNet-based semantic parser that deconstructs natural language text into concepts based on the dependency relation between clauses is proposed that is domain-independent and is able to extract concepts from heterogeneous text.

...read moreread less

Abstract: Concept-level text analysis is superior to word-level analysis as it preserves the semantics associated with multi-word expressions. It offers a better understanding of text and helps to significantly increase the accuracy of many text mining tasks. Concept extraction from text is a key step in concept-level text analysis. In this paper, we propose a ConceptNet-based semantic parser that deconstructs natural language text into concepts based on the dependency relation between clauses. Our approach is domain-independent and is able to extract concepts from heterogeneous text. Through this parsing technique, 92.21% accuracy was obtained on a dataset of 3,204 concepts. We also show experimental results on three different text analysis tasks, on which the proposed framework outperformed state-of-the-art parsing techniques.

...read moreread less

68 citations

Proceedings Article•DOI•

A Context Based Text Summarization System

[...]

Rafael Ferreira¹, Frederico Luiz Gonçalves de Freitas¹, Luciano Cabral², Rafael Dueire Lins¹, Rinaldo Lima¹, Gabriel Franca¹, Steven J. Simske³, Luciano Favaro³ - Show less +4 more•Institutions (3)

Universidade de Pernambuco¹, Rio de Janeiro State University², Hewlett-Packard³

07 Apr 2014

TL;DR: This paper advocates the thesis that the quality of the summary obtained with combinations of sentence scoring methods depend on text subject, and evaluates the validity of the hypothesis formulated and point at which techniques are more effective in each of those contexts studied.

...read moreread less

Abstract: Text summarization is the process of creating a shorter version of one or more text documents. Automatic text summarization has become an important way of finding relevant information in large text libraries or in the Internet. Extractive text summarization techniques select entire sentences from documents according to some criteria to form a summary. Sentence scoring is the technique most used for extractive text summarization, today. Depending on the context, however, some techniques may yield better results than some others. This paper advocates the thesis that the quality of the summary obtained with combinations of sentence scoring methods depend on text subject. Such hypothesis is evaluated using three different contexts: news, blogs and articles. The results obtained show the validity of the hypothesis formulated and point at which techniques are more effective in each of those contexts studied.

...read moreread less

52 citations

Text Mining: Techniques and its Application

[...]

Shilpa Dang, Peerzada Hamid Ahmad

01 Jan 2014

TL;DR: This paper has discussed general idea of text mining and comparison of its techniques, and briefly discusses a number of textmining applications which are used presently and in future.

...read moreread less

Abstract: Text mining has become an exciting research field as it tries to discover valuable information from unstructured texts. The unstructured texts which contain vast amount of information cannot simply be used for further processing by computers. Therefore, exact processing methods, algorithms and techniques are vital in order to extract this valuable information which is completed by using text mining. In this paper, we have discussed general idea of text mining and comparison of its techniques. In addition, we briefly discuss a number of text mining applications which are used presently and in future.

...read moreread less

43 citations

Journal Article•DOI•

Design and Implementation of Text To Speech Conversion for Visually Impaired People

[...]

Itunuoluwa Isewon, O. J. Oyelade, O. O. Oladipupo

05 Apr 2014-International Journal of Applied Information Systems

TL;DR: A useful text-to-speech synthesizer is developed in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file.

...read moreread less

Abstract: A Text-to-speech synthesizer is an application that converts text into spoken word, by analyzing and processing the text using Natural Language Processing (NLP) and then using Digital Signal Processing (DSP) technology to convert this processed text into synthesized speech representation of the text. Here, we developed a useful text-to-speech synthesizer in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file. The development of a text to speech synthesizer will be of great help to people with visual impairment and make making through large volume of text easier

...read moreread less

40 citations

A survey on text mining techniques

[...]

Gaurav Sharma

01 Jan 2014

TL;DR: Different method of text categorization and cluster analysis for text documents and a new text mining technique is proposed for future implementation are discussed.

...read moreread less

Abstract: text mining is a technique to find meaningful patterns from the available text documents. The pattern discovery from the text and document organization of document is a well-known problem in data mining. Analysis of text content and categorization of the documents is a complex task of data mining. In order to find an efficient and effective technique for text categorization, various techniques of text categorization and classification is recently developed. Some of them are supervised and some of them unsupervised manner of document arrangement. This presented paper discusses different method of text categorization and cluster analysis for text documents. In addition of that a new text mining technique is proposed for future implementation.

...read moreread less

39 citations

Proceedings Article•DOI•

Estimating and rating the quality of optically character recognised text

[...]

Beatrice Alex¹, John Burns²•Institutions (2)

University of Edinburgh¹, University of Liverpool²

19 May 2014

TL;DR: The effect OCR errors have on named entity recognition (NER) and how it affects text mining is studied and a simple method for estimating text quality of OCRed text is introduced and how well human raters can evaluate it is examined.

...read moreread less

Abstract: The focus of this paper is on the quality of historical text digitised through optical character recognition (OCR) and how it affects text mining. We study the effect OCR errors have on named entity recognition (NER) and show that in a random sample of documents picked from several historical text collections, 30.6% of false negative commodity and location mentions and 13.3% of all manually annotated commodity and location mentions contain OCR errors. We introduce a simple method for estimating text quality of OCRed text and examine how well human raters can evaluate it. We also illustrate how automatic text quality estimation compares to manual rating with the aim of determining a quality threshold below which documents could potentially be discarded or would require extensive correction first. This work was conducted during the Trading Consequences project which focussed on text mining and visualisation of historical documents for the study of nineteenth century trade.

...read moreread less

35 citations

Journal Article•DOI•

Automated Semantic Tagging of Textual Content

[...]

Jelena Jovanovic¹, Ebrahim Bagheri², John Cuzzola², Dragan Gašević³, Zoran Jeremic², Reza Bashash - Show less +2 more•Institutions (3)

University of Belgrade¹, Ryerson University², Athabasca University³

01 Nov 2014-IT Professional

TL;DR: The authors offer insight into the process of semantic tagging, the capabilities and specificities of today's semantic taggers, and also indicate some of the criteria to be considered when choosing a tagger.

...read moreread less

Abstract: Motivated by a continually increasing demand for applications that depend on machine comprehension of text-based content, researchers in both academia and industry have developed innovative solutions for automated information extraction from text. In this article, the authors focus on a subset of such tools--semantic taggers--that not only extract and disambiguate entities mentioned in the text but also identify topics that unambiguously describe the text's main themes. The authors offer insight into the process of semantic tagging, the capabilities and specificities of today's semantic taggers, and also indicate some of the criteria to be considered when choosing a tagger.

...read moreread less

31 citations

Journal Article•DOI•

Normalization of informal text

[...]

Deana L. Pennell¹, Yang Liu¹•Institutions (1)

University of Texas at Dallas¹

01 Jan 2014-Computer Speech & Language

TL;DR: Two character-level methods for the abbreviation modeling aspect of the noisy channel model are introduced: a statistical classifier using language-based features to decide whether a character is likely to be removed from a word, and a character- level machine translation model.

...read moreread less

Proceedings Article•DOI•

News classification based on their headlines: A review

[...]

Mazhar Iqbal Rana¹, Shehzad Khalid¹, Muhammad Usman Akbar¹•Institutions (1)

Bahria University¹

01 Dec 2014

TL;DR: This paper has discussed text classification process, classifiers, and numerous feature extraction methodologies but all in context of short texts i.e. news classification based on their headlines.

...read moreread less

Abstract: For the last few years, text mining has been gaining significant importance. Since Knowledge is now available to users through variety of sources i.e. electronic media, digital media, print media, and many more. Due to huge availability of text in numerous forms, a lot of unstructured data has been recorded by research experts and have found numerous ways in literature to convert this scattered text into defined structured volume, commonly known as text classification. Focus on full text classification i.e. full news, huge documents, long length texts etc. is more prominent as compared to the short length text. In this paper, we have discussed text classification process, classifiers, and numerous feature extraction methodologies but all in context of short texts i.e. news classification based on their headlines. Existing classifiers and their working methodologies are being compared and results are presented effectively.

...read moreread less

Proceedings Article•DOI•

A Combined System for Text Line Extraction and Handwriting Recognition in Historical Documents

[...]

Andreas Fischer, Micheal Baechler¹, Angelika Garz¹, Marcus Liwicki¹, Rolf Ingold¹ - Show less +1 more•Institutions (1)

University of Fribourg¹

07 Apr 2014

TL;DR: A combined system for text localization and transcription in page images is presented that includes flexible learning-based methods for layout analysis and handwriting recognition, which were developed in the context of the Swiss research project HisDoc.

...read moreread less

Abstract: Automated reading of historical handwriting is needed to search and browse ancient manuscripts in digital libraries based on their textual content. In this paper, we present a combined system for text localization and transcription in page images. It includes flexible learning-based methods for layout analysis and handwriting recognition, which were developed in the context of the Swiss research project HisDoc. A comprehensive experimental evaluation is provided for the medieval Parzival database, demonstrating a promising word recognition accuracy of 93.0% with closed vocabulary. In order to harmonize the evaluation of the two document analysis tasks, we introduce a novel evaluation measure for text line extraction that takes substitution, deletion, as well as insertion errors into account.

...read moreread less

Proceedings Article•DOI•

Automatic text recognition in natural scene and its translation into user defined language

[...]

Deepak Chandra Bijalwan, Alok Aggarwal

01 Dec 2014

TL;DR: A novel approach is proposed to recognize text in complex background natural scene, word formation from recognized text, spelling checking and word translation into user defined language and finally overlay translated word onto the image.

...read moreread less

Abstract: In recent year's availability of economical image capturing devices in low cost products like mobile phones has led a significant attention of researchers to the problem of recognizing text in images. Recognition of scene text is a challenging problem compared to the recognition of printed documents. In this work a novel approach is proposed to recognize text in complex background natural scene, word formation from recognized text, spelling checking and word translation into user defined language and finally overlay translated word onto the image. The proposed approach is robust to different kinds of text appearances, including font size, font style, color, and background. Combining the respective strengths of different complementary techniques and overcoming their shortcomings, the proposed method uses efficient character detection and localization technique and multiclass classifier to recognize the text accurately. The proposed approach successfully recognizes text on natural scene images and does not depend on a particular alphabet, text background. It works with a wide variety in size of characters and can handle up to 20 degree skewness efficiently.

...read moreread less

Proceedings Article•

Visualizations for Text Re-use

[...]

Stefan Jänicke¹, Annette Gessner², Marco Büchler², Gerik Scheuermann¹•Institutions (2)

Leipzig University¹, University of Göttingen²

01 Jan 2014

TL;DR: The authors presented various visualizations for the text re-use found between texts of a collection to support humanists in answering a broad palette of research questions, such as how to compare different text editions to each other.

...read moreread less

Abstract: In this paper, we present various visualizations for the Text Re-use found between texts of a collection to support humanists in answering a broad palette of research questions. When juxtaposing all texts of a corpus in the form of tuples, we propose the Text Re-use Grid as a distant reading method that emphasizes text tuples with systematic or repetitive Text Re-use. In contrast, the Text Re-use Browser allows for close reading of the Text Re-use between the two texts of a tuple. Additionally, we present Sentence Alignment Flows to improve the readability for Text Variant Graphs on sentence level that are used to compare various text editions to each other. Finally, we portray findings of the humanists of our project using the proposed visualizations.

...read moreread less

Journal Article•DOI•

Classification of heterogeneous text data for robust domain-specific language modeling

[...]

Jan Stas¹, Jozef Juhar¹, Daniel Hládek¹•Institutions (1)

Technical University of Košice¹

15 Apr 2014-Eurasip Journal on Audio, Speech, and Music Processing

TL;DR: This paper describes the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain and shows significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.

...read moreread less

Abstract: The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.

...read moreread less

Proceedings Article•DOI•

Mining Semantic Structures from Syntactic Structures in Free Text Documents

[...]

Hamid Mousavi¹, Deirdre Kerr¹, Markus Iseli¹, Carlo Zaniolo¹•Institutions (1)

University of California, Los Angeles¹

16 Jun 2014

TL;DR: This work proposes a weighted-graph representation of text, called Text Graphs, which captures the grammatical and semantic relations between words and terms in the text.

...read moreread less

Abstract: The Web has made possible many advanced text-mining applications, such as news summarization, essay grading, question answering, and semantic search For many of such applications, statistical text-mining techniques are ineffective since they do not utilize the morphological structure of the text Thus, many approaches use NLP-based techniques, that parse the text and use patterns to mine and analyze the parse trees which are often unnecessarily complex Therefore, we propose a weighted-graph representation of text, called Text Graphs, which captures the grammatical and semantic relations between words and terms in the text Text Graphs are generated using a new text mining framework which is the main focus of this paper Our framework, SemScape, uses a statistical parser to generate few of the most probable parse trees for each sentence and employs a novel two-step pattern-based technique to extract from parse trees candidate terms and their grammatical relations Moreover, SemScape resolves co references by a novel technique, generates domain-specific Text Graphs by consulting ontologies, and provides a SPARQL-like query language and an optimized engine for semantically querying and mining Text Graphs

...read moreread less

E2E: An End-to-End Entity Linking System for Short and Noisy Text

[...]

Ming-Wei Chang¹, Bo-June (Paul) Hsu¹, Hao Ma¹, Ricky Loynd¹, Kuansan Wang¹ - Show less +1 more•Institutions (1)

Microsoft¹

01 Jan 2014

TL;DR: E2E, an end-to-end entity linking system that is designed for short and noisy text found in microblogs and text messages, can process short and noise text robustly.

...read moreread less

Abstract: We present E2E, an end-to-end entity linking system that is designed for short and noisy text found in microblogs and text messages. Mining and extracting entities from short text is an essential step for many content analysis applications. By jointly optimizing entity recognition and disambiguation as a single task, our system can process short and noisy text robustly.

...read moreread less

Patent•

Identifying and processing recommendation requests

[...]

Russell Lee-Goldman¹, Lada A. Adamic¹, David M. Goldblatt¹, Yuval Kesten¹, Mark Andrew Rich¹, Nidhi Gupta¹, Amy Campbell¹, Andrew T. Fiore¹ - Show less +4 more•Institutions (1)

Facebook¹

08 Aug 2014

TL;DR: In this article, the unstructured text of a post or message generated by the user on a social-networking system is used to determine whether the text includes a request for recommendation using a machine-learning model based on comparison of the text to the one or more predetermined words associated with requests for recommendation.

...read moreread less

Abstract: In one embodiment, a method includes receiving unstructured text from a user of a social-networking system, determining whether the unstructured text includes a request for a recommendation, identifying one or more first entity names in the unstructured text, generating a structured query based upon the one or more first entity names, identifying, in the social graph, one or more second entity names corresponding to the structured query, and presenting the one or more second entity names and the unstructured text in a social context of the user. The unstructured text may include text of a post or message generated by the user on a social-networking system. A score may be generated based on the unstructured text to determine whether the text includes a request for recommendation using a machine-learning model based on comparison of the unstructured text to the one or more predetermined words associated with requests for recommendation.

...read moreread less

Journal Article•

A Survey on Pre-processing Techniques for Text Mining

[...]

Manthan J. Vyas, Sanjay D. Bhanderi

01 Jan 2014-Data mining and knowledge engineering

TL;DR: Some pre-processing techniques in Gujarati are introduced in this paper and it is shown that Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces.

...read moreread less

Abstract: Text mining is the process of obtaining interesting patterns or knowledge from text documents. The most often used type of data in the WWW is text. Text mining is used to extract interesting knowledge from unstructured text data. Pre-processing is a very important phase in the text mining process. Text mining framework includes two components, text refining and knowledge distillation. This paper is about pre-processing for text mining in English and Gujarati language. There is very less work done for text mining in Gujarati language. It is very challenging task as Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces. Some pre-processing techniques in Gujarati are introduced in this paper.

...read moreread less

Journal Article•DOI•

A Survey of Different Text Mining Techniques

[...]

Varsha C. Pande, A. S. Khandelwal¹•Institutions (1)

Hislop College¹

01 Jan 2014-IBMRD's Journal of Management & Research

TL;DR: The techniques for text mining, Information Extraction, Information retrieval, Query processing, Natural Language processing, Categorization, Clustering, and information retrieval are described.

...read moreread less

Abstract: Text mining is a technology that can work with unstructured or semi-structured data. It is a technology that can be used to find the meaningful information from natural language text using existing data in corporate databases by making unstructured text data available for analysis. There are many techniques for text mining. In this paper we describe the techniques, Information Extraction, Information retrieval, Query processing, Natural Language processing, Categorization, Clustering.

...read moreread less

Journal Article•DOI•

Topic Language Model Adaption for Recognition of Homologous Offline Handwritten Chinese Text Image

[...]

Yanwei Wang¹, Xiaoqing Ding¹, Changsong Liu¹•Institutions (1)

Tsinghua University¹

26 Feb 2014-IEEE Signal Processing Letters

TL;DR: As the content of a full text page usually focuses on a specific topic, a topic language model adaption method is proposed to improve the recognition performance of homologous offline handwritten Chinese text image and to obtain a tradeoff between the recognition and computational complexity.

...read moreread less

Abstract: As the content of a full text page usually focuses on a specific topic, a topic language model adaption method is proposed to improve the recognition performance of homologous offline handwritten Chinese text image. Firstly, the text images are recognized with a character based bi-gram language model. Secondly, the topic of the text image is matched adaptively. Finally, the text image is recognized again with the best matched topic language model. To obtain a tradeoff between the recognition performance and computational complexity, a restricted topic language model adaption method is further presented. The methods have been evaluated on 100 offline Chinese text images. Compared to the general language model, the topic language model adaption has reduced the relative error rate by 11.94%. The restricted topic language model has lessened the running time by 19.22% at the cost of losing 0.35% of the accuracy.

...read moreread less

An overview and applications of optical character recognition

[...]

Mir Arif, Mir Asif, Shaikh Abdul Hannan, Yusuf Perwej, Mane Arjun Vithalrao - Show less +1 more

01 Jan 2014

TL;DR: The paper presents introduction, major research work and applications of Optical Character Recognition in various fields, and some points will be stressed on the major research works that have made a great impact in character recognition.

...read moreread less

Abstract: Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is widely used as a form of data entry from some sort of original paper data source, whether passport documents, invoices, bank statement, receipts, business card, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website [1]. A large number of research papers and reports have already been published on this topic. The paper presents introduction, major research work and applications of Optical Character Recognition in various fields. At the first introduction of OCR will be discussed and then some points will be stressed on the major research works that have made a great impact in character recognition. And finally the most important applications of OCR will be covered and then conclusion.

...read moreread less

Self-Training for Parsing Learner Text

[...]

Aoife Cahill, Binod Gyawali, James V. Bruno

01 Aug 2014

TL;DR: This paper adds additional support for the claim that the new self-trained parser has improved over the baseline by carrying out a qualitative linguistic analysis of the kinds of differences between two parsers on non-native text.

...read moreread less

Abstract: We apply the well-known parsing technique of self-training to a new type of text: languagelearner text. This type of text often contains grammatical and other errors which can cause problems for traditional treebank-based parsers. Evaluation on a small test set of student data shows improvement over the baseline, both by training on native or non-native text. The main contribution of this paper adds additional support for the claim that the new self-trained parser has improved over the baseline by carrying out a qualitative linguistic analysis of the kinds of differences between two parsers on non-native text. We show that for a number of linguistically interesting cases, the self-trained parser is able to provide better analyses, despite the sometimes ungrammatical nature of the text.

...read moreread less

Patent•

Automatic discovery and presentation of topic summaries related to a selection of text

[...]

Patrick W. Fink¹, Philip E. Parker¹•Institutions (1)

IBM¹

18 Mar 2014

TL;DR: In this article, a topic summary application receives the user-selected text and identifies entities in the text using natural language processing, and then generates a summary, presented to the user in a pop-up window, of most frequently correlated related entities along with text phrases that are semantically important.

...read moreread less

Abstract: Techniques are disclosed for discovering and presenting topic summaries related to a selection of text in an electronic document. A topic summary application receives the user-selected text and identifies entities in the text using natural language processing. Using natural language processing, the summary application also identifies related entities and associated text phrases in a remaining portion of the electronic document. The remaining portion may be a portion of the document that precedes the user-selected text, so that a summary generated therefrom may be used to refresh the memory of the user while not revealing information that the user has not yet encountered. In addition, the summary application determines semantically important text phrases using text analytics and generates a summary, presented to the user in a pop-up window, of most frequently correlated related entities along with text phrases that are semantically important.

...read moreread less

Patent•

Speech recognition using phoneme matching

[...]

Wilson Hsu, Kaheer Suleman, Joshua Pantony

18 Sep 2014

TL;DR: In this paper, a system, method and computer program is provided for generating customized text representations of audio commands based on a general language grammar, which is used for generating a first text representation of an audio command, the second module including a custom language grammar that may include contacts for a particular user.

...read moreread less

Abstract: A system, method and computer program is provided for generating customized text representations of audio commands. A first speech recognition module may be used for generating a first text representation of an audio command based on a general language grammar. A second speech recognition module may be used for generating a second text representation of the audio command, the second module including a custom language grammar that may include contacts for a particular user. Entity extraction is applied to the second text representation and the entities are checked against a file containing personal language. If the entities are found in the user-specific language, the two text representations may be fused into a combined text representation and named entity recognition may be performed again to extract further entities.

...read moreread less

Proceedings Article•DOI•

Classifying unstructured data into natural language text and technical information

[...]

Thorsten Merten¹, Bastian Mager¹, Simone Bürsner¹, Barbara Paech²•Institutions (2)

Bonn-Rhein-Sieg University of Applied Sciences¹, Heidelberg University²

31 May 2014

TL;DR: This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering.

...read moreread less

Abstract: Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa. This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering. The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.

...read moreread less

Proceedings Article•DOI•

A Multi-dimensional Analysis and Data Cube for Unstructured Text and Social Media

[...]

Suan Lee¹, Nam-Soo Kim¹, Jinho Kim¹•Institutions (1)

Kangwon National University¹

03 Dec 2014

TL;DR: This paper extended the existing text cube model to incorporate TF-IDF (Term Frequency Inverse Document Frequrency) and LM (Language Model) as measurements, and revealed that the performance and the effectiveness of the proposed text cube outperform the existing one.

...read moreread less

Abstract: Recently, unstructured data like texts, documents, or SNS messages has been increasingly being used in many applications, rather than structured data consisting of simple numbers or characters. Thus it becomes more important to analysis unstructured text data to extract valuable information for usres decision making. Like OLAP (On-Line Analytical Processing) analysis over structured data, Multi-dimensional analysis for these unstructured data is popularly being required. To facilitate these analysis requirements on the unstructured data, a text cube model on multi-dimensional text database has been proposed. In this paper, we extended the existing text cube model to incorporate TF-IDF (Term Frequency Inverse Document Frequrency) and LM (Language Model) as measurements. Because the proposed text cube model utilizes new measurements which are more popular in information retrieval systems, it is more efficient and effective to analysis text databases. Through experiments, we revealed that the performance and the effectiveness of the proposed text cube outperform the existing one.

...read moreread less

Book•

Text Mining: From Ontology Learning to Automated Text Processing Applications

[...]

Chris Biemann, Alexander Mehler

20 Dec 2014

TL;DR: In this paper, the authors present a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of textmining and use text mining for various tasks in natural language processing (NLP).

...read moreread less

Abstract: This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects the most recent achievements with respect to the automatic build-up of large lexical resources. It addresses researchers that already perform text mining, and those who want to enrich their battery of methods. Selected articles can be used to support graduate-level teaching. The book is suitable for all readers that completed undergraduate studies of computational linguistics, quantitative linguistics, computer science and computational humanities. It assumes basic knowledge of computer science and corpus processing as well as of statistics.

...read moreread less

Journal Article•

User Interest Modeling Approach Based on Short Text of Micro-blog

[...]

Qiu Yun-fe

01 Jan 2014-Computer Engineering

TL;DR: Experimental results show that the short text reconstruction and concept mapping can improve the effect of clustering and the proposed micro-blog user's interest modeling has a better performance.

...read moreread less

Abstract: In this paper, a method on modeling user's interests based on short text of micro-blog is presented. In order to overcome the lack of information in short text, on the base of analyzing the structure and content of micro-blog short text, this paper proposes an approach on micro-blog short text reconstruction, and namely, according to the other related and the three kinds of special symbols of the text, extends the content, thereby extending the characteristic information of original micro-blog. It takes advantage of HowNet2000 concept dictionary to map the feature set of reconstruction text to a set of concepts. It clusters the set of concepts to divide user's interests, and meanwhile, a representation mechanism of user interest model is presented. Experimental results show that the short text reconstruction and concept mapping can improve the effect of clustering. Compared with the modeling based on collaborative filtering, F-Measure value is increased by 29.1%. This means the proposed micro-blog user's interest modeling has a better performance.

...read moreread less