scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2017"


Posted Content
TL;DR: Several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering are described, which briefly explain text mining in biomedical and health care domains.
Abstract: The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

422 citations


Posted Content
TL;DR: Experimental results on public benchmark datasets show the ability of the STN-OCR model to handle a variety of different tasks, without substantial changes in its overall network structure.
Abstract: Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In re- cent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present STN-OCR, a step towards semi-supervised neural networks for scene text recognition, that can be optimized end-to-end. In contrast to most existing works that consist of multiple deep neural networks and several pre-processing steps we propose to use a single deep neural network that learns to detect and recognize text from natural images in a semi-supervised way. STN-OCR is a network that integrates and jointly learns a spatial transformer network, that can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We investigate how our model behaves on a range of different tasks (detection and recognition of characters, and lines of text). Experimental results on public benchmark datasets show the ability of our model to handle a variety of different tasks, without substantial changes in its overall network structure.

70 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: A novel recurrent Dense Text Localization Network (DTLN) is proposed to sequentially decode the intermediate convolutional representations of a cluttered scene image into a set of distinct text instance detections, and a Context Reasoning Text Retrieval model is proposed, which jointly encodes text instances and their context information through a recurrent network, and ranks localized text bounding boxes by a scoring function of context compatibility.
Abstract: Text instance as one category of self-described objects provides valuable information for understanding and describing cluttered scenes. In this paper, we explore the task of unambiguous text localization and retrieval, to accurately localize a specific targeted text instance in a cluttered image given a natural language description that refers to it. To address this issue, first a novel recurrent Dense Text Localization Network (DTLN) is proposed to sequentially decode the intermediate convolutional representations of a cluttered scene image into a set of distinct text instance detections. Our approach avoids repeated detections at multiple scales of the same text instance by recurrently memorizing previous detections, and effectively tackles crowded text instances in close proximity. Second, we propose a Context Reasoning Text Retrieval (CRTR) model, which jointly encodes text instances and their context information through a recurrent network, and ranks localized text bounding boxes by a scoring function of context compatibility. Quantitative evaluations on standard scene text localization benchmarks and a newly collected scene text retrieval dataset demonstrate the effectiveness and advantages of our models for both scene text localization and retrieval.

31 citations


Journal ArticleDOI
TL;DR: This paper attempts to analyze and classify the various text extraction schemes for the scene-text and document images, and compares different approaches of these images based on common problems and discusses their merits and demerits.
Abstract: One of the major applications of text retrieval from images is to extract the text information and then recognize its characters. This is helpful for indexing the images within storage media. When we want to search a particular image or document, there is no need to go through a large bunch of images. We go only through the group of indexed images, so that the task of finding the particular image becomes easy. Extracting text lines from scanned document images present a major problem in optical character recognition process as skewed text lines raise the complexity. The problem gets even worse with the text lines of different orientations. Such lines are called as multi-skewed lines. These multi-skewed lines are easily observed in both printed and handwritten documents. It is a challenging task to design a real time system, which can maintain a high recognition rate with good accuracy and is independent of the type of documents and character fonts. In this paper, we attempt to analyze and classify...

27 citations


Journal ArticleDOI
TL;DR: A new hybrid ensemble approach is proposed that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques, which can improve the text content quality and enhance the performance of the expert systems for spamming detection.
Abstract: A new classifier is presented to detect undesired short text comments.The proposed approach is light, fast, multinomial and offers incremental learning.The impact of applying text normalization and semantic indexing is studied.The results indicate the proposed techniques outperformed most of the approaches.Text normalization and semantic indexing enhanced the classifiers performance. The popularity and reach of short text messages commonly used in electronic communication have led spammers to use them to propagate undesired content. This is often composed by misleading information, advertisements, viruses, and malwares that can be harmful and annoying to users. The dynamic nature of spam messages demands for knowledge-based systems with online learning and, therefore, the most traditional text categorization techniques can not be used. In this study, we introduce the MDLText, a text classifier based on the minimum description length principle, to the context of filtering undesired short text messages. The proposed approach supports incremental learning and, therefore, its predictive model is scalable and can adapt to continuously evolving spamming techniques. It is also fast, with computational cost increasing linearly with the number of samples and features, which is very desirable for expert systems applied to real-time electronic communication. In addition to the dynamic nature of these messages, they are also short and usually poorly written, rife with slangs, symbols, and abbreviations that difficult text representation, learning, and filtering. In this scenario, we also investigated the benefits of using text normalization and semantic indexing techniques. We showed these techniques can improve the text content quality and, consequently, enhance the performance of the expert systems for spamming detection. Based on these findings, we propose a new hybrid ensemble approach that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques. It has the advantages of being independent of the classification method and the results indicated it is efficient to filter undesired short text messages.

23 citations


Journal ArticleDOI
TL;DR: This paper combines the Text Proposals algorithm with Fully Convolutional Networks to efficiently reduce the number of proposals while maintaining the same recall level and thus gaining a significant speed up.

20 citations


Proceedings ArticleDOI
13 Aug 2017
TL;DR: In this paper, a deep memory network is proposed to automatically find relevant information from a collection of longer documents and reformulate the short text through a gating mechanism, which significantly outperforms classical text expansion methods.
Abstract: Effectively making sense of short texts is a critical task for many real world applications such as search engines, social media services, and recommender systems. The task is particularly challenging as a short text contains very sparse information, often too sparse for a machine learning algorithm to pick up useful signals. A common practice for analyzing short text is to first expand it with external information, which is usually harvested from a large collection of longer texts. In literature, short text expansion has been done with all kinds of heuristics. We propose an end-to-end solution that automatically learns how to expand short text to optimize a given learning task. A novel deep memory network is proposed to automatically find relevant information from a collection of longer documents and reformulate the short text through a gating mechanism. Using short text classification as a demonstrating task, we show that the deep memory network significantly outperforms classical text expansion methods with comprehensive experiments on real world data sets.

13 citations


Journal ArticleDOI
TL;DR: The results suggest that spoken text is a cause of the transient information effect, which can be best avoided by substituting written text for spoken text on tasks that require integration of information.
Abstract: Two experiments involving 125 grade-10 students learning about commerce investigated strategies to overcome the transient information effect caused by explanatory spoken text. The transient information effect occurs when learning is reduced as a result of information disappearing before the learner has time to adequately process it, or link it with new information. Spoken text, unless recorded or repeated in some fashion, is fleeting in nature and can be a major cause of transiency. The three strategies investigated, all theoretically expected to enhance learning, were: (a) replacing lengthy spoken text with written text (Experiments 1 and 2), (b) replacing lengthy continuous text with segmented text (Experiment 1), and (c) adding a diagram to lengthy spoken text (Experiment 2). In both experiments on tasks that required information to be integrated across segments, written text was found to be superior to spoken text. In Experiment 1 the expected advantage of segmented text in reducing transitory effects was not found. Compared with written continuous text the segmented spoken text strategy was inferior. Experiment 2 found that adding a diagram to spoken text was an advantage compared to spoken text alone consistent with a multimedia effect. Overall, the results suggest that spoken text is a cause of the transient information effect, which can be best avoided by substituting written text for spoken text on tasks that require integration of information.

8 citations


Journal Article
TL;DR: This paper focuses on using the pattern matching technique for regular expression to find the relevant data from the text/word file to extract the patterns from theText mining documents.
Abstract: Text mining is also known as knowledge discovery from textual databases; its job is to derive a high level knowledge from the text. The process of obtaining useful information from the records is also known a text mining. The system uses many data mining approaches to extract the patterns from the text documents. The challenge is using those updated patterns and implementing an algorithm for pattern discovery is still an open research issue. The paper focuses on using the pattern matching technique for regular expression to find the relevant data from the text/word file. The text file containing large number free-text is used to fetch all the discovered words or characters from the documents. The system is helpful for the users to search the relevant document, and converts all the unstructured data into structured form.

6 citations


Proceedings ArticleDOI
03 Apr 2017
TL;DR: This tutorial introduces data-driven methods to construct structured information networks for text corpora of different kinds to rep- resent their factual information and demonstrates how these constructed networks aid in text analytics and knowledge discovery at a large scale.
Abstract: In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information from various domains (medical records, corporate reports). To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations, events) in the text. In this tutorial, we introduce data-driven methods to construct structured information networks (where nodes are different types of entities attached with attributes, and edges are different relations between entities) for text corpora of different kinds (especially for massive, domain-specific text corpora) to rep- resent their factual information. We focus on methods that are minimally-supervised, domain-independent, and language-independent for fast network construction across various application domains (news, web, biomedical, reviews). We demonstrate on real datasets including news articles, scientific publications, tweets and reviews how these constructed networks aid in text analytics and knowledge discovery at a large scale.

5 citations


Proceedings ArticleDOI
17 Mar 2017
TL;DR: This paper presents a framework for text detection and recognition from natural images for mobile devices, and focuses particularly on the text.
Abstract: On the light of the remarkable audio-visual effect on modern life, and the massive use of new technologies (smartphones, tablets ...), the image has been given a great importance in the field of communication. Actually, it has become the most effective, attractive and suitable means of communication for transmitting information between different people. Of all the various parts of information that can be extracted from the image, our focus will be particularly on the text. Actually, since its detection and recognition in a natural image is a major problem in many applications, the text has drawn the attention of a great number of researchers in recent years. In this paper, we present a framework for text detection and recognition from natural images for mobile devices.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: This paper presents a work of Text identification from an image consisting of two languages and translating it into a single language using OCR and SVM classifier.
Abstract: Now a days it has become a trend to write movie name, company name, name plate, vehicle number plate, brands name in mixed languages. Because of lack of language knowledge, it becomes difficult for people to identify bi-lingual words. The text written in the image contains different fonts and background image. This paper presents a work of Text identification from an image consisting of two languages and translating it into a single language. OCR is implemented in order to convert electronic form of image into machine editable text. SVM classifier is used for differentiating scripts based on characters density features and map new examples to the predicted category. Bi-lingual or cross language words recognition from an image solves the problem of reading a text from an image which contains Hindi and English language.

Posted Content
TL;DR: A new approach for full page text recognition based on regressions with Fully Convolutional Neural Networks and Multidimensional Long Short-Term Memory as contextual layers and only the position of the left side of the text lines are predicted.
Abstract: Text line detection and localization is a crucial step for full page document analysis, but still suffers from heterogeneity of real life documents. In this paper, we present a new approach for full page text recognition. Localization of the text lines is based on regressions with Fully Convolutional Neural Networks and Multidimensional Long Short-Term Memory as contextual layers. In order to increase the efficiency of this localization method, only the position of the left side of the text lines are predicted. The text recognizer is then in charge of predicting the end of the text to recognize. This method has shown good results for full page text recognition on the highly heterogeneous Maurdor dataset.

Posted Content
23 Mar 2017
TL;DR: Experimental results show that the proposed learning-based approach to text image retrieval with the purpose of finding out the original or similar text through a query text image has good ability to retrieve the original text content.
Abstract: Rapid increase of digitized document give birth to high demand of document image retrieval. While conventional document image retrieval approaches depend on complex OCR-based text recognition and text similarity detection, this paper proposes a new content-based approach, in which more attention is paid to features extraction and fusion. In the proposed approach, multiple features of document images are extracted by different CNN models. After that, the extracted CNN features are reduced and fused into weighted average feature. Finally, the document images are ranked based on feature similarity to a provided query image. Experimental procedure is performed on a group of document images that transformed from academic papers, which contain both English and Chinese document, the results show that the proposed approach has good ability to retrieve document images with similar text content, and the fusion of CNN features can effectively improve the retrieval accuracy.

Proceedings ArticleDOI
17 Oct 2017
TL;DR: An architecture that efficiently integrates localization, extraction and recognition algorithms applied to text recognition in web images is presented, making the system flexible, scalable and robust in detecting texts from complex web images with different orientations, dimensions and colors.
Abstract: Web images play an important role in delivering multimedia content on the Web. The text embedded in web images carry semantic information related to layout and content of the pages. Statistics show that there is a significant need to detect and recognize text from web images. This paper presents an architecture that efficiently integrates localization, extraction and recognition algorithms applied to text recognition in web images. In the recognition step is proposed a procedure based on super-resolution and an iterative method for improving the performance. The approach is implemented and evaluated using Matlab and cloud computing, making the system flexible, scalable and robust in detecting texts from complex web images with different orientations, dimensions and colors. Competitive results are presented, both in precision and recognition rate, when compared with other systems in the existing literature.

Journal Article
TL;DR: The technology for text recognition using Optical Character Recognition (OCR) for the mobile phone application is proposed which can easily recognized the text written on that document with various languages.
Abstract: In this paper, we propose the technology for text recognition using Optical Character Recognition (OCR) for the mobile phone application. The camera within android mobile phone scan the text written on document and then the OCR is applied which can easily recognized the text written on that document with various languages. OCR is the machine replication of human reading and has been the subject of intensive research for more than three decades. OCR can be described as mechanical or electronic conversion of scanned text where text can be handwritten, typewritten or printed form. It is a method of digitizing printed texts so that it can be electronically searched and used in various machine processes. It converts the images into machine-encoded text that can be used in machine translation, text-to-speech and text mining.

Journal ArticleDOI
TL;DR: The experimental results have proved the better performance of the proposed text information representation model in terms of its Time and Space complexity.
Abstract: Text representation is the essential step for the tasks of text mining. To represent the textual information more expressively, a kind of Text Mining based Semantic Graph approach is proposed, in which more semantic and ordering information among terms as well as the structural information of the text is incorporated. Such model can be constructed by extracting representative terms from texts and their mutually semantic relationships. The implementation of the proposed work is provided using the JAVA environment and python environment. Moreover, WordNet is showing relationship amongst word node. So that GEPHI tool is used to constructing more effectively semantic graph. Additionally the comparative performance is also compared with traditional. In order to compare the performance of the algorithms the memory consumption and time consumption is taken as stand parameters. The experimental results have proved the better performance of the proposed text information representation model in terms of its Time and Space complexity. KeywordSVM, Semantic Graphs, POS Tagging, WordNet, Text representation, Text Mining, Graph Model, Semantic Networks

13 Mar 2017
TL;DR: The paper discusses some of the developments in text mining applications, primarily reviewing techniques in the classification, summarization and analysis of text, as advocated by academia.
Abstract: Due to the ever increasing rate at which information is generated, text mining and its automated analysis have become the need of the hour. The paper discusses some of the developments in text mining applications, primarily reviewing techniques in the classification, summarization and analysis of text, as advocated by academia. The goal is, in essence, to ultimately turn unstructured text into useful data and information for analysis using critical methods. We introduce the paper by introducing the concept of “textual analysis” similar to text mining done using the analysis of Natural Language texts, their respective techniques in use and the open source tools in use to do so. We survey varied topics that use NLP, and also expand the horizons of this domain by devising new techniques for improving the efficiency even in limited amounts of data, improved accuracy, new methods, novel approaches, and new application areas for it, and relating to text summarization and text classification. Various text mining techniques used in text classification and summarization are reviewed, followed by the application areas of text mining being worked upon by businesses. Finally, the paper concludes by introducing “organizational text mining” and emphasizing the need for it.

Journal ArticleDOI
TL;DR: Two novel frameworks for the classification and extraction of the association rules and the visualisation of financial Arabic text are presented in order to realize both the general structure and the sentiment within an accumulated corpus.
Abstract: Text mining methods involve various techniques, such as text categorization, summarisation, information retrieval, document clustering, topic detection, and concept extraction. In addition, because of the difficulties involved in text mining, visualisation techniques can play a paramount role in the analysis and pre-processing of textual data. This paper will present two novel frameworks for the classification and extraction of the association rules and the visualisation of financial Arabic text in order to realize both the general structure and the sentiment within an accumulated corpus. However, mining unstructured data with natural language processing (NLP) and machine learning techniques can be arduous, especially where the Arabic language is concerned, because of limited research in this area. The results show that our frameworks can readily classify Arabic tweets. Furthermore, they can handle many antecedent text association rules for the positive class and the negative class.

Proceedings ArticleDOI
01 May 2017
TL;DR: A Turkish scene text database is collected for the first time in the literature and the contents of this database, shortly called STRIT (Scene Text Recognition In Turkish), are detailed.
Abstract: Scene text localization and recognition keeps attracting an increasing interest from researchers due to its valuable advantage in extracting content from real world images and in image retrieval via text search. Nevertheless, due to the fact that the majority of the image datasets that are commonly used in this field is comprised of text in English, the related studies have mostly been limited to a single language. On that account, in order to apply the technologies developed for scene text detection and recognition to Turkish scene text, analyze their performances and to develop Turkish language specific algorithms, a Turkish scene text database is collected for the first time in the literature. In this paper, the contents of this database, shortly called STRIT (Scene Text Recognition In Turkish), are detailed. Additionally, two baseline methods are tested to detect and recognize scene text in Turkish and the preliminary results are presented.

Book ChapterDOI
12 Dec 2017
TL;DR: The proposed system is the integration of an efficiently trained text classifier model with an open source speech to text conversion plat-form that will eliminate the need for agents manually processing the conversation and initiating required action.
Abstract: In the services industry chat helplines were seen as more effective than a voice based service because more number of users could be serviced at the same time and with help of standard text message templates. By training text classifier models and integrating them text to speech conversion systems we can further reduce human effort and thereby deliver efficient solutions with minimal participation and increase user convenience multifold. Our proposed system is the integration of an efficiently trained text classifier model with an open source speech to text conversion plat-form. Our trained model can receive the input in text format from the con-version tool and can accurately classify its category (i.e label it). Based on its classification, consequent action is initiated. Our trained model will eliminate the need for agents manually processing the conversation and initiating required action. The system can save lot of energy, time and other resources.

11 Jul 2017
TL;DR: A camera-based helpful text browsing framework to assist blind persons to read text labels from hand-held objects in their day to day lives is proposed.
Abstract: We propose a camera-based helpful text browsing framework to assist blind persons to read text labels from hand-held objects in their day to day lives. During this paper Camera acts as main vision to capture the image of product packaging and hand-held objects. To isolate the article from complicated backgrounds, we tend to initial propose an efficient motion-based technique to outline a district of interest (DOI) within the image. Within the extracted ROI, text localization and recognition are conducted to amass text data. Then text characters are recognized by ready-to-wear optical character recognition (OCR) computer code. Victimization text to speech converter the extracted texts are output in audio output.