scispace - formally typeset
Search or ask a question
Topic

Noisy text analytics

About: Noisy text analytics is a research topic. Over the lifetime, 700 publications have been published within this topic receiving 28759 citations.


Papers
More filters
Proceedings ArticleDOI
01 Dec 2013
TL;DR: This work describes Photo OCR, a system for text extraction from images that is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions.
Abstract: We describe Photo OCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification, we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern data center-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency, mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.

499 citations

Book
25 Jan 2012
TL;DR: This comprehensive professional reference brings together all the information, tools and methods a professional will need to efficiently use text mining applications and statistical analysis and presents a comprehensive how- to reference that shows the user how to conduct text mining and statistically analyze results.
Abstract: The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase dramatically. This comprehensive professional reference brings together all the information, tools and methods a professional will need to efficiently use text mining applications and statistical analysis. The Handbook of Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications presents a comprehensive how- to reference that shows the user how to conduct text mining and statistically analyze results. In addition to providing an in-depth examination of core text mining and link detection tools, methods and operations, the book examines advanced preprocessing techniques, knowledge representation considerations, and visualization approaches. Finally, the book explores current real-world, mission-critical applications of text mining and link detection using real world example tutorials in such varied fields as corporate, finance, business intelligence, genomics research, and counterterrorism activities.-Extensive case studies, most in a tutorial format, allow the reader to 'click through' the example using a software program, thus learning to conduct text mining analyses in the most rapid manner of learning possible -Numerous examples, tutorials, power points and datasets available via companion website on Elsevierdirect.com -Glossary of text mining terms provided in the appendix

450 citations

01 Jan 2005
TL;DR: This paper illustrates the text classification process using machine learning techniques to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing.
Abstract: Automated text classification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing In general, text classification plays an important role in information extraction and summarization, text retrieval, and question- answering This paper illustrates the text classification process using machine learning techniques The references cited cover the major theoretical issues and guide the researcher to interesting research directions

447 citations

Posted Content
TL;DR: Several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering are described, which briefly explain text mining in biomedical and health care domains.
Abstract: The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

422 citations

Journal ArticleDOI
TL;DR: This paper proposes a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost to take full advantage of existing natural language processing (NLP) tools.
Abstract: The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the significance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of defining analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.

416 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
75% related
Server
79.5K papers, 1.4M citations
74% related
Cluster analysis
146.5K papers, 2.9M citations
74% related
Feature (computer vision)
128.2K papers, 1.7M citations
73% related
Wireless sensor network
142K papers, 2.4M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20236
20228
20201
20191
20184
201723