scispace - formally typeset
Topic

Noisy text analytics

About: Noisy text analytics is a(n) research topic. Over the lifetime, 700 publication(s) have been published within this topic receiving 28759 citation(s).
Papers
More filters


Journal ArticleDOI
01 Feb 1999-Machine Learning
TL;DR: WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences, and can also handle extraction from free text such as news stories.
Abstract: A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically. WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.

1,065 citations


Journal ArticleDOI
TL;DR: An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.
Abstract: In this work we present an end-to-end system for text spotting--localising and recognising text in natural scene images--and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

926 citations


Proceedings ArticleDOI
25 Aug 2013-
TL;DR: The datasets and ground truth specification are described, the performance evaluation protocols used are details, and the final results are presented along with a brief summary of the participating methods.
Abstract: This report presents the final results of the ICDAR 2013 Robust Reading Competition. The competition is structured in three Challenges addressing text extraction in different application domains, namely born-digital images, real scene images and real-scene videos. The Challenges are organised around specific tasks covering text localisation, text segmentation and word recognition. The competition took place in the first quarter of 2013, and received a total of 42 submissions over the different tasks offered. This report describes the datasets and ground truth specification, details the performance evaluation protocols used and presents the final results along with a brief summary of the participating methods.

921 citations


Journal ArticleDOI
TL;DR: A large number of techniques to address the problem of text information extraction are classified and reviewed, benchmark data and performance evaluation are discussed, and promising directions for future research are pointed out.
Abstract: Text data present in images and video contain useful information for automatic annotation, indexing, and structuring of images. Extraction of this information involves detection, localization, tracking, extraction, enhancement, and recognition of the text from a given image. However, variations of text due to differences in size, style, orientation, and alignment, as well as low image contrast and complex background make the problem of automatic text extraction extremely challenging. While comprehensive surveys of related problems such as face detection, document analysis, and image & video indexing can be found, the problem of text information extraction is not well surveyed. A large number of techniques have been proposed to address this problem, and the purpose of this paper is to classify and review these algorithms, discuss benchmark data and performance evaluation, and to point out promising directions for future research.

894 citations


Network Information
Related Topics (5)
Semantic computing

11.1K papers, 241.3K citations

81% related
Conceptual clustering

3K papers, 118.5K citations

80% related
tf–idf

2K papers, 46.3K citations

80% related
Speaker recognition

14.9K papers, 310K citations

80% related
Collaborative filtering

14.7K papers, 470.4K citations

80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20191
20184
201723
201659
201573
201469

Top Attributes

Show by:

Topic's top 5 most impactful authors

Gerard Salton

7 papers, 3.6K citations

Umapada Pal

4 papers, 98 citations

Alessandro Vinciarelli

3 papers, 19 citations

Marie-Francine Moens

3 papers, 25 citations

Dimosthenis Karatzas

3 papers, 1K citations