Detecting Oriented Text in Natural Images by Linking Segments

doi:10.1109/CVPR.2017.371

Home
/
Papers
/
Detecting Oriented Text in Natural Images by Linking Segments

Proceedings Article•DOI•

Detecting Oriented Text in Natural Images by Linking Segments

Baoguang Shi¹, Xiang Bai¹, Serge Belongie²•Institutions (2)

Huazhong University of Science and Technology¹, Cornell University²

01 Jul 2017-pp 3482-3490

TL;DR: SegLink, an oriented text detection method to decompose text into two locally detectable elements, namely segments and links, achieves an f-measure of 75.0% on the standard ICDAR 2015 Incidental (Challenge 4) benchmark, outperforming the previous best by a large margin.

read less

Abstract: Most state-of-the-art text detection methods are specific to horizontal Latin text and are not fast enough for real-time applications. We introduce Segment Linking (SegLink), an oriented text detection method. The main idea is to decompose text into two locally detectable elements, namely segments and links. A segment is an oriented box covering a part of a word or text line, A link connects two adjacent segments, indicating that they belong to the same word or text line. Both elements are detected densely at multiple scales by an end-to-end trained, fully-convolutional neural network. Final detections are produced by combining segments connected by links. Compared with previous methods, SegLink improves along the dimensions of accuracy, speed, and ease of training. It achieves an f-measure of 75.0% on the standard ICDAR 2015 Incidental (Challenge 4) benchmark, outperforming the previous best by a large margin. It runs at over 20 FPS on 512x512 images. Moreover, without modification, SegLink is able to detect long lines of non-Latin text, such as Chinese.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Character Region Awareness for Text Detection

[...]

Young Min Baek¹, Bado Lee¹, Dongyoon Han¹, Sangdoo Yun¹, Hwalsuk Lee¹ - Show less +1 more•Institutions (1)

Naver Corporation¹

15 Jun 2019

TL;DR: Zhang et al. as mentioned in this paper proposed a new scene text detection method to effectively detect text area by exploring each character and affinity between characters, which significantly outperforms the state-of-the-art detectors.

...read moreread less

Abstract: Scene text detection methods based on neural networks have emerged recently and have shown promising results. Previous methods trained with rigid word-level bounding boxes exhibit limitations in representing the text region in an arbitrary shape. In this paper, we propose a new scene text detection method to effectively detect text area by exploring each character and affinity between characters. To overcome the lack of individual character level annotations, our proposed framework exploits both the given character-level annotations for synthetic images and the estimated character-level ground-truths for real images acquired by the learned interim model. In order to estimate affinity between characters, the network is trained with the newly proposed representation for affinity. Extensive experiments on six benchmarks, including the TotalText and CTW-1500 datasets which contain highly curved texts in natural images, demonstrate that our character-level text detection significantly outperforms the state-of-the-art detectors. According to the results, our proposed method guarantees high flexibility in detecting complicated scene text images, such as arbitrarily-oriented, curved, or deformed texts.

...read moreread less

635 citations

Proceedings Article•DOI•

SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects

[...]

Xue Yang¹, Jirui Yang², Junchi Yan¹, Yue Zhang², Tengfei Zhang², Zhi Guo², Xian Sun, Kun Fu² - Show less +4 more•Institutions (2)

Shanghai Jiao Tong University¹, Chinese Academy of Sciences²

01 Oct 2019

TL;DR: A sampling fusion network is devised which fuses multi-layer feature with effective anchor sampling, to improve the sensitivity to small objects, and the IoU constant factor is added to the smooth L1 loss to address the boundary problem for the rotating bounding box.

...read moreread less

Abstract: Object detection has been a building block in computer vision. Though considerable progress has been made, there still exist challenges for objects with small size, arbitrary direction, and dense distribution. Apart from natural images, such issues are especially pronounced for aerial images of great importance. This paper presents a novel multi-category rotation detector for small, cluttered and rotated objects, namely SCRDet. Specifically, a sampling fusion network is devised which fuses multi-layer feature with effective anchor sampling, to improve the sensitivity to small objects. Meanwhile, the supervised pixel attention network and the channel attention network are jointly explored for small and cluttered object detection by suppressing the noise and highlighting the objects feature. For more accurate rotation estimation, the IoU constant factor is added to the smooth L1 loss to address the boundary problem for the rotating bounding box. Extensive experiments on two remote sensing public datasets DOTA, NWPU VHR-10 as well as natural image datasets COCO, VOC2007 and scene text data ICDAR2015 show the state-of-the-art performance of our detector. The code and models will be available at https://github.com/DetectionTeamUCAS.

...read moreread less

552 citations

Cites background from "Detecting Oriented Text in Natural ..."

...While such methods still have difficulty in dealing with aerial image based object detection: one reason is that most text detection methods are restricted to single-category object detection [44, 34, 7], while there are often many different categories to discern for remote sensing....
[...]

Proceedings Article•DOI•

Shape Robust Text Detection With Progressive Scale Expansion Network

[...]

Wenhai Wang¹, Enze Xie², Xiang Li³, Hou Wenbo¹, Tong Lu¹, Gang Yu, Shuai Shao - Show less +3 more•Institutions (3)

Nanjing University¹, Nanjing University of Science and Technology², Tongji University³

01 Jun 2019

TL;DR: A novel Progressive Scale Expansion Network (PSENet) is proposed, which can precisely detect text instances with arbitrary shapes and is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances.

...read moreread less

Abstract: Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.

...read moreread less

501 citations

Cites methods from "Detecting Oriented Text in Natural ..."

...For the regression-based approaches [36, 43, 32, 16, 42, 23, 11, 13, 27], the text targets are usually represented in the forms of rectangles or quadrangles with certain orientations....
[...]
...In addition, we Method Ext Total-Text P R F FPS SegLink [32] - 30....
[...]

Posted Content•

R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection.

[...]

Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, Zhenbo Luo - Show less +4 more

29 Jun 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel method called Rotational Region CNN (R2CNN) for detecting arbitrary-oriented texts in natural scene images using the Region Proposal Network to generate axis-aligned bounding boxes that enclose the texts with different orientations.

...read moreread less

Abstract: In this paper, we propose a novel method called Rotational Region CNN (R2CNN) for detecting arbitrary-oriented texts in natural scene images. The framework is based on Faster R-CNN [1] architecture. First, we use the Region Proposal Network (RPN) to generate axis-aligned bounding boxes that enclose the texts with different orientations. Second, for each axis-aligned text box proposed by RPN, we extract its pooled features with different pooled sizes and the concatenated features are used to simultaneously predict the text/non-text score, axis-aligned box and inclined minimum area box. At last, we use an inclined non-maximum suppression to get the detection results. Our approach achieves competitive results on text detection benchmarks: ICDAR 2015 and ICDAR 2013.

...read moreread less

435 citations

Proceedings Article•DOI•

FOTS: Fast Oriented Text Spotting with a Unified Network

[...]

Xuebo Liu¹, Ding Liang¹, Shi Yan¹, Dagui Chen¹, Yu Qiao¹, Junjie Yan¹ - Show less +2 more•Institutions (1)

SenseTime¹

18 Jun 2018

TL;DR: In this article, a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network is proposed for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks.

...read moreread less

Abstract: Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community. Most existing methods treat text detection and recognition as separate tasks. In this work, we propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks. Specifically, RoIRotate is introduced to share convolutional features between detection and recognition. Benefiting from convolution sharing strategy, our FOTS has little computation overhead compared to baseline text detection network, and the joint training method makes our method perform better than these two-stage methods. Experiments on ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets demonstrate that the proposed method outperforms state-of-the-art methods significantly, which further allows us to develop the first real-time oriented text spotting system which surpasses all previous state-of-the-art results by more than 5% on ICDAR 2015 text spotting task while keeping 22.6 fps.

...read moreread less

434 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

49,914 citations

Proceedings Article•DOI•

Fully convolutional networks for semantic segmentation

[...]

Jonathan Long¹, Evan Shelhamer¹, Trevor Darrell¹•Institutions (1)

University of California, Berkeley¹

07 Jun 2015

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

...read moreread less

Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

...read moreread less

28,225 citations

Proceedings Article•DOI•

You Only Look Once: Unified, Real-Time Object Detection

[...]

Joseph Redmon¹, Santosh K. Divvala², Ross Girshick³, Ali Farhadi²•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Facebook³

27 Jun 2016

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

27,256 citations

"Detecting Oriented Text in Natural ..." refers methods in this paper

...Data Augmentation We adopt an online augmentation pipeline that is similar to that of SSD [14] and YOLO [18]....
[...]

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Book Chapter•DOI•

SSD: Single Shot MultiBox Detector

[...]

Wei Liu¹, Dragomir Anguelov, Dumitru Erhan², Christian Szegedy², Scott Reed³, Cheng-Yang Fu¹, Alexander C. Berg¹ - Show less +3 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Google², University of Michigan³

08 Oct 2016

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.

...read moreread less

Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/tree/ssd.

...read moreread less

19,543 citations

"Detecting Oriented Text in Natural ..." refers background or methods in this paper

...We detect segments by estimating the confidence scores and geometric offsets to a set of default boxes [14] on the input image....
[...]
...The architecture of our network inherits that of SSD [14], a recent object detection model....
[...]
...Following [14], the fully-connected layers of VGG-16 are converted into convolutional layers (fc6 to conv6; fc7 to conv7)....
[...]
...An (fast/faster) R-CNN [5, 4, 19]or SSD [14]-style detector may suffer from the difficulty of producing such boxes, owing to its proposal or anchor box design....
[...]
...Data Augmentation We adopt an online augmentation pipeline that is similar to that of SSD [14] and YOLO [18]....
[...]