Whole is Greater than Sum of Parts: Recognizing Scene Text Words
Summary (2 min read)
I. INTRODUCTION
- The document image analysis community has shown a huge interest in the problem of scene text understanding in recent years [6] , [15] , [19] .
- In [18] , each word in the lexicon is matched to the detected set of character windows, and the one with the highest score is reported as the predicted word.
- This strongly top-down approach is prone to errors when characters are missed or detected with low confidence.
- We, however, differ from their approach as follows.
- The main contributions of their work are two fold: (i) We show that holistic word recognition for scene text images is possible with high accuracy, and achieve a significant improvement over prior art.the authors.the authors.
II. WORD REPRESENTATION AND MATCHING
- The authors extract features from the image, and match them with those computed for each word in the lexicon.
- To this end, the authors present a gradient based feature set, and then a weighted Dynamic Time Warping scheme in the remainder of this section.
- The gradient orientations are accumulated into histograms over vertical strips extracted from the image.
- The problem is how to match the scene text and synthetic lexicon based images 1 .
- For this the authors cluster all the feature vectors computed over vertical strips of synthetic images and entropy of each cluster as follows.
H(cluster
- High entropy of a cluster indicates that the features corresponding to that cluster are almost equally distributed in all the word classes.
- In other words, such features are less informative, and thus are assigned a low weight during matching.
- To give high penalty to those warping paths which deviate from the near diagonal paths the authors multiply them with a penalty function log 10 (wp − wp o ), where wp and wp o are warping path of DTW matching and diagonal warping path respectively.
- Given a scene text and a ranked list of matched synthetic words (each corresponding to one of the lexicon words), their goal is to find the text label.
- Randomness is maximum when all the top k retrievals are different words, and is minimum (i.e. zero) when all the top k retrieval are same.
B. Implementation Details
- For every lexicon word the authors generated synthetic words with 20 different styles and fonts using ImageMagic.
- The authors observations suggest that font selection is not a very crucial step for overall performance of their method.
- A five pixel-width padding was done for all the images.
- The authors used the binarization method in [10] prior to computing the profile features.
- Given a scene text image to recognize, the authors retrieve word images from database of synthetic words.
C. Comparison with Previous Work
- The authors retrieve synthetic word images corresponding to lexicon words and use dynamic k-NN to assign text label to a given scene text image.
- A specific preprocessing or more variations in the synthetic dataset may be needed to deal with such fonts.
- Fig. 4 shows the qualitative performance of the proposed method on sample images.
- In addition to being simple, their method significantly improves the prior art.
- This gain in accuracy can be attributed to the robustness of their method, which (i) does not rely on character segmentation rather do holistic word recognition; and (ii) learns discriminitiveness of features in a principled way and use this information for robust matching using wDTW.
IV. CONCLUSION
- The authors method neither requires character segmentation 3 cvit.iiit.ac.in/projects/SceneTextUnderstanding/ nor relies on binarization, but instead performs holistic word recognition.
- The authors show a significantly improved performance over the most recent works from 2011 and 2012.
- The authors thus establish a new state-of-the-art on lexicon-driven scene text recognition.
- The robustness of their word matching approach shows that the natural extension of this work can be in direction of "text to scene image" retrieval.
Did you find this useful? Give us your feedback
Citations
2,184 citations
1,054 citations
Cites background or methods from "Whole is Greater than Sum of Parts:..."
...[25] use whole word sub-image features to recognise words by comparing to simple black-and-white font-renderings of lexicon words....
[...]
...For scene text recognition, methods can be split into two groups – character based recognition [5, 7, 31, 48, 49, 56–59] and whole word based recognition [4, 25, 30, 39, 45, 51]....
[...]
875 citations
Cites background or methods from "Whole is Greater than Sum of Parts:..."
...In contrast to these approaches based on character classification, the work by [7, 17, 21, 24] instead uses the notion of holistic word recognition....
[...]
...[7] use whole word-image features to recognize words by comparing to simple black-and-white font-renderings of lexicon words....
[...]
709 citations
Cites background from "Whole is Greater than Sum of Parts:..."
...Goel [179] 2013 Holistic recognition by gradient based features and dynamic matching ICDAR’03 50 0....
[...]
...The motivation of word spotting is that ”the whole is greater than the sum of parts”, and the task looks to match specific words in a given lexicon with image patches using character and word models [118], [179]....
[...]
681 citations
Cites background from "Whole is Greater than Sum of Parts:..."
...3% off stateof-the-art, improves on the next best result [19] by 8....
[...]
References
31,952 citations
"Whole is Greater than Sum of Parts:..." refers methods in this paper
...Inspired by the success of Histogram of Oriented Gradient (HOG) features [7] in many vision tasks, we adapted them to the word recognition problem....
[...]
1,669 citations
1,531 citations
"Whole is Greater than Sum of Parts:..." refers background in this paper
...Due to recent works [5], [8], [13], text detection accuracies have significantly improved....
[...]
1,074 citations
"Whole is Greater than Sum of Parts:..." refers background or methods in this paper
...(Note that following the experimental protocol of [18], we do case-insensitive recognition)....
[...]
...From the results, we see that the proposed holistic word matching based scheme outperforms not only our earlier work [12], but also many recent works as [14], [18], [19] on the SVT dataset....
[...]
...In [18], each word in the lexicon is matched to the detected set of character windows, and the one with the highest score is reported as the predicted word....
[...]
...The problem of recognizing words has been looked at in two broads contexts – with and without the use of a lexicon [11], [12], [18], [20]....
[...]
...Following the protocol of [18], we ignore words with less than two characters or with non-alphanumeric characters, which results in 863 words overall....
[...]
900 citations
"Whole is Greater than Sum of Parts:..." refers background or methods in this paper
...From the results, we see that the proposed holistic word matching based scheme outperforms not only our earlier work [12], but also many recent works as [14], [18], [19] on the SVT dataset....
[...]
...In this section we present implementation details of our approach, and its detailed evaluation, and compare it with the best performing methods for this task namely [12], [14], [18], [19]....
[...]
...In addition to being simple, the proposed method improves the accuracy by more than 5% over recent works [12], [18], [19]....
[...]
...The document image analysis community has shown a huge interest in the problem of scene text understanding in recent years [6], [15], [19]....
[...]
...On the ICDAR dataset, we perform better than almost all the methods, except [19]....
[...]
Related Papers (5)
Frequently Asked Questions (17)
Q2. What are the future works in "Whole is greater than sum of parts: recognizing scene text words" ?
As a part of future work, the authors would explore the benefits of introducing a hidden Markov models for this problem.
Q3. What is the way to find a feature sequence?
(1)In a maximum likelihood framework, the problem of finding an optimal feature sequence Y for a given feature sequence X is equivalent to maximize ∏i P (xi, yi|ωk) over all possible Y s.
Q4. How did the authors extract the histogram of gradient orientation features?
The authors used vertical strips of width 4 pixels and a 2-pixel horizontal shift to extract the histogram of gradient orientation features.
Q5. What datasets were used for the experimental analysis?
For the experimental analysis the authors used two datasets, namely Street View Text (SVT) [1] and ICDAR 2003 robust word recognition [2].
Q6. What is the problem of finding the optimal matching sequence?
In other words, given a feature sequence X and a set of candidate sequences Y s, the problem of finding the optimal matching sequence becomes as minimizing f over all candidate sequences Y .
Q7. What is the way to write a histogram of a word?
Since the authors assume features at each vertical strips are independent, the joint probability that the feature sequences X and Y originate from the same word ωk, i.e. P (X, Y |ωk) can be written as the multiplication of joint probabilities of features originating from the same strip, i.e.,P (X, Y |ωk) = ∏iP (xi, yi|ωk).
Q8. What is the DTW distance between two time series?
To give high penalty to those warping paths which deviate from the near diagonal paths the authors multiply them with a penalty function log10 (wp −wpo), where wp and wpo are warping path of DTW matching and diagonal warping path respectively.
Q9. What is the method for calculating the profile features?
Profile features have shown noteworthy performance on tasks such as handwritten and printed word spotting, but fail to cope with the additional complexities in scene text (e.g., low contrast, noise, blur, large intra-class variations).
Q10. What is the goal of dynamic k-NN?
Given a scene text and a ranked list of matched synthetic words (each corresponding to one of the lexicon words), their goal is to find the text label.
Q11. What is the significance of the proposed method?
This gain in accuracy can be attributed to the robustness of their method, which (i) does not rely on character segmentation rather do holistic word recognition; and (ii) learns discriminitiveness of features in a principled way and use this information for robust matching using wDTW.
Q12. How did the authors adapt the HOG features to the word recognition problem?
Inspired by the success of Histogram of Oriented Gradient (HOG) features [7] in many vision tasks, the authors adapted them to the word recognition problem.
Q13. What is the entropy of a cluster?
High entropy of a cluster indicates that the features corresponding to that cluster are almost equally distributed in all the word classes.
Q14. How many words were generated using the ICDAR protocol?
Following the protocol of [18], the authors ignore words with less than two characters or with non-alphanumeric characters, which results in 863 words overall.
Q15. What is the way to describe the word matching problem?
Let X = {x1, x2, . . . , xm} and Y = {y1, y2, . . . , ym} be the feature sequences from a given word and its candidate match respectively.
Q16. What is the way to find the match?
In summary, given a scene text word and a set of lexicon words, the authors transform each lexicon into a collection of synthetic images, and then represent each image as a sequence of features.
Q17. What is the purpose of this article?
To this end, the authors present a gradient based feature set, and then a weighted Dynamic Time Warping scheme in the remainder of this section.