Enhancing Word Image Retrieval in Presence of Font Variations
Summary (3 min read)
Introduction
- Font and style variations make the problem of recognition and retrieval challenging while working with large and diverse document image databases.
- Commonly, a classifier is trained with a certain set of fonts available apriori, and generalization across fonts is hoped due to either the quality of the features or the power of the classifier.
- A natural extension of the query expansion in cross document word image retrieval could be to automatically reformulate the query word in multiple fonts.
- Euclidean distance is often preferred for scalability in retrieval [7].
- Transfer learning may involve (i) Feature transformations, e.g. updating the regression matrix [11], updating the LDA transformation matrix [12] (ii) Classifier adaptation, e.g. Retraining strategy for neural network [13], SVM [14], etc.
II. DIRECT APPROACHES
- A common approach to deal with font variations is to heuristically define and extract features.
- Then one empirically validates the insensitivity to feature variations on multiple fonts.
- Profile based representation [5], [17] is one such popular feature.
- Use of a DTW based sequence alignment further improves the robustness of retrieval as DTW is able to take care of local variations in sequences.
- Another possible approach for handling font variations is to reformulate the query word image in the target document font.
A. Style Transfer
- Style transfer strategy has been used in the past for handwriting recognition.
- This results in a specific model for each user.
- A straightforward method to do style transfer of the query is to decompose it into style and content factors using a bilinear model [10].
- The authors show such style transfer examples in Figure 2.
- In addition, a serious limitation of using this style transfer approach in large multi-font databases is the need for some labeled examples of all the distinct words in the database for each of the fonts.
III. QUERY EXPANSION USING SEMI-SUPERVISED STYLE TRANSFER
- In the retrieval setting, the authors have a single example to transfer the style.
- An initial seed image is reformulated into multiple versions and all versions have in common the underlying word label.
- Ar by solving the following optimization problem min Ar ||Y r −ArBr||2F + λ ||A r −As||2F . (3) Here, columns of Br are a subset of the columns of Bc. Using the original pixel based representation of word images for performing style transfer has a few shortcomings.
- Using a low dimensional profile feature representation reduces the computation required for model learning as well as retrieval.
- The authors represent each word image by its profile feature representation (Section V) and stack the mean vector for each word label along the column of matrix Y t.
IV. KERNALIZED STYLE-CONTENT SEPARATION
- To make linear models more robust, it is a common practice to first map the feature vectors in the original space to a high dimensional space and then learn the linear model over the high dimensional space.
- The authors call their nonlinear version of bilinear model as asymmetric kernel bilinear model (AKBM).
- Since style basis vectors lie in the same feature space as the observation vectors, each basis vector (each column of At) can be expressed as a linear combination of the mapped observation vectors, hence At can be represented as: At = φ(Y t)α.
- The authors solve this optimization problem by alternately keeping one of the two factors as constant and optimizing for the other factor.
- Now, to use these nonlinear basis vectors to perform retrieval on the target dataset, the authors represent all the word images from the target dataset by solving min bir ∣∣∣∣φ(yir)− φ(Y t)αbir∣∣∣∣2 , where yir is the profile feature representation of ith image from target dataset.
V. EXPERIMENTS, RESULTS AND DISCUSSIONS
- The authors compare the retrieval performance for the following three cases: 1) Query word images from training dataset are used directly to perform retrieval on target dataset (i.e. font independent feature definitions).
- 2) Semi-supervised style transfer as discussed in Sec. III.
- 3) Asymmetric kernel bilinear model as discussed in Sec. IV.
A. Data Sets, Implementation and Evaluation Protocol
- These datasets, detailed in Table I, comprise scanned English books from a digital library collection.
- The authors manually created the ground truth at word level for the quantitative evaluation of their proposed retrieval approaches.
- Each of the datasets D1 - D5 are subdivided into training, testing and validation sets, with each set containing one-third of word images for each word label.
- Bilinear models are learned from the examples in training set.
- 2) Upper and word profile, which encode the distance between the top boundary and the top-most (-most) ink pixels in each column.
B. Retrieval Experiments
- In Table II, the authors compare the retrieval performance of font independent feature definitions (no transfer), semi-supervised style transfer (SSST) and asymmetric kernel bilinear model (AKBM).
- Using this kernel bilinear model, the authors obtain content vector representation for all of the target dataset word images and use them to perform nearest neighbor based retrieval on the basis of their distance with the content vectors corresponding to query labels from the training dataset.
- In Figure 5, the authors show few query images and the corresponding retrieval results, on D1 - D4, obtained using AKBM.
- Since the training and target fonts are too dissimilar, retrieval performance of all three approaches goes down, however, the performance of AKBM is still much better than the other two approaches.
- SSST performs comparably to the supervised style transfer in this case.
VI. CONCLUSION
- The authors have proposed strategies for doing word image retrieval in a multi-font database.
- To deal with the style variations between different documents, the authors have proposed a semi-supervised style transfer strategy.
- The authors have also suggested a font independent retrieval strategy by representing words from all the documents using the same set of high dimensional basis vectors.
- The authors have shown results on various datasets varying in font.
Did you find this useful? Give us your feedback
Citations
134 citations
9 citations
Cites background from "Enhancing Word Image Retrieval in P..."
...5 Scanned books varying in font D1-D5 Up to 85% accuracy [44]...
[...]
...In [44], the problems of font and style variation, where the query word image has a different style to the dataset, have been considered....
[...]
References
18,616 citations
"Enhancing Word Image Retrieval in P..." refers methods in this paper
...This technique is known as transfer learning [8], and it has been widely used in applications like handwriting recognition [2], [9], face pose classification [10] etc....
[...]
8,175 citations
"Enhancing Word Image Retrieval in P..." refers background or methods in this paper
...This technique is known as the kernel trick and has been widely used for obtaining nonlinear versions of PCA [18], LDA [19] and many other algorithms....
[...]
...If any algorithm can be expressed solely in terms of dot products of feature points in H , then we do not need to know the exact mapping φ and a kernel function κ can be defined such that κ(x, y) =< φ(x), φ(y) >, where x, y ∈ R and κ corresponds to some mapping φ [18]....
[...]
2,991 citations
"Enhancing Word Image Retrieval in P..." refers methods in this paper
...Any standard QP solver [20], [21] can be used for solving this optimization problem....
[...]
2,896 citations
2,504 citations
"Enhancing Word Image Retrieval in P..." refers methods in this paper
...updating the regression matrix [11], updating the LDA transformation matrix [12] (ii) Classifier adaptation, e....
[...]
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the future works mentioned in the paper "Enhancing word image retrieval in presence of font variations" ?
Their future work will be to learn the font/style independent features from a large collection of document images.
Q3. What are the parameters used to perform retrieval on the test set?
Optimal value for kernel parameters and the regularization factors β and λ are found by performing retrieval on the validation set and these optimal parameters are then used while performing retrieval on the test set.
Q4. What is the way to make a linear model more robust?
To make linear models more robust, it is a common practice to first map the feature vectors in the original space to a high dimensional space and then learn the linear model over the high dimensional space.
Q5. What is the hypothesis of style transfer?
Their hypothesis is that a style-transformed query would be more closer to the correct matches and would lead to a better performance of the nearest neighbor classifier.
Q6. What is the common approach to addressing font style variations in word image retrieval?
For addressing font style variations in word image retrieval, a common strategy is to use some font independent feature representation.
Q7. What is the easiest method to do style transfer of a query?
A straightforward method to do style transfer of the query is to decompose it into style and content factors using a bilinear model [10].
Q8. How can the authors represent ith word labels using asymmetric bilinear model?
The ith column of Y t corresponding to the mean vector of ith word label can be represented using asymmetric bilinear model as yit =
Q9. How is the retrieval performed on target dataset?
Now the retrieval is performed on target dataset on the basis of distance between the content vector of query word images and content vector of target dataset word images.
Q10. how do you find fonts in documents?
The authors have also suggested a font independent retrieval strategy by representing wordsfrom all the documents using the same set of high dimensional basis vectors.
Q11. How do the authors obtain content vector representation for all of the target dataset word images?
Using this kernel bilinear model, the authors obtain content vector representation for all of the target dataset word images and use them to perform nearest neighbor based retrieval on the basis of their distance with the content vectors corresponding to query labels from the training dataset.
Q12. What is the way to transfer word images?
As and content vectors Bc(each column is a content vector corresponding to a word label) can be obtained by solving the following optimization problemmin As,Bc||Y s −AsBc||2F . (2)If the same number of word images are available for all the word labels, this problem can be solved with the help of SVD of the matrix Y s.Consider the task of rendering word images in a new font using the asymmetric bilinear model.
Q13. What is the way to achieve font independence?
the kernelized version of the bilinear model is able to achieve font independence and improved mAP scores by up to 0.30 for word image retrieval.