Document Retrieval with Unlimited Vocabulary
Summary (2 min read)
1. Introduction
- Retrieving relevant documents (pages, paragraphs or words) is a critical component in information retrieval solutions associated with digital libraries.
- Though OCRs have become the de facto preprocessing for the retrieval, they are realized as insufficient for degraded books [8], incompatible for older print styles [5], unavailable for specialized scripts [14] and very hard for handwritten documents [1].
- There are two fundamental challenges in using a classifier based solution for word retrieval (i)A classifier needs good amount of annotated training data (both positive and negative) for training.
- Without having any access to the annotated training data, i.e., classifiers are trained for a set of frequent queries, and seamlessly extended for the rare and arbitrary queries, as and when required.the authors.
2. Accurate Classsifiers for Frequent and Rare
- The authors word-level retrieval scheme is a direct application of the SVM classifier.
- The authors train a linear SVM classifier with few positive examples and a set of randomly sampled negative examples.
- During retrieval, this classifier is evaluated over the dataset images, and a ranked list of word images is predicted.
- For representing word images, the authors prefer a fixed length sequence representation of the visual content, i.e., each word image is represented as a fixed length sequence of vertical strips (of varying width according to the aspect ratio of the word image).
- The authors exploit the sequential nature of the feature representation for an on the fly synthesis of the novel classifiers in Section 2.2.1.
2.1. Efficient Classifier based Retrieval
- SVM gives maximum margin hyperplane separating the positive and negative instances.
- Another demerit of ESVM is the large overall training time since a separate SVM needs to be trained for each exemplar.
- Gharbi et al. [6] provide another alternative for fast training of exemplar SVM.
- The authors also assume a Gaussian distribution over the feature space and hence, use this normal vector as an approximation of the SVM weight and use this weight vector for retrieval.
- It requires only few d2 multiplications for the design of specific query classifiers.
2.2. Classifier design for rare queries
- It is not practical to build classifiers for all the possible words.
- The authors show that the SVM classifiers corresponding to the ngrams can be effectively composed to generate novel classifiers on the fly.
- Since the classifiers need to be built for all the words, the overall performance could thus be poor.
- The authors consider the problem of finding µq for the query class as the classifier synthesis problem outlined above.
- The authors select the 10 most similar mean vectors and use them in the subsequence DTW.
3. Efficient and Accurate Retrieval
- When a direct classifier is used for the frequent words, retrieval is efficient since this requires only the evaluation of the classifiers.
- For the rare words, the authors use the DQC classifier which requires a DP based selection from multiple composite classifiers.
- This DP based strategy affects the efficiency and accuracy of the solution to some extent.
- An index is built over all the database vectors and those vectors similar to query vector xq are identified by performing approximate nearest neighbor search over the index.
- This is much smaller compared to the time complexity O(NRd) for subsequence DTW matching.
4. Experiments, Results and Discussions
- The authors validate the DQC classifier synthesis method on multiple word image collections and also demonstrate its quantitative and qualitative performance.
- Figure 3 gives more qualitative examples of the retrieval.
- During the DQC evaluation, the authors discard the trained classifiers and mean vectors for the chosen query word classes.
- The authors also compare the average retrieval time for frequent and rare queries.
- Page retrieval is performed based on the score given by query weight to different word images present in the page.
5. Conclusion
- The authors have described a classifier based retrieval scheme for effectively retrieving word and document images from a collection.
- The authors argue that the classifier based method is superior to the OCR in practice.
- The authors introduce a novel classifier synthesis scheme which enable design of classifiers without any explicit training data.
- For this the authors exploit the fact that words in a language can be formed from fewer combination of character sequences.
Did you find this useful? Give us your feedback
Citations
10 citations
Cites methods from "Document Retrieval with Unlimited V..."
...DTW distance has been successfully applied in many areas like, bioinformatics [1] and word recognition [4, 8]....
[...]
9 citations
Cites background or methods from "Document Retrieval with Unlimited V..."
...The nearest neighbour method has been commonly used to measure the similarity in some recent studies [40, 46, 48, 52, 53]....
[...]
...For the most frequent queries, SVM classifiers have been used and a classifier synthesis strategy has been built for rare queries [40]....
[...]
...In [40], each word image has been represented by a fixed length sequence of vertical strips using word profile features....
[...]
...SVMs have been applied for the retrieval process in [11, 35, 40]....
[...]
...Printed books Word level features 3 Scanned English books D1, D2, D3 Up to 98% accuracy [40]...
[...]
5 citations
Cites background from "Document Retrieval with Unlimited V..."
...In addition to speech recognition, DTW has also been found useful in many other disciplines [14], including word recognition [4, 21], bioinformatics [1], data mining and gesture recognition....
[...]
3 citations
Cites background from "Document Retrieval with Unlimited V..."
...Ranjan et al [47] Word level segment -ation Profile features Support Vector Machine Custom English documents Average Precision = 81%...
[...]
3 citations
Cites background from "Document Retrieval with Unlimited V..."
...In [1] we see that SVM based classification techniques’ performance is superior to that of optical character recognition (OCR)....
[...]
References
2,934 citations
"Document Retrieval with Unlimited V..." refers methods in this paper
...The query xq is also divided into portions of fixed length R and the approximate nearest neighbor match of each of these portions with the indexed portions is found using FLANN....
[...]
...We consider fixed portions (length R) of the mean vectors of all the known word classes and build an index over all such portions using FLANN [11]....
[...]
...We use FLANN for getting the approximate nearest neighbors, thus incurring an additional cost of O(MBD) where M is the number of vertical strips in the word image....
[...]
...FLANN has a time complexity O(RBD) when using hierarchical k-means for indexing, where B is the branching factor and D is the depth of the tree....
[...]
2,037 citations
"Document Retrieval with Unlimited V..." refers methods in this paper
...Training has become efficient with methods like Pegasos [16] and whitening [7]....
[...]
1,576 citations
"Document Retrieval with Unlimited V..." refers methods in this paper
...The alignment of feature portions is done using subsequence Dynamic Time Warping(DTW) [12], which is a dynamic programming (DP) algorithm....
[...]
999 citations
"Document Retrieval with Unlimited V..." refers background in this paper
...For searching complex concepts in large databases, SVMs have emerged as the most popular and accurrate solution in the recent past [10]....
[...]
985 citations
Related Papers (5)
Frequently Asked Questions (21)
Q2. What have the authors stated for future works in "Document retrieval with unlimited vocabulary" ?
One of their future work is to design efficient and scalable retrieval system which uses linear SVM classifiers in the back end.
Q3. What is the way to find the alignment of the cut-portions of the query?
Subsequence DTW is used to find the best alignment of the cut-portions of the query feature vector with the concatenated mean vectors of the closest 10 word classes.
Q4. How do the authors reduce the time complexity of the classifier?
To reduce the time complexity, the authors compute the normalized dot product between the query vector and all the mean vectors of the known classes.
Q5. How do the authors ensure monotonicity of the sequence ai?
The authors ensure monotonicity of the sequence {ai} by using a fixed sequence for {ai}, thus avoiding optimization over the set of indices {ai}.
Q6. What is the normal vector to the maximum margin hyperplane?
For a query word xq , a SVM classifier wq (wq is the normal vector to the maximum margin hyperplane) is learned during the training, and for retrieval, database images are sorted based on the score wTq Fi.
Q7. What is the disadvantage of the classifier based scheme?
A major disadvantage of the classifier based scheme is the difficulty in indexing, which is important if the method needs to scale to millions of document images.
Q8. What is the way to use a direct classifier?
When a direct classifier is used for the frequent words, retrieval is efficient since this requires only the evaluation of the classifiers.
Q9. What are the challenges in using a classifier based solution for word retrieval?
there are two fundamental challenges in using a classifier based solution for word retrieval (i)A classifier needs good amount of annotated training data (both positive and negative) for training.
Q10. What is the normal vector to the Gaussian at query point xq?
Assuming a Gaussian distribution over feature space, the authors give closed form expression for the normal vector to the Gaussian at query point xq .
Q11. What is the generalized expression for LDA weights?
Generalized expression for LDA weights are given as w = Σ−1(µ+ − µ−) where µ+ and µ− are the means of the positive and negative examples respectively.
Q12. What is the definition of a ngram?
In many practical applications related to text processing, a finite set of ngrams were used to cover the vocabulary, and the small vocabulary solutions were extended to unlimited vocabulary settings.
Q13. What is the way to represent word images?
For representing word images, the authors prefer a fixed length sequence representation of the visual content, i.e., each word image is represented as a fixed length sequence of vertical strips (of varying width according to the aspect ratio of the word image).
Q14. Why is DTW not used in the SVM classifier?
Note that DTW cannot be directly used in the SVM classifier since the corresponding kernel will not be positive semidefinite, and moreover due to the computational complexity.
Q15. What is the purpose of this paper?
In this paper, the authors have described a classifier based retrieval scheme for effectively retrieving word and document images from a collection.
Q16. How high is the mAP of a SVM based word retrieval?
The authors show, later in this paper, that a SVM based word retrieval can give mean average precision ( mAP ) as high as 1.0 even when the OCR based solution is limited to a mAP of 0.89.
Q17. What is the LDA weight for the dq problem?
LDA weight wq is given aswq = Σ −1(µq − µ0) (2)where Σ and µ0 are covariance and mean computed over the entire dataset of word images.
Q18. What is the way to improve the efficiency of the DP DQC synthesis?
The authors now discuss two refinements to the solution which can improve the efficiency and accuracy of the retrieval, NN DQC, which is an approximate nearest neighbor based implementation of DQC , and use of query expansion for adapting DQC to a previously unseen word collection without using any new training data.
Q19. What is the way to reduce training time for a negative example?
One approach to reduce training time is to make the negative example mining step offline and selecting a common set of negative examples [17].
Q20. What are the salient observations out of this experiment?
Some of the salient observations out of this experiment are (i) OCR performance is inferior to the SVM based retrieval in all the cases.
Q21. What is the way to build a novel classifier?
2.2.1 DP DQC: DQC Design using Dynamic ProgrammingGiven a set of linear classifiersWw = {w1,w2, . . . ,wN} for most frequent N queries and a query feature vector xq , the authors would like to synthesize a novel classifier wq as a piecewise fusion of the parts from the available classifiers from Ww (See Figure 2).