Proceedings Article•DOI•

Document Retrieval with Unlimited Vocabulary

Viresh Ranjan¹, Gaurav Harit², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Indian Institute of Technology, Jodhpur²

05 Jan 2015-pp 741-748

TL;DR: This paper uses SVM classifiers for word retrieval, and argues that the classifier based solutions can be superior to the OCR based solutions in many practical situations, and designs a one-shot learning scheme for dynamically synthesizing classifiers.

read less

Abstract: In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. We use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations. We overcome the practical limitations of the classifier based solution in terms of limited vocabulary support, and availability of training data. In order to overcome these limitations, we design a one-shot learning scheme for dynamically synthesizing classifiers. Given a set of SVM classifiers, we appropriately join them to create novel classifiers. This extends the classifier based retrieval paradigm to an unlimited number of classes (words) present in a language. We validate our method on multiple datasets, and compare it with popular alternatives like OCR and word spotting. Even on a language like English, where OCRs have been fairly advanced, our method yields comparable or even superior results. Our results are significant since we do not use any language specific post-processing for obtaining this performance. For better accuracy of the retrieved list, we use query expansion. This also allows us to seamlessly adapt our solution to new fonts, styles and collections.

...read moreread less

Summary (2 min read)

Jump to: [1. Introduction] – [2. Accurate Classsifiers for Frequent and Rare] – [2.1. Efficient Classifier based Retrieval] – [2.2. Classifier design for rare queries] – [3. Efficient and Accurate Retrieval] – [4. Experiments, Results and Discussions] and [5. Conclusion]

1. Introduction

Retrieving relevant documents (pages, paragraphs or words) is a critical component in information retrieval solutions associated with digital libraries.
Though OCRs have become the de facto preprocessing for the retrieval, they are realized as insufficient for degraded books [8], incompatible for older print styles [5], unavailable for specialized scripts [14] and very hard for handwritten documents [1].
There are two fundamental challenges in using a classifier based solution for word retrieval (i)A classifier needs good amount of annotated training data (both positive and negative) for training.
Without having any access to the annotated training data, i.e., classifiers are trained for a set of frequent queries, and seamlessly extended for the rare and arbitrary queries, as and when required.the authors.

2. Accurate Classsifiers for Frequent and Rare

The authors word-level retrieval scheme is a direct application of the SVM classifier.
The authors train a linear SVM classifier with few positive examples and a set of randomly sampled negative examples.
During retrieval, this classifier is evaluated over the dataset images, and a ranked list of word images is predicted.
For representing word images, the authors prefer a fixed length sequence representation of the visual content, i.e., each word image is represented as a fixed length sequence of vertical strips (of varying width according to the aspect ratio of the word image).
The authors exploit the sequential nature of the feature representation for an on the fly synthesis of the novel classifiers in Section 2.2.1.

2.1. Efficient Classifier based Retrieval

SVM gives maximum margin hyperplane separating the positive and negative instances.
Another demerit of ESVM is the large overall training time since a separate SVM needs to be trained for each exemplar.
Gharbi et al. [6] provide another alternative for fast training of exemplar SVM.
The authors also assume a Gaussian distribution over the feature space and hence, use this normal vector as an approximation of the SVM weight and use this weight vector for retrieval.
It requires only few d2 multiplications for the design of specific query classifiers.

2.2. Classifier design for rare queries

It is not practical to build classifiers for all the possible words.
The authors show that the SVM classifiers corresponding to the ngrams can be effectively composed to generate novel classifiers on the fly.
Since the classifiers need to be built for all the words, the overall performance could thus be poor.
The authors consider the problem of finding µq for the query class as the classifier synthesis problem outlined above.
The authors select the 10 most similar mean vectors and use them in the subsequence DTW.

3. Efficient and Accurate Retrieval

When a direct classifier is used for the frequent words, retrieval is efficient since this requires only the evaluation of the classifiers.
For the rare words, the authors use the DQC classifier which requires a DP based selection from multiple composite classifiers.
This DP based strategy affects the efficiency and accuracy of the solution to some extent.
An index is built over all the database vectors and those vectors similar to query vector xq are identified by performing approximate nearest neighbor search over the index.
This is much smaller compared to the time complexity O(NRd) for subsequence DTW matching.

4. Experiments, Results and Discussions

The authors validate the DQC classifier synthesis method on multiple word image collections and also demonstrate its quantitative and qualitative performance.
Figure 3 gives more qualitative examples of the retrieval.
During the DQC evaluation, the authors discard the trained classifiers and mean vectors for the chosen query word classes.
The authors also compare the average retrieval time for frequent and rare queries.
Page retrieval is performed based on the score given by query weight to different word images present in the page.

5. Conclusion

The authors have described a classifier based retrieval scheme for effectively retrieving word and document images from a collection.
The authors argue that the classifier based method is superior to the OCR in practice.
The authors introduce a novel classifier synthesis scheme which enable design of classifiers without any explicit training data.
For this the authors exploit the fact that words in a language can be formed from fewer combination of character sequences.

Did you find this useful? Give us your feedback

Figures (10)

Table 2. Retrieval performance: mAP scores for the evaluation of frequent and rare queries. For rare queries, mAP scores are given for DP DQC as well as NN DQC. Notice that QE improves the performance significantly.

Figure 3. Figure shows few query words and corresponding top 10 retrieved results. First two rows show frequent query results. Row 3 - Row 6 show rare query results using DP DQC . Row 7 - Row 10 show rare query results using DP DQC with QE.

Table 3. Table shows change in retrieval performance with change in cut length. mAP values are given for both DP DQC and NN DQC. C is the cut-length.

Figure 1. (A) A typical page from a document image from our data set. (B) OCRs make many errors. Examples of word images and the errors from a commercial OCR in a page. (C) Examples of a classifier based retrieval compared with OCR retrieval. OCR did not recognize correctly the images whose OCR output are marked in red, and failed to recall. Note that the classifier based solution recalled all relevant images correctly, whereas the OCR recalled correctly 2 out of 4 images.

Table 4. Table gives comparison between DQC weights and actual weights for 100 rare queries. Cosine distance and RMSE are used for comparison between the two.

Figure 4. Fig(a) shows page retrieval for query ”king”. Fig(b) shows page retrieval with multiple query words. Here query words are ”man” and ”felt”. Fig(c)shows effect of query expansion on weights synthesized for 100 queries each from the three datasets. Blue bars are for DQC weights and Red bars are for DQC+QE weights. Fig(d) shows few example images over which noise was added artificially and query was performed.

Figure 5. Barplot shows retrieval performance of DQC for rare queries. Improvement in mAP values with QE can be observed for all three datasets.

Figure 6. Figure shows few query images and corresponding portions used in generating DP DQC classifier.

Figure 2. Synthesis of classifier for rare queries. (a) portions of the classifiers corresponding to “ground” and “leather” are joined to form a classifier for great. Note that the appropriate segments are automatically found. (b) In a general setting, a novel classifier gets formed from multiple constituent classifiers.

Table 1. Table gives details of the datasets. It also shows comparison of various word retrieval schemes such as nearest neighbor (NN),LDA, SVM, EXEMPLAR SVM, DTW based NN with that of OCR based retrieval. Classifier based methods (specially SVM based methods are much superior to the OCR based solution. Performance is reported as mAP.

Content maybe subject to copyright Report

Document Retrieval with Unlimited Vocabulary

Viresh Ranjan

Gaurav Harit

C. V. Jawahar

CVIT, IIIT Hyderabad, India

IIT Jodhpur, India

Abstract

In this paper, we describe a classiﬁer based retrieval

scheme for efﬁciently and accurately retrieving relevant

documents. We use SVM classiﬁers for word retrieval, and

argue that the classiﬁer based solutions can be superior

to the OCR based solutions in many practical situations.

We overcome the practical limitations of the classiﬁer

based solution in terms of limited vocabulary support,

and availability of training data. In order to overcome

these limitations, we design a one-shot learning scheme

for dynamically synthesizing classiﬁers. Given a set of

SVM classiﬁers, we appropriately join them to create

novel classiﬁers. This extends the classiﬁer based retrieval

paradigm to an unlimited number of classes (words)

present in a language. We validate our method on multiple

datasets, and compare it with popular alternatives like

OCR and wordspotting. Even on a language like English,

where OCRs have been fairly advanced, our method yields

comparable or even superior results. Our results are

signiﬁcant since we do not use any language speciﬁc

post-processing for obtaining this performance. For better

accuracy of the retrieved list, we use query expansion. This

also allows us to seamlessly adapt our solution to new

fonts, styles and collections.

1. Introduction

Retrieving relevant documents (pages, paragraphs or

words) is a critical component in information retrieval so-

lutions associated with digital libraries. Most of the present

day digital libraries, use Optical Character Recognizers

(OCRs) for the recognition of digitized documents, and

thereafter employ a text-based solution for the information

retrieval. Though OCRs have become the de facto prepro-

cessing for the retrieval, they are realized as insufﬁcient for

degraded books [8], incompatible for older print styles [5],

unavailable for specialized scripts [14] and very hard for

handwritten documents [1]. Even for printed books, com-

mercial OCRs may provide highly unacceptable results in

practice. The best commercial OCRs can only give word

accuracy of around 90% on printed books [18] in modern

digital libraries. Recall of retrieval systems built on such

text is thus limited.

In this paper, we hypothesize that word images, even if

degraded, can be matched and retrieved effectively with a

classiﬁer based solution. A properly trained classiﬁer can

yield an accurate ranked list of words since the classiﬁer

looks at the word as a whole, and uses a larger context (say

multiple examples) for the matching. We show, later in this

paper, that a SVM based word retrieval can give mean av-

erage precision ( mAP ) as high as 1.0 even when the OCR

based solution is limited to a mAP of 0.89. (See Figure 1

and Table 1.) Our results are signiﬁcant since (i) We do

not use any language speciﬁc post-processing for improving

the accuracy. (ii) Even for a language like English, where

OCRs are fairly advanced and engineering solutions were

perfected, our simple classiﬁer based solution is as good, if

not superior to the best available commercial OCRs .

However, there are two fundamental challenges in using

a classiﬁer based solution for word retrieval (i)A classiﬁer

needs good amount of annotated training data (both positive

and negative) for training. Obtaining annotated data for ev-

ery word in every style is practically impossible. (ii) One

could train a set of classiﬁers for a given set of frequent

queries. However they are not applicable for rare queries.

In this paper we introduce a one-shot classiﬁer learning

scheme which enables direct design of a classiﬁer for novel

queries, without having any access to the annotated training

data, i.e., classiﬁers are trained for a set of frequent queries,

and seamlessly extended for the rare and arbitrary queries,

as and when required. We refer to this as one-shot learning

of query classiﬁers.

Recognition free retrieval was attempted in the past for

printed as well as handwritten document collections [3,

5, 13]. Primary focus has been on designing appropri-

ate features (eg. Proﬁles, SIFT-BoW), distance functions

(eg. Euclidean, Earth Movers), matching schemes (eg. Dy-

namic Programing). Since most of these methods were de-

signed for smaller collections (few handwritten documents

as in [13]), computational time was not a major concern.

Methods that extended this to larger collection [14, 15, 9]

Figure 1. (A) A typical page from a document image from our data set. (B) OCRs make many errors. Examples of word images and the

errors from a commercial OCR in a page. (C) Examples of a classiﬁer based retrieval compared with OCR retrieval. OCR did not recognize

correctly the images whose OCR output are marked in red, and failed to recall. Note that the classiﬁer based solution recalled all relevant

images correctly, whereas the OCR recalled correctly 2 out of 4 images.

used mostly (approximate) nearest neighbor retrieval. For

searching complex concepts in large databases, SVMs have

emerged as the most popular and accurrate solution in the

recent past [10]. For linear SVMs , both training and test-

ing have become very fast with the introduction of efﬁcient

algorithms and excellent implementations.

Training has become efﬁcient with methods like Pega-

sos [16] and whitening [7]. These methods make the ofﬂine

training, and incremental/online training really fast. Train-

ing a classiﬁer on the ﬂy [4] is now considered as quite fea-

sible for reasonably large image collections. (Indeed these

are still not yet comparable to the indexing schemes popu-

larly used in the text IR tasks, specially for huge data.)

The highlights of our current work are

1. We demonstrate that the SVM based word retrieval per-

formance is superior to that of OCRs, and also the pop-

ular nearest neighbor based word spotting solutions.

2. We design a one-shot learning scheme, which allows to

generate a novel classiﬁer for rare/novel query words

without any training data.

3. We demonstrate that with a simple retraining (with no

extra supervision), the solution can adapt to a speciﬁc

book or collection effectively.

4. We validate the performance on multiple books and

demonstrate the qualitative and quantitative perfor-

mance of the solution.

2. Accurate Classsiﬁers for Frequent and Rare

Queries

Our word-level retrieval scheme is a direct application

of the SVM classiﬁer. We train a linear SVM classiﬁer with

few positive examples and a set of randomly sampled neg-

ative examples. During retrieval, this classiﬁer is evaluated

over the dataset images, and a ranked list of word images is

predicted.

For representing word images, we prefer a ﬁxed length

sequence representation of the visual content, i.e., each

word image is represented as a ﬁxed length sequence of ver-

tical strips (of varying width according to the aspect ratio of

the word image). A set of features f

, . . . , f

are extracted,

where f

∈ R

is the feature representation of the i

ver-

tical strip and L is the number of vertical strips. For a SVM

classiﬁer, this can be considered as a single feature vector

F ∈ R

of size d = LM . However, we exploit the se-

quential nature of the feature representation for an on the

ﬂy synthesis of the novel classiﬁers in Section 2.2.1.

2.1. Efﬁcient Classiﬁer based Retrieval

Our classiﬁer is basically a margin maximizing SVM

classiﬁer trained in a 1 vs rest setting. SVM gives maximum

margin hyperplane separating the positive and negative in-

stances. For a query word x

, a SVM classiﬁer w

the normal vector to the maximum margin hyperplane) is

learned during the training, and for retrieval, database im-

ages are sorted based on the score w

. This evaluation

is very efﬁcient, since it requires only d multiplications and

d − 1 additions. For frequent queries, this can in-fact be

Class specific weights

(a)

.... .... ....

Synthesized weights (cut and paste)

(b)

Figure 2. Synthesis of classiﬁer for rare queries. (a) portions of the classiﬁers corresponding to “ground” and “leather” are joined to form

a classiﬁer for great. Note that the appropriate segments are automatically found. (b) In a general setting, a novel classiﬁer gets formed

from multiple constituent classiﬁers.

computed ofﬂine. However, traditional SVM implementa-

tions require many positive and negative examples to learn

the weight vector w

In [10], Malisiewicz et al. proposed the idea of exemplar

SVM (ESVM) where a separate SVM is learnt for each ex-

ample. Almazan et al. [2] use ESVMs for retrieving word

images. ESVMs are inherently highly tuned to its corre-

sponding example. Given a query, it can retrieve highly

similar word images. This constrains the recall, unless one

has large variations of the query word available. Another

demerit of ESVM is the large overall training time since a

separate SVM needs to be trained for each exemplar. One

approach to reduce training time is to make the negative ex-

ample mining step ofﬂine and selecting a common set of

negative examples [17]. Gharbi et al. [6] provide another

alternative for fast training of exemplar SVM. As discussed

in [6], a SVM between a single positive point and a set of

negative points can be seen as ﬁnding the tangent to the

manifold of images at the positive point. Assuming a Gaus-

sian distribution over feature space, the authors give closed

form expression for the normal vector to the Gaussian at

query point x

. The normal is given as Σ

−1

− µ

where Σ is the global covariance matrix, µ

is global mean

vector. This expression can also be interpreted in terms of

linear discriminant classiﬁers(LDA) as done by Hariharan et

al. in [7]. We also assume a Gaussian distribution over the

feature space and hence, use this normal vector as an ap-

proximation of the SVM weight and use this weight vector

for retrieval. Generalized expression for LDA weights are

given as w = Σ

−1

(µ

− µ

−

) where µ

and µ

−

are the

means of the positive and negative examples respectively.

This simple computation makes the training extremely efﬁ-

cient. It requires only few d

multiplications for the design

of speciﬁc query classiﬁers. Also, the same method can be

used independent of whether we have one example query or

multiple examples from query class.

2.2. Classiﬁer design for rare queries

The number of possible words in a language can be inﬁ-

nite. It is not practical to build classiﬁers for all the possible

words. However, on a closer look, we realize that all these

words are composed from a very small number of charac-

ters, and a reasonably small set of ngrams. In many prac-

tical applications related to text processing, a ﬁnite set of

ngrams were used to cover the vocabulary, and the small

vocabulary solutions were extended to unlimited vocabu-

lary settings. In this work, we show that the SVM classiﬁers

corresponding to the ngrams can be effectively composed

to generate novel classiﬁers on the ﬂy. Fig 2 shows the

overview of our classiﬁer synthesis strategy. Such synthe-

sized classiﬁers, by simply concatenating ngram classiﬁers,

could be inferior to the directly built classiﬁers. (i) Due to

the nature of scripts and writing styles, the joining will not

be ideal. One should prefer larger grams for better synthe-

sis. (ii) Classiﬁers for smaller ngrams could be noisy. Since

the classiﬁers need to be built for all the words, the overall

performance could thus be poor. We address these limita-

tions as follows. It is well known that the queries in any

search engine follow an exponentially decaying structure

(Zipf’s law) like the frequency of occurrence of words in a

language. We build SVM classiﬁers (Section 3) for the most

frequent queries and use classiﬁer synthesis only for rare

queries. This improves the overall performance. When the

synthesized classiﬁers are not as good as the originl one, we

further use query expansion (QE) for the reﬁnement of the

query classiﬁer (See Section 3). Our method does not build

artiﬁcial ngram classiﬁers; we use complete word classiﬁers

and dynamically decide what portions from what words to

be cut and pasted to create the novel classiﬁer. We refer

to this solution as a Direct Query Classiﬁer (DQC) design

scheme. In Section 2.2.1, we describe DP DQC, a dynamic

programming based approach for DQC synthesis.

2.2.1 DP DQC: DQC Design using Dynamic Program-

ming

Given a set of linear classiﬁers W

= {w

, w

, . . . , w

}

for most frequent N queries and a query feature vector x

we would like to synthesize a novel classiﬁer w

as a piece-

wise fusion of the parts from the available classiﬁers from

(See Figure 2). Let us assume that there are p portions

that we need to select to form a novel classiﬁer. Intervals

(portions) are characterized by the sequence of indices a

... a

p+1

where a

= 1 and a

p+1

= L. We formulate the

classiﬁer synthesis problem using a set of already available

linear classiﬁers, as that of ﬁnding the optimal solution to

max

},{c

}

i=1

i+1

k=a

(1)

where w

corresponds to the c

classiﬁer that we choose

and the inner summation applies the index range (a

, a

i+1

)

to use a portion from the classiﬁer c

. The index i in the

outer summation is over the portions, and p is the total num-

ber of portions we need to consider. The solution to the

problem requires picking up the optimal set {c

} and the set

of segment indices {a

} such that the {a

} form a monoton-

ically increasing sequence of indices.

We use LDA as the linear classiﬁer in our DQC solution.

LDA weight w

is given as

= Σ

−1

(µ

− µ

) (2)

where Σ and µ

are covariance and mean computed over

the entire dataset of word images. Since Σ and µ

are

common for all classes, synthesizing w

requires ﬁnding

mean vector (µ

) for unknown query class. We consider

the problem of ﬁnding µ

for the (unknown) query class

as the classiﬁer synthesis problem outlined above. Let

the set of mean vectors of frequent words be deﬁned as

= {µ

, µ

, . . . , µ

}. We divide query vector x

into

p ﬁxed length portions and match each cut portion x

to the

most similar feature portion (say µ

) of equal length from

the set W

. We solve the problem of selecting the optimal

set {c

} by solving for the optimal alignment of the query

feature portion x

with the best matching portion µ

of the

mean vector µ

picked from W

. We then combine each

of the obtained µ

to compose the mean vector µ

for the

query. We ensure monotonicity of the sequence {a

} by us-

ing a ﬁxed sequence for {a

}, thus avoiding optimization

over the set of indices {a

The alignment of feature portions is done using subse-

quence Dynamic Time Warping(DTW) [12], which is a dy-

namic programming (DP) algorithm. The DTW takes care

of the variability in the different instances (word images) of

the same class. The time complexity of subsequence DTW is

O(l

), where l

and l

are the length of the two sequences

given as input. In our case, the l

is the length of each small

segment and l

= N · d is the length of the concatenated

sequence of all the mean vectors in the set W

. If we use

all the known mean vectors in the set W

, synthesizing the

classiﬁer for a single query could take long time. To reduce

the time complexity, we compute the normalized dot prod-

uct between the query vector and all the mean vectors of the

known classes. We select the 10 most similar mean vectors

and use them in the subsequence DTW.

3. Efﬁcient and Accurate Retrieval

When a direct classiﬁer is used for the frequent words,

retrieval is efﬁcient since this requires only the evaluation of

the classiﬁers. In practice, these are also pre-computed. For

the rare words, we use the DQC classiﬁer which requires a

DP based selection from multiple composite classiﬁers. We

call this DP based classiﬁer synthesis strategy as DP DQC .

This DP based strategy affects the efﬁciency and accuracy

of the solution to some extent. We now discuss two reﬁne-

ments to the solution which can improve the efﬁciency and

accuracy of the retrieval, NN DQC, which is an approximate

nearest neighbor based implementation of DQC , and use of

query expansion for adapting DQC to a previously unseen

word collection without using any new training data.

NN DQC: DQC with Nearest Neighbor indexing DP

DQC synthesis is slow because it uses DTW based alignment

of the query vector portions with the mean vectors of the

known classes. We can obtain a speed-up by using approx-

imate nearest neighbor search instead of using DTW based

alignment . This, in principle, compromises the optimality

of the synthesis, however, in practice, it does not affect the

quality of the classiﬁer. We consider ﬁxed portions (length

R) of the mean vectors of all the known word classes and

build an index over all such portions using FLANN [11].

The query x

is also divided into portions of ﬁxed length

R and the approximate nearest neighbor match of each of

these portions with the indexed portions is found using

FLANN. The so obtained nearest neighbors are concate-

nated to give the mean vector µ

, which is then used as

in Equation (2), to compute the LDA weight w

. FLANN

has a time complexity O(RBD) when using hierarchical

k-means for indexing, where B is the branching factor and

D is the depth of the tree. This time complexity O(RBD) is

typically much smaller than O(NRd) of subsequence DTW.

Hence DQC using NN is much faster than using subsequence

DTW.

Query expansion for DQC Classiﬁers which are trained

on one dataset need not perform well on another dataset due

to the print and style variations. For adapting the query to

a new collection, query expansion (QE) is used. We imple-

Dataset Retrieval

Dataset Source Type #images #queries OCR NN LDA SVM ESVM NN DTW

D1 1 Book Clean 26555 100 0.97 0.95 0.98 1.0 0.96 0.83

D2 2 Books Clean 35730 100 0.95 0.72 0.92 0.98 0.72 0.62

D3 1 Book Noisy 4373 100 0.89 0.73 0.98 1.0 0.86 0.71

Table 1. Table gives details of the datasets. It also shows comparison of various word retrieval schemes such as nearest neighbor (NN),LDA,

SVM, EXEMPLAR SVM, DTW based NN with that of OCR based retrieval. Classiﬁer based methods (specially SVM based methods are much

superior to the OCR based solution. Performance is reported as mAP.

ment QE very efﬁciently. Query expansion is a concept used

in information retrieval where the seed or primary query is

reformulated to improve the retrieval performance. We use

QE to further improve the performance of DQC . An index is

built over all the database vectors and those vectors similar

to query vector x

are identiﬁed by performing approximate

nearest neighbor search over the index. The top 5 vectors

closest to the query vector are averaged to give the reformu-

lated query vector to be used in the DQC. The reformulated

query better captures the variations of the query class. We

use FLANN for getting the approximate nearest neighbors,

thus incurring an additional cost of O(MBD) where M is

the number of vertical strips in the word image.

This is much smaller compared to the time complexity

O(NRd) for subsequence DTW matching. Hence, adding

QE step before DQC does not cause signiﬁcant increase in

computation time. However, it improves the accuracy.

4. Experiments, Results and Discussions

In this section, we validate the DQC classiﬁer synthe-

sis method on multiple word image collections and also

demonstrate its quantitative and qualitative performance.

Data Sets, Implementation and Evaluation Protocol

Our datasets, detailed in Table 1, comprises scanned En-

glish books from a digital library collection. We manually

created ground truth at word level for the quantitative evalu-

ation of the methods. Note that the words are segmented us-

ing the ground truth information. The ﬁrst collection (D1)

of words is from a book which is reasonably clean. On this

collection, commercial OCR (ABBYY Fine Reader 9.0) pro-

vides very high word accuracy.

We use this collection to demonstrate that our method

provides satisfactory performance on collections where the

OCR is satisfactory. Second dataset (D2) is larger in size

and is used to demonstrate the performance in case of het-

erogeneous print styles. Third data set (D3) is a noisy

book, and is used to demonstrate the utility of the perfor-

mance of our method in degraded collections. For the ex-

periments, we extract proﬁle features [13] for each of the

word images. Proﬁle features comprise of the following:

(i) Vertical projection proﬁle, which counts the number of

ink pixels in each column (ii) Upper and lower word proﬁle,

which encode the distance between the top (lower) bound-

ary and the top-most (lower-most) ink pixels in each col-

umn. (iii) Background/Ink transition counts the number of

background to ink transitions in each column. All these fea-

tures are extracted for 100 vertical strips. This results in a

400 dimensional representation for every word image. The

features are normalized to [0, 1] so as to avoid dominance

of any speciﬁc feature. Instead of the query log of a search

engine, which lists the frequent queries and rare queries, we

have considered the frequency of occurrence of the words in

the collection. We report the mAP score for 100 frequent

queries as the retrieval performance measure in Table 1. In

Table 1, we compare the performance of various methods

as mAP of the retrieval. The OCR scores correspond to av-

erage over all the queries in the corresponding dataset, and

are obtained by using ABBYY Fine Reader 9.0. We observed

that ABBYY 9.0 performs comparably with its more recent

versions . Some of the salient observations out of this exper-

iment are (i) OCR performance is inferior to the SVM based

retrieval in all the cases. (ii) Faster approximation in the

form of LDA with multiple positive examples is quite close

to the performance of SVM . (iii) ESVM performs inferior

due to the use of only one example (iv) Nearest neighbor

methods (with DTW, Euclidean etc.) are inferior, in general,

to SVM . (v) Nearest neighbour with DTW may be consid-

ered as equivalent to word spotting. Therefore, one could

observe that this classiﬁer based retrieval is superior to OCR

and the DTW based retrieval.

Note that DTW cannot be directly used in the SVM clas-

siﬁer since the corresponding kernel will not be positive

semideﬁnite, and moreover due to the computational com-

plexity. Fig 1 depicts some of the qualitative examples of

the retrieval. As can be seen that the classiﬁer-based re-

trieved images are more relevant to the query. Classiﬁers

are less sensitive to the degradations (eg. cuts, merges etc.)

present in the word images. Figure 3 gives more qualitative

examples of the retrieval. We show examples of classiﬁer

based retrieval, and DQC based retrieval. We also show that

the quality of the retrieval improves with QE.

Performance of the DQC In the initial experiment, we

built classiﬁers for most frequent 1000 word categories. For

HTML Viewer

Frequently Asked Questions (21)

Q1. What contributions have the authors mentioned in the paper "Document retrieval with unlimited vocabulary" ?

In this paper, the authors describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. Given a set of SVM classifiers, the authors appropriately join them to create novel classifiers. Their results are significant since the authors do not use any language specific post-processing for obtaining this performance. The authors use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations.

Q2. What have the authors stated for future works in "Document retrieval with unlimited vocabulary" ?

One of their future work is to design efficient and scalable retrieval system which uses linear SVM classifiers in the back end.

Q3. What is the way to find the alignment of the cut-portions of the query?

Subsequence DTW is used to find the best alignment of the cut-portions of the query feature vector with the concatenated mean vectors of the closest 10 word classes.

Q4. How do the authors reduce the time complexity of the classifier?

To reduce the time complexity, the authors compute the normalized dot product between the query vector and all the mean vectors of the known classes.

Q5. How do the authors ensure monotonicity of the sequence ai?

The authors ensure monotonicity of the sequence {ai} by using a fixed sequence for {ai}, thus avoiding optimization over the set of indices {ai}.

Q6. What is the normal vector to the maximum margin hyperplane?

For a query word xq , a SVM classifier wq (wq is the normal vector to the maximum margin hyperplane) is learned during the training, and for retrieval, database images are sorted based on the score wTq Fi.

Q7. What is the disadvantage of the classifier based scheme?

A major disadvantage of the classifier based scheme is the difficulty in indexing, which is important if the method needs to scale to millions of document images.

Q8. What is the way to use a direct classifier?

When a direct classifier is used for the frequent words, retrieval is efficient since this requires only the evaluation of the classifiers.

Q9. What are the challenges in using a classifier based solution for word retrieval?

there are two fundamental challenges in using a classifier based solution for word retrieval (i)A classifier needs good amount of annotated training data (both positive and negative) for training.

Q10. What is the normal vector to the Gaussian at query point xq?

Assuming a Gaussian distribution over feature space, the authors give closed form expression for the normal vector to the Gaussian at query point xq .

Q11. What is the generalized expression for LDA weights?

Generalized expression for LDA weights are given as w = Σ−1(µ+ − µ−) where µ+ and µ− are the means of the positive and negative examples respectively.

Q12. What is the definition of a ngram?

In many practical applications related to text processing, a finite set of ngrams were used to cover the vocabulary, and the small vocabulary solutions were extended to unlimited vocabulary settings.

Q13. What is the way to represent word images?

For representing word images, the authors prefer a fixed length sequence representation of the visual content, i.e., each word image is represented as a fixed length sequence of vertical strips (of varying width according to the aspect ratio of the word image).

Q14. Why is DTW not used in the SVM classifier?

Note that DTW cannot be directly used in the SVM classifier since the corresponding kernel will not be positive semidefinite, and moreover due to the computational complexity.

Q15. What is the purpose of this paper?

In this paper, the authors have described a classifier based retrieval scheme for effectively retrieving word and document images from a collection.

Q16. How high is the mAP of a SVM based word retrieval?

The authors show, later in this paper, that a SVM based word retrieval can give mean average precision ( mAP ) as high as 1.0 even when the OCR based solution is limited to a mAP of 0.89.

Q17. What is the LDA weight for the dq problem?

LDA weight wq is given aswq = Σ −1(µq − µ0) (2)where Σ and µ0 are covariance and mean computed over the entire dataset of word images.

Q18. What is the way to improve the efficiency of the DP DQC synthesis?

The authors now discuss two refinements to the solution which can improve the efficiency and accuracy of the retrieval, NN DQC, which is an approximate nearest neighbor based implementation of DQC , and use of query expansion for adapting DQC to a previously unseen word collection without using any new training data.

Q19. What is the way to reduce training time for a negative example?

One approach to reduce training time is to make the negative example mining step offline and selecting a common set of negative examples [17].

Q20. What are the salient observations out of this experiment?

Some of the salient observations out of this experiment are (i) OCR performance is inferior to the SVM based retrieval in all the cases.

Q21. What is the way to build a novel classifier?

2.2.1 DP DQC: DQC Design using Dynamic ProgrammingGiven a set of linear classifiersWw = {w1,w2, . . . ,wN} for most frequent N queries and a query feature vector xq , the authors would like to synthesize a novel classifier wq as a piecewise fusion of the parts from the available classifiers from Ww (See Figure 2).

Document Retrieval with Unlimited Vocabulary

Summary (2 min read)

1. Introduction

2. Accurate Classsifiers for Frequent and Rare

2.1. Efficient Classifier based Retrieval

2.2. Classifier design for rare queries

3. Efficient and Accurate Retrieval

4. Experiments, Results and Discussions

5. Conclusion

Figures (10)

Citations

Cites methods from "Document Retrieval with Unlimited V..."

Cites background or methods from "Document Retrieval with Unlimited V..."

Cites background from "Document Retrieval with Unlimited V..."

Cites background from "Document Retrieval with Unlimited V..."

Cites background from "Document Retrieval with Unlimited V..."

References

"Document Retrieval with Unlimited V..." refers methods in this paper

"Document Retrieval with Unlimited V..." refers methods in this paper

"Document Retrieval with Unlimited V..." refers methods in this paper

"Document Retrieval with Unlimited V..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (21)

Q1. What contributions have the authors mentioned in the paper "Document retrieval with unlimited vocabulary" ?

Q2. What have the authors stated for future works in "Document retrieval with unlimited vocabulary" ?

Q3. What is the way to find the alignment of the cut-portions of the query?

Q4. How do the authors reduce the time complexity of the classifier?

Q5. How do the authors ensure monotonicity of the sequence ai?

Q6. What is the normal vector to the maximum margin hyperplane?

Q7. What is the disadvantage of the classifier based scheme?

Q8. What is the way to use a direct classifier?

Q9. What are the challenges in using a classifier based solution for word retrieval?

Q10. What is the normal vector to the Gaussian at query point xq?

Q11. What is the generalized expression for LDA weights?

Q12. What is the definition of a ngram?

Q13. What is the way to represent word images?

Q14. Why is DTW not used in the SVM classifier?

Q15. What is the purpose of this paper?

Q16. How high is the mAP of a SVM based word retrieval?

Q17. What is the LDA weight for the dq problem?

Q18. What is the way to improve the efficiency of the DP DQC synthesis?

Q19. What is the way to reduce training time for a negative example?

Q20. What are the salient observations out of this experiment?

Q21. What is the way to build a novel classifier?