scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Document Retrieval with Unlimited Vocabulary

TL;DR: This paper uses SVM classifiers for word retrieval, and argues that the classifier based solutions can be superior to the OCR based solutions in many practical situations, and designs a one-shot learning scheme for dynamically synthesizing classifiers.
Abstract: In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. We use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations. We overcome the practical limitations of the classifier based solution in terms of limited vocabulary support, and availability of training data. In order to overcome these limitations, we design a one-shot learning scheme for dynamically synthesizing classifiers. Given a set of SVM classifiers, we appropriately join them to create novel classifiers. This extends the classifier based retrieval paradigm to an unlimited number of classes (words) present in a language. We validate our method on multiple datasets, and compare it with popular alternatives like OCR and word spotting. Even on a language like English, where OCRs have been fairly advanced, our method yields comparable or even superior results. Our results are significant since we do not use any language specific post-processing for obtaining this performance. For better accuracy of the retrieved list, we use query expansion. This also allows us to seamlessly adapt our solution to new fonts, styles and collections.

Summary (2 min read)

1. Introduction

  • Retrieving relevant documents (pages, paragraphs or words) is a critical component in information retrieval solutions associated with digital libraries.
  • Though OCRs have become the de facto preprocessing for the retrieval, they are realized as insufficient for degraded books [8], incompatible for older print styles [5], unavailable for specialized scripts [14] and very hard for handwritten documents [1].
  • There are two fundamental challenges in using a classifier based solution for word retrieval (i)A classifier needs good amount of annotated training data (both positive and negative) for training.
  • Without having any access to the annotated training data, i.e., classifiers are trained for a set of frequent queries, and seamlessly extended for the rare and arbitrary queries, as and when required.the authors.

2. Accurate Classsifiers for Frequent and Rare

  • The authors word-level retrieval scheme is a direct application of the SVM classifier.
  • The authors train a linear SVM classifier with few positive examples and a set of randomly sampled negative examples.
  • During retrieval, this classifier is evaluated over the dataset images, and a ranked list of word images is predicted.
  • For representing word images, the authors prefer a fixed length sequence representation of the visual content, i.e., each word image is represented as a fixed length sequence of vertical strips (of varying width according to the aspect ratio of the word image).
  • The authors exploit the sequential nature of the feature representation for an on the fly synthesis of the novel classifiers in Section 2.2.1.

2.1. Efficient Classifier based Retrieval

  • SVM gives maximum margin hyperplane separating the positive and negative instances.
  • Another demerit of ESVM is the large overall training time since a separate SVM needs to be trained for each exemplar.
  • Gharbi et al. [6] provide another alternative for fast training of exemplar SVM.
  • The authors also assume a Gaussian distribution over the feature space and hence, use this normal vector as an approximation of the SVM weight and use this weight vector for retrieval.
  • It requires only few d2 multiplications for the design of specific query classifiers.

2.2. Classifier design for rare queries

  • It is not practical to build classifiers for all the possible words.
  • The authors show that the SVM classifiers corresponding to the ngrams can be effectively composed to generate novel classifiers on the fly.
  • Since the classifiers need to be built for all the words, the overall performance could thus be poor.
  • The authors consider the problem of finding µq for the query class as the classifier synthesis problem outlined above.
  • The authors select the 10 most similar mean vectors and use them in the subsequence DTW.

3. Efficient and Accurate Retrieval

  • When a direct classifier is used for the frequent words, retrieval is efficient since this requires only the evaluation of the classifiers.
  • For the rare words, the authors use the DQC classifier which requires a DP based selection from multiple composite classifiers.
  • This DP based strategy affects the efficiency and accuracy of the solution to some extent.
  • An index is built over all the database vectors and those vectors similar to query vector xq are identified by performing approximate nearest neighbor search over the index.
  • This is much smaller compared to the time complexity O(NRd) for subsequence DTW matching.

4. Experiments, Results and Discussions

  • The authors validate the DQC classifier synthesis method on multiple word image collections and also demonstrate its quantitative and qualitative performance.
  • Figure 3 gives more qualitative examples of the retrieval.
  • During the DQC evaluation, the authors discard the trained classifiers and mean vectors for the chosen query word classes.
  • The authors also compare the average retrieval time for frequent and rare queries.
  • Page retrieval is performed based on the score given by query weight to different word images present in the page.

5. Conclusion

  • The authors have described a classifier based retrieval scheme for effectively retrieving word and document images from a collection.
  • The authors argue that the classifier based method is superior to the OCR in practice.
  • The authors introduce a novel classifier synthesis scheme which enable design of classifiers without any explicit training data.
  • For this the authors exploit the fact that words in a language can be formed from fewer combination of character sequences.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Document Retrieval with Unlimited Vocabulary
Viresh Ranjan
1
Gaurav Harit
2
C. V. Jawahar
1
1
CVIT, IIIT Hyderabad, India
2
IIT Jodhpur, India
Abstract
In this paper, we describe a classifier based retrieval
scheme for efficiently and accurately retrieving relevant
documents. We use SVM classifiers for word retrieval, and
argue that the classifier based solutions can be superior
to the OCR based solutions in many practical situations.
We overcome the practical limitations of the classifier
based solution in terms of limited vocabulary support,
and availability of training data. In order to overcome
these limitations, we design a one-shot learning scheme
for dynamically synthesizing classifiers. Given a set of
SVM classifiers, we appropriately join them to create
novel classifiers. This extends the classifier based retrieval
paradigm to an unlimited number of classes (words)
present in a language. We validate our method on multiple
datasets, and compare it with popular alternatives like
OCR and wordspotting. Even on a language like English,
where OCRs have been fairly advanced, our method yields
comparable or even superior results. Our results are
significant since we do not use any language specific
post-processing for obtaining this performance. For better
accuracy of the retrieved list, we use query expansion. This
also allows us to seamlessly adapt our solution to new
fonts, styles and collections.
1. Introduction
Retrieving relevant documents (pages, paragraphs or
words) is a critical component in information retrieval so-
lutions associated with digital libraries. Most of the present
day digital libraries, use Optical Character Recognizers
(OCRs) for the recognition of digitized documents, and
thereafter employ a text-based solution for the information
retrieval. Though OCRs have become the de facto prepro-
cessing for the retrieval, they are realized as insufficient for
degraded books [8], incompatible for older print styles [5],
unavailable for specialized scripts [14] and very hard for
handwritten documents [1]. Even for printed books, com-
mercial OCRs may provide highly unacceptable results in
practice. The best commercial OCRs can only give word
accuracy of around 90% on printed books [18] in modern
digital libraries. Recall of retrieval systems built on such
text is thus limited.
In this paper, we hypothesize that word images, even if
degraded, can be matched and retrieved effectively with a
classifier based solution. A properly trained classifier can
yield an accurate ranked list of words since the classifier
looks at the word as a whole, and uses a larger context (say
multiple examples) for the matching. We show, later in this
paper, that a SVM based word retrieval can give mean av-
erage precision ( mAP ) as high as 1.0 even when the OCR
based solution is limited to a mAP of 0.89. (See Figure 1
and Table 1.) Our results are significant since (i) We do
not use any language specific post-processing for improving
the accuracy. (ii) Even for a language like English, where
OCRs are fairly advanced and engineering solutions were
perfected, our simple classifier based solution is as good, if
not superior to the best available commercial OCRs .
However, there are two fundamental challenges in using
a classifier based solution for word retrieval (i)A classifier
needs good amount of annotated training data (both positive
and negative) for training. Obtaining annotated data for ev-
ery word in every style is practically impossible. (ii) One
could train a set of classifiers for a given set of frequent
queries. However they are not applicable for rare queries.
In this paper we introduce a one-shot classifier learning
scheme which enables direct design of a classifier for novel
queries, without having any access to the annotated training
data, i.e., classifiers are trained for a set of frequent queries,
and seamlessly extended for the rare and arbitrary queries,
as and when required. We refer to this as one-shot learning
of query classifiers.
Recognition free retrieval was attempted in the past for
printed as well as handwritten document collections [3,
5, 13]. Primary focus has been on designing appropri-
ate features (eg. Profiles, SIFT-BoW), distance functions
(eg. Euclidean, Earth Movers), matching schemes (eg. Dy-
namic Programing). Since most of these methods were de-
signed for smaller collections (few handwritten documents
as in [13]), computational time was not a major concern.
Methods that extended this to larger collection [14, 15, 9]

Figure 1. (A) A typical page from a document image from our data set. (B) OCRs make many errors. Examples of word images and the
errors from a commercial OCR in a page. (C) Examples of a classifier based retrieval compared with OCR retrieval. OCR did not recognize
correctly the images whose OCR output are marked in red, and failed to recall. Note that the classifier based solution recalled all relevant
images correctly, whereas the OCR recalled correctly 2 out of 4 images.
used mostly (approximate) nearest neighbor retrieval. For
searching complex concepts in large databases, SVMs have
emerged as the most popular and accurrate solution in the
recent past [10]. For linear SVMs , both training and test-
ing have become very fast with the introduction of efficient
algorithms and excellent implementations.
Training has become efficient with methods like Pega-
sos [16] and whitening [7]. These methods make the offline
training, and incremental/online training really fast. Train-
ing a classifier on the fly [4] is now considered as quite fea-
sible for reasonably large image collections. (Indeed these
are still not yet comparable to the indexing schemes popu-
larly used in the text IR tasks, specially for huge data.)
The highlights of our current work are
1. We demonstrate that the SVM based word retrieval per-
formance is superior to that of OCRs, and also the pop-
ular nearest neighbor based word spotting solutions.
2. We design a one-shot learning scheme, which allows to
generate a novel classifier for rare/novel query words
without any training data.
3. We demonstrate that with a simple retraining (with no
extra supervision), the solution can adapt to a specific
book or collection effectively.
4. We validate the performance on multiple books and
demonstrate the qualitative and quantitative perfor-
mance of the solution.
2. Accurate Classsifiers for Frequent and Rare
Queries
Our word-level retrieval scheme is a direct application
of the SVM classifier. We train a linear SVM classifier with
few positive examples and a set of randomly sampled neg-
ative examples. During retrieval, this classifier is evaluated
over the dataset images, and a ranked list of word images is
predicted.
For representing word images, we prefer a fixed length
sequence representation of the visual content, i.e., each
word image is represented as a fixed length sequence of ver-
tical strips (of varying width according to the aspect ratio of
the word image). A set of features f
1
, . . . , f
L
are extracted,
where f
i
R
M
is the feature representation of the i
th
ver-
tical strip and L is the number of vertical strips. For a SVM
classifier, this can be considered as a single feature vector
F R
d
of size d = LM . However, we exploit the se-
quential nature of the feature representation for an on the
fly synthesis of the novel classifiers in Section 2.2.1.
2.1. Efficient Classifier based Retrieval
Our classifier is basically a margin maximizing SVM
classifier trained in a 1 vs rest setting. SVM gives maximum
margin hyperplane separating the positive and negative in-
stances. For a query word x
q
, a SVM classifier w
q
(w
q
is
the normal vector to the maximum margin hyperplane) is
learned during the training, and for retrieval, database im-
ages are sorted based on the score w
T
q
F
i
. This evaluation
is very efficient, since it requires only d multiplications and
d 1 additions. For frequent queries, this can in-fact be

Class specific weights
(a)
.... .... ....
Synthesized weights (cut and paste)
(b)
Figure 2. Synthesis of classifier for rare queries. (a) portions of the classifiers corresponding to ground” and “leather” are joined to form
a classifier for great. Note that the appropriate segments are automatically found. (b) In a general setting, a novel classifier gets formed
from multiple constituent classifiers.
computed offline. However, traditional SVM implementa-
tions require many positive and negative examples to learn
the weight vector w
q
.
In [10], Malisiewicz et al. proposed the idea of exemplar
SVM (ESVM) where a separate SVM is learnt for each ex-
ample. Almazan et al. [2] use ESVMs for retrieving word
images. ESVMs are inherently highly tuned to its corre-
sponding example. Given a query, it can retrieve highly
similar word images. This constrains the recall, unless one
has large variations of the query word available. Another
demerit of ESVM is the large overall training time since a
separate SVM needs to be trained for each exemplar. One
approach to reduce training time is to make the negative ex-
ample mining step offline and selecting a common set of
negative examples [17]. Gharbi et al. [6] provide another
alternative for fast training of exemplar SVM. As discussed
in [6], a SVM between a single positive point and a set of
negative points can be seen as finding the tangent to the
manifold of images at the positive point. Assuming a Gaus-
sian distribution over feature space, the authors give closed
form expression for the normal vector to the Gaussian at
query point x
q
. The normal is given as Σ
1
(x
q
µ
0
),
where Σ is the global covariance matrix, µ
0
is global mean
vector. This expression can also be interpreted in terms of
linear discriminant classifiers(LDA) as done by Hariharan et
al. in [7]. We also assume a Gaussian distribution over the
feature space and hence, use this normal vector as an ap-
proximation of the SVM weight and use this weight vector
for retrieval. Generalized expression for LDA weights are
given as w = Σ
1
(µ
+
µ
) where µ
+
and µ
are the
means of the positive and negative examples respectively.
This simple computation makes the training extremely effi-
cient. It requires only few d
2
multiplications for the design
of specific query classifiers. Also, the same method can be
used independent of whether we have one example query or
multiple examples from query class.
2.2. Classifier design for rare queries
The number of possible words in a language can be infi-
nite. It is not practical to build classifiers for all the possible
words. However, on a closer look, we realize that all these
words are composed from a very small number of charac-
ters, and a reasonably small set of ngrams. In many prac-
tical applications related to text processing, a finite set of
ngrams were used to cover the vocabulary, and the small
vocabulary solutions were extended to unlimited vocabu-
lary settings. In this work, we show that the SVM classifiers
corresponding to the ngrams can be effectively composed
to generate novel classifiers on the fly. Fig 2 shows the
overview of our classifier synthesis strategy. Such synthe-
sized classifiers, by simply concatenating ngram classifiers,
could be inferior to the directly built classifiers. (i) Due to
the nature of scripts and writing styles, the joining will not
be ideal. One should prefer larger grams for better synthe-
sis. (ii) Classifiers for smaller ngrams could be noisy. Since
the classifiers need to be built for all the words, the overall
performance could thus be poor. We address these limita-
tions as follows. It is well known that the queries in any
search engine follow an exponentially decaying structure
(Zipfs law) like the frequency of occurrence of words in a
language. We build SVM classifiers (Section 3) for the most
frequent queries and use classifier synthesis only for rare
queries. This improves the overall performance. When the
synthesized classifiers are not as good as the originl one, we
further use query expansion (QE) for the refinement of the
query classifier (See Section 3). Our method does not build
artificial ngram classifiers; we use complete word classifiers
and dynamically decide what portions from what words to
be cut and pasted to create the novel classifier. We refer
to this solution as a Direct Query Classifier (DQC) design
scheme. In Section 2.2.1, we describe DP DQC, a dynamic
programming based approach for DQC synthesis.

2.2.1 DP DQC: DQC Design using Dynamic Program-
ming
Given a set of linear classifiers W
w
= {w
1
, w
2
, . . . , w
N
}
for most frequent N queries and a query feature vector x
q
,
we would like to synthesize a novel classifier w
q
as a piece-
wise fusion of the parts from the available classifiers from
W
w
(See Figure 2). Let us assume that there are p portions
that we need to select to form a novel classifier. Intervals
(portions) are characterized by the sequence of indices a
1
,
... a
p+1
where a
1
= 1 and a
p+1
= L. We formulate the
classifier synthesis problem using a set of already available
linear classifiers, as that of finding the optimal solution to
max
{a
i
},{c
i
}
p
X
i=1
a
i+1
X
k=a
i
w
k
c
i
0
x
k
q
(1)
where w
c
i
corresponds to the c
th
i
classifier that we choose
and the inner summation applies the index range (a
i
, a
i+1
)
to use a portion from the classifier c
i
. The index i in the
outer summation is over the portions, and p is the total num-
ber of portions we need to consider. The solution to the
problem requires picking up the optimal set {c
i
} and the set
of segment indices {a
i
} such that the {a
i
} form a monoton-
ically increasing sequence of indices.
We use LDA as the linear classifier in our DQC solution.
LDA weight w
q
is given as
w
q
= Σ
1
(µ
q
µ
0
) (2)
where Σ and µ
0
are covariance and mean computed over
the entire dataset of word images. Since Σ and µ
0
are
common for all classes, synthesizing w
q
requires finding
mean vector (µ
q
) for unknown query class. We consider
the problem of finding µ
q
for the (unknown) query class
as the classifier synthesis problem outlined above. Let
the set of mean vectors of frequent words be defined as
W
µ
= {µ
1
, µ
2
, . . . , µ
N
}. We divide query vector x
q
into
p fixed length portions and match each cut portion x
k
q
to the
most similar feature portion (say µ
k
q
) of equal length from
the set W
µ
. We solve the problem of selecting the optimal
set {c
i
} by solving for the optimal alignment of the query
feature portion x
k
q
with the best matching portion µ
k
c
i
of the
mean vector µ
c
i
picked from W
µ
. We then combine each
of the obtained µ
k
q
to compose the mean vector µ
q
for the
query. We ensure monotonicity of the sequence {a
i
} by us-
ing a fixed sequence for {a
i
}, thus avoiding optimization
over the set of indices {a
i
}.
The alignment of feature portions is done using subse-
quence Dynamic Time Warping(DTW) [12], which is a dy-
namic programming (DP) algorithm. The DTW takes care
of the variability in the different instances (word images) of
the same class. The time complexity of subsequence DTW is
O(l
1
l
2
), where l
1
and l
2
are the length of the two sequences
given as input. In our case, the l
1
is the length of each small
segment and l
2
= N · d is the length of the concatenated
sequence of all the mean vectors in the set W
µ
. If we use
all the known mean vectors in the set W
µ
, synthesizing the
classifier for a single query could take long time. To reduce
the time complexity, we compute the normalized dot prod-
uct between the query vector and all the mean vectors of the
known classes. We select the 10 most similar mean vectors
and use them in the subsequence DTW.
3. Efficient and Accurate Retrieval
When a direct classifier is used for the frequent words,
retrieval is efficient since this requires only the evaluation of
the classifiers. In practice, these are also pre-computed. For
the rare words, we use the DQC classifier which requires a
DP based selection from multiple composite classifiers. We
call this DP based classifier synthesis strategy as DP DQC .
This DP based strategy affects the efficiency and accuracy
of the solution to some extent. We now discuss two refine-
ments to the solution which can improve the efficiency and
accuracy of the retrieval, NN DQC, which is an approximate
nearest neighbor based implementation of DQC , and use of
query expansion for adapting DQC to a previously unseen
word collection without using any new training data.
NN DQC: DQC with Nearest Neighbor indexing DP
DQC synthesis is slow because it uses DTW based alignment
of the query vector portions with the mean vectors of the
known classes. We can obtain a speed-up by using approx-
imate nearest neighbor search instead of using DTW based
alignment . This, in principle, compromises the optimality
of the synthesis, however, in practice, it does not affect the
quality of the classifier. We consider fixed portions (length
R) of the mean vectors of all the known word classes and
build an index over all such portions using FLANN [11].
The query x
q
is also divided into portions of fixed length
R and the approximate nearest neighbor match of each of
these portions with the indexed portions is found using
FLANN. The so obtained nearest neighbors are concate-
nated to give the mean vector µ
q
, which is then used as
in Equation (2), to compute the LDA weight w
q
. FLANN
has a time complexity O(RBD) when using hierarchical
k-means for indexing, where B is the branching factor and
D is the depth of the tree. This time complexity O(RBD) is
typically much smaller than O(NRd) of subsequence DTW.
Hence DQC using NN is much faster than using subsequence
DTW.
Query expansion for DQC Classifiers which are trained
on one dataset need not perform well on another dataset due
to the print and style variations. For adapting the query to
a new collection, query expansion (QE) is used. We imple-

Dataset Retrieval
Dataset Source Type #images #queries OCR NN LDA SVM ESVM NN DTW
D1 1 Book Clean 26555 100 0.97 0.95 0.98 1.0 0.96 0.83
D2 2 Books Clean 35730 100 0.95 0.72 0.92 0.98 0.72 0.62
D3 1 Book Noisy 4373 100 0.89 0.73 0.98 1.0 0.86 0.71
Table 1. Table gives details of the datasets. It also shows comparison of various word retrieval schemes such as nearest neighbor (NN),LDA,
SVM, EXEMPLAR SVM, DTW based NN with that of OCR based retrieval. Classifier based methods (specially SVM based methods are much
superior to the OCR based solution. Performance is reported as mAP.
ment QE very efficiently. Query expansion is a concept used
in information retrieval where the seed or primary query is
reformulated to improve the retrieval performance. We use
QE to further improve the performance of DQC . An index is
built over all the database vectors and those vectors similar
to query vector x
q
are identified by performing approximate
nearest neighbor search over the index. The top 5 vectors
closest to the query vector are averaged to give the reformu-
lated query vector to be used in the DQC. The reformulated
query better captures the variations of the query class. We
use FLANN for getting the approximate nearest neighbors,
thus incurring an additional cost of O(MBD) where M is
the number of vertical strips in the word image.
This is much smaller compared to the time complexity
O(NRd) for subsequence DTW matching. Hence, adding
QE step before DQC does not cause significant increase in
computation time. However, it improves the accuracy.
4. Experiments, Results and Discussions
In this section, we validate the DQC classifier synthe-
sis method on multiple word image collections and also
demonstrate its quantitative and qualitative performance.
Data Sets, Implementation and Evaluation Protocol
Our datasets, detailed in Table 1, comprises scanned En-
glish books from a digital library collection. We manually
created ground truth at word level for the quantitative evalu-
ation of the methods. Note that the words are segmented us-
ing the ground truth information. The first collection (D1)
of words is from a book which is reasonably clean. On this
collection, commercial OCR (ABBYY Fine Reader 9.0) pro-
vides very high word accuracy.
We use this collection to demonstrate that our method
provides satisfactory performance on collections where the
OCR is satisfactory. Second dataset (D2) is larger in size
and is used to demonstrate the performance in case of het-
erogeneous print styles. Third data set (D3) is a noisy
book, and is used to demonstrate the utility of the perfor-
mance of our method in degraded collections. For the ex-
periments, we extract profile features [13] for each of the
word images. Profile features comprise of the following:
(i) Vertical projection profile, which counts the number of
ink pixels in each column (ii) Upper and lower word profile,
which encode the distance between the top (lower) bound-
ary and the top-most (lower-most) ink pixels in each col-
umn. (iii) Background/Ink transition counts the number of
background to ink transitions in each column. All these fea-
tures are extracted for 100 vertical strips. This results in a
400 dimensional representation for every word image. The
features are normalized to [0, 1] so as to avoid dominance
of any specific feature. Instead of the query log of a search
engine, which lists the frequent queries and rare queries, we
have considered the frequency of occurrence of the words in
the collection. We report the mAP score for 100 frequent
queries as the retrieval performance measure in Table 1. In
Table 1, we compare the performance of various methods
as mAP of the retrieval. The OCR scores correspond to av-
erage over all the queries in the corresponding dataset, and
are obtained by using ABBYY Fine Reader 9.0. We observed
that ABBYY 9.0 performs comparably with its more recent
versions . Some of the salient observations out of this exper-
iment are (i) OCR performance is inferior to the SVM based
retrieval in all the cases. (ii) Faster approximation in the
form of LDA with multiple positive examples is quite close
to the performance of SVM . (iii) ESVM performs inferior
due to the use of only one example (iv) Nearest neighbor
methods (with DTW, Euclidean etc.) are inferior, in general,
to SVM . (v) Nearest neighbour with DTW may be consid-
ered as equivalent to word spotting. Therefore, one could
observe that this classifier based retrieval is superior to OCR
and the DTW based retrieval.
Note that DTW cannot be directly used in the SVM clas-
sifier since the corresponding kernel will not be positive
semidefinite, and moreover due to the computational com-
plexity. Fig 1 depicts some of the qualitative examples of
the retrieval. As can be seen that the classifier-based re-
trieved images are more relevant to the query. Classifiers
are less sensitive to the degradations (eg. cuts, merges etc.)
present in the word images. Figure 3 gives more qualitative
examples of the retrieval. We show examples of classifier
based retrieval, and DQC based retrieval. We also show that
the quality of the retrieval improves with QE.
Performance of the DQC In the initial experiment, we
built classifiers for most frequent 1000 word categories. For

Citations
More filters
Proceedings ArticleDOI
23 Aug 2015
TL;DR: A fast approximation to the DTW distance is used, which makes word retrieval efficient and shows the speed up of proposed approximation on George Washington collection and multi-language datasets containing words from English and two Indian languages.
Abstract: Dynamic time warping (DTW) is a popular distance measure used for recognition free document image retrieval. However, it has quadratic complexity and hence is computationally expensive for large scale word image retrieval. In this paper, we use a fast approximation to the DTW distance, which makes word retrieval efficient. For a pair of sequences, to compute their DTW distance, we need to find the optimal alignment from all the possible alignments. This is a computationally expensive operation. In this work, we learn a small set of global principal alignments from the training data and avoid the computation of alignments for query images. Thus, our proposed approximation is significantly faster compared to DTW distance, and gives 40 times speed up. We approximate the DTW distance as a sum of multiple weighted Eulidean distances which are known to be amenable to indexing and efficient retrieval. We show the speed up of proposed approximation on George Washington collection and multi-language datasets containing words from English and two Indian languages.

10 citations


Cites methods from "Document Retrieval with Unlimited V..."

  • ...DTW distance has been successfully applied in many areas like, bioinformatics [1] and word recognition [4, 8]....

    [...]

Proceedings ArticleDOI
24 Jul 2016
TL;DR: An overview of the methods which have been applied for document image retrieval over recent years is provided and it is found that from a textual perspective, more attention has been paid to the feature extraction methods without using OCR.
Abstract: Due to the rapid increase of different digitized documents, the development of a system to automatically retrieve document images from a large collection of structured and unstructured document images is in high demand. Many techniques have been developed to provide an efficient and effective way for retrieving and organizing these document images in the literature. This paper provides an overview of the methods which have been applied for document image retrieval over recent years. It has been found that from a textual perspective, more attention has been paid to the feature extraction methods without using OCR.

9 citations


Cites background or methods from "Document Retrieval with Unlimited V..."

  • ...The nearest neighbour method has been commonly used to measure the similarity in some recent studies [40, 46, 48, 52, 53]....

    [...]

  • ...For the most frequent queries, SVM classifiers have been used and a classifier synthesis strategy has been built for rare queries [40]....

    [...]

  • ...In [40], each word image has been represented by a fixed length sequence of vertical strips using word profile features....

    [...]

  • ...SVMs have been applied for the retrieval process in [11, 35, 40]....

    [...]

  • ...Printed books Word level features 3 Scanned English books D1, D2, D3 Up to 98% accuracy [40]...

    [...]

Proceedings ArticleDOI
18 Mar 2015
TL;DR: This paper introduces FastDTW kernel, which is a linear approximation of the DTW kernel and can be used with linear SVM, and learns the principal global alignments for the given data by using the hidden structure of the alignments from the training data.
Abstract: The dynamic time warping (DTW) distance is a popular similarity measure for comparing time series data. It has been successfully applied in many fields like speech recognition, data mining and information retrieval to automatically cope with time deformations and variations in the length of the time dependent data. There have been attempts in the past to define kernels on DTW distance. These kernels try to approximate the DTW distance. However, these have quadratic complexity and these are computationally expensive for large time series. In this paper, we introduce FastDTW kernel, which is a linear approximation of the DTW kernel and can be used with linear SVM. To compute the DTW distance for any given sequences, we need to find the optimal warping path from all the possible alignments, which is a computationally expensive operation. Instead of finding the optimal warping path for every pair of sequences, we learn a small set of global alignments from a given dataset and use these alignments for comparing the given sequences. In this work, we learn the principal global alignments for the given data by using the hidden structure of the alignments from the training data. Since we use only a small number of global alignments for comparing the given test sequences, our proposed approximation kernel is computationally efficient compared to previous kernels on DTW distance. Further, we also propose a approximate explicit featuremap for our proposed kernel. Our results show the efficiency of the proposed approximation kernel.

5 citations


Cites background from "Document Retrieval with Unlimited V..."

  • ...In addition to speech recognition, DTW has also been found useful in many other disciplines [14], including word recognition [4, 21], bioinformatics [1], data mining and gesture recognition....

    [...]

Journal Article
TL;DR: This work presents a language independent keyword based document indexing and retrieval system using SVM as classifier and realized promising precision and recall rates on the IAM database of handwritten documents.
Abstract: This work presents a language independent keyword based document indexing and retrieval system using SVM as classifier. Word spotting presents an attractive alternative to the traditional Optical Character Recognition (OCR) systems where instead of converting the image into text, retrieval is based on matching the images of words using pattern classification techniques. The proposed technique relies on extracting words from images of handwritten documents and converting each word image into a shape represented by its contour. A set of multiple features is then extracted from each word image and instances of same words are grouped into clusters. These clusters are used to train a multi-class SVM which learns different word classes. The documents to be indexed are segmented into words and the closest cluster for each word is determined using the SVM. An index file is maintained for each word containing the word locations within each document. A query word presented to the system is matched with the clusters in the database and the documents containing occurrences of the query word are retrieved. The system realized promising precision and recall rates on the IAM database of handwritten documents.

3 citations


Cites background from "Document Retrieval with Unlimited V..."

  • ...Ranjan et al [47] Word level segment -ation Profile features Support Vector Machine Custom English documents Average Precision = 81%...

    [...]

Proceedings ArticleDOI
01 Aug 2017
TL;DR: A classifier based automatic document retrieval system for accurately retrieving relevant documents and assigning them to proper workflow in the application and achieves an efficiency of up to 94% for character recognition and 97% for document classification.
Abstract: This paper describes a classifier based automatic document retrieval system for accurately retrieving relevant documents and assigning them to proper workflow in the application. Documents present in archives oftentimes become illegible or distorted with time. Hence the system will first be used to recognize those distorted characters and then document text classification will be done using supervised machine learning. Language specific post-processing is not used by the system for obtaining this performance. It can also adapt to new fonts, styles and collections. The given system achieves an efficiency of up to 94% for character recognition and 97% for document classification.

3 citations


Cites background from "Document Retrieval with Unlimited V..."

  • ...In [1] we see that SVM based classification techniques’ performance is superior to that of optical character recognition (OCR)....

    [...]

References
More filters
Proceedings Article
01 Jan 2009
TL;DR: A system that answers the question, “What is the fastest approximate nearest-neighbor algorithm for my data?” and a new algorithm that applies priority search on hierarchical k-means trees, which is found to provide the best known performance on many datasets.
Abstract: For many computer vision problems, the most time consuming component consists of nearest neighbor matching in high-dimensional spaces. There are no known exact algorithms for solving these high-dimensional problems that are faster than linear search. Approximate algorithms are known to provide large speedups with only minor loss in accuracy, but many such algorithms have been published with only minimal guidance on selecting an algorithm and its parameters for any given problem. In this paper, we describe a system that answers the question, “What is the fastest approximate nearest-neighbor algorithm for my data?” Our system will take any given dataset and desired degree of precision and use these to automatically determine the best algorithm and parameter values. We also describe a new algorithm that applies priority search on hierarchical k-means trees, which we have found to provide the best known performance on many datasets. After testing a range of alternatives, we have found that multiple randomized k-d trees provide the best performance for other datasets. We are releasing public domain code that implements these approaches. This library provides about one order of magnitude improvement in query time over the best previously available software and provides fully automated parameter selection.

2,934 citations


"Document Retrieval with Unlimited V..." refers methods in this paper

  • ...The query xq is also divided into portions of fixed length R and the approximate nearest neighbor match of each of these portions with the indexed portions is found using FLANN....

    [...]

  • ...We consider fixed portions (length R) of the mean vectors of all the known word classes and build an index over all such portions using FLANN [11]....

    [...]

  • ...We use FLANN for getting the approximate nearest neighbors, thus incurring an additional cost of O(MBD) where M is the number of vertical strips in the word image....

    [...]

  • ...FLANN has a time complexity O(RBD) when using hierarchical k-means for indexing, where B is the branching factor and D is the depth of the tree....

    [...]

Journal ArticleDOI
TL;DR: A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.
Abstract: We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

2,037 citations


"Document Retrieval with Unlimited V..." refers methods in this paper

  • ...Training has become efficient with methods like Pegasos [16] and whitening [7]....

    [...]

Book
26 Sep 2007
TL;DR: Analysis and Retrieval Techniques for Music Data, SyncPlayer: An Advanced Audio Player, and Relational Features and Adaptive Segmentation.
Abstract: Analysis and Retrieval Techniques for Music Data.- Fundamentals on Music and Audio Data.- Pitch- and Chroma-Based Audio Features.- Dynamic Time Warping.- Music Synchronization.- Audio Matching.- Audio Structure Analysis.- SyncPlayer: An Advanced Audio Player.- Analysis and Retrieval Techniques for Motion Data.- Fundamentals on Motion Capture Data.- DTW-Based Motion Comparison and Retrieval.- Relational Features and Adaptive Segmentation.- Index-Based Motion Retrieval.- Motion Templates.- MT-Based Motion Annotation and Retrieval.

1,576 citations


"Document Retrieval with Unlimited V..." refers methods in this paper

  • ...The alignment of feature portions is done using subsequence Dynamic Time Warping(DTW) [12], which is a dynamic programming (DP) algorithm....

    [...]

Proceedings ArticleDOI
06 Nov 2011
TL;DR: This paper proposes a conceptually simple but surprisingly powerful method which combines the effectiveness of a discriminative object detector with the explicit correspondence offered by a nearest-neighbor approach.
Abstract: This paper proposes a conceptually simple but surprisingly powerful method which combines the effectiveness of a discriminative object detector with the explicit correspondence offered by a nearest-neighbor approach. The method is based on training a separate linear SVM classifier for every exemplar in the training set. Each of these Exemplar-SVMs is thus defined by a single positive instance and millions of negatives. While each detector is quite specific to its exemplar, we empirically observe that an ensemble of such Exemplar-SVMs offers surprisingly good generalization. Our performance on the PASCAL VOC detection task is on par with the much more complex latent part-based model of Felzenszwalb et al., at only a modest computational cost increase. But the central benefit of our approach is that it creates an explicit association between each detection and a single training exemplar. Because most detections show good alignment to their associated exemplar, it is possible to transfer any available exemplar meta-data (segmentation, geometric structure, 3D model, etc.) directly onto the detections, which can then be used as part of overall scene understanding.

999 citations


"Document Retrieval with Unlimited V..." refers background in this paper

  • ...For searching complex concepts in large databases, SVMs have emerged as the most popular and accurrate solution in the recent past [10]....

    [...]

Proceedings ArticleDOI
20 Jun 2007
TL;DR: A simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines that alternates between stochastic gradient descent steps and projection steps that can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function.
Abstract: We describe and analyze a simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines (SVM). Our method alternates between stochastic gradient descent steps and projection steps. We prove that the number of iterations required to obtain a solution of accuracy e is O(1/e). In contrast, previous analyses of stochastic gradient descent methods require Ω (1/e2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is O (d/(λe)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. We demonstrate the efficiency and applicability of our approach by conducting experiments on large text classification problems, comparing our solver to existing state-of-the-art SVM solvers. For example, it takes less than 5 seconds for our solver to converge when solving a text classification problem from Reuters Corpus Volume 1 (RCV1) with 800,000 training examples.

985 citations

Frequently Asked Questions (21)
Q1. What contributions have the authors mentioned in the paper "Document retrieval with unlimited vocabulary" ?

In this paper, the authors describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. Given a set of SVM classifiers, the authors appropriately join them to create novel classifiers. Their results are significant since the authors do not use any language specific post-processing for obtaining this performance. The authors use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations. 

One of their future work is to design efficient and scalable retrieval system which uses linear SVM classifiers in the back end. 

Subsequence DTW is used to find the best alignment of the cut-portions of the query feature vector with the concatenated mean vectors of the closest 10 word classes. 

To reduce the time complexity, the authors compute the normalized dot product between the query vector and all the mean vectors of the known classes. 

The authors ensure monotonicity of the sequence {ai} by using a fixed sequence for {ai}, thus avoiding optimization over the set of indices {ai}. 

For a query word xq , a SVM classifier wq (wq is the normal vector to the maximum margin hyperplane) is learned during the training, and for retrieval, database images are sorted based on the score wTq Fi. 

A major disadvantage of the classifier based scheme is the difficulty in indexing, which is important if the method needs to scale to millions of document images. 

When a direct classifier is used for the frequent words, retrieval is efficient since this requires only the evaluation of the classifiers. 

there are two fundamental challenges in using a classifier based solution for word retrieval (i)A classifier needs good amount of annotated training data (both positive and negative) for training. 

Assuming a Gaussian distribution over feature space, the authors give closed form expression for the normal vector to the Gaussian at query point xq . 

Generalized expression for LDA weights are given as w = Σ−1(µ+ − µ−) where µ+ and µ− are the means of the positive and negative examples respectively. 

In many practical applications related to text processing, a finite set of ngrams were used to cover the vocabulary, and the small vocabulary solutions were extended to unlimited vocabulary settings. 

For representing word images, the authors prefer a fixed length sequence representation of the visual content, i.e., each word image is represented as a fixed length sequence of vertical strips (of varying width according to the aspect ratio of the word image). 

Note that DTW cannot be directly used in the SVM classifier since the corresponding kernel will not be positive semidefinite, and moreover due to the computational complexity. 

In this paper, the authors have described a classifier based retrieval scheme for effectively retrieving word and document images from a collection. 

The authors show, later in this paper, that a SVM based word retrieval can give mean average precision ( mAP ) as high as 1.0 even when the OCR based solution is limited to a mAP of 0.89. 

LDA weight wq is given aswq = Σ −1(µq − µ0) (2)where Σ and µ0 are covariance and mean computed over the entire dataset of word images. 

The authors now discuss two refinements to the solution which can improve the efficiency and accuracy of the retrieval, NN DQC, which is an approximate nearest neighbor based implementation of DQC , and use of query expansion for adapting DQC to a previously unseen word collection without using any new training data. 

One approach to reduce training time is to make the negative example mining step offline and selecting a common set of negative examples [17]. 

Some of the salient observations out of this experiment are (i) OCR performance is inferior to the SVM based retrieval in all the cases. 

2.2.1 DP DQC: DQC Design using Dynamic ProgrammingGiven a set of linear classifiersWw = {w1,w2, . . . ,wN} for most frequent N queries and a query feature vector xq , the authors would like to synthesize a novel classifier wq as a piecewise fusion of the parts from the available classifiers from Ww (See Figure 2).