scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Enhancing Word Image Retrieval in Presence of Font Variations

TL;DR: This paper proposes an effective style independent retrieval scheme using a nonlinear style-content separation model and proposes a semi-supervised style transfer strategy to expand the query into multiple styles.
Abstract: This paper investigates the problem of cross document image retrieval, ie use of query images from one style (say font) to perform retrieval from a collection which is in a different style (say a different set of books) We present two approaches to tackle this problem We propose an effective style independent retrieval scheme using a nonlinear style-content separation model We also propose a semi-supervised style transfer strategy to expand the query into multiple styles We validate both these approaches on a collection of word images which vary in fonts/styles

Summary (3 min read)

Introduction

  • Font and style variations make the problem of recognition and retrieval challenging while working with large and diverse document image databases.
  • Commonly, a classifier is trained with a certain set of fonts available apriori, and generalization across fonts is hoped due to either the quality of the features or the power of the classifier.
  • A natural extension of the query expansion in cross document word image retrieval could be to automatically reformulate the query word in multiple fonts.
  • Euclidean distance is often preferred for scalability in retrieval [7].
  • Transfer learning may involve (i) Feature transformations, e.g. updating the regression matrix [11], updating the LDA transformation matrix [12] (ii) Classifier adaptation, e.g. Retraining strategy for neural network [13], SVM [14], etc.

II. DIRECT APPROACHES

  • A common approach to deal with font variations is to heuristically define and extract features.
  • Then one empirically validates the insensitivity to feature variations on multiple fonts.
  • Profile based representation [5], [17] is one such popular feature.
  • Use of a DTW based sequence alignment further improves the robustness of retrieval as DTW is able to take care of local variations in sequences.
  • Another possible approach for handling font variations is to reformulate the query word image in the target document font.

A. Style Transfer

  • Style transfer strategy has been used in the past for handwriting recognition.
  • This results in a specific model for each user.
  • A straightforward method to do style transfer of the query is to decompose it into style and content factors using a bilinear model [10].
  • The authors show such style transfer examples in Figure 2.
  • In addition, a serious limitation of using this style transfer approach in large multi-font databases is the need for some labeled examples of all the distinct words in the database for each of the fonts.

III. QUERY EXPANSION USING SEMI-SUPERVISED STYLE TRANSFER

  • In the retrieval setting, the authors have a single example to transfer the style.
  • An initial seed image is reformulated into multiple versions and all versions have in common the underlying word label.
  • Ar by solving the following optimization problem min Ar ||Y r −ArBr||2F + λ ||A r −As||2F . (3) Here, columns of Br are a subset of the columns of Bc. Using the original pixel based representation of word images for performing style transfer has a few shortcomings.
  • Using a low dimensional profile feature representation reduces the computation required for model learning as well as retrieval.
  • The authors represent each word image by its profile feature representation (Section V) and stack the mean vector for each word label along the column of matrix Y t.

IV. KERNALIZED STYLE-CONTENT SEPARATION

  • To make linear models more robust, it is a common practice to first map the feature vectors in the original space to a high dimensional space and then learn the linear model over the high dimensional space.
  • The authors call their nonlinear version of bilinear model as asymmetric kernel bilinear model (AKBM).
  • Since style basis vectors lie in the same feature space as the observation vectors, each basis vector (each column of At) can be expressed as a linear combination of the mapped observation vectors, hence At can be represented as: At = φ(Y t)α.
  • The authors solve this optimization problem by alternately keeping one of the two factors as constant and optimizing for the other factor.
  • Now, to use these nonlinear basis vectors to perform retrieval on the target dataset, the authors represent all the word images from the target dataset by solving min bir ∣∣∣∣φ(yir)− φ(Y t)αbir∣∣∣∣2 , where yir is the profile feature representation of ith image from target dataset.

V. EXPERIMENTS, RESULTS AND DISCUSSIONS

  • The authors compare the retrieval performance for the following three cases: 1) Query word images from training dataset are used directly to perform retrieval on target dataset (i.e. font independent feature definitions).
  • 2) Semi-supervised style transfer as discussed in Sec. III.
  • 3) Asymmetric kernel bilinear model as discussed in Sec. IV.

A. Data Sets, Implementation and Evaluation Protocol

  • These datasets, detailed in Table I, comprise scanned English books from a digital library collection.
  • The authors manually created the ground truth at word level for the quantitative evaluation of their proposed retrieval approaches.
  • Each of the datasets D1 - D5 are subdivided into training, testing and validation sets, with each set containing one-third of word images for each word label.
  • Bilinear models are learned from the examples in training set.
  • 2) Upper and word profile, which encode the distance between the top boundary and the top-most (-most) ink pixels in each column.

B. Retrieval Experiments

  • In Table II, the authors compare the retrieval performance of font independent feature definitions (no transfer), semi-supervised style transfer (SSST) and asymmetric kernel bilinear model (AKBM).
  • Using this kernel bilinear model, the authors obtain content vector representation for all of the target dataset word images and use them to perform nearest neighbor based retrieval on the basis of their distance with the content vectors corresponding to query labels from the training dataset.
  • In Figure 5, the authors show few query images and the corresponding retrieval results, on D1 - D4, obtained using AKBM.
  • Since the training and target fonts are too dissimilar, retrieval performance of all three approaches goes down, however, the performance of AKBM is still much better than the other two approaches.
  • SSST performs comparably to the supervised style transfer in this case.

VI. CONCLUSION

  • The authors have proposed strategies for doing word image retrieval in a multi-font database.
  • To deal with the style variations between different documents, the authors have proposed a semi-supervised style transfer strategy.
  • The authors have also suggested a font independent retrieval strategy by representing words from all the documents using the same set of high dimensional basis vectors.
  • The authors have shown results on various datasets varying in font.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Enhancing Word Image Retrieval in Presence of
Font Variations
Viresh Ranjan
1
Gaurav Harit
2
C. V. Jawahar
1
1
CVIT, IIIT Hyderabad, India
2
IIT Jodhpur, India
Abstract—This paper investigates the problem of cross docu-
ment image retrieval, i.e. use of query images from one style (say
font) to perform retrieval from a collection which is in a different
style (say a different set of books). We present two approaches
to tackle this problem. We propose an effective style independent
retrieval scheme using a nonlinear style-content separation model.
We also propose a semi-supervised style transfer strategy to
expand the query into multiple styles. We validate both these
approaches on a collection of word images which vary in
fonts/styles.
I. INTRODUCTION
Font and style variations make the problem of recognition
and retrieval challenging while working with large and diverse
document image databases. Commonly, a classifier is trained
with a certain set of fonts available apriori, and generalization
across fonts is hoped due to either the quality of the features or
the power of the classifier. However, in practice, these solutions
give degraded performance when used on target documents
with a new font. If the entire target dataset is available at the
time of training, then it is possible to learn a classifier [1]
which could work on several fonts. If the details of the fonts
in the database are known, one could render the textual queries
in each of these fonts and retrieve from the database [1]. In
some cases, a style clustering [2], [3] is done and then separate
classifiers are learnt for each of the style clusters. In this work,
we are interested in an effective retrieval solution, where the
query is a word image, and the database has an unknown
set of fonts. We formulate the retrieval problem in a nearest
neighbor setting. In this setting, the distance for finding nearest
neighbors can be Euclidean [4] or the cost of alignment of
two feature vector sequences with a Dynamic Time Warping
(DTW) [5].
If the query is a word image, then we need to transfer or
expand the query into multiple fonts. Query expansion, which
is a technique for reformulating a seed query, is a common
practice in information retrieval. In query expansion, a seed
query is reformulated by also taking into account semantically
and morphologically related words. A natural extension of the
query expansion in cross document word image retrieval could
be to automatically reformulate the query word in multiple
fonts. In this paper, we propose a query reformulation strategy
which builds up on this very idea. To motivate the challenges
in cross document retrieval, we conduct an experiment on
words rendered in two different fonts. We argue that the
distance between the two feature vector representation could
become ineffective in presence of font variations. In Figure 1,
we present the Euclidean distance between profile feature
representations of different words in the same font, as well as
the same word in different fonts. Smaller inter-class distance
4.7
6.01
5.58
5.7
6.83
1.17
5.93
3.37
1.74
Fig. 1. Euclidean distance between profile feature representation of pairs
of word images. Euclidean distance could be affected more by font variation
than a difference in underlying word labels, for example, distance between
“battle” in the two fonts is more than “battle” and “cattle” in the same font.
and larger intraclass distance lead to many false positives and
poorer retrieval. This shows that font variation could be a
crucial factor while performing cross document word image
retrieval (see more in Sec. II).
Many efficient approaches for word image retrieval has
been proposed in the recent past. Rath and Manmatha [5], as
well as Meshesha and Jawahar [6] use a profile based represen-
tation along with DTW based retrieval. In many of the recent
works, either DTW or Euclidean distance is used. Euclidean
distance is often preferred for scalability in retrieval [7]. These
approaches primarily depend upon training data in order to
handle font variations and may not generalize well in case of
previously unseen fonts.
If the target style is not known apriori but certain samples
(labeled or unlabeled) of the target dataset are known, then
it is possible to transfer (adapt) the classifiers learned on the
training data so that they are able to handle the new style of the
target dataset. This technique is known as transfer learning [8],
and it has been widely used in applications like handwriting
recognition [2], [9], face pose classification [10] etc. Transfer
learning may involve (i) Feature transformations, e.g. updating
the regression matrix [11], updating the LDA transformation
matrix [12] (ii) Classifier adaptation, e.g. Retraining strategy
for neural network [13], SVM [14], etc. The adaptation process
needs to be unsupervised if labeled data from the target dataset
is not available. The classifier would then need to use some
suitable self-learning strategy [15], [16] to learn the style
context in a group of patterns.
The objective of this work is to perform word image
retrieval from a collection of books/documents, where the
query word image could be in a different style from those

Fig. 2. Application of bilinear model for transferring image from one font to
another is shown. Content vectors corresponding to word images in the first
font are transferred to the second font using style vectors of the second font.
in the database. Our primary contributions are the following:
1) Effective retrieval from multi-font database is formulated
as an automatic query expansion with no human interven-
tion or labeled examples.
2) A nonlinear style-content factorization scheme is pro-
posed. The method is compatible with the popular doc-
ument retrieval schemes (e.g. those which use some
appearance features with a distance based retrieval) and
can improve their performance at minimal computational
overhead.
3) We validate the method on real data sets with font
variations and report qualitative and quantitative results.
To analyze the solution better, we also build a dataset in
a laboratory setting.
II. DIRECT APPROACHES
A common approach to deal with font variations is to
heuristically define and extract features. Then one empirically
validates the insensitivity to feature variations on multiple
fonts. For addressing font style variations in word image
retrieval, a common strategy is to use some font independent
feature representation. Profile based representation [5], [17] is
one such popular feature. Profile features are considered to
be reasonably robust to font variations (however see Figure
1). It works well in the presence of a single or a limited
set of fonts. Use of a DTW based sequence alignment further
improves the robustness of retrieval as DTW is able to take
care of local variations in sequences. Manmatha and Rath [5]
use a profile based representation and DTW based alignment for
retrieval on a dataset with some amount of variation in writing
styles. However, such an approach may not scale-up to large
multi-font databases because of large font variations and high
computational cost. Another possible approach for handling
font variations is to reformulate the query word image in the
target document font. This strategy is discussed in Sec. II-A.
A. Style Transfer
Style transfer strategy has been used in the past for
handwriting recognition. Connell and Jain [2] do a general
to specific adaptation of their model using few examples of
handwritten words from each user. This results in a specific
model for each user. Zhang and Liu [9] address writer adapta-
tion by learning a style transfer matrix for each user which
projects word samples of each user to a style free space
where a style independent classifier is used for classification.
A straightforward method to do style transfer of the query is
to decompose it into style and content factors using a bilinear
model [10]. The style factor can then be modulated separately
to make it similar to that of the target document.
Our hypothesis is that a style-transformed query would be
more closer to the correct matches and would lead to a better
performance of the nearest neighbor classifier. Following the
asymmetric bilinear model in [10], we represent the query
observation y
sc
, in style s and content c, as
y
sc
= A
s
b
c
, (1)
where A
s
is the set of style dependent basis vectors, b
c
is
the content vector depicting the underlying word label. If the
set of style vectors A
s
and A
t
pertaining to style s and t
respectively are known, a word image y
sc
can be transferred
from style s to the new style t by first finding the content vector
b
c
corresponding to the word image and then using the style
basis vectors A
t
as y
tc
= A
t
b
c
. We show such style transfer
examples in Figure 2. The transfer does not look to be visually
impressive due to the nature (binary) of the image. In addition,
a serious limitation of using this style transfer approach in large
multi-font databases is the need for some labeled examples of
all the distinct words in the database for each of the fonts.
In other words, this approach cannot effectively generalize to
previously unseen fonts.
III. QUERY EXPANSION USING SEMI-SUPERVISED STYLE
TRANSFER
In the retrieval setting, we have a single example (query)
to transfer the style. We modify the reformulation strategy
discussed in Sec. II-A so that minimal amount of labeled data
is required for the style transfer. We propose a semisupervised
style transfer strategy for reformulating the query word image
into target fonts without using any target labels. This strategy
uses labeled data only from a single font, learns a bilinear
model over it and adapts the bilinear model to any target
dataset in an unsupervised manner. This strategy saves us
from the costly practice of obtaining labeled word images
corresponding to every different font in the database. The
reformulation strategy used here is akin to the query expansion
strategy used in information retrieval. An initial seed image is
reformulated into multiple versions and all versions have in
common the underlying word label.
Given a set of word image observations for different word
labels arranged as column vectors in matrix Y
s
(each column
corresponds to average of all the images of a particular word
label), basis vectors A
s
and content vectors B
c
(each column is
a content vector corresponding to a word label) can be obtained
by solving the following optimization problem
min
A
s
,B
c
||Y
s
A
s
B
c
||
2
F
. (2)
If the same number of word images are available for all the
word labels, this problem can be solved with the help of SVD
of the matrix Y
s
.
Consider the task of rendering word images in a new font
using the asymmetric bilinear model. We learn the model
parameters (A
s
, B
c
) from the training dataset of word images.
To transfer the content vectors in B
c
to any desired style r,
a few labeled examples Y
r
from the target dataset in style r

can be used to adapt A
s
to obtain A
r
by solving the following
optimization problem
min
A
r
||Y
r
A
r
B
r
||
2
F
+ λ ||A
r
A
s
||
2
F
. (3)
Here, columns of B
r
are a subset of the columns of B
c
.
Using the original pixel based representation of word images
for performing style transfer has a few shortcomings. We
believe that image transfer is a difficult task because of the
high dimensionality of the image space. The bilinear model
may overfit the training images, and may not generalize well
to the word images and fonts which are not there in the
training dataset. Also, there is a high computational cost
associated with the SVD of a large matrix. Therefore we
prefer a low dimensional feature space. In this work we use a
profile feature [5] based representation of word images and
perform transfer and retrieval in the feature space. Using
a low dimensional profile feature representation reduces the
computation required for model learning as well as retrieval.
Consider the same number of word images for each of the
N word classes, where each class corresponds to the different
underlying word label. We represent each word image by its
profile feature representation (Section V) and stack the mean
vector for each word label along the column of matrix Y
t
.
We obtain the font dependent basis vectors A
t
and a matrix
of content vectors B
t
by doing SVD of Y
t
. The i
th
column of
Y
t
corresponding to the mean vector of i
th
word label can be
represented using asymmetric bilinear model as y
it
= A
t
b
it
,
where b
it
is the i
th
column of B
t
and it is content vector for
the i
th
word label. Since a content vector b
it
is independent of
the style, it is possible to transfer b
it
to the target dataset font
if we have the style dependent basis vectors (A
r
) for the target
dataset font. Mean vector for i
th
word label can be obtained
in target dataset font using Equation 1.
Our method, outlined below, does not require labeled data
from the target dataset.
1) Learn bilinear model A
t
, B
t
from labeled training dataset.
2) Propagate the labels corresponding to the word images
in the training dataset to the word images in the target
dataset by doing a nearest neighbor search over it. Say we
propagate the labels for M word labels.
3) We assign labels to only the top few results of the nearest
neighbor search. Therefore we get labeled examples corre-
sponding to M word labels such that these M labels are a
subset of the N training dataset labels.
4) We then form the content vector matrix B
r
using the
content vectors from B
t
which correspond to the labels
assigned in the previous step.
5) We use Equation 3 to obtain A
r
.
6) Once we have obtained A
r
, we use Equation 1 to obtain
a feature vector representation of the word images in the
target dataset font. These vectors can now be used to
perform nearest neighbor based retrieval over the target
dataset.
The asymmetric bilinear model, which we use here for style
transfer, is a linear model and hence it cannot capture the
nonlinearities in the data. Also, this strategy requires retraining
for each new target font. In next section, we introduce our
nonlinear style-content factorization model which takes care
of these issues.
IV. KERNALIZED STYLE-CONTENT SEPARATION
To make linear models more robust, it is a common
practice to first map the feature vectors in the original space
to a high dimensional space and then learn the linear model
over the high dimensional space. If a feature vector in this
high dimensional space is some nonlinear function of the
corresponding vector in original space, then a linear model
in this space will correspond to a nonlinear model in original
space.
Let φ be a mapping such that φ : R
n
H where
R
n
is original observation space and H is a Reproducing
Kernel Hilbert Space (RKHS) which could have a very high
dimensionality in comparison to R
n
. The feature map φ could
be a nonlinear mapping. If any algorithm can be expressed
solely in terms of dot products of feature points in H, then we
do not need to know the exact mapping φ and a kernel function
κ can be defined such that κ(x, y) =< φ(x), φ(y) >, where
x, y R
n
and κ corresponds to some mapping φ [18]. This
technique is known as the kernel trick and has been widely
used for obtaining nonlinear versions of PCA [18], LDA [19]
and many other algorithms.
We call our nonlinear version of bilinear model as asym-
metric kernel bilinear model (AKBM). In order to obtain
nonlinear version of the bilinear model, we first define the
following terms. Let Y
t
be the matrix containing mean vectors
of different word classes along its columns, φ be the feature
map, B
t
be the content vectors corresponding to different
word labels and A
t
be the set of style dependent basis vectors
in the high dimensional feature space. Any observation y
tc
corresponding to style t and label c can be represented in the
feature space as
φ(y
tc
) = A
t
b
c
. (4)
To obtain style basis vectors A
t
and content vectors B
t
, we
solve the following optimization problem
min
A
t
,B
t
φ(Y
t
) A
t
B
t
2
+ βTrace(A
t
T
A
t
). (5)
Here the first term is the data fitting term and second term
is the regularizer which controls overfitting. Since style basis
vectors lie in the same feature space as the observation vectors,
each basis vector (each column of A
t
) can be expressed as a
linear combination of the mapped observation vectors, hence
A
t
can be represented as: A
t
= φ(Y
t
)α.
Using these, the above optimization problem can be rewrit-
ten as
min
α,B
t
KB
t
T
α
T
KKαB
t
+B
t
T
α
T
KαB
t
+βTrace(α
T
Kα). (6)
This problem is convex in α if B
t
is kept constant and
vice-versa. We solve this optimization problem by alternately
keeping one of the two factors as constant and optimizing for
the other factor. Any standard QP solver [20], [21] can be used
for solving this optimization problem.
To learn the nonlinear model from the available profile
feature representation of training dataset word images, we
solve the optimization problem given in 6. This gives us
the coefficient matrix α and the content matrix B
t
. Any
observation in the feature space can now be represented as
φ(y
tc
) = φ(Y
t
)αb
c
.

Dataset # Distinct Words #images
D1 200 19472
D2 200 4923
D3 200 8463
D4 200 13557
D5 200 2868
Dlab 500 5000
TABLE I. DATASET: TABLE GIVES INFORMATION ABOUT DIFFERENT
DATASETS USED IN OUR EXPERIMENTS. D5 HAS A VERY DIFFERENT FONT
IN COMPARISON TO D1 - D4. DLAB CONSISTS OF WORD IMAGES
RENDERED IN 10 DIFFERENT FONTS.
Now, to use these nonlinear basis vectors to perform
retrieval on the target dataset, we represent all the word images
from the target dataset by solving min
b
ir
φ(y
ir
) φ(Y
t
)αb
ir
2
,
where y
ir
is the profile feature representation of i
th
image
from target dataset. We use the closed form expression of
this problem and obtain the content vectors corresponding to
all the images from the target dataset. Now the retrieval is
performed on target dataset on the basis of distance between
the content vector of query word images and content vector of
target dataset word images.
Since the nonlinear model is more robust, the basis vectors
computed from the training dataset can represent word image
features from the target dataset also. Hence, we need not adapt
the nonlinear model using word images from the target dataset.
V. EXPERIMENTS, RESULTS AND DISCUSSIONS
In this section, we compare the retrieval performance for
the following three cases:
1) Query word images from training dataset are used directly
to perform retrieval on target dataset (i.e. font independent
feature definitions).
2) Semi-supervised style transfer as discussed in Sec. III.
3) Asymmetric kernel bilinear model as discussed in Sec.
IV.
A. Data Sets, Implementation and Evaluation Protocol
To validate the performance of our approaches, we create
datasets D1 - D5 comprising of five books varying in font.
These datasets, detailed in Table I, comprise scanned English
books from a digital library collection. We manually created
the ground truth at word level for the quantitative evaluation
of our proposed retrieval approaches. Each of the datasets D1
- D5 are subdivided into training, testing and validation sets,
with each set containing one-third of word images for each
word label. Apart from these datasets obtained from scanned
books, we also create a multifont dataset Dlab by rendering
500 words in 10 different fonts. Few of the example images
from this dataset has been shown in Fig 3. Bilinear models are
learned from the examples in training set. Optimal value for
kernel parameters and the regularization factors β and λ are
found by performing retrieval on the validation set and these
optimal parameters are then used while performing retrieval
on the test set. We use RBF kernel for our experiments. The
kernel function κ is defined as κ(x
i
, x
j
) = exp(
||x
i
x
j
||
2
2σ
2
)
where σ is the bandwidth of RBF kernel. For each word image
in the dataset we extract the profile features [5] comprising of:
Fig. 3. Examples from each of the 10 fonts used in the Dlab.
1) Vertical projection profile, which counts the number of
ink pixels in each column.
2) Upper and lower word profile, which encode the dis-
tance between the top (lower) boundary and the top-most
(lower-most) ink pixels in each column.
3) Background/Ink transition which counts the number of
background to ink transitions in each column.
B. Retrieval Experiments
In Table II, we compare the retrieval performance of font
independent feature definitions (no transfer), semi-supervised
style transfer (SSST) and asymmetric kernel bilinear model
(AKBM). D1 - D4 are used for this set of experiments. 100
query word images are picked from the training dataset and
retrieval is performed on the target dataset. Results are reported
as the mAP values for these 100 queries. For SSST, we use
asymmetric bilinear model for font transfer of query words
from training dataset font to target dataset font. We learn
asymmetric bilinear model using word images corresponding
to 100 different word labels from training dataset. Then we
do a nearest neighbor based search over the target dataset
to find images similar to query words form training dataset.
We assign the label of corresponding query word to the top
retrieved results and use them to adapt the model. Using this
updated bilinear model, we obtain feature vectors for the 100
word labels and use it for performing nearest neighbor based
retrieval on the target dataset. For AKBM, we learn asymmetric
kernel bilinear model using word images corresponding to 100
different word labels from training dataset. Using this kernel
bilinear model, we obtain content vector representation for all
of the target dataset word images and use them to perform
nearest neighbor based retrieval on the basis of their distance
with the content vectors corresponding to query labels from
the training dataset. We observe that in majority of the cases,
kernel based retrieval shows much better retrieval performance
than the other two cases. It is able to achieve mAP gain of up
to 0.33 over the no transfer case. In Figure 4, we show the
Precision-Recall (PR) curves corresponding to 100 queries. For
this experiment, two datasets are picked from D1 - D4 and used
as training and target datasets. No transfer, AKBM and SSST
cases are compared in the figure. Out of the three methods,
AKBM has the maximum area under the PR curve, followed
by SSST and no transfer case. In Figure 5, we show few query

Training-Target dataset
Method D1,D1 D2,D1 D3,D1 D4,D1 D1,D2 D2,D2 D3,D2 D4,D2 D1,D3 D2,D3 D3,D3 D4,D3 D1,D4 D2,D4 D3,D4 D4,D4
No Transfer 0.97 0.69 0.78 0.55 0.63 0.81 0.83 0.63 0.55 0.68 0.99 0.85 0.68 0.76 0.92 0.82
SSST 0.99 0.71 0.64 0.74 0.67 0.91 0.75 0.81 0.59 0.76 0.95 0.84 0.70 0.83 0.89 0.91
AKBM 0.99 0.85 0.69 0.88 0.88 0.94 0.79 0.92 0.72 0.83 0.97 0.95 0.84 0.91 0.96 0.99
TABLE II. SHOWS THE MAP VALUES FOR 100 QUERIES WHEN USING NO TRANSFER, SSST AND AKBM. IN TRAINING-TARGET PAIR (D1, D2), D1 IS
TRAINING DATASET AND D2 IS TARGET DATASET.
Fig. 4. Precision- Recall (PR) curves corresponding to 100 queries is given.
For training and target dataset, two datasets are picked from D1 - D4.
Training dataset mAP values over 100 queries
No Transfer SSST AKBM
D1 0.52 0.57 0.84
D2 0.43 0.47 0.66
D3 0.32 0.38 0.52
D4 0.44 0.52 0.68
TABLE III. RETRIEVAL PERFORMANCE ON D5.
images and the corresponding retrieval results, on D1 - D4,
obtained using AKBM. The experiment is done in a multi-
font scenario, i.e. one of the datasets is chosen for training,
and retrieval is performed on dataset obtained by combining
multiple datasets (D1 - D4). We also show retrieved results
corresponding to a failure case in the last row which shows
that visually similar words may sometimes create confusion
while retrieval. We conduct another set of retrieval experiments
where we test our proposed approach in case of large font
variations between the training dataset and target dataset. We
perform retrieval on D5 while training on one of the datasets
D1 to D4 every time. We report the results in Table III. In
this experiment, since the training and target fonts are too
dissimilar, retrieval performance of all three approaches goes
down, however, the performance of AKBM is still much better
than the other two approaches. Thus, the kernelized version of
the bilinear model is able to achieve font independence and
improved mAP scores by up to 0.30 for word image retrieval.
In Table IV we compare the semi-supervised style trans-
fer strategy (SSST) with supervised style transfer. For doing
supervised style transfer using Equation 3, we use a single
Training Test mAP values over 100 queries
dataset dataset Semi-supervised Supervised
Transfer Transfer
D1 D2 0.67 0.68
D1 D3 0.59 0.59
D1 D4 0.70 0.70
D2 D1 0.71 0.69
D2 D3 0.76 0.74
D2 D4 0.83 0.82
D3 D1 0.64 0.63
D3 D2 0.75 0.76
D3 D4 0.89 0.89
D4 D1 0.74 0.74
D4 D2 0.81 0.81
D4 D3 0.84 0.85
TABLE IV. COMPARISON BETWEEN SEMISUPERVISED STYLE
TRANSFER (SSST) AND SUPERVISED TRANSFER.
labeled example per word class from the target domain in-
stead of doing nearest neighbor based label propagation. SSST
performs comparably to the supervised style transfer in this
case. However, further increasing the labeled examples from
the target dataset will result in improvement for the supervised
case.
We also conduct an experiment on the dataset Dlab to
observe retrieval performance of AKBM in presence of multiple
widely varying fonts in the target dataset. Results of the
experiment are given in Fig 6. For the retrieval experiment,
query image is picked from one of the fonts and retrieval is
performed on all the remaining fonts. For the baseline in this
experiment, we directly use the query image for retrieval on
the target fonts. Results are reported as average mAP values
along with corresponding standard deviation for 10 runs, taking
each of the fonts as source font once. As the number of the
target fonts is increased, the retrieval performance of AKBM
as well as the baseline decreases, however, AKBM outperforms
the baseline in all the cases. The large values for the standard
deviations can be attributed to the large font variations.
Results show that among the different approaches con-
sidered for handling cross-font and multi-font retrieval, our
kernel based AKBM gives the best retrieval performance in
the majority of cases. Superiority of this approach over the
style-transfer approach could be attributed to the fact that style-
content separation of word images is a complex task and using
a linear model for this task may be rather restrictive.
VI. CONCLUSION
In this work, we have proposed strategies for doing word
image retrieval in a multi-font database. To deal with the style
variations between different documents, we have proposed a
semi-supervised style transfer strategy. We have also suggested
a font independent retrieval strategy by representing words

Citations
More filters
Journal ArticleDOI
TL;DR: The nature of texts and inherent challenges addressed by word spotting methods are thoroughly examined and the use of retrieval enhancement techniques based on relevance feedback which improve the retrieved results are investigated.

134 citations

Proceedings ArticleDOI
24 Jul 2016
TL;DR: An overview of the methods which have been applied for document image retrieval over recent years is provided and it is found that from a textual perspective, more attention has been paid to the feature extraction methods without using OCR.
Abstract: Due to the rapid increase of different digitized documents, the development of a system to automatically retrieve document images from a large collection of structured and unstructured document images is in high demand. Many techniques have been developed to provide an efficient and effective way for retrieving and organizing these document images in the literature. This paper provides an overview of the methods which have been applied for document image retrieval over recent years. It has been found that from a textual perspective, more attention has been paid to the feature extraction methods without using OCR.

9 citations


Cites background from "Enhancing Word Image Retrieval in P..."

  • ...5 Scanned books varying in font D1-D5 Up to 85% accuracy [44]...

    [...]

  • ...In [44], the problems of font and style variation, where the query word image has a different style to the dataset, have been considered....

    [...]

DissertationDOI
01 Jan 2019
TL;DR: A fast and non-parametric texture feature extraction method based on summarising the local grey-level structure of the image is further proposed in this research work and provided promising results, with lower computing time as well as smaller memory space consumption compared to other variations of local binary pattern-based methods.
Abstract: Storing and manipulating documents in digital form to contribute to a paperless society has been the propensity of emerging technology. There has been notable growth in the variety and quantity of digitised documents, which have often been scanned/photographed and archived as images without any labelling or sufficient index information. The growth of these kinds of document images will undoubtedly continue with new technology. To provide an effective way for retrieving and organizing these document images, many techniques have been implemented in the literature. However, designing automation systems to accurately retrieve document images from archives remains a challenging problem. Finding discriminative and effective features is the fundamental task for developing an efficient retrieval system. An overview of the literature reveals that research on document image retrieval using texture-based features has not yet been broadly investigated. Texture features are suitable for large volume data and are generally fast to compute. In this study, the effectiveness of more than 50 different texture-based feature extraction methods from four categories of texture features - statistical, transform-based, model-based, and structural approaches - are investigated in order to propose a more accurate method for document image retrieval. Moreover, the influence of resolution and similarity metrics on document image retrieval are examined. The MTDB, ITESOFT, and CLEF_IP datasets, which are heterogeneous datasets providing a great variety of page layouts and contents, are considered for experimentation, and the results are computed in terms of retrieval precision, recall, and F-score. By considering the performance, time complexity, and memory usage of different texture features on three datasets, the best category of texture features for obtaining the best retrieval results is discussed. The effectiveness of the transform-based category over other categories in regard to obtaining higher retrieval result is proven. Many new feature extraction and document image retrieval methods are proposed in this research. To attain fast document image retrieval, the number of extracted features and time complexity play a significant role in the retrieval process. Thus, a fast and non-parametric texture feature extraction method based on summarising the local grey-level structure of the image is further proposed in this research work. The proposed fast local binary pattern provided promising results, with lower computing time as well as smaller memory space consumption compared to other variations of local binary pattern-based methods. There is a challenge in DIR systems when document images in queries are of different resolutions from the document images considered for training the system. In addition, a small number of document image samples with a particular resolution may only be available for training a DIR system. To investigate these two issues, an under-sampling concept is considered to generate under-sampled images and to improve the retrieval results. In order to use more than one characteristic of document images for document image retrieval, two different texture-based features are used for feature extraction. The fast-local binary method as a statistical approach, and a wavelet analysis technique as a transform-based approach, are used for feature extraction, and two feature vectors are obtained for every document image. The classifier fusion method using the weighted average fusion of distance measures obtained in relation to each feature vector is then proposed to improve document image retrieval results. To extract features similar to human visual system perception, an appearance-based feature extraction method for document images is also proposed. In the proposed method, the Gist operator is employed on the sub-images obtained from the wavelet transform. Thereby, a set of global features from the original image as well as sub-images are extracted. Wavelet-based features are also considered as the second feature set. The classifier fusion technique is finally employed to find similarity distances between the extracted features using the Gist and wavelet transform from a given query and the knowledge-base. Higher document image retrieval results have been obtained from this proposed system compared to the other systems in the literature. The other appearance-based document image retrieval system proposed in this research is based on the use of a saliency map obtained from human visual attention. The saliency map obtained from the input document image is used to form a weighted document image. Features are then extracted from the weighted document images using the Gist operator. The proposed retrieval system provided the best document image retrieval results compared to the results reported from other systems. Further research could be undertaken to combine the properties of other approaches to improve retrieval result. Since in the conducted experiments, a priori knowledge regarding document image layout and content has not been considered, the use of prior knowledge about the document classes may also be integrated into the feature set to further improve the retrieval performance
References
More filters
Journal ArticleDOI
TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

18,616 citations


"Enhancing Word Image Retrieval in P..." refers methods in this paper

  • ...This technique is known as transfer learning [8], and it has been widely used in applications like handwriting recognition [2], [9], face pose classification [10] etc....

    [...]

Journal ArticleDOI
TL;DR: A new method for performing a nonlinear form of principal component analysis by the use of integral operator kernel functions is proposed and experimental results on polynomial feature extraction for pattern recognition are presented.
Abstract: A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.

8,175 citations


"Enhancing Word Image Retrieval in P..." refers background or methods in this paper

  • ...This technique is known as the kernel trick and has been widely used for obtaining nonlinear versions of PCA [18], LDA [19] and many other algorithms....

    [...]

  • ...If any algorithm can be expressed solely in terms of dot products of feature points in H , then we do not need to know the exact mapping φ and a kernel function κ can be defined such that κ(x, y) =< φ(x), φ(y) >, where x, y ∈ R and κ corresponds to some mapping φ [18]....

    [...]

Book ChapterDOI
TL;DR: Graph implementations as mentioned in this paper is a generic method for representing a convex function via its epigraph, described in a disciplined convex programming framework, which allows a very wide variety of smooth and nonsmooth convex programs to be easily specified and efficiently solved.
Abstract: We describe graph implementations, a generic method for representing a convex function via its epigraph, described in a disciplined convex programming framework. This simple and natural idea allows a very wide variety of smooth and nonsmooth convex programs to be easily specified and efficiently solved, using interiorpoint methods for smooth or cone convex programs.

2,991 citations


"Enhancing Word Image Retrieval in P..." refers methods in this paper

  • ...Any standard QP solver [20], [21] can be used for solving this optimization problem....

    [...]

Proceedings ArticleDOI
23 Aug 1999
TL;DR: In this article, a non-linear classification technique based on Fisher's discriminant is proposed and the main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space.
Abstract: A non-linear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) non-linear decision function in input space. Large scale simulations demonstrate the competitiveness of our approach.

2,896 citations

Journal ArticleDOI
TL;DR: An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed and that as more data is used the adaptation performance improves.

2,504 citations


"Enhancing Word Image Retrieval in P..." refers methods in this paper

  • ...updating the regression matrix [11], updating the LDA transformation matrix [12] (ii) Classifier adaptation, e....

    [...]

Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Enhancing word image retrieval in presence of font variations" ?

This paper investigates the problem of cross document image retrieval, i. e. use of query images from one style ( say font ) to perform retrieval from a collection which is in a different style ( say a different set of books ). The authors present two approaches to tackle this problem. The authors propose an effective style independent retrieval scheme using a nonlinear style-content separation model. The authors also propose a semi-supervised style transfer strategy to expand the query into multiple styles. 

Their future work will be to learn the font/style independent features from a large collection of document images. 

Optimal value for kernel parameters and the regularization factors β and λ are found by performing retrieval on the validation set and these optimal parameters are then used while performing retrieval on the test set. 

To make linear models more robust, it is a common practice to first map the feature vectors in the original space to a high dimensional space and then learn the linear model over the high dimensional space. 

Their hypothesis is that a style-transformed query would be more closer to the correct matches and would lead to a better performance of the nearest neighbor classifier. 

For addressing font style variations in word image retrieval, a common strategy is to use some font independent feature representation. 

A straightforward method to do style transfer of the query is to decompose it into style and content factors using a bilinear model [10]. 

The ith column of Y t corresponding to the mean vector of ith word label can be represented using asymmetric bilinear model as yit = 

Now the retrieval is performed on target dataset on the basis of distance between the content vector of query word images and content vector of target dataset word images. 

The authors have also suggested a font independent retrieval strategy by representing wordsfrom all the documents using the same set of high dimensional basis vectors. 

Using this kernel bilinear model, the authors obtain content vector representation for all of the target dataset word images and use them to perform nearest neighbor based retrieval on the basis of their distance with the content vectors corresponding to query labels from the training dataset. 

As and content vectors Bc(each column is a content vector corresponding to a word label) can be obtained by solving the following optimization problemmin As,Bc||Y s −AsBc||2F . (2)If the same number of word images are available for all the word labels, this problem can be solved with the help of SVD of the matrix Y s.Consider the task of rendering word images in a new font using the asymmetric bilinear model. 

the kernelized version of the bilinear model is able to achieve font independence and improved mAP scores by up to 0.30 for word image retrieval.