scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Tag Completion for Image Retrieval

01 Mar 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 35, Iss: 3, pp 716-727
TL;DR: This work proposes a new algorithm for tag completion, where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images and represents the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the observed tags and the visual similarity.
Abstract: Many social image search engines are based on keyword/tag matching. This is because tag-based image retrieval (TBIR) is not only efficient but also effective. The performance of TBIR is highly dependent on the availability and quality of manual tags. Recent studies have shown that manual tags are often unreliable and inconsistent. In addition, since many users tend to choose general and ambiguous tags in order to minimize their efforts in choosing appropriate words, tags that are specific to the visual content of images tend to be missing or noisy, leading to a limited performance of TBIR. To address this challenge, we study the problem of tag completion, where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images. We represent the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the observed tags and the visual similarity. We propose a new algorithm for solving this optimization problem. Extensive empirical studies show that the proposed algorithm is significantly more effective than the state-of-the-art algorithms. Our studies also verify that the proposed algorithm is computationally efficient and scales well to large databases.

Summary (3 min read)

1 INTRODUCTION

  • With the remarkable growth in the popularity of social media websites, there have been a proliferation of digital images on the Internet, which have posed a great challenge for large-scale image search.
  • To overcome the limitations of CBIR, TBIR represents the visual content of images by manually assigned keywords/tags.
  • Similar to the classification based approaches for image annotation, to achieve good performance, these approaches require a large number of well annotated images, and therefore are not suitable for the tag completion problem.
  • The limitation of current automatic image annotation approaches motivates us to develop a new computational framework for tag completion.

3.1 A Framework for Tag Completion

  • Figure 1 illustrates the tag completion task.
  • Rn×m be the partially observed tag matrix derived from user annotations, where T̂i,j is set to one if tag j is assigned to image i and zero otherwise.
  • To narrow down the solution for the complete tag matrix T , the authors consider the following three criterions for reconstructing T .
  • To address this challenge, the authors propose to exploit this criterion by comparing image similarities based on visual content with image similarities based on the overlap in annotated tags.
  • There are, however, two problems with the formulation in (1).

3.2 Optimization

  • To solve the optimization problem in (2), the authors develop a subgradient descent based approach (Algorithm 1).
  • The subgradient descent approach is an iterative method.
  • At each iteration t, given the current solution.
  • At each iteration t, the authors compute the subgradients ∇TA(Tt,wt) and ∇wA(Tt,wt), and update the solutions for T and w according to the theory of composite function optimization [3].
  • The authors final question is how to decide the step size ηt.

3.3 Discussion

  • This however is not a serious issue from the viewpoint of learning theory [5].
  • To alleviate the problem of local optima, the authors run the algorithm 20 times and choose the run with the lowest objective function.
  • The convergence rate for the adopted subgradient descent method is O(1/ √ t), where t is the number of iterations.
  • The authors finally note that since the objective of this work is to complete the tag matrix for all the images, it belongs to the category of transductive learning.
  • A similar approach can be used for the proposed approach to make predictions for outof-samples.

3.4 Tag Based Image Retrieval

  • Given the complete tag matrix T obtained by solving the optimization problem in (2), the authors briefly describe how to utilize the matrix T for tag based image retrieval.
  • The authors first consider the simplest scenario when the query consists of a single-tag.
  • A straightforward approach is to compute the tag based similarity between the query and the images by Tq.
  • A shortcoming of this similarity measure is that it does not take into account the correlation between tags.

4 EXPERIMENTS

  • The authors evaluate the quality of the completed tag matrix on two tasks: automatic image annotation and tag based image retrieval.
  • The maximum number of annotated tags per image is 82.
  • The authors then cluster the projected low dimensional SIFT features into 100,000 visual words, and represent the visual content of images by the histogram of the visual words.
  • To make a fair comparison with other state-of-theart methods, the authors adopt average precision, and average recall [6] as the evaluation metrics.

4.1 Experiment (I): Automatic Image Annotation

  • The authors first evaluate the proposed algorithm for tag completion by automatic image annotation.
  • To run the proposed algorithm for automatic image annotation, the authors simply view test images as special cases of partially tagged images, i.e., no tag is observed for test images.
  • The authors will discuss the parameter setting in more details in the later part of this section.
  • The authors also observed that as the number of returned tags increases from five to ten, the precision usually declines while the recall usually improves.

4.2 Experiment (II): Tag based Image Retrieval

  • Unlike the experiments for image annotation where each dataset is divided into a training set and a testing set, for the experiment of tag-based image retrieval, the authors include all the images from the dataset except the queries as the gallery images for retrieval.
  • Similar to the previous experiments, the authors only compare the proposed algorithm to TagProp and TagRel because the other approaches were unable to handle the partially tagged images.
  • Below, the authors first present the results for queries with single-tag, and then the results for queries consisting of multiple tags.

4.2.1 Results for Single-tag Queries

  • Since every tag can be used as a query, the authors have in total 260 queries for the Corel5k dataset, 495 queries for Labelme dataset, and 1, 000 queries for the the Flickr and TinyImage datasets.
  • The authors adopt a simple rule determining the relevance: an image is relevant if its annotation contains the query.
  • Besides the TagProp and TagRel methods, the authors also introduce a reference method that returns a gallery image if its observed tags include the query word.
  • By comparing to the reference method, the authors will be able to determine the improvement made by the proposed matrix completion method.
  • Table 4 shows the MAP results for the four datasets.

4.2.2 Experimental results for queries with multiple tags

  • To generate queries with multiple tags, the authors randomly select 200 images from the Flickr dataset, and use the annotated tags of the randomly selected images as the queries.
  • For all the methods in comparison, the authors follow the method presented in Section 3.3 for calculating the tag-based similarity between the textual query and the completed tags of gallery images.
  • According to Table 5, the authors first observe a significant difference in MAP scores between CBIR and TBIR (i.e. the reference method, TagProp and TMC), which is consistent with the observations reported in the previous study [23].
  • Second, the authors observe that the proposed method TMC outperforms all the baseline methods significantly.
  • Figure 5 shows examples of queries and images returned by the proposed method and the baselines.

4.3 Convergence and Computational Efficiency

  • The authors evaluate the computational efficiency by the running time for image annotation.
  • Table 6 summarizes the running times of both the proposed method and the baseline methods.
  • Note that for the Flickr and TinyImg dataset, the authors only report the running time for three methods, because the other methods either have memory issue or take more than several days to finish.
  • Figure 3 shows how the objective function value is reduced over the iterations.

5 CONCLUSIONS

  • The authors have proposed a tag matrix completion method for image tagging and image retrieval.
  • The authors consider the image-tag relation as a tag matrix, and aim to optimize the tag matrix by minimizing the difference between tag based similarity and visual content based similarity.
  • The proposed method falls into the category of semi-supervised learning in that both tagged images and untagged images are exploited to find the optimal tag matrix.
  • Extensive experimental results on four open benchmark datasets show that the proposed method significantly outperforms several stateof-the-art methods for automatic image annotation.

Did you find this useful? Give us your feedback

Figures (9)

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY 2011 1
Tag Completion for Image Retrieval
Lei Wu Member, IEEE, Rong Jin, Anil K. Jain, Fellow, IEEE
Abstract—Many social image search engines are based on keyword/tag matching. This is because tag based image retrieval
(TBIR) is not only efficient but also effective. The performance of TBIR is highly dependent on the availability and quality of
manual tags. Recent studies have shown that manual tags are often unreliable and inconsistent. In addition, since many users
tend to choose general and ambiguous tags in order to minimize their efforts in choosing appropriate words, tags that are specific
to the visual content of images tend to be missing or noisy, leading to a limited performance of TBIR. To address this challenge,
we study the problem of tag completion where the goal is to automatically fill in the missing tags as well as correct noisy tags for
given images. We represent the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the
observed tags and the visual similarity. We propose a new algorithm for solving this optimization problem. Extensive empirical
studies show that the proposed algorithm is significantly more effective than the state-of-the-art algorithms. Our studies also
verify that the proposed algorithm is computationally efficient and scales well to large databases.
Index Terms—tag completion, matrix completion, tag-based image retrieval, image annotation, image retrieval, metric learning.
1 INTRODUCTION
With the remarkable growth in the p opula rity of social
media websites, there have been a proliferation of
digital images on the Internet, which have posed a
great challenge for large-scale image search. Most
image retrieval methods can be classified into two
categories: content based image retrieval [41], [36]
(CBIR) and keyword/tag based image retrieval [32],
[58] (T BIR).
CBIR takes a n image as a query, and identifies
the matched images based on the visual similarity
between the query image and gallery images. Various
visual features, including both global features [33]
(e.g., color, texture, and shape) and local features [16]
(e.g., SIFT keypoints), have been studied for CBIR.
Despite the significant efforts, the performance of
available CBIR systems is usually limited [38], due
to the semantic gap between the low-level visual
features used to represent images and the high level
semantic mea ning behind ima ges.
To overcome the limitations of CBIR, TBIR rep-
resents the visual content of image s by manually
assigned keywords/tags. It allows a user to present
his/her information need as a textual query, and find
the relevant images based on the match between the
textual query and the manual annotations of images.
Compare to CBIR, TBIR is usually more accurate in
identifying relevant images [24] by alleviating the
challenge arising from the semantic ga p. TBIR is also
more efficient in retrieving relevant images than CBIR
because it can be formulated as a document retrieval
L. Wu, R. Jin, and A.K. Jain are with department of Computer Science and
Engineering, Michigan State Un iversity, E ast Lansing, MI 48824 USA.
A. K. Jain is also with the Dept. of Brain & Cognitive Engineering, Korea
University, Anamdong, Seongbukgu, Seoul 136-713, Republic of Korea.
E-mail: leiwu@live.com, rongjin@cse.msu.edu, jain@cse.msu.edu.
Manuscript received August 22, 2011.
problem and therefore can be efficiently implemented
using the inverted index technique [29].
However, the performance of TBIR is highly d epen-
dent on the availability and quality of manual tags.
In most cases, the tags are provided by the users
who upload their images to the social media sites
(e.g., Flickr), and are therefore often inconsistent and
unreliable in describing the visual content of images,
as indicated in a recent study on Flickr data [47].
In particular, according to [37], in order to minimize
the effort in selecting appropriate words for given
images, many users tend to describe the visual content
of images by general, ambiguous, and sometimes
inappropriate tags, as explained by the principle of
least effort [25]. As a result, the manually annotated
tags tend to be noisy and incomplete, leading to a
limited performance of TBIR. This was observed in
[44], where, on average, less than 10% of query words
were used as image tags, implying that many useful
tags were missing in the d atabase. In this work, we
address this challenge by automatically filling in the
missing ta gs and correcting the noisy ones. We refer
to this problem as the tag completion problem.
One way to complete the missing tags is to directly
apply automatic image annotation techniques [52],
[17], [20], [42] to predict additional keywords/tags
based on the visual content of images. Most auto-
matic image annotation algorithms cast the problem
of keyword/tag prediction into a set of binary c las-
sification problems, one for each keyword/tag. The
main shortcoming of this approach is that in order to
train a reliable model for keyword/tag prediction, it
requires a large set of training images with clean and
complete manual annotations. Any missing or noisy
tag could potentially lead to a biased estimation of
prediction models, and consequentially suboptimal
performances. Unfortunately, the annotated tags for

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY 2011 2
most Web images are incomplete a nd noisy, making
it difficult to directly apply the method of automatic
image annotation.
Besides the classification approaches, several ad-
vanced machine learning approaches have been ap-
plied to image annotation, including annotation by
search [54], tag propagation [38], probabilistic rele-
vant component analysis (pRCA) [33], d ista nce metric
learning [19], [46], [49], [28], Tag transfer [15], and
reranking [61]. Similar to the classification based ap-
proaches for image annotation, to achieve good per-
formance, these approaches require a large number of
well annotated images, and therefore are not suitable
for the tag completion problem.
The limitation of current automatic image anno-
tation approaches motivates us to develop a new
computational framework f or tag completion. In par-
ticular, we cast tag completion into a problem of
matrix completion: we represent the relation be tween
tags and images by a tag matrix, where each row cor-
responds to an image and each column corresponds
to a tag. Eac h entry in the tag matrix is a real number
that represents the relevance of a tag to an image.
Similarly, we represent the partially and noisy tagged
images by a n observed tag matrix, where an entry
(i, j) is marked as 1 if and only if image i is annotated
by keyword/tag j. Besides the tag information, we
also compute the visual similarity between images
based on the extracted visual f eatures. We search
for the optimal tag matrix that is consistent with
both the observed tag matrix and the pa ir wise visual
similarity be tween images. We present an efficient
learning algorithm for tag completion that scales well
to la rge databases with millions of images. Our exten-
sive empirical studies verify both the efficiency and
effectiveness of the proposed algorithm in comparison
to the state-of -the-art algorithms for automatic image
annotation.
The rest of this paper is organized as follows. In
Section 2, we overview the related work on automatic
image annotation. Section 3 defines the problem of
tag completion and provides a detailed description
for the proposed framework and algorithm. Section 4
summarizes the experimental results on automatic
image annotation and tag based search. Section 5
concludes this study with suggestions for future work.
2 RELATED WORK
Numerous algorithms have been proposed for au-
tomatic image annotation (see [18] and references
therein). They can roughly be grouped into two major
categories, depending on the type of image repre-
sentations used. The first group of approaches are
based upon global image features [31], such as color
moment, texture histogram, etc. The second group of
approaches adopts the local visual features. [30], [43],
[48] segment image into multiple regions, and repre-
sent each region by a vector of visual features. Other
approaches [22], [56], [ 45] extend the bag-of-features
or bag-of-words representation, which was originally
developed for object recognition, for automatic image
annotation. More recent work [34], [27] improves the
performance of automatic image annotation by taking
into account the spatial dependence among visual
features. Other than predicting annotated keywords
for the entire image, several algorithms [11] have
been developed to predict annotations for individual
regions within an image. Despite these developments,
the performance of automatic image annotation is far
from being satisfactory. A recent report [38] shows
that the-state- of -the-art methods for automatic im-
age annotation, including Conditional Random Fields
(CRM) [52], inference network approach ( infNet)
[17], Nonparametric Density Estimation (NPDE) [7],
and superv ised multi-class labeling (SML) [20], are
only able to achieve 16% 28% for average precision,
and 19% 33% for average recall, for key benchmark
datasets Corel5k and ESP Game. Another limitation
of most automatic image annotation algorithms is tha t
they require fully annotated images for training, mak-
ing them unsuitable for the tag completion problem.
Several recent works explore multi-label learning
techniques for image annotation that aim to exploit
the dependence among keywords/tags. Ramanan et
al. [13] proposed a discriminative model for multi-
label learning. Zhang et al. [40] proposed a lazy learn-
ing algorithm for multi-label prediction. Hariharan
et al. [8] proposed max-margin cla ssifier for large
scale multi-label learning. Guo et al. [55] applied the
conditional dependency networks to structured multi-
label learning. An approach for batch-mode image re-
tagging is proposed in [32]. Zha et al. [60] proposed a
graph based multi-label learning approach for image
annotation. Wang et al. [26] proposed a multi-label
learning approach v ia maximum consistency. Chen
et al. [21] proposed an efficient multi-label learning
based on hypergraph regularization. Bao et al. [9]
proposed a scalable multi-label p ropagation approach
for image annotation. Liu et al. [59] proposed a con-
strained non-negative matrix factorization method for
multi-label learning. Unlike the existing approaches
for multi-label learning that assume complete and
perfect class assignments, the proposed a pproach is
able to deal with noisy and incorrect tags assigned to
the images. Although a matrix completion approach
was proposed in [1] for transductive classification,
it differs from the proposed work in that it applies
Euclidean distance to measure the difference between
two training instances while the proposed approach
introduces a distance metric to better capture the
similarity between two instances.
Besides the classification approaches, severa l recent
works on image annotation are base d on distance
metric learning. Monay et al. [19] proposed to an-
notate the image in a latent semantic space. Wu and
Hoi et al. [ 46], [49], [33] proposed to learn a metric

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY 2011 3
Fig. 1. The framework for tag matrix completion and
its application to image search. Given a database of
images with some initially assigned tags, the proposed
algorithm first generates a tag matrix denoting the rela-
tion between the images and initially assigned tags. It
then automatically complete the tag matrix by updating
the relevance score of tags to all the images. The
completed tag matrix will be used for tag based image
search or image similarity search.
to better capture the image similarity. Zhuang et al.
[28] proposed a two-view learning algorithm for tag
re-ranking. Li et al. [53] proposed a neighbor voting
method for social tagging. Similar to the classification
based approaches, these methods require clean and
complete image tags, making them unsuitable f or the
tag completion problem.
Finally, our work is closely related to tag refine-
ment [24]. Unlike the proposed work that tries to
complete the missing tags and correct the noisy tags,
tag refinement is only designed to remove noisy tags
that do not reflect the visual content of images.
3 TAG COMPLETION
We first present a framework for tag completion, and
then describe an efficient algorithm for solving the
optimization problem related to the proposed frame-
work.
3.1 A Framework for Tag Completion
Figure 1 illustrates the tag completion task. Given a
binary image-tag ma tr ix (tag matrix for brief), our goal
is to automatically complete the tag matrix with real
numbers, that indicate the probability of assigning the
tags to the images. Given the completed tag matrix,
we can run TB I R to efficiently and accurately identify
the relevant images for textual query.
Let n and m be the number of images and unique
tags, respectively. Let
b
T R
n×m
be the pa rtially
observed tag matrix derived from user annotations,
where
b
T
i,j
is set to one if tag j is assigned to image
i and zero otherwise. We denote by T R
n×m
the
completed tag matrix that needs to be computed. In
order to complete the partia lly observed tag matrix
b
T , we further represent the visual content of images
by matrix V R
n×d
, where d is the number of visual
features and each row of V corresponds to the vector
of visual features for an image. Finally, to exploit
the dependence among different tags, we introduce
the tag correlation matrix R R
m×m
, where R
i,j
represents the correlation between tag i and j. Fol-
lowing [10], we compute the correlation score between
two tags i and j as follows
R
i,j
=
f
i,j
f
i
+ f
j
f
i,j
where f
i
and f
j
are the occurrence of tags i and j, and
f
ij
is the co-occurrence of tags i and j. Note that f
i
,
f
j
and f
i,j
are statistics collected from the partially
observed tag matrix
b
T . Our goal is to reconstruct
the tag matrix T based on the partially observed tag
matrix
b
T , the visual representation of image data V ,
and the tag correlation matrix R. To narrow down the
solution for the complete tag matrix T , we consider
the f ollowing three criterions for reconstructing T .
There are three important c onstraints in the matrix
completion algorithm to avoid trivial solutions.
First, the complete tag matrix T should be similar
to the partially observed matrix
b
T . We add this con-
straint by penalizing the difference be tween T and
b
T
with a Frobensius norm, and we prefer the solution
T with small kT
b
T k
2
F
.
Second, the complete tag matrix T should reflect
the visual content of images represented by the matrix
V , where each image is represented as a row vector
(visual feature vector) in V . However, since the rela-
tionship between tag matrix T and the visual feature
matrix V is unknown, it is difficult to implement
this criterion directly. To address this challenge, we
propose to exploit this criterion by comparing im-
age similarities based on visual content with image
similarities based on the overlap in annotated tags.
More specifically, we compute the visual similarity
between image i and j as v
i
v
j
, where v
i
and v
j
are the ith and jth rows of matrix V . Given the
complete tag ma trix T , we can also compute the
similarity between image i and j based on the overlap
between their tags, i.e., t
i
t
j
, where t
i
and t
j
are
the ith and jth rows of matrix T . If the complete
tag matrix T reflects the visual content of images,
we e xpect |v
i
v
j
t
i
t
j
|
2
to be small f or any two
images i and j. As a result, we expect a small value
for
P
n
i,j=1
|v
i
v
j
t
i
t
j
|
2
= kT T
V V
k
2
F
.
Finally, we expect the complete matrix T to be
consistent with the correlation matrix R, and there-
fore a small value for kT
T Rk
2
F
. Combining the
three criterions, we have the following optimization
problem for finding the complete tag matrix T .
min
T R
n×m
kT T
V V
k
2
F
+ λkT
T Rk
2
F
+ ηkT
b
T k
2
F
(1)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY 2011 4
where λ > 0 and η > 0 are parameters whose values
will be decided by cross validation.
There are, however, two problems with the for-
mulation in (1). First, the visual similarity between
images i and j is computed by v
i
v
j
, which as-
sumes that all visual features are equally important
in determining the visua l similarity. S ince some visual
features may be more important than the others in
deciding the tags for images, we introduce a vector
w = (w
1
, . . . , w
d
) R
d
+
, here w
i
is used to represent
the importance of the ith visual feature. Using the
weight vector w, we modify the visual similarity
measure as v
i
Av
j
, where A = diag(w) is a diagonal
matrix with A
i,i
= w
i
. Second, the complete tag matrix
T computed by (1) may be dense in which most of the
entries in T are non-zero. But, on the other hand, we
generally expect that only a small number of tags will
be assigned to each image , and as a result, a sparse
matrix for T . To address this issue, we introduce
into the objective function an L
1
regularizer for T ,
i.e., kT k
1
=
P
n
i=1
P
m
j=1
|T
i,j
|. Incorporating these two
modifications into (1), we have the final optimiza tion
problem for tag completion
min
T R
n×m
,wR
d
+
L(T, w) (2)
where
L(T, w) = kT T
V diag(w)V
k
2
F
+λkT
T Rk
2
F
+ ηkT
b
T k
2
F
+ µkT k
1
+ γkwk
1
Note that in (2) we fur ther introduce an L
1
regularizer
for w to generate a sparse solution for w.
3.2 Optimization
To solve the optimization problem in (2), we deve lop
a subgradie nt descent based approach (Algorithm 1).
Compared to the other optimization approaches such
as Newton’s method and interior point methods [39],
the subgradient descent approach is advantageous
in that its computational complexity per iteration is
significantly lower, making it suitable for large image
datasets.
The subgradient descent approach is an iterative
method. At each iteration t, given the current solution
T
t
and w
t
, we first compute the subgradients of the
objective f unction L(T, w). Define
G = T
t
T
t
V diag(w
t
)V
, H = T
t
T
t
R
We compute the subgradients as
T
L(T
t
, w
t
) = 2GT
t
+ 2λT
t
H + 2η(T
t
b
T ) + µ (3)
w
L(T
t
, w
t
) = 2diag(V
GV ) + γδ (4)
where R
n×m
and δ R
d
are defined as
i,j
= sgn(T
i,j
), δ
i
= sgn(w
i
)
Here, sgn(z) outputs 1 when z > 0, 1 when z < 0,
and a random number uniformly distributed b etween
1 and +1 when z = 0. Given the subgradients, we
update the solution for T and w as follows
T
t+1
= T
t
η
t
T
L(T
t
, w
t
)
w
t+1
= π
(w
t
η
t
w
L(T
t
, w
t
))
where η
t
is the step size of iteration t, and = {w
R
d
+
} and π
(w) projects a vector w into the domain
to ensure that the learned weights are non-negative.
One problem with the above implementation of
the subgradient descent approach is that the imme-
diate solutions T
1
, T
2
, . . . , may be dense, leading to a
high c omputational cost in matrix multiplication. We
address this difficulty by exploring the method de-
veloped for composite function optimization [12]. In
particular, we rewrite L(T, w) as L(T, w) = A(T, w) +
γkwk
1
+ µkT k
1
, where
A(T, w) = kT T
V diag(w)V
k
2
F
+
λkT
T Rk
2
F
+ ηkT
b
T k
2
F
At each iteration t, we compute the subgradients
T
A(T
t
, w
t
) and
w
A(T
t
, w
t
), and update the solu-
tions for T and w according to the theory of composite
function optimization [3]
T
t+1
= arg min
T
1
2
kT
b
T
t+1
k
2
F
+ µη
t
kT k
1
(5)
w
t+1
= arg min
w
1
2
kw
b
w
t+1
k
2
F
+ γη
t
kwk
1
(6)
where η
t
is the step size for the t-th iteration and
b
T
t+1
and
b
w
t+1
are given by
b
T
t+1
= T
t
η
t
T
A(T
t
, w
t
), (7)
b
w
t+1
= w
t
η
t
w
A(T
t
, w
t
) (8)
Using the result in [3], the solutions to (5) and (6) are
given by
T
t+1
= max
0,
b
T
t+1
µη
t
1
n
1
m
(9)
w
t+1
= max (0,
b
w
t+1
γη
t
1
d
) (10)
where 1
d
is vector of n dimensions with all its ele-
ments being 1. As indica ted in the equations in Eq.
(9) and (10), any entry in T
t
(w
t+1
) which is less than
µη
t
(γη
t
respectively) will become zero, leading to
sparse solutions for T and w by using the theory of
composite function optimization.
Our final question is how to decide the step size η
t
.
There are common choices: η
t
= 1/
t or η
t
= 1/t.
We set η
t
= 1/t, which appears to yield a faster
convergence than η
t
= 1/
t. Algorithm 1 summarizes
the key steps of the subgradient descent approach.
3.3 Discussion
Although the proposed formulation is non-convex
and therefore cannot guarantee to find the global
optimal, this however is not a serious issue from the
viewpoint of learning theory [5]. This is because as

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY 2011 5
Algorithm 1 Tag Completion Algorithm ( TMC)
1: INPUT:
Observed tag matrix:
b
T R
n×m
Parameters: γ, η, λ, and µ
Convergence threshold: ε
2: OUTPUT: the complete tag matrix T
3: Compute the tag c orrelation matrix R =
b
T
b
T
4: Initialize w
1
= 1
d
, T
1
=
b
T , and t = 0
5: repeat
6: Set t = t + 1 and stepsize η
t
= 1/t
7: Compute
b
T
t+1
and
b
w
t+1
according to (8)
8: Update the solutions T
t+1
and w
t+1
according
to (9) and (10)
9: until convergence: kL(T
t
, w
t
) L(T
t+1
, w
t+1
)k
εkL(T
t
, w
t
)k
the empirical error goes down during the process
of optimization, the generalization error will become
the le ading term in the prediction error. As a result,
finding the global optima will not have a signifi-
cant impact on the final prediction result. In fact,
[51] shows that only an approximately good solution
would be enough to achieve similar performance as
the exact optimal one. To alleviate the p roblem of local
optima, we run the algorithm 20 times and choose the
run with the lowest objective function.
The convergence rate for the adopted subgradient
descent method is O(1/
t), where t is the number of
iterations. The space requirement for the algorithm is
O(n ×m), where n is the number of images and m is
the number of unique tags.
We finally note that since the objective of this work
is to complete the tag matrix for all the images, it
belongs to the category of transductive learning. In
order to turn a transductive learning method into an
inductive one, one common approach is to retrain a
prediction model based on outputs from the transduc-
tion method [2]. A similar approach can be used for
the proposed approach to make predictions for out-
of-samples.
3.4 Tag Based Image Retrieval
Given the complete tag matrix T obtained by solving
the optimization problem in (2), we briefly describe
how to utilize the matrix T for tag based image
retrieval.
We first consider the simplest scenario when the
query consists of a single-tag. Given a query tag j, we
simply rank all the gallery images in the descending
order of their relevance scores to tag j, corresponding
to the jth column in matrix T . Now consider the
general case when a textual query is comprised of
multiple tags. Let q = (q
1
, . . . , q
m
)
{ 0, 1}
m
be a
query vector, where q
i
= 1 if the ith ta g appears in
the query and q
i
= 0 otherwise. A straightforward ap-
proach is to compute the tag based similarity between
the query and the images by T q. A shortcoming of
this similarity measure is that it does not take into
account the correlation be tween tags. To address this
limitation, we refine the similarity between the query
and the images b y T W q, where W = π
[0,1]
(T
T )
is the tag correlation matrix estimated based on the
complete tag matrix T . Here, π
0,1
(A) projects every
entry of A into the range between 0 and 1.
4 EXPERIMENTS
We evaluate the quality of the completed tag matrix
on two tasks: automatic image annotation and tag
based image retrieval.
Four benchmark datasets are used in this study:
Corel dataset [43]. I t consists of 4, 993 images, with
each image being annotated by at most five tags.
There are a total of 260 unique keywords used in
this dataset.
Labelme photo collection. It consists of 2,900 on-
line photos, annotated by 495 non-abstract noun
tags. The maximum number of annotated tags
per image is 48.
Flickr photo collection. It consists of one million
images that are annotated by more than 10,000
tags. The maximum number of annotated tags
per image is 76. Since most of the tags are only
used by a small number of images, we reduce the
vocabulary to the first 1, 000 most popular tags
used in this dataset, which reduces the database
to 897,500 images.
TinyImg ima ge collection. It consists of 79,302,017
images collected from the web, annotated by
75,062 non-abstract noun tags. The maximum
number of annotated tags pe r image is 82. Similar
to the Flickr photo collection, we reduce the
vocabulary to the first 1, 000 most popular tags
in the dataset, which reduces the database size to
997,420 images.
Ta b le 1 summarizes the statistics of the four datasets
used in our study.
For the Corel data, we use the same set of features
as [38], including SIFT loca l features and a robust
hue descriptor that is extracted densely on multi-scale
grids of interest points. Each local feature descriptor
is quantized to one of 100,000 visual words that are
identified by a k-means clustering algorithm. Given
the quantized local features, we represent each image
by a bag-of-words histogram. For Flickr and Labelme
photo collections, we adopt the compact SIFT feature
representation [57]. It first extracts SIFT features from
an image, and then projects the SIFT fea tures to a
space of 8 dimensions using the Principle Component
Analysis (PCA). We then cluster the projected low
dimensional SIFT features into 100,000 visual words,
and represent the visual content of images by the
histogram of the visual words. For TinyImg dataset,
since the images are of low resolution, we adopt a

Citations
More filters
Proceedings ArticleDOI
23 Jun 2014
TL;DR: A fully-automated approach for learning extensive models for a wide range of variations within any concept, which leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models.
Abstract: Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50, 000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.

376 citations


Cites background from "Tag Completion for Image Retrieval"

  • ...Such an exhaustive vocabulary helps in generating fine-grained descriptions of images [17, 29, 34, 40, 50]....

    [...]

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work introduces a strategy to dynamically select face regions useful for robust HR estimation, inspired by recent advances on matrix completion theory, which significantly outperforms state-of-the-art HR estimation methods in naturalistic conditions.
Abstract: Recent studies in computer vision have shown that, while practically invisible to a human observer, skin color changes due to blood flow can be captured on face videos and, surprisingly, be used to estimate the heart rate (HR). While considerable progress has been made in the last few years, still many issues remain open. In particular, state of-the-art approaches are not robust enough to operate in natural conditions (e.g. in case of spontaneous movements, facial expressions, or illumination changes). Opposite to previous approaches that estimate the HR by processing all the skin pixels inside a fixed region of interest, we introduce a strategy to dynamically select face regions useful for robust HR estimation. Our approach, inspired by recent advances on matrix completion theory, allows us to predict the HR while simultaneously discover the best regions of the face to be used for estimation. Thorough experimental evaluation conducted on public benchmarks suggests that the proposed approach significantly outperforms state-of the-art HR estimation methods in naturalistic conditions.

280 citations


Cites background from "Tag Completion for Image Retrieval"

  • ...Matrix completion has proved successful for many computer vision tasks, when data and labels are noisy or in the case of missing data, such as multi-label image classification [6], image retrieval and tagging [28, 9], manifold correspondence finding [16], head/body pose estimation [1] and emotion recognition from abstract paintings [2]....

    [...]

Journal ArticleDOI
TL;DR: A Deep Collaborative Embedding model is proposed to uncover a unified latent space for images and tags and integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly to collaboratively explore the rich context information of social images.
Abstract: In this work, we investigate the problem of learning knowledge from the massive community-contributed images with rich weakly-supervised context information, which can benefit multiple image understanding tasks simultaneously, such as social image tag refinement and assignment, content-based image retrieval, tag-based image retrieval and tag expansion Towards this end, we propose a Deep Collaborative Embedding (DCE) model to uncover a unified latent space for images and tags The proposed method incorporates the end-to-end learning and collaborative factor analysis in one unified framework for the optimal compatibility of representation learning and latent space discovery A nonnegative and discrete refined tagging matrix is learned to guide the end-to-end learning To collaboratively explore the rich context information of social images, the proposed method integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly The proposed model is also extended to embed new tags in the uncovered space To verify the effectiveness of the proposed method, extensive experiments are conducted on two widely-used social image benchmarks for multiple social image understanding tasks The encouraging performance of the proposed method over the state-of-the-art approaches demonstrates its superiority

269 citations


Cites background or methods from "Tag Completion for Image Retrieval"

  • ...In [6], the relevance of images to tags is refined by constraining it to be consistent with the original one and the visual similarity....

    [...]

  • ...Some methods have recently been proposed to explore images and user-provided tags [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]....

    [...]

  • ...Following [6], [56], the performance is also evaluated by using average recall, which can measure the percentage of correct tags tagged by the tagging methods out of all ground-truth tags....

    [...]

  • ...The image-tag relation that is consistent with the original image-tag relation and the visual similarity is found for tag refinement in [6]....

    [...]

  • ...TBIR is to return images with a query tag based on the relevance between the tag and images [6], [37], [38], [39], [40]....

    [...]

Journal ArticleDOI
TL;DR: A novelWeakly supervised deep matrix factorization algorithm is proposed, which uncovers the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure.
Abstract: The number of images associated with weakly supervised user-provided tags has increased dramatically in recent years. User-provided tags are incomplete, subjective and noisy. In this paper, we focus on the problem of social image understanding, i.e., tag refinement, tag assignment, and image retrieval. Different from previous work, we propose a novel weakly supervised deep matrix factorization algorithm, which uncovers the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure. Due to the well-known semantic gap, the hidden representations of images are learned by a hierarchical model, which are progressively transformed from the visual feature space. It can naturally embed new images into the subspace using the learned deep architecture. The semantic and visual structures are jointly incorporated to learn a semantic subspace without overfitting the noisy, incomplete, or subjective tags. Besides, to remove the noisy or redundant visual features, a sparse model is imposed on the transformation matrix of the first layer in the deep architecture. Finally, a unified optimization problem with a well-defined objective function is developed to formulate the proposed problem and solved by a gradient descent procedure with curvilinear search. Extensive experiments on real-world social image databases are conducted on the tasks of image understanding: image tag refinement, assignment, and retrieval. Encouraging results are achieved with comparison with the state-of-the-art algorithms, which demonstrates the effectiveness of the proposed method.

197 citations


Cites background from "Tag Completion for Image Retrieval"

  • ...Unfortunately, these tags are provided by amateur users and are imperfect, i.e., they are often incomplete or inaccurate in describing the visual content of images, which brings challenges to the tasks of image understanding such as tag-based image retrieval [4]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a comprehensive survey of content-based image retrieval focuses on what people tag about an image and how such information can be exploited to construct a tag relevance function. And a two-dimensional taxonomy is presented to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.
Abstract: Where previous reviews on content-based image retrieval emphasize what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image A comprehensive treatise of three closely linked problems (ie, image tag assignment, refinement, and tag-based image retrieval) is presented While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, that is, estimating the relevance of a specific tag with respect to the visual content of a given image and its social context By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this article introduces a two-dimensional taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations For a head-to-head comparison with the state of the art, a new experimental protocol is presented, with training sets containing 10,000, 100,000, and 1 million images, and an evaluation on three test sets, contributed by various research groups Eleven representative works are implemented and evaluated Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future

134 citations

References
More filters
Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations


"Tag Completion for Image Retrieval" refers background in this paper

  • ...To address this challenge, we study the problem of tag completion, where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images....

    [...]

Proceedings Article
01 Jan 2004
TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Abstract: We present a novel method for generic visual categorization: the problem of identifying the object content of natural images while generalizing across variations inherent to the object class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We propose and compare two alternative implementations using different classifiers: Naive Bayes and SVM. The main advantages of the method are that it is simple, computationally efficient and intrinsically invariant. We present results for simultaneously classifying seven semantic visual categories. These results clearly demonstrate that the method is robust to background clutter and produces good categorization accuracy even without exploiting geometric information.

5,046 citations


"Tag Completion for Image Retrieval" refers background in this paper

  • ...Various visual features, including both global features [33] (e.g., color, texture, and shape) and local features [16] (e.g., SIFT keypoints), have been studied for CBIR....

    [...]

Journal ArticleDOI
TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Abstract: We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.

3,501 citations

Proceedings ArticleDOI
27 Jun 2004
TL;DR: This paper examines (and improves upon) the local image descriptor used by SIFT, and demonstrates that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation.
Abstract: Stable local feature detection and representation is a fundamental component of many image registration and object recognition algorithms. Mikolajczyk and Schmid (June 2003) recently evaluated a variety of approaches and identified the SIFT [D. G. Lowe, 1999] algorithm as being the most resistant to common image deformations. This paper examines (and improves upon) the local image descriptor used by SIFT. Like SIFT, our descriptors encode the salient aspects of the image gradient in the feature point's neighborhood; however, instead of using SIFT's smoothed weighted histograms, we apply principal components analysis (PCA) to the normalized gradient patch. Our experiments demonstrate that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. We also present results showing that using these descriptors in an image retrieval application results in increased accuracy and faster matching.

3,325 citations

Journal ArticleDOI
TL;DR: Experiments on three different real-world multi-label learning problems, i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization, show that ML-KNN achieves superior performance to some well-established multi- label learning algorithms.
Abstract: Multi-label learning originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In multi-label learning, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets. In this paper, a multi-label lazy learning approach named ML-KNN is presented, which is derived from the traditional K-nearest neighbor (KNN) algorithm. In detail, for each unseen instance, its K nearest neighbors in the training set are firstly identified. After that, based on statistical information gained from the label sets of these neighboring instances, i.e. the number of neighboring instances belonging to each possible class, maximum a posteriori (MAP) principle is utilized to determine the label set for the unseen instance. Experiments on three different real-world multi-label learning problems, i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization, show that ML-KNN achieves superior performance to some well-established multi-label learning algorithms.

2,832 citations


"Tag Completion for Image Retrieval" refers background in this paper

  • ...[40] proposed a lazy learning algorithm for multi-label prediction....

    [...]

Frequently Asked Questions (2)
Q1. What are the contributions mentioned in the paper "Tag completion for image retrieval" ?

To address this challenge, the authors study the problem of tag completion where the goal is to automatically fill in the missing tags as well as correct noisy tags for given images. The authors represent the image-tag relation by a tag matrix, and search for the optimal tag matrix consistent with both the observed tags and the visual similarity. The authors propose a new algorithm for solving this optimization problem. 

In future work, the authors plan to exploit computationally more efficient approaches for tag completion based on the theory of compressed sensing and matrix completion.