scispace - formally typeset
Open AccessProceedings ArticleDOI

Probabilistic web image gathering

Reads0
Chats0
TLDR
A new method for automated large scale gathering of Web images relevant to specified concepts to build a knowledge base associated with as many concepts as possible for large scale object recognition studies and supporting the building of more accurate text-based indexes for Web images.
Abstract
We propose a new method for automated large scale gathering of Web images relevant to specified concepts. Our main goal is to build a knowledge base associated with as many concepts as possible for large scale object recognition studies. A second goal is supporting the building of more accurate text-based indexes for Web images. In our method, good quality candidate sets of images for each keyword are gathered as a function of analysis of the surrounding HTML text. The gathered images are then segmented into regions, and a model for the probability distribution of regions for the concept is computed using an iterative algorithm based on the previous work on statistical image annotation. The learned model is then applied to identify which images are visually relevant to the concept implied by the keyword. Implicitly, which regions or the images are relevant is also determined. Our experiments reveal that the new method performs much better than Google Image Search and a simple method based on more standard content based image retrieval methods.

read more

Content maybe subject to copyright    Report

Probabilistic Web Image Gathering
Keiji Yanai
Department of Computer Science,
The University of Electro-Communications
1-5-1 Chofugaoka, Chofu-shi,
Tokyo, 182-8585 JAPAN
yanai@cs.uec.ac.jp
Kobus Barnard
Computer Science Department,
University of Arizona
Tucson, AZ, 85721 USA
kobus@cs.arizona.edu
ABSTRACT
We propose a new method for automated large scale gath-
ering of Web images relevant to specified concepts. Our
main goal is to build a knowledge base associated with as
many concepts as possible for large scale object recognition
studies. A second goal is supporting the building of more
accurate text-based indexes for Web images. In our method,
go od quality candidate sets of images for each keyword are
gathered as a function of analysis of the surrounding HTML
text. The gathered images are then segmented into regions,
and a model for the probability distribution of regions for
the concept is computed using an iterative algorithm based
on the previous work on statistical image annotation. The
learned model is then applied to identify which images are
visually relevant to the concept implied by the keyword.
Implicitly, which regions or the images are relevant is also
determined. Our experiments reveal that the new method
p erforms much better than Google Image Search and a sim-
ple method based on more standard content based image
retrieval methods.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous
General Terms
Algorithms, Design, Experimentation
Keywords
Web image mining, Web image search, image selection, prob-
abilistic method
1. INTRODUCTION
Because of the recent growth of the World Wide Web, we
can easily gather substantive quantities of image data. Our
goal is to mine such data for visual content. In particular,
we wish to build a large scale data set consisting of many
highly relevant images for each of thousands of concepts. We
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MIR’05, November 10–11, 2005, Singapore.
Copyright 2005 ACM 1-59593-244-5/05/0011 ...$5.00.
present below a method for achieving substantively better
relevance than using either text, or the combination of text
and standard content-based image retrieval (CBIR) methods
based on simple image descriptors.
In the case of text data, there are many studies about
how to gather data and use it as “knowledge” effectively.
While such text “Web mining” is a difficult endeavor and
is an active research area, mining image data poses addi-
tional challenges and has seen less research activity. The
problem with mining images for knowledge is that it is not
known how to reliably automatically determine semantics
from image data. This has been refereed to as the seman-
tic gap. For this reason, commercial image search engines
such as Google Image Search and Altavista Image Search
rely on text associated with the images as determined from
surrounding HTML data by a variety of heuristics. The ap-
proach supports fast retrieval, and is somewhat successful,
partly because the user can select from a number of choices.
We contrast the search activity with mining for “visual
knowledge” which is the topic of this paper. To mine the
data we are willing to expend more resources to achieve a
more precise result. The results can then be used to im-
prove indexes for image search, but our main focus here is
to build a training set for generic image recognition. This
data set will go beyond what is available from commercial
image collections such as Corel images which are represent
only a subset of the kinds of images that we need to study.
Web images are as diverse as real world scenes, since Web
images are taken by a large number of people for various
kinds of purpose. It can be expected that diverse training
images enable us to classify/recognize diverse real world im-
ages. We believe that use of image data on the Web, namely
visual knowledge on the Web, is promising and imp ortant
for resolving real world image recognition.
The key improvement in our gather approach over earlier
work [16, 17] is the construction of a probabilistic model for
the relevant part of the images. A model for regions relevant
to the keyword is learned simultaneously with one for irrel-
evant (background) regions. This irrelevant region model
is also meant to absorb all regions from irrelevant images.
This process substantively improves the accuracy of the set
of relevant images. A large data set of images accurately
classified into thousands of categories will provide a training
data for computer vision methods to further automatically
determine content based on visual information.
To obtain candidate labeled images we could use relevance
feedback or as a semi-supervised was done by H. Feng et
al.[6]. In their exp eriments, they selected 50 relevant images
by hand for each search, and then used that for bootstrap-
ping further gathering. However, to achieve the scale that
we hope for, a fully automated method is desirable. Thus

we prop ose starting with images evaluated as highly rele-
vant ones by analyzing associated HTML texts as training
images. In our previous work [16], we revealed that images
whose file name, ALT tag or link word includes a certain
keyword “X” are relevant to the keyword “X” with around
75% precision on average. Although the images include 25%
irrelevant images, and many of the remaining 75% are not a
desired canonical example, they provide an adequate start-
ing point for our approach. We then build a model of a visual
concept associated to the keyword “X”. We use a generative
mo del based on the Gaussian mixture model (GMM) to rep-
resent “X” model, and estimate the model with the EM al-
gorithm. Next, we recognize images evaluated as highly or
medium relevant by analyzing associated HTML texts with
the model, and select “X” images from them. By repeating
this image selection and model estimation for several times,
we can refine the “X” model and finally obtain “X” images
with the high accuracy. In the experiments, we show that
the new method performs much better than Google Image
Search and the simple method based on CBIR (Content-
Based Image Retrieval) we employed in the previous system
[16].
The rest of this paper is organized as follows: In Section
2 we review related work. In Section 3 we overview our
framework, and in Section 4 we describ e the probabilistic
framework to select images gathered from the Web. In Sec-
tion 5 we presents the experimental results and evaluations,
and in Section 6 we conclude this paper.
2. RELATED WORK
At present, some commercial image search engines on the
Web such as Google Image Search, Ditto and AltaVista Im-
age Search are available. Their accuracy of search results
is, however, not always sufficient since they employs only
keyword-based search. Therefore, to overcome such draw-
back, some integrated Web image search engines employ-
ing both keyword-based search and content-based image re-
trieval have been proposed. WebSeer [9], WebSEEk [14] and
Image Rover [13] are representative systems employing both
visual and textual information. These systems search for im-
ages based on query keywords, and then a user selects query
images from search results. After this selection by the user,
the systems search for images that are similar to the query
images based on image features. These three systems carry
out their search in an interactive manner. They can be re-
garded as the combination of text-based Web image search
and CBIR. An interactive way is suitable for “search” not
for “gathering”, since it needs human’s intervention during
the process.
Furthermore, the three research systems quoted above re-
quire crawling over the Web in advance for gathering Web
images and making large indices of images on the Web.
Hence, they require a large-scale web-crawling mechanism
for the whole Web and continuous web-crawling to keep
their indices up-to-date for practical use. However, they
limited crawled Web sites in their experiments, and did not
make large indices covering the whole Web. This shows
difficulty to make these system more practical like Google
image search. Therefore, using the commercial Web search
engines which have been indexing whole the World Wide
Web such as Google as the index of the Web is much more
practical approach. In [16], we proposed using the Web text
search engines to gather images on the Web. In contrast to
the existing systems, due to exploiting of existing keyword-
based search engines and on-demand image-gathering, our
system, the Image Collector, did not require a large-scale
web-crawling mechanism and making a large index in ad-
vance, so that it can be used practically unlike the existing
Web image search system quoted above. In fact, at that
time, Google Image Search has not got in service yet, so it
was a very good way to search whole the World Wide Web
for images. Now we can use Google Image Search or other
commercial image search engines on the Web. So recently
most of studies related to Web image search focus on how to
refine results of Web image search engines, which use only
textual information as clues to index Web images but index
most part of the extremely huge World Wide Web.
As one of such studies, H. Feng et al. proposed a new
metho d to refine results of Web image search engines [6],
which employed a co-learning method [2] that is a kind of
semi-sup ervised learning. They used both image features
and word vectors extracted from associated HTML docu-
ments. In their paper, they claimed only 50 relevant images
were needed to be supervised by hand in case of using co-
learning with both visual and textual features. They called
their method a bootstrapping approach. In their experi-
ment, they used 5418 images associated to 15 concepts, and
obtained an F-measure of 54%.
X. Song et al.[15] proposed building automatic visual model
from the result of Google image search without human’s in-
tervention. They employed multiple instance learning[10] as
an image learning method. It seems to be similar to our
framework, “Web image mining” in terms of learning from
Web images with no supervision. However, they focused
on only frontal face images and used a face detector which
was learned in advance, although they claimed that they
can apply their proposed method to generic Web images
by using the supervised framework they proposed before.
Since they used a face detector for which they used many
lab eled face images to learn a face model, we can consider
that their method is not truly “automatic” and it is a kind of
sup ervised methods. Their method can build face models of
famous individuals, after detecting face regions by the face
detector. In their paper, they did not mention how many
images they obtained from Google Image Search for each
p erson.
The two systems mentioned above also aims to “search”
the Web for images. On the other hand, the objective of our
image gathering is absolutely different from ones of the other
Web image search systems including commercial Web image
search engines. Their objective is searching for highly rele-
vant but relatively a small number of images. That is why
they have adopted interactive or supervision-needed ways.
Unlike these system, we aim to gather a large number of rel-
evant images, and actually we plan to gather enormous and
various kinds images associated to more than one thousand
kinds of concepts. In such case, even a very little interac-
tion should be avoided, since we have to rep eat the same
kinds of interaction more than one thousand time. It must
b e a troublesome job. Therefore, we adopt non-interactive
search without user intervention during the gathering pro-
cess. This enables the gathering process to be done as a
batch job. To gather many kinds of images from the Web,
what we have to do are just providing keywords related to
concepts of images we like to gather and waiting.
In work possibly most similar to ours, P. Fergus et al. ap-
plied their probabilistic method [7] to filtering of results of
Go ogle Image Search [8]. Their method can model object
categories in an unsupervised way. Unlike the two above
studies, they used imperfect training data which includes
outliers, and in addition they used negative training data
which consists of only irrelevant images. Although it is sim-
ilar to our method in terms of not requiring supervision and
using negative training images, there are some important

differences: (1) They used all images obtained from Google
Image Search as training images, and HTML analysis never
affected image analysis process, while we use only images
judged as highly relevant ones by HTML analysis, yielding
b etter training data. (2) They used image patches or image
curves as units to be modeled, while we use image regions
generated by a region segmentation algorithm as the original
mo deling algorithms that the respective methods are based
on [7, 1] are different.
3. OVERVIEW OF IMAGE GATHERING
Our proposed method consists of two stages, which are
a collection stage and a selection stage (Figure.1). Most of
recent studies related to Web image search focus on only
how to select relevant images from results of Web image
search engines. In contrast, we make much of a collection
stage as well as a selection stage. We have indicated that we
can collect images related to given keywords from the Web
with only Web text search engine and we do not need Web
image search engines to do that [16]. Therefore, in this paper
we also use Web text search engine to fetch images from the
Web, although we can use Web image search engines instead
of Web text search engines in our framework.
First of all, we provide keywords which represent the vi-
sual concept of images we like to obtain. For example,
“lion”, “dog” and “cat”. If we use polysemous words as
keywords, we can add subsidiary keywords which restrict
the meaning of the main keyword. For example, when we
like to obtain images of “bank” of a river, we should use
“river” as a subsidiary keyword in addition to the main key-
word “bank”.
In the collection stage, we use the method we proposed
b efore [16, 17] to collect images from the Web. Since an im-
age on the Web is usually embedded in an HTML document
that explains its content, we exploit some existing commer-
cial text Web search engines and gather URLs of HTML
do cuments related to the keywords. In the next step, using
those gathered URLs, we fetch HTML documents from the
Web, and evaluate the relevancy of images only by analyzing
asso ciated HTML documents. If it is judged that images are
related to keywords, the image files are downloaded from the
Web. According to the relevancy of images to the given key-
words, we divide fetched images into two groups: images in
group A are highly relevant to the keywords, and others are
classified into group B. For all gathered images, we perform
region segmentation by JSEG[4] and extract image features
from each region. Moreover, we extract word vectors from
the HTML documents associated to all the downloaded im-
ages. The detail is described in [16, 17].
In the selection stage, we employ a probabilistic method
to select relevant images from all the downloaded images.
In general, to use a probabilistic method or other machine
learning metho ds to select true images, we need labeled
training images. However, we do not want to pick up good
images by hand. Instead, we regard images classified into
group A as training images, although they always include
some irrelevant images. In our probabilistic framework, we
allow training data to include some irrelevant data and we
can remove them by repeating both estimation of a model
and selection of relevant regions of images from all the re-
gions of images in group A and group B. We use a generative
mo del based on the Gaussian mixture model to represent
mo dels associated to keywords, and estimate models with
the EM algorithm. After estimating the model, we “recog-
nize” relevant region out of all regions in group A and B with
the model. We repeat this model estimation and region se-
lection. After the second iteration, we use regions selected
in the previous iteration as training data for estimating a
mo del.
4. SELECTION STAGE
In the selection stage, the system selects relevant images
to the concept which the keywords represent out of all the
downloaded images in the collection stage.
4.1 Overview of Probabilistic Approach
for Image Selection
As a method to select images, we adopt a probabilistic
metho d with a Gaussian mixture model. This approach is
based on our method for learning to label image regions from
images with associated text without the correspondence be-
tween words and images regions [5, 1]. That method uses
a mixture of multi-modal components, each combining a
multinomial for words and a Gaussian over image features.
Here, we simplify things a bit, and build models of the dis-
tribution of image features for a given concept for regions
which are obtained by a region segmentation algorithm.
To get a model of regions associated to a certain concept,
we need training images. As mentioned before, our basic
p olicy is no human intervention, so that we propose using
images in group A as training images. Most of images in
group A are relevant ones, but they always includes outliers
due to no supervision. Moreover, in general, images usu-
ally include backgrounds as well as objects associated with
the given concept. Therefore, we need to eliminate outlier
images and regions unrelated to the concept such as back-
grounds, and pick up only the regions strongly associated
with the concept in order to make a model correctly. We
use only the regions expected to be highly related to the
concept to estimate a model. In our new method, we need
negative training images in addition to group A and B im-
ages. We prepare about one thousand images by fetching
them from the Web randomly as negative training images
in advance.
Our method to find regions related to a certain concept
is an iterative algorithm similar to the expectation maxi-
mization (EM) algorithm applied to missing value problems.
Initially, we do not know which region is associated with a
concept “X”, since an image with an “X” label just means
the image contain “X” regions. In fact, with the images
gathered from the Web, even an image with an “X” lab el
sometimes contains no “X” regions at all. So at first we have
to find regions which are likely associated with “X”. To find
“X” regions, we also need a model for “X” regions. Here we
adopt a probabilistic generative model, namely a mixture of
Gaussian, fitted using the EM algorithm.
In short, we need to know a model for “X” regions and
which regions are associated with “X” simultaneously. How-
ever, each one depends on each other, so we proceed itera-
tively. Once we know which regions corresponds to “X”, we
can regard images containing “X” regions as ”X” images,
and therefore we can compute the probability of an “X” im-
age for each image. Finally, we select the images which have
the high probability as final results.
In addition to image selection with only image features,
we also propose the extended method using not only image
features but also textual features extracted from associated
HTML documents. In fact, some existing studies have in-
tegrated both visual and textual features [13, 16, 6]. To re-
alize that, we compute the probability of “X” and “non-X”
in terms of word vectors in the same way as image features,
and integrate them.

keyword
"X"
gathering
HTML files
extracting and
evaluating URL
gathering
A-rank img.
gathering
B-rank img.
selecting
"X" regions
randomly
selecting
"X" regions
with "X"
model
(text-based) Web
search engines
World
Wide Web
output
images
query URLs
not gathered
A
B
other
img. files
(1) (2)
(3)
(4)
(5)
collection stage selection stage
(HTML analysis) (the same as [16,17]) (estimating GMM and selecting images with it)
estimating
"X" model
HTML files
repeating
loop
(7)
(6)
(8)
selectimg
"X" images
Figure 1: Processing flow of image-gathering from the Web employing a probabilistic method, which consists
of the collection stage and the selection stage.
4.2 Segmentation and Image Feature
Extraction
For the images gathered from the Web as “X” images, we
carry out the region segmentation. In the experiment, we
use JSEG [4]. After segmentation, we extract image features
from each region whose size is larger than a certain thresh-
old. As image features, we prepare three kinds of features:
color, texture and shape features, which include the aver-
age RGB value and its variance, the average response to the
difference of 4 different combination of 2 Gaussian filters, re-
gion size, location, the first moment and the area divided by
the square of the outer boundary length. An image feature
vector we use in this paper is totally 24-dimension.
4.3 Textual Feature Extraction
In addition to image selection based on only image fea-
tures, we make experiments with not only image features
but also textual features, namely word vectors which are
extracted from associated HTML documents.
To make a word vector for each HTML document, we
eliminates HTML tags and extracts surrounding ten words
(only nouns, adjectives, and verbs) before and after the link
tag to the image file, link words and words in the ALT tag
from HTML documents associated with downloaded images.
We count the frequency of appearance of all the extracted
words, select the top 300 words in terms of the frequency,
and make a 300-dimensional word vector whose elements are
word frequencies weighted by TFIDF (Term Frequency and
Inverse Document Frequency) [12] for each of the images.
Moreover, to shorten the vectors and to equate words hav-
ing similar meanings, we perform the LSI methods (Latent
Semantic Indexing) [3] to the 300-dimensional word vectors.
These methods compress word vectors with singular value
decomp osition which is similar to the principal component
analysis. We compress a 300-dimensional word vector into a
100-dimensional vector, and treat with this 100-dimensional
vector as a word vector in this paper.
4.4 DetectingRegionsandEstimatingaModel
Associated with “X” and “non-X”
To obtain P (X|r
i
), which represents the probability of
how much the region is associated with the concept “X”,
and some parameters of the Gaussian mixture model, which
represents a generative model of “X” regions, at the same
time, we propose an iterative algorithm. Note that “X” cor-
resp onds to a certain concept associated to given keywords.
At the initial iteration, we regard images in group A which
are expected to be highly relevant to the concept “X” by
HTML analysis as positive training images, and prepare neg-
ative training images by gathering images from the Web in
advance. To gather negative training images, we provided
Go ogle Image Search with randomly selected 200 adjective
keywords which have no relation to noun concepts, and col-
lected 4000 negative training images.
Next, we select n “X” regions randomly from group A
images, and select n “non-X” regions randomly from regions
which come from negative training images, respectively. In
the experiment, we set n as 1000.
Taking positive and negative regions together, we apply
the EM algorithm, which is a kind of a probabilistic clus-
tering algorithm, to 2n image feature vectors of the regions
selected from positive and negative initial training images,
and obtain the Gaussian mixture model.
To select positive components and negative components
from all components of the mixture model, we compute
P
0
(c
j
|X) which represents the ratio that the j-th compo-
nent of the mixture model, c
j
, contributes to the concept
“X” within the obtained GMM, according to the following
formula:
P
0
(c
j
|X) = (1/n
X
)
n
X
X
i=1
P (c
j
|r
X
i
, X) (1)
= (α/n
X
)
n
X
X
i=1
P (X|c
j
, r
X
i
)P (c
j
) (2)
where n
X
is the number of positive regions, r
X
i
is the i-
th “X” region, and α is a constant for the normalization.
As the same way, we also compute P
0
(c
j
|nonX). Here we
regard all the regions selected from positive images as “X”
regions and all the regions selected from negative images as
“non-X” regions, and substitute their feature vectors to the
ab ove formula.
Next, we compute p
X
j
and p
nonX
j
for all components j as
follows:
p
X
j
=
P
0
(c
j
|X)
P
0
(c
j
|X) + P
0
(c
j
|nonX)
(3)
p
nonX
j
=
P
0
(c
j
|nonX)
P
0
(c
j
|X) + P
0
(c
j
|nonX)
(4)
We select components where p
X
j
> th
1
as positive com-
p onents and components where p
non
X
j
> th
1
as negative
comp onents. Positive components and negative components
means Gaussian components associated with the concept
“X” and Gaussian components strictly not to associated
with “X”, resp ectively. The key point in this component
selection process is that mixing positive samples and nega-
tive samples together before applying the EM, and throwing

away neutral components which belongs to neither positive
nor negative components, since neutral components are ex-
p ected to be associated with image features included in both
p ositive and negative samples and to be useless for discrim-
ination between “X” and “non-X”. This is different from
other work (e.g.[11]) which estimates two GMMs separately
with EM to model positive and negative image concepts.
We regard the mixture of only positive components as an
“X” model and the mixture of only negative components as
a “non-X” model as. With these models of “X” and “non-
X”, we can compute P (X|r
i
) and P (nonX|r
i
) for all the
regions extracted from images in both group A and group
B. First, we compute p1(X|r
i
) which is the output of the
mo del of “X’ and p2(nonX|r
i
) which is the output of the
mo del of “non-X” for each region r
i
:
p1(X|r
i
) =
m
1
X
k=1
w
1,k
1
p
(2π)
N
|Σ
1,k
|
exp
1
2
(x
i
µ
1,k
)
T
Σ
1
1.k
(x
i
µ
1,k
)
(5)
p2(nonX|r
i
) =
m
2
X
k=1
w
2,k
1
p
(2π)
N
|Σ
2,k
|
exp
1
2
(x
i
µ
1,k
)
T
Σ
1
2,k
(x
i
µ
2,k
)
(6)
where x
i
is the image feature vector of region r
i
, N is the
dimension of image features, m
1
is the number of positive
comp onents and w
1,k
, µ
1,k
and Σ
1,k
represent the weight,
the mean vector and the covariant matrix of k-th positive
comp onent, respectively, on the condition of
P
w
1,k
m
/
1
k=1
= 1.
Finally, we obtain P (X|r
i
) and P (nonX|r
i
) as follows:
P (X|r
i
) =
p1(X|r
i
)
p1(X|r
i
) + p2(nonX|r
i
)
(7)
P (nonX|r
i
) =
p2(nonX|r
i
)
p1(X|r
i
) + p2(nonX|r
i
)
(8)
For the next iteration, we select the top n regions regard-
ing P (X|r
i
) as “X” regions and the top
2
3
n regions regarding
P (nonX|r
i
) as “non-X” regions. In addition, we add
1
3
n re-
gions randomly selected from negative images gathered from
the Web in advance to the “non-X” regions. We repeat the
pro cessing described above with n positive regions and n
negative regions for several times.
4.5 Computing the Probability of “X”
After iterating both detection of regions and estimation
of models for several times, in order to decide which images
are “X”, finally, we compute the probability of “X” P (X|I
j
),
where I
j
represents j-th image.
To estimate P (X|I
j
), we select top T regions in terms of
P (X|r
i
), (r
i
I
j
), and average them as follows:
P (X|I
j
) =
1
T
T
X
k=1
P (X|r
i
top
k
) (9)
where r
i
top
k
is the j-th largest region within image I
i
in
terms of P (X|r
i
). This estimation of P (X|I
j
) is based on
the heuristic that an image having regions whose P (X|r
i
)
are high is likely to be an “X” image. Since images usually
includes backgrounds as well as target objects, background
regions or unrelated regions should be ignored for estimat-
ing P (X|I
j
). Therefore, we use not all regions but only
several regions with high probability. Finally, we select im-
ages whose P (X|I
j
) is more than a threshold th
2
as final
output images. In the experiment, we set T as 2.
4.6 Selection by Textual Feature
To use textual features in addition to image features, we
compute the probability of “X” and “non-X” in terms of
word vectors in the same way as image features, and inte-
grate them in two kinds of ways. Note that a word vector
corresp onds with an image, while an image feature vector
corresp onds with a region.
After selecting “X” regions, we need to select “non-X”
regions, and need to compute the probability of “X” images
and “non-X” images for each image, respectively, after each
iteration of image selection by image feature. Since we do
not have negative samples of word vectors unlike the case
of image features in the beginning, we make use of results
of selection by image features. We regard the word vectors
of the images labeled as “X” by image-feature-selection as
p ositive training samples, and regard the word vectors of
the images labeled as “non-X” as negative training samples.
In the next step, using both positive and negative training
vectors, we build an “X” model and a “non-X” model in
the similar way as image features. We perform the EM
algorithm for the word vectors of “X” and “non-X” images,
select Gaussian components corresponding to “X” and “non-
X”, and compute the probability of “X” and “non-X” in
terms of word vectors.
We prepare two methods to use the probability P
word
(X|I
j
).
One is computing the weighted sum P
total
(X|r
i
) of word-
vector-based probability P
word
(X|I
j
) and image-feature-based
probability P (X|r
i
), and the other is the two step selection
that word-vector-based selection is carried out after image-
feature-based selection.
In the weighted-sum selection, we compute P
total
(X|r
i
)
for all r
i
as follows:
P
total
(X|r
i
) = wP
word
(X|I
j
) + (1 w)P (X|r
i
) (10)
where r
i
I
j
, and select positive and negative training re-
gions for the next iteration using P
total
(X|r
i
) instead of just
P (X|r
i
). In the experiment, we set w as 0.25.
In the two-step selection, using a threshold th
3
, we elimi-
nate region r
i
whose P
word
(X|I
j
) is low, where r
i
I
j
from
the positive regions after image-feature-based region selec-
tion.
4.7 Proposed Algorithm
To summarize our method we described above, the algo-
rithm is as follows:
(1) Carry out region segmentation for all the images and
extract image features from each region of each image.
In this paper, we use JSEG[4] to perform region segmen-
tation. In case of using word vector, we also generate
word vectors for all the images in both group A and B
from associated HTML documents in this step.
(2) At the first iteration, regard images in group A as posi-
tive training images which are associated with the con-
cept “X” and images gathered from the Web with non-
noun keywords in advance as negative training images.
(3) Select n “X” regions randomly from positive images,
and select n “non-X” regions randomly from negative
images, respectively (Figure 1 (4)).
(4) Applying the EM algorithm to the image features of
regions which are selected as both positive and negative
regions, compute the Gaussian mixture model for the
distribution of both “X” and “non-X” (Figure 1 (5)).
(5) Find the components of the Gaussian mixture which
contributes “X” regions or “non-X” regions greatly. They
are regarded as “X” components or “non-X” compo-
nents, and the rest are ignored. The mixture of only

Citations
More filters
Book

A Framework for Web Science

TL;DR: This text sets out a series of approaches to the analysis and synthesis of the World Wide Web, and other web-like information structures, and a comprehensive set of research questions is outlined, together with a sub-disciplinary breakdown, emphasising the multi-faceted nature of the Web.
Journal ArticleDOI

OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning

TL;DR: This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously, and adapts a non-parametric latent topic model and proposes an incremental learning framework.
Proceedings ArticleDOI

OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning

TL;DR: This work adapts a non-parametric graphical model and proposes an incremental learning framework that mimics the human learning process of iteratively accumulating model knowledge and image examples and is capable of collecting image datasets that are superior to Caltech 101 and LabelMe.
Proceedings ArticleDOI

Animals on the Web

TL;DR: This work demonstrates a method for identifying images containing categories of animals using a clustering method applied to text on web pages and shows unequivocal evidence that visual information improves performance for this task.
References
More filters
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Proceedings ArticleDOI

Combining labeled and unlabeled data with co-training

TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Proceedings ArticleDOI

Object class recognition by unsupervised scale-invariant learning

TL;DR: The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).
Book ChapterDOI

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

TL;DR: This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Probabilistic web image gathering" ?

The authors propose a new method for automated large scale gathering of Web images relevant to specified concepts. 

As future work, the authors plan to apply the proposed method to automatic generation of real world image corpus for generic image classification/recognition. In their method, the authors have already made generative models based on the Gaussian mixture which can be applied to generic image recognition. So the authors need to develop new models which can represent “ object ” concepts as well as “ scene ” concepts by importing latest methods for generic object recognition. 

Their method to find regions related to a certain concept is an iterative algorithm similar to the expectation maximization (EM) algorithm applied to missing value problems. 

As image features, the authors prepare three kinds of features: color, texture and shape features, which include the average RGB value and its variance, the average response to the difference of 4 different combination of 2 Gaussian filters, region size, location, the first moment and the area divided by the square of the outer boundary length. 

To make a word vector for each HTML document, the authors eliminates HTML tags and extracts surrounding ten words (only nouns, adjectives, and verbs) before and after the link tag to the image file, link words and words in the ALT tag from HTML documents associated with downloaded images. 

The authors use a generative model based on the Gaussian mixture model to represent models associated to keywords, and estimate models with the EM algorithm. 

In their paper, they claimed only 50 relevant images were needed to be supervised by hand in case of using colearning with both visual and textual features. 

In their previous work [16], the authors revealed that images whose file name, ALT tag or link word includes a certain keyword “X” are relevant to the keyword “X” with around 75% precision on average. 

The contribution of this paper is as follows:(1) The authors divide raw images collected from the Web into two groups, A and B by analyzing associated HTML documents, and use images in group A which are more likely to be relevant as initial training images for the probabilistic learning method. 

To realize that, the authors compute the probability of “X” and “non-X” in terms of word vectors in the same way as image features, and integrate them. 

The key improvement in their gather approach over earlier work [16, 17] is the construction of a probabilistic model for the relevant part of the images.