What are the future works in "Probabilistic web image gathering" ?

As future work, the authors plan to apply the proposed method to automatic generation of real world image corpus for generic image classification/recognition. In their method, the authors have already made generative models based on the Gaussian mixture which can be applied to generic image recognition. So the authors need to develop new models which can represent “ object ” concepts as well as “ scene ” concepts by importing latest methods for generic object recognition.

What are the three kinds of features that the authors prepare for image features?

As image features, the authors prepare three kinds of features: color, texture and shape features, which include the average RGB value and its variance, the average response to the difference of 4 different combination of 2 Gaussian filters, region size, location, the first moment and the area divided by the square of the outer boundary length.

What is the contribution of this paper?

The contribution of this paper is as follows:(1) The authors divide raw images collected from the Web into two groups, A and B by analyzing associated HTML documents, and use images in group A which are more likely to be relevant as initial training images for the probabilistic learning method.

(Open Access) Probabilistic web image gathering (2005) | Keiji Yanai

Q: What have the authors contributed in "Probabilistic web image gathering" ?

The authors propose a new method for automated large scale gathering of Web images relevant to specified concepts.

Q: What is the method to find regions related to a certain concept?

Their method to find regions related to a certain concept is an iterative algorithm similar to the expectation maximization (EM) algorithm applied to missing value problems.

Q: How do the authors make a word vector for each image?

To make a word vector for each HTML document, the authors eliminates HTML tags and extracts surrounding ten words (only nouns, adjectives, and verbs) before and after the link tag to the image file, link words and words in the ALT tag from HTML documents associated with downloaded images.

Q: How do the authors use the Gaussian mixture model to represent models associated to keywords?

The authors use a generative model based on the Gaussian mixture model to represent models associated to keywords, and estimate models with the EM algorithm.

Q: How many images are relevant to the keyword “X”?

In their previous work [16], the authors revealed that images whose file name, ALT tag or link word includes a certain keyword “X” are relevant to the keyword “X” with around 75% precision on average.

Q: How do the authors compute the probability of “X” and “non-X” images?

To realize that, the authors compute the probability of “X” and “non-X” in terms of word vectors in the same way as image features, and integrate them.

Probabilistic Web Image Gathering

Keiji Yanai

Department of Computer Science,

The University of Electro-Communications

1-5-1 Chofugaoka, Chofu-shi,

Tokyo, 182-8585 JAPAN

yanai@cs.uec.ac.jp

Kobus Barnard

Computer Science Department,

University of Arizona

Tucson, AZ, 85721 USA

kobus@cs.arizona.edu

ABSTRACT

We propose a new method for automated large scale gath-

ering of Web images relevant to speciﬁed concepts. Our

main goal is to build a knowledge base associated with as

many concepts as possible for large scale object recognition

studies. A second goal is supporting the building of more

accurate text-based indexes for Web images. In our method,

go od quality candidate sets of images for each keyword are

gathered as a function of analysis of the surrounding HTML

text. The gathered images are then segmented into regions,

and a model for the probability distribution of regions for

the concept is computed using an iterative algorithm based

on the previous work on statistical image annotation. The

learned model is then applied to identify which images are

visually relevant to the concept implied by the keyword.

Implicitly, which regions or the images are relevant is also

determined. Our experiments reveal that the new method

p erforms much better than Google Image Search and a sim-

ple method based on more standard content based image

retrieval methods.

Categories and Subject Descriptors

H.4 [Information Systems Applications]: Miscellaneous

General Terms

Algorithms, Design, Experimentation

Keywords

Web image mining, Web image search, image selection, prob-

abilistic method

1. INTRODUCTION

Because of the recent growth of the World Wide Web, we

can easily gather substantive quantities of image data. Our

goal is to mine such data for visual content. In particular,

we wish to build a large scale data set consisting of many

highly relevant images for each of thousands of concepts. We

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

MIR’05, November 10–11, 2005, Singapore.

present below a method for achieving substantively better

relevance than using either text, or the combination of text

and standard content-based image retrieval (CBIR) methods

based on simple image descriptors.

In the case of text data, there are many studies about

how to gather data and use it as “knowledge” eﬀectively.

While such text “Web mining” is a diﬃcult endeavor and

is an active research area, mining image data poses addi-

tional challenges and has seen less research activity. The

problem with mining images for knowledge is that it is not

known how to reliably automatically determine semantics

from image data. This has been refereed to as the seman-

tic gap. For this reason, commercial image search engines

such as Google Image Search and Altavista Image Search

rely on text associated with the images as determined from

surrounding HTML data by a variety of heuristics. The ap-

proach supports fast retrieval, and is somewhat successful,

partly because the user can select from a number of choices.

We contrast the search activity with mining for “visual

knowledge” which is the topic of this paper. To mine the

data we are willing to expend more resources to achieve a

more precise result. The results can then be used to im-

prove indexes for image search, but our main focus here is

to build a training set for generic image recognition. This

data set will go beyond what is available from commercial

image collections such as Corel images which are represent

only a subset of the kinds of images that we need to study.

Web images are as diverse as real world scenes, since Web

images are taken by a large number of people for various

kinds of purpose. It can be expected that diverse training

images enable us to classify/recognize diverse real world im-

ages. We believe that use of image data on the Web, namely

visual knowledge on the Web, is promising and imp ortant

for resolving real world image recognition.

The key improvement in our gather approach over earlier

work [16, 17] is the construction of a probabilistic model for

the relevant part of the images. A model for regions relevant

to the keyword is learned simultaneously with one for irrel-

evant (background) regions. This irrelevant region model

is also meant to absorb all regions from irrelevant images.

This process substantively improves the accuracy of the set

of relevant images. A large data set of images accurately

classiﬁed into thousands of categories will provide a training

data for computer vision methods to further automatically

determine content based on visual information.

To obtain candidate labeled images we could use relevance

feedback or as a semi-supervised was done by H. Feng et

al.[6]. In their exp eriments, they selected 50 relevant images

by hand for each search, and then used that for bootstrap-

ping further gathering. However, to achieve the scale that

we hope for, a fully automated method is desirable. Thus

we prop ose starting with images evaluated as highly rele-

vant ones by analyzing associated HTML texts as training

images. In our previous work [16], we revealed that images

whose ﬁle name, ALT tag or link word includes a certain

keyword “X” are relevant to the keyword “X” with around

75% precision on average. Although the images include 25%

irrelevant images, and many of the remaining 75% are not a

desired canonical example, they provide an adequate start-

ing point for our approach. We then build a model of a visual

concept associated to the keyword “X”. We use a generative

mo del based on the Gaussian mixture model (GMM) to rep-

resent “X” model, and estimate the model with the EM al-

gorithm. Next, we recognize images evaluated as highly or

medium relevant by analyzing associated HTML texts with

the model, and select “X” images from them. By repeating

this image selection and model estimation for several times,

we can reﬁne the “X” model and ﬁnally obtain “X” images

with the high accuracy. In the experiments, we show that

the new method performs much better than Google Image

Search and the simple method based on CBIR (Content-

Based Image Retrieval) we employed in the previous system

[16].

The rest of this paper is organized as follows: In Section

2 we review related work. In Section 3 we overview our

framework, and in Section 4 we describ e the probabilistic

framework to select images gathered from the Web. In Sec-

tion 5 we presents the experimental results and evaluations,

and in Section 6 we conclude this paper.

2. RELATED WORK

At present, some commercial image search engines on the

Web such as Google Image Search, Ditto and AltaVista Im-

age Search are available. Their accuracy of search results

is, however, not always suﬃcient since they employs only

keyword-based search. Therefore, to overcome such draw-

back, some integrated Web image search engines employ-

ing both keyword-based search and content-based image re-

trieval have been proposed. WebSeer [9], WebSEEk [14] and

Image Rover [13] are representative systems employing both

visual and textual information. These systems search for im-

ages based on query keywords, and then a user selects query

images from search results. After this selection by the user,

the systems search for images that are similar to the query

images based on image features. These three systems carry

out their search in an interactive manner. They can be re-

garded as the combination of text-based Web image search

and CBIR. An interactive way is suitable for “search” not

for “gathering”, since it needs human’s intervention during

the process.

Furthermore, the three research systems quoted above re-

quire crawling over the Web in advance for gathering Web

images and making large indices of images on the Web.

Hence, they require a large-scale web-crawling mechanism

for the whole Web and continuous web-crawling to keep

their indices up-to-date for practical use. However, they

limited crawled Web sites in their experiments, and did not

make large indices covering the whole Web. This shows

diﬃculty to make these system more practical like Google

image search. Therefore, using the commercial Web search

engines which have been indexing whole the World Wide

Web such as Google as the index of the Web is much more

practical approach. In [16], we proposed using the Web text

search engines to gather images on the Web. In contrast to

the existing systems, due to exploiting of existing keyword-

based search engines and on-demand image-gathering, our

system, the Image Collector, did not require a large-scale

web-crawling mechanism and making a large index in ad-

vance, so that it can be used practically unlike the existing

Web image search system quoted above. In fact, at that

time, Google Image Search has not got in service yet, so it

was a very good way to search whole the World Wide Web

for images. Now we can use Google Image Search or other

commercial image search engines on the Web. So recently

most of studies related to Web image search focus on how to

reﬁne results of Web image search engines, which use only

textual information as clues to index Web images but index

most part of the extremely huge World Wide Web.

As one of such studies, H. Feng et al. proposed a new

metho d to reﬁne results of Web image search engines [6],

which employed a co-learning method [2] that is a kind of

semi-sup ervised learning. They used both image features

and word vectors extracted from associated HTML docu-

ments. In their paper, they claimed only 50 relevant images

were needed to be supervised by hand in case of using co-

learning with both visual and textual features. They called

their method a bootstrapping approach. In their experi-

ment, they used 5418 images associated to 15 concepts, and

obtained an F-measure of 54%.

X. Song et al.[15] proposed building automatic visual model

from the result of Google image search without human’s in-

tervention. They employed multiple instance learning[10] as

an image learning method. It seems to be similar to our

framework, “Web image mining” in terms of learning from

Web images with no supervision. However, they focused

on only frontal face images and used a face detector which

was learned in advance, although they claimed that they

can apply their proposed method to generic Web images

by using the supervised framework they proposed before.

Since they used a face detector for which they used many

lab eled face images to learn a face model, we can consider

that their method is not truly “automatic” and it is a kind of

sup ervised methods. Their method can build face models of

famous individuals, after detecting face regions by the face

detector. In their paper, they did not mention how many

images they obtained from Google Image Search for each

p erson.

The two systems mentioned above also aims to “search”

the Web for images. On the other hand, the objective of our

image gathering is absolutely diﬀerent from ones of the other

Web image search systems including commercial Web image

search engines. Their objective is searching for highly rele-

vant but relatively a small number of images. That is why

they have adopted interactive or supervision-needed ways.

Unlike these system, we aim to gather a large number of rel-

evant images, and actually we plan to gather enormous and

various kinds images associated to more than one thousand

kinds of concepts. In such case, even a very little interac-

tion should be avoided, since we have to rep eat the same

kinds of interaction more than one thousand time. It must

b e a troublesome job. Therefore, we adopt non-interactive

search without user intervention during the gathering pro-

cess. This enables the gathering process to be done as a

batch job. To gather many kinds of images from the Web,

what we have to do are just providing keywords related to

concepts of images we like to gather and waiting.

In work possibly most similar to ours, P. Fergus et al. ap-

plied their probabilistic method [7] to ﬁltering of results of

Go ogle Image Search [8]. Their method can model object

categories in an unsupervised way. Unlike the two above

studies, they used imperfect training data which includes

outliers, and in addition they used negative training data

which consists of only irrelevant images. Although it is sim-

ilar to our method in terms of not requiring supervision and

using negative training images, there are some important

diﬀerences: (1) They used all images obtained from Google

Image Search as training images, and HTML analysis never

aﬀected image analysis process, while we use only images

judged as highly relevant ones by HTML analysis, yielding

b etter training data. (2) They used image patches or image

curves as units to be modeled, while we use image regions

generated by a region segmentation algorithm as the original

mo deling algorithms that the respective methods are based

on [7, 1] are diﬀerent.

3. OVERVIEW OF IMAGE GATHERING

Our proposed method consists of two stages, which are

a collection stage and a selection stage (Figure.1). Most of

recent studies related to Web image search focus on only

how to select relevant images from results of Web image

search engines. In contrast, we make much of a collection

stage as well as a selection stage. We have indicated that we

can collect images related to given keywords from the Web

with only Web text search engine and we do not need Web

image search engines to do that [16]. Therefore, in this paper

we also use Web text search engine to fetch images from the

Web, although we can use Web image search engines instead

of Web text search engines in our framework.

First of all, we provide keywords which represent the vi-

sual concept of images we like to obtain. For example,

“lion”, “dog” and “cat”. If we use polysemous words as

keywords, we can add subsidiary keywords which restrict

the meaning of the main keyword. For example, when we

like to obtain images of “bank” of a river, we should use

“river” as a subsidiary keyword in addition to the main key-

word “bank”.

In the collection stage, we use the method we proposed

b efore [16, 17] to collect images from the Web. Since an im-

age on the Web is usually embedded in an HTML document

that explains its content, we exploit some existing commer-

cial text Web search engines and gather URLs of HTML

do cuments related to the keywords. In the next step, using

those gathered URLs, we fetch HTML documents from the

Web, and evaluate the relevancy of images only by analyzing

asso ciated HTML documents. If it is judged that images are

related to keywords, the image ﬁles are downloaded from the

Web. According to the relevancy of images to the given key-

words, we divide fetched images into two groups: images in

group A are highly relevant to the keywords, and others are

classiﬁed into group B. For all gathered images, we perform

region segmentation by JSEG[4] and extract image features

from each region. Moreover, we extract word vectors from

the HTML documents associated to all the downloaded im-

ages. The detail is described in [16, 17].

In the selection stage, we employ a probabilistic method

to select relevant images from all the downloaded images.

In general, to use a probabilistic method or other machine

learning metho ds to select true images, we need labeled

training images. However, we do not want to pick up good

images by hand. Instead, we regard images classiﬁed into

group A as training images, although they always include

some irrelevant images. In our probabilistic framework, we

allow training data to include some irrelevant data and we

can remove them by repeating both estimation of a model

and selection of relevant regions of images from all the re-

gions of images in group A and group B. We use a generative

mo del based on the Gaussian mixture model to represent

mo dels associated to keywords, and estimate models with

the EM algorithm. After estimating the model, we “recog-

nize” relevant region out of all regions in group A and B with

the model. We repeat this model estimation and region se-

lection. After the second iteration, we use regions selected

in the previous iteration as training data for estimating a

mo del.

4. SELECTION STAGE

In the selection stage, the system selects relevant images

to the concept which the keywords represent out of all the

downloaded images in the collection stage.

4.1 Overview of Probabilistic Approach

for Image Selection

As a method to select images, we adopt a probabilistic

metho d with a Gaussian mixture model. This approach is

based on our method for learning to label image regions from

images with associated text without the correspondence be-

tween words and images regions [5, 1]. That method uses

a mixture of multi-modal components, each combining a

multinomial for words and a Gaussian over image features.

Here, we simplify things a bit, and build models of the dis-

tribution of image features for a given concept for regions

which are obtained by a region segmentation algorithm.

To get a model of regions associated to a certain concept,

we need training images. As mentioned before, our basic

p olicy is no human intervention, so that we propose using

images in group A as training images. Most of images in

group A are relevant ones, but they always includes outliers

due to no supervision. Moreover, in general, images usu-

ally include backgrounds as well as objects associated with

the given concept. Therefore, we need to eliminate outlier

images and regions unrelated to the concept such as back-

grounds, and pick up only the regions strongly associated

with the concept in order to make a model correctly. We

use only the regions expected to be highly related to the

concept to estimate a model. In our new method, we need

negative training images in addition to group A and B im-

ages. We prepare about one thousand images by fetching

them from the Web randomly as negative training images

in advance.

Our method to ﬁnd regions related to a certain concept

is an iterative algorithm similar to the expectation maxi-

mization (EM) algorithm applied to missing value problems.

Initially, we do not know which region is associated with a

concept “X”, since an image with an “X” label just means

the image contain “X” regions. In fact, with the images

gathered from the Web, even an image with an “X” lab el

sometimes contains no “X” regions at all. So at ﬁrst we have

to ﬁnd regions which are likely associated with “X”. To ﬁnd

“X” regions, we also need a model for “X” regions. Here we

adopt a probabilistic generative model, namely a mixture of

Gaussian, ﬁtted using the EM algorithm.

In short, we need to know a model for “X” regions and

which regions are associated with “X” simultaneously. How-

ever, each one depends on each other, so we proceed itera-

tively. Once we know which regions corresponds to “X”, we

can regard images containing “X” regions as ”X” images,

and therefore we can compute the probability of an “X” im-

age for each image. Finally, we select the images which have

the high probability as ﬁnal results.

In addition to image selection with only image features,

we also propose the extended method using not only image

features but also textual features extracted from associated

HTML documents. In fact, some existing studies have in-

tegrated both visual and textual features [13, 16, 6]. To re-

alize that, we compute the probability of “X” and “non-X”

in terms of word vectors in the same way as image features,

and integrate them.

keyword

"X"

gathering

HTML files

extracting and

evaluating URL

gathering

A-rank img.

gathering

B-rank img.

selecting

"X" regions

randomly

selecting

"X" regions

with "X"

model

(text-based) Web

search engines

World

Wide Web

output

images

query URLs

not gathered

other

img. files

(1) (2)

(3)

(4)

(5)

collection stage selection stage

(HTML analysis) (the same as [16,17]) (estimating GMM and selecting images with it)

estimating

"X" model

HTML files

repeating

loop

(7)

(6)

(8)

selectimg

"X" images

Figure 1: Processing ﬂow of image-gathering from the Web employing a probabilistic method, which consists

of the collection stage and the selection stage.

4.2 Segmentation and Image Feature

Extraction

For the images gathered from the Web as “X” images, we

carry out the region segmentation. In the experiment, we

use JSEG [4]. After segmentation, we extract image features

from each region whose size is larger than a certain thresh-

old. As image features, we prepare three kinds of features:

color, texture and shape features, which include the aver-

age RGB value and its variance, the average response to the

diﬀerence of 4 diﬀerent combination of 2 Gaussian ﬁlters, re-

gion size, location, the ﬁrst moment and the area divided by

the square of the outer boundary length. An image feature

vector we use in this paper is totally 24-dimension.

4.3 Textual Feature Extraction

In addition to image selection based on only image fea-

tures, we make experiments with not only image features

but also textual features, namely word vectors which are

extracted from associated HTML documents.

To make a word vector for each HTML document, we

eliminates HTML tags and extracts surrounding ten words

(only nouns, adjectives, and verbs) before and after the link

tag to the image ﬁle, link words and words in the ALT tag

from HTML documents associated with downloaded images.

We count the frequency of appearance of all the extracted

words, select the top 300 words in terms of the frequency,

and make a 300-dimensional word vector whose elements are

word frequencies weighted by TFIDF (Term Frequency and

Inverse Document Frequency) [12] for each of the images.

Moreover, to shorten the vectors and to equate words hav-

ing similar meanings, we perform the LSI methods (Latent

Semantic Indexing) [3] to the 300-dimensional word vectors.

These methods compress word vectors with singular value

decomp osition which is similar to the principal component

analysis. We compress a 300-dimensional word vector into a

100-dimensional vector, and treat with this 100-dimensional

vector as a word vector in this paper.

4.4 DetectingRegionsandEstimatingaModel

Associated with “X” and “non-X”

To obtain P (X|r

), which represents the probability of

how much the region is associated with the concept “X”,

and some parameters of the Gaussian mixture model, which

represents a generative model of “X” regions, at the same

time, we propose an iterative algorithm. Note that “X” cor-

resp onds to a certain concept associated to given keywords.

At the initial iteration, we regard images in group A which

are expected to be highly relevant to the concept “X” by

HTML analysis as positive training images, and prepare neg-

ative training images by gathering images from the Web in

advance. To gather negative training images, we provided

Go ogle Image Search with randomly selected 200 adjective

keywords which have no relation to noun concepts, and col-

lected 4000 negative training images.

Next, we select n “X” regions randomly from group A

images, and select n “non-X” regions randomly from regions

which come from negative training images, respectively. In

the experiment, we set n as 1000.

Taking positive and negative regions together, we apply

the EM algorithm, which is a kind of a probabilistic clus-

tering algorithm, to 2n image feature vectors of the regions

selected from positive and negative initial training images,

and obtain the Gaussian mixture model.

To select positive components and negative components

from all components of the mixture model, we compute

|X) which represents the ratio that the j-th compo-

nent of the mixture model, c

, contributes to the concept

“X” within the obtained GMM, according to the following

formula:

|X) = (1/n

)

i=1

P (c

, X) (1)

= (α/n

)

i=1

P (X|c

, r

)P (c

) (2)

where n

is the number of positive regions, r

is the i-

th “X” region, and α is a constant for the normalization.

As the same way, we also compute P

|nonX). Here we

regard all the regions selected from positive images as “X”

regions and all the regions selected from negative images as

“non-X” regions, and substitute their feature vectors to the

ab ove formula.

Next, we compute p

and p

nonX

for all components j as

follows:

|X)

|X) + P

|nonX)

(3)

nonX

|nonX)

|X) + P

|nonX)

(4)

We select components where p

> th

as positive com-

p onents and components where p

non

> th

as negative

comp onents. Positive components and negative components

means Gaussian components associated with the concept

“X” and Gaussian components strictly not to associated

with “X”, resp ectively. The key point in this component

selection process is that mixing positive samples and nega-

tive samples together before applying the EM, and throwing

away neutral components which belongs to neither positive

nor negative components, since neutral components are ex-

p ected to be associated with image features included in both

p ositive and negative samples and to be useless for discrim-

ination between “X” and “non-X”. This is diﬀerent from

other work (e.g.[11]) which estimates two GMMs separately

with EM to model positive and negative image concepts.

We regard the mixture of only positive components as an

“X” model and the mixture of only negative components as

a “non-X” model as. With these models of “X” and “non-

X”, we can compute P (X|r

) and P (nonX|r

) for all the

regions extracted from images in both group A and group

B. First, we compute p1(X|r

) which is the output of the

mo del of “X’ and p2(nonX|r

) which is the output of the

mo del of “non-X” for each region r

p1(X|r

) =

k=1

1,k

(2π)

|Σ

1,k

exp

−

−µ

1,k

)

−1

1.k

−µ

1,k

)

(5)

p2(nonX|r

) =

k=1

2,k

(2π)

|Σ

2,k

exp

−

−µ

1,k

)

−1

2,k

−µ

2,k

)

(6)

where x

is the image feature vector of region r

, N is the

dimension of image features, m

is the number of positive

comp onents and w

1,k

, µ

1,k

and Σ

1,k

represent the weight,

the mean vector and the covariant matrix of k-th positive

comp onent, respectively, on the condition of

1,k

k=1

= 1.

Finally, we obtain P (X|r

) and P (nonX|r

) as follows:

P (X|r

) =

p1(X|r

)

p1(X|r

) + p2(nonX|r

)

(7)

P (nonX|r

) =

p2(nonX|r

)

p1(X|r

) + p2(nonX|r

)

(8)

For the next iteration, we select the top n regions regard-

ing P (X|r

) as “X” regions and the top

n regions regarding

P (nonX|r

) as “non-X” regions. In addition, we add

n re-

gions randomly selected from negative images gathered from

the Web in advance to the “non-X” regions. We repeat the

pro cessing described above with n positive regions and n

negative regions for several times.

4.5 Computing the Probability of “X”

After iterating both detection of regions and estimation

of models for several times, in order to decide which images

are “X”, ﬁnally, we compute the probability of “X” P (X|I

where I

represents j-th image.

To estimate P (X|I

), we select top T regions in terms of

P (X|r

), (r

∈ I

), and average them as follows:

P (X|I

) =

k=1

P (X|r

top

) (9)

where r

top

is the j-th largest region within image I

terms of P (X|r

). This estimation of P (X|I

) is based on

the heuristic that an image having regions whose P (X|r

)

are high is likely to be an “X” image. Since images usually

includes backgrounds as well as target objects, background

regions or unrelated regions should be ignored for estimat-

ing P (X|I

). Therefore, we use not all regions but only

several regions with high probability. Finally, we select im-

ages whose P (X|I

) is more than a threshold th

as ﬁnal

output images. In the experiment, we set T as 2.

4.6 Selection by Textual Feature

To use textual features in addition to image features, we

compute the probability of “X” and “non-X” in terms of

word vectors in the same way as image features, and inte-

grate them in two kinds of ways. Note that a word vector

corresp onds with an image, while an image feature vector

corresp onds with a region.

After selecting “X” regions, we need to select “non-X”

regions, and need to compute the probability of “X” images

and “non-X” images for each image, respectively, after each

iteration of image selection by image feature. Since we do

not have negative samples of word vectors unlike the case

of image features in the beginning, we make use of results

of selection by image features. We regard the word vectors

of the images labeled as “X” by image-feature-selection as

p ositive training samples, and regard the word vectors of

the images labeled as “non-X” as negative training samples.

In the next step, using both positive and negative training

vectors, we build an “X” model and a “non-X” model in

the similar way as image features. We perform the EM

algorithm for the word vectors of “X” and “non-X” images,

select Gaussian components corresponding to “X” and “non-

X”, and compute the probability of “X” and “non-X” in

terms of word vectors.

We prepare two methods to use the probability P

word

(X|I

One is computing the weighted sum P

total

(X|r

) of word-

vector-based probability P

word

(X|I

) and image-feature-based

probability P (X|r

), and the other is the two step selection

that word-vector-based selection is carried out after image-

feature-based selection.

In the weighted-sum selection, we compute P

total

(X|r

)

for all r

as follows:

total

(X|r

) = wP

word

(X|I

) + (1 − w)P (X|r

) (10)

where r

∈ I

, and select positive and negative training re-

gions for the next iteration using P

total

(X|r

) instead of just

P (X|r

). In the experiment, we set w as 0.25.

In the two-step selection, using a threshold th

, we elimi-

nate region r

whose P

word

(X|I

) is low, where r

∈ I

from

the positive regions after image-feature-based region selec-

tion.

4.7 Proposed Algorithm

To summarize our method we described above, the algo-

rithm is as follows:

(1) Carry out region segmentation for all the images and

extract image features from each region of each image.

In this paper, we use JSEG[4] to perform region segmen-

tation. In case of using word vector, we also generate

word vectors for all the images in both group A and B

from associated HTML documents in this step.

(2) At the ﬁrst iteration, regard images in group A as posi-

tive training images which are associated with the con-

cept “X” and images gathered from the Web with non-

noun keywords in advance as negative training images.

(3) Select n “X” regions randomly from positive images,

and select n “non-X” regions randomly from negative

images, respectively (Figure 1 (4)).

(4) Applying the EM algorithm to the image features of

regions which are selected as both positive and negative

regions, compute the Gaussian mixture model for the

distribution of both “X” and “non-X” (Figure 1 (5)).

(5) Find the components of the Gaussian mixture which

contributes “X” regions or “non-X” regions greatly. They

are regarded as “X” components or “non-X” compo-

nents, and the rest are ignored. The mixture of only

Probabilistic web image gathering

Figures

Citations

A Framework for Web Science

OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning

Dataset issues in object recognition

OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning

Animals on the Web

References

Indexing by Latent Semantic Analysis

Term Weighting Approaches in Automatic Text Retrieval

Combining labeled and unlabeled data with co-training

Object class recognition by unsupervised scale-invariant learning

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Matching words and pictures

Visual categorization with bags of keypoints

ImageNet: A large-scale hierarchical image database

Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Probabilistic web image gathering" ?

Q2. What are the future works in "Probabilistic web image gathering" ?

Q3. What is the method to find regions related to a certain concept?

Q4. What are the three kinds of features that the authors prepare for image features?

Q5. How do the authors make a word vector for each image?

Q6. How do the authors use the Gaussian mixture model to represent models associated to keywords?

Q7. How many relevant images were needed to be supervised by hand?

Q8. How many images are relevant to the keyword “X”?

Q9. What is the contribution of this paper?

Q10. How do the authors compute the probability of “X” and “non-X” images?

Q11. What is the key improvement in gather?