What have the authors contributed in "Modeling and analysis of dynamic behaviors of web image collections" ?

If so, can the authors exploit the understanding of dynamics to solve novel visual problems or improve recognition performance ? These two challenging questions are the motivation for this work. The authors propose a nonparametric approach to modeling and analysis of topical evolution in image sets. A scalable and parallelizable sequential Monte Carlo based method is developed to construct the similarity network of a large-scale dataset that provides a base representation for wide ranges of dynamics analysis. In this paper, the authors provide several experimental results to support the usefulness of image dynamics with the datasets of 47 topics gathered from Flickr. Second, the authors also present the complementary benefits that the images can introduce over the associated text analysis. Finally, the authors show that the training using the temporal association significantly improves the recognition performance. First, the authors produce some interesting observations such as tracking of subtopic evolution and outbreak detection, which can not be achieved with conventional image sets.

What is the importance sampling method for the transition model?

The importance sampling is particularly useful for the transition model since there is no closed form of the product of Gaussian and Gamma distributions and its normalization is not straightforward.

What is the advantage of image-based temporal analysis?

Another important advantage of image-based temporal analysis is that it conveys more delicate information that is hardly captured by text descriptions.

What is the way to capture the subtopic evolution of a web image?

A sequential Monte Carlo based tracker is proposed to capture the subtopic evolution in the form of the similarity network of the image set.

Why are the text tags highly fluctuated?

the text tags are highly fluctuated mainly because tags are subjectively assigned by different users with little consensus.

How can the authors observe the affinity changes of each topic in the apple image set?

As Google trends reveal the popularity variation of query terms in the search volumes, the authors can easily observe the affinity changes of each subtopic in the apple image set.

How do the authors compare the dynamic behaviors detected from images and texts?

In order to compare the dynamic behaviors detected from images and texts, the authors apply the outbreak detection method in previous section to both images and their associated tags.

What is the purpose of this study?

In order to show the usefulness of the image-based temporal topic modeling, the authors examined subtopic evolution tracking, subtopic outbreak detection, the comparison with the analysis on the associated texts, and the use of temporal association for recognition improvement.

What is the speed of the analysis of the network?

The analysis of the network is also fast since most network analysis algorithms depend on the number of nonzero elements, which is O(N log N).

What is the probability of occurrence of a subtopic?

In their interpretation, given an image stream, the authors assume that the occurrence of images of each subtopic follows the Poisson process with β.

What is the significance of the stationary probability?

The stationary probability is a popular ranking measure, and thus the images with high stationary probabilities can be thought of temporally and visually strengthened images.

How do the authors generate the top-ranked images for each topic?

For the test sets, the authors downloaded 256 top-ranked images for each topic from Google Image Search by querying the same word in Table 1.

(Open Access) Modeling and analysis of dynamic behaviors of web image collections (2010) | Gunhee Kim

Modeling and Analysis of Dynamic Behaviors of

Web Image Collections

Gunhee Kim

,EricP.Xing

, and Antonio Torralba

Carnegie Mellon University, Pittsburgh, PA 15213, USA

Massachusetts Institute of Technology, Cambridge, MA 02139, USA

{gunhee,epxing}@cs.cmu.edu, torralba@csail.mit.edu

Abstract. Can we model the temporal evolution of topics in Web im-

age collections? If so, can we exploit t he understanding of dynamics to

solve novel visual problems or improve re cognition performance?These

two challenging questions are the motivation for this work. We propose a

nonparametric approach to modeling and analysis of topical evolution in

image sets. A scalable and parallelizable sequential Monte Carlo based

method is developed to construct the similarity network of a large-scale

dataset that provides a base representation for wide ranges of dynam-

ics analysis. In this paper, we provide several experimental results to

support the usefulness of image dynamics with the datasets of 47 top-

ics gathered from Flickr. First, we produce some interesting observations

such as tracking of subtopic evolution and outbreak detection, which can-

not be achieved with conventional image sets. Second, we also present

the complementary beneﬁts that the images can introduce over the asso-

ciated text analysis. Finally, we show that the training using the temporal

association signiﬁcantly improves the recognition performance.

1 Introduction

This paper investigates the discovery and use of topical evolution in Web image

collections. The images on the Web are rapidly growing, and it is obvious to

assume that their topical patterns evolve over time. Topics may rise and fall in

their popularity; sometimes they are split or merged to a new one; some of them

are synchronized or mutually exclusive on the timeline. In Fig.1, we download

apple images and their associated timestamps from Flickr, and measure the

similarity changes with some canonical images of apple’s subtopics. As Google

trends reveal the popularity variation of query terms in the search volumes, we

can easily observe the aﬃnity changes of each subtopic in the apple image set.

The main objectives of this work are as follows. First, we propose a non-

parametric approach to modeling and analysis of temporal evolution of topics

in Web image collections. Second, we show that understanding image dynamics

is useful to solve novel problems such as subtopic outbreak detection and to im-

prove classiﬁcation performance using the temporal association that is inspired

by studies in human vision [2,19,21]. Third, we present that the images can be a

K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part V, LNCS 6315, pp. 85–98, 2010.

 Springer-Verlag Berlin Heidelberg 2010

86 G. Kim, E.P. Xing, and A. Torralba

Fig. 1. The Google trends -like visualization of the subtopic evolution in the apple im-

ages from Flickr (fruit:blue,logo:red,laptop:orange,tree: green, iphone: purple). We

choose the cluster center image of each subtopic, and measure the average similarity

with the posterior (i.e. a set of weighted image samples) at each time step. The fruit

subtopic is stable along the timeline whereas the iphone subtopic is highly ﬂuctuated.

more reliable and delicate source of information to detect topical evolution than

the texts.

Our approach is motivated by the recent success of the nonparametric meth-

ods [13,20] that are powered by large databases. Instead of using sophisticated

parametric topic models [3,22], we represent the images with timestamps in the

form of a similarity network [11], in which vertices are images and edges con-

nect the temporally related and visually similar images. Thus, our approach is

able to perform diverse dynamics analysis without solving complex inference

problems. For example, a simple information-theoretic measure of the network

can be used to detect subtopic outbreaks, which point out when the evolution

speed is abruptly changed. The temporal context is also easily integrated with

the classiﬁer training in a framework of the Metropolis-Hastings algorithm.

The network generation is based on the sequential Monte Carlo (i.e.particle

ﬁltering) [1,9]. In the sequential Monte Carlo, the posterior (i.e. subtopic dis-

tribution) at a particular time step is represented by a set of weighted image

samples. We track similar subtopics (i.e. clusters of images) in consecutive pos-

teriors along the timeline, and create edges between them. The sampling based

representation is quite powerful in our context. Since we deal with unordered

natural images on the Web, any Gaussian or linearity assumption does not hold

and multiple peaks of distributions are unavoidable. Another practical advan-

tage is that we can easily control the tradeoﬀ between accuracy and speed by

managing the number of samples and parameters in the transition model. The

proposed algorithm is easily parallelizable by running multiple sequential Monte

Carlo trackers with diﬀerent initialization and parameters. Our approach is also

scalable and fast. The computation time is linear with the number of images.

For evaluation, we download more than 9M images of 47 topics from Flickr.

Most standard datasets in computer vision research [7,18] have not yet consid-

ered the importance of temporal context. Recently, several datasets have intro-

duced spatial contexts as fundamental cues to recognition [18], but the support

for temporal context has still been largely ignored. Our experiments clearly show

Modeling and Analysis of Dynamic Behaviors of Web Image Collections 87

that our modeling and analysis is practically useful and can be used to under-

stand and simulate human-like visual experience from Web images.

1.1 Related Work

The temporal information is one of the most obvious features in video or au-

ditory applications. Hence, here we review only the use of temporal cues for

image analysis. The importance of temporal context has long been recognized in

neuroscience research [2,19,21]. Wide range of research has supported that the

temporal association (i.e. liking temporally close images) is an important mecha-

nism to recognize objects and generalize visual representation. [21] tested several

interesting experiments to show that temporally correlated multiple views can

be easily linked to a single representation. [2] proposed a learning model for 3D

object recognition by using the temporal continuity in image sequences.

In computer vision, [16] is one of the early studies that use temporal context

in active object recognition. They used a POMDP framework for the modeling of

temporal context to disambiguate the object hypotheses. [5] proposed a HMM-

based temporal context model to solve scene classiﬁcation problems. For the

indoor-outdoor classiﬁcation and the sunset detection, they showed that the

temporal model outperformed the baseline content-based classiﬁers.

As the Internet vision emerges as an active research area in computer vision,

timing information starts to be used in the assistance of visual tasks. Surpris-

ingly, however, the dynamics or temporal context for Web images has not yet

been studied a great deal, contrary to the fact that the study of the dynamic

behaviors of the texts on the Web has been one of active research areas in data

mining and machine learning communities [3,22]. We brieﬂy review some notable

examples using timestamp meta-data for visual tasks. [6] developed an annota-

tion method for personal photo collections, and the timestamps associated with

the images were used for better correlation discovery between the images. [12]

proposed a landmark classiﬁcation for an extremely large dataset, and the tem-

poral information was used for the constraints to remove misclassiﬁcation. [17]

also used the timestamp as an additional feature to develop an object and event

retrieval system for online image communities. [10] presented a method to geolo-

cate a sequence of images taken by a single individual. Temporal constraints from

the sequence of images were used as a strong prior to improve the geolocation

accuracy.

The main diﬀerence between their work and ours is that they considered the

temporal information as additional meta-data or constraints to achieve their

original goals (i.e. annotations in [6], classiﬁcation and detection in [12,17], and

the geolocation of images in [10]). However, our work considers the timestamps

associated with images as a main research subject to uncover dynamic behav-

iors of Web images. To our best knowledge, there have been very few previous

attempts to tackle this issue in computer vision research.

88 G. Kim, E.P. Xing, and A. Torralba

2 Network Construction by Sequential Monte Carlo

2.1 Image Description and Similarity Measure

Each image is represented by two types of descriptors, which are spatial pyramids

of visual words [14] and HOG [4]. We use the codes provided by the authors of

the papers. A dictionary of 200 visual words is formed by K-means to randomly

selected SIFT descriptors [14]. A visual word is densely assigned to every pixel

of an image by ﬁnding the nearest cluster center in the dictionary. Then visual

words are binned using a two-level spatial pyramid. The oriented gradients are

computed by Canny edge detection and Sobel mask [4]. The HOG descriptor

is then discretized into 20 orientation bins in the range of [0

◦

,180

◦

]. Then the

HOG descriptors are binned using a three-level spatial pyramid. The similarity

measure between a pair of images is the cosine similarity, which is calculated by

the dot product of a pair of L

normalized descriptors.

2.2 Problem Statement

The input of our algorithm is a set of images I = {I

, ..., I

} and associated

tags of taken time T = {T

, ..., T

}. The main goal is to generate an N × N

sparse similarity network G =(V, E, W) by using the Sequential Monte Carlo

(SMC) method. Each vertex in V is an image in the dataset. The edge set E

is created between the images that are visually similar and temporally distant

with a certain interval that is assigned by the transition model of the SMC

tracker (Section 2.3). The weight set W is discovered by the similarity between

descriptors of images (Section 2.1). For sparsity, each image is connected to its

k-nearest neighbors with k = a log N,wherea is a constant (e.g. a =10).

2.3 Network Construction Using Sequential Monte Carlo

Algorithm 1 summarizes the proposed SMC based network construction. For

better readability, we follow the notation of condensation algorithm [9]. The

output of each iteration of the SMC is the conditional subtopic distribution (i.e.

posterior) at every step, which is approximated by a set of images with relative

importance denoted by {s

, π

} = {s

(i)

,π

(i)

,i=1,...,M}. Note that our SMC

does not explicitly solve the data association during the tracking. In other words,

we do not assign a subtopic membership to each image in s

. However, it can be

easily obtained later by applying clustering to the subgraph of s

Fig.2 shows a downsampled example of a single iteration of the posterior

estimation. At every iteration, the SMC generates a new posterior {s

, π

} by

running transition, observation,andresampling.

The image data are severely unbalanced on the timeline. (e.g. There are only

a few images within a month in 2005 but a large number of images within even

a week in 2008). Thus, in our experiments, we bin the timeline by the number

of images instead of a ﬁxed time interval. (e.g. The timeline may be binned by

every 3000 images instead of a month). The function τ(T

,m) is used to indicate

the timestamp of the m-th image later from the image at T

Modeling and Analysis of Dynamic Behaviors of Web Image Collections 89

Fig. 2. An overview of the SMC based network construction for the jaguar topic. The

subtopic distribution at each time step is represented by a set of weighted image samples

(i.e. posterior) {s

, π

}. In this example, a posterior of the jaguar topic consists of image

samples of animal, cars,andfootball subtopics. (a) The transition model generates new

posterior candidates s



from s

t−1

. (b) The observation model discovers π



of s



and the

resampling step computes {s

, π

} from {s



, π



}. Finally, the network is constructed

by similarity matching between two consecutive posteriors s

t−1

and s

Initialization. The initialization samples the initial posterior s

from the prior

p(x

)atT

. p(x

) is set by a Gaussian distribution N (T

,τ

, 2M/3)) on the

timeline, which means that 2M numbersofimagesaroundT

have nonzero

probabilities to be selected as one of s

. The initial π

is uniformly set to 1/M .

Tran si t io n Mo del . The transition model generates posterior candidates s



rightward on the timeline from the previous {s

t−1

, π

t−1

} (See Fig.2.(a) for an

example). Each image s

(i)

t−1

in s

t−1

recommends m

numbers of images that are

similar to itself as candidates set s



for the next posterior. A more weighted image

(i)

t−1

is able to recommend more images for s





=2M and m

∝ π

(i)

t−1

At this stage, we generate 2M candidates (i.e. |s



| =2M), and the observation

and resampling steps reduce it to be |s

| = M while computing weights π

Similarly to condensation algorithm [9], the transition consists of deterministic

drift and stochastic diﬀusion.Thedrift describes the transition tendency of the

overall s



(i.e.howfarthes



is located from the s

t−1

on the timeline). The

diﬀusion assigns a random transition of an individual image. The drift and

the diﬀusion are modeled by a Gaussian distribution N(μ

,σ

) and a Gamma

distribution Γ(α, β), respectively. The ﬁnal transition model is the product of

these two distributions [8] in Eq.1. The asterisk of P

(i)∗

(x)inEq.1meansthatit

is not normalized. Renormalization is not required since we will use importance

sampling to sample images on the timeline with the target distribution (See the

next subsection with Fig.3 for the detail).

(i)∗

(x)=N(x; μ

,σ

) × Γ(x; α

(i)

t−1

,β

(i)

t−1

)(1)

Modeling and analysis of dynamic behaviors of web image collections

Figures

Citations

The PASCAL Visual Object Classes Challenge

Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time

Dating historical color images

Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos

References

The Pascal Visual Object Classes (VOC) Challenge

A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking

Information Theory, Inference and Learning Algorithms

Information theory, inference, and learning algorithms

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Related Papers (5)

IM2GPS: estimating geographic information from a single image

ImageNet: A large-scale hierarchical image database

Image sequence geolocation with human travel priors

NUS-WIDE: a real-world web image database from National University of Singapore

80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Modeling and analysis of dynamic behaviors of web image collections" ?

Q2. What is the similarity measure between a pair of images?

Q3. What is the importance sampling method for the transition model?

Q4. What is the advantage of image-based temporal analysis?

Q5. What is the way to capture the subtopic evolution of a web image?

Q6. Why are the text tags highly fluctuated?

Q7. How can the authors use the proposed algorithm?

Q8. What is the significance of temporal association in the neuroscience research?

Q9. How can the authors observe the affinity changes of each topic in the apple image set?

Q10. How do the authors compare the dynamic behaviors detected from images and texts?

Q11. What is the purpose of this study?

Q12. What is the speed of the analysis of the network?

Q13. What is the probability of occurrence of a subtopic?

Q14. What is the significance of the stationary probability?

Q15. How do the authors generate the top-ranked images for each topic?