scispace - formally typeset
Open AccessBook ChapterDOI

Modeling and analysis of dynamic behaviors of web image collections

Reads0
Chats0
TLDR
A scalable and parallelizable sequential Monte Carlo based method is developed to construct the similarity network of a large-scale dataset that provides a base representation for wide ranges of dynamics analysis.
Abstract
Can we model the temporal evolution of topics in Web image collections? If so, can we exploit the understanding of dynamics to solve novel visual problems or improve recognition performance? These two challenging questions are the motivation for this work. We propose a nonparametric approach to modeling and analysis of topical evolution in image sets. A scalable and parallelizable sequential Monte Carlo based method is developed to construct the similarity network of a large-scale dataset that provides a base representation for wide ranges of dynamics analysis. In this paper, we provide several experimental results to support the usefulness of image dynamics with the datasets of 47 topics gathered from Flickr. First, we produce some interesting observations such as tracking of subtopic evolution and outbreak detection, which cannot be achieved with conventional image sets. Second, we also present the complementary benefits that the images can introduce over the associated text analysis. Finally, we show that the training using the temporal association significantly improves the recognition performance.

read more

Content maybe subject to copyright    Report

Modeling and Analysis of Dynamic Behaviors of
Web Image Collections
Gunhee Kim
1
,EricP.Xing
1
, and Antonio Torralba
2
1
Carnegie Mellon University, Pittsburgh, PA 15213, USA
2
Massachusetts Institute of Technology, Cambridge, MA 02139, USA
{gunhee,epxing}@cs.cmu.edu, torralba@csail.mit.edu
Abstract. Can we model the temporal evolution of topics in Web im-
age collections? If so, can we exploit t he understanding of dynamics to
solve novel visual problems or improve re cognition performance?These
two challenging questions are the motivation for this work. We propose a
nonparametric approach to modeling and analysis of topical evolution in
image sets. A scalable and parallelizable sequential Monte Carlo based
method is developed to construct the similarity network of a large-scale
dataset that provides a base representation for wide ranges of dynam-
ics analysis. In this paper, we provide several experimental results to
support the usefulness of image dynamics with the datasets of 47 top-
ics gathered from Flickr. First, we produce some interesting observations
such as tracking of subtopic evolution and outbreak detection, which can-
not be achieved with conventional image sets. Second, we also present
the complementary benefits that the images can introduce over the asso-
ciated text analysis. Finally, we show that the training using the temporal
association significantly improves the recognition performance.
1 Introduction
This paper investigates the discovery and use of topical evolution in Web image
collections. The images on the Web are rapidly growing, and it is obvious to
assume that their topical patterns evolve over time. Topics may rise and fall in
their popularity; sometimes they are split or merged to a new one; some of them
are synchronized or mutually exclusive on the timeline. In Fig.1, we download
apple images and their associated timestamps from Flickr, and measure the
similarity changes with some canonical images of apple’s subtopics. As Google
trends reveal the popularity variation of query terms in the search volumes, we
can easily observe the affinity changes of each subtopic in the apple image set.
The main objectives of this work are as follows. First, we propose a non-
parametric approach to modeling and analysis of temporal evolution of topics
in Web image collections. Second, we show that understanding image dynamics
is useful to solve novel problems such as subtopic outbreak detection and to im-
prove classification performance using the temporal association that is inspired
by studies in human vision [2,19,21]. Third, we present that the images can be a
K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part V, LNCS 6315, pp. 85–98, 2010.
c
Springer-Verlag Berlin Heidelberg 2010

86 G. Kim, E.P. Xing, and A. Torralba
Fig. 1. The Google trends -like visualization of the subtopic evolution in the apple im-
ages from Flickr (fruit:blue,logo:red,laptop:orange,tree: green, iphone: purple). We
choose the cluster center image of each subtopic, and measure the average similarity
with the posterior (i.e. a set of weighted image samples) at each time step. The fruit
subtopic is stable along the timeline whereas the iphone subtopic is highly fluctuated.
more reliable and delicate source of information to detect topical evolution than
the texts.
Our approach is motivated by the recent success of the nonparametric meth-
ods [13,20] that are powered by large databases. Instead of using sophisticated
parametric topic models [3,22], we represent the images with timestamps in the
form of a similarity network [11], in which vertices are images and edges con-
nect the temporally related and visually similar images. Thus, our approach is
able to perform diverse dynamics analysis without solving complex inference
problems. For example, a simple information-theoretic measure of the network
can be used to detect subtopic outbreaks, which point out when the evolution
speed is abruptly changed. The temporal context is also easily integrated with
the classifier training in a framework of the Metropolis-Hastings algorithm.
The network generation is based on the sequential Monte Carlo (i.e.particle
filtering) [1,9]. In the sequential Monte Carlo, the posterior (i.e. subtopic dis-
tribution) at a particular time step is represented by a set of weighted image
samples. We track similar subtopics (i.e. clusters of images) in consecutive pos-
teriors along the timeline, and create edges between them. The sampling based
representation is quite powerful in our context. Since we deal with unordered
natural images on the Web, any Gaussian or linearity assumption does not hold
and multiple peaks of distributions are unavoidable. Another practical advan-
tage is that we can easily control the tradeoff between accuracy and speed by
managing the number of samples and parameters in the transition model. The
proposed algorithm is easily parallelizable by running multiple sequential Monte
Carlo trackers with different initialization and parameters. Our approach is also
scalable and fast. The computation time is linear with the number of images.
For evaluation, we download more than 9M images of 47 topics from Flickr.
Most standard datasets in computer vision research [7,18] have not yet consid-
ered the importance of temporal context. Recently, several datasets have intro-
duced spatial contexts as fundamental cues to recognition [18], but the support
for temporal context has still been largely ignored. Our experiments clearly show

Modeling and Analysis of Dynamic Behaviors of Web Image Collections 87
that our modeling and analysis is practically useful and can be used to under-
stand and simulate human-like visual experience from Web images.
1.1 Related Work
The temporal information is one of the most obvious features in video or au-
ditory applications. Hence, here we review only the use of temporal cues for
image analysis. The importance of temporal context has long been recognized in
neuroscience research [2,19,21]. Wide range of research has supported that the
temporal association (i.e. liking temporally close images) is an important mecha-
nism to recognize objects and generalize visual representation. [21] tested several
interesting experiments to show that temporally correlated multiple views can
be easily linked to a single representation. [2] proposed a learning model for 3D
object recognition by using the temporal continuity in image sequences.
In computer vision, [16] is one of the early studies that use temporal context
in active object recognition. They used a POMDP framework for the modeling of
temporal context to disambiguate the object hypotheses. [5] proposed a HMM-
based temporal context model to solve scene classification problems. For the
indoor-outdoor classification and the sunset detection, they showed that the
temporal model outperformed the baseline content-based classifiers.
As the Internet vision emerges as an active research area in computer vision,
timing information starts to be used in the assistance of visual tasks. Surpris-
ingly, however, the dynamics or temporal context for Web images has not yet
been studied a great deal, contrary to the fact that the study of the dynamic
behaviors of the texts on the Web has been one of active research areas in data
mining and machine learning communities [3,22]. We briefly review some notable
examples using timestamp meta-data for visual tasks. [6] developed an annota-
tion method for personal photo collections, and the timestamps associated with
the images were used for better correlation discovery between the images. [12]
proposed a landmark classification for an extremely large dataset, and the tem-
poral information was used for the constraints to remove misclassification. [17]
also used the timestamp as an additional feature to develop an object and event
retrieval system for online image communities. [10] presented a method to geolo-
cate a sequence of images taken by a single individual. Temporal constraints from
the sequence of images were used as a strong prior to improve the geolocation
accuracy.
The main difference between their work and ours is that they considered the
temporal information as additional meta-data or constraints to achieve their
original goals (i.e. annotations in [6], classification and detection in [12,17], and
the geolocation of images in [10]). However, our work considers the timestamps
associated with images as a main research subject to uncover dynamic behav-
iors of Web images. To our best knowledge, there have been very few previous
attempts to tackle this issue in computer vision research.

88 G. Kim, E.P. Xing, and A. Torralba
2 Network Construction by Sequential Monte Carlo
2.1 Image Description and Similarity Measure
Each image is represented by two types of descriptors, which are spatial pyramids
of visual words [14] and HOG [4]. We use the codes provided by the authors of
the papers. A dictionary of 200 visual words is formed by K-means to randomly
selected SIFT descriptors [14]. A visual word is densely assigned to every pixel
of an image by finding the nearest cluster center in the dictionary. Then visual
words are binned using a two-level spatial pyramid. The oriented gradients are
computed by Canny edge detection and Sobel mask [4]. The HOG descriptor
is then discretized into 20 orientation bins in the range of [0
,180
]. Then the
HOG descriptors are binned using a three-level spatial pyramid. The similarity
measure between a pair of images is the cosine similarity, which is calculated by
the dot product of a pair of L
2
normalized descriptors.
2.2 Problem Statement
The input of our algorithm is a set of images I = {I
1
,I
2
, ..., I
N
} and associated
tags of taken time T = {T
1
,T
2
, ..., T
N
}. The main goal is to generate an N × N
sparse similarity network G =(V, E, W) by using the Sequential Monte Carlo
(SMC) method. Each vertex in V is an image in the dataset. The edge set E
is created between the images that are visually similar and temporally distant
with a certain interval that is assigned by the transition model of the SMC
tracker (Section 2.3). The weight set W is discovered by the similarity between
descriptors of images (Section 2.1). For sparsity, each image is connected to its
k-nearest neighbors with k = a log N,wherea is a constant (e.g. a =10).
2.3 Network Construction Using Sequential Monte Carlo
Algorithm 1 summarizes the proposed SMC based network construction. For
better readability, we follow the notation of condensation algorithm [9]. The
output of each iteration of the SMC is the conditional subtopic distribution (i.e.
posterior) at every step, which is approximated by a set of images with relative
importance denoted by {s
t
, π
t
} = {s
(i)
t
(i)
t
,i=1,...,M}. Note that our SMC
does not explicitly solve the data association during the tracking. In other words,
we do not assign a subtopic membership to each image in s
t
. However, it can be
easily obtained later by applying clustering to the subgraph of s
t
.
Fig.2 shows a downsampled example of a single iteration of the posterior
estimation. At every iteration, the SMC generates a new posterior {s
t
, π
t
} by
running transition, observation,andresampling.
The image data are severely unbalanced on the timeline. (e.g. There are only
a few images within a month in 2005 but a large number of images within even
a week in 2008). Thus, in our experiments, we bin the timeline by the number
of images instead of a fixed time interval. (e.g. The timeline may be binned by
every 3000 images instead of a month). The function τ(T
i
,m) is used to indicate
the timestamp of the m-th image later from the image at T
i
.

Modeling and Analysis of Dynamic Behaviors of Web Image Collections 89
Fig. 2. An overview of the SMC based network construction for the jaguar topic. The
subtopic distribution at each time step is represented by a set of weighted image samples
(i.e. posterior) {s
t
, π
t
}. In this example, a posterior of the jaguar topic consists of image
samples of animal, cars,andfootball subtopics. (a) The transition model generates new
posterior candidates s
t
from s
t1
. (b) The observation model discovers π
t
of s
t
and the
resampling step computes {s
t
, π
t
} from {s
t
, π
t
}. Finally, the network is constructed
by similarity matching between two consecutive posteriors s
t1
and s
t
.
Initialization. The initialization samples the initial posterior s
0
from the prior
p(x
0
)atT
0
. p(x
0
) is set by a Gaussian distribution N (T
0
2
(T
0
, 2M/3)) on the
timeline, which means that 2M numbersofimagesaroundT
0
have nonzero
probabilities to be selected as one of s
0
. The initial π
0
is uniformly set to 1/M .
Tran si t io n Mo del . The transition model generates posterior candidates s
t
rightward on the timeline from the previous {s
t1
, π
t1
} (See Fig.2.(a) for an
example). Each image s
(i)
t1
in s
t1
recommends m
i
numbers of images that are
similar to itself as candidates set s
t
for the next posterior. A more weighted image
s
(i)
t1
is able to recommend more images for s
t
.(
i
m
i
=2M and m
i
π
(i)
t1
).
At this stage, we generate 2M candidates (i.e. |s
t
| =2M), and the observation
and resampling steps reduce it to be |s
t
| = M while computing weights π
t
.
Similarly to condensation algorithm [9], the transition consists of deterministic
drift and stochastic diffusion.Thedrift describes the transition tendency of the
overall s
t
(i.e.howfarthes
t
is located from the s
t1
on the timeline). The
diffusion assigns a random transition of an individual image. The drift and
the diffusion are modeled by a Gaussian distribution N(μ
t
2
) and a Gamma
distribution Γ(α, β), respectively. The final transition model is the product of
these two distributions [8] in Eq.1. The asterisk of P
(i)
t
(x)inEq.1meansthatit
is not normalized. Renormalization is not required since we will use importance
sampling to sample images on the timeline with the target distribution (See the
next subsection with Fig.3 for the detail).
P
(i)
t
(x)=N(x; μ
t
2
) × Γ(x; α
(i)
t1
(i)
t1
)(1)

Figures
Citations
More filters
Proceedings ArticleDOI

Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

TL;DR: In this paper, the authors use image metadata nonparametrategically to generate neighborhoods of related images using Jaccard similarities, then use a deep neural network to blend visual information from the image and its neighbors.
Proceedings ArticleDOI

Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time

TL;DR: This work presents a weakly-supervised visual data mining approach that discovers connections between recurring mid-level visual elements in historic and geographic image collections, and attempts to capture the underlying visual style.
Book ChapterDOI

Dating historical color images

TL;DR: The task of automatically estimating the age of historical color photographs is introduced and data-driven camera response function estimation is applied to historical color imagery, demonstrating its relevance to both the age estimation task and the popular application of imitating the appearance of vintage color photography.
Proceedings ArticleDOI

Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos

TL;DR: This paper forms the storyline reconstruction problem as an inference of sparse time-varying directed graphs, and develops an optimization algorithm that successfully addresses a number of key challenges of Web-scale problems, including global optimality, linear complexity, and easy parallelization.
References
More filters
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Journal ArticleDOI

A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking

TL;DR: Both optimal and suboptimal Bayesian algorithms for nonlinear/non-Gaussian tracking problems, with a focus on particle filters are reviewed.
Book

Information Theory, Inference and Learning Algorithms

TL;DR: A fun and exciting textbook on the mathematics underpinning the most dynamic areas of modern science and engineering.
Book

Information theory, inference, and learning algorithms

Djc MacKay
TL;DR: In this paper, the mathematics underpinning the most dynamic areas of modern science and engineering are discussed and discussed in a fun and exciting textbook on the mathematics underlying the most important areas of science and technology.
Journal ArticleDOI

C ONDENSATION —Conditional Density Propagation forVisual Tracking

TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What have the authors contributed in "Modeling and analysis of dynamic behaviors of web image collections" ?

If so, can the authors exploit the understanding of dynamics to solve novel visual problems or improve recognition performance ? These two challenging questions are the motivation for this work. The authors propose a nonparametric approach to modeling and analysis of topical evolution in image sets. A scalable and parallelizable sequential Monte Carlo based method is developed to construct the similarity network of a large-scale dataset that provides a base representation for wide ranges of dynamics analysis. In this paper, the authors provide several experimental results to support the usefulness of image dynamics with the datasets of 47 topics gathered from Flickr. Second, the authors also present the complementary benefits that the images can introduce over the associated text analysis. Finally, the authors show that the training using the temporal association significantly improves the recognition performance. First, the authors produce some interesting observations such as tracking of subtopic evolution and outbreak detection, which can not be achieved with conventional image sets. 

The similarity measure between a pair of images is the cosine similarity, which is calculated by the dot product of a pair of L2 normalized descriptors. 

The importance sampling is particularly useful for the transition model since there is no closed form of the product of Gaussian and Gamma distributions and its normalization is not straightforward. 

Another important advantage of image-based temporal analysis is that it conveys more delicate information that is hardly captured by text descriptions. 

A sequential Monte Carlo based tracker is proposed to capture the subtopic evolution in the form of the similarity network of the image set. 

the text tags are highly fluctuated mainly because tags are subjectively assigned by different users with little consensus. 

The proposed algorithm is easily parallelizable by running multiple sequential Monte Carlo trackers with different initialization and parameters. 

Wide range of research has supported that the temporal association (i.e. liking temporally close images) is an important mechanism to recognize objects and generalize visual representation. [21] tested several interesting experiments to show that temporally correlated multiple views can be easily linked to a single representation. [2] proposed a learning model for 3D object recognition by using the temporal continuity in image sequences. 

As Google trends reveal the popularity variation of query terms in the search volumes, the authors can easily observe the affinity changes of each subtopic in the apple image set. 

In order to compare the dynamic behaviors detected from images and texts, the authors apply the outbreak detection method in previous section to both images and their associated tags. 

In order to show the usefulness of the image-based temporal topic modeling, the authors examined subtopic evolution tracking, subtopic outbreak detection, the comparison with the analysis on the associated texts, and the use of temporal association for recognition improvement. 

The analysis of the network is also fast since most network analysis algorithms depend on the number of nonzero elements, which is O(N log N). 

In their interpretation, given an image stream, the authors assume that the occurrence of images of each subtopic follows the Poisson process with β. 

The stationary probability is a popular ranking measure, and thus the images with high stationary probabilities can be thought of temporally and visually strengthened images. 

For the test sets, the authors downloaded 256 top-ranked images for each topic from Google Image Search by querying the same word in Table 1.