scispace - formally typeset
Open AccessBook ChapterDOI

Video mining with frequent itemset configurations

Reads0
Chats0
TLDR
In this article, a method for mining frequently occurring objects and scenes from videos is presented, which is based on the class of frequent itemset mining algorithms, which have proven their efficiency in other domains, but have not been applied to video mining before.
Abstract
We present a method for mining frequently occurring objects and scenes from videos. Object candidates are detected by finding recurring spatial arrangements of affine covariant regions. Our mining method is based on the class of frequent itemset mining algorithms, which have proven their efficiency in other domains, but have not been applied to video mining before. In this work we show how to express vector-quantized features and their spatial relations as itemsets. Furthermore, a fast motion segmentation method is introduced as an attention filter for the mining algorithm. Results are shown on real world data consisting of music video clips.

read more

Content maybe subject to copyright    Report

HAL Id: inria-00548580
https://hal.inria.fr/inria-00548580
Submitted on 20 Dec 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Video mining with frequent itemset congurations
Till Quack, Vittorio Ferrari, Luc van Gool
To cite this version:
Till Quack, Vittorio Ferrari, Luc van Gool. Video mining with frequent itemset congurations. Inter-
national Conference on Image and Video Retrieval (CIVR ’06), 2006, Tempe, United States. pp.360–
369, �10.1007/11788034_37�. �inria-00548580�

Video Mining with Frequent Itemset
Configurations
Till Quack
1
, Vittorio Ferrari
2
, and Luc Van Gool
1
1
ETH Zurich
2
LEAR - INRIA Grenoble
Abstract. We present a method for mining frequently occurring ob-
jects and scenes from videos. Object candidates are detected by finding
recurring spatial arrangements of affine covariant regions. Our mining
metho d is based on the class of frequent itemset mining algorithms, which
have proven their efficiency in other domains, but have not been applied
to video mining before. In this work we show how to express vector-
quantized features and their spatial relations as itemsets. Furthermore,
a fast motion segmentation method is introduced as an attention filter
for the mining algorithm. Results are shown on real world data consisting
of music video clips.
1 Introduction
The goal of this work is to mine interesting objects and scenes from video data.
In other words, to detect frequently occurring objects automatically. Mining such
representative objects, actors, and scenes in video data is useful for many appli-
cations. For instance, they can serve as entry points for retrieval and browsing,
or they can provide a basis for video summarization. Our approach to video data
mining is based on the detection of recurring spatial arrangements of local fea-
tures. These features are represented in quantized codebooks, which has been a
recently popular and successful technique in object recognition, retrieval [14] and
classification [10]. On top of this representation, we introduce a method to de-
tect frequently re-occurring spatial configurations of codebook entries. Whereas
isolated features still show weak links with semantic content, their co-occurrence
has far greater meaning. Our approach relies on frequent itemset mining algo-
rithms, which have been successfully applied to several other, large-scale data
mining problems such as market basket analysis or query log analysis [1, 2]. In
our context, the concept of an item corresponds to a codebook entry. The input
to the mining algorithm consists of subsets of feature-codebook entries for each
video frame, encoded into ”transactions”, as they are known in the data mining
literature [1]. We demonstrate how to incorporate spatial arrangement informa-
tion in transactions and how to select the neighborhood defining the subset of
image features included in a transaction. For scenes with significant motion, we
define this neighborhood via motion segmentation. To this end, we also introduce
a simple and very fast technique for motion segmentation on feature codebooks.

I I
Few works have dealt with the problem of mining objects composed of local
features from video data. In this respect, the closest work to ours is by Sivic and
Zisserman [5]. However, there are considerable differences. [5] starts by selecting
subsets of quantized features. The neighborhoods for mining are always of fixed
size (e.g. the 20 nearest neighbors). Each such neighborhood is expressed as a
simple, orderless bag-of-words, represented as a sparse binary indicator vector.
The actual mining proceeds by computing the dot-product between all pairs
of neighb orhoods and setting a threshold on the resulting number of codebook
terms they have in common. While this definition of a neighborhood is similar
in spirit to our transactions, we also include information about the localization
of the feature within its neighborhood. Furthermore, the neighborhood itself is
not of fixed size. For scenes containing significant motion, we can exploit our
fast motion segmentation to restrict the neighborhood to features with similar
motions, and hence more likely to belong to a single object. As another important
difference, unlike [5] our approach does not require pairwise matching of bag-
of-words indicator vectors, but it relies instead on a frequent itemset mining
algorithm, which is a well studied technique in data mining. This brings the
additional benefit of knowing which regions are common between neighborhoods,
versus the dot-product technique only reporting how many they are. It also
opens the doors to a large body of research on the efficient detection of frequent
itemsets and many deduced mining methods.
To the best of our knowledge, no work has been published on frequent itemset
mining of video data, and very little is reported on static image data. In [3] an
extended association rule mining algorithm was used to mine spatial associations
between five classes of texture-tiles in aerial images (forest, urban etc.). In [4]
association rules were used to create a classifier for breast cancer detection from
mammogram-images. Each mammogram was first cropped to contain the same
fraction of the breast, and then described by photometric moments. Compared
to our method, both works were only applied to static image data containing
rather small variations.
The remainder of this paper is organized as follows. First the pre-processing
steps (i.e. video shot detection, local feature extraction and clustering into ap-
pearance codebooks) are described in section 2. Section 3 introduces the concepts
of our mining method. Section 4 describes the application of the mining method
to video sequences. Finally, results are shown in section 5.
2 Shot detection, features and appearance codebooks
The main processing stages of our system (next sections) rely on the prior sub-
division of the video into shots. We apply the shot partitioning algorithm [6],
and pick four ”keyframes” per second within each shot. As in [5], this results
in a denser and more uniform sampling than when using the keyframes selected
by [6]. In each keyframe we extract two types of affine covariant features (re-
gions): Hessian-Affine [7] and MSER [8]. Affine covariant features are preferred
over simpler scale-invariant ones, as they provide robustness against viewpoint

I II
changes. Each normalized region is described with a SIFT-descriptor [9]. Next,
a quantized codebook [10] (also called ”visual vocabulary” [5]) is constructed by
clustering the SIFT descriptors with the hierarchical-agglomerative technique
described in [10]. In a typical video, this results in about 8000 appearance clus-
ters for each feature type.
We apply the ’stop-list’ method known from text-retrieval and [5] as a final
polishing: very frequent and very rare visual words are removed from the code-
book (the 5% most and 5% least frequent). Note that the following processing
stages use only the spatial location of features and their assigned appearance-
codebook id’s. The appearance descriptors are no longer needed.
3 Our mining approach
Our goal is to find frequent spatial configurations of visual words in video scenes.
For the time being, let us consider a configuration to be just an unordered set of
visual words. We add spatial relationships later, in section 3.2. For a codebook
of size d there are 2
d
possible subsets of visual words. For each of our two feature
types we have a codebook with about 8000 words, which means d is typically
> 10000, resulting in an immense search space. Hence we need a mining method
capable of dealing with such a large dataset and to return frequently occurring
word combinations. Frequent itemset mining methods are a good choice, as they
have solved analogous problems in market basket like data [1, 2]. Here we briefly
summarize the terminology and methodology of frequent itemset mining.
3.1 Frequent itemset mining
Let I = {i
1
. . . i
p
} be a set of p items. Let A be a subset of I with l items,i.e.
A I, |A| = l. Then we call A a l-itemset. A transaction is an itemset T I with
a transaction identifier tid(T ). A transaction database D is a set of transactions
with unique identifiers D = {T
1
. . . T
n
}. We say that a transaction T supports
an itemset A, if A T . We can now define the support of an itemset in the
transactions-database D as follows:
Definition 1. The support of an itemset A D is
support(A) =
|{T D|A T }|
|D|
[0, 1] (1)
An itemset A is called frequent in D if support(A) s where s is a threshold
for the minimal support defined by the user.
Frequent itemsets are subject to the monotonicity property: all l-subsets of
frequent (l+1)-sets are also frequent. The well known APriori algorithm [1] takes
advantage of the monotonicity property and allows us to find frequent itemsets
very quickly.

IV
3.2 Incorporating spatial information
In our context, the items correspond to visual words. In the simplest case, a
transaction would be created for each visual word, and would consist of an
orderless bag of all other words within some image neighborhood. In order to
include also spatial information (i.e. locations of visual words) in the mining
process, we further adapt the concept of an item to our problem. The key idea
is to encode spatial information directly in the items. In each image we create
transactions from the neighborhood around a limited subset of selected words
{v
c
}. These words must appear in at least f
min
and at most in f
max
frames. This
is motivated by the notion that neighbourhoods containing a very infrequent
word would create infrequent transactions, neighbourhoods around extremely
frequent word have a high probability of being part of clutter. Each v
c
must also
have a matching word in the previous frame, if both frames are from the same
shot. Typically, with these restrictions, about 1/4 of the regions in a frame are
selected.
For each v
c
we create a transaction which contains the surrounding k near-
est words together with their rough spatial arrangement. The neighbourhood
around v
c
is divided into B sections. In all experiments we use B = 4 sections.
Each section covers 90
plus an overlap o = 5
with its neighboring sections,
to be robust against small rotations. We label the sections {tl, tr, bl, br} (for
”top-left”, ”top-right”, etc.), and append to each visual word the label of the
section it lies in. In the example in figure 1, the transaction created for v
c
is
T = {tl55, tl9, tr923, br79, br23, bl23, bl9}. In the following, we refer to the se-
lected words {v
c
} as central words. Although the approach only accomodates for
small rotations, in most videos objects rarely appear in substantially different
orientations. Rotations of the neighborhood stemming from perspective trans-
formations are safely accomodated by the overlap o. Although augmenting the
Fig. 1. Creating transaction from a neighborhood. The area around a central visual
word v
c
is divided into sections. Each section is labeled (tl, tr, bl, br) and the label is
app ended to the visual word ids.
items in this fashion increases their total number by a factor B, no changes to
the frequent itemset mining algorithm itself are necessary. Besides, thanks to
the careful selection of the central visual words v
c
, we reduce the number of
transactions and thus the runtime of the algorithm.

Citations
More filters
Proceedings ArticleDOI

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval

TL;DR: This paper brings query expansion into the visual domain via two novel contributions: strong spatial constraints between the query image and each result allow us to accurately verify each return, suppressing the false positives which typically ruin text-based query expansion.
Journal ArticleDOI

A Survey on Visual Content-Based Video Indexing and Retrieval

TL;DR: Methods for video structure analysis, including shot boundary detection, key frame extraction and scene segmentation, extraction of features including static key frame features, object features and motion features, video data mining, video annotation, and video retrieval including query interfaces are analyzed.
Proceedings ArticleDOI

Key-segments for video object segmentation

TL;DR: The method first identifies object-like regions in any frame according to both static and dynamic cues and compute a series of binary partitions among candidate “key-segments” to discover hypothesis groups with persistent appearance and motion.
Proceedings ArticleDOI

World-scale mining of objects and events from community photo collections

TL;DR: This paper describes an approach for mining images of objects from community photo collections in an unsupervised fashion, and demonstrates this approach on several urban areas, densely covering an area of over 700 square kilometers and mining over 200,000 photos, making it probably the largest experiment of its kind to date.
Proceedings ArticleDOI

Geometric min-Hashing: Finding a (thick) needle in a haystack

TL;DR: The proposed geometric min-hashing approach is a novel hashing scheme for image retrieval, clustering and automatic object discovery that has both higher recall and lower false positive rates than the state-of-the-art min-hash.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Proceedings ArticleDOI

Mining association rules between sets of items in large databases

TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.
Book

Finding Groups in Data: An Introduction to Cluster Analysis

TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.
BookDOI

Finding Groups in Data

TL;DR: In this article, an electrical signal transmission system for railway locomotives and rolling stock is proposed, where a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count, and a spike pulse of greater selected amplitude is transmitted, occurring immediately after the axle count pulse to which it relates, whenever an overheated axle box is detected.
Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Frequently Asked Questions (16)
Q1. What are the contributions in "Video mining with frequent itemset configurations" ?

The authors present a method for mining frequently occurring objects and scenes from videos. In this work the authors show how to express vectorquantized features and their spatial relations as itemsets. Furthermore, a fast motion segmentation method is introduced as an attention filter for the mining algorithm. 

Future works include testing on larger datasets ( e. g. TRECVID ), defining more interestingness measures, and stronger customization of itemset mining algorithms to video data. 

Future works include testing on larger datasets (e.g. TRECVID), defining more interestingness measures, and stronger customization of itemset mining algorithms to video data. 

For shots with considerable motion, the authors use as central words the two words closest to the spatial center of the motion group, and create two transactions covering only visual words within it. 

The support of an itemset A ∈ D issupport(A) = |{T ∈ D|A ⊆ T}||D| ∈ [0, 1] (1)An itemset A is called frequent in D if support(A) ≥ s where s is a threshold for the minimal support defined by the user. 

The mean time for performing motion segmentation matching + k-means clustering) was typically about 0.4s per frame, but obviously depends on the number of features detected per frame. 

While the runtime is very short for both cases, the method is faster for the 40-neighborhood case, because transactions are shorter and only shorter itemsets were frequent. 

For each motion group, the authors estimate a series of bounding-boxes, containing from 80% progressively up to all regions closest to the spatial median of the group. 

Restricting the neighborhood by motion grouping has proven to be useful for detecting objects of different sizes at the same time. 

thanks to the careful selection of the central visual words vc, the authors reduce the number of transactions and thus the runtime of the algorithm. 

The well known APriori algorithm [1] takes advantage of the monotonicity property and allows us to find frequent itemsets very quickly. 

This is motivated by the notion that neighbourhoods containing a very infrequent word would create infrequent transactions, neighbourhoods around extremely frequent word have a high probability of being part of clutter. 

in the 40-NN case, the support threshold to mine even a small set of only 285 frequent itemsets has to be set more than a factor 10 lower. 

For each remaining timestep, the authors run k-means three times with different values for k, specificallyk(t) ∈ {k(t− 1)− 1, k(t− 1), k(t− 1) + 1} (2) where k(t − 1) is the number of motion groups in the previous timestep. 

IXIn conclusion, the authors showed that their mining approach based on frequent itemsets is a suitable and efficient tool for video mining. 

Since the frequent itemset mining typically returns spatially and temporally overlapping itemsets, the authors merge them with a final clustering stage.