What future works have the authors mentioned in the paper "Video mining with frequent itemset configurations" ?

Future works include testing on larger datasets ( e. g. TRECVID ), defining more interestingness measures, and stronger customization of itemset mining algorithms to video data.

What are the future plans for itemset mining?

Future works include testing on larger datasets (e.g. TRECVID), defining more interestingness measures, and stronger customization of itemset mining algorithms to video data.

What is the procedure for detecting motion groups?

For shots with considerable motion, the authors use as central words the two words closest to the spatial center of the motion group, and create two transactions covering only visual words within it.

How many features can be detected in a video?

The mean time for performing motion segmentation matching + k-means clustering) was typically about 0.4s per frame, but obviously depends on the number of features detected per frame.

How long is the runtime for the 40-neighborhood case?

While the runtime is very short for both cases, the method is faster for the 40-neighborhood case, because transactions are shorter and only shorter itemsets were frequent.

How do the authors determine the number of motion groups?

For each motion group, the authors estimate a series of bounding-boxes, containing from 80% progressively up to all regions closest to the spatial median of the group.

How can the authors use this method to find objects of different sizes?

Restricting the neighborhood by motion grouping has proven to be useful for detecting objects of different sizes at the same time.

How many transactions are necessary to increase the number of itemsets?

thanks to the careful selection of the central visual words vc, the authors reduce the number of transactions and thus the runtime of the algorithm.

How many itemsets are mined in the 40-NN case?

in the 40-NN case, the support threshold to mine even a small set of only 285 frequent itemsets has to be set more than a factor 10 lower.

How many times do the authors run k-means?

For each remaining timestep, the authors run k-means three times with different values for k, specificallyk(t) ∈ {k(t− 1)− 1, k(t− 1), k(t− 1) + 1} (2) where k(t − 1) is the number of motion groups in the previous timestep.

What is the way to mine objects?

IXIn conclusion, the authors showed that their mining approach based on frequent itemsets is a suitable and efficient tool for video mining.

What is the difference between the itemsets?

Since the frequent itemset mining typically returns spatially and temporally overlapping itemsets, the authors merge them with a final clustering stage.

(Open Access) Video mining with frequent itemset configurations (2006) | Till Quack

Q: What are the contributions in "Video mining with frequent itemset configurations" ?

The authors present a method for mining frequently occurring objects and scenes from videos. In this work the authors show how to express vectorquantized features and their spatial relations as itemsets. Furthermore, a fast motion segmentation method is introduced as an attention filter for the mining algorithm.

Q: What is the support of an itemset in the transactions-database D?

The support of an itemset A ∈ D issupport(A) = |{T ∈ D|A ⊆ T}||D| ∈ [0, 1] (1)An itemset A is called frequent in D if support(A) ≥ s where s is a threshold for the minimal support defined by the user.

HAL Id: inria-00548580

https://hal.inria.fr/inria-00548580

Submitted on 20 Dec 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Video mining with frequent itemset congurations

Till Quack, Vittorio Ferrari, Luc van Gool

To cite this version:

Till Quack, Vittorio Ferrari, Luc van Gool. Video mining with frequent itemset congurations. Inter-

national Conference on Image and Video Retrieval (CIVR ’06), 2006, Tempe, United States. pp.360–

369, �10.1007/11788034_37�. �inria-00548580�

Video Mining with Frequent Itemset

Conﬁgurations

Till Quack

, Vittorio Ferrari

, and Luc Van Gool

ETH Zurich

LEAR - INRIA Grenoble

Abstract. We present a method for mining frequently occurring ob-

jects and scenes from videos. Object candidates are detected by ﬁnding

recurring spatial arrangements of aﬃne covariant regions. Our mining

metho d is based on the class of frequent itemset mining algorithms, which

have proven their eﬃciency in other domains, but have not been applied

to video mining before. In this work we show how to express vector-

quantized features and their spatial relations as itemsets. Furthermore,

a fast motion segmentation method is introduced as an attention ﬁlter

for the mining algorithm. Results are shown on real world data consisting

of music video clips.

1 Introduction

The goal of this work is to mine interesting objects and scenes from video data.

In other words, to detect frequently occurring objects automatically. Mining such

representative objects, actors, and scenes in video data is useful for many appli-

cations. For instance, they can serve as entry points for retrieval and browsing,

or they can provide a basis for video summarization. Our approach to video data

mining is based on the detection of recurring spatial arrangements of local fea-

tures. These features are represented in quantized codebooks, which has been a

recently popular and successful technique in object recognition, retrieval [14] and

classiﬁcation [10]. On top of this representation, we introduce a method to de-

tect frequently re-occurring spatial conﬁgurations of codebook entries. Whereas

isolated features still show weak links with semantic content, their co-occurrence

has far greater meaning. Our approach relies on frequent itemset mining algo-

rithms, which have been successfully applied to several other, large-scale data

mining problems such as market basket analysis or query log analysis [1, 2]. In

our context, the concept of an item corresponds to a codebook entry. The input

to the mining algorithm consists of subsets of feature-codebook entries for each

video frame, encoded into ”transactions”, as they are known in the data mining

literature [1]. We demonstrate how to incorporate spatial arrangement informa-

tion in transactions and how to select the neighborhood deﬁning the subset of

image features included in a transaction. For scenes with signiﬁcant motion, we

deﬁne this neighborhood via motion segmentation. To this end, we also introduce

a simple and very fast technique for motion segmentation on feature codebooks.

I I

Few works have dealt with the problem of mining objects composed of local

features from video data. In this respect, the closest work to ours is by Sivic and

Zisserman [5]. However, there are considerable diﬀerences. [5] starts by selecting

subsets of quantized features. The neighborhoods for mining are always of ﬁxed

size (e.g. the 20 nearest neighbors). Each such neighborhood is expressed as a

simple, orderless bag-of-words, represented as a sparse binary indicator vector.

The actual mining proceeds by computing the dot-product between all pairs

of neighb orhoods and setting a threshold on the resulting number of codebook

terms they have in common. While this deﬁnition of a neighborhood is similar

in spirit to our transactions, we also include information about the localization

of the feature within its neighborhood. Furthermore, the neighborhood itself is

not of ﬁxed size. For scenes containing signiﬁcant motion, we can exploit our

fast motion segmentation to restrict the neighborhood to features with similar

motions, and hence more likely to belong to a single object. As another important

diﬀerence, unlike [5] our approach does not require pairwise matching of bag-

of-words indicator vectors, but it relies instead on a frequent itemset mining

algorithm, which is a well studied technique in data mining. This brings the

additional beneﬁt of knowing which regions are common between neighborhoods,

versus the dot-product technique only reporting how many they are. It also

opens the doors to a large body of research on the eﬃcient detection of frequent

itemsets and many deduced mining methods.

To the best of our knowledge, no work has been published on frequent itemset

mining of video data, and very little is reported on static image data. In [3] an

extended association rule mining algorithm was used to mine spatial associations

between ﬁve classes of texture-tiles in aerial images (forest, urban etc.). In [4]

association rules were used to create a classiﬁer for breast cancer detection from

mammogram-images. Each mammogram was ﬁrst cropped to contain the same

fraction of the breast, and then described by photometric moments. Compared

to our method, both works were only applied to static image data containing

rather small variations.

The remainder of this paper is organized as follows. First the pre-processing

steps (i.e. video shot detection, local feature extraction and clustering into ap-

pearance codebooks) are described in section 2. Section 3 introduces the concepts

of our mining method. Section 4 describes the application of the mining method

to video sequences. Finally, results are shown in section 5.

2 Shot detection, features and appearance codebooks

The main processing stages of our system (next sections) rely on the prior sub-

division of the video into shots. We apply the shot partitioning algorithm [6],

and pick four ”keyframes” per second within each shot. As in [5], this results

in a denser and more uniform sampling than when using the keyframes selected

by [6]. In each keyframe we extract two types of aﬃne covariant features (re-

gions): Hessian-Aﬃne [7] and MSER [8]. Aﬃne covariant features are preferred

over simpler scale-invariant ones, as they provide robustness against viewpoint

I II

changes. Each normalized region is described with a SIFT-descriptor [9]. Next,

a quantized codebook [10] (also called ”visual vocabulary” [5]) is constructed by

clustering the SIFT descriptors with the hierarchical-agglomerative technique

described in [10]. In a typical video, this results in about 8000 appearance clus-

ters for each feature type.

We apply the ’stop-list’ method known from text-retrieval and [5] as a ﬁnal

polishing: very frequent and very rare visual words are removed from the code-

book (the 5% most and 5% least frequent). Note that the following processing

stages use only the spatial location of features and their assigned appearance-

codebook id’s. The appearance descriptors are no longer needed.

3 Our mining approach

Our goal is to ﬁnd frequent spatial conﬁgurations of visual words in video scenes.

For the time being, let us consider a conﬁguration to be just an unordered set of

visual words. We add spatial relationships later, in section 3.2. For a codebook

of size d there are 2

possible subsets of visual words. For each of our two feature

types we have a codebook with about 8000 words, which means d is typically

> 10000, resulting in an immense search space. Hence we need a mining method

capable of dealing with such a large dataset and to return frequently occurring

word combinations. Frequent itemset mining methods are a good choice, as they

have solved analogous problems in market basket like data [1, 2]. Here we brieﬂy

summarize the terminology and methodology of frequent itemset mining.

3.1 Frequent itemset mining

Let I = {i

. . . i

} be a set of p items. Let A be a subset of I with l items,i.e.

A ⊆ I, |A| = l. Then we call A a l-itemset. A transaction is an itemset T ⊆ I with

a transaction identiﬁer tid(T ). A transaction database D is a set of transactions

with unique identiﬁers D = {T

. . . T

}. We say that a transaction T supports

an itemset A, if A ⊆ T . We can now deﬁne the support of an itemset in the

transactions-database D as follows:

Deﬁnition 1. The support of an itemset A ∈ D is

support(A) =

|{T ∈ D|A ⊆ T }|

|D|

∈ [0, 1] (1)

An itemset A is called frequent in D if support(A) ≥ s where s is a threshold

for the minimal support deﬁned by the user.

Frequent itemsets are subject to the monotonicity property: all l-subsets of

frequent (l+1)-sets are also frequent. The well known APriori algorithm [1] takes

advantage of the monotonicity property and allows us to ﬁnd frequent itemsets

very quickly.

3.2 Incorporating spatial information

In our context, the items correspond to visual words. In the simplest case, a

transaction would be created for each visual word, and would consist of an

orderless bag of all other words within some image neighborhood. In order to

include also spatial information (i.e. locations of visual words) in the mining

process, we further adapt the concept of an item to our problem. The key idea

is to encode spatial information directly in the items. In each image we create

transactions from the neighborhood around a limited subset of selected words

}. These words must appear in at least f

min

and at most in f

max

frames. This

is motivated by the notion that neighbourhoods containing a very infrequent

word would create infrequent transactions, neighbourhoods around extremely

frequent word have a high probability of being part of clutter. Each v

must also

have a matching word in the previous frame, if both frames are from the same

shot. Typically, with these restrictions, about 1/4 of the regions in a frame are

selected.

For each v

we create a transaction which contains the surrounding k near-

est words together with their rough spatial arrangement. The neighbourhood

around v

is divided into B sections. In all experiments we use B = 4 sections.

Each section covers 90

◦

plus an overlap o = 5

◦

with its neighboring sections,

to be robust against small rotations. We label the sections {tl, tr, bl, br} (for

”top-left”, ”top-right”, etc.), and append to each visual word the label of the

section it lies in. In the example in ﬁgure 1, the transaction created for v

T = {tl55, tl9, tr923, br79, br23, bl23, bl9}. In the following, we refer to the se-

lected words {v

} as central words. Although the approach only accomodates for

small rotations, in most videos objects rarely appear in substantially diﬀerent

orientations. Rotations of the neighborhood stemming from perspective trans-

formations are safely accomodated by the overlap o. Although augmenting the

Fig. 1. Creating transaction from a neighborhood. The area around a central visual

word v

is divided into sections. Each section is labeled (tl, tr, bl, br) and the label is

app ended to the visual word ids.

items in this fashion increases their total number by a factor B, no changes to

the frequent itemset mining algorithm itself are necessary. Besides, thanks to

the careful selection of the central visual words v

, we reduce the number of

transactions and thus the runtime of the algorithm.

Video mining with frequent itemset configurations

Figures

Citations

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval

A Survey on Visual Content-Based Video Indexing and Retrieval

Key-segments for video object segmentation

World-scale mining of objects and events from community photo collections

Geometric min-Hashing: Finding a (thick) needle in a haystack

References

Distinctive Image Features from Scale-Invariant Keypoints

Mining association rules between sets of items in large databases

Finding Groups in Data: An Introduction to Cluster Analysis

Finding Groups in Data

Video Google: a text retrieval approach to object matching in videos

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Video Google: a text retrieval approach to object matching in videos

Scalable Recognition with a Vocabulary Tree

Visual categorization with bags of keypoints

Mining association rules between sets of items in large databases

Frequently Asked Questions (16)

Q1. What are the contributions in "Video mining with frequent itemset configurations" ?

Q2. What future works have the authors mentioned in the paper "Video mining with frequent itemset configurations" ?

Q3. What are the future plans for itemset mining?

Q4. What is the procedure for detecting motion groups?

Q5. What is the support of an itemset in the transactions-database D?

Q6. How many features can be detected in a video?

Q7. How long is the runtime for the 40-neighborhood case?

Q8. How do the authors determine the number of motion groups?

Q9. How can the authors use this method to find objects of different sizes?

Q10. How many transactions are necessary to increase the number of itemsets?

Q11. How does the APriori algorithm find frequent itemsets?

Q12. Why is the neighbourhoods around frequent words infrequent?

Q13. How many itemsets are mined in the 40-NN case?

Q14. How many times do the authors run k-means?

Q15. What is the way to mine objects?

Q16. What is the difference between the itemsets?