scispace - formally typeset
Open AccessProceedings ArticleDOI

A Novel Learning-Based Framework for Detecting Interesting Events in Soccer Videos

Reads0
Chats0
TLDR
A novel learning-based framework for detecting interesting events in soccer videos that applies this framework to parse soccer videos into Interesting and Non-Interesting videos is presented.
Abstract
We present a novel learning-based framework for detecting interesting events in soccer videos The input to the system is a raw soccer video We have learning at three levels-learning to detect interesting low-level features from image and video data using support vector machines (hereafter, SVMs), and a hierarchical conditional random field- (hereafter, CRF-) based methodology to learn the dependencies of mid-level features and their relation with the low level features, and high level decisions (dasiainteresting eventspsila) and their relation with the mid-level features: all on the basis of training video data Descriptors are spatio-temporal in nature - they can be associated with a region in an image or a set of frames Temporal patterns of descriptors characterise an event We apply this framework to parse soccer videos into Interesting (a goal or a goal miss) and Non-Interesting videos We present results of numerous experiments in support of the proposed strategy

read more

Content maybe subject to copyright    Report

A Novel Learning-based Framework for Detecting Interesting Events in Soccer
Videos
Nisha Jain
1
Santanu Chaudhury
1
Sumantra Dutta Roy
1
Prasenjit Mukherjee
2
Krishanu Seal
2
Kumar Talluri
2
1
Electrical Engineering Department, IIT Delhi
{nisha.iitd,schaudhury, sumantra.dutta.roy}@gmail.com
2
AOL
{P.Mukherjee, Krishanu.Seal,Kumar.Talluri}@corp.aol.com
Abstract
We present a novel learning-based framework for detect-
ing interesting events in soccer videos. The input to the sys-
tem is a raw soccer video. We have learning at three levels
- learning to detect interesting low-level features from im-
age and video data using Support Vector Machines (here-
after, SVMs), and a hierarchical Conditional Random Field-
(hereafter, CRF-) based methodology to learn the depen-
dencies of mid-level features and their relation with the low-
level features, and high level decisions (‘interesting events’)
and their relation with the mid-level features: all on the ba-
sis of training video data. Descriptors are spatio-temporal
in nature - they can be associated with a region in an im-
age or a set of frames. Temporal patterns of descriptors
characterise an event. We apply this framework to parse
soccer videos into Interesting (a goal or a goal miss) and
Non-Interesting videos. We present results of numerous ex-
periments in support of the proposed strategy.
1. Introduction
Automatic extraction of highlights and summarization is
an important task in video analysis, especially in the sports
domain [12], [4], [1], [5], [2], [14]. In general, sports
video analysis involves analysing all feed associated with it
- ticker-tape text, audio, still images, graphics, video, and so
on. Hakeem and Shah [7] present an unsupervised method-
ology for learning, detection and representation of events
performed by multiple agents in videos. Flieschman and
Roy [6] present an unsupervised learning framework based
on learning grounded language models. The authors asso-
ciate low-level patterns from videos with words extracted
from the closed captioning text using a generalisation of La-
tent Dirichlet Allocation. This paper presents an alternate
point of view, using CRF-based hierarchical modelling, and
supervised learning. Often we have enough ground truth to
label representative examples - supervised approaches are
very commonly used for such a task [9]. While unsuper-
vised learning is more applicable for mining and discover-
ing events, our work is geared to a different problem - that
of automatically parsing video streams. We have specific
definitions for events, and their description through a set
of training examples. Further, the problem involves prob-
abilistic mappings between features (at numerous levels),
and their labels (at many levels, again). Such a framework
is naturally amenable to analysis using CRFs - our proposed
method builds a learning-based methodology to learn this
structure automatically from training videos. Further, our
system works towards automated generation of highlights -
by segmenting out the sequence of frames where interesting
events occur. Xu and Chua [14] have a multi-level system,
where the probabilistic uncertainty handling and inference
process comes from a hierarchical Hidden Markov Model
(HHMM). A CRF-based formulation is more general than
any other Markovian or Bayesian formulation [11]: more
so, in a multi-level version [8], [10]. We have a probabilis-
tic hierarchical modelling of dependencies between events
and entities at different levels - unlike the heuristic combi-
nation of rule-based and model-based structure in the work
of Chu and Wu [4], or the ones in the work of Ariki et al. [1].
This is also a limitation in the work of Duan et al. [5] who
proposes a multi-level hierarchy, but has a rule-based sys-
tem for event detection. The work of Chang et al. [3] re-
ports the only major use of CRFs for video analysis. The
basic structure is quite different from our work, and the aim
is also quite different - semantic concept detection in con-
sumer video. To the best of our knowledge, there has been
no other related work combining different kinds of learning
at various levels right from feature detection to a hierarchi-
cal CRF-based formulation for parsing videos and detect-
ing interesting events. The organisation of the rest of the
paper is as follows: Sec. 2 gives an overview of the com-

plete system. Sec. 3 describes the various feature extrac-
tions schemes. Sec. 4 explains the hierarchical CRF design
and the parsing process in detail. Sec. 5 presents results of
the proposed approach. Sec. 6 includes the conclusion.
2. Overall System Description
Our framework for detecting interesting events in videos
operates in three phases (see Fig. 1): first, the raw video
data is abstracted into multiple streams of discrete features.
CRF models are used to classify the low level features into
generate mid- level features. The mid -level features are fed
to the hierarchical CRF-based parsing model that parses out
interesting events from the video. Fig. 2 shows an example
Figure 1. Three phases of the proposed
framework, to detect interesting events in
videos (Sec. 1).
of a hierarchical CRF in our system. This links low-level
features to the overall decision e.g., Goal Sec. 3 explains
our trainable feature extraction process, and In Fig. 2, each
circle in the centre indicates an intermediate-level feature
such as zoom in or zoom in on a goal post. The edges
indicate probabilistic dependencies between these objects.
The input to our model is the video which is broken into
a sequence of frames. We segment each frame in order to
generate a sequence of feature nodes at the next level of the
model. Our model labels different features in each frame.
3. Trainable Feature Extraction Scheme
The first step in detecting interesting events is to ab-
stract the raw video data into more semantically meaning-
ful streams of information. We divide the video into frames
and then find features in each frame of the video sequence.
Here, we emphasise that a feature need not be specific
Figure 2. An example of a hierarchical CRF
model. This links low-level features at the
bottom to mid-level features, and the over-
all interesting events at the topmost level.
Each link is associated with the correspond-
ing conditional probability. This is learnt in
the CRF learning phase. Sec. 1 outlines the
overall system, and Sec. 4 gives the details
of the hierarchical CRF-based scheme.
to data collected from a single frame. One may use any
number of low-level features such as audio-based features,
image-based features, graphics detection and analysis, and
of course, getting spatio-temporal video features. For our
experiments with soccer videos, we choose the following
low-level features, and outline their detection process in de-
tail in the subsequent sections: Zoom In, Goal Post Detec-
tion, Crowd Detection, Field View Detection, and Face De-
tection.
3.1. Zoom-In Detection
We propose the use of divergence for detecting zoom in
across sequence of frames of a video. First we convert the
coloured frame to a gray scale image. Then calculate the
optic flow across frames of the video sequence. We get
measure in both directions (x & y) say u and v. We cal-
culate the divergence of the optical flow value i.e. calcu-
late the partial derivative of the optic flow value (u
x
, v
y
) for
each pixel in the frame. There can be noise at pixels which
is removed by applying thresholding. In a range of some
divergence value (say 50 to 150) we consider only those
values that satisfy the threshold (say values greater than 75
and less than 125). We make different bins for each range
and count the valid divergence values in each bin. This is
the histogram of the divergence values. In the final step,
we feed the histogram values to a Support Vector Machine

(SVM) for classification of zoom and non-zoom sequences
of frames. One can use any method for the classification -
we choose SVMs for their versatility and universal appeal
in being generalised linear classifiers. We train the SVMs in
their decision boundary using a large number of representa-
tive sequences.
In our experiments, we trained this using 100 representa-
tive sequences. For 50 new non-training examples given to
the system, it marked a Zoom In correctly in 47 of them,
representing an accuracy of 94%.
3.2. Face Detection
We take faces as the basis of detecting human beings in
videos frames. We use a cascade of Haar filters as in the
Viola-Jones algorithm [13], and train it with a large number
of faces.
3.3. Goal Post Detection
We use a model-based approach for goal post detection,
the shape of the goal post is our model which is basically
two near-vertical lines and a near-horizontal line between
them or a vertical line in combination with a nearly hori-
zontal line joined together. (Often, the perspective distor-
tion does not affect the vertical posts too much, since cam-
eras are usually positioned on the field such that the vertical
distortion is minimum.) Morphological operations are ap-
plied on each frame to detect the goal post. We convert
our coloured frame to a binary image with a high threshold
value so that the prominent white areas of the image can be
segmented. We apply the closure operation which is dila-
tion followed by erosion. We take the structural elements as
lines that satisfy our model.
(a)
(b)
Figure 3. Goal Post Detection Results using
morphological operations
Fig. 3.3 shows two examples in which the goal post
detected in the frames after applying the morphological
operation.The gray thresholding leaves out the whiter por-
tions and after that applying the close operation followed
by erosion with structuring element as line (first vertical the
horizontal). In some cases the penalty box is also detected
but because of perspective distortion those results are few.
We tested this algorithm on a data set of randomly cho-
sen 120 frames; 57 out of 60 frames with goal post were
correctly classified, and 50 out of 60 frames without the
goal post were correctly classified, an accuracy of 95% and
83.33%, respectively.
3.4. Field View Features
We convert the RGB image to HSV image. We calculate
histogram of the hue values of the frames of the video on
the bin range of 0 to 255 with each bin of unit size. These
histogram values are feeded to the SVM. In this case again,
we use SVMs to train the classifier to get an optimal deci-
sion boundary.
The training data set had 150 frames information (75 frames
of positive examples, and 75 negative examples). The test-
ing data set had 132 frames which did not have any over-
lap with the training data set. Out of the 132 frames, our
SVM-based classifier correctly classified the frame into a
field view and a non-field view in 127 cases, giving an ac-
curacy of 96.21%.
3.5. Crowd Detection
We convert the RGB image to HSV image. The input to
the SVM in this case is again a histogram of hue values for
each frame of the video. In this case also bins of unit length
with bin range of 0 to 255 is used.
The training data set had 150 frames (75 frames with a
crowd view, and 75 frames without crowd) The testing data
set had 100 frames (these did not have anything in common
with the training set) - out of which 89 were correctly clas-
sified. This gives an accuracy of 89%.
4. Hierarchical Classifier Design using CRFs
We propose a hierarchical approach because different
features extracted in the video sequence are interdependent,
for example the occurrence of zoom in depends on where
exactly the zoom in occurs, on the goal post or field. CRF
model exploits the probabilistic conditional dependencies
between different features efficiently. Unlike, the HMM
output which is dependent on the current state and not on
the past, CRF produces outputs based on the current as well
as past states. The Baye’s method uses directed dependency,
our model uses undirected dependency between different

features. The CRF- hierarchical modelling is a discrimi-
native model that gives priorities to observations with more
weightage.It discriminates nodes with less weightage and
thus the output result is a cumulation of important observa-
tions.
4.1. Conditional Random Fields
Our goal is to develop a probabilistic temporal model
that can extract high-level activities from sequence of
frames. Discriminative models such as conditional random
fields (CRF) have recently shown to outperform genera-
tive techniques in areas such as natural language process-
ing, web page classification and computer vision. CRF are
undirected graphical models used for Relational Learning.
CRF’s directly represent the conditional distribution over
hidden states given the observations. CRF’s are thus es-
pecially suitable for classification tasks with complex and
overlapped observations. Similar to hidden Markov mod-
els(HMM’s) and Markov random fields, the nodes in CRF’s
represent a sequence of observations, denoted as
x =< x
1
, x
2
, ...., x
t
>, and corresponding hid-
den states(e.g.,mid level features), denoted as y =<
y
1
, y
2
, ...., y
t
>. These then define the conditional distri-
bution p(y|x) over the hidden states y. The conditional dis-
tribution over the hidden states is written as:
p(y|x) =
1
Z(x)
Y
φ
c
(x
c
, y
c
)
where Z(x) =
P
y
Y
cC
φ
c
(x
c
, y
c
) is the normalizing
partition function.and c is a collection of subsets. With-
out loss of generality φ
c
(x
c
, y
c
) are described by log linear
combinations of feature functions f
c
() i.e.,
φ
c
(x
c
, y
c
) = exp(w
T
c
f
c
(x
c
, y
c
))
where w
T
c
is the transpose of a weight vector w
c
and
f
c
(x
c
, y
c
) is a function that extracts vector of features from
the variable values. The log linear feature representation
is very compact guarantees the non-negativeness potential
values. We can write conditional distribution as :
p(y|x) =
1
Z(x)
Y
cC
exp{w
T
c
f
c
(x
c
, y
c
)}
=
1
Z(x)
exp
(
X
cC
w
T
c
f
c
(x
c
, y
c
)
)
In this step mid-level features are mined from the low
level features abstracted from the low level feature modules
in Step 1.We find temporal combinations of different low
level features to better identify the events in the frames of
the video sequence. We use linear CRF’s to classify the mid
level features.
The mid-level features that are classified are zoom in on
the goal post, zoom in on the field, goal post and simulta-
neous crowd detection, goal post detection with field view
and goal post and face detection simultaneously. These mid
level features develop a probabilistic temporal model that
can extract high level interesting events from sequence of
frames of the in video sequence. These mid level features
form the intermediate nodes of the hierarchical CRF model
(see Fig. 2). A combination of field view with goal post and
zoom in on the goal post has higher probability that the se-
quence of frames is a goal than just a field view and goal
post. This conditional probabilistic model is generated us-
ing conditional random fields.
4.2. CRF based Parsing Engine
CRF based parsing engine parses out interesting or
non interesting sequence from the complete frames of the
video.In our case the interesting event includes a goal and
goal miss. Non interesting events include field play in
general. We feed the mid level features extracted from
the above step to another set of linear CRF’s that classify
whether the sequence of frames in the video is interesting
or not. The complete set of frames are now classified as
interesting or non-interesting event. Each event has a set
of sequence frames. The sequence of frames is outputted
as interesting or non-interesting event.This can be used to
create summaries of soccer videos primarily with all the in-
teresting events of the game. It also has an application in
mobiles for generation of highlights for the consumers.
5. Video Parsing Results
We trained our system to detect Interesting Events (goal
or goal miss) and Non-interesting events. We trained our
hierarchical system on a set of 30 videos, and had a set of
170 other videos for testing. First three examples, that are
explained below describe how on the basis of low level and
mid-level features a result is deduced for the type of event.
In Fig. 4, we show an example of a video correctly de-
tected as an Interesting Event (in this case a goal). The fig-
ure shows some frames from the video, and the correspond-
ing low-level features, mid-level features, and of course, the
final decision - all with the relation between the entities
clearly outlined.The low level features, zoom, goal post,
crowd and face detection are combined according to their
respective dependency to deduce mid level features, zoom
in on the goal post, a crowd detection along with goal post
detection and goal post detection followed by a face detec-
tion. These features results in the final decision of an inter-
esting event(goal).

Figure 4. Example 1, Sec. 5: an example of
a video correctly detected as an Interesting
Event (in this case a goal). The figure shows
the corresponding parsing using the hierar-
chical CRF model.
Fig. 5 shows a different scenario corresponding to a suc-
cessful classification of an interesting event. Fig. 6 shows
an example of a successful detection of a Non-Interesting
event. Again, the figure shows some frames from the video,
and the corresponding low-level features, mid-level fea-
tures, and of course, the final decision - all with the relation
between the entities.In this example across the sequence of
frames the only low level feature detected is field view. So
the mid level features, that are deduced donot give any in-
dication of a feature that indicates a goal. Thus the result
from the mid level features is an non-interesting event.
Next two examples show the working of the parsing en-
gine. The features of the frames of the video are extracted
and in the sequence of the frames the portions of interest-
ing and non-interesting events are parsed.The boundary of
the parsed regions are based on the mid-level features. The
size or duration of the event varies according to the value
of the features.The CRF based parsing engine segments out
the events on the basis of the mid level features. Condi-
tional dependency between mid level features is exploited
by the CRF’s that results in the parsing of interesting and
non-interesting events.
Fig. 7 and Fig. 8 shows that the frames of the video that
are parsed as interesting and non- interesting events based
on the mid level features. In the first example the interest-
ing event is parsed accurately and in the second case the
first interesting event is parsed with one additional frame
rest all portions are parsed accurately by the CRF based
parsing engine.The numbers below the frames correspond
to the frame number in the video sequence. The following
table describes the global results of CRF- methodology.
Figure 5. Example 2, Sec. 5: another exam-
ple of a video correctly detected as an Inter-
esting Event (in this case a goal). The figure
shows the corresponding parsing using the
hierarchical CRF model.
Table 1. Compiled Results using Hierarchical
CRF: (Details in Sec. 5).
Label True Marked Actual Precision
Interesting Event 8023 8100 8238 99.049
Non-Interesting Event 8017 8232 8094 97.388
Overall 16040 16332 16332 98.212
The columns in the table 1, the actual values describe
the actual number of cases, the marked values describes
the number of cases marked correctly by the model and
the true values accounts for the number of cases that were
marked correctly. Precision is the fraction of marked val-
ues that match the actual values. In case of non-interesting
events marked are more than actual because some frames
are marked non-interesting instead of interesting as ex-
plained in the parsing engine example.
The next table Table 2 gives experimental results for
comparison of the overall hierarchical CRF- based model
and non-hierarchical model. It gives results for the 50
videos which were tested. It shows the global results of
the complete computation.
Tables 2 clearly shows the former to out perform the lat-
ter by a large margin.The actual values are the total frames
in 50 videos that include interesting and non-interesting
event. The marked values are the number of frames of
the videos that were correctly parsed as interesting or non-
interesting event. The accuracy gives the fraction of frames
marked correctly as interesting or non-interesting.

Citations
More filters
Journal ArticleDOI

Affection arousal based highlight extraction for soccer video

TL;DR: Event temporal transition pattern in soccer video is utilized to detect highlights boundaries effectively combined with the affection arousal curve, and shot intensity is exploited to replace the motion activity, which greatly improves the computational performance of affection arousal model.
Journal ArticleDOI

Soccer Video Event Annotation by Synchronization of Attack–Defense Clips and Match Reports With Coarse-Grained Time Information

TL;DR: A more generalized approach that synchronizes video events with text descriptions using high-level semantics with coarse time constraints, rather than assuming that the timestamp is given exactly in the text description.
Proceedings ArticleDOI

Generic architecture for event detection in broadcast sports video

TL;DR: The focus of this paper is on the creation of a generic architecture for automated event detection in sports video and the different aspects of the architecture are explained and the systems is evaluated on different sports sequences.
Proceedings ArticleDOI

A hybrid framework for event detection using multi-modal features

TL;DR: A novel approach for event detection in sports videos by topic based graphical model learning where characteristics features defining various sport events are extracted by contextual grouping of low-level video and audio features using topic modeling.
Proceedings ArticleDOI

A Semi-Automatic Soccer Video Annotation System based on Ontology Paradigm

TL;DR: A soccer ontology is developed, that is used in the OSAS annotation system (Ontology Soccer Annotation System), based on a set of Semantic Web Rule Language (SWRL) rules that bridges the semantic gap problem.
References
More filters
Journal ArticleDOI

Robust Real-Time Face Detection

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.

An Introduction to Conditional Random Fields for Relational Learning

TL;DR: A solution to this problem is to directly model the conditional distribution p(y|x), which is sufficient for classification, and this is the approach taken by conditional random fields.
Journal ArticleDOI

Video abstraction: A systematic review and classification

TL;DR: The purpose of this article is to provide a systematic classification of various ideas and techniques proposed towards the effective abstraction of video contents, and identify and detail, for each approach, the underlying components and how they are addressed in specific works.
Proceedings ArticleDOI

Multiscale conditional random fields for image labeling

TL;DR: An approach to include contextual features for labeling images, in which each pixel is assigned to one of a finite set of labels, are incorporated into a probabilistic framework, which combines the outputs of several components.

An Introduction to Conditional Random Fields for Relational Learning

Lise Getoor, +1 more
TL;DR: This chapter contains sections titled: Introduction, Graphical Models, Linear-Chain Conditional Random Fields, CRFs in General, Skip-Chain CRFs, Conclusion, Acknowledgments, References.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What have the authors contributed in "A novel learning-based framework for detecting interesting events in soccer videos" ?

The authors present a novel learning-based framework for detecting interesting events in soccer videos. The authors apply this framework to parse soccer videos into Interesting ( a goal or a goal miss ) and Non-Interesting videos. The authors present results of numerous experiments in support of the proposed strategy. 

Conditional dependency between mid level features is exploited by the CRF’s that results in the parsing of interesting and non-interesting events. 

The first step in detecting interesting events is to abstract the raw video data into more semantically meaningful streams of information. 

The conditional distribution over the hidden states is written as:p(y|x) = 1 Z(x)∏ φc(xc, yc)where Z(x) = ∑y ∏ c∈C φc(xc, yc) is the normalizingpartition function. 

The gray thresholding leaves out the whiter portions and after that applying the close operation followed by erosion with structuring element as line (first vertical the horizontal). 

Out of the 132 frames, their SVM-based classifier correctly classified the frame into a field view and a non-field view in 127 cases, giving an accuracy of 96.21%. 

Without loss of generality φc(xc, yc) are described by log linear combinations of feature functions fc() i.e.,φc(xc, yc) = exp(wTc fc(xc, yc))where wTc is the transpose of a weight vector wc and fc(xc, yc) is a function that extracts vector of features from the variable values. 

For 50 new non-training examples given to the system, it marked a Zoom In correctly in 47 of them, representing an accuracy of 94%. 

The low level features, zoom, goal post, crowd and face detection are combined according to their respective dependency to deduce mid level features, zoom in on the goal post, a crowd detection along with goal post detection and goal post detection followed by a face detection. 

This is also a limitation in the work of Duan et al. [5] who proposes a multi-level hierarchy, but has a rule-based system for event detection. 

A combination of field view with goal post and zoom in on the goal post has higher probability that the sequence of frames is a goal than just a field view and goal post. 

For their experiments with soccer videos, the authors choose the following low-level features, and outline their detection process in detail in the subsequent sections: Zoom In, Goal Post Detection, Crowd Detection, Field View Detection, and Face Detection.