scispace - formally typeset
Open AccessBook ChapterDOI

High-level feature detection from video in TRECVid: a 5-year retrospective of achievements

TLDR
This chapter summarizes the work done on the TRECVid high-level feature task, showing the progress made year-on-year, and provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task.
Abstract
Successful and effective content-based access to digital video requires fast, accurate and scalable methods to determine the video content automatically. A variety of contemporary approaches to this rely on text taken from speech within the video, or on matching one video frame against others using low-level characteristics like colour, texture or shapes, or on determining and matching objects appearing within the video. Possibly the most important technique, however, is one that determines the presence or absence of a high-level or semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands of such semantic features we can support many kinds of content-based video navigation. Critically, however, this depends on being able to determine whether each feature is or is not present in a video clip. The last 5 years have seen much progress in the development of techniques to determine the presence of semantic features within video. This progress can be tracked in the annual TRECVid benchmarking activity where dozens of research groups measure the effectiveness of their techniques on common data and using an open, metrics-based approach. In this chapter we summarize the work done on the TRECVid high-level feature task, showing the progress made year-on-year. This provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task, not just for one research group or for one approach, but across the spectrum. We then use this past and on-going work as a basis for highlighting the trends that are emerging in this area, and the questions which remain to be addressed before we can achieve large-scale, fast and reliable high-level feature detection on video.

read more

Content maybe subject to copyright    Report

High-Level Feature Detection from Video in
TRECVid: a 5-Year Retrospective of Achievements
Alan F. Smeaton
1
and Paul Over
2
and Wessel Kraaij
3
1
CLARITY: Centre for Sensor Web Technologies, Dublin City University, Ireland.
Alan.Smeaton@DCU.ie
2
National Institute of Standards and Technology, USA. over@nist.gov
3
TNO, The Netherlands. Wessel.Kraaij@tno.nl
Summary. *Successful and effective content-based access to digital video requires fast, ac-
curate and scalable methods to determine the video content automatically. A variety of con-
temporary approaches to this rely on text taken from speech within the video, or on matching
one video frame against others using low-level characteristics like colour, texture, or shapes,
or on determining and matching objects appearing within the video. Possibly the most impor-
tant technique, however, is one which determines the presence or absence of a high-level or
semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands
of such semantic features we can support many kinds of content-based video navigation. Criti-
cally however, this depends on being able to determine whether each feature is or is not present
in a video clip. The last 5 years have seen much progress in the development of techniques
to determine the presence of semantic features within video. This progress can be tracked in
the annual TRECVid benchmarking activity where dozens of research groups measure the ef-
fectiveness of their techniques on common data and using an open, metrics-based approach.
In this chapter we summarise the work done on the TRECVid high-level feature task, show-
ing the progress made year-on-year. This provides a fairly comprehensive statement on where
the state-of-the-art is regarding this important task, not just for one research group or for one
approach, but across the spectrum. We then use this past and on-going work as a basis for
highlighting the trends that are emerging in this area, and the questions which remain to be
addressed before we can achieve large-scale, fast and reliable high-level feature detection on
video.
4
Published in A. Divakaran (ed.), Multimedia Content Analysis, Signals and Communication
Technology, pages 151–174, DOI10.1007/978-0-387-76569-3
6 (c) Springer Science+Business
Media, LLC 2009
4
Disclaimer: Certain commercial entities, equipment, or materials may be identified in this
document in order to describe an experimental procedure or concept adequately. Such iden-
tification is not intended to imply recommendation or endorsement by the National Institute
of Standards, nor is it intended to imply that the entities, materials, or equipment are nec-
essarily the best available for the purpose.

2 Alan F. Smeaton and Paul Over and Wessel Kraaij
1 Introduction
Searching for relevant video fragments in a large collection of video clips is a much harder
task than searching textual collections. A user’s information need is more easily represented
as a textual description in natural language using high-level concepts that directly relate to
the user’s ontology which relates terminology to real world objects and events. Even though
raw video clips lack textual descriptions, low-level signal processing techniques can however
describe them in terms of color histograms, textures etc. The fact that there exists a mismatch
between the low-level interpretation of video frames and the representation of an information
need as expressed by a user is called the “semantic gap” [20].
Up to this point in time, video archives have overcome the semantic gap and can facilitate
search by manual indexing of video productions, which is a very costly approach. The meta-
data produced this way often lacks descriptions at the shot level, making retrieval of relevant
fragments at the shot level a time-consuming effort. Even if relevant video productions have
been found, they have to be watched completely in order to narrow down the search selection
to the relevant shots.
A promising approach to make search in video archives more efficient and effective is to
develop automatic indexing techniques that produce descriptions at a higher semantic level
that is better attuned to matching information needs. Such indexing techniques produce de-
scriptions using a xed vocabulary of so-called high-level features also referred to as semantic
concepts. Typical examples of high-level features are objects such as ‘car’, persons such as
‘Madeline Albright’, scenes such as ‘sky’ or events like ‘airplane takeoff’. These descrip-
tors are named high-level features to make a clear distinction with low-level features such
as colour, texture and shape. Low-level features are used as inputs for the detection of high-
level features. In turn (and this is the main reason why they are called features), the high-level
features can be used as features by a higher level interpretation module, combining different
high-level features in a compositional fashion, e.g. ‘car AND fire’.
Semantic concept indexing has been one of the objects of study of the TRECVid bench-
marking evaluation campaign. More background about TRECVid is presented in Sections 2
and 3 of this chapter. Section 4 subsequently discusses the principal results and trends in the
five iterations of the high-level feature detection task organized in each year during the period
2002-2006.
High-level feature detectors are usually built by training a classifier (often a support vec-
tor machine) on labeled training data. However, developing detectors with a high accuracy
is challenging, since the number of positive training examples is usually rather small, so the
classifier has to deal with class imbalance. There is also a large variation in example frames
and the human labeling contains errors. From a development point of view, it is a challenge to
find scalable methods that exploit multiple layers of rich representations and to develop fusion
configurations that are automatically optimized for individual concepts. If the accuracy of such
a detector is sufficiently high, it can be of tremendous help for a search task, especially if rele-
vant concepts exist for the particular search query. For example, the performance of the query
“Find two visible tennis players” benefits from using the high-level feature “tennis game”.
Of course the size of the concept lexicon and the granularity of the ontology it represents are
seminal for the applicability of concept indexing for search. Over the last few years, the lex-
icon size of state-of-the-art systems for content based video access has grown from several
tens to several hundreds and there is evidence that high-level features indeed improve search
effectiveness and thus help to bridge the semantic gap.

TRECVid Feature Detection 3
However, there are several open research problems linked to using automatic semantic
concept annotation for video search. Experience from five years of benchmarking high-level
feature detectors at TRECVid has raised several issues. We mention a few here:
The choice of a proper lexicon depends on the video collection and the envisaged queries,
and no automatic strategy exists to assist in constructing such a lexicon.
The accuracy of a substantial number of concepts is too poor to be helpful.
The stability of the accuracy of concept detectors when moving from one collection to
another has not been established yet.
Section 5 will discuss these and other open issues in some more detail and formulate an out-
look on how to benchmark concept indexing techniques in the coming years.
2 Benchmarking Evaluation Campaigns, TREC, and TRECVid
The Text Retrieval Conference (TREC) initiative began in 1991 as a reaction to small collec-
tion sizes used in experimental information retrieval (IR) at that time, and the need for a more
co-ordinated evaluation among researchers. TREC is run by the National Institute of Stan-
dards and Technology (NIST). It set out initially to benchmark the ad hoc search and retrieval
operation on text documents and over the intervening decade and a half spawned over a dozen
IR-related tasks including cross-language IR, filtering, IR from web data, interactive IR, high
accuracy IR, IR from blog data, novelty detection in IR, IR from video data, IR from enter-
prise data, IR from genomic data, from legal data, from spam data, question-answering and
others. 2007 was the 16th TREC evaluation and over a hundred research groups participated.
One of the evaluation campaigns which started as a track within TREC but spawned off as an
independent activity after 2 years is the video data track, known as TRECVid, and the subject
of this paper.
The operation of TREC and all its tracks was established from the start and has followed
the same formula which is basically:
Acquire data and distribute it to participants;
Formulate a set of search topics and release these to participants simultaneously and en
bloc;
Allow up to 4 weeks of query processing by participants and accept submissions of the
top-1000 ranked documents per search topic, from each participant;
Pool submissions to eliminate duplicates and use manual assessors to make binary rele-
vance judgments;
Calculate Precision, Recall and other derived measures for submitted runs and distribute
results;
Host workshop to compare results;
The approach in TREC has always been metrics-based focusing on evaluation of search
performance with measurement typically being some variants of Precision and Recall.
Following the success of TREC and its many tracks, many similar evaluation campaigns
have been launched in the information retrieval domain. In particular, in the video/image area
there are evaluation campaigns for basic video/image analysis as well as for retrieval. In all
cases these are not competitions with “winners” and “losers” but they are more correctly ti-
tled “evaluation campaigns” where interested parties can benchmark their techniques against
others and normally they culminate in a workshop where results are presented and discussed.

4 Alan F. Smeaton and Paul Over and Wessel Kraaij
TRECVid is one such evaluation campaign and we shall see details of that in section 3, but
first we shall look briefly at evaluations related to video processing.
ETISEO (Evaluation du Traitement et de l’Interpr´etation de S´equences Vid´eo) [3] was an
evaluation campaign that ran in 2005 and 2006. The aim was to evaluate vision techniques for
video surveillance applications and it focussed on the treatment and interpretation of videos
involving pedestrians and (or) vehicles, indoors or outdoors, obtained from xed cameras.
The video data used was single and multi-view surveillance of areas like airports, car parks,
corridors and subways. The ground truth consisted of manual annotations and classifications
of persons, vehicles and groups, and the tasks were detection, localization, classification and
tracking of physical objects, and event recognition.
The PETS campaign (Performance Evaluation of Tracking & Surveillance) [6] is in its
10th year in 2007 and is funded by the European Union through the FP6 project ISCAPS (In-
tegrated Surveillance of Crowded Areas for Public Security). PETS evaluates object detection
and tracking for video surveillance, and its evaluation is also metrics based. Data in PETS
is multi-view/multi-camera surveillance video using up to 4 cameras and the task is event
detection for events such as luggage being left in public places.
The AMI (Augmented Multi-Party Interaction) project [2], funded by the European
Union, targets computer enhanced multi-modal interaction, including the analysis of video
recordings taken from multiple cameras, in the context of meetings. The project coordinates
an evaluation campaign where tasks include 2D multi-person tracking, head tracking, head
pose estimation and an estimation of the focus-of-attention (FoA) in meetings as being either
a table, documents, a screen, or other people in the meeting. This is based on video analysis
of people in the meeting and what is the focus of their gaze.
ARGOS [9] is another evaluation campaign for video content analysis tools. The set of
tasks under evaluation have a lot of overlap with the TRECVid tasks and include shot bound
detection, camera motion detection, person identification, video OCR and story boundary de-
tection. The corpus of video used by ARGOS includes broadcast TV news, scientific docu-
mentaries and surveillance video.
Although even these evaluation campaigns in the video domain span multiple domains
and genres as well as multiple applications, some of which are information retrieval, they
have several things in common, including the following:
they are all very metrics-based with agreed evaluation procedures and data formats;
they are all primarily system evaluations rather than user evaluations;
they are all open in terms of participation and make their results, and some also their data,
available to others;
they are all have manual self-annotation of ground truth or centralized assessment of
pooled results;
they all coordinate large volunteer efforts, many with little sponsorship funding;
they all have growing participation;
they all have contributed to raising the profile of their application and of evaluation cam-
paigns in general;
What we can conclude from the level of activity in evaluation campaigns such as the above,
and the TRECVid campaign which we will cover in the next section, is that they are established
within their research communities as the means to carry out comparative evaluations.

TRECVid Feature Detection 5
3 The TRECVid Benchmarking Evaluation Campaign
The TREC Video Retrieval Evaluations began on a small scale in 2001 as one of the many
variations on standard text IR evaluations hatched within the larger TREC effort. The mo-
tivation was an interest in expanding the notion of “information” in IR beyond text and the
observation that it was difficult to compare research results in video retrieval because there
was no common basis (data, tasks, or measures) for scientific comparison. TRECVid’s two
goals reflected the relatively young nature of the field at the time it started, namely promotion
of research and progress in video retrieval and in how to usefully benchmark performance.
In both areas TRECVid has often opted for freedom for participants in the search for effec-
tive approaches over control aimed at finality of results. This is believed appropriate given the
difficulty of the research problems addressed and the current maturity of systems.
TRECVid can be compared with more constrained evaluations using larger-scale testing
as in the Face Recognition Grand Challenge (FRGC) [1] and in the context of benchmark-
ing evaluation campaigns it is interesting to compare those in IR and image/video processing
mentioned above, with such a “grand challenge”. The FRGC is built on the conclusion that
there exist “three main contenders for improvements in face recognition” and on the defini-
tion of 5 specific conjectures to be tested. FRGC shares with TRECVid an emphasis on large
data sets, shared tasks (experiments) so results are comparable, and shared input/output for-
mats. But FRGC differs from TRECVid in that FRGC works with much more data and tests
(complete ground truth is given by process of capturing data), more controlled data, focus on
a single task, only non-interactive systems, and evaluation only in terms of verification and
false accept rates. This makes it quite different from TRECVid.
The annual TRECVid cycle begins more than a year before the target workshop as NIST
works with the sponsors to secure the video to be used and outlines associated tasks and mea-
sures. These are presented for discussion at the November workshop a year before they are to
be used. They need to reflect interests of the sponsors as well as enough researchers to attract
a critical mass of participants. With input from participants and sponsors, a set of guidelines
is created and a call for participation is sent out by early February. The various sorts of data
required are prepared for distribution in the spring and early summer. Researchers develop
their systems, run them on the test data, and submit the output for manual and automatic eval-
uation at NIST starting in August. Results of the evaluations are returned to the participants in
September and October. Participants then write up their work and discuss it at the workshop in
mid-November what worked, what didn’t work, and why. The emphasis in this is on learning
by exploring. Final analysis and description of the work is completed in the months following
the workshop and often include results of new or corrected experiments and discussion at the
workshop. Many of the workshop papers are starting points for peer-reviewed publications,
with a noticable effect on the scientific programme of multimedia conferences. Over the last
few years, about 50 publications per year were reporting the use of a TRECVid test collection.
The TRECVid tasks which have been evaluated are shot boundary detection, detection of
concepts or high-level features within shots, automatic detection of story bounds in broad-
cast TV news, three kinds of search (automatic, manual and interactive) and automatic video
summarisation. In this chapter we gather together the work done and the contributions of the
TRECVid high-level feature detection task since it started in 2002. We analyse its impact and
we enlist what we believe to be the outstanding challenges and likely developments.

Citations
More filters
Journal ArticleDOI

Multimodal fusion for multimedia analysis: a survey

TL;DR: This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks.
Journal ArticleDOI

A Survey on Visual Content-Based Video Indexing and Retrieval

TL;DR: Methods for video structure analysis, including shot boundary detection, key frame extraction and scene segmentation, extraction of features including static key frame features, object features and motion features, video data mining, video annotation, and video retrieval including query interfaces are analyzed.
Proceedings Article

Domain Adaptation under Target and Conditional Shift

TL;DR: This work considers domain adaptation under three possible scenarios, kernel embedding of conditional as well as marginal distributions, and proposes to estimate the weights or transformations by reweighting or transforming training data to reproduce the covariate distribution on the test domain.
Book

Concept-Based Video Retrieval

TL;DR: This paper presents a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction and lays down the anatomy of a concept-based video search engine.
Book

Lifelogging: Personal Big Data

TL;DR: This review aims to provide a comprehensive summary of lifelogging, to cover its research history, current technologies, and applications, and reflect on the challenges lifelogged poses for information access and retrieval in general.
References
More filters
Journal ArticleDOI

Content-based image retrieval at the end of the early years

TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.
Proceedings ArticleDOI

The challenge problem for automated detection of 101 semantic concepts in multimedia

TL;DR: The challenge problem for generic video indexing is introduced to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments.
Journal ArticleDOI

Large-scale concept ontology for multimedia

TL;DR: The large-scale concept ontology for multimedia (LSCOM) is the first of its kind designed to simultaneously optimize utility to facilitate end-user access, cover a large semantic space, make automated extraction feasible, and increase observability in diverse broadcast news video data sets.
Proceedings ArticleDOI

Estimating average precision with incomplete and imperfect judgments

TL;DR: This work proposes three evaluation measures that are approximations to average precision even when the relevance judgments are incomplete and are more robust to incomplete or imperfect relevance judgments than bpref, and proposes estimates of average precision that are simple and accurate.
Proceedings ArticleDOI

Semantic concept-based query expansion and re-ranking for multimedia retrieval

TL;DR: This paper proposes several new approaches for query expansion, in which textual keywords, visual examples, or initial retrieval results are analyzed to identify the most relevant visual concepts for the given query, and develops both lexical and statistical approaches.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "High-level feature detection from video in trecvid: a 5-year retrospective of achievements" ?

This progress can be tracked in the annual TRECVid benchmarking activity where dozens of research groups measure the effectiveness of their techniques on common data and using an open, metrics-based approach. In this chapter the authors summarise the work done on the TRECVid high-level feature task, showing the progress made year-on-year. This provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task, not just for one research group or for one approach, but across the spectrum. The authors then use this past and on-going work as a basis for highlighting the trends that are emerging in this area, and the questions which remain to be addressed before they can achieve large-scale, fast and reliable high-level feature detection on video. 

The targeted effort to design a concept ontology for broadcast news, LSCOM [ 5 ], has also been very influential, since it created the possibility to use the semantic relations between concepts for the search task. Future experiments should be more focused on quantifying the robustness of the technology, how well can detectors be applied in different domains, and on better comparability of the experiments across sites and across collections in order to answer community-wide high-level research questions. Now approaches are consolidating, and it may become more attractive to control more factors in the experimental setting in order to make submissions more comparable across sites. The authors mention a few: • What are the limits on the generalizability of detectors, i. e., how reusable are the detectors, and how can they measure this in an affordable way given the further constraint that changing data sets is expensive ? • 

A promising approach to make search in video archives more efficient and effective is to develop automatic indexing techniques that produce descriptions at a higher semantic level that is better attuned to matching information needs. 

In the latter years this training data consisted of manually annotated shots provided as part of large-scale community-based video annotation activities, an aspect of TRECVid which really allowed the benchmark to focus on system approaches rather than data availability. 

High-level feature detectors are usually built by training a classifier (often a support vector machine) on labeled training data. 

One interesting aspect of the feature detection task was the provision of development data which could be used by participating groups to train their feature detection systems. 

A total of 60 hours (32,318 shots) were used for the evaluation, a big step-up in size, and 10 groups submitted a total of 60 runs which were pooled and only partially assessed because of the large ramp-up in submissions and data volume from the data used in 2002. 

One way to facilitate meta-analysis of experiment results across sites is to classify systems based on an ontology of experimental choices that has been constructed for the design of a concept detector architecture. 

The TRECVid standard for correctness in annotation of feature training data and judging of system output is that of a human – so that examples which are very difficult for systems due to small size, occlusion, etc., are included in the training data and systems that can detect these examples get credit for them – as should be the case in a real system. 

When assessing the results of feature detection the authors employ the widely used trec eval software to calculate standard information retrieval measures. 

In the broadcast news domain, shots are fairly short, for longer shots, it might make sense to annotate the presence of a feature at the frame level. 

In turn (and this is the main reason why they are called features), the high-level features can be used as features by a higher level interpretation module, combining different high-level features in a compositional fashion, e.g. ‘car AND fire’. 

Throughout the previous iterations of the feature detection task most groups had come to depend on the keyframe as the shot representative and had applied their feature detection techniques to the keyframe rather than the whole shot.