What was the main focus of the TRECVid training data in the latter years?

In the latter years this training data consisted of manually annotated shots provided as part of large-scale community-based video annotation activities, an aspect of TRECVid which really allowed the benchmark to focus on system approaches rather than data availability.

How do you build a high-level feature detector?

High-level feature detectors are usually built by training a classifier (often a support vector machine) on labeled training data.

What was the main aspect of the feature detection task?

One interesting aspect of the feature detection task was the provision of development data which could be used by participating groups to train their feature detection systems.

How many hours of training data were used for the evaluation in 2003?

A total of 60 hours (32,318 shots) were used for the evaluation, a big step-up in size, and 10 groups submitted a total of 60 runs which were pooled and only partially assessed because of the large ramp-up in submissions and data volume from the data used in 2002.

What is the way to facilitate meta-analysis of experiment results across sites?

One way to facilitate meta-analysis of experiment results across sites is to classify systems based on an ontology of experimental choices that has been constructed for the design of a concept detector architecture.

What is the TRECVid standard for correctness in annotation of feature training data and?

The TRECVid standard for correctness in annotation of feature training data and judging of system output is that of a human – so that examples which are very difficult for systems due to small size, occlusion, etc., are included in the training data and systems that can detect these examples get credit for them – as should be the case in a real system.

What is the standard for correctness in annotation of feature training data?

When assessing the results of feature detection the authors employ the widely used trec eval software to calculate standard information retrieval measures.

What is the way to annotate the presence of a feature?

In the broadcast news domain, shots are fairly short, for longer shots, it might make sense to annotate the presence of a feature at the frame level.

What did the participants learn from the previous iterations of the feature detection task?

Throughout the previous iterations of the feature detection task most groups had come to depend on the keyframe as the shot representative and had applied their feature detection techniques to the keyframe rather than the whole shot.

(Open Access) High-level feature detection from video in TRECVid: a 5-year retrospective of achievements (2009) | Alan F. Smeaton

Q: What are the future works in "High-level feature detection from video in trecvid: a 5-year retrospective of achievements" ?

The targeted effort to design a concept ontology for broadcast news, LSCOM [ 5 ], has also been very influential, since it created the possibility to use the semantic relations between concepts for the search task. Future experiments should be more focused on quantifying the robustness of the technology, how well can detectors be applied in different domains, and on better comparability of the experiments across sites and across collections in order to answer community-wide high-level research questions. Now approaches are consolidating, and it may become more attractive to control more factors in the experimental setting in order to make submissions more comparable across sites. The authors mention a few: • What are the limits on the generalizability of detectors, i. e., how reusable are the detectors, and how can they measure this in an affordable way given the further constraint that changing data sets is expensive ? •

High-Level Feature Detection from Video in

TRECVid: a 5-Year Retrospective of Achievements

Alan F. Smeaton

and Paul Over

and Wessel Kraaij

CLARITY: Centre for Sensor Web Technologies, Dublin City University, Ireland.

Alan.Smeaton@DCU.ie

National Institute of Standards and Technology, USA. over@nist.gov

TNO, The Netherlands. Wessel.Kraaij@tno.nl

Summary. *Successful and effective content-based access to digital video requires fast, ac-

curate and scalable methods to determine the video content automatically. A variety of con-

temporary approaches to this rely on text taken from speech within the video, or on matching

one video frame against others using low-level characteristics like colour, texture, or shapes,

or on determining and matching objects appearing within the video. Possibly the most impor-

tant technique, however, is one which determines the presence or absence of a high-level or

semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands

of such semantic features we can support many kinds of content-based video navigation. Criti-

cally however, this depends on being able to determine whether each feature is or is not present

in a video clip. The last 5 years have seen much progress in the development of techniques

to determine the presence of semantic features within video. This progress can be tracked in

the annual TRECVid benchmarking activity where dozens of research groups measure the ef-

fectiveness of their techniques on common data and using an open, metrics-based approach.

In this chapter we summarise the work done on the TRECVid high-level feature task, show-

ing the progress made year-on-year. This provides a fairly comprehensive statement on where

the state-of-the-art is regarding this important task, not just for one research group or for one

approach, but across the spectrum. We then use this past and on-going work as a basis for

highlighting the trends that are emerging in this area, and the questions which remain to be

addressed before we can achieve large-scale, fast and reliable high-level feature detection on

video.

Published in A. Divakaran (ed.), Multimedia Content Analysis, Signals and Communication

Technology, pages 151–174, DOI10.1007/978-0-387-76569-3

6 (c) Springer Science+Business

Media, LLC 2009

Disclaimer: Certain commercial entities, equipment, or materials may be identiﬁed in this

document in order to describe an experimental procedure or concept adequately. Such iden-

tiﬁcation is not intended to imply recommendation or endorsement by the National Institute

of Standards, nor is it intended to imply that the entities, materials, or equipment are nec-

essarily the best available for the purpose.

2 Alan F. Smeaton and Paul Over and Wessel Kraaij

1 Introduction

Searching for relevant video fragments in a large collection of video clips is a much harder

task than searching textual collections. A user’s information need is more easily represented

as a textual description in natural language using high-level concepts that directly relate to

the user’s ontology which relates terminology to real world objects and events. Even though

raw video clips lack textual descriptions, low-level signal processing techniques can however

describe them in terms of color histograms, textures etc. The fact that there exists a mismatch

between the low-level interpretation of video frames and the representation of an information

need as expressed by a user is called the “semantic gap” [20].

Up to this point in time, video archives have overcome the semantic gap and can facilitate

search by manual indexing of video productions, which is a very costly approach. The meta-

data produced this way often lacks descriptions at the shot level, making retrieval of relevant

fragments at the shot level a time-consuming effort. Even if relevant video productions have

been found, they have to be watched completely in order to narrow down the search selection

to the relevant shots.

A promising approach to make search in video archives more efﬁcient and effective is to

develop automatic indexing techniques that produce descriptions at a higher semantic level

that is better attuned to matching information needs. Such indexing techniques produce de-

scriptions using a ﬁxed vocabulary of so-called high-level features also referred to as semantic

concepts. Typical examples of high-level features are objects such as ‘car’, persons such as

‘Madeline Albright’, scenes such as ‘sky’ or events like ‘airplane takeoff’. These descrip-

tors are named high-level features to make a clear distinction with low-level features such

as colour, texture and shape. Low-level features are used as inputs for the detection of high-

level features. In turn (and this is the main reason why they are called features), the high-level

features can be used as features by a higher level interpretation module, combining different

high-level features in a compositional fashion, e.g. ‘car AND ﬁre’.

Semantic concept indexing has been one of the objects of study of the TRECVid bench-

marking evaluation campaign. More background about TRECVid is presented in Sections 2

and 3 of this chapter. Section 4 subsequently discusses the principal results and trends in the

ﬁve iterations of the high-level feature detection task organized in each year during the period

2002-2006.

High-level feature detectors are usually built by training a classiﬁer (often a support vec-

tor machine) on labeled training data. However, developing detectors with a high accuracy

is challenging, since the number of positive training examples is usually rather small, so the

classiﬁer has to deal with class imbalance. There is also a large variation in example frames

and the human labeling contains errors. From a development point of view, it is a challenge to

ﬁnd scalable methods that exploit multiple layers of rich representations and to develop fusion

conﬁgurations that are automatically optimized for individual concepts. If the accuracy of such

a detector is sufﬁciently high, it can be of tremendous help for a search task, especially if rele-

vant concepts exist for the particular search query. For example, the performance of the query

“Find two visible tennis players” beneﬁts from using the high-level feature “tennis game”.

Of course the size of the concept lexicon and the granularity of the ontology it represents are

seminal for the applicability of concept indexing for search. Over the last few years, the lex-

icon size of state-of-the-art systems for content based video access has grown from several

tens to several hundreds and there is evidence that high-level features indeed improve search

effectiveness and thus help to bridge the semantic gap.

TRECVid Feature Detection 3

However, there are several open research problems linked to using automatic semantic

concept annotation for video search. Experience from ﬁve years of benchmarking high-level

feature detectors at TRECVid has raised several issues. We mention a few here:

• The choice of a proper lexicon depends on the video collection and the envisaged queries,

and no automatic strategy exists to assist in constructing such a lexicon.

• The accuracy of a substantial number of concepts is too poor to be helpful.

• The stability of the accuracy of concept detectors when moving from one collection to

another has not been established yet.

Section 5 will discuss these and other open issues in some more detail and formulate an out-

look on how to benchmark concept indexing techniques in the coming years.

2 Benchmarking Evaluation Campaigns, TREC, and TRECVid

The Text Retrieval Conference (TREC) initiative began in 1991 as a reaction to small collec-

tion sizes used in experimental information retrieval (IR) at that time, and the need for a more

co-ordinated evaluation among researchers. TREC is run by the National Institute of Stan-

dards and Technology (NIST). It set out initially to benchmark the ad hoc search and retrieval

operation on text documents and over the intervening decade and a half spawned over a dozen

IR-related tasks including cross-language IR, ﬁltering, IR from web data, interactive IR, high

accuracy IR, IR from blog data, novelty detection in IR, IR from video data, IR from enter-

prise data, IR from genomic data, from legal data, from spam data, question-answering and

others. 2007 was the 16th TREC evaluation and over a hundred research groups participated.

One of the evaluation campaigns which started as a track within TREC but spawned off as an

independent activity after 2 years is the video data track, known as TRECVid, and the subject

of this paper.

The operation of TREC and all its tracks was established from the start and has followed

the same formula which is basically:

• Acquire data and distribute it to participants;

• Formulate a set of search topics and release these to participants simultaneously and en

bloc;

• Allow up to 4 weeks of query processing by participants and accept submissions of the

top-1000 ranked documents per search topic, from each participant;

• Pool submissions to eliminate duplicates and use manual assessors to make binary rele-

vance judgments;

• Calculate Precision, Recall and other derived measures for submitted runs and distribute

results;

• Host workshop to compare results;

The approach in TREC has always been metrics-based — focusing on evaluation of search

performance — with measurement typically being some variants of Precision and Recall.

Following the success of TREC and its many tracks, many similar evaluation campaigns

have been launched in the information retrieval domain. In particular, in the video/image area

there are evaluation campaigns for basic video/image analysis as well as for retrieval. In all

cases these are not competitions with “winners” and “losers” but they are more correctly ti-

tled “evaluation campaigns” where interested parties can benchmark their techniques against

others and normally they culminate in a workshop where results are presented and discussed.

4 Alan F. Smeaton and Paul Over and Wessel Kraaij

TRECVid is one such evaluation campaign and we shall see details of that in section 3, but

ﬁrst we shall look brieﬂy at evaluations related to video processing.

ETISEO (Evaluation du Traitement et de l’Interpr´etation de S´equences Vid´eo) [3] was an

evaluation campaign that ran in 2005 and 2006. The aim was to evaluate vision techniques for

video surveillance applications and it focussed on the treatment and interpretation of videos

involving pedestrians and (or) vehicles, indoors or outdoors, obtained from ﬁxed cameras.

The video data used was single and multi-view surveillance of areas like airports, car parks,

corridors and subways. The ground truth consisted of manual annotations and classiﬁcations

of persons, vehicles and groups, and the tasks were detection, localization, classiﬁcation and

tracking of physical objects, and event recognition.

The PETS campaign (Performance Evaluation of Tracking & Surveillance) [6] is in its

10th year in 2007 and is funded by the European Union through the FP6 project ISCAPS (In-

tegrated Surveillance of Crowded Areas for Public Security). PETS evaluates object detection

and tracking for video surveillance, and its evaluation is also metrics based. Data in PETS

is multi-view/multi-camera surveillance video using up to 4 cameras and the task is event

detection for events such as luggage being left in public places.

The AMI (Augmented Multi-Party Interaction) project [2], funded by the European

Union, targets computer enhanced multi-modal interaction, including the analysis of video

recordings taken from multiple cameras, in the context of meetings. The project coordinates

an evaluation campaign where tasks include 2D multi-person tracking, head tracking, head

pose estimation and an estimation of the focus-of-attention (FoA) in meetings as being either

a table, documents, a screen, or other people in the meeting. This is based on video analysis

of people in the meeting and what is the focus of their gaze.

ARGOS [9] is another evaluation campaign for video content analysis tools. The set of

tasks under evaluation have a lot of overlap with the TRECVid tasks and include shot bound

detection, camera motion detection, person identiﬁcation, video OCR and story boundary de-

tection. The corpus of video used by ARGOS includes broadcast TV news, scientiﬁc docu-

mentaries and surveillance video.

Although even these evaluation campaigns in the video domain span multiple domains

and genres as well as multiple applications, some of which are information retrieval, they

have several things in common, including the following:

• they are all very metrics-based with agreed evaluation procedures and data formats;

• they are all primarily system evaluations rather than user evaluations;

• they are all open in terms of participation and make their results, and some also their data,

available to others;

• they are all have manual self-annotation of ground truth or centralized assessment of

pooled results;

• they all coordinate large volunteer efforts, many with little sponsorship funding;

• they all have growing participation;

• they all have contributed to raising the proﬁle of their application and of evaluation cam-

paigns in general;

What we can conclude from the level of activity in evaluation campaigns such as the above,

and the TRECVid campaign which we will cover in the next section, is that they are established

within their research communities as the means to carry out comparative evaluations.

TRECVid Feature Detection 5

3 The TRECVid Benchmarking Evaluation Campaign

The TREC Video Retrieval Evaluations began on a small scale in 2001 as one of the many

variations on standard text IR evaluations hatched within the larger TREC effort. The mo-

tivation was an interest in expanding the notion of “information” in IR beyond text and the

observation that it was difﬁcult to compare research results in video retrieval because there

was no common basis (data, tasks, or measures) for scientiﬁc comparison. TRECVid’s two

goals reﬂected the relatively young nature of the ﬁeld at the time it started, namely promotion

of research and progress in video retrieval and in how to usefully benchmark performance.

In both areas TRECVid has often opted for freedom for participants in the search for effec-

tive approaches over control aimed at ﬁnality of results. This is believed appropriate given the

difﬁculty of the research problems addressed and the current maturity of systems.

TRECVid can be compared with more constrained evaluations using larger-scale testing

as in the Face Recognition Grand Challenge (FRGC) [1] and in the context of benchmark-

ing evaluation campaigns it is interesting to compare those in IR and image/video processing

mentioned above, with such a “grand challenge”. The FRGC is built on the conclusion that

there exist “three main contenders for improvements in face recognition” and on the deﬁni-

tion of 5 speciﬁc conjectures to be tested. FRGC shares with TRECVid an emphasis on large

data sets, shared tasks (experiments) so results are comparable, and shared input/output for-

mats. But FRGC differs from TRECVid in that FRGC works with much more data and tests

(complete ground truth is given by process of capturing data), more controlled data, focus on

a single task, only non-interactive systems, and evaluation only in terms of veriﬁcation and

false accept rates. This makes it quite different from TRECVid.

The annual TRECVid cycle begins more than a year before the target workshop as NIST

works with the sponsors to secure the video to be used and outlines associated tasks and mea-

sures. These are presented for discussion at the November workshop a year before they are to

be used. They need to reﬂect interests of the sponsors as well as enough researchers to attract

a critical mass of participants. With input from participants and sponsors, a set of guidelines

is created and a call for participation is sent out by early February. The various sorts of data

required are prepared for distribution in the spring and early summer. Researchers develop

their systems, run them on the test data, and submit the output for manual and automatic eval-

uation at NIST starting in August. Results of the evaluations are returned to the participants in

September and October. Participants then write up their work and discuss it at the workshop in

mid-November – what worked, what didn’t work, and why. The emphasis in this is on learning

by exploring. Final analysis and description of the work is completed in the months following

the workshop and often include results of new or corrected experiments and discussion at the

workshop. Many of the workshop papers are starting points for peer-reviewed publications,

with a noticable effect on the scientiﬁc programme of multimedia conferences. Over the last

few years, about 50 publications per year were reporting the use of a TRECVid test collection.

The TRECVid tasks which have been evaluated are shot boundary detection, detection of

concepts or high-level features within shots, automatic detection of story bounds in broad-

cast TV news, three kinds of search (automatic, manual and interactive) and automatic video

summarisation. In this chapter we gather together the work done and the contributions of the

TRECVid high-level feature detection task since it started in 2002. We analyse its impact and

we enlist what we believe to be the outstanding challenges and likely developments.

High-level feature detection from video in TRECVid: a 5-year retrospective of achievements

Figures

Citations

Multimodal fusion for multimedia analysis: a survey

A Survey on Visual Content-Based Video Indexing and Retrieval

Domain Adaptation under Target and Conditional Shift

Concept-Based Video Retrieval

Lifelogging: Personal Big Data

References

Content-based image retrieval at the end of the early years

The challenge problem for automated detection of 101 semantic concepts in multimedia

Large-scale concept ontology for multimedia

Estimating average precision with incomplete and imperfect judgments

Semantic concept-based query expansion and re-ranking for multimedia retrieval

Related Papers (5)

Evaluation campaigns and TRECVid

Distinctive Image Features from Scale-Invariant Keypoints

Evaluating Color Descriptors for Object and Scene Recognition

LIBSVM: A library for support vector machines

Content-based image retrieval at the end of the early years

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "High-level feature detection from video in trecvid: a 5-year retrospective of achievements" ?

Q2. What are the future works in "High-level feature detection from video in trecvid: a 5-year retrospective of achievements" ?

Q3. What is the way to make search in video archives more efficient?

Q4. What was the main focus of the TRECVid training data in the latter years?

Q5. How do you build a high-level feature detector?

Q6. What was the main aspect of the feature detection task?

Q7. How many hours of training data were used for the evaluation in 2003?

Q8. What is the way to facilitate meta-analysis of experiment results across sites?

Q9. What is the TRECVid standard for correctness in annotation of feature training data and?

Q10. What is the standard for correctness in annotation of feature training data?

Q11. What is the way to annotate the presence of a feature?

Q12. What is the main reason why high-level features are called features?

Q13. What did the participants learn from the previous iterations of the feature detection task?