A survey of semantic image and video annotation tools

doi:10.1007/978-3-642-20795-2_8

Book Chapter•DOI•

A survey of semantic image and video annotation tools

Stamatia Dasiopoulou, Eirini Giannakidou, Georgios Litos, Polyxeni Malasioti, Yiannis Kompatsiaris - Show less +1 more

01 Jan 2011-pp 196-239

TL;DR: This chapter presents an overview of the state of the art in image and video annotation tools to provide a common framework of reference and to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata.

read less

Abstract: The availability of semantically annotated image and video assets constitutes a critical prerequisite for the realisation of intelligent knowledge management services pertaining to realistic user needs. Given the extend of the challenges involved in the automatic extraction of such descriptions, manually created metadata play a significant role, further strengthened by their deployment in training and evaluation tasks related to the automatic extraction of content descriptions. The different views taken by the two main approaches towards semantic content description, namely the Semantic Web and MPEG-7, as well as the traits particular to multimedia content due to the multiplicity of information levels involved, have resulted in a variety of image and video annotation tools, adopting varying description aspects. Aiming to provide a common framework of reference and furthermore to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata, in this chapter we present an overview of the state of the art in image and video annotation tools.

...read moreread less

Summary (6 min read)

Jump to: [1 Introduction] – [2 Semantic Image and Video Annotation] – [2.1 Input & Output] – [2.2 Annotation Level] – [2.3 Miscellaneous] – [3 Tools for Semantic Image Annotation] – [3.1 KAT] – [3.2 PhotoStuff] – [3.3 AktiveMedia] – [3.5 Caliph] – [3.6 SWAD] – [3.7 LabelMe] – [3.8 Application-specific Image Annotation Tools] – [3.9 Discussion] – [4 Tools for Semantic Video Annotation] – [4.1 VIA] – [4.2 VideoAnnEx] – [4.6 Anvil] – [4.7 Semantic Video Annotation Suite] – [4.8 Application-specific Video Annotation Tools] – [4.9 Discussion] and [5 Conclusions]

1 Introduction

Accessing multimedia content in correspondence with the meaning pertained to a user, constitutes the core challenge in multimedia research, commonly referred to as the semantic gap [1].
This significance is further strengthened by the need for manually constructed descriptions in automatic content analysis both for evaluation as well as for training purposes, when learning based on preannotated examples is used.
Fundamental to information sharing, exchange and reuse, is the interoperability of the descriptions at both syntactic and semantic levels, i.e. regarding the valid structuring of the descriptions and the endowed meaning respectively.
The strong relation of structural and low-level feature information to the tasks involved in the automatic analysis of visual content, as well as to retrieval services, such as transcoding, content-based search, etc., brings these two dimensions to the foreground, along with the subject matter descriptions.
A number of so called multimedia ontologies [9–13] issued in an attempt to add formal semantics to MPEG-7 descriptions and thereby enable linking with existing ontologies and the semantic management of existing MPEG-7 metadata repositories.

2 Semantic Image and Video Annotation

Image and video assets constitute extremely rich information sources, ubiquitous in a wide variety of diverse applications and tasks related to information management, both for personal and professional purposes.
Inevitably, the value of the endowed information amounts to the effectiveness and efficiency at which it can be accessed and managed.
The former encompasses the capacity to share and reuse annotations, and by consequence determines the level of seamless content utilisation and the benefits issued from the annotations made available; the latter is vital to the realisation of intelligent content management services.
Towards their accomplishment, the existence of commonly agreed vocabularies and syntax, and respectively of commonly agreed semantics and interpretation mechanisms, are essential elements.
The aforementioned considerations intertwine, establishing a number of dimensions and corresponding criteria along which image and video annotation can be characterised.

2.1 Input & Output

This category includes criteria regarding the way the tool interacts in terms of requested / supported input and the output produced.
The authors note that annotation vocabularies may refer not only to subject matter descriptions, but as well to media and structural descriptions.
As will be shown in the sequel though, where the individual tools are described, there is not necessarily a strict correspondence (e.g. a tool may use an RDFS6 or OWL7 ontology as the subject matter vocabulary, and yet output annotations in RDF8).
The format is equally significant to the annotation vocabulary as with respect to the annotations interoperability and sharing.
Refers to the supported image/video formats, e.g. jpg, png, mpeg, etc.

2.2 Annotation Level

This category addresses attributes of the annotations per se.
Such retrieval may address concept-based queries or queries involving relations between concepts, entailing respective annotation specifications.
To capture the aforementioned considerations, the following criteria have been used.
For video assets, annotation may refer to the entire video, temporal segments , frames (temporal segments with zero duration), regions within frames, or even to moving regions, i.e. a region followed for a sequence of frames.
Refers to the level of expressivity supported with respect to the annotation vocabulary.

2.3 Miscellaneous

This category summarises additional criteria that do not fall under the previous dimensions.
The considered aspects relate mostly to attributes of the tool itself rather than of the annotation process.
As such, and given the scope of this chapter, in the description of the individual tools that follows in the two subsequent Sections, these criteria are treated very briefly.
Specifies whether the tool constitutes a web-based or a stand-alone application, also known as – Application Type.
– Licence: Specifies the kind of licence condition under which the tool operates, e.g. open source, etc. – Collaboration: Specifies whether the tool supports concurrent annotations (referring to the same media object) by multiple users or not.

3 Tools for Semantic Image Annotation

In this Section the authors describe prominent semantic image annotation tools with respect to the dimensions and criteria outlined in Section 2.
As will be illustrated in the following, Semantic Web technologies have permeated to a considerable degree the representation of metadata, with the majority of tools supporting ontology-based subject matter descriptions, while a considerable share of them adopts ontological representation for structural annotations as well.
In order to provide a relative ranking with respect to SW compatibility, the authors order the tools according to the extend to which the produced annotations bear formal semantics.

3.1 KAT

The K-Space Annotation Tool9 (KAT), developed within the K-Space10 project, implements an ontology-based framework for the semantic annotation of images.
COMM extends the Descriptions & Situations (D&S) and Ontology of Information Objects (OIO) design patterns of DOLCE [17, 18], while incorporating re-engineered definitions of MPEG-7 description tools[19, 20].
The latter are strictly concept based, i.e. considering the aforementioned annotation example it is not possible to annotate the pole as being next to the pole vaulter, and may refer to the entire image or to specific regions of it.
The localisation of image regions is performed manually, using either of the rectangle and polygon drawing tools.
Furthermore, the COMM based annotation scheme renders quite straightforward the extension of the annotation dimensions supported by KAT.

3.2 PhotoStuff

PhotoStuff11, developed by the Mindswap group12, is an ontology-based image annotation tool that supports the generation of semantic image descriptions with respect to the employed ontologies.
PhotoStuff [21] addresses primarily two types of metadata, namely descriptive and structural.
Regarding descriptive annotations, the user may load one or multiple domain-specific ontologies from the web or from the local hard drive, while with respect to structural annotations, two internal, hidden to the user, ontologies are used: the Digital-Media13 ontology and the Technical14 one.
Nor the representation neither the extraction of such descriptors is addressed.
Notably, annotations may refer not only to concept instantiations, but also to relations between concept instances already identified in an image.

3.3 AktiveMedia

AktiveMedia20, developed within AKT21 and X-Media22 projects, is an ontologybased cross-media annotation system addressing text and image assets.
In image annotation mode, AktiveMedia supports descriptive metadata with respect to user selected ontologies, stored in the local hard drive [22].
Annotations can refer to image or region level.
Contrary to Photostuff which uses 17 http://dublincore.org/documents/dces/.
As such, the semantics of generated RDF metadata, i.e. the annotation semantics as it entails from the respective ontology definitions, are not direct but require additional processing to retrieve and to reason over.

3.5 Caliph

Caliph27 is an MPEG-7 based image annotation tool that supports all types of MPEG-7 metadata among which descriptive, structural, authoring and low-level visual descriptor annotations.
In combination with Emir, they support contentbased retrieval of images using MPEG-7 descriptions.
Figure 6 illustrates two screenshots corresponding to the generic image information and the semantic annotation tabs.
The descriptions may be either in the form of free text or structured, in accordance to the SemanticBase description tools provided by MPEG-7 (i.e. Agents, Events, Time, Place and Object annotations [26]).
The so called semantic tab allows for the latter, offering a graph based interface.

3.6 SWAD

SWAD28 is an RDF-based image annotation tool that was developed within the SWAD-Europe project29.
The latter ran from May 2002 to October 2004 and aimed to support the Semantic Web initiative in Europe through targeted research, demonstrations and outreach activities.
The authors chose to provide a very brief description here for the purpose of illustrating image annotation in the Semantic Web as envisaged and realised by that time, as a reference and comparison point for the various image annotation tools that have been developed afterwards.
Licensing information as described in the respective SWAD deliverable30.
When entering a keyword description, the respective Wordnet31 hierarchy is shown to the user, assisting her in determining the appropriateness of the keyword and in selecting descriptions of further accuracy.

3.7 LabelMe

LabelMe33 is a database and web-based image annotation tool, aiming to contribute in the creation of large annotated image databases for evaluation and training purposes [28].
It contains all images from the MIT CSAIL34 database, in addition to a large number of user uploaded images.
LabelMe [28] supports descriptive metadata addressing in principle regionbased annotation.
Specifically, the user defines a polygon enclosing the annotated object through a set of control points.
Its focus on requirements related to object recognition research, rather than image search and retrieval, entails different notions regarding the utilisation, sharing and purpose of annotation.

3.8 Application-specific Image Annotation Tools

Apart from the afore described semantic image annotation tools, a variety of application-specific tools are available.
Some of them relate to Web 2.0 applications addressing tagging and sharing of content among social groups, while others focus on particular application domains, such as medical imaging, that impose additional specifications pertaining to the individual application context.
Utilising radiology specific ontologies, iPad enhances the annotation procedure by suggesting more specific terms and by identifying incomplete descriptions and subsequently prompting for missing parts in the description (e.g. “enlarged” is flagged as incomplete while “enlarged liver” is acceptable).
The produced descriptions are in RDF/XML following a proprietary schema39 that models the label constituting the tag, its position (the label constitutes a rectangle region in itself), and the position of the rectangle that encloses the annotated region in the form of the top left point coordinates and width and height information.
Furthermore, general information about the image is included such as image size, number of regions annotated, etc. Oriented towards Web 2.0, FotoTagger places significant focus on social aspects pertaining to content management, allowing among others to publish tagged images to 35 http://www.rsna.org/Technology/DICOM/ blogs and to upload/download tagged images to/from Flickr, while maintaining both FotoTagger’s and Flickr’s descriptions.

3.9 Discussion

The aforementioned overview reveals that the utilisation of Semantic Web languages for the representation, interchange and processing of image metadata has permeated semantic image annotation.
The choice of a standard representation shows the importance placed on creating content descriptions that can be easily exchanged and reused across heterogenous applications, and works like [10, 11, 30] provide bridges between MPEG-7 metadata and the Semantic Web and existing ontologies.
Thus unlike subject matter descriptions, where a user can choose which vocabulary to use (in the form of a domain ontology, a lexicon or user provided keywords), structural descriptions are tool specific.
Summing up, the choice of a tool depends primarily on the intended context of usage, which provides the specifications regarding the annotation dimensions supported, and subsequently on the desired formality of annotations, again related to a large extend to the application context.
Thus for semantic retrieval purposes, where semantic refers to the SW perspective, KAT, PhotoStuff, SWAD T o o l and AkiveMedia would be the more appropriate choices.

4 Tools for Semantic Video Annotation

The increase in the amount of video data deployed and used in today’s applications not only caused video to draw increased attention as a content type, but also introduced new challenges in terms of effective content management.
In the following the authors survey typical video annotation tools, highlighting their features with respect to the criteria delineated in Section 2.
In the latter category fall tools such as VIDETO41, Ricoh Movie Tool42, or LogCreator43.
It is interesting to note that the majority of these tools followed MPEG-7 for the representation of annotations.
As described in the sequel, this favourable disposition is still evident, differentiating video annotation tools from image ones, where the Semantic Web technologies have been more pervasive.

4.1 VIA

The Video and Image Annotation44 (VIA) tool has been developed by the MKLab45 within the BOEMIE46 project.
The shot records a pole vaulter holding a pole and sprinting at the jump point.
VIA supports descriptive, structural and media metadata of image and video assets.
Descriptive annotation is performed with respect to a user loaded OWL ontology, while free text descriptions can also be added.
The first one is concerned with region annotation, in which the user selects rectangular areas of the video content and subsequently adds corresponding annotations.

4.2 VideoAnnEx

The IBM VideoAnnEx47 annotation tool addresses video annotation with MPEG7 metadata.
VideoAnnex supports descriptive, structural and administrative annotations according to the respective MPEG-7 Description Schemes.
The tool supports default subject matter lexicons in XML format, and additionally allows the user to create and load her own XML lexicon, design a concept hierarchy through the interface menu commands, or insert free text descriptions.
As illustrated in Figure 10, the VideoAnnEx annotation interface consists of four components.
On the bottom part of the tool, two views are available of the annotation preview: one contains the I-frames of a shot and the keyframes of each shot in the video, respectively.

4.6 Anvil

Anvil69 is a tool that supports audiovisual content annotation, but which was primarily designed for linguistic purposes, in the same vein as the previously described tool.
User-defined XML schema specification files provide the definition of the vocabulary used in the annotation procedure.
Its interface consists of the media player window, the annotation board and the metadata window.
As in most described tools, also in Anvil, the user has to manually define the temporal segments that wants to annotate.
Anvil can import data from the phonetic tools PRAAT72 and XWaves which perform speech transcriptions.

4.7 Semantic Video Annotation Suite

The Semantic Video Annotation Suite75 (SVAS), developed by Joanneum research Institute of Information Systems & Information Management76, targets the creation of MPEG-7 video annotations.
SVAS [36] encompasses two tools: the Media Analyzer, which extracts automatically structural information regarding shots and key-frames, and the Semantic Video Annotation Tool (SVAT), which allows to edit the structural metadata obtained through the Media Analyzer and to add administrative and descriptive metadata, in accordance with MPEG-7.
The detection results are displayed in a separate key-frame view, where for each of the computed key frames the detected object is highlighted.
The user can partially enhance the results of this matching service by removing irrelevant key-frames; however more elaborate enhancement such as editing of the detected region’s boundaries or of its location is not supported.
All views, including the shot view tree structure, can be exported to a CSV file and the metadata is saved in an MPEG-7 XML file.

4.8 Application-specific Video Annotation Tools

Apart from the afore described semantic video annotation tools, a number of additional annotation systems have been proposed that aspiring to specific application contexts induce different perspectives on the annotation process.
To keep the survey comprehensive, in the following the authors examine briefly some representative examples.
Advocating W3C standards, Annotea adopts RDF based annotation schemes and XPointer78 for locating the annotations within the annotated resource.
Object level descriptions can be also propagated through dragging while the video is playing.

4.9 Discussion

As illustrated in the aforementioned descriptions, video annotation tools make a rather poor utilisation of Semantic Web technologies and formal meaning, XML being the most common choice for the capturing and representation of the produced annotations.
The use of MPEG-7 based descriptions, may constitute a solution towards standardised video descriptions, yet raises serious issues with respect to the automatic processing of annotations, especially the descriptive ones, at a semantic level.
Furthermore, VideoAnnex, VIA and SVAT are the only ones that offer selection and annotation of spatial regions on frames of the video, as well.
Anvil has recently presented a new annotation mechanisms called spatiotemporal coding aiming to support point and region annotation, yet currently only points are supported.
It worths noticing that most annotation tools offer a variety of additional functionalities, in order to satisfy varying user needs.

5 Conclusions

As to provide a common framework of reference for assessing the suitability and interoperability of annotations under different context of usages.
Domain specific ontologies are supported by the majority of tools for the representation of subject matter descriptions.
The level of correspondence between research outcomes and implemented annotation tools is not the sole subject for further investigation.
Research in multimedia annotation, and by consequence into multimedia ontologies, is not restricted to the representation of the different annotation dimensions involved.
As a continuation of the efforts initiated within MMSEM, further manifesting the strong emphasis placed upon achieving cross community multimedia data integration, two new 84 http://www.w3.org/2005/Incubator/mmsem/.

Did you find this useful? Give us your feedback

Figures (15)

Fig. 9. Example video annotation using VIA.

Fig. 10. Example video annotation using VideoAnnEx.

Fig. 3. Example image annotation using PhotoStuff.

Fig. 5. Example image annotation using M-Ontomat-Annotizer.

Fig. 2. Example image annotation using KAT.

Fig. 13. Example video annotation using Elan.

Fig. 14. Example video annotation using Anvil.

Fig. 11. Example video annotation using Ontolog.

Fig. 8. Example image annotation using LabelMe.

Fig. 7. Example image annotation using the SWAD annotation tool.

Fig. 4. Example image annotation using AktiveMedia.

Fig. 15. Example video annotation using SVAT.

Fig. 12. Example video annotation using Advene.

Fig. 6. Example image annotation using Caliph; generic (image information) and (semantic) descriptive annotation tabs.

Content maybe subject to copyright Report

A Survey of Semantic Image and Video

Annotation Tools

S. Dasiopoulou, E. Giannakidou, G. Litos, P. Malasioti, and I. Kompatsiaris

Multimedia Knowledge Lab oratory, Informatics and Telematics Institute,

Centre for Research and Technology Hellas

{dasiop,igiannak,litos,xenia,ikom}@iti.gr

Abstract. The availability of semantically annotated image and video

assets constitutes a critical prerequisite for the realisation of intelligent

knowledge management services pertaining to realistic user needs. Given

the extend of the challenges involved in the automatic extraction of such

descriptions, manually created metadata play a signiﬁcant role, further

strengthened by their deployment in training and evaluation tasks re-

lated to the automatic extraction of content descriptions. The diﬀerent

views taken by the two main approaches towards semantic content de-

scription, namely the Semantic Web and MPEG-7, as well as the traits

particular to multimedia content due to the multiplicity of information

levels involved, have resulted in a variety of image and video annotation

to ols, adopting varying description aspects. Aiming to provide a com-

mon framework of reference and furthermore to highlight open issues,

esp ecially with respect to the coverage and the interoperability of the

pro duced metadata, in this chapter we present an overview of the state

of the art in image and video annotation tools.

1 Introduction

Accessing multimedia content in correspondence with the meaning pertained to a

user, constitutes the core challenge in multimedia research, commonly referred to

as the semantic gap [1]. The current state of the art in automatic content analysis

and understanding supports in many cases the successful detection of semantic

concepts, such as persons, buildings, natural scenes vs manmade scenes, etc. at a

satisfactory level of accuracy; however, the attained performance remains highly

variable when considering general domains, or when increasing, even slightly, the

number of supported concepts [2–4]. As a consequence, the manual generation of

content descriptions holds an important role towards the realisation of intelligent

content management services. This signiﬁcance is further strengthened by the

need for manually constructed descriptions in automatic content analysis both

for evaluation as well as for training purposes, when learning based on pre-

annotated examples is used.

The availability of semantic descriptions though is not adequate per se for

the eﬀective management of multimedia content. Fundamental to information

sharing, exchange and reuse, is the interoperability of the descriptions at both

syntactic and semantic levels, i.e. regarding the valid structuring of the descrip-

tions and the endowed meaning respectively. Besides the general prerequisite for

interoperability, additional requirements arise from the multiple levels at which

multimedia content can be represented including structural and low-level fea-

tures information. Further description levels induce from more generic aspects

such as authoring & access control, navigation, and user history & preferences.

The strong relation of structural and low-level feature information to the tasks in-

volved in the automatic analysis of visual content, as well as to retrieval services,

such as transcoding, content-based search, etc., brings these two dimensions to

the foreground, along with the subject matter descriptions.

Two initiatives prevail the eﬀorts towards machine processable semantic con-

tent metadata, the Semantic Web activity

of the W3C and ISO’s Multimedia

Content Description Interface

(MPEG-7) [5, 6], delineating corresponding ap-

proaches with respect to multimedia semantic annotation [7, 8]. Through a lay-

ered architecture of successively increased expressivity, the Semantic Web (SW)

advocates formal semantics and reasoning through logically grounded meaning.

The respective rule and ontology languages embody the general mechanisms

for capturing, representing and reasoning with semantics. They do not capture

application speciﬁc knowledge. In contrast, MPEG-7 addresses speciﬁcally the

description of audiovisual content and comprises not only the representation

language, in the form of the Description Deﬁnition Language (DDL), but also

speciﬁc, media and domain, deﬁnitions; thus from a SW perspective, MPEG-

7 serves the twofold role of a representation language and a domain speciﬁc

ontology.

Overcoming the syntactic and semantic interoperability issues between MPEG-

7 and the SW has been the subject of very active research in the current decade,

highly motivated by the complementary aspects characterising the two afore-

mentioned metadata initiatives: media sp eciﬁc, yet not formal, semantics on one

hand, and general mechanisms for logically grounded semantics on the other

hand. A number of so called multimedia ontologies [9–13] issued in an attempt

to add formal semantics to MPEG-7 descriptions and thereby enable linking with

existing ontologies and the semantic management of existing MPEG-7 metadata

repositories. Furthermore, initiatives such the W3C Multimedia Annotation on

the Semantic Web Taskforce

, the W3C Multimedia Semantics Incubator Group

and the Common Multimedia Ontology Framework

, have been established to

address the technologies, advantages and open issues related to the creation,

storage, manipulation and processing of multimedia semantic metadata.

In this chapter, bearing in mind the signiﬁcance of manual image and video

annotation in combination with the diﬀerent possibilities aﬀorded by the SW

and MPEG-7 initiatives, we present a detailed overview of the most well known

http://www.w3.org/2001/sw/

http://www.chiariglione.org/mpeg/

http://www.w3.org/2001/sw/BestPractices/MM/

http://www.w3.org/2005/Incubator/mmsem/

http://www.acemedia.org/aceMedia/reference/multimedia ontology/index.html

manual annotation tools, addressing both functionality aspects, such as coverage

& granularity of annotations, as well as interoperability concerns with respect to

the supported annotation vocabularies and representation languages. Interoper-

ability though does not address solely the harmonisation between the SW and

MPEG-7 initiatives; a signiﬁcant number of tools, specially regarding video an-

notation, follow customised approaches, aggravating the challenges. As such, this

survey serves a twofold role; it provides a common framework for reference and

comparison purposes, while highlighting issues pertaining to the communication,

sharing and reuse of the produced metadata.

The rest of the chapter is organised as follows. Section 2 describes the criteria

along which the assessment and comparison of the examined annotation tools

is performed. Sections 3 and 4 discuss the individual image and video tools

respectively, while Section 5 concludes the paper, summarising the resulting

observations and open issues.

2 Semantic Image and Video Annotation

Image and video assets constitute extremely rich information sources, ubiqui-

tous in a wide variety of diverse applications and tasks related to information

management, both for personal and professional purposes. Inevitably, the value

of the endowed information amounts to the eﬀectiveness and eﬃciency at which

it can be accessed and managed. This is where semantic annotation comes in, as

it designates the schemes for capturing the information related to the content.

As already indicated, two crucial requirements featuring content annotation

are the interoperability of the created metadata and the ability to automatically

process them. The former encompasses the capacity to share and reuse anno-

tations, and by consequence determines the level of seamless content utilisation

and the beneﬁts issued from the annotations made available; the latter is vital

to the realisation of intelligent content management services. Towards their ac-

complishment, the existence of commonly agreed vocabularies and syntax, and

respectively of commonly agreed semantics and interpretation mechanisms, are

essential elements.

Within the context of visual content, these general prerequisites incur more

speciﬁc conditions issuing from the particular traits of image and video assets.

Visual content semantics, as multimedia semantics in general, comes into a mul-

tilayered, intertwined fashion [14, 15]. It encompasses, amongst others, thematic

descriptions addressing the subject matter depicted (scene categorisation, ob-

jects, events, etc.), media descriptions referring to low-level features and related

information such as the algorithms used for their extraction, respective param-

eters, etc., as well as structural descriptions addressing the decomposition of

content into constituent segments and the spatiotemporal conﬁguration of these

segments. As in this chapter semantic annotation is investigated mostly with re-

spect to content retrieval and analysis tasks, aspects addressing concerns related

to authoring, access and privacy, and so forth, are only shallowly treated.

Fig. 1. Multi-layer image semantics.

Figure 1 shows such an example, illustrating subject matter descriptions such

as “Sky” and “Pole Vaulter, Athlete”, structural descriptions such as the three

identiﬁed regions, the spatial conﬁguration between two of them (i.e. region2

above region3), and the ScalableColour and RegionsShape descriptor values ex-

tracted for two regions. The diﬀerent layers correspond to diﬀerent annotation

dimensions and serve diﬀerent purposes, further diﬀerentiated by the individual

application context. For example, for a search and retrieval service regarding

a device of limited resources (e.g. PDA, mobile phone), content management

becomes more eﬀective if speciﬁc temporal parts of video can be returned to a

query rather than the whole video asset, leaving the user with the cumbersome

task of browsing through it, till reaching the relative parts and assessing if they

satisfy her query.

The aforementioned considerations intertwine, establishing a number of di-

mensions and corresponding criteria along which image and video annotation

can be characterised. As such, interoperability, explicit semantics in terms of lia-

bility to automated processing, and reuse, apply both to all types of description

dimensions and to their interlinking, and not only to subject matter descriptions,

as is the common case for textual content resources.

In the following, we describe the criteria along which we overview the diﬀerent

annotation tools in order to assess them with respect to the aforementioned

considerations. Criteria addressing concerns of similar nature have been grouped

together, resulting in three categories.

2.1 Input & Output

This category includes criteria regarding the way the tool interacts in terms of

requested / supported input and the output produced.

– Annotation Vocabulary. Refers to whether the annotation is performed ac-

cording to a predeﬁned set of terms (e.g. lexicon / thesaurus, taxonomy,

ontology) or if it is provided by the user in the form of keywords and free

text. In the case of controlled vocabulary, we diﬀerentiate the case where the

user has to explicitly provide it (e.g. as when uploading a sp eciﬁc ontology)

or whether it is provided by the tool as a built-in; the formalisms supported

for the representation of the vocabulary constitute a further attribute. We

note that annotation vocabularies may refer not only to subject matter de-

scriptions, but as well to media and structural descriptions. Naturally, the

more formal and well-deﬁned the semantics of the annotation vocabulary,

the more opportunities for achieving interoperable and machine understand-

able annotations.

– Metadata Format. Considers the representation format in which the pro-

duced annotations are expressed. Naturally, the output format is strongly

related to the supported annotation vocabularies. As will be shown in the

sequel though, where the individual tools are described, there is not nec-

essarily a strict correspondence (e.g. a tool may use an RDFS

or OWL

ontology as the subject matter vocabulary, and yet output annotations in

RDF

). The format is equally signiﬁcant to the annotation vocabulary as

with respect to the annotations interoperability and sharing.

– Content Type. Refers to the supported image/video formats, e.g. jpg, png,

mpeg, etc.

2.2 Annotation Level

This category addresses attributes of the annotations per se. Naturally, the types

of information addressed by the descriptions issue from the intended context of

usage. Subject matter annotations, i.e. thematic descriptions with respect to the

depicted objects and events, are indispensable for any application scenario ad-

dressing content-based retrieval at the level of meaning conveyed. Such retrieval

may address concept-based queries or queries involving relations between con-

cepts, entailing respective annotation speciﬁcations. Structural information is

crucial for services where it is important to know the exact content parts associ-

ated with speciﬁc thematic descriptions, as for example in the case of semantic

transcoding or enhanced retrieval and presentation, where the parts of interest

can be indicated in an elaborated manner. Analogously, annotations intended for

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/owl-features/

http://www.w3.org/RDF/

HTML Viewer

A survey of semantic image and video annotation tools

Summary (6 min read)

1 Introduction

2 Semantic Image and Video Annotation

2.1 Input & Output

2.2 Annotation Level

2.3 Miscellaneous

3 Tools for Semantic Image Annotation

3.1 KAT

3.2 PhotoStuff

3.3 AktiveMedia

3.5 Caliph

3.6 SWAD

3.7 LabelMe

3.8 Application-specific Image Annotation Tools

3.9 Discussion

4 Tools for Semantic Video Annotation

4.1 VIA

4.2 VideoAnnEx

4.6 Anvil

4.7 Semantic Video Annotation Suite

4.8 Application-specific Video Annotation Tools

4.9 Discussion

5 Conclusions

Figures (15)

Citations

Additional excerpts

Cites background from "A survey of semantic image and vide..."

Cites background or methods from "A survey of semantic image and vide..."

References

Related Papers (5)