scispace - formally typeset
Search or ask a question
Book ChapterDOI

A survey of semantic image and video annotation tools

TL;DR: This chapter presents an overview of the state of the art in image and video annotation tools to provide a common framework of reference and to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata.
Abstract: The availability of semantically annotated image and video assets constitutes a critical prerequisite for the realisation of intelligent knowledge management services pertaining to realistic user needs. Given the extend of the challenges involved in the automatic extraction of such descriptions, manually created metadata play a significant role, further strengthened by their deployment in training and evaluation tasks related to the automatic extraction of content descriptions. The different views taken by the two main approaches towards semantic content description, namely the Semantic Web and MPEG-7, as well as the traits particular to multimedia content due to the multiplicity of information levels involved, have resulted in a variety of image and video annotation tools, adopting varying description aspects. Aiming to provide a common framework of reference and furthermore to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata, in this chapter we present an overview of the state of the art in image and video annotation tools.

Summary (6 min read)

1 Introduction

  • Accessing multimedia content in correspondence with the meaning pertained to a user, constitutes the core challenge in multimedia research, commonly referred to as the semantic gap [1].
  • This significance is further strengthened by the need for manually constructed descriptions in automatic content analysis both for evaluation as well as for training purposes, when learning based on preannotated examples is used.
  • Fundamental to information sharing, exchange and reuse, is the interoperability of the descriptions at both syntactic and semantic levels, i.e. regarding the valid structuring of the descriptions and the endowed meaning respectively.
  • The strong relation of structural and low-level feature information to the tasks involved in the automatic analysis of visual content, as well as to retrieval services, such as transcoding, content-based search, etc., brings these two dimensions to the foreground, along with the subject matter descriptions.
  • A number of so called multimedia ontologies [9–13] issued in an attempt to add formal semantics to MPEG-7 descriptions and thereby enable linking with existing ontologies and the semantic management of existing MPEG-7 metadata repositories.

2 Semantic Image and Video Annotation

  • Image and video assets constitute extremely rich information sources, ubiquitous in a wide variety of diverse applications and tasks related to information management, both for personal and professional purposes.
  • Inevitably, the value of the endowed information amounts to the effectiveness and efficiency at which it can be accessed and managed.
  • The former encompasses the capacity to share and reuse annotations, and by consequence determines the level of seamless content utilisation and the benefits issued from the annotations made available; the latter is vital to the realisation of intelligent content management services.
  • Towards their accomplishment, the existence of commonly agreed vocabularies and syntax, and respectively of commonly agreed semantics and interpretation mechanisms, are essential elements.
  • The aforementioned considerations intertwine, establishing a number of dimensions and corresponding criteria along which image and video annotation can be characterised.

2.1 Input & Output

  • This category includes criteria regarding the way the tool interacts in terms of requested / supported input and the output produced.
  • The authors note that annotation vocabularies may refer not only to subject matter descriptions, but as well to media and structural descriptions.
  • As will be shown in the sequel though, where the individual tools are described, there is not necessarily a strict correspondence (e.g. a tool may use an RDFS6 or OWL7 ontology as the subject matter vocabulary, and yet output annotations in RDF8).
  • The format is equally significant to the annotation vocabulary as with respect to the annotations interoperability and sharing.
  • Refers to the supported image/video formats, e.g. jpg, png, mpeg, etc.

2.2 Annotation Level

  • This category addresses attributes of the annotations per se.
  • Such retrieval may address concept-based queries or queries involving relations between concepts, entailing respective annotation specifications.
  • To capture the aforementioned considerations, the following criteria have been used.
  • For video assets, annotation may refer to the entire video, temporal segments , frames (temporal segments with zero duration), regions within frames, or even to moving regions, i.e. a region followed for a sequence of frames.
  • Refers to the level of expressivity supported with respect to the annotation vocabulary.

2.3 Miscellaneous

  • This category summarises additional criteria that do not fall under the previous dimensions.
  • The considered aspects relate mostly to attributes of the tool itself rather than of the annotation process.
  • As such, and given the scope of this chapter, in the description of the individual tools that follows in the two subsequent Sections, these criteria are treated very briefly.
  • Specifies whether the tool constitutes a web-based or a stand-alone application, also known as – Application Type.
  • – Licence: Specifies the kind of licence condition under which the tool operates, e.g. open source, etc. – Collaboration: Specifies whether the tool supports concurrent annotations (referring to the same media object) by multiple users or not.

3 Tools for Semantic Image Annotation

  • In this Section the authors describe prominent semantic image annotation tools with respect to the dimensions and criteria outlined in Section 2.
  • As will be illustrated in the following, Semantic Web technologies have permeated to a considerable degree the representation of metadata, with the majority of tools supporting ontology-based subject matter descriptions, while a considerable share of them adopts ontological representation for structural annotations as well.
  • In order to provide a relative ranking with respect to SW compatibility, the authors order the tools according to the extend to which the produced annotations bear formal semantics.

3.1 KAT

  • The K-Space Annotation Tool9 (KAT), developed within the K-Space10 project, implements an ontology-based framework for the semantic annotation of images.
  • COMM extends the Descriptions & Situations (D&S) and Ontology of Information Objects (OIO) design patterns of DOLCE [17, 18], while incorporating re-engineered definitions of MPEG-7 description tools[19, 20].
  • The latter are strictly concept based, i.e. considering the aforementioned annotation example it is not possible to annotate the pole as being next to the pole vaulter, and may refer to the entire image or to specific regions of it.
  • The localisation of image regions is performed manually, using either of the rectangle and polygon drawing tools.
  • Furthermore, the COMM based annotation scheme renders quite straightforward the extension of the annotation dimensions supported by KAT.

3.2 PhotoStuff

  • PhotoStuff11, developed by the Mindswap group12, is an ontology-based image annotation tool that supports the generation of semantic image descriptions with respect to the employed ontologies.
  • PhotoStuff [21] addresses primarily two types of metadata, namely descriptive and structural.
  • Regarding descriptive annotations, the user may load one or multiple domain-specific ontologies from the web or from the local hard drive, while with respect to structural annotations, two internal, hidden to the user, ontologies are used: the Digital-Media13 ontology and the Technical14 one.
  • Nor the representation neither the extraction of such descriptors is addressed.
  • Notably, annotations may refer not only to concept instantiations, but also to relations between concept instances already identified in an image.

3.3 AktiveMedia

  • AktiveMedia20, developed within AKT21 and X-Media22 projects, is an ontologybased cross-media annotation system addressing text and image assets.
  • In image annotation mode, AktiveMedia supports descriptive metadata with respect to user selected ontologies, stored in the local hard drive [22].
  • Annotations can refer to image or region level.
  • Contrary to Photostuff which uses 17 http://dublincore.org/documents/dces/.
  • As such, the semantics of generated RDF metadata, i.e. the annotation semantics as it entails from the respective ontology definitions, are not direct but require additional processing to retrieve and to reason over.

3.5 Caliph

  • Caliph27 is an MPEG-7 based image annotation tool that supports all types of MPEG-7 metadata among which descriptive, structural, authoring and low-level visual descriptor annotations.
  • In combination with Emir, they support contentbased retrieval of images using MPEG-7 descriptions.
  • Figure 6 illustrates two screenshots corresponding to the generic image information and the semantic annotation tabs.
  • The descriptions may be either in the form of free text or structured, in accordance to the SemanticBase description tools provided by MPEG-7 (i.e. Agents, Events, Time, Place and Object annotations [26]).
  • The so called semantic tab allows for the latter, offering a graph based interface.

3.6 SWAD

  • SWAD28 is an RDF-based image annotation tool that was developed within the SWAD-Europe project29.
  • The latter ran from May 2002 to October 2004 and aimed to support the Semantic Web initiative in Europe through targeted research, demonstrations and outreach activities.
  • The authors chose to provide a very brief description here for the purpose of illustrating image annotation in the Semantic Web as envisaged and realised by that time, as a reference and comparison point for the various image annotation tools that have been developed afterwards.
  • Licensing information as described in the respective SWAD deliverable30.
  • When entering a keyword description, the respective Wordnet31 hierarchy is shown to the user, assisting her in determining the appropriateness of the keyword and in selecting descriptions of further accuracy.

3.7 LabelMe

  • LabelMe33 is a database and web-based image annotation tool, aiming to contribute in the creation of large annotated image databases for evaluation and training purposes [28].
  • It contains all images from the MIT CSAIL34 database, in addition to a large number of user uploaded images.
  • LabelMe [28] supports descriptive metadata addressing in principle regionbased annotation.
  • Specifically, the user defines a polygon enclosing the annotated object through a set of control points.
  • Its focus on requirements related to object recognition research, rather than image search and retrieval, entails different notions regarding the utilisation, sharing and purpose of annotation.

3.8 Application-specific Image Annotation Tools

  • Apart from the afore described semantic image annotation tools, a variety of application-specific tools are available.
  • Some of them relate to Web 2.0 applications addressing tagging and sharing of content among social groups, while others focus on particular application domains, such as medical imaging, that impose additional specifications pertaining to the individual application context.
  • Utilising radiology specific ontologies, iPad enhances the annotation procedure by suggesting more specific terms and by identifying incomplete descriptions and subsequently prompting for missing parts in the description (e.g. “enlarged” is flagged as incomplete while “enlarged liver” is acceptable).
  • The produced descriptions are in RDF/XML following a proprietary schema39 that models the label constituting the tag, its position (the label constitutes a rectangle region in itself), and the position of the rectangle that encloses the annotated region in the form of the top left point coordinates and width and height information.
  • Furthermore, general information about the image is included such as image size, number of regions annotated, etc. Oriented towards Web 2.0, FotoTagger places significant focus on social aspects pertaining to content management, allowing among others to publish tagged images to 35 http://www.rsna.org/Technology/DICOM/ blogs and to upload/download tagged images to/from Flickr, while maintaining both FotoTagger’s and Flickr’s descriptions.

3.9 Discussion

  • The aforementioned overview reveals that the utilisation of Semantic Web languages for the representation, interchange and processing of image metadata has permeated semantic image annotation.
  • The choice of a standard representation shows the importance placed on creating content descriptions that can be easily exchanged and reused across heterogenous applications, and works like [10, 11, 30] provide bridges between MPEG-7 metadata and the Semantic Web and existing ontologies.
  • Thus unlike subject matter descriptions, where a user can choose which vocabulary to use (in the form of a domain ontology, a lexicon or user provided keywords), structural descriptions are tool specific.
  • Summing up, the choice of a tool depends primarily on the intended context of usage, which provides the specifications regarding the annotation dimensions supported, and subsequently on the desired formality of annotations, again related to a large extend to the application context.
  • Thus for semantic retrieval purposes, where semantic refers to the SW perspective, KAT, PhotoStuff, SWAD T o o l and AkiveMedia would be the more appropriate choices.

4 Tools for Semantic Video Annotation

  • The increase in the amount of video data deployed and used in today’s applications not only caused video to draw increased attention as a content type, but also introduced new challenges in terms of effective content management.
  • In the following the authors survey typical video annotation tools, highlighting their features with respect to the criteria delineated in Section 2.
  • In the latter category fall tools such as VIDETO41, Ricoh Movie Tool42, or LogCreator43.
  • It is interesting to note that the majority of these tools followed MPEG-7 for the representation of annotations.
  • As described in the sequel, this favourable disposition is still evident, differentiating video annotation tools from image ones, where the Semantic Web technologies have been more pervasive.

4.1 VIA

  • The Video and Image Annotation44 (VIA) tool has been developed by the MKLab45 within the BOEMIE46 project.
  • The shot records a pole vaulter holding a pole and sprinting at the jump point.
  • VIA supports descriptive, structural and media metadata of image and video assets.
  • Descriptive annotation is performed with respect to a user loaded OWL ontology, while free text descriptions can also be added.
  • The first one is concerned with region annotation, in which the user selects rectangular areas of the video content and subsequently adds corresponding annotations.

4.2 VideoAnnEx

  • The IBM VideoAnnEx47 annotation tool addresses video annotation with MPEG7 metadata.
  • VideoAnnex supports descriptive, structural and administrative annotations according to the respective MPEG-7 Description Schemes.
  • The tool supports default subject matter lexicons in XML format, and additionally allows the user to create and load her own XML lexicon, design a concept hierarchy through the interface menu commands, or insert free text descriptions.
  • As illustrated in Figure 10, the VideoAnnEx annotation interface consists of four components.
  • On the bottom part of the tool, two views are available of the annotation preview: one contains the I-frames of a shot and the keyframes of each shot in the video, respectively.

4.6 Anvil

  • Anvil69 is a tool that supports audiovisual content annotation, but which was primarily designed for linguistic purposes, in the same vein as the previously described tool.
  • User-defined XML schema specification files provide the definition of the vocabulary used in the annotation procedure.
  • Its interface consists of the media player window, the annotation board and the metadata window.
  • As in most described tools, also in Anvil, the user has to manually define the temporal segments that wants to annotate.
  • Anvil can import data from the phonetic tools PRAAT72 and XWaves which perform speech transcriptions.

4.7 Semantic Video Annotation Suite

  • The Semantic Video Annotation Suite75 (SVAS), developed by Joanneum research Institute of Information Systems & Information Management76, targets the creation of MPEG-7 video annotations.
  • SVAS [36] encompasses two tools: the Media Analyzer, which extracts automatically structural information regarding shots and key-frames, and the Semantic Video Annotation Tool (SVAT), which allows to edit the structural metadata obtained through the Media Analyzer and to add administrative and descriptive metadata, in accordance with MPEG-7.
  • The detection results are displayed in a separate key-frame view, where for each of the computed key frames the detected object is highlighted.
  • The user can partially enhance the results of this matching service by removing irrelevant key-frames; however more elaborate enhancement such as editing of the detected region’s boundaries or of its location is not supported.
  • All views, including the shot view tree structure, can be exported to a CSV file and the metadata is saved in an MPEG-7 XML file.

4.8 Application-specific Video Annotation Tools

  • Apart from the afore described semantic video annotation tools, a number of additional annotation systems have been proposed that aspiring to specific application contexts induce different perspectives on the annotation process.
  • To keep the survey comprehensive, in the following the authors examine briefly some representative examples.
  • Advocating W3C standards, Annotea adopts RDF based annotation schemes and XPointer78 for locating the annotations within the annotated resource.
  • Object level descriptions can be also propagated through dragging while the video is playing.

4.9 Discussion

  • As illustrated in the aforementioned descriptions, video annotation tools make a rather poor utilisation of Semantic Web technologies and formal meaning, XML being the most common choice for the capturing and representation of the produced annotations.
  • The use of MPEG-7 based descriptions, may constitute a solution towards standardised video descriptions, yet raises serious issues with respect to the automatic processing of annotations, especially the descriptive ones, at a semantic level.
  • Furthermore, VideoAnnex, VIA and SVAT are the only ones that offer selection and annotation of spatial regions on frames of the video, as well.
  • Anvil has recently presented a new annotation mechanisms called spatiotemporal coding aiming to support point and region annotation, yet currently only points are supported.
  • It worths noticing that most annotation tools offer a variety of additional functionalities, in order to satisfy varying user needs.

5 Conclusions

  • As to provide a common framework of reference for assessing the suitability and interoperability of annotations under different context of usages.
  • Domain specific ontologies are supported by the majority of tools for the representation of subject matter descriptions.
  • The level of correspondence between research outcomes and implemented annotation tools is not the sole subject for further investigation.
  • Research in multimedia annotation, and by consequence into multimedia ontologies, is not restricted to the representation of the different annotation dimensions involved.
  • As a continuation of the efforts initiated within MMSEM, further manifesting the strong emphasis placed upon achieving cross community multimedia data integration, two new 84 http://www.w3.org/2005/Incubator/mmsem/.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Survey of Semantic Image and Video
Annotation Tools
S. Dasiopoulou, E. Giannakidou, G. Litos, P. Malasioti, and I. Kompatsiaris
Multimedia Knowledge Lab oratory, Informatics and Telematics Institute,
Centre for Research and Technology Hellas
{dasiop,igiannak,litos,xenia,ikom}@iti.gr
Abstract. The availability of semantically annotated image and video
assets constitutes a critical prerequisite for the realisation of intelligent
knowledge management services pertaining to realistic user needs. Given
the extend of the challenges involved in the automatic extraction of such
descriptions, manually created metadata play a significant role, further
strengthened by their deployment in training and evaluation tasks re-
lated to the automatic extraction of content descriptions. The different
views taken by the two main approaches towards semantic content de-
scription, namely the Semantic Web and MPEG-7, as well as the traits
particular to multimedia content due to the multiplicity of information
levels involved, have resulted in a variety of image and video annotation
to ols, adopting varying description aspects. Aiming to provide a com-
mon framework of reference and furthermore to highlight open issues,
esp ecially with respect to the coverage and the interoperability of the
pro duced metadata, in this chapter we present an overview of the state
of the art in image and video annotation tools.
1 Introduction
Accessing multimedia content in correspondence with the meaning pertained to a
user, constitutes the core challenge in multimedia research, commonly referred to
as the semantic gap [1]. The current state of the art in automatic content analysis
and understanding supports in many cases the successful detection of semantic
concepts, such as persons, buildings, natural scenes vs manmade scenes, etc. at a
satisfactory level of accuracy; however, the attained performance remains highly
variable when considering general domains, or when increasing, even slightly, the
number of supported concepts [2–4]. As a consequence, the manual generation of
content descriptions holds an important role towards the realisation of intelligent
content management services. This significance is further strengthened by the
need for manually constructed descriptions in automatic content analysis both
for evaluation as well as for training purposes, when learning based on pre-
annotated examples is used.
The availability of semantic descriptions though is not adequate per se for
the effective management of multimedia content. Fundamental to information
sharing, exchange and reuse, is the interoperability of the descriptions at both

syntactic and semantic levels, i.e. regarding the valid structuring of the descrip-
tions and the endowed meaning respectively. Besides the general prerequisite for
interoperability, additional requirements arise from the multiple levels at which
multimedia content can be represented including structural and low-level fea-
tures information. Further description levels induce from more generic aspects
such as authoring & access control, navigation, and user history & preferences.
The strong relation of structural and low-level feature information to the tasks in-
volved in the automatic analysis of visual content, as well as to retrieval services,
such as transcoding, content-based search, etc., brings these two dimensions to
the foreground, along with the subject matter descriptions.
Two initiatives prevail the efforts towards machine processable semantic con-
tent metadata, the Semantic Web activity
1
of the W3C and ISO’s Multimedia
Content Description Interface
2
(MPEG-7) [5, 6], delineating corresponding ap-
proaches with respect to multimedia semantic annotation [7, 8]. Through a lay-
ered architecture of successively increased expressivity, the Semantic Web (SW)
advocates formal semantics and reasoning through logically grounded meaning.
The respective rule and ontology languages embody the general mechanisms
for capturing, representing and reasoning with semantics. They do not capture
application specific knowledge. In contrast, MPEG-7 addresses specifically the
description of audiovisual content and comprises not only the representation
language, in the form of the Description Definition Language (DDL), but also
specific, media and domain, definitions; thus from a SW perspective, MPEG-
7 serves the twofold role of a representation language and a domain specific
ontology.
Overcoming the syntactic and semantic interoperability issues between MPEG-
7 and the SW has been the subject of very active research in the current decade,
highly motivated by the complementary aspects characterising the two afore-
mentioned metadata initiatives: media sp ecific, yet not formal, semantics on one
hand, and general mechanisms for logically grounded semantics on the other
hand. A number of so called multimedia ontologies [9–13] issued in an attempt
to add formal semantics to MPEG-7 descriptions and thereby enable linking with
existing ontologies and the semantic management of existing MPEG-7 metadata
repositories. Furthermore, initiatives such the W3C Multimedia Annotation on
the Semantic Web Taskforce
3
, the W3C Multimedia Semantics Incubator Group
4
and the Common Multimedia Ontology Framework
5
, have been established to
address the technologies, advantages and open issues related to the creation,
storage, manipulation and processing of multimedia semantic metadata.
In this chapter, bearing in mind the significance of manual image and video
annotation in combination with the different possibilities afforded by the SW
and MPEG-7 initiatives, we present a detailed overview of the most well known
1
http://www.w3.org/2001/sw/
2
http://www.chiariglione.org/mpeg/
3
http://www.w3.org/2001/sw/BestPractices/MM/
4
http://www.w3.org/2005/Incubator/mmsem/
5
http://www.acemedia.org/aceMedia/reference/multimedia ontology/index.html

manual annotation tools, addressing both functionality aspects, such as coverage
& granularity of annotations, as well as interoperability concerns with respect to
the supported annotation vocabularies and representation languages. Interoper-
ability though does not address solely the harmonisation between the SW and
MPEG-7 initiatives; a significant number of tools, specially regarding video an-
notation, follow customised approaches, aggravating the challenges. As such, this
survey serves a twofold role; it provides a common framework for reference and
comparison purposes, while highlighting issues pertaining to the communication,
sharing and reuse of the produced metadata.
The rest of the chapter is organised as follows. Section 2 describes the criteria
along which the assessment and comparison of the examined annotation tools
is performed. Sections 3 and 4 discuss the individual image and video tools
respectively, while Section 5 concludes the paper, summarising the resulting
observations and open issues.
2 Semantic Image and Video Annotation
Image and video assets constitute extremely rich information sources, ubiqui-
tous in a wide variety of diverse applications and tasks related to information
management, both for personal and professional purposes. Inevitably, the value
of the endowed information amounts to the effectiveness and efficiency at which
it can be accessed and managed. This is where semantic annotation comes in, as
it designates the schemes for capturing the information related to the content.
As already indicated, two crucial requirements featuring content annotation
are the interoperability of the created metadata and the ability to automatically
process them. The former encompasses the capacity to share and reuse anno-
tations, and by consequence determines the level of seamless content utilisation
and the benefits issued from the annotations made available; the latter is vital
to the realisation of intelligent content management services. Towards their ac-
complishment, the existence of commonly agreed vocabularies and syntax, and
respectively of commonly agreed semantics and interpretation mechanisms, are
essential elements.
Within the context of visual content, these general prerequisites incur more
specific conditions issuing from the particular traits of image and video assets.
Visual content semantics, as multimedia semantics in general, comes into a mul-
tilayered, intertwined fashion [14, 15]. It encompasses, amongst others, thematic
descriptions addressing the subject matter depicted (scene categorisation, ob-
jects, events, etc.), media descriptions referring to low-level features and related
information such as the algorithms used for their extraction, respective param-
eters, etc., as well as structural descriptions addressing the decomposition of
content into constituent segments and the spatiotemporal configuration of these
segments. As in this chapter semantic annotation is investigated mostly with re-
spect to content retrieval and analysis tasks, aspects addressing concerns related
to authoring, access and privacy, and so forth, are only shallowly treated.

Fig. 1. Multi-layer image semantics.
Figure 1 shows such an example, illustrating subject matter descriptions such
as “Sky” and “Pole Vaulter, Athlete”, structural descriptions such as the three
identified regions, the spatial configuration between two of them (i.e. region2
above region3), and the ScalableColour and RegionsShape descriptor values ex-
tracted for two regions. The different layers correspond to different annotation
dimensions and serve different purposes, further differentiated by the individual
application context. For example, for a search and retrieval service regarding
a device of limited resources (e.g. PDA, mobile phone), content management
becomes more effective if specific temporal parts of video can be returned to a
query rather than the whole video asset, leaving the user with the cumbersome
task of browsing through it, till reaching the relative parts and assessing if they
satisfy her query.
The aforementioned considerations intertwine, establishing a number of di-
mensions and corresponding criteria along which image and video annotation
can be characterised. As such, interoperability, explicit semantics in terms of lia-
bility to automated processing, and reuse, apply both to all types of description
dimensions and to their interlinking, and not only to subject matter descriptions,
as is the common case for textual content resources.
In the following, we describe the criteria along which we overview the different
annotation tools in order to assess them with respect to the aforementioned
considerations. Criteria addressing concerns of similar nature have been grouped
together, resulting in three categories.

2.1 Input & Output
This category includes criteria regarding the way the tool interacts in terms of
requested / supported input and the output produced.
Annotation Vocabulary. Refers to whether the annotation is performed ac-
cording to a predefined set of terms (e.g. lexicon / thesaurus, taxonomy,
ontology) or if it is provided by the user in the form of keywords and free
text. In the case of controlled vocabulary, we differentiate the case where the
user has to explicitly provide it (e.g. as when uploading a sp ecific ontology)
or whether it is provided by the tool as a built-in; the formalisms supported
for the representation of the vocabulary constitute a further attribute. We
note that annotation vocabularies may refer not only to subject matter de-
scriptions, but as well to media and structural descriptions. Naturally, the
more formal and well-defined the semantics of the annotation vocabulary,
the more opportunities for achieving interoperable and machine understand-
able annotations.
Metadata Format. Considers the representation format in which the pro-
duced annotations are expressed. Naturally, the output format is strongly
related to the supported annotation vocabularies. As will be shown in the
sequel though, where the individual tools are described, there is not nec-
essarily a strict correspondence (e.g. a tool may use an RDFS
6
or OWL
7
ontology as the subject matter vocabulary, and yet output annotations in
RDF
8
). The format is equally significant to the annotation vocabulary as
with respect to the annotations interoperability and sharing.
Content Type. Refers to the supported image/video formats, e.g. jpg, png,
mpeg, etc.
2.2 Annotation Level
This category addresses attributes of the annotations per se. Naturally, the types
of information addressed by the descriptions issue from the intended context of
usage. Subject matter annotations, i.e. thematic descriptions with respect to the
depicted objects and events, are indispensable for any application scenario ad-
dressing content-based retrieval at the level of meaning conveyed. Such retrieval
may address concept-based queries or queries involving relations between con-
cepts, entailing respective annotation specifications. Structural information is
crucial for services where it is important to know the exact content parts associ-
ated with specific thematic descriptions, as for example in the case of semantic
transcoding or enhanced retrieval and presentation, where the parts of interest
can be indicated in an elaborated manner. Analogously, annotations intended for
6
http://www.w3.org/TR/rdf-schema/
7
http://www.w3.org/TR/owl-features/
8
http://www.w3.org/RDF/

Citations
More filters
Journal ArticleDOI
TL;DR: This article surveys literature at the intersection of Human-Computer Interaction and Multimedia, integrating literature from video browsing and navigation, direct video manipulation, video content visualization, as well as interactive video summarization and interactive video retrieval.
Abstract: Digital video enables manifold ways of multimedia content interaction. Over the last decade, many proposals for improving and enhancing video content interaction were published. More recent work particularly leverages on highly capable devices such as smartphones and tablets that embrace novel interaction paradigms, for example, touch, gesture-based or physical content interaction. In this article, we survey literature at the intersection of Human-Computer Interaction and Multimedia. We integrate literature from video browsing and navigation, direct video manipulation, video content visualization, as well as interactive video summarization and interactive video retrieval. We classify the reviewed works by the underlying interaction method and discuss the achieved improvements so far. We also depict a set of open problems that the video interaction community should address in future.

87 citations


Additional excerpts

  • ...…readers interested in older work we refer to previous survey/review papers focusing on video browsing [Schoeffmann et al. 2010a], video annotation [Dasiopoulou et al. 2011], video retrieval [Geetha and Narayanan 2008; Hu et al. 2011], and video summarization/abstraction [Money and Agius 2008;…...

    [...]

  • ...2010a], video annotation [Dasiopoulou et al. 2011], video retrieval [Geetha and Narayanan 2008; Hu et al....

    [...]

Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this article, a path supervision framework is proposed to annotate trajectories and use it to produce a MOT dataset of unprecedented size, with more than 15,000 person trajectories in 720 sequences.
Abstract: Progress in Multiple Object Tracking (MOT) has been historically limited by the size of the available datasets. We present an efficient framework to annotate trajectories and use it to produce a MOT dataset of unprecedented size. In our novel path supervision the annotator loosely follows the object with the cursor while watching the video, providing a path annotation for each object in the sequence. Our approach is able to turn such weak annotations into dense box trajectories. Our experiments on existing datasets prove that our framework produces more accurate annotations than the state of the art, in a fraction of the time. We further validate our approach by crowdsourcing the PathTrack dataset, with more than 15,000 person trajectories in 720 sequences. Tracking approaches can benefit training on such large-scale datasets, as did object recognition. We prove this by re-training an off-the-shelf person matching network, originally trained on the MOT15 dataset, almost halving the misclassification rate. Additionally, training on our data consistently improves tracking results, both on our dataset and on MOT15. On the latter, we improve the top-performing tracker (NOMT) dropping the number of ID Switches by 18% and fragments by 5%.

68 citations

Journal ArticleDOI
TL;DR: Ratsnake is presented, a publicly available generic image annotation tool providing annotation efficiency, semantic awareness, versatility, and extensibility, features that can be exploited to transform it into an effective CAD system.
Abstract: Image segmentation and annotation are key components of image-based medical computer-aided diagnosis (CAD) systems. In this paper we present Ratsnake, a publicly available generic image annotation tool providing annotation efficiency, semantic awareness, versatility, and extensibility, features that can be exploited to transform it into an effective CAD system. In order to demonstrate this unique capability, we present its novel application for the evaluation and quantification of salient objects and structures of interest in kidney biopsy images. Accurate annotation identifying and quantifying such structures in microscopy images can provide an estimation of pathogenesis in obstructive nephropathy, which is a rather common disease with severe implication in children and infants. However a tool for detecting and quantifying the disease is not yet available. A machine learning-based approach, which utilizes prior domain knowledge and textural image features, is considered for the generation of an image force field customizing the presented tool for automatic evaluation of kidney biopsy images. The experimental evaluation of the proposed application of Ratsnake demonstrates its efficiency and effectiveness and promises its wide applicability across a variety of medical imaging domains.

51 citations


Cites background from "A survey of semantic image and vide..."

  • ...A concise review study of other annotation tools can be found in [31]....

    [...]

Journal ArticleDOI
TL;DR: This work collects a number of literature that applied semantic annotations on different objects, and classify them according to the subject being described in an enterprise architecture framework, and identifies the existing drawbacks.

46 citations


Cites background or methods from "A survey of semantic image and vide..."

  • ...We can find that the surveys [13], [14], [15], [19] and [20] were mainly focusing on documents, as well as the surveys [17] and [18] paid major attention to images or videos....

    [...]

  • ...They analysed some existing annotation tools from both the functionality perspective ([14], [15], [17], [18] and [19]) and from the efficiency perspective ([13] and...

    [...]

01 Jan 2006
TL;DR: This paper proposes an approach for expressing semantics explicitly by formalizing the semantic constraints of a profile using ontologies and rules, thus enabling interoperability and automatic use for MPEG-7 based applications.
Abstract: MPEG-7 can be used to create complex and comprehensive metadata descriptions of multimedia content. Since MPEG-7 is defined in terms of an XML schema, the semantics of its elements have no formal grounding. In addition, certain features can be described in multiple ways. MPEG-7 profiles are subsets of the standard that apply to specific application areas, which can be used to reduce this syntactic variability, but they still lack formal semantics. In this paper, we propose an approach for expressing semantics explicitly by formalizing the semantic constraints of a profile using ontologies and rules, thus enabling interoperability and automatic use for MPEG-7 based applications. We demonstrate the feasibility of the approach by implementing a validation service for a subset of the semantic constraints of the Detailed Audiovisual Profile (DAVP)

39 citations

References
More filters
Proceedings Article
01 May 2008
TL;DR: A new coding mechanism, spatiotemporal coding, is presented that allows coders to annotate points and regions in the video frame by drawing directly on the screen, which opens up the spatial dimension for multi-track video coding.
Abstract: We present a new coding mechanism, spatiotemporal coding, that allows coders to annotate points and regions in the video frame by drawing directly on the screen. Coders can not only attach labels to time intervals in the video but can specify a possibly moving region on the video screen. This opens up the spatial dimension for multi-track video coding and is an essential asset in almost every area of video coding, e.g. gesture coding, facial expression coding, encoding semantics for information retrieval etc. We discuss conceptual variants, design decisions and the relation to the MPEG-7 standard and tools.

77 citations

Journal ArticleDOI
TL;DR: Concept nonperceivable Interpretation perceivable - the role of language and language in decision-making and its applications in science and medicine are studied.
Abstract: ion nonperceivable Interpretation perceivable

74 citations

Journal ArticleDOI
14 Mar 2008
TL;DR: The papers in this issue cover the main aspects of multimedia information retrieval research, assess the applicability of the obtained results in real-life scenarios, and address the many future challenges in this field.
Abstract: The papers in this issue cover the main aspects of multimedia information retrieval research, assess the applicability of the obtained results in real-life scenarios, and address the many future challenges in this field.

72 citations

Proceedings Article
06 Nov 2008
TL;DR: The iPad tool as discussed by the authors enables researchers and clinicians to create semantic annotations on radiological images, enabling them to describe images and image regions using a graphical interface that maps their descriptions to structured ontologies semi-automatically.
Abstract: Radiological images contain a wealth of information, such as anatomy and pathology, which is often not explicit and computationally accessible. Information schemes are being developed to describe the semantic content of images, but such schemes can be unwieldy to operationalize because there are few tools to enable users to capture structured information easily as part of the routine research workflow. We have created iPad, an open source tool enabling researchers and clinicians to create semantic annotations on radiological images. iPad hides the complexity of the underlying image annotation information model from users, permitting them to describe images and image regions using a graphical interface that maps their descriptions to structured ontologies semi-automatically. Image annotations are saved in a variety of formats, enabling interoperability among medical records systems, image archives in hospitals, and the Semantic Web. Tools such as iPad can help reduce the burden of collecting structured information from images, and it could ultimately enable researchers and physicians to exploit images on a very large scale and glean the biological and physiological significance of image content.

69 citations

Journal ArticleDOI
TL;DR: The DS-MIRF framework is described, a software engineering framework that facilitates the development of knowledge-based multimedia applications such as multimedia information retrieval, filtering, browsing, interaction, knowledge extraction, segmentation, and content description that supports interoperability of OWL with the MPEG-7/21.
Abstract: In this paper, we focus on interoperable semantic multimedia services that are offered in open environments such as the Internet. The use of well-accepted standards is of paramount importance for interoperability support in open environments. In addition, the semantic description of multimedia content utilizing domain ontologies is very useful for indexing, query specification, retrieval, filtering, user Interfaces, and knowledge extraction from audiovisual material. With the MPEG-7 and MPEG-21 standards dominating the multimedia content and service description domain and OWL dominating the ontology description languages, it is important to establish a framework that allows these standards to interoperate. We describe here the DS-MIRF framework, a software engineering framework that facilitates the development of knowledge-based multimedia applications such as multimedia information retrieval, filtering, browsing, interaction, knowledge extraction, segmentation, and content description. DS-MIRF supports interoperability of OWL with the MPEG-7/21 so that domain and application ontologies expressed in OWL can be transparently integrated with MPEG-7/21 metadata. This allows applications that recognize and use the constructs provided by MPEG-7/21 to make use of domain and application ontologies, resulting in more effective retrieval and user interaction with the audiovisual material. We also present a retrieval evaluation methodology and comparative retrieval results

62 citations