scispace - formally typeset
Search or ask a question
Book ChapterDOI

A survey of semantic image and video annotation tools

TL;DR: This chapter presents an overview of the state of the art in image and video annotation tools to provide a common framework of reference and to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata.
Abstract: The availability of semantically annotated image and video assets constitutes a critical prerequisite for the realisation of intelligent knowledge management services pertaining to realistic user needs. Given the extend of the challenges involved in the automatic extraction of such descriptions, manually created metadata play a significant role, further strengthened by their deployment in training and evaluation tasks related to the automatic extraction of content descriptions. The different views taken by the two main approaches towards semantic content description, namely the Semantic Web and MPEG-7, as well as the traits particular to multimedia content due to the multiplicity of information levels involved, have resulted in a variety of image and video annotation tools, adopting varying description aspects. Aiming to provide a common framework of reference and furthermore to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata, in this chapter we present an overview of the state of the art in image and video annotation tools.

Summary (6 min read)

1 Introduction

  • Accessing multimedia content in correspondence with the meaning pertained to a user, constitutes the core challenge in multimedia research, commonly referred to as the semantic gap [1].
  • This significance is further strengthened by the need for manually constructed descriptions in automatic content analysis both for evaluation as well as for training purposes, when learning based on preannotated examples is used.
  • Fundamental to information sharing, exchange and reuse, is the interoperability of the descriptions at both syntactic and semantic levels, i.e. regarding the valid structuring of the descriptions and the endowed meaning respectively.
  • The strong relation of structural and low-level feature information to the tasks involved in the automatic analysis of visual content, as well as to retrieval services, such as transcoding, content-based search, etc., brings these two dimensions to the foreground, along with the subject matter descriptions.
  • A number of so called multimedia ontologies [9–13] issued in an attempt to add formal semantics to MPEG-7 descriptions and thereby enable linking with existing ontologies and the semantic management of existing MPEG-7 metadata repositories.

2 Semantic Image and Video Annotation

  • Image and video assets constitute extremely rich information sources, ubiquitous in a wide variety of diverse applications and tasks related to information management, both for personal and professional purposes.
  • Inevitably, the value of the endowed information amounts to the effectiveness and efficiency at which it can be accessed and managed.
  • The former encompasses the capacity to share and reuse annotations, and by consequence determines the level of seamless content utilisation and the benefits issued from the annotations made available; the latter is vital to the realisation of intelligent content management services.
  • Towards their accomplishment, the existence of commonly agreed vocabularies and syntax, and respectively of commonly agreed semantics and interpretation mechanisms, are essential elements.
  • The aforementioned considerations intertwine, establishing a number of dimensions and corresponding criteria along which image and video annotation can be characterised.

2.1 Input & Output

  • This category includes criteria regarding the way the tool interacts in terms of requested / supported input and the output produced.
  • The authors note that annotation vocabularies may refer not only to subject matter descriptions, but as well to media and structural descriptions.
  • As will be shown in the sequel though, where the individual tools are described, there is not necessarily a strict correspondence (e.g. a tool may use an RDFS6 or OWL7 ontology as the subject matter vocabulary, and yet output annotations in RDF8).
  • The format is equally significant to the annotation vocabulary as with respect to the annotations interoperability and sharing.
  • Refers to the supported image/video formats, e.g. jpg, png, mpeg, etc.

2.2 Annotation Level

  • This category addresses attributes of the annotations per se.
  • Such retrieval may address concept-based queries or queries involving relations between concepts, entailing respective annotation specifications.
  • To capture the aforementioned considerations, the following criteria have been used.
  • For video assets, annotation may refer to the entire video, temporal segments , frames (temporal segments with zero duration), regions within frames, or even to moving regions, i.e. a region followed for a sequence of frames.
  • Refers to the level of expressivity supported with respect to the annotation vocabulary.

2.3 Miscellaneous

  • This category summarises additional criteria that do not fall under the previous dimensions.
  • The considered aspects relate mostly to attributes of the tool itself rather than of the annotation process.
  • As such, and given the scope of this chapter, in the description of the individual tools that follows in the two subsequent Sections, these criteria are treated very briefly.
  • Specifies whether the tool constitutes a web-based or a stand-alone application, also known as – Application Type.
  • – Licence: Specifies the kind of licence condition under which the tool operates, e.g. open source, etc. – Collaboration: Specifies whether the tool supports concurrent annotations (referring to the same media object) by multiple users or not.

3 Tools for Semantic Image Annotation

  • In this Section the authors describe prominent semantic image annotation tools with respect to the dimensions and criteria outlined in Section 2.
  • As will be illustrated in the following, Semantic Web technologies have permeated to a considerable degree the representation of metadata, with the majority of tools supporting ontology-based subject matter descriptions, while a considerable share of them adopts ontological representation for structural annotations as well.
  • In order to provide a relative ranking with respect to SW compatibility, the authors order the tools according to the extend to which the produced annotations bear formal semantics.

3.1 KAT

  • The K-Space Annotation Tool9 (KAT), developed within the K-Space10 project, implements an ontology-based framework for the semantic annotation of images.
  • COMM extends the Descriptions & Situations (D&S) and Ontology of Information Objects (OIO) design patterns of DOLCE [17, 18], while incorporating re-engineered definitions of MPEG-7 description tools[19, 20].
  • The latter are strictly concept based, i.e. considering the aforementioned annotation example it is not possible to annotate the pole as being next to the pole vaulter, and may refer to the entire image or to specific regions of it.
  • The localisation of image regions is performed manually, using either of the rectangle and polygon drawing tools.
  • Furthermore, the COMM based annotation scheme renders quite straightforward the extension of the annotation dimensions supported by KAT.

3.2 PhotoStuff

  • PhotoStuff11, developed by the Mindswap group12, is an ontology-based image annotation tool that supports the generation of semantic image descriptions with respect to the employed ontologies.
  • PhotoStuff [21] addresses primarily two types of metadata, namely descriptive and structural.
  • Regarding descriptive annotations, the user may load one or multiple domain-specific ontologies from the web or from the local hard drive, while with respect to structural annotations, two internal, hidden to the user, ontologies are used: the Digital-Media13 ontology and the Technical14 one.
  • Nor the representation neither the extraction of such descriptors is addressed.
  • Notably, annotations may refer not only to concept instantiations, but also to relations between concept instances already identified in an image.

3.3 AktiveMedia

  • AktiveMedia20, developed within AKT21 and X-Media22 projects, is an ontologybased cross-media annotation system addressing text and image assets.
  • In image annotation mode, AktiveMedia supports descriptive metadata with respect to user selected ontologies, stored in the local hard drive [22].
  • Annotations can refer to image or region level.
  • Contrary to Photostuff which uses 17 http://dublincore.org/documents/dces/.
  • As such, the semantics of generated RDF metadata, i.e. the annotation semantics as it entails from the respective ontology definitions, are not direct but require additional processing to retrieve and to reason over.

3.5 Caliph

  • Caliph27 is an MPEG-7 based image annotation tool that supports all types of MPEG-7 metadata among which descriptive, structural, authoring and low-level visual descriptor annotations.
  • In combination with Emir, they support contentbased retrieval of images using MPEG-7 descriptions.
  • Figure 6 illustrates two screenshots corresponding to the generic image information and the semantic annotation tabs.
  • The descriptions may be either in the form of free text or structured, in accordance to the SemanticBase description tools provided by MPEG-7 (i.e. Agents, Events, Time, Place and Object annotations [26]).
  • The so called semantic tab allows for the latter, offering a graph based interface.

3.6 SWAD

  • SWAD28 is an RDF-based image annotation tool that was developed within the SWAD-Europe project29.
  • The latter ran from May 2002 to October 2004 and aimed to support the Semantic Web initiative in Europe through targeted research, demonstrations and outreach activities.
  • The authors chose to provide a very brief description here for the purpose of illustrating image annotation in the Semantic Web as envisaged and realised by that time, as a reference and comparison point for the various image annotation tools that have been developed afterwards.
  • Licensing information as described in the respective SWAD deliverable30.
  • When entering a keyword description, the respective Wordnet31 hierarchy is shown to the user, assisting her in determining the appropriateness of the keyword and in selecting descriptions of further accuracy.

3.7 LabelMe

  • LabelMe33 is a database and web-based image annotation tool, aiming to contribute in the creation of large annotated image databases for evaluation and training purposes [28].
  • It contains all images from the MIT CSAIL34 database, in addition to a large number of user uploaded images.
  • LabelMe [28] supports descriptive metadata addressing in principle regionbased annotation.
  • Specifically, the user defines a polygon enclosing the annotated object through a set of control points.
  • Its focus on requirements related to object recognition research, rather than image search and retrieval, entails different notions regarding the utilisation, sharing and purpose of annotation.

3.8 Application-specific Image Annotation Tools

  • Apart from the afore described semantic image annotation tools, a variety of application-specific tools are available.
  • Some of them relate to Web 2.0 applications addressing tagging and sharing of content among social groups, while others focus on particular application domains, such as medical imaging, that impose additional specifications pertaining to the individual application context.
  • Utilising radiology specific ontologies, iPad enhances the annotation procedure by suggesting more specific terms and by identifying incomplete descriptions and subsequently prompting for missing parts in the description (e.g. “enlarged” is flagged as incomplete while “enlarged liver” is acceptable).
  • The produced descriptions are in RDF/XML following a proprietary schema39 that models the label constituting the tag, its position (the label constitutes a rectangle region in itself), and the position of the rectangle that encloses the annotated region in the form of the top left point coordinates and width and height information.
  • Furthermore, general information about the image is included such as image size, number of regions annotated, etc. Oriented towards Web 2.0, FotoTagger places significant focus on social aspects pertaining to content management, allowing among others to publish tagged images to 35 http://www.rsna.org/Technology/DICOM/ blogs and to upload/download tagged images to/from Flickr, while maintaining both FotoTagger’s and Flickr’s descriptions.

3.9 Discussion

  • The aforementioned overview reveals that the utilisation of Semantic Web languages for the representation, interchange and processing of image metadata has permeated semantic image annotation.
  • The choice of a standard representation shows the importance placed on creating content descriptions that can be easily exchanged and reused across heterogenous applications, and works like [10, 11, 30] provide bridges between MPEG-7 metadata and the Semantic Web and existing ontologies.
  • Thus unlike subject matter descriptions, where a user can choose which vocabulary to use (in the form of a domain ontology, a lexicon or user provided keywords), structural descriptions are tool specific.
  • Summing up, the choice of a tool depends primarily on the intended context of usage, which provides the specifications regarding the annotation dimensions supported, and subsequently on the desired formality of annotations, again related to a large extend to the application context.
  • Thus for semantic retrieval purposes, where semantic refers to the SW perspective, KAT, PhotoStuff, SWAD T o o l and AkiveMedia would be the more appropriate choices.

4 Tools for Semantic Video Annotation

  • The increase in the amount of video data deployed and used in today’s applications not only caused video to draw increased attention as a content type, but also introduced new challenges in terms of effective content management.
  • In the following the authors survey typical video annotation tools, highlighting their features with respect to the criteria delineated in Section 2.
  • In the latter category fall tools such as VIDETO41, Ricoh Movie Tool42, or LogCreator43.
  • It is interesting to note that the majority of these tools followed MPEG-7 for the representation of annotations.
  • As described in the sequel, this favourable disposition is still evident, differentiating video annotation tools from image ones, where the Semantic Web technologies have been more pervasive.

4.1 VIA

  • The Video and Image Annotation44 (VIA) tool has been developed by the MKLab45 within the BOEMIE46 project.
  • The shot records a pole vaulter holding a pole and sprinting at the jump point.
  • VIA supports descriptive, structural and media metadata of image and video assets.
  • Descriptive annotation is performed with respect to a user loaded OWL ontology, while free text descriptions can also be added.
  • The first one is concerned with region annotation, in which the user selects rectangular areas of the video content and subsequently adds corresponding annotations.

4.2 VideoAnnEx

  • The IBM VideoAnnEx47 annotation tool addresses video annotation with MPEG7 metadata.
  • VideoAnnex supports descriptive, structural and administrative annotations according to the respective MPEG-7 Description Schemes.
  • The tool supports default subject matter lexicons in XML format, and additionally allows the user to create and load her own XML lexicon, design a concept hierarchy through the interface menu commands, or insert free text descriptions.
  • As illustrated in Figure 10, the VideoAnnEx annotation interface consists of four components.
  • On the bottom part of the tool, two views are available of the annotation preview: one contains the I-frames of a shot and the keyframes of each shot in the video, respectively.

4.6 Anvil

  • Anvil69 is a tool that supports audiovisual content annotation, but which was primarily designed for linguistic purposes, in the same vein as the previously described tool.
  • User-defined XML schema specification files provide the definition of the vocabulary used in the annotation procedure.
  • Its interface consists of the media player window, the annotation board and the metadata window.
  • As in most described tools, also in Anvil, the user has to manually define the temporal segments that wants to annotate.
  • Anvil can import data from the phonetic tools PRAAT72 and XWaves which perform speech transcriptions.

4.7 Semantic Video Annotation Suite

  • The Semantic Video Annotation Suite75 (SVAS), developed by Joanneum research Institute of Information Systems & Information Management76, targets the creation of MPEG-7 video annotations.
  • SVAS [36] encompasses two tools: the Media Analyzer, which extracts automatically structural information regarding shots and key-frames, and the Semantic Video Annotation Tool (SVAT), which allows to edit the structural metadata obtained through the Media Analyzer and to add administrative and descriptive metadata, in accordance with MPEG-7.
  • The detection results are displayed in a separate key-frame view, where for each of the computed key frames the detected object is highlighted.
  • The user can partially enhance the results of this matching service by removing irrelevant key-frames; however more elaborate enhancement such as editing of the detected region’s boundaries or of its location is not supported.
  • All views, including the shot view tree structure, can be exported to a CSV file and the metadata is saved in an MPEG-7 XML file.

4.8 Application-specific Video Annotation Tools

  • Apart from the afore described semantic video annotation tools, a number of additional annotation systems have been proposed that aspiring to specific application contexts induce different perspectives on the annotation process.
  • To keep the survey comprehensive, in the following the authors examine briefly some representative examples.
  • Advocating W3C standards, Annotea adopts RDF based annotation schemes and XPointer78 for locating the annotations within the annotated resource.
  • Object level descriptions can be also propagated through dragging while the video is playing.

4.9 Discussion

  • As illustrated in the aforementioned descriptions, video annotation tools make a rather poor utilisation of Semantic Web technologies and formal meaning, XML being the most common choice for the capturing and representation of the produced annotations.
  • The use of MPEG-7 based descriptions, may constitute a solution towards standardised video descriptions, yet raises serious issues with respect to the automatic processing of annotations, especially the descriptive ones, at a semantic level.
  • Furthermore, VideoAnnex, VIA and SVAT are the only ones that offer selection and annotation of spatial regions on frames of the video, as well.
  • Anvil has recently presented a new annotation mechanisms called spatiotemporal coding aiming to support point and region annotation, yet currently only points are supported.
  • It worths noticing that most annotation tools offer a variety of additional functionalities, in order to satisfy varying user needs.

5 Conclusions

  • As to provide a common framework of reference for assessing the suitability and interoperability of annotations under different context of usages.
  • Domain specific ontologies are supported by the majority of tools for the representation of subject matter descriptions.
  • The level of correspondence between research outcomes and implemented annotation tools is not the sole subject for further investigation.
  • Research in multimedia annotation, and by consequence into multimedia ontologies, is not restricted to the representation of the different annotation dimensions involved.
  • As a continuation of the efforts initiated within MMSEM, further manifesting the strong emphasis placed upon achieving cross community multimedia data integration, two new 84 http://www.w3.org/2005/Incubator/mmsem/.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Survey of Semantic Image and Video
Annotation Tools
S. Dasiopoulou, E. Giannakidou, G. Litos, P. Malasioti, and I. Kompatsiaris
Multimedia Knowledge Lab oratory, Informatics and Telematics Institute,
Centre for Research and Technology Hellas
{dasiop,igiannak,litos,xenia,ikom}@iti.gr
Abstract. The availability of semantically annotated image and video
assets constitutes a critical prerequisite for the realisation of intelligent
knowledge management services pertaining to realistic user needs. Given
the extend of the challenges involved in the automatic extraction of such
descriptions, manually created metadata play a significant role, further
strengthened by their deployment in training and evaluation tasks re-
lated to the automatic extraction of content descriptions. The different
views taken by the two main approaches towards semantic content de-
scription, namely the Semantic Web and MPEG-7, as well as the traits
particular to multimedia content due to the multiplicity of information
levels involved, have resulted in a variety of image and video annotation
to ols, adopting varying description aspects. Aiming to provide a com-
mon framework of reference and furthermore to highlight open issues,
esp ecially with respect to the coverage and the interoperability of the
pro duced metadata, in this chapter we present an overview of the state
of the art in image and video annotation tools.
1 Introduction
Accessing multimedia content in correspondence with the meaning pertained to a
user, constitutes the core challenge in multimedia research, commonly referred to
as the semantic gap [1]. The current state of the art in automatic content analysis
and understanding supports in many cases the successful detection of semantic
concepts, such as persons, buildings, natural scenes vs manmade scenes, etc. at a
satisfactory level of accuracy; however, the attained performance remains highly
variable when considering general domains, or when increasing, even slightly, the
number of supported concepts [2–4]. As a consequence, the manual generation of
content descriptions holds an important role towards the realisation of intelligent
content management services. This significance is further strengthened by the
need for manually constructed descriptions in automatic content analysis both
for evaluation as well as for training purposes, when learning based on pre-
annotated examples is used.
The availability of semantic descriptions though is not adequate per se for
the effective management of multimedia content. Fundamental to information
sharing, exchange and reuse, is the interoperability of the descriptions at both

syntactic and semantic levels, i.e. regarding the valid structuring of the descrip-
tions and the endowed meaning respectively. Besides the general prerequisite for
interoperability, additional requirements arise from the multiple levels at which
multimedia content can be represented including structural and low-level fea-
tures information. Further description levels induce from more generic aspects
such as authoring & access control, navigation, and user history & preferences.
The strong relation of structural and low-level feature information to the tasks in-
volved in the automatic analysis of visual content, as well as to retrieval services,
such as transcoding, content-based search, etc., brings these two dimensions to
the foreground, along with the subject matter descriptions.
Two initiatives prevail the efforts towards machine processable semantic con-
tent metadata, the Semantic Web activity
1
of the W3C and ISO’s Multimedia
Content Description Interface
2
(MPEG-7) [5, 6], delineating corresponding ap-
proaches with respect to multimedia semantic annotation [7, 8]. Through a lay-
ered architecture of successively increased expressivity, the Semantic Web (SW)
advocates formal semantics and reasoning through logically grounded meaning.
The respective rule and ontology languages embody the general mechanisms
for capturing, representing and reasoning with semantics. They do not capture
application specific knowledge. In contrast, MPEG-7 addresses specifically the
description of audiovisual content and comprises not only the representation
language, in the form of the Description Definition Language (DDL), but also
specific, media and domain, definitions; thus from a SW perspective, MPEG-
7 serves the twofold role of a representation language and a domain specific
ontology.
Overcoming the syntactic and semantic interoperability issues between MPEG-
7 and the SW has been the subject of very active research in the current decade,
highly motivated by the complementary aspects characterising the two afore-
mentioned metadata initiatives: media sp ecific, yet not formal, semantics on one
hand, and general mechanisms for logically grounded semantics on the other
hand. A number of so called multimedia ontologies [9–13] issued in an attempt
to add formal semantics to MPEG-7 descriptions and thereby enable linking with
existing ontologies and the semantic management of existing MPEG-7 metadata
repositories. Furthermore, initiatives such the W3C Multimedia Annotation on
the Semantic Web Taskforce
3
, the W3C Multimedia Semantics Incubator Group
4
and the Common Multimedia Ontology Framework
5
, have been established to
address the technologies, advantages and open issues related to the creation,
storage, manipulation and processing of multimedia semantic metadata.
In this chapter, bearing in mind the significance of manual image and video
annotation in combination with the different possibilities afforded by the SW
and MPEG-7 initiatives, we present a detailed overview of the most well known
1
http://www.w3.org/2001/sw/
2
http://www.chiariglione.org/mpeg/
3
http://www.w3.org/2001/sw/BestPractices/MM/
4
http://www.w3.org/2005/Incubator/mmsem/
5
http://www.acemedia.org/aceMedia/reference/multimedia ontology/index.html

manual annotation tools, addressing both functionality aspects, such as coverage
& granularity of annotations, as well as interoperability concerns with respect to
the supported annotation vocabularies and representation languages. Interoper-
ability though does not address solely the harmonisation between the SW and
MPEG-7 initiatives; a significant number of tools, specially regarding video an-
notation, follow customised approaches, aggravating the challenges. As such, this
survey serves a twofold role; it provides a common framework for reference and
comparison purposes, while highlighting issues pertaining to the communication,
sharing and reuse of the produced metadata.
The rest of the chapter is organised as follows. Section 2 describes the criteria
along which the assessment and comparison of the examined annotation tools
is performed. Sections 3 and 4 discuss the individual image and video tools
respectively, while Section 5 concludes the paper, summarising the resulting
observations and open issues.
2 Semantic Image and Video Annotation
Image and video assets constitute extremely rich information sources, ubiqui-
tous in a wide variety of diverse applications and tasks related to information
management, both for personal and professional purposes. Inevitably, the value
of the endowed information amounts to the effectiveness and efficiency at which
it can be accessed and managed. This is where semantic annotation comes in, as
it designates the schemes for capturing the information related to the content.
As already indicated, two crucial requirements featuring content annotation
are the interoperability of the created metadata and the ability to automatically
process them. The former encompasses the capacity to share and reuse anno-
tations, and by consequence determines the level of seamless content utilisation
and the benefits issued from the annotations made available; the latter is vital
to the realisation of intelligent content management services. Towards their ac-
complishment, the existence of commonly agreed vocabularies and syntax, and
respectively of commonly agreed semantics and interpretation mechanisms, are
essential elements.
Within the context of visual content, these general prerequisites incur more
specific conditions issuing from the particular traits of image and video assets.
Visual content semantics, as multimedia semantics in general, comes into a mul-
tilayered, intertwined fashion [14, 15]. It encompasses, amongst others, thematic
descriptions addressing the subject matter depicted (scene categorisation, ob-
jects, events, etc.), media descriptions referring to low-level features and related
information such as the algorithms used for their extraction, respective param-
eters, etc., as well as structural descriptions addressing the decomposition of
content into constituent segments and the spatiotemporal configuration of these
segments. As in this chapter semantic annotation is investigated mostly with re-
spect to content retrieval and analysis tasks, aspects addressing concerns related
to authoring, access and privacy, and so forth, are only shallowly treated.

Fig. 1. Multi-layer image semantics.
Figure 1 shows such an example, illustrating subject matter descriptions such
as “Sky” and “Pole Vaulter, Athlete”, structural descriptions such as the three
identified regions, the spatial configuration between two of them (i.e. region2
above region3), and the ScalableColour and RegionsShape descriptor values ex-
tracted for two regions. The different layers correspond to different annotation
dimensions and serve different purposes, further differentiated by the individual
application context. For example, for a search and retrieval service regarding
a device of limited resources (e.g. PDA, mobile phone), content management
becomes more effective if specific temporal parts of video can be returned to a
query rather than the whole video asset, leaving the user with the cumbersome
task of browsing through it, till reaching the relative parts and assessing if they
satisfy her query.
The aforementioned considerations intertwine, establishing a number of di-
mensions and corresponding criteria along which image and video annotation
can be characterised. As such, interoperability, explicit semantics in terms of lia-
bility to automated processing, and reuse, apply both to all types of description
dimensions and to their interlinking, and not only to subject matter descriptions,
as is the common case for textual content resources.
In the following, we describe the criteria along which we overview the different
annotation tools in order to assess them with respect to the aforementioned
considerations. Criteria addressing concerns of similar nature have been grouped
together, resulting in three categories.

2.1 Input & Output
This category includes criteria regarding the way the tool interacts in terms of
requested / supported input and the output produced.
Annotation Vocabulary. Refers to whether the annotation is performed ac-
cording to a predefined set of terms (e.g. lexicon / thesaurus, taxonomy,
ontology) or if it is provided by the user in the form of keywords and free
text. In the case of controlled vocabulary, we differentiate the case where the
user has to explicitly provide it (e.g. as when uploading a sp ecific ontology)
or whether it is provided by the tool as a built-in; the formalisms supported
for the representation of the vocabulary constitute a further attribute. We
note that annotation vocabularies may refer not only to subject matter de-
scriptions, but as well to media and structural descriptions. Naturally, the
more formal and well-defined the semantics of the annotation vocabulary,
the more opportunities for achieving interoperable and machine understand-
able annotations.
Metadata Format. Considers the representation format in which the pro-
duced annotations are expressed. Naturally, the output format is strongly
related to the supported annotation vocabularies. As will be shown in the
sequel though, where the individual tools are described, there is not nec-
essarily a strict correspondence (e.g. a tool may use an RDFS
6
or OWL
7
ontology as the subject matter vocabulary, and yet output annotations in
RDF
8
). The format is equally significant to the annotation vocabulary as
with respect to the annotations interoperability and sharing.
Content Type. Refers to the supported image/video formats, e.g. jpg, png,
mpeg, etc.
2.2 Annotation Level
This category addresses attributes of the annotations per se. Naturally, the types
of information addressed by the descriptions issue from the intended context of
usage. Subject matter annotations, i.e. thematic descriptions with respect to the
depicted objects and events, are indispensable for any application scenario ad-
dressing content-based retrieval at the level of meaning conveyed. Such retrieval
may address concept-based queries or queries involving relations between con-
cepts, entailing respective annotation specifications. Structural information is
crucial for services where it is important to know the exact content parts associ-
ated with specific thematic descriptions, as for example in the case of semantic
transcoding or enhanced retrieval and presentation, where the parts of interest
can be indicated in an elaborated manner. Analogously, annotations intended for
6
http://www.w3.org/TR/rdf-schema/
7
http://www.w3.org/TR/owl-features/
8
http://www.w3.org/RDF/

Citations
More filters
Proceedings ArticleDOI
20 Nov 2014
TL;DR: A variety of possible shot type characterizations that can be assigned in a single video frame or still image are studied and ways to propagate these characterizations to a video segment or to an entire shot are discussed.
Abstract: Due to the enormous increase of video and image content on the web in the last decades, automatic video annotation became a necessity. The successful annotation of video and image content facilitate a successful indexing and retrieval in search databases. In this work we study a variety of possible shot type characterizations that can be assigned in a single video frame or still image. Possible ways to propagate these characterizations to a video segment (or to an entire shot) are also discussed. A method for the detection of Over-the-Shoulder shots in 3D (stereo) video is also proposed.

9 citations


Cites background from "A survey of semantic image and vide..."

  • ...For this reason a variety of annotation tools has been developed [1]....

    [...]

Journal ArticleDOI
TL;DR: A literature review was conducted with the aim of investigating the most popular methods employed to produce photo annotations and identified that People, Location, and Event are the most important features of photo annotation.
Abstract: Due to the large number of photos that are currently being generated, it is very important to have techniques to organize, search for, and retrieve such images. Photo annotation plays a key role in these mechanisms because it can link raw data (photos) to specific information that is essential for human beings to handle large amounts of content. However, the generation of photo annotation is still a difficult problem to solve as part of a well-known challenge called the semantic gap. In this paper, a literature review was conducted with the aim of investigating the most popular methods employed to produce photo annotations. Based on the papers surveyed, we identified that People (“Who?”), Location (“Where?”), and Event (“Where? When?”) are the most important features of photo annotation. We also established comparisons between similar photo annotation methods, highlighting key aspects of the most commonly used approaches. Moreover, we provide an overview of a general photo annotation process and present the main aspects of photo annotation representation comprising formats, context of usage, advantages and disadvantages. Finally, we discuss ways to improve photo annotation methods and present some future research guidelines.

9 citations

Dissertation
10 Dec 2014
TL;DR: An hybrid technique using content-based and collaborative filtering paradigms is used in order to attain an accurate model for recommendation, under the strain of mechanisms designed to keep user privacy, particularly designed to reduce the user exposure risk.
Abstract: The main objective of this thesis is to propose a recommendation method that keeps in mind the privacy of users as well as the scalability of the system. To achieve this goal, an hybrid technique using content-based and collaborative filtering paradigms is used in order to attain an accurate model for recommendation, under the strain of mechanisms designed to keep user privacy, particularly designed to reduce the user exposure risk. The thesis contributions are threefold : First, a Collaborative Filtering model is defined by using client-side agent that interacts with public information about items kept on the recommender system side. Later, this model is extended into an hybrid approach for recommendation that includes a content-based strategy for content recommendation. Using a knowledge model based on keywords that describe the item domain, the hybrid approach increases the predictive performance of the models without much computational effort on the cold-start setting. Finally, some strategies to improve the recommender system's provided privacy are introduced: Random noise generation is used to limit the possible inferences an attacker can make when continually observing the interaction between the client-side agent and the server, and a blacklisted strategy is used to refrain the server from learning interactions that the user considers violate her privacy. The use of the hybrid model mitigates the negative impact these strategies cause on the predictive performance of the recommendations.

9 citations

Proceedings ArticleDOI
03 Dec 2012
TL;DR: The novelty of SVCAT lies in its automatic propagation of the object localization and description metadata realized by tracking their contour through the video, thus drastically alleviating the task of the annotator.
Abstract: A vital prerequisite for fine-grained video content processing (indexing, querying, retrieval, adaptation, etc.) is the production of accurate metadata describing its structure and semantics. Several annotation tools were presented in the literature generating metadata at different granularities (i.e. scenes, shots, frames, objects). These tools have a number of limitations with respect to the annotation of objects. Though they provide functionalities to localize and annotate an object in a frame, the propagation of this information in the next frames still requires human intervention. Furthermore, they are based on video models that lack expressiveness along the spatial and semantic dimensions. To address these shortcomings, we propose the Semantic Video Content Annotation Tool (SVCAT) for structural and high-level semantic annotation. SVCAT is a semi-automatic annotation tool compliant with the MPEG-7 standard, which produces metadata according to an object-based video content model described in this paper. In particular, the novelty of SVCAT lies in its automatic propagation of the object localization and description metadata realized by tracking their contour through the video, thus drastically alleviating the task of the annotator. Experimental results show that SVCAT provides accurate metadata to object-based applications, particularly exact contours of multiple deformable objects.

8 citations


Additional excerpts

  • ...Many video annotation tools have been proposed in the literature [5]....

    [...]

Journal ArticleDOI
TL;DR: It transpires that the abstract representations provide a basic description that enables the user to perform a subset of the desired queries, however, a more complex depiction is required for this use case.
Abstract: Biomedical images and models contain vast amounts of information. Regrettably, much of this information is only accessible by domain experts. This paper describes a biological use case in which this situation occurs. Motivation is given for describing images, from this use case, semantically. Furthermore, links are provided to the medical domain, demonstrating the transferability of this work. Subsequently, it is shown that a semantic representation in which every pixel is featured is needlessly expensive. This motivates the discussion of more abstract renditions, which are dealt with next. As part of this, the paper discusses the suitability of existing technologies. In particular, Region Connection Calculus and one implementation of the W3C Geospatial Vocabulary are considered. It transpires that the abstract representations provide a basic description that enables the user to perform a subset of the desired queries. However, a more complex depiction is required for this use case.

8 citations

References
More filters
Journal ArticleDOI
TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.
Abstract: Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

6,447 citations

Journal ArticleDOI
TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Abstract: We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.

3,501 citations

Book
17 Sep 2004
TL;DR: Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition, and design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks.
Abstract: Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition. ART networks function both as models of human cognitive information processing [1,2,3] and as neural systems for technology transfer [4]. A neural computation central to both the scientific and the technological analyses is the ART matching rule [5], which models the interaction between topdown expectation and bottom-up input, thereby creating a focus of attention which, in turn, determines the nature of coded memories. Sites of early and ongoing transfer of ART-based technologies include industrial venues such as the Boeing Corporation [6] and government venues such as MIT Lincoln Laboratory [7]. A recent report on industrial uses of neural networks [8] states: “[The] Boeing ... Neural Information Retrieval System is probably still the largest-scale manufacturing application of neural networks. It uses [ART] to cluster binary templates of aeroplane parts in a complex hierarchical network that covers over 100,000 items, grouped into thousands of self-organised clusters. Claimed savings in manufacturing costs are in millions of dollars per annum.” At Lincoln Lab, a team led by Waxman developed an image mining system which incorporates several models of vision and recognition developed in the Boston University Department of Cognitive and Neural Systems (BU/CNS). Over the years a dozen CNS graduates (Aguilar, Baloch, Baxter, Bomberger, Cunningham, Fay, Gove, Ivey, Mehanian, Ross, Rubin, Streilein) have contributed to this effort, which is now located at Alphatech, Inc. Customers for BU/CNS neural network technologies have attributed their selection of ART over alternative systems to the model's defining design principles. In listing the advantages of its THOT technology, for example, American Heuristics Corporation (AHC) cites several characteristic computational capabilities of this family of neural models, including fast on-line (one-pass) learning, “vigilant” detection of novel patterns, retention of rare patterns, improvement with experience, “weights [which] are understandable in real world terms,” and scalability (www.heuristics.com). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP [9], ART-EMAP [10], ARTMAP-IC [11],

1,745 citations

Book
01 Jun 2002
TL;DR: This book has been designed as a unique tutorial in the new MPEG 7 standard covering content creation, content distribution and content consumption, and presents a comprehensive overview of the principles and concepts involved in the complete range of Audio Visual material indexing, metadata description, information retrieval and browsing.
Abstract: From the Publisher: The MPEG standards are an evolving set of standards for video and audio compression. MPEG 7 technology covers the most recent developments in multimedia search and retreival, designed to standardise the description of multimedia content supporting a wide range of applications including DVD, CD and HDTV. Multimedia content description, search and retrieval is a rapidly expanding research area due to the increasing amount of audiovisual (AV) data available. The wealth of practical applications available and currently under development (for example, large scale multimedia search engines and AV broadcast servers) has lead to the development of processing tools to create the description of AV material or to support the identification or retrieval of AV documents. Written by experts in the field, this book has been designed as a unique tutorial in the new MPEG 7 standard covering content creation, content distribution and content consumption. At present there are no books documenting the available technologies in such a comprehensive way. Presents a comprehensive overview of the principles and concepts involved in the complete range of Audio Visual material indexing, metadata description, information retrieval and browsingDetails the major processing tools used for indexing and retrieval of images and video sequencesIndividual chapters, written by experts who have contributed to the development of MPEG 7, provide clear explanations of the underlying tools and technologies contributing to the standardDemostration software offering step-by-step guidance to the multi-media system components and eXperimentation model (XM) MPEG reference softwareCoincides with the release of the ISO standard in late 2001. A valuable reference resource for practising electronic and communications engineers designing and implementing MPEG 7 compliant systems, as well as for researchers and students working with multimedia database technology.

1,301 citations

Book ChapterDOI
01 Oct 2002
TL;DR: This paper introduces the DOLCE upper level ontology, the first module of a Foundational Ontologies Library being developed within the WonderWeb project, and suggests that such analysis could hopefully lead to an ?
Abstract: In this paper we introduce the DOLCE upper level ontology, the first module of a Foundational Ontologies Library being developed within the WonderWeb project. DOLCE is presented here in an intuitive way; the reader should refer to the project deliverable for a detailed axiomatization. A comparison with WordNet's top-level taxonomy of nouns is also provided, which shows how DOLCE, used in addition to the OntoClean methodology, helps isolating and understanding some major WordNet?s semantic limitations. We suggest that such analysis could hopefully lead to an ?ontologically sweetened? WordNet, meant to be conceptually more rigorous, cognitively transparent, and efficiently exploitable in several applications.

1,100 citations