scispace - formally typeset
Search or ask a question
Book ChapterDOI

A survey of semantic image and video annotation tools

TL;DR: This chapter presents an overview of the state of the art in image and video annotation tools to provide a common framework of reference and to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata.
Abstract: The availability of semantically annotated image and video assets constitutes a critical prerequisite for the realisation of intelligent knowledge management services pertaining to realistic user needs. Given the extend of the challenges involved in the automatic extraction of such descriptions, manually created metadata play a significant role, further strengthened by their deployment in training and evaluation tasks related to the automatic extraction of content descriptions. The different views taken by the two main approaches towards semantic content description, namely the Semantic Web and MPEG-7, as well as the traits particular to multimedia content due to the multiplicity of information levels involved, have resulted in a variety of image and video annotation tools, adopting varying description aspects. Aiming to provide a common framework of reference and furthermore to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata, in this chapter we present an overview of the state of the art in image and video annotation tools.

Summary (6 min read)

1 Introduction

  • Accessing multimedia content in correspondence with the meaning pertained to a user, constitutes the core challenge in multimedia research, commonly referred to as the semantic gap [1].
  • This significance is further strengthened by the need for manually constructed descriptions in automatic content analysis both for evaluation as well as for training purposes, when learning based on preannotated examples is used.
  • Fundamental to information sharing, exchange and reuse, is the interoperability of the descriptions at both syntactic and semantic levels, i.e. regarding the valid structuring of the descriptions and the endowed meaning respectively.
  • The strong relation of structural and low-level feature information to the tasks involved in the automatic analysis of visual content, as well as to retrieval services, such as transcoding, content-based search, etc., brings these two dimensions to the foreground, along with the subject matter descriptions.
  • A number of so called multimedia ontologies [9–13] issued in an attempt to add formal semantics to MPEG-7 descriptions and thereby enable linking with existing ontologies and the semantic management of existing MPEG-7 metadata repositories.

2 Semantic Image and Video Annotation

  • Image and video assets constitute extremely rich information sources, ubiquitous in a wide variety of diverse applications and tasks related to information management, both for personal and professional purposes.
  • Inevitably, the value of the endowed information amounts to the effectiveness and efficiency at which it can be accessed and managed.
  • The former encompasses the capacity to share and reuse annotations, and by consequence determines the level of seamless content utilisation and the benefits issued from the annotations made available; the latter is vital to the realisation of intelligent content management services.
  • Towards their accomplishment, the existence of commonly agreed vocabularies and syntax, and respectively of commonly agreed semantics and interpretation mechanisms, are essential elements.
  • The aforementioned considerations intertwine, establishing a number of dimensions and corresponding criteria along which image and video annotation can be characterised.

2.1 Input & Output

  • This category includes criteria regarding the way the tool interacts in terms of requested / supported input and the output produced.
  • The authors note that annotation vocabularies may refer not only to subject matter descriptions, but as well to media and structural descriptions.
  • As will be shown in the sequel though, where the individual tools are described, there is not necessarily a strict correspondence (e.g. a tool may use an RDFS6 or OWL7 ontology as the subject matter vocabulary, and yet output annotations in RDF8).
  • The format is equally significant to the annotation vocabulary as with respect to the annotations interoperability and sharing.
  • Refers to the supported image/video formats, e.g. jpg, png, mpeg, etc.

2.2 Annotation Level

  • This category addresses attributes of the annotations per se.
  • Such retrieval may address concept-based queries or queries involving relations between concepts, entailing respective annotation specifications.
  • To capture the aforementioned considerations, the following criteria have been used.
  • For video assets, annotation may refer to the entire video, temporal segments , frames (temporal segments with zero duration), regions within frames, or even to moving regions, i.e. a region followed for a sequence of frames.
  • Refers to the level of expressivity supported with respect to the annotation vocabulary.

2.3 Miscellaneous

  • This category summarises additional criteria that do not fall under the previous dimensions.
  • The considered aspects relate mostly to attributes of the tool itself rather than of the annotation process.
  • As such, and given the scope of this chapter, in the description of the individual tools that follows in the two subsequent Sections, these criteria are treated very briefly.
  • Specifies whether the tool constitutes a web-based or a stand-alone application, also known as – Application Type.
  • – Licence: Specifies the kind of licence condition under which the tool operates, e.g. open source, etc. – Collaboration: Specifies whether the tool supports concurrent annotations (referring to the same media object) by multiple users or not.

3 Tools for Semantic Image Annotation

  • In this Section the authors describe prominent semantic image annotation tools with respect to the dimensions and criteria outlined in Section 2.
  • As will be illustrated in the following, Semantic Web technologies have permeated to a considerable degree the representation of metadata, with the majority of tools supporting ontology-based subject matter descriptions, while a considerable share of them adopts ontological representation for structural annotations as well.
  • In order to provide a relative ranking with respect to SW compatibility, the authors order the tools according to the extend to which the produced annotations bear formal semantics.

3.1 KAT

  • The K-Space Annotation Tool9 (KAT), developed within the K-Space10 project, implements an ontology-based framework for the semantic annotation of images.
  • COMM extends the Descriptions & Situations (D&S) and Ontology of Information Objects (OIO) design patterns of DOLCE [17, 18], while incorporating re-engineered definitions of MPEG-7 description tools[19, 20].
  • The latter are strictly concept based, i.e. considering the aforementioned annotation example it is not possible to annotate the pole as being next to the pole vaulter, and may refer to the entire image or to specific regions of it.
  • The localisation of image regions is performed manually, using either of the rectangle and polygon drawing tools.
  • Furthermore, the COMM based annotation scheme renders quite straightforward the extension of the annotation dimensions supported by KAT.

3.2 PhotoStuff

  • PhotoStuff11, developed by the Mindswap group12, is an ontology-based image annotation tool that supports the generation of semantic image descriptions with respect to the employed ontologies.
  • PhotoStuff [21] addresses primarily two types of metadata, namely descriptive and structural.
  • Regarding descriptive annotations, the user may load one or multiple domain-specific ontologies from the web or from the local hard drive, while with respect to structural annotations, two internal, hidden to the user, ontologies are used: the Digital-Media13 ontology and the Technical14 one.
  • Nor the representation neither the extraction of such descriptors is addressed.
  • Notably, annotations may refer not only to concept instantiations, but also to relations between concept instances already identified in an image.

3.3 AktiveMedia

  • AktiveMedia20, developed within AKT21 and X-Media22 projects, is an ontologybased cross-media annotation system addressing text and image assets.
  • In image annotation mode, AktiveMedia supports descriptive metadata with respect to user selected ontologies, stored in the local hard drive [22].
  • Annotations can refer to image or region level.
  • Contrary to Photostuff which uses 17 http://dublincore.org/documents/dces/.
  • As such, the semantics of generated RDF metadata, i.e. the annotation semantics as it entails from the respective ontology definitions, are not direct but require additional processing to retrieve and to reason over.

3.5 Caliph

  • Caliph27 is an MPEG-7 based image annotation tool that supports all types of MPEG-7 metadata among which descriptive, structural, authoring and low-level visual descriptor annotations.
  • In combination with Emir, they support contentbased retrieval of images using MPEG-7 descriptions.
  • Figure 6 illustrates two screenshots corresponding to the generic image information and the semantic annotation tabs.
  • The descriptions may be either in the form of free text or structured, in accordance to the SemanticBase description tools provided by MPEG-7 (i.e. Agents, Events, Time, Place and Object annotations [26]).
  • The so called semantic tab allows for the latter, offering a graph based interface.

3.6 SWAD

  • SWAD28 is an RDF-based image annotation tool that was developed within the SWAD-Europe project29.
  • The latter ran from May 2002 to October 2004 and aimed to support the Semantic Web initiative in Europe through targeted research, demonstrations and outreach activities.
  • The authors chose to provide a very brief description here for the purpose of illustrating image annotation in the Semantic Web as envisaged and realised by that time, as a reference and comparison point for the various image annotation tools that have been developed afterwards.
  • Licensing information as described in the respective SWAD deliverable30.
  • When entering a keyword description, the respective Wordnet31 hierarchy is shown to the user, assisting her in determining the appropriateness of the keyword and in selecting descriptions of further accuracy.

3.7 LabelMe

  • LabelMe33 is a database and web-based image annotation tool, aiming to contribute in the creation of large annotated image databases for evaluation and training purposes [28].
  • It contains all images from the MIT CSAIL34 database, in addition to a large number of user uploaded images.
  • LabelMe [28] supports descriptive metadata addressing in principle regionbased annotation.
  • Specifically, the user defines a polygon enclosing the annotated object through a set of control points.
  • Its focus on requirements related to object recognition research, rather than image search and retrieval, entails different notions regarding the utilisation, sharing and purpose of annotation.

3.8 Application-specific Image Annotation Tools

  • Apart from the afore described semantic image annotation tools, a variety of application-specific tools are available.
  • Some of them relate to Web 2.0 applications addressing tagging and sharing of content among social groups, while others focus on particular application domains, such as medical imaging, that impose additional specifications pertaining to the individual application context.
  • Utilising radiology specific ontologies, iPad enhances the annotation procedure by suggesting more specific terms and by identifying incomplete descriptions and subsequently prompting for missing parts in the description (e.g. “enlarged” is flagged as incomplete while “enlarged liver” is acceptable).
  • The produced descriptions are in RDF/XML following a proprietary schema39 that models the label constituting the tag, its position (the label constitutes a rectangle region in itself), and the position of the rectangle that encloses the annotated region in the form of the top left point coordinates and width and height information.
  • Furthermore, general information about the image is included such as image size, number of regions annotated, etc. Oriented towards Web 2.0, FotoTagger places significant focus on social aspects pertaining to content management, allowing among others to publish tagged images to 35 http://www.rsna.org/Technology/DICOM/ blogs and to upload/download tagged images to/from Flickr, while maintaining both FotoTagger’s and Flickr’s descriptions.

3.9 Discussion

  • The aforementioned overview reveals that the utilisation of Semantic Web languages for the representation, interchange and processing of image metadata has permeated semantic image annotation.
  • The choice of a standard representation shows the importance placed on creating content descriptions that can be easily exchanged and reused across heterogenous applications, and works like [10, 11, 30] provide bridges between MPEG-7 metadata and the Semantic Web and existing ontologies.
  • Thus unlike subject matter descriptions, where a user can choose which vocabulary to use (in the form of a domain ontology, a lexicon or user provided keywords), structural descriptions are tool specific.
  • Summing up, the choice of a tool depends primarily on the intended context of usage, which provides the specifications regarding the annotation dimensions supported, and subsequently on the desired formality of annotations, again related to a large extend to the application context.
  • Thus for semantic retrieval purposes, where semantic refers to the SW perspective, KAT, PhotoStuff, SWAD T o o l and AkiveMedia would be the more appropriate choices.

4 Tools for Semantic Video Annotation

  • The increase in the amount of video data deployed and used in today’s applications not only caused video to draw increased attention as a content type, but also introduced new challenges in terms of effective content management.
  • In the following the authors survey typical video annotation tools, highlighting their features with respect to the criteria delineated in Section 2.
  • In the latter category fall tools such as VIDETO41, Ricoh Movie Tool42, or LogCreator43.
  • It is interesting to note that the majority of these tools followed MPEG-7 for the representation of annotations.
  • As described in the sequel, this favourable disposition is still evident, differentiating video annotation tools from image ones, where the Semantic Web technologies have been more pervasive.

4.1 VIA

  • The Video and Image Annotation44 (VIA) tool has been developed by the MKLab45 within the BOEMIE46 project.
  • The shot records a pole vaulter holding a pole and sprinting at the jump point.
  • VIA supports descriptive, structural and media metadata of image and video assets.
  • Descriptive annotation is performed with respect to a user loaded OWL ontology, while free text descriptions can also be added.
  • The first one is concerned with region annotation, in which the user selects rectangular areas of the video content and subsequently adds corresponding annotations.

4.2 VideoAnnEx

  • The IBM VideoAnnEx47 annotation tool addresses video annotation with MPEG7 metadata.
  • VideoAnnex supports descriptive, structural and administrative annotations according to the respective MPEG-7 Description Schemes.
  • The tool supports default subject matter lexicons in XML format, and additionally allows the user to create and load her own XML lexicon, design a concept hierarchy through the interface menu commands, or insert free text descriptions.
  • As illustrated in Figure 10, the VideoAnnEx annotation interface consists of four components.
  • On the bottom part of the tool, two views are available of the annotation preview: one contains the I-frames of a shot and the keyframes of each shot in the video, respectively.

4.6 Anvil

  • Anvil69 is a tool that supports audiovisual content annotation, but which was primarily designed for linguistic purposes, in the same vein as the previously described tool.
  • User-defined XML schema specification files provide the definition of the vocabulary used in the annotation procedure.
  • Its interface consists of the media player window, the annotation board and the metadata window.
  • As in most described tools, also in Anvil, the user has to manually define the temporal segments that wants to annotate.
  • Anvil can import data from the phonetic tools PRAAT72 and XWaves which perform speech transcriptions.

4.7 Semantic Video Annotation Suite

  • The Semantic Video Annotation Suite75 (SVAS), developed by Joanneum research Institute of Information Systems & Information Management76, targets the creation of MPEG-7 video annotations.
  • SVAS [36] encompasses two tools: the Media Analyzer, which extracts automatically structural information regarding shots and key-frames, and the Semantic Video Annotation Tool (SVAT), which allows to edit the structural metadata obtained through the Media Analyzer and to add administrative and descriptive metadata, in accordance with MPEG-7.
  • The detection results are displayed in a separate key-frame view, where for each of the computed key frames the detected object is highlighted.
  • The user can partially enhance the results of this matching service by removing irrelevant key-frames; however more elaborate enhancement such as editing of the detected region’s boundaries or of its location is not supported.
  • All views, including the shot view tree structure, can be exported to a CSV file and the metadata is saved in an MPEG-7 XML file.

4.8 Application-specific Video Annotation Tools

  • Apart from the afore described semantic video annotation tools, a number of additional annotation systems have been proposed that aspiring to specific application contexts induce different perspectives on the annotation process.
  • To keep the survey comprehensive, in the following the authors examine briefly some representative examples.
  • Advocating W3C standards, Annotea adopts RDF based annotation schemes and XPointer78 for locating the annotations within the annotated resource.
  • Object level descriptions can be also propagated through dragging while the video is playing.

4.9 Discussion

  • As illustrated in the aforementioned descriptions, video annotation tools make a rather poor utilisation of Semantic Web technologies and formal meaning, XML being the most common choice for the capturing and representation of the produced annotations.
  • The use of MPEG-7 based descriptions, may constitute a solution towards standardised video descriptions, yet raises serious issues with respect to the automatic processing of annotations, especially the descriptive ones, at a semantic level.
  • Furthermore, VideoAnnex, VIA and SVAT are the only ones that offer selection and annotation of spatial regions on frames of the video, as well.
  • Anvil has recently presented a new annotation mechanisms called spatiotemporal coding aiming to support point and region annotation, yet currently only points are supported.
  • It worths noticing that most annotation tools offer a variety of additional functionalities, in order to satisfy varying user needs.

5 Conclusions

  • As to provide a common framework of reference for assessing the suitability and interoperability of annotations under different context of usages.
  • Domain specific ontologies are supported by the majority of tools for the representation of subject matter descriptions.
  • The level of correspondence between research outcomes and implemented annotation tools is not the sole subject for further investigation.
  • Research in multimedia annotation, and by consequence into multimedia ontologies, is not restricted to the representation of the different annotation dimensions involved.
  • As a continuation of the efforts initiated within MMSEM, further manifesting the strong emphasis placed upon achieving cross community multimedia data integration, two new 84 http://www.w3.org/2005/Incubator/mmsem/.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Survey of Semantic Image and Video
Annotation Tools
S. Dasiopoulou, E. Giannakidou, G. Litos, P. Malasioti, and I. Kompatsiaris
Multimedia Knowledge Lab oratory, Informatics and Telematics Institute,
Centre for Research and Technology Hellas
{dasiop,igiannak,litos,xenia,ikom}@iti.gr
Abstract. The availability of semantically annotated image and video
assets constitutes a critical prerequisite for the realisation of intelligent
knowledge management services pertaining to realistic user needs. Given
the extend of the challenges involved in the automatic extraction of such
descriptions, manually created metadata play a significant role, further
strengthened by their deployment in training and evaluation tasks re-
lated to the automatic extraction of content descriptions. The different
views taken by the two main approaches towards semantic content de-
scription, namely the Semantic Web and MPEG-7, as well as the traits
particular to multimedia content due to the multiplicity of information
levels involved, have resulted in a variety of image and video annotation
to ols, adopting varying description aspects. Aiming to provide a com-
mon framework of reference and furthermore to highlight open issues,
esp ecially with respect to the coverage and the interoperability of the
pro duced metadata, in this chapter we present an overview of the state
of the art in image and video annotation tools.
1 Introduction
Accessing multimedia content in correspondence with the meaning pertained to a
user, constitutes the core challenge in multimedia research, commonly referred to
as the semantic gap [1]. The current state of the art in automatic content analysis
and understanding supports in many cases the successful detection of semantic
concepts, such as persons, buildings, natural scenes vs manmade scenes, etc. at a
satisfactory level of accuracy; however, the attained performance remains highly
variable when considering general domains, or when increasing, even slightly, the
number of supported concepts [2–4]. As a consequence, the manual generation of
content descriptions holds an important role towards the realisation of intelligent
content management services. This significance is further strengthened by the
need for manually constructed descriptions in automatic content analysis both
for evaluation as well as for training purposes, when learning based on pre-
annotated examples is used.
The availability of semantic descriptions though is not adequate per se for
the effective management of multimedia content. Fundamental to information
sharing, exchange and reuse, is the interoperability of the descriptions at both

syntactic and semantic levels, i.e. regarding the valid structuring of the descrip-
tions and the endowed meaning respectively. Besides the general prerequisite for
interoperability, additional requirements arise from the multiple levels at which
multimedia content can be represented including structural and low-level fea-
tures information. Further description levels induce from more generic aspects
such as authoring & access control, navigation, and user history & preferences.
The strong relation of structural and low-level feature information to the tasks in-
volved in the automatic analysis of visual content, as well as to retrieval services,
such as transcoding, content-based search, etc., brings these two dimensions to
the foreground, along with the subject matter descriptions.
Two initiatives prevail the efforts towards machine processable semantic con-
tent metadata, the Semantic Web activity
1
of the W3C and ISO’s Multimedia
Content Description Interface
2
(MPEG-7) [5, 6], delineating corresponding ap-
proaches with respect to multimedia semantic annotation [7, 8]. Through a lay-
ered architecture of successively increased expressivity, the Semantic Web (SW)
advocates formal semantics and reasoning through logically grounded meaning.
The respective rule and ontology languages embody the general mechanisms
for capturing, representing and reasoning with semantics. They do not capture
application specific knowledge. In contrast, MPEG-7 addresses specifically the
description of audiovisual content and comprises not only the representation
language, in the form of the Description Definition Language (DDL), but also
specific, media and domain, definitions; thus from a SW perspective, MPEG-
7 serves the twofold role of a representation language and a domain specific
ontology.
Overcoming the syntactic and semantic interoperability issues between MPEG-
7 and the SW has been the subject of very active research in the current decade,
highly motivated by the complementary aspects characterising the two afore-
mentioned metadata initiatives: media sp ecific, yet not formal, semantics on one
hand, and general mechanisms for logically grounded semantics on the other
hand. A number of so called multimedia ontologies [9–13] issued in an attempt
to add formal semantics to MPEG-7 descriptions and thereby enable linking with
existing ontologies and the semantic management of existing MPEG-7 metadata
repositories. Furthermore, initiatives such the W3C Multimedia Annotation on
the Semantic Web Taskforce
3
, the W3C Multimedia Semantics Incubator Group
4
and the Common Multimedia Ontology Framework
5
, have been established to
address the technologies, advantages and open issues related to the creation,
storage, manipulation and processing of multimedia semantic metadata.
In this chapter, bearing in mind the significance of manual image and video
annotation in combination with the different possibilities afforded by the SW
and MPEG-7 initiatives, we present a detailed overview of the most well known
1
http://www.w3.org/2001/sw/
2
http://www.chiariglione.org/mpeg/
3
http://www.w3.org/2001/sw/BestPractices/MM/
4
http://www.w3.org/2005/Incubator/mmsem/
5
http://www.acemedia.org/aceMedia/reference/multimedia ontology/index.html

manual annotation tools, addressing both functionality aspects, such as coverage
& granularity of annotations, as well as interoperability concerns with respect to
the supported annotation vocabularies and representation languages. Interoper-
ability though does not address solely the harmonisation between the SW and
MPEG-7 initiatives; a significant number of tools, specially regarding video an-
notation, follow customised approaches, aggravating the challenges. As such, this
survey serves a twofold role; it provides a common framework for reference and
comparison purposes, while highlighting issues pertaining to the communication,
sharing and reuse of the produced metadata.
The rest of the chapter is organised as follows. Section 2 describes the criteria
along which the assessment and comparison of the examined annotation tools
is performed. Sections 3 and 4 discuss the individual image and video tools
respectively, while Section 5 concludes the paper, summarising the resulting
observations and open issues.
2 Semantic Image and Video Annotation
Image and video assets constitute extremely rich information sources, ubiqui-
tous in a wide variety of diverse applications and tasks related to information
management, both for personal and professional purposes. Inevitably, the value
of the endowed information amounts to the effectiveness and efficiency at which
it can be accessed and managed. This is where semantic annotation comes in, as
it designates the schemes for capturing the information related to the content.
As already indicated, two crucial requirements featuring content annotation
are the interoperability of the created metadata and the ability to automatically
process them. The former encompasses the capacity to share and reuse anno-
tations, and by consequence determines the level of seamless content utilisation
and the benefits issued from the annotations made available; the latter is vital
to the realisation of intelligent content management services. Towards their ac-
complishment, the existence of commonly agreed vocabularies and syntax, and
respectively of commonly agreed semantics and interpretation mechanisms, are
essential elements.
Within the context of visual content, these general prerequisites incur more
specific conditions issuing from the particular traits of image and video assets.
Visual content semantics, as multimedia semantics in general, comes into a mul-
tilayered, intertwined fashion [14, 15]. It encompasses, amongst others, thematic
descriptions addressing the subject matter depicted (scene categorisation, ob-
jects, events, etc.), media descriptions referring to low-level features and related
information such as the algorithms used for their extraction, respective param-
eters, etc., as well as structural descriptions addressing the decomposition of
content into constituent segments and the spatiotemporal configuration of these
segments. As in this chapter semantic annotation is investigated mostly with re-
spect to content retrieval and analysis tasks, aspects addressing concerns related
to authoring, access and privacy, and so forth, are only shallowly treated.

Fig. 1. Multi-layer image semantics.
Figure 1 shows such an example, illustrating subject matter descriptions such
as “Sky” and “Pole Vaulter, Athlete”, structural descriptions such as the three
identified regions, the spatial configuration between two of them (i.e. region2
above region3), and the ScalableColour and RegionsShape descriptor values ex-
tracted for two regions. The different layers correspond to different annotation
dimensions and serve different purposes, further differentiated by the individual
application context. For example, for a search and retrieval service regarding
a device of limited resources (e.g. PDA, mobile phone), content management
becomes more effective if specific temporal parts of video can be returned to a
query rather than the whole video asset, leaving the user with the cumbersome
task of browsing through it, till reaching the relative parts and assessing if they
satisfy her query.
The aforementioned considerations intertwine, establishing a number of di-
mensions and corresponding criteria along which image and video annotation
can be characterised. As such, interoperability, explicit semantics in terms of lia-
bility to automated processing, and reuse, apply both to all types of description
dimensions and to their interlinking, and not only to subject matter descriptions,
as is the common case for textual content resources.
In the following, we describe the criteria along which we overview the different
annotation tools in order to assess them with respect to the aforementioned
considerations. Criteria addressing concerns of similar nature have been grouped
together, resulting in three categories.

2.1 Input & Output
This category includes criteria regarding the way the tool interacts in terms of
requested / supported input and the output produced.
Annotation Vocabulary. Refers to whether the annotation is performed ac-
cording to a predefined set of terms (e.g. lexicon / thesaurus, taxonomy,
ontology) or if it is provided by the user in the form of keywords and free
text. In the case of controlled vocabulary, we differentiate the case where the
user has to explicitly provide it (e.g. as when uploading a sp ecific ontology)
or whether it is provided by the tool as a built-in; the formalisms supported
for the representation of the vocabulary constitute a further attribute. We
note that annotation vocabularies may refer not only to subject matter de-
scriptions, but as well to media and structural descriptions. Naturally, the
more formal and well-defined the semantics of the annotation vocabulary,
the more opportunities for achieving interoperable and machine understand-
able annotations.
Metadata Format. Considers the representation format in which the pro-
duced annotations are expressed. Naturally, the output format is strongly
related to the supported annotation vocabularies. As will be shown in the
sequel though, where the individual tools are described, there is not nec-
essarily a strict correspondence (e.g. a tool may use an RDFS
6
or OWL
7
ontology as the subject matter vocabulary, and yet output annotations in
RDF
8
). The format is equally significant to the annotation vocabulary as
with respect to the annotations interoperability and sharing.
Content Type. Refers to the supported image/video formats, e.g. jpg, png,
mpeg, etc.
2.2 Annotation Level
This category addresses attributes of the annotations per se. Naturally, the types
of information addressed by the descriptions issue from the intended context of
usage. Subject matter annotations, i.e. thematic descriptions with respect to the
depicted objects and events, are indispensable for any application scenario ad-
dressing content-based retrieval at the level of meaning conveyed. Such retrieval
may address concept-based queries or queries involving relations between con-
cepts, entailing respective annotation specifications. Structural information is
crucial for services where it is important to know the exact content parts associ-
ated with specific thematic descriptions, as for example in the case of semantic
transcoding or enhanced retrieval and presentation, where the parts of interest
can be indicated in an elaborated manner. Analogously, annotations intended for
6
http://www.w3.org/TR/rdf-schema/
7
http://www.w3.org/TR/owl-features/
8
http://www.w3.org/RDF/

Citations
More filters
Proceedings ArticleDOI
24 Feb 2014
TL;DR: OCTAB (Online Crowdsourcing Tool for Annotations of Behaviors), a web-based annotation tool that allows precise and convenient behavior annotations in videos, directly portable to popular crowdsourcing platforms, and a training module with specialized visualizations.
Abstract: Research that involves human behavior analysis usually requires laborious and costly efforts for obtaining micro-level behavior annotations on a large video corpus. With the emerging paradigm of crowdsourcing however, these efforts can be considerably reduced. We first present OCTAB (Online Crowdsourcing Tool for Annotations of Behaviors), a web-based annotation tool that allows precise and convenient behavior annotations in videos, directly portable to popular crowdsourcing platforms. As part of OCTAB, we introduce a training module with specialized visualizations. The training module's design was inspired by an observational study of local experienced coders, and it enables an iterative procedure for effectively training crowd workers online. Finally, we present an extensive set of experiments that evaluates the feasibility of our crowdsourcing approach for obtaining micro-level behavior annotations in videos, showing the reliability improvement in annotation accuracy when properly training online crowd workers. We also show the generalization of our training approach to a new independent video corpus.

25 citations

Journal ArticleDOI
18 Apr 2021
TL;DR: In this article, the authors describe the recent rise of deep learning algorithms for recognising images and their applications in image analysis. But they do not discuss the use of labeled training data to solve computer vision problems.
Abstract: Supervised machine learning methods for image analysis require large amounts of labelled training data to solve computer vision problems. The recent rise of deep learning algorithms for recognising...

23 citations

Journal ArticleDOI
TL;DR: In this paper , a detailed literature review focusing on object detection and discusses the object detection techniques is provided, and a systematic review has been followed to summarize the current research work's findings and discuss seven research questions related to object detection.
Abstract: Object detection is one of the most fundamental and challenging tasks to locate objects in images and videos. Over the past, it has gained much attention to do more research on computer vision tasks such as object classification, counting of objects, and object monitoring. This study provides a detailed literature review focusing on object detection and discusses the object detection techniques. A systematic review has been followed to summarize the current research work’s findings and discuss seven research questions related to object detection. Our contribution to the current research work is (i) analysis of traditional, two-stage, one-stage object detection techniques, (ii) Dataset preparation and available standard dataset, (iii) Annotation tools, and (iv) performance evaluation metrics. In addition, a comparative analysis has been performed and analyzed that the proposed techniques are different in their architecture, optimization function, and training strategies. With the remarkable success of deep neural networks in object detection, the performance of the detectors has improved. Various research challenges and future directions for object detection also has been discussed in this research paper.

21 citations

Journal ArticleDOI
TL;DR: The contentus approach towards an automated media processing chain for cultural heritage organizations and content holders is presented, which allows for unattended processing from media ingest to availability thorough the search and retrieval interface.
Abstract: An ever-growing amount of digitized content urges libraries and archives to integrate new media types from a large number of origins such as publishers, record labels and film archives, into their existing collections. This is a challenging task, since the multimedia content itself as well as the associated metadata is inherently heterogeneous--the different sources lead to different data structures, data quality and trustworthiness. This paper presents the contentus approach towards an automated media processing chain for cultural heritage organizations and content holders. Our workflow allows for unattended processing from media ingest to availability thorough our search and retrieval interface. We aim to provide a set of tools for the processing of digitized print media, audio/visual, speech and musical recordings. Media specific functionalities include quality control for digitization of still image and audio/visual media and restoration of the most common quality issues encountered with these media. Furthermore, the contentus tools include modules for content analysis like segmentation of printed, audio and audio/visual media, optical character recognition (OCR), speech-to-text transcription, speaker recognition and the extraction of musical features from audio recordings, all aimed at a textual representation of information inherent within the media assets. Once the information is extracted and transcribed in textual form, media independent processing modules offer extraction and disambiguation of named entities and text classification. All contentus modules are designed to be flexibly recombined within a scalable workflow environment using cloud computing techniques. In the next step analyzed media assets can be retrieved and consumed through a search interface using all available metadata. The search engine combines Semantic Web technologies for representing relations between the media and entities such as persons, locations and organizations with a full-text approach for searching within transcribed information gathered through the preceding processing steps. The contentus unified search interface integrates text, images, audio and audio/visual content. Queries can be narrowed and expanded in an exploratory manner, search results can be refined by disambiguating entities and topics. Further, semantic relationships become not only apparent, but can also be navigated.

20 citations

Proceedings ArticleDOI
19 Nov 2018
TL;DR: EagleView is designed, which provides analysts with real-time visualisations during playback of videos and an accompanying data-stream of tracked interactions that allow analysts to gain insights into collaborative activities.
Abstract: To study and understand group collaborations involving multiple handheld devices and large interactive displays, researchers frequently analyse video recordings of interaction studies to interpret people's interactions with each other and/or devices. Advances in ubicomp technologies allow researchers to record spatial information through sensors in addition to video material. However, the volume of video data and high number of coding parameters involved in such an interaction analysis makes this a time-consuming and labour-intensive process. We designed EagleView, which provides analysts with real-time visualisations during playback of videos and an accompanying data-stream of tracked interactions. Real-time visualisations take into account key proxemic dimensions, such as distance and orientation. Overview visualisations show people's position and movement over longer periods of time. EagleView also allows the user to query people's interactions with an easy-to-use visual interface. Results are highlighted on the video player's timeline, enabling quick review of relevant instances. Our evaluation with expert users showed that EagleView is easy to learn and use, and the visualisations allow analysts to gain insights into collaborative activities.

18 citations

References
More filters
Journal ArticleDOI
TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.
Abstract: Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

6,447 citations

Journal ArticleDOI
TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Abstract: We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.

3,501 citations

Book
17 Sep 2004
TL;DR: Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition, and design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks.
Abstract: Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition. ART networks function both as models of human cognitive information processing [1,2,3] and as neural systems for technology transfer [4]. A neural computation central to both the scientific and the technological analyses is the ART matching rule [5], which models the interaction between topdown expectation and bottom-up input, thereby creating a focus of attention which, in turn, determines the nature of coded memories. Sites of early and ongoing transfer of ART-based technologies include industrial venues such as the Boeing Corporation [6] and government venues such as MIT Lincoln Laboratory [7]. A recent report on industrial uses of neural networks [8] states: “[The] Boeing ... Neural Information Retrieval System is probably still the largest-scale manufacturing application of neural networks. It uses [ART] to cluster binary templates of aeroplane parts in a complex hierarchical network that covers over 100,000 items, grouped into thousands of self-organised clusters. Claimed savings in manufacturing costs are in millions of dollars per annum.” At Lincoln Lab, a team led by Waxman developed an image mining system which incorporates several models of vision and recognition developed in the Boston University Department of Cognitive and Neural Systems (BU/CNS). Over the years a dozen CNS graduates (Aguilar, Baloch, Baxter, Bomberger, Cunningham, Fay, Gove, Ivey, Mehanian, Ross, Rubin, Streilein) have contributed to this effort, which is now located at Alphatech, Inc. Customers for BU/CNS neural network technologies have attributed their selection of ART over alternative systems to the model's defining design principles. In listing the advantages of its THOT technology, for example, American Heuristics Corporation (AHC) cites several characteristic computational capabilities of this family of neural models, including fast on-line (one-pass) learning, “vigilant” detection of novel patterns, retention of rare patterns, improvement with experience, “weights [which] are understandable in real world terms,” and scalability (www.heuristics.com). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP [9], ART-EMAP [10], ARTMAP-IC [11],

1,745 citations

Book
01 Jun 2002
TL;DR: This book has been designed as a unique tutorial in the new MPEG 7 standard covering content creation, content distribution and content consumption, and presents a comprehensive overview of the principles and concepts involved in the complete range of Audio Visual material indexing, metadata description, information retrieval and browsing.
Abstract: From the Publisher: The MPEG standards are an evolving set of standards for video and audio compression. MPEG 7 technology covers the most recent developments in multimedia search and retreival, designed to standardise the description of multimedia content supporting a wide range of applications including DVD, CD and HDTV. Multimedia content description, search and retrieval is a rapidly expanding research area due to the increasing amount of audiovisual (AV) data available. The wealth of practical applications available and currently under development (for example, large scale multimedia search engines and AV broadcast servers) has lead to the development of processing tools to create the description of AV material or to support the identification or retrieval of AV documents. Written by experts in the field, this book has been designed as a unique tutorial in the new MPEG 7 standard covering content creation, content distribution and content consumption. At present there are no books documenting the available technologies in such a comprehensive way. Presents a comprehensive overview of the principles and concepts involved in the complete range of Audio Visual material indexing, metadata description, information retrieval and browsingDetails the major processing tools used for indexing and retrieval of images and video sequencesIndividual chapters, written by experts who have contributed to the development of MPEG 7, provide clear explanations of the underlying tools and technologies contributing to the standardDemostration software offering step-by-step guidance to the multi-media system components and eXperimentation model (XM) MPEG reference softwareCoincides with the release of the ISO standard in late 2001. A valuable reference resource for practising electronic and communications engineers designing and implementing MPEG 7 compliant systems, as well as for researchers and students working with multimedia database technology.

1,301 citations

Book ChapterDOI
01 Oct 2002
TL;DR: This paper introduces the DOLCE upper level ontology, the first module of a Foundational Ontologies Library being developed within the WonderWeb project, and suggests that such analysis could hopefully lead to an ?
Abstract: In this paper we introduce the DOLCE upper level ontology, the first module of a Foundational Ontologies Library being developed within the WonderWeb project. DOLCE is presented here in an intuitive way; the reader should refer to the project deliverable for a detailed axiomatization. A comparison with WordNet's top-level taxonomy of nouns is also provided, which shows how DOLCE, used in addition to the OntoClean methodology, helps isolating and understanding some major WordNet?s semantic limitations. We suggest that such analysis could hopefully lead to an ?ontologically sweetened? WordNet, meant to be conceptually more rigorous, cognitively transparent, and efficiently exploitable in several applications.

1,100 citations