A survey of semantic image and video annotation tools
Summary (6 min read)
1 Introduction
- Accessing multimedia content in correspondence with the meaning pertained to a user, constitutes the core challenge in multimedia research, commonly referred to as the semantic gap [1].
- This significance is further strengthened by the need for manually constructed descriptions in automatic content analysis both for evaluation as well as for training purposes, when learning based on preannotated examples is used.
- Fundamental to information sharing, exchange and reuse, is the interoperability of the descriptions at both syntactic and semantic levels, i.e. regarding the valid structuring of the descriptions and the endowed meaning respectively.
- The strong relation of structural and low-level feature information to the tasks involved in the automatic analysis of visual content, as well as to retrieval services, such as transcoding, content-based search, etc., brings these two dimensions to the foreground, along with the subject matter descriptions.
- A number of so called multimedia ontologies [9–13] issued in an attempt to add formal semantics to MPEG-7 descriptions and thereby enable linking with existing ontologies and the semantic management of existing MPEG-7 metadata repositories.
2 Semantic Image and Video Annotation
- Image and video assets constitute extremely rich information sources, ubiquitous in a wide variety of diverse applications and tasks related to information management, both for personal and professional purposes.
- Inevitably, the value of the endowed information amounts to the effectiveness and efficiency at which it can be accessed and managed.
- The former encompasses the capacity to share and reuse annotations, and by consequence determines the level of seamless content utilisation and the benefits issued from the annotations made available; the latter is vital to the realisation of intelligent content management services.
- Towards their accomplishment, the existence of commonly agreed vocabularies and syntax, and respectively of commonly agreed semantics and interpretation mechanisms, are essential elements.
- The aforementioned considerations intertwine, establishing a number of dimensions and corresponding criteria along which image and video annotation can be characterised.
2.1 Input & Output
- This category includes criteria regarding the way the tool interacts in terms of requested / supported input and the output produced.
- The authors note that annotation vocabularies may refer not only to subject matter descriptions, but as well to media and structural descriptions.
- As will be shown in the sequel though, where the individual tools are described, there is not necessarily a strict correspondence (e.g. a tool may use an RDFS6 or OWL7 ontology as the subject matter vocabulary, and yet output annotations in RDF8).
- The format is equally significant to the annotation vocabulary as with respect to the annotations interoperability and sharing.
- Refers to the supported image/video formats, e.g. jpg, png, mpeg, etc.
2.2 Annotation Level
- This category addresses attributes of the annotations per se.
- Such retrieval may address concept-based queries or queries involving relations between concepts, entailing respective annotation specifications.
- To capture the aforementioned considerations, the following criteria have been used.
- For video assets, annotation may refer to the entire video, temporal segments , frames (temporal segments with zero duration), regions within frames, or even to moving regions, i.e. a region followed for a sequence of frames.
- Refers to the level of expressivity supported with respect to the annotation vocabulary.
2.3 Miscellaneous
- This category summarises additional criteria that do not fall under the previous dimensions.
- The considered aspects relate mostly to attributes of the tool itself rather than of the annotation process.
- As such, and given the scope of this chapter, in the description of the individual tools that follows in the two subsequent Sections, these criteria are treated very briefly.
- Specifies whether the tool constitutes a web-based or a stand-alone application, also known as – Application Type.
- – Licence: Specifies the kind of licence condition under which the tool operates, e.g. open source, etc. – Collaboration: Specifies whether the tool supports concurrent annotations (referring to the same media object) by multiple users or not.
3 Tools for Semantic Image Annotation
- In this Section the authors describe prominent semantic image annotation tools with respect to the dimensions and criteria outlined in Section 2.
- As will be illustrated in the following, Semantic Web technologies have permeated to a considerable degree the representation of metadata, with the majority of tools supporting ontology-based subject matter descriptions, while a considerable share of them adopts ontological representation for structural annotations as well.
- In order to provide a relative ranking with respect to SW compatibility, the authors order the tools according to the extend to which the produced annotations bear formal semantics.
3.1 KAT
- The K-Space Annotation Tool9 (KAT), developed within the K-Space10 project, implements an ontology-based framework for the semantic annotation of images.
- COMM extends the Descriptions & Situations (D&S) and Ontology of Information Objects (OIO) design patterns of DOLCE [17, 18], while incorporating re-engineered definitions of MPEG-7 description tools[19, 20].
- The latter are strictly concept based, i.e. considering the aforementioned annotation example it is not possible to annotate the pole as being next to the pole vaulter, and may refer to the entire image or to specific regions of it.
- The localisation of image regions is performed manually, using either of the rectangle and polygon drawing tools.
- Furthermore, the COMM based annotation scheme renders quite straightforward the extension of the annotation dimensions supported by KAT.
3.2 PhotoStuff
- PhotoStuff11, developed by the Mindswap group12, is an ontology-based image annotation tool that supports the generation of semantic image descriptions with respect to the employed ontologies.
- PhotoStuff [21] addresses primarily two types of metadata, namely descriptive and structural.
- Regarding descriptive annotations, the user may load one or multiple domain-specific ontologies from the web or from the local hard drive, while with respect to structural annotations, two internal, hidden to the user, ontologies are used: the Digital-Media13 ontology and the Technical14 one.
- Nor the representation neither the extraction of such descriptors is addressed.
- Notably, annotations may refer not only to concept instantiations, but also to relations between concept instances already identified in an image.
3.3 AktiveMedia
- AktiveMedia20, developed within AKT21 and X-Media22 projects, is an ontologybased cross-media annotation system addressing text and image assets.
- In image annotation mode, AktiveMedia supports descriptive metadata with respect to user selected ontologies, stored in the local hard drive [22].
- Annotations can refer to image or region level.
- Contrary to Photostuff which uses 17 http://dublincore.org/documents/dces/.
- As such, the semantics of generated RDF metadata, i.e. the annotation semantics as it entails from the respective ontology definitions, are not direct but require additional processing to retrieve and to reason over.
3.5 Caliph
- Caliph27 is an MPEG-7 based image annotation tool that supports all types of MPEG-7 metadata among which descriptive, structural, authoring and low-level visual descriptor annotations.
- In combination with Emir, they support contentbased retrieval of images using MPEG-7 descriptions.
- Figure 6 illustrates two screenshots corresponding to the generic image information and the semantic annotation tabs.
- The descriptions may be either in the form of free text or structured, in accordance to the SemanticBase description tools provided by MPEG-7 (i.e. Agents, Events, Time, Place and Object annotations [26]).
- The so called semantic tab allows for the latter, offering a graph based interface.
3.6 SWAD
- SWAD28 is an RDF-based image annotation tool that was developed within the SWAD-Europe project29.
- The latter ran from May 2002 to October 2004 and aimed to support the Semantic Web initiative in Europe through targeted research, demonstrations and outreach activities.
- The authors chose to provide a very brief description here for the purpose of illustrating image annotation in the Semantic Web as envisaged and realised by that time, as a reference and comparison point for the various image annotation tools that have been developed afterwards.
- Licensing information as described in the respective SWAD deliverable30.
- When entering a keyword description, the respective Wordnet31 hierarchy is shown to the user, assisting her in determining the appropriateness of the keyword and in selecting descriptions of further accuracy.
3.7 LabelMe
- LabelMe33 is a database and web-based image annotation tool, aiming to contribute in the creation of large annotated image databases for evaluation and training purposes [28].
- It contains all images from the MIT CSAIL34 database, in addition to a large number of user uploaded images.
- LabelMe [28] supports descriptive metadata addressing in principle regionbased annotation.
- Specifically, the user defines a polygon enclosing the annotated object through a set of control points.
- Its focus on requirements related to object recognition research, rather than image search and retrieval, entails different notions regarding the utilisation, sharing and purpose of annotation.
3.8 Application-specific Image Annotation Tools
- Apart from the afore described semantic image annotation tools, a variety of application-specific tools are available.
- Some of them relate to Web 2.0 applications addressing tagging and sharing of content among social groups, while others focus on particular application domains, such as medical imaging, that impose additional specifications pertaining to the individual application context.
- Utilising radiology specific ontologies, iPad enhances the annotation procedure by suggesting more specific terms and by identifying incomplete descriptions and subsequently prompting for missing parts in the description (e.g. “enlarged” is flagged as incomplete while “enlarged liver” is acceptable).
- The produced descriptions are in RDF/XML following a proprietary schema39 that models the label constituting the tag, its position (the label constitutes a rectangle region in itself), and the position of the rectangle that encloses the annotated region in the form of the top left point coordinates and width and height information.
- Furthermore, general information about the image is included such as image size, number of regions annotated, etc. Oriented towards Web 2.0, FotoTagger places significant focus on social aspects pertaining to content management, allowing among others to publish tagged images to 35 http://www.rsna.org/Technology/DICOM/ blogs and to upload/download tagged images to/from Flickr, while maintaining both FotoTagger’s and Flickr’s descriptions.
3.9 Discussion
- The aforementioned overview reveals that the utilisation of Semantic Web languages for the representation, interchange and processing of image metadata has permeated semantic image annotation.
- The choice of a standard representation shows the importance placed on creating content descriptions that can be easily exchanged and reused across heterogenous applications, and works like [10, 11, 30] provide bridges between MPEG-7 metadata and the Semantic Web and existing ontologies.
- Thus unlike subject matter descriptions, where a user can choose which vocabulary to use (in the form of a domain ontology, a lexicon or user provided keywords), structural descriptions are tool specific.
- Summing up, the choice of a tool depends primarily on the intended context of usage, which provides the specifications regarding the annotation dimensions supported, and subsequently on the desired formality of annotations, again related to a large extend to the application context.
- Thus for semantic retrieval purposes, where semantic refers to the SW perspective, KAT, PhotoStuff, SWAD T o o l and AkiveMedia would be the more appropriate choices.
4 Tools for Semantic Video Annotation
- The increase in the amount of video data deployed and used in today’s applications not only caused video to draw increased attention as a content type, but also introduced new challenges in terms of effective content management.
- In the following the authors survey typical video annotation tools, highlighting their features with respect to the criteria delineated in Section 2.
- In the latter category fall tools such as VIDETO41, Ricoh Movie Tool42, or LogCreator43.
- It is interesting to note that the majority of these tools followed MPEG-7 for the representation of annotations.
- As described in the sequel, this favourable disposition is still evident, differentiating video annotation tools from image ones, where the Semantic Web technologies have been more pervasive.
4.1 VIA
- The Video and Image Annotation44 (VIA) tool has been developed by the MKLab45 within the BOEMIE46 project.
- The shot records a pole vaulter holding a pole and sprinting at the jump point.
- VIA supports descriptive, structural and media metadata of image and video assets.
- Descriptive annotation is performed with respect to a user loaded OWL ontology, while free text descriptions can also be added.
- The first one is concerned with region annotation, in which the user selects rectangular areas of the video content and subsequently adds corresponding annotations.
4.2 VideoAnnEx
- The IBM VideoAnnEx47 annotation tool addresses video annotation with MPEG7 metadata.
- VideoAnnex supports descriptive, structural and administrative annotations according to the respective MPEG-7 Description Schemes.
- The tool supports default subject matter lexicons in XML format, and additionally allows the user to create and load her own XML lexicon, design a concept hierarchy through the interface menu commands, or insert free text descriptions.
- As illustrated in Figure 10, the VideoAnnEx annotation interface consists of four components.
- On the bottom part of the tool, two views are available of the annotation preview: one contains the I-frames of a shot and the keyframes of each shot in the video, respectively.
4.6 Anvil
- Anvil69 is a tool that supports audiovisual content annotation, but which was primarily designed for linguistic purposes, in the same vein as the previously described tool.
- User-defined XML schema specification files provide the definition of the vocabulary used in the annotation procedure.
- Its interface consists of the media player window, the annotation board and the metadata window.
- As in most described tools, also in Anvil, the user has to manually define the temporal segments that wants to annotate.
- Anvil can import data from the phonetic tools PRAAT72 and XWaves which perform speech transcriptions.
4.7 Semantic Video Annotation Suite
- The Semantic Video Annotation Suite75 (SVAS), developed by Joanneum research Institute of Information Systems & Information Management76, targets the creation of MPEG-7 video annotations.
- SVAS [36] encompasses two tools: the Media Analyzer, which extracts automatically structural information regarding shots and key-frames, and the Semantic Video Annotation Tool (SVAT), which allows to edit the structural metadata obtained through the Media Analyzer and to add administrative and descriptive metadata, in accordance with MPEG-7.
- The detection results are displayed in a separate key-frame view, where for each of the computed key frames the detected object is highlighted.
- The user can partially enhance the results of this matching service by removing irrelevant key-frames; however more elaborate enhancement such as editing of the detected region’s boundaries or of its location is not supported.
- All views, including the shot view tree structure, can be exported to a CSV file and the metadata is saved in an MPEG-7 XML file.
4.8 Application-specific Video Annotation Tools
- Apart from the afore described semantic video annotation tools, a number of additional annotation systems have been proposed that aspiring to specific application contexts induce different perspectives on the annotation process.
- To keep the survey comprehensive, in the following the authors examine briefly some representative examples.
- Advocating W3C standards, Annotea adopts RDF based annotation schemes and XPointer78 for locating the annotations within the annotated resource.
- Object level descriptions can be also propagated through dragging while the video is playing.
4.9 Discussion
- As illustrated in the aforementioned descriptions, video annotation tools make a rather poor utilisation of Semantic Web technologies and formal meaning, XML being the most common choice for the capturing and representation of the produced annotations.
- The use of MPEG-7 based descriptions, may constitute a solution towards standardised video descriptions, yet raises serious issues with respect to the automatic processing of annotations, especially the descriptive ones, at a semantic level.
- Furthermore, VideoAnnex, VIA and SVAT are the only ones that offer selection and annotation of spatial regions on frames of the video, as well.
- Anvil has recently presented a new annotation mechanisms called spatiotemporal coding aiming to support point and region annotation, yet currently only points are supported.
- It worths noticing that most annotation tools offer a variety of additional functionalities, in order to satisfy varying user needs.
5 Conclusions
- As to provide a common framework of reference for assessing the suitability and interoperability of annotations under different context of usages.
- Domain specific ontologies are supported by the majority of tools for the representation of subject matter descriptions.
- The level of correspondence between research outcomes and implemented annotation tools is not the sole subject for further investigation.
- Research in multimedia annotation, and by consequence into multimedia ontologies, is not restricted to the representation of the different annotation dimensions involved.
- As a continuation of the efforts initiated within MMSEM, further manifesting the strong emphasis placed upon achieving cross community multimedia data integration, two new 84 http://www.w3.org/2005/Incubator/mmsem/.
Did you find this useful? Give us your feedback
Citations
87 citations
Additional excerpts
...…readers interested in older work we refer to previous survey/review papers focusing on video browsing [Schoeffmann et al. 2010a], video annotation [Dasiopoulou et al. 2011], video retrieval [Geetha and Narayanan 2008; Hu et al. 2011], and video summarization/abstraction [Money and Agius 2008;…...
[...]
...2010a], video annotation [Dasiopoulou et al. 2011], video retrieval [Geetha and Narayanan 2008; Hu et al....
[...]
68 citations
51 citations
Cites background from "A survey of semantic image and vide..."
...A concise review study of other annotation tools can be found in [31]....
[...]
46 citations
Cites background or methods from "A survey of semantic image and vide..."
...We can find that the surveys [13], [14], [15], [19] and [20] were mainly focusing on documents, as well as the surveys [17] and [18] paid major attention to images or videos....
[...]
...They analysed some existing annotation tools from both the functionality perspective ([14], [15], [17], [18] and [19]) and from the efficiency perspective ([13] and...
[...]
39 citations
References
290 citations
225 citations
222 citations
220 citations
161 citations