scispace - formally typeset
Open AccessBook ChapterDOI

Semantic Multimedia

TLDR
This tutorial aims to provide a red thread through different issues and to give an outline of where Semantic Web modeling and reasoning needs to further contribute to the area of semantic multimedia for the fruitful interaction between these two fields of computer science.
Abstract
Multimedia constitutes an interesting field of application for Semantic Web and Semantic Web reasoning, as the access and management of multimedia content and context depends strongly on the semantic descriptions of both. At the same time, multimedia resources constitute complex objects, the descriptions of which are involved and require the foundation on sound modeling practice in order to represent findings of low- and high level multimedia analysis and to make them accessible via Semantic Web querying of resources. This tutorial aims to provide a red thread through these different issues and to give an outline of where Semantic Web modeling and reasoning needs to further contribute to the area of semantic multimedia for the fruitful interaction between these two fields of computer science.

read more

Content maybe subject to copyright    Report

Semantic Multimedia
Steffen Staab
1
, Ansgar Scherp
1
, Richard Arndt
1
, Raphael Troncy
2
,
Marcin Grzegorzek
1
, Carsten Saathoff
1
, Simon Schenk
1
, and Lynda Hardman
2
1
ISWeb Research Group, University of Koblenz-Landau
http://isweb.uni-koblenz.de
2
Semantic Media Interfaces, CWI Amsterdam
http://www.cwi.nl
Abstract. Multimedia constitutes an interesting field of application for
Semantic Web and Semantic Web reasoning, as the access and man-
agement of multimedia content and context depends strongly on the
semantic descriptions of both. At the same time, multimedia resources
constitute complex objects, the descriptions of which are involved and
require the foundation on sound modeling practice in order to represent
findings of low- and high level multimedia analysis and to make them
accessible via Semantic Web querying of resources. This tutorial aims
to provide a red thread through these different issues and to give an
outline of where Semantic Web modeling and reasoning needs to further
contribute to the area of semantic multimedia for the fruitful interaction
between these two fields of computer science.
1 Semantics for Multimedia
Multimedia objects are ubiquitous, whether found via web search (e.g., Google
1
or Yahoo!
2
images), or via dedicated sites (e.g., Flickr
3
or YouTube
4
)orinthe
repositories of private users or commercial organizations (film archives, broad-
casters, photo agencies, etc.). The media objects are produced and consumed
by professionals and amateurs alike. Unlike textual assets, whose content can
be searched for using text strings, media search is dependent on, (i),complex
analysis processes, (ii), manual descriptions of multimedia resources, (iii),rep-
resentation of these results and contributions in a widely understandable format
for, (iv) later retrieval and/or querying by the consumer of this data.
In the past, this process has not been supported by an interoperable and easily
extensible machinery of processing tools, applications and data formats, but only
by idiosyncratic combinations of system components into sealed off applications
such that effective sharing of their semantic metadata remained impossible and
the linkage to semantic data and ontologies found on the Semantic Web remained
far off.
1
http://images.google.com/
2
http://images.search.yahoo.com/
3
http://www.flickr.com/
4
http://www.youtube.com/
C. Baroglio et al. (Eds.): Reasoning Web 2008, LNCS 5224, pp. 125–170, 2008.
c
Springer-Verlag Berlin Heidelberg 2008

126 S. Staab et al.
MPEG-7 [52, 57] is an international standard defined by the Moving Picture
Experts Group (MPEG) that specifies how to connect descriptions to parts of
a media asset. The standard includes descriptors representing low-level media-
specific features that can often be automatically extracted from media types.
Unfortunately, MPEG-7 is not fully suitable for describing multimedia content,
because i) it is not open to standards that represent knowledge and make use
of existing controlled vocabularies for describing the subject matter and (ii) its
XML Schema
5
based nature has led to design decisions that leave the annota-
tions conceptually ambiguous and therefore prevent direct machine processing
of semantic content descriptions.
In order to avoid such problems, we advocate the use of Semantic Web lan-
guages and a core ontology for multimedia annotations throughout the manual
and automatic processing of multimedia content and its retrieval. For this pur-
pose, we build on rich ontological foundations provided by an ontology such as
the Descriptive Ontology for Linguistic and Cognitive Engineering
6
(DOLCE)
and sound ontology engineering principles. The result presented in this tutorial
is COMM, a core ontology for multimedia, which is able to accommodate re-
sults from manual annotation of data (cf. Section 6) as well as from automated
processing (cf. Section 4).
The remainder of this document is organized as follows: In the next Section 2,
we illustrate by an example scenario the main problems when using MPEG-7
for describing multimedia resources. Subsequently, we define in Section 3 the
requirements that a multimedia ontology should meet. We review work in image
and video processing in Section 4, before we present COMM, an MPEG-7 based
ontology, in Section 5 and discuss our design decisions based on our requirements.
In Section 6, we illustrate how to use COMM in a manual annotation tool.
In Section 7, we demonstrate the use of the ontology with the scenario from
Section 2 and in Section 8 we indicate challenges and solutions for querying
metadata based on COMM. Further and future issues of semantic multimedia
are considered in Section 9, before we summarize and conclude the paper.
2 Annotating Multimedia Assets
For annotating multimedia assets, let us imagine Nathalie, a student in history,
who wants to create a multimedia presentation of the major international con-
ferences and summits held in the last 60 years. Her starting point is the famous
“Big Three” picture, taken at the Yalta (Crimea) Conference, showing the heads
of government of the United States, the United Kingdom, and the Soviet Union
during World War II. Nathalie uses an MPEG-7 compliant authoring tool for
detecting and labeling relevant multimedia objects automatically. On the Inter-
net, she finds three different face recognition web services that provide very good
results for detecting Winston Churchill, Franklin D. Roosevelt, and Josef Stalin,
respectively. Having these tools, she would like to run the face recognition web
5
http://www.w3.org/XML/Schema
6
http://wonderweb.semanticweb.org/deliverables/documents/D18.pdf

Semantic Multimedia 127





 

!"#$#$!"

$ 

 
%
&' !"!$&'
$%
$ 
 #

(
"")*#$#$)*""
$(
$
$ 
+++



Fig. 1. MPEG-7 annotation example of an image adapted from Wikipedia, http://
en.wikipedia.org/wiki/Yalta
Conference
services on images and import the extraction results into the authoring tool in
order to automatically generate links from the detected face regions to detailed
textual information about Churchill, Roosevelt, and Stalin (image in Fig. 1-A).
Nathalie would then like to describe a recent video from a G8 summit, such
as the retrospective A history of G8 violence made by Reuters
7
. She uses again
an MPEG-7 compliant segmentation tool for detecting the seven main sequences
of this 2’26 minutes report: the various anti-capitalist protests during the Seat-
tle (1999), Melbourne (2000), Prague (2000), Gothenburg (2001), Genoa (2001),
St Petersburg (2006), Heiligendamm (2007) World Economic Forums, EU and
G8 Summits. Finally, Nathalie plans to deliver her multimedia presentation
in an Open Document Format (ODF) document embedding the image and
video previously annotated. However, this scenario causes several problems with
7
http://www.reuters.com/news/video/summitVideo?videoId=56114

128 S. Staab et al.
existing solutions. These problems refer to fragment identification, semantic
annotation, web interoperability, and embedding semantic annotations into
compound documents.
Fragment identification. Particular regions of the image need to be localized
(anchor value in [29]). However, the current web architecture does not provide
a means for uniquely identifying sub-parts of media assets, in the same way
that the fragment identifier in the URI can refer to a part of an HTML or
XML document. Indeed, for almost any other media type such as audio, video,
and image, the semantics of the fragment identifier has not been defined or is
not commonly accepted. Providing an agreed upon way to localize sub-parts of
multimedia objects (e.g., sub-regions of images, temporal sequences of videos, or
tracking moving objects in space and in time) is fundamental
8
[25]. For images,
one can use either MPEG-7 or SVG snippet code to define the bounding box
coordinates of specific regions. For temporal locations, one can use MPEG-7 code
or the TemporalURI RFC
9
. MPEG-21 specifies a normative syntax to be used
in URIs for addressing parts of any resource but whose media type is restricted
to MPEG [51]. The MPEG-7 approach requires an indirection: an annotation is
about a fragment of an XML document that refers to a multimedia document,
whereas the MPEG-21 approach does not have this limitation [90].
Semantic annotation. MPEG-7 is a natural candidate for representing the
extraction results of multimedia analysis software such as a face recognition web
service. The language, standardized in 2001, specifies a rich vocabulary of multi-
media descriptors, which can be represented in either XML or a binary format.
While it is possible to specify very detailed annotations using these descriptors,
it is not possible to guarantee that MPEG-7 metadata generated by different
agents will be mutually understood due to the lack of formal semantics of this
language [32, 87]. The XML code of Fig. 1-B illustrates the inherent interop-
erability problems of MPEG-7: several descriptors, semantically equivalent and
representing the same information while using different syntax can coexist [88].
As Nathalie used three different face recognition web services, the extraction re-
sults of the regions SR1, SR2,andSR3 differ from each other even though they are
all syntactically correct. While the first service uses the MPEG-7 SemanticType
for assigning the <Label> Roosevelt to still region SR1, the second one makes use
of a <KeywordAnnotation> forattachingthekeywordChurchill to still region
SR2. Finally the third service uses a <StructuredAnnotation> (which can be
used within the SemanticType) in order to label still region SR3 with Stalin.
Consequently, alternative ways for annotating the still regions render almost im-
possible the retrieval of the face recognition results within the authoring tool
since the corresponding XPath
10
query has to deal with these syntactic varia-
tions. As a result, the authoring tool will not link occurrences of Churchill in
8
See also the forthcoming W3C Media Fragments Working Group:
http://www.w3.org/2008/01/media-fragments-wg.html
9
http://www.annodex.net/TR/URI fragments.html
10
http://www.w3.org/TR/xpath20/

Semantic Multimedia 129
the images with, e.g., his biography as it does not expect semantic labels of still
regions as part of the <KeywordAnnotation> element.
Web i nteroperability. Nathalie would like to link the multimedia presenta-
tion to historical information about the key figures of the Yalta Conference or
the various G8 summits that is already available. She has also found semantic
metadata about the relationships between these figures that could improve the
automatic generation of the multimedia presentation. However, she realizes that
MPEG-7 cannot be combined with these concepts defined in domain-specific on-
tologies because of its closing to the web. As this example demonstrates, although
MPEG-7 provides ways of associating semantics with (parts of) non-textual me-
dia assets, it is incompatible with (semantic) web technologies and has no formal
description of the semantics encapsulated implicitly in the standard.
Embedding into compound documents. Nathalie needs to compile the se-
mantic annotations of the images, videos, and textual stories into a semantically
annotated compound document. However, the current state of the art does not
provide a framework which allows the semantic annotation of compound doc-
uments. MPEG-7 solves only partially the problem as it is restricted to the
description of audiovisual compound documents. Bearing the growing number
of multimedia office documents in mind, this limitation is a serious drawback.
Querying. Eventually, Nathalie and other consumers of Nathalie’s compound
document may want to pick out specific events, related to specific persons or
locations. Depending on such a condition and depending on what they want to
pick out, e.g., a 2 minute video stream or a key frame out of a video, they need
to formulate a query and receive the corresponding results. The query language
and corresponding engine receiving such a request must be able to drill down
into the compound document at an arbitrary level of granularity. For instance,
if a person like Churchill appears in a keyframe that is part of a video scene that
is part of a video shot, Churchill will also appear in the video shot as a whole.
The engine must return results also at the desired level of granularity, e.g., the
video scene.
3 Requirements for Designing a Multimedia Ontology
Requirements for designing a multimedia ontology have been gathered and re-
ported in the literature, e.g., in [35]. Here, we compile these and use our scenario
from the previous section to present a list of requirements for a web-compliant
multimedia ontology.
MPEG-7 compliance. As an international standard, MPEG-7 is used both in
the signal processing and the broadcasting communities. It contains a wealth of
accumulated experience that needs to be included in a web-based multimedia on-
tology. In addition, existing annotations in MPEG-7 should be easily expressible
in this multimedia ontology.

Citations
More filters
Journal ArticleDOI

Designing core ontologies

TL;DR: This paper presents the simultaneous use and integration of the core ontologies at the example of a complex, distributed socio-technical system of emergency response, and describes the design approach for core ontology and discusses the lessons learned in designing them.
Book ChapterDOI

A survey of semantic image and video annotation tools

TL;DR: This chapter presents an overview of the state of the art in image and video annotation tools to provide a common framework of reference and to highlight open issues, especially with respect to the coverage and the interoperability of the produced metadata.
Book ChapterDOI

Connecting the dots: a multi-pivot approach to data exploration

TL;DR: A multi-pivot approach to identify and query data in graph-based datasets, helping users connect key points of interest in the graph on the conceptual level, visually occluding the remainder parts of the graph, thus helping create a road-map for navigation is embodied in tool called Visor.
Journal ArticleDOI

Editorial: Using ontologies with UML class-based modeling: The TwoUse approach

TL;DR: This work presents a framework involving different concrete syntaxes for developing integrated models and use a SPARQL-like approach for writing query operations that achieves enhancements of non-functional software requirements like maintainability, reusability and extensibility.
Journal ArticleDOI

Guidelines for Linked Data generation and publication: An example in building energy consumption

TL;DR: This paper presents a set of guidelines for generating and publishing Linked Data in the context of energy consumption in buildings, in which the energy consumption data of council sites belonging to the Leeds City Council jurisdiction have been generated and published as Linked data.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Book

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.
Proceedings ArticleDOI

Face recognition using eigenfaces

TL;DR: An approach to the detection and identification of human faces is presented, and a working, near-real-time face recognition system which tracks a subject's head and then recognizes the person by comparing characteristics of the face to those of known individuals is described.
Proceedings ArticleDOI

Towards a Better Understanding of Context and Context-Awareness

TL;DR: Some of the research challenges in understanding context and in developing context-aware applications are discussed, which are increasingly important in the fields of handheld and ubiquitous computing, where the user?s context is changing rapidly.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What have the authors contributed in "Semantic multimedia" ?

This tutorial aims to provide a red thread through these different issues and to give an outline of where Semantic Web modeling and reasoning needs to further contribute to the area of semantic multimedia for the fruitful interaction between these two fields of computer science. 

The authors briefly motivate and summarize them in order to give an outlook to future work. In a future work, it will be very interesting to elaborate how the work on the five types of semantics defined in the ecosystem, the layers of intelligence considered in the WeKnowIt project, and the work in the field of semiotics can be integrated. Looking at the current state of the art, a future research issue is providing efficient support for a recursive querying in structured multimedia content over a large dataset. A multimedia ontology like COMM presented in Section 5 is an annotation model that can be used to organize and structure multimedia semantics. 

The decomposition pattern handles the structure of a multimedia document, while the media annotation pattern, the content annotation pattern, and the semantic annotation pattern are useful for annotating the media, the features, and the semantic content of the multimedia document respectively. 

The Contour Shape descriptor has a number of important properties, namely: (i) it captures very well characteristic features of the shape, enabling similarity-based retrieval; (ii) it reflects properties of the perception of human visual system and offers good generalization; (iii) it is robust to non-rigid motion; (iv) it is robust to partial occlusion of the shape; (v) it is robust to perspective transformations which result from the changes of the camera parameters and are common in images and video; (vi) it is compact. 

A canonical representation of the separating hyperplane is obtained by rescaling the pair (v, b) into the pair (v′, b′) in such a way that the distance of the closest feature vector equals |v′|−1. 

simple string representation formats are used for serializing data type concepts (e.g., rectangle) that are currently not covered by W3C standards. 

Given that the (semantic) web is an important repository of both media assets and annotations, a semantic description of the multimedia ontology should be expressible in a web language such as OWL, RDF/XML, or RDFa11. 

The Homogeneous Texture descriptor is designed to characterize the properties of texture in an image (or region), based on the assumption that the texture is homogeneous, i.e., the visual properties of the texture are relatively constant over the region. 

Using a simple drag&drop mechanism, a region is dropped on a concept or an instance of the ontology, which creates an according annotation. 

the segmentation-based approach often suffers from errors due to loss of image details or other inaccuracies resulting from the segmentation process. 

According to current state of the art, for analysis with a large number of variables a large amount of memory and computation power is needed. 

Using a pre-classified COMM and some comparable simple query rewriting, the authors are able to completely avoid reasoning at runtime for many queries. 

Following the D&S pattern, the authors consider that a decomposition of a multimedia-data entity is a situation (a segment-decomposition) that satisfies a description such as a segmentation-algorithm or a method (e.g., a user drawing a bounding box around a depicted face), which has been applied to perform the decomposition, see Fig. 4-B. 

The specialization of the pattern for describing image decompositions is shown in Fig. 5-F. According to MPEG-7, an image or an image segment (image-data) can be composed into still regions. 

With respect to provenance, a future challenge is to leverage this information to make decisions about the trustworthiness of specific statements made about the multimedia content.