What are the future works in "Semantic multimedia" ?

The authors briefly motivate and summarize them in order to give an outlook to future work. In a future work, it will be very interesting to elaborate how the work on the five types of semantics defined in the ecosystem, the layers of intelligence considered in the WeKnowIt project, and the work in the field of semiotics can be integrated. Looking at the current state of the art, a future research issue is providing efficient support for a recursive querying in structured multimedia content over a large dataset. A multimedia ontology like COMM presented in Section 5 is an annotation model that can be used to organize and structure multimedia semantics.

What are the three patterns used to annotate a multimedia document?

The decomposition pattern handles the structure of a multimedia document, while the media annotation pattern, the content annotation pattern, and the semantic annotation pattern are useful for annotating the media, the features, and the semantic content of the multimedia document respectively.

What are the properties of the Contour Shape descriptor?

The Contour Shape descriptor has a number of important properties, namely: (i) it captures very well characteristic features of the shape, enabling similarity-based retrieval; (ii) it reflects properties of the perception of human visual system and offers good generalization; (iii) it is robust to non-rigid motion; (iv) it is robust to partial occlusion of the shape; (v) it is robust to perspective transformations which result from the changes of the camera parameters and are common in images and video; (vi) it is compact.

How is the canonical representation of the separating hyperplane obtained?

A canonical representation of the separating hyperplane is obtained by rescaling the pair (v, b) into the pair (v′, b′) in such a way that the distance of the closest feature vector equals |v′|−1.

What are the current formats used for serializing data type concepts?

simple string representation formats are used for serializing data type concepts (e.g., rectangle) that are currently not covered by W3C standards.

What is the way to describe a multimedia ontology?

Given that the (semantic) web is an important repository of both media assets and annotations, a semantic description of the multimedia ontology should be expressible in a web language such as OWL, RDF/XML, or RDFa11.

What is the definition of a texture descriptor?

The Homogeneous Texture descriptor is designed to characterize the properties of texture in an image (or region), based on the assumption that the texture is homogeneous, i.e., the visual properties of the texture are relatively constant over the region.

What is the mechanism used to drop a concept on an ontology?

Using a simple drag&drop mechanism, a region is dropped on a concept or an instance of the ontology, which creates an according annotation.

What is the disadvantage of the segmentation-based approach?

the segmentation-based approach often suffers from errors due to loss of image details or other inaccuracies resulting from the segmentation process.

What is the need for a large amount of memory and computation power?

According to current state of the art, for analysis with a large number of variables a large amount of memory and computation power is needed.

How can the authors avoid reasoning at runtime for many queries?

Using a pre-classified COMM and some comparable simple query rewriting, the authors are able to completely avoid reasoning at runtime for many queries.

What is the definition of a decomposition of a multimedia data entity?

Following the D&S pattern, the authors consider that a decomposition of a multimedia-data entity is a situation (a segment-decomposition) that satisfies a description such as a segmentation-algorithm or a method (e.g., a user drawing a bounding box around a depicted face), which has been applied to perform the decomposition, see Fig. 4-B.

What is the specialization of the pattern for describing image decompositions?

The specialization of the pattern for describing image decompositions is shown in Fig. 5-F. According to MPEG-7, an image or an image segment (image-data) can be composed into still regions.

What is the future challenge of using provenance?

With respect to provenance, a future challenge is to leverage this information to make decisions about the trustworthiness of specific statements made about the multimedia content.

(Open Access) Semantic Multimedia (2008) | Steffen Staab

Q: What have the authors contributed in "Semantic multimedia" ?

This tutorial aims to provide a red thread through these different issues and to give an outline of where Semantic Web modeling and reasoning needs to further contribute to the area of semantic multimedia for the fruitful interaction between these two fields of computer science.

Semantic Multimedia

Steﬀen Staab

, Ansgar Scherp

, Richard Arndt

, Raphael Troncy

Marcin Grzegorzek

, Carsten Saathoﬀ

, Simon Schenk

, and Lynda Hardman

ISWeb Research Group, University of Koblenz-Landau

http://isweb.uni-koblenz.de

Semantic Media Interfaces, CWI Amsterdam

http://www.cwi.nl

Abstract. Multimedia constitutes an interesting ﬁeld of application for

Semantic Web and Semantic Web reasoning, as the access and man-

agement of multimedia content and context depends strongly on the

semantic descriptions of both. At the same time, multimedia resources

constitute complex objects, the descriptions of which are involved and

require the foundation on sound modeling practice in order to represent

ﬁndings of low- and high level multimedia analysis and to make them

accessible via Semantic Web querying of resources. This tutorial aims

to provide a red thread through these diﬀerent issues and to give an

outline of where Semantic Web modeling and reasoning needs to further

contribute to the area of semantic multimedia for the fruitful interaction

between these two ﬁelds of computer science.

1 Semantics for Multimedia

Multimedia objects are ubiquitous, whether found via web search (e.g., Google

or Yahoo!

images), or via dedicated sites (e.g., Flickr

or YouTube

)orinthe

repositories of private users or commercial organizations (ﬁlm archives, broad-

casters, photo agencies, etc.). The media objects are produced and consumed

by professionals and amateurs alike. Unlike textual assets, whose content can

be searched for using text strings, media search is dependent on, (i),complex

analysis processes, (ii), manual descriptions of multimedia resources, (iii),rep-

resentation of these results and contributions in a widely understandable format

for, (iv) later retrieval and/or querying by the consumer of this data.

In the past, this process has not been supported by an interoperable and easily

extensible machinery of processing tools, applications and data formats, but only

by idiosyncratic combinations of system components into sealed oﬀ applications

such that eﬀective sharing of their semantic metadata remained impossible and

the linkage to semantic data and ontologies found on the Semantic Web remained

far oﬀ.

http://images.google.com/

http://images.search.yahoo.com/

http://www.flickr.com/

http://www.youtube.com/

C. Baroglio et al. (Eds.): Reasoning Web 2008, LNCS 5224, pp. 125–170, 2008.

 Springer-Verlag Berlin Heidelberg 2008

126 S. Staab et al.

MPEG-7 [52, 57] is an international standard deﬁned by the Moving Picture

Experts Group (MPEG) that speciﬁes how to connect descriptions to parts of

a media asset. The standard includes descriptors representing low-level media-

speciﬁc features that can often be automatically extracted from media types.

Unfortunately, MPEG-7 is not fully suitable for describing multimedia content,

because i) it is not open to standards that represent knowledge and make use

of existing controlled vocabularies for describing the subject matter and (ii) its

XML Schema

based nature has led to design decisions that leave the annota-

tions conceptually ambiguous and therefore prevent direct machine processing

of semantic content descriptions.

In order to avoid such problems, we advocate the use of Semantic Web lan-

guages and a core ontology for multimedia annotations throughout the manual

and automatic processing of multimedia content and its retrieval. For this pur-

pose, we build on rich ontological foundations provided by an ontology such as

the Descriptive Ontology for Linguistic and Cognitive Engineering

(DOLCE)

and sound ontology engineering principles. The result presented in this tutorial

is COMM, a core ontology for multimedia, which is able to accommodate re-

sults from manual annotation of data (cf. Section 6) as well as from automated

processing (cf. Section 4).

The remainder of this document is organized as follows: In the next Section 2,

we illustrate by an example scenario the main problems when using MPEG-7

for describing multimedia resources. Subsequently, we deﬁne in Section 3 the

requirements that a multimedia ontology should meet. We review work in image

and video processing in Section 4, before we present COMM, an MPEG-7 based

ontology, in Section 5 and discuss our design decisions based on our requirements.

In Section 6, we illustrate how to use COMM in a manual annotation tool.

In Section 7, we demonstrate the use of the ontology with the scenario from

Section 2 and in Section 8 we indicate challenges and solutions for querying

metadata based on COMM. Further and future issues of semantic multimedia

are considered in Section 9, before we summarize and conclude the paper.

2 Annotating Multimedia Assets

For annotating multimedia assets, let us imagine Nathalie, a student in history,

who wants to create a multimedia presentation of the major international con-

ferences and summits held in the last 60 years. Her starting point is the famous

“Big Three” picture, taken at the Yalta (Crimea) Conference, showing the heads

of government of the United States, the United Kingdom, and the Soviet Union

during World War II. Nathalie uses an MPEG-7 compliant authoring tool for

detecting and labeling relevant multimedia objects automatically. On the Inter-

net, she ﬁnds three diﬀerent face recognition web services that provide very good

results for detecting Winston Churchill, Franklin D. Roosevelt, and Josef Stalin,

respectively. Having these tools, she would like to run the face recognition web

http://www.w3.org/XML/Schema

http://wonderweb.semanticweb.org/deliverables/documents/D18.pdf

Semantic Multimedia 127













 



!"#$#$!"



$ 



 

%

&' !"!$&'

$%

$ 



 #



(

"")*#$#$)*""

$(

$

$ 

+++











Fig. 1. MPEG-7 annotation example of an image adapted from Wikipedia, http://

en.wikipedia.org/wiki/Yalta

Conference

services on images and import the extraction results into the authoring tool in

order to automatically generate links from the detected face regions to detailed

textual information about Churchill, Roosevelt, and Stalin (image in Fig. 1-A).

Nathalie would then like to describe a recent video from a G8 summit, such

as the retrospective A history of G8 violence made by Reuters

. She uses again

an MPEG-7 compliant segmentation tool for detecting the seven main sequences

of this 2’26 minutes report: the various anti-capitalist protests during the Seat-

tle (1999), Melbourne (2000), Prague (2000), Gothenburg (2001), Genoa (2001),

St Petersburg (2006), Heiligendamm (2007) World Economic Forums, EU and

G8 Summits. Finally, Nathalie plans to deliver her multimedia presentation

in an Open Document Format (ODF) document embedding the image and

video previously annotated. However, this scenario causes several problems with

http://www.reuters.com/news/video/summitVideo?videoId=56114

128 S. Staab et al.

existing solutions. These problems refer to fragment identiﬁcation, semantic

annotation, web interoperability, and embedding semantic annotations into

compound documents.

Fragment identiﬁcation. Particular regions of the image need to be localized

(anchor value in [29]). However, the current web architecture does not provide

a means for uniquely identifying sub-parts of media assets, in the same way

that the fragment identiﬁer in the URI can refer to a part of an HTML or

XML document. Indeed, for almost any other media type such as audio, video,

and image, the semantics of the fragment identiﬁer has not been deﬁned or is

not commonly accepted. Providing an agreed upon way to localize sub-parts of

multimedia objects (e.g., sub-regions of images, temporal sequences of videos, or

tracking moving objects in space and in time) is fundamental

[25]. For images,

one can use either MPEG-7 or SVG snippet code to deﬁne the bounding box

coordinates of speciﬁc regions. For temporal locations, one can use MPEG-7 code

or the TemporalURI RFC

. MPEG-21 speciﬁes a normative syntax to be used

in URIs for addressing parts of any resource but whose media type is restricted

to MPEG [51]. The MPEG-7 approach requires an indirection: an annotation is

about a fragment of an XML document that refers to a multimedia document,

whereas the MPEG-21 approach does not have this limitation [90].

Semantic annotation. MPEG-7 is a natural candidate for representing the

extraction results of multimedia analysis software such as a face recognition web

service. The language, standardized in 2001, speciﬁes a rich vocabulary of multi-

media descriptors, which can be represented in either XML or a binary format.

While it is possible to specify very detailed annotations using these descriptors,

it is not possible to guarantee that MPEG-7 metadata generated by diﬀerent

agents will be mutually understood due to the lack of formal semantics of this

language [32, 87]. The XML code of Fig. 1-B illustrates the inherent interop-

erability problems of MPEG-7: several descriptors, semantically equivalent and

representing the same information while using diﬀerent syntax can coexist [88].

As Nathalie used three diﬀerent face recognition web services, the extraction re-

sults of the regions SR1, SR2,andSR3 diﬀer from each other even though they are

all syntactically correct. While the ﬁrst service uses the MPEG-7 SemanticType

for assigning the <Label> Roosevelt to still region SR1, the second one makes use

of a <KeywordAnnotation> forattachingthekeywordChurchill to still region

SR2. Finally the third service uses a <StructuredAnnotation> (which can be

used within the SemanticType) in order to label still region SR3 with Stalin.

Consequently, alternative ways for annotating the still regions render almost im-

possible the retrieval of the face recognition results within the authoring tool

since the corresponding XPath

query has to deal with these syntactic varia-

tions. As a result, the authoring tool will not link occurrences of Churchill in

See also the forthcoming W3C Media Fragments Working Group:

http://www.w3.org/2008/01/media-fragments-wg.html

http://www.annodex.net/TR/URI fragments.html

http://www.w3.org/TR/xpath20/

Semantic Multimedia 129

the images with, e.g., his biography as it does not expect semantic labels of still

regions as part of the <KeywordAnnotation> element.

Web i nteroperability. Nathalie would like to link the multimedia presenta-

tion to historical information about the key ﬁgures of the Yalta Conference or

the various G8 summits that is already available. She has also found semantic

metadata about the relationships between these ﬁgures that could improve the

automatic generation of the multimedia presentation. However, she realizes that

MPEG-7 cannot be combined with these concepts deﬁned in domain-speciﬁc on-

tologies because of its closing to the web. As this example demonstrates, although

MPEG-7 provides ways of associating semantics with (parts of) non-textual me-

dia assets, it is incompatible with (semantic) web technologies and has no formal

description of the semantics encapsulated implicitly in the standard.

Embedding into compound documents. Nathalie needs to compile the se-

mantic annotations of the images, videos, and textual stories into a semantically

annotated compound document. However, the current state of the art does not

provide a framework which allows the semantic annotation of compound doc-

uments. MPEG-7 solves only partially the problem as it is restricted to the

description of audiovisual compound documents. Bearing the growing number

of multimedia oﬃce documents in mind, this limitation is a serious drawback.

Querying. Eventually, Nathalie and other consumers of Nathalie’s compound

document may want to pick out speciﬁc events, related to speciﬁc persons or

locations. Depending on such a condition and depending on what they want to

pick out, e.g., a 2 minute video stream or a key frame out of a video, they need

to formulate a query and receive the corresponding results. The query language

and corresponding engine receiving such a request must be able to drill down

into the compound document at an arbitrary level of granularity. For instance,

if a person like Churchill appears in a keyframe that is part of a video scene that

is part of a video shot, Churchill will also appear in the video shot as a whole.

The engine must return results also at the desired level of granularity, e.g., the

video scene.

3 Requirements for Designing a Multimedia Ontology

Requirements for designing a multimedia ontology have been gathered and re-

ported in the literature, e.g., in [35]. Here, we compile these and use our scenario

from the previous section to present a list of requirements for a web-compliant

multimedia ontology.

MPEG-7 compliance. As an international standard, MPEG-7 is used both in

the signal processing and the broadcasting communities. It contains a wealth of

accumulated experience that needs to be included in a web-based multimedia on-

tology. In addition, existing annotations in MPEG-7 should be easily expressible

in this multimedia ontology.

Semantic Multimedia

Figures

Citations

Designing core ontologies

A survey of semantic image and video annotation tools

Connecting the dots: a multi-pivot approach to data exploration

Editorial: Using ontologies with UML class-based modeling: The TwoUse approach

Guidelines for Linked Data generation and publication: An example in building energy consumption

References

The Nature of Statistical Learning Theory

Support-Vector Networks

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

Face recognition using eigenfaces

Towards a Better Understanding of Context and Context-Awareness

Related Papers (5)

The Semantic Web: Research and Applications

Content-based image retrieval at the end of the early years

Adding multimedia to the semantic web: building an MPEG-7 ontology

Distinctive Image Features from Scale-Invariant Keypoints

RelFinder: Revealing Relationships in RDF Knowledge Bases

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Semantic multimedia" ?

Q2. What are the future works in "Semantic multimedia" ?

Q3. What are the three patterns used to annotate a multimedia document?

Q4. What are the properties of the Contour Shape descriptor?

Q5. How is the canonical representation of the separating hyperplane obtained?

Q6. What are the current formats used for serializing data type concepts?

Q7. What is the way to describe a multimedia ontology?

Q8. What is the definition of a texture descriptor?

Q9. What is the mechanism used to drop a concept on an ontology?

Q10. What is the disadvantage of the segmentation-based approach?

Q11. What is the need for a large amount of memory and computation power?

Q12. How can the authors avoid reasoning at runtime for many queries?

Q13. What is the definition of a decomposition of a multimedia data entity?

Q14. What is the specialization of the pattern for describing image decompositions?

Q15. What is the future challenge of using provenance?