scispace - formally typeset
Open AccessBook ChapterDOI

Incremental Natural Language Description of Dynamic Imagery

TLDR
This work has focused on the integration of computer vision and the generation of natural language expressions for the description of image sequences in the context of 3D image recognition.
Abstract
Although image understanding and natural language processing constitute two major areas of AI, they have mostly been studied independently of each other. Only a few attempts have been concerned with the integration of computer vision and the generation of natural language expressions for the description of image sequences.

read more

Content maybe subject to copyright    Report

Incremental Natural Language Description of
Dynamic Imagery
Gerd Herzog,* Chen-Ko Sung,+ Elisabeth Andr´e, **
Wilfried Enkelmann,+ Hans-Hellmut Nagel,+ ++ Thomas Rist, **
Wolfgang Wahlster,* ** Georg Zimmermann+
SFB 314, Project VITRA, Universit¨at des Saarlandes
D-66041 Saarbr¨ucken, Germany
* German Research Center for Artificial Intelligence (DFKI)
D-66123 Saarbr¨ucken, Germany
+ Fraunhofer–Institutf¨ur Informations- und Datenverarbeitung (IITB)
D-76131 Karlsruhe, Germany
++ Fakult¨at f¨ur Informatik, Universit¨at Karlsruhe (TH)
D-76128 Karlsruhe, Germany
Abstract
Although image understanding and natural language processing constitute
two major areas of AI, they have mostlybeenstudiedindependentlyof each other.
Only a few attempts have been concerned with the integration of computer vision
and the generation of natural language expressions for the description of image
sequences.
The aim of our joint efforts at combining a vision system and a natural lan-
guage access system is the automatic simultaneous description of dynamic im-
agery, i.e., we are interested in image interpretation and language processing on
an incremental basis. In this contribution
1
we sketch an approach towards the in-
tegration of the Karlsruhe vision system called Actions and the natural language
component Vitra developed in Saarbr¨ucken. The steps toward realization, based
1
The work described here was partly supported by the Sonderforschungsbereich 314 der Deutschen
Forschungsgemeinschaft, “K¨unstliche Intelligenz und wissensbasierte Systeme”, projects V1 (IITB,
Karlsruhe) and N2: VITRA (Universit¨at des Saarlandes).
1
In: Ch. Freksa und W. Brauer (Hrsg.), Wissensbasierte Systeme , pp. 153-162, Berlin, Heidelberg: Springer.

on available components, are outlined and the capabilities of the current system
are demonstrated.
Zusammenfassung
Obwohl das Bildverstehen und die Verarbeitung nat¨urlicher Sprache zwei der
Kerngebiete im Bereich der KI darstellen, wurden sie bisher nahezu unabh¨angig
voneinander untersucht. Nur sehr wenige Ans¨atze haben sich mit der Intergration
von maschinellem Sehen und der Generierung nat¨urlichsprachlicher
¨
Außerungen
zur Beschreibung von Bildfolgen besch¨aftigt.
Das Ziel unserer Zusammenarbeit bei der Kopplung eines bildverstehenden
Systems und eines nat¨urlichsprachlichen Zugangssystems ist die automatische
simultane Beschreibung zeitver¨anderlicher Szenen, d.h. wir sind interessiert an
Bildfolgeninterpretation und Sprachverarbeitung auf inkrementeller Basis. In
diesem Beitrag beschreiben wir einen Ansatz zur Integration des Karlsruher Bild-
folgenanalysesystems Actions und der nat¨urlichsprachlichen Komponente Vitra,
die in Saarbr¨ucken entwickelt wird. Die Schritte hin zur Realisierung, basierend
auf bereits verf¨ugbaren Komponenten, werden dargestellt und die F¨ahigkeiten
des derzeit vorhandenen Systems demonstriert.
This paper appeared in: In: C. Freksa and W. Brauer (eds.), Wissensbasierte
Systeme. 3. Int. GI-Kongreß, pp. 153–162. Berlin, Heidelberg: Springer, 1989.
2

1 Introduction
Image understanding and natural language processing are two major areas of research
within AI that have generally been studied independently of one another. Advances
in both technical fields during the last 10 years form a promising basis for the de-
sign and construction of integrated knowledge-based systems capable of translating
visual information into natural language descriptions. From the point of view of cog-
nitive science, anchoring meaning in a referential semantics is of theoretical as well as
practical interest. From the engineering perspective, the systems envisaged here could
serve such practical purposes as handling the vast amount of visual data accumulating,
for example, in medical technology, remote sensing, and traffic control.
The goal of our joint efforts at combining a vision system and a natural language
access system is the automatic simultaneous description of dynamic imagery, i.e., we
are interested in image interpretation and language processing on an incremental basis.
The conversational setting is this: the system provides a running report of the scene it
is watching for a listener who cannot see the scene her/himself, but who is assumed to
have prior knowledge about its static properties. In this paper we describe the integra-
tion of the Karlsruhe vision system Actions and the natural language component Vitra
developed in Saarbr¨ucken.
2
The steps toward realization, based on available compo-
nents, are outlined, and results already obtained in the investigation of traffic scenes
and short sequences from soccer matches will be discussed.
2 Relations to Previous Research
Following Kanade (see Kanade [1980]), it is advantageous for a discussion of machine
vision to distinguish between the 2-D picture domain and the 3-D scene domain. So
far, most machine vision approaches have been concerned (i) with the detection and
localization of significant grey value variations (corners, edges, regions) in the picture
domain, and in the scene domain (ii) with the estimation of 3-D shape descriptions,
as well as—more recently—(iii) with the evaluation of image sequences for object
tracking and automatic navigation. Among the latter approaches, the estimation of
relative motion between camera(s) and scene components as well as the estimation
of spatial structures, i.e., surfaces and objects, are focal points of activity (see Ay-
ache and Faugeras [1987], Faugeras [1988], Nagel [1988b]). Few research results
have been published about attempts to associate picture domain cues extracted from
image sequences with conceptual descriptions that could be linked directly to efforts
at algorithmic processing of natural language expressions and sentences. In this con-
text, computer-based generic descriptions for complex movements become important.
Those accessible in the image understanding literature have been surveyed in Nagel
[1988a]. Two even more recent investigations in this direction have been published
2
The acronyms stand for `Automatic Cueing and Trajectory estimation in Imagery of Objects in
Natural Scenes' and `VIsual TRAnslator'.
3

in Witkin et al. [1988] (in particular Section D) and Goddard [1988]. A few selected
approaches from the literature are outlined in the remainder of this section to provide
a background for the ideas presented here.
In Badler [1975], Badler studied the interpretation of simulated image sequences
with object motions in terms of natural language oriented concepts. His approach has
been improved by Tsotsos, who proposed a largely domain-independent hierarchy of
conceptual motion frames which is specialized further within the system Alven to ana-
lyze X-ray image sequences showing left ventricular wall motion (see Tsotsos [1985]).
Later, a similar system for the analysis of scintigraphic image sequences of the human
heart was developed by Niemann et al. (see Niemann et al. [1985]). Based on a study
of Japanese verbs, Okada developed a set of 20 semantic features to be used within the
system Supp to match those verb patterns, that are applicable to simple line drawings
(see Okada [1979]). Traffic scenes constitute one of the diverse domains of the dialog
system Ham-Ans (see Wahlster et al. [1983]). Based on a procedural referential se-
mantics for certain verbs of locomotion, the system answers questions concerning the
motions of vehicles and pedestrians. The system Naos (see Neumann [1984], Novak
[1986]) also allows for a retrospective natural language description. In Naos, event
recognition is based on a hierarchy of event models, i.e., declarative descriptions of
classes of events organized around verbs of locomotion. The more recent Epex system
(see Walter et al. [1988]) studies the handling of conceptual units of higher semantic
complexity, but still in an a posteriori way.
The natural language interfaces mentioned so far have not been connected to real
vision components, they use only simulated data. Apart from our previous results
(see Andr´e et al. [1986], Schirra et al. [1987]) the LandScan system (see Bajcsy et al.
[1985]) constitutes the only approach in which processing spans the entire distance be-
tween raw images and natural language utterances but it deals only with static scenes.
3 Simultaneous Evaluation and Natural Language De-
scription of Image Sequences
The main goal of our cooperation is the design and implementation of an integrated
system that performs a kind of simultaneous reporting, that is, evaluating an image
sequence and immediately generating a natural language description of the salient ac-
tivities corresponding to the most recent image subsequence. It is not (yet) real-time
evaluation, but our approach emphasizes concurrency of image sequence evaluation
and natural language generation.
In order to gain a realistic insight into the problems associated with such an en-
deavor, we decided toevaluate real-worldimagesequences with multiplemobile agents
or objects, based on system components which are already partially available due to
previous research efforts in the laboratories involved. Since the analysis of complex
articulated movements still exceeds our capabilities given the computational resources
4

available today, we concentrate initially on the picture domain in order to detect and
track projected object candidates, which are considered to be essentially rigid. The
crucial links between the picture domain results and the natural language process-
ing steps are provided by complex events, i.e., higher conceptual units capturing the
spatio-temporalaspects of object motions. A complex event should be understood as an
`event' in its broadest sense, comprising also notions like `episode' and `history' (see
Nagel [1988a]). The recognition of intentions and plans (see Retz-Schmidt [1988]) is,
however, outside the scope of this paper. In what follows, the term `event' will be used
to refer to complex events.
3.1 Overall Structure of the Approach
The task of generating natural language descriptions based on visual data can roughly
be subdivided into three parts: (1) constructing an abstract propositional description of
the scene, the so-called Geometrical Scene Description (GSD, see Neumann [1984]),
(2) further interpretation of this intermediate geometrical representation by recogniz-
ing complex events, and (3) selection and verbalization of appropriate propositions
derived in step 2 to describe the scene under discussion. Because of the simultaneity
of the description in our case, the three steps have to be carried out incrementally.
Figure 1: The architecture of the integrated system
5

Citations
More filters
Book

Plan-based integration of natural language and graphics generation

TL;DR: The central claim of this paper is that the generation of a multimodal presentation can be considered as an incremental planning process that aims to achieve a given communicative goal.
Journal ArticleDOI

Plan-based integration of natural language and graphics generation

TL;DR: In this article, the authors describe a multimodal presentation system WIP which allows the generation of alternate presentations of the same content taking into account various contextual factors, and discuss how the plan-based approach to presentation design can be exploited so that graphics generation influences the production of text.
Journal ArticleDOI

Constructing qualitative event models automatically from video input

TL;DR: An implemented technique for generating event models automatically based on qualitative reasoning and a statistical analysis of video input is described, which learns various event models expressed in the qualitative calculus which represent human observable events.
Book ChapterDOI

VIsual TRAnslator: Linking perceptions and natural language descriptions

TL;DR: Practical experience gained in the projectVitra concerning the design and construction of integrated knowledge-based systems capable of translating visual information into natural language descriptions is reported on.
Journal ArticleDOI

Three RoboCup Simulation League Commentator Systems

TL;DR: Three systems that generate real-time natural language commentary on the RoboCup simulation league are presented, and their similarities, differences, and directions for the future discussed.
References
More filters
Book ChapterDOI

Logic and conversation

H. P. Grice
- 12 Dec 1975 - 

Towards a General Theory of Action and Time.

TL;DR: A formalism for reasoning about actions that is based on a temporal logic allows a much wider range of actions to be described than with previous approaches such as the situation calculus and a framework for planning in a dynamic world with external events and multiple agents is suggested.
Journal ArticleDOI

Towards a general theory of action and time

TL;DR: In this article, a formalism for reasoning about actions is proposed that is based on a temporal logic, which allows a much wider range of actions to be described than with previous approaches such as the situation calculus.
Journal ArticleDOI

An Incremental Procedural Grammar for Sentence Formulation

TL;DR: An explanation for the existence of configurational conditions on transformations ond other linguistics rules is proposed and the basic design feature of IPG which gives rise to these psychologically and linguistically desiroble properties, is the “Procedures + Stack” concept.
Book

Building, registrating, and fusing noisy visual maps

TL;DR: In this paper, the authors deal with the problem of building three-dimensional descriptions (called visual maps) of the environment of a mobile robot using passive vision, and they use these maps to fuse the different visual maps and reduce the uncertainty of geometric primitives which have found correspondents in other maps.
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Incremental natural language description of dynamic imagery" ?

Although image understanding and natural language processing constitute two major areas of AI, they have mostly been studied independently of each other. The aim of their joint efforts at combining a vision system and a natural language access system is the automatic simultaneous description of dynamic imagery, i. e., the authors are interested in image interpretation and language processing on an incremental basis. The steps toward realization, based 1The work described here was partly supported by the Sonderforschungsbereich 314 der Deutschen Forschungsgemeinschaft, “ Künstliche Intelligenz und wissensbasierte Systeme ”, projects V1 ( IITB, Karlsruhe ) and N2: VITRA ( Universität des Saarlandes ). 

Their approach emphasizes concurrent image sequence evaluation and natural language processing, an important prerequisite for real-time performance, which is the long-term goal of this work. 

Using course diagrams guarantees that primitive motion concepts as well as complex activities can be defined in an uniform and declarative way. 

In order to model durative events like `move', a further predicate called succeed was introduced to express the continuation of an event. 

In addition to a specification of roles denoting participating objects, which must be members of specified object classes, an event model includes a course diagram, used to model the prototypical progression of an event. 

Because of the strong temporal restrictions the system cannot talk about all recognized events, thus it has to decide which events should be verbalized in order to enable the listener to follow the scene. 

In the process of transforming symbolic event descriptions into natural language utterances, first a verb is selected by accessing the concept lexicon, and the case-rolesassociated with the verb are instantiated. 

The relevance of an event depends on factors like: (i) salience, which is determined by the frequency of occurrence and the complexity of the generic event model, (ii) topicality, and (iii) current state, i.e., events with state succeed or stop are preferred. 

Each recognition cycle starts at the lowest level of the event hierarchy: first, the traversal of course diagrams corresponding to basic events is attempted; later, more complex event instances can look at those lower levels to verify the existence of their necessary subevents. 

The language generation component selects relevant propositions from this buffer, orders them and finally transforms the non-verbal information into an ordered sequence of either written or spoken German words. 

Image understanding and natural language processing are two major areas of research within AI that have generally been studied independently of one another. 

The as yet partial trajectories delivered by Actions are currently used to synthesizeinteractively a realistic GSD, with object candidates assigned to previously known players and the ball. 

The recognition of an occurrence can be thought of as traversing the course diagram, where the edge types (:trigger, :proceed, etc.) are used for the definition of their basic event predicates (see Section 3.3). 

The authors have shown that the various processing steps from raw images to natural language utterances, i.e., picture domain analysis of the image sequence, event recognition, and natural language generation, must be carried out on an incremental basis. 

Since the first results described in Schirra et al. [1987], more than 3000 frames (120 seconds) of image sequences recorded from a major traffic intersection in Karlsruhe have been evaluated by the Actions system.