What contributions have the authors mentioned in the paper "Incremental natural language description of dynamic imagery" ?

Although image understanding and natural language processing constitute two major areas of AI, they have mostly been studied independently of each other. The aim of their joint efforts at combining a vision system and a natural language access system is the automatic simultaneous description of dynamic imagery, i. e., the authors are interested in image interpretation and language processing on an incremental basis. The steps toward realization, based 1The work described here was partly supported by the Sonderforschungsbereich 314 der Deutschen Forschungsgemeinschaft, “ Künstliche Intelligenz und wissensbasierte Systeme ”, projects V1 ( IITB, Karlsruhe ) and N2: VITRA ( Universität des Saarlandes ).

what is the goal of this work?

Their approach emphasizes concurrent image sequence evaluation and natural language processing, an important prerequisite for real-time performance, which is the long-term goal of this work.

What is the meaning of course diagrams?

Using course diagrams guarantees that primitive motion concepts as well as complex activities can be defined in an uniform and declarative way.

What is the predicate used to express the continuation of an event?

In order to model durative events like `move', a further predicate called succeed was introduced to express the continuation of an event.

What is the purpose of the course diagram?

In addition to a specification of roles denoting participating objects, which must be members of specified object classes, an event model includes a course diagram, used to model the prototypical progression of an event.

Why does the system have to decide which events to verbalize?

Because of the strong temporal restrictions the system cannot talk about all recognized events, thus it has to decide which events should be verbalized in order to enable the listener to follow the scene.

What is the process of transforming symbolic event descriptions into natural language utterances?

In the process of transforming symbolic event descriptions into natural language utterances, first a verb is selected by accessing the concept lexicon, and the case-rolesassociated with the verb are instantiated.

What are the factors that determine the relevance of an event?

The relevance of an event depends on factors like: (i) salience, which is determined by the frequency of occurrence and the complexity of the generic event model, (ii) topicality, and (iii) current state, i.e., events with state succeed or stop are preferred.

What is the meaning of a course diagram?

Each recognition cycle starts at the lowest level of the event hierarchy: first, the traversal of course diagrams corresponding to basic events is attempted; later, more complex event instances can look at those lower levels to verify the existence of their necessary subevents.

What is the function of the language generation component?

The language generation component selects relevant propositions from this buffer, orders them and finally transforms the non-verbal information into an ordered sequence of either written or spoken German words.

What is the current use of the trajectories?

The as yet partial trajectories delivered by Actions are currently used to synthesizeinteractively a realistic GSD, with object candidates assigned to previously known players and the ball.

What is the definition of a course diagram?

The recognition of an occurrence can be thought of as traversing the course diagram, where the edge types (:trigger, :proceed, etc.) are used for the definition of their basic event predicates (see Section 3.3).

how do you use natural language to analyze raw images?

The authors have shown that the various processing steps from raw images to natural language utterances, i.e., picture domain analysis of the image sequence, event recognition, and natural language generation, must be carried out on an incremental basis.

What is the effect of the action system?

Since the first results described in Schirra et al. [1987], more than 3000 frames (120 seconds) of image sequences recorded from a major traffic intersection in Karlsruhe have been evaluated by the Actions system.

(Open Access) Incremental Natural Language Description of Dynamic Imagery (1989) | Gerd Herzog

Incremental Natural Language Description of

Dynamic Imagery

Gerd Herzog,* Chen-Ko Sung,+ Elisabeth Andr´e, **

Wilfried Enkelmann,+ Hans-Hellmut Nagel,+ ++ Thomas Rist, **

Wolfgang Wahlster,* ** Georg Zimmermann+

SFB 314, Project VITRA, Universit¨at des Saarlandes

D-66041 Saarbr¨ucken, Germany

* German Research Center for Artiﬁcial Intelligence (DFKI)

D-66123 Saarbr¨ucken, Germany

+ Fraunhofer–Institutf¨ur Informations- und Datenverarbeitung (IITB)

D-76131 Karlsruhe, Germany

++ Fakult¨at f¨ur Informatik, Universit¨at Karlsruhe (TH)

D-76128 Karlsruhe, Germany

Abstract

Although image understanding and natural language processing constitute

two major areas of AI, they have mostlybeenstudiedindependentlyof each other.

Only a few attempts have been concerned with the integration of computer vision

and the generation of natural language expressions for the description of image

sequences.

The aim of our joint efforts at combining a vision system and a natural lan-

guage access system is the automatic simultaneous description of dynamic im-

agery, i.e., we are interested in image interpretation and language processing on

an incremental basis. In this contribution

we sketch an approach towards the in-

tegration of the Karlsruhe vision system called Actions and the natural language

component Vitra developed in Saarbr¨ucken. The steps toward realization, based

The work described here was partly supported by the Sonderforschungsbereich 314 der Deutschen

Forschungsgemeinschaft, “K¨unstliche Intelligenz und wissensbasierte Systeme”, projects V1 (IITB,

Karlsruhe) and N2: VITRA (Universit¨at des Saarlandes).

In: Ch. Freksa und W. Brauer (Hrsg.), Wissensbasierte Systeme , pp. 153-162, Berlin, Heidelberg: Springer.

on available components, are outlined and the capabilities of the current system

are demonstrated.

Zusammenfassung

Obwohl das Bildverstehen und die Verarbeitung nat¨urlicher Sprache zwei der

Kerngebiete im Bereich der KI darstellen, wurden sie bisher nahezu unabh¨angig

voneinander untersucht. Nur sehr wenige Ans¨atze haben sich mit der Intergration

von maschinellem Sehen und der Generierung nat¨urlichsprachlicher

Außerungen

zur Beschreibung von Bildfolgen besch¨aftigt.

Das Ziel unserer Zusammenarbeit bei der Kopplung eines bildverstehenden

Systems und eines nat¨urlichsprachlichen Zugangssystems ist die automatische

simultane Beschreibung zeitver¨anderlicher Szenen, d.h. wir sind interessiert an

Bildfolgeninterpretation und Sprachverarbeitung auf inkrementeller Basis. In

diesem Beitrag beschreiben wir einen Ansatz zur Integration des Karlsruher Bild-

folgenanalysesystems Actions und der nat¨urlichsprachlichen Komponente Vitra,

die in Saarbr¨ucken entwickelt wird. Die Schritte hin zur Realisierung, basierend

auf bereits verf¨ugbaren Komponenten, werden dargestellt und die F¨ahigkeiten

des derzeit vorhandenen Systems demonstriert.

This paper appeared in: In: C. Freksa and W. Brauer (eds.), Wissensbasierte

Systeme. 3. Int. GI-Kongreß, pp. 153–162. Berlin, Heidelberg: Springer, 1989.

1 Introduction

Image understanding and natural language processing are two major areas of research

within AI that have generally been studied independently of one another. Advances

in both technical ﬁelds during the last 10 years form a promising basis for the de-

sign and construction of integrated knowledge-based systems capable of translating

visual information into natural language descriptions. From the point of view of cog-

nitive science, anchoring meaning in a referential semantics is of theoretical as well as

practical interest. From the engineering perspective, the systems envisaged here could

serve such practical purposes as handling the vast amount of visual data accumulating,

for example, in medical technology, remote sensing, and trafﬁc control.

The goal of our joint efforts at combining a vision system and a natural language

access system is the automatic simultaneous description of dynamic imagery, i.e., we

are interested in image interpretation and language processing on an incremental basis.

The conversational setting is this: the system provides a running report of the scene it

is watching for a listener who cannot see the scene her/himself, but who is assumed to

have prior knowledge about its static properties. In this paper we describe the integra-

tion of the Karlsruhe vision system Actions and the natural language component Vitra

developed in Saarbr¨ucken.

The steps toward realization, based on available compo-

nents, are outlined, and results already obtained in the investigation of trafﬁc scenes

and short sequences from soccer matches will be discussed.

2 Relations to Previous Research

Following Kanade (see Kanade [1980]), it is advantageous for a discussion of machine

vision to distinguish between the 2-D picture domain and the 3-D scene domain. So

far, most machine vision approaches have been concerned (i) with the detection and

localization of signiﬁcant grey value variations (corners, edges, regions) in the picture

domain, and in the scene domain (ii) with the estimation of 3-D shape descriptions,

as well as—more recently—(iii) with the evaluation of image sequences for object

tracking and automatic navigation. Among the latter approaches, the estimation of

relative motion between camera(s) and scene components as well as the estimation

of spatial structures, i.e., surfaces and objects, are focal points of activity (see Ay-

ache and Faugeras [1987], Faugeras [1988], Nagel [1988b]). Few research results

have been published about attempts to associate picture domain cues extracted from

image sequences with conceptual descriptions that could be linked directly to efforts

at algorithmic processing of natural language expressions and sentences. In this con-

text, computer-based generic descriptions for complex movements become important.

Those accessible in the image understanding literature have been surveyed in Nagel

[1988a]. Two even more recent investigations in this direction have been published

The acronyms stand for `Automatic Cueing and Trajectory estimation in Imagery of Objects in

Natural Scenes' and `VIsual TRAnslator'.

in Witkin et al. [1988] (in particular Section D) and Goddard [1988]. A few selected

approaches from the literature are outlined in the remainder of this section to provide

a background for the ideas presented here.

In Badler [1975], Badler studied the interpretation of simulated image sequences

with object motions in terms of natural language oriented concepts. His approach has

been improved by Tsotsos, who proposed a largely domain-independent hierarchy of

conceptual motion frames which is specialized further within the system Alven to ana-

lyze X-ray image sequences showing left ventricular wall motion (see Tsotsos [1985]).

Later, a similar system for the analysis of scintigraphic image sequences of the human

heart was developed by Niemann et al. (see Niemann et al. [1985]). Based on a study

of Japanese verbs, Okada developed a set of 20 semantic features to be used within the

system Supp to match those verb patterns, that are applicable to simple line drawings

(see Okada [1979]). Trafﬁc scenes constitute one of the diverse domains of the dialog

system Ham-Ans (see Wahlster et al. [1983]). Based on a procedural referential se-

mantics for certain verbs of locomotion, the system answers questions concerning the

motions of vehicles and pedestrians. The system Naos (see Neumann [1984], Novak

[1986]) also allows for a retrospective natural language description. In Naos, event

recognition is based on a hierarchy of event models, i.e., declarative descriptions of

classes of events organized around verbs of locomotion. The more recent Epex system

(see Walter et al. [1988]) studies the handling of conceptual units of higher semantic

complexity, but still in an a posteriori way.

The natural language interfaces mentioned so far have not been connected to real

vision components, they use only simulated data. Apart from our previous results

(see Andr´e et al. [1986], Schirra et al. [1987]) the LandScan system (see Bajcsy et al.

[1985]) constitutes the only approach in which processing spans the entire distance be-

tween raw images and natural language utterances but it deals only with static scenes.

3 Simultaneous Evaluation and Natural Language De-

scription of Image Sequences

The main goal of our cooperation is the design and implementation of an integrated

system that performs a kind of simultaneous reporting, that is, evaluating an image

sequence and immediately generating a natural language description of the salient ac-

tivities corresponding to the most recent image subsequence. It is not (yet) real-time

evaluation, but our approach emphasizes concurrency of image sequence evaluation

and natural language generation.

In order to gain a realistic insight into the problems associated with such an en-

deavor, we decided toevaluate real-worldimagesequences with multiplemobile agents

or objects, based on system components which are already partially available due to

previous research efforts in the laboratories involved. Since the analysis of complex

articulated movements still exceeds our capabilities given the computational resources

available today, we concentrate initially on the picture domain in order to detect and

track projected object candidates, which are considered to be essentially rigid. The

crucial links between the picture domain results and the natural language process-

ing steps are provided by complex events, i.e., higher conceptual units capturing the

spatio-temporalaspects of object motions. A complex event should be understood as an

`event' in its broadest sense, comprising also notions like `episode' and `history' (see

Nagel [1988a]). The recognition of intentions and plans (see Retz-Schmidt [1988]) is,

however, outside the scope of this paper. In what follows, the term `event' will be used

to refer to complex events.

3.1 Overall Structure of the Approach

The task of generating natural language descriptions based on visual data can roughly

be subdivided into three parts: (1) constructing an abstract propositional description of

the scene, the so-called Geometrical Scene Description (GSD, see Neumann [1984]),

(2) further interpretation of this intermediate geometrical representation by recogniz-

ing complex events, and (3) selection and verbalization of appropriate propositions

derived in step 2 to describe the scene under discussion. Because of the simultaneity

of the description in our case, the three steps have to be carried out incrementally.

Figure 1: The architecture of the integrated system

Incremental Natural Language Description of Dynamic Imagery

Figures

Citations

Plan-based integration of natural language and graphics generation

Plan-based integration of natural language and graphics generation

Constructing qualitative event models automatically from video input

VIsual TRAnslator: Linking perceptions and natural language descriptions

Three RoboCup Simulation League Commentator Systems

References

Logic and conversation

Towards a General Theory of Action and Time.

Towards a general theory of action and time

An Incremental Procedural Grammar for Sentence Formulation

Building, registrating, and fusing noisy visual maps

Related Papers (5)

On the simultaneous interpretation of real world image sequences and their natural language description: the system soccer

From image sequences towards conceptual descriptions

Over-answering yes-no questions: extended responses in a NL interface to a vision system

Intelligent multimedia interfaces

Towards model-based recognition of human movements in image sequences

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Incremental natural language description of dynamic imagery" ?

Q2. what is the goal of this work?

Q3. What is the meaning of course diagrams?

Q4. What is the predicate used to express the continuation of an event?

Q5. What is the purpose of the course diagram?

Q6. Why does the system have to decide which events to verbalize?

Q7. What is the process of transforming symbolic event descriptions into natural language utterances?

Q8. What are the factors that determine the relevance of an event?

Q9. What is the meaning of a course diagram?

Q10. What is the function of the language generation component?

Q11. What are the main areas of research within AI?

Q12. What is the current use of the trajectories?

Q13. What is the definition of a course diagram?

Q14. how do you use natural language to analyze raw images?

Q15. What is the effect of the action system?