Book Chapter•DOI•

Computational Scene Analysis

DeLiang Wang¹•Institutions (1)

01 Jan 2007-pp 163-191

TL;DR: It is pointed out that the time dimension and David Marr's framework for understanding perception are essential for computational scene analysis, particularly visual and auditory scene analysis.

read less

Abstract: A remarkable achievement of the perceptual system is its scene analysis capability, which involves two basic perceptual processes: the segmentation of a scene into a set of coherent patterns (objects) and the recognition of memorized ones. Although the perceptual system performs scene analysis with apparent ease, computational scene analysis remains a tremendous challenge as foreseen by Frank Rosenblatt. This chapter discusses scene analysis in the field of computational intelligence, particularly visual and auditory scene analysis. The chapter first addresses the question of the goal of computational scene analysis. A main reason why scene analysis is difficult in computational intelligence is the binding problem, which refers to how a collection of features comprising an object in a scene is represented in a neural network. In this context, temporal correlation theory is introduced as a biologically plausible representation for addressing the binding problem. The LEGION network lays a computational foundation for oscillatory correlation, which is a special form of temporal correlation. Recent results on visual and auditory scene analysis are described in the oscillatory correlation framework, with emphasis on real-world scenes. Also discussed are the issues of attention, feature-based versus model-based analysis, and representation versus learning. Finally, the chapter points out that the time dimension and David Marr's framework for understanding perception are essential for computational scene analysis.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 What is the Goal of Computational Scene Analysis?] – [3 Binding Problem and Temporal Correlation Theory] – [4 Oscillatory Correlation Theory] – [A. An input image with 30x30 binary pixels showing a connected cup figure. B.] – [5 Visual Scene Analysis] – [6 Auditory Scene Analysis] – [Hz to 5 kHz is employed in peripheral processing. B. A snapshot of the grouping layer. Here, white pixels denote active oscillators that represent the segregated] – [7.1 Attention] – [7.2 Feature-based Analysis versus Model-based Analysis] – [7.3 Learning versus Representation] and [8 Concluding Remarks]

1 Introduction

Human intelligence can be broadly divided into three aspects: Perception, reasoning, and action.
Section 3 is devoted to a key problem in scene analysis - the binding problem, which concerns how sensory elements are organized into percepts in the brain.
Section 4 describes oscillatory correlation theory as a biologically plausible representation to address the binding problem.
In Section 7, I discuss a number of challenging issues facing computational scene analysis.

2 What is the Goal of Computational Scene Analysis?

In his monumental book on computational vision, Marr makes a compelling case that understanding perceptual information processing requires three different levels of description.
The second level, called representation and algorithm, is concerned with the representation of the input and the output, and the algorithm that transforms from the input representation to the output representation.
Before addressing this question, let us ask the question of what purpose perception serves.
The above goal of computational scene analysis is strongly related to the goal of human scene analysis.
In particular, the authors assume the input format to be similar in both cases.

3 Binding Problem and Temporal Correlation Theory

The ability to group sensory elements of a scene into coherent objects, often known as perceptual organization or perceptual grouping [40], is a fundamental part of perception.
How perceptual organization is achieved in the brain remains a mystery.
The authors should note that object-level attributes, such as shape and size, are undefined before the more fundamental problem of figure-ground separation is solved.
The correlation theory asserts that the temporal structure of a neuronal signal provides the neural basis for correlation, which in turn serves to bind neuronal responses.
Eventually, individual objects are coded by individual neurons, and for this reason hierarchical coding is also known as the cardinal cell (or grandmother cell) representation [3].

4 Oscillatory Correlation Theory

A special form of temporal correlation - oscillatory correlation [52] - has been studied extensively.
Second, it can desynchronize different assemblies of oscillators that are activated by multiple, simultaneously present objects.
Within each of the two phases the oscillator exhibits slow-varying behavior.
Rosenblatt’s perceptrons [46, 47] are classification networks.
As shown in the figure, the connectedness predicate is correctly computed beyond a beginning period that corresponds to the process of assembly formation.

A. An input image with 30x30 binary pixels showing a connected cup figure. B.

A snapshot from corresponding LEGION network showing the initial conditions of the network.
C. A subsequent snapshot of the network activity.
The threshold is indicated by the dash line.
I. The upper three traces show the temporal activities for the three assemblies representing the three connected patterns in the disconnected ‘CUP’ image, the next-to-bottom trace the activity of the global inhibitor, and the bottom one the ratio of the global inhibitor’s frequency to that of enabled oscillators along with.
The oscillatory correlation theory provides a general framework to address the computational scene analysis problem.

5 Visual Scene Analysis

For computational scene analysis, some measure of similarity between features is necessary.
Elements with similar attributes, such as color, depth, or texture, tend to group.
As a result, such segmentation gives rise to the notion of a segmentation capacity [69] - at least for networks of relaxation oscillators with a non-instantaneous active phase - that refers to a limited number of oscillator assemblies that may be formed.
Cesmeli and Wang [8] applied LEGION to motion-based segmentation that considers motion as well as intensity for analyzing image sequences (see also [75]).
A frame of a motion sequence is shown in Fig. 6A, where a motorcycle rider jumps to a dry canal with his motorcycle while the camera is tracking him.

6 Auditory Scene Analysis

Frequency components that have common temporal modulation tend to be grouped together.
Their model relies on global connectivity to achieve synchronization among the oscillators that are stimulated at the same time.
The second layer groups the segments that emerge from the first layer.
Their model first performs peripheral processing and then auditory segmentation.

Hz to 5 kHz is employed in peripheral processing. B. A snapshot of the grouping layer. Here, white pixels denote active oscillators that represent the segregated

C. Another snapshot showing the segregated background.
At a conceptual level, a major difference between this model and Wang’s model [63] concerns whether attention can be directed to more than one stream:.
In the Wrigley and Brown model only one stream may be attended to at a time whereas in Wang’s model attention may be divided by more than one stream.
This issue will be revisited in Sect. 7.1.

7.1 Attention

The importance of attention for scene analysis can hardly be overstated.
The difficulty is illustrated by the finding of Field et al. [18] that a path of curvilinearly aligned (snake-like) orientation elements embedded in a background of randomly oriented elements can be readily detected by observers, whereas other paths cannot.
70, 76], capacity limitation is a fundamental property of attention.
Attention can be either goal-driven or stimulus-driven [73].
Visual feature dimensions include luminance, color, orientation, motion, and depth.

7.2 Feature-based Analysis versus Model-based Analysis

Scene analysis can be performed on the basis of the features of the objects in the input scene or the models of the objects in the memory.
What’s at issue is how much model-based analysis contributes to scene analysis, or whether binding should be part of a recognition process.
The forward path performs pattern recognition that is robust to a range of variations in position and size, and the last layer stores learned patterns.
A later model along a similar line was proposed by Riesenhuber and Poggio [44], and it uses a hierarchical architecture similar to the neocognitron.
This point is illustrated in Figure 9 which shows two frogs in a pond.

7.3 Learning versus Representation

Learning - both supervised and unsupervised - is central to neural networks (and computational intelligence in general).
The failure of perceptrons to solve this problem is rooted in the lack of a proper representation, not the lack of a powerful learning method.
The emphasis on representations contrasts that on learning.
The cepstral representation3 separates voice excitation from vocal tract filtering [22], and the discovery of this representation pays a huge dividend to speech processing tasks including automatic speech recognition where cepstral features are an indispensable part of any state-of-the-art system.
The above discussion makes it plain that the investigation of computational scene analysis can be characterized in large part as the pursuit of appropriate representations.

8 Concluding Remarks

In this chapter I have made an effort to define the goal of computational scene analysis explicitly.
Advances in understanding oscillatory dynamics lead to the development of the oscillatory correlation approach to computational scene analysis with promising results.
Natural intelligence ranges from sensation, perceptual organization, language, motor control, to decision making and long-term planning.
Temporal structure is shared by neuronal responses in all parts of the brain, and the time dimension is flexible and infinitely extensible.
The bewildering complexity of perception makes it necessary to adopt a compass to guide the way forward and avoid many pitfalls along the way.

Did you find this useful? Give us your feedback

Figures (10)

Fig. 4. Oscillatory correlation solution to the connectedness problem (from [64]). A. An input image with 30x30 binary pixels showing a connected cup figure. B. A snapshot from corresponding LEGION network showing the initial conditions of the network. C. A subsequent snapshot of the network activity. D. Another input image depicting three connected patterns forming the word ‘CUP’. E.-G. Snapshots of the LEGION network at three different times. H. The upper trace shows the temporal activity of the oscillator assembly representing the connected cup image, the middle trace the activity of the global inhibitor, and the bottom trace the ratio of the global inhibitor’s frequency to that of enabled oscillators. The threshold is indicated by the dash line. I. The upper three traces show the temporal activities for the three assemblies representing the three connected patterns in the disconnected ‘CUP’ image, the next-to-bottom trace the activity of the global inhibitor, and the bottom one the ratio of the global inhibitor’s frequency to that of enabled oscillators along with.

Fig. 2. Behavior of a relaxation oscillator. A. Enabled state of the oscillator. This state produces a limit cycle shown as the bold curve. The direction of the trajectory is indicated by the arrows, and jumps are indicated by double arrows. B. Excitable state of the oscillator. This state produces a stable fixed point. C. Temporal activity of the oscillator in the enabled state. The curve shows the x activity.

Fig. 9. A natural image that contains two frogs in their natural habitat.

Fig. 6. Motion segmentation (from [8]). A. A frame of a motion sequence. B. Estimated optic flow. C. Result of segmentation.

Fig. 5. Extraction of hydrographic regions (from [10]). A. Input satellite image consisting of 640x606 pixels. B. Extraction result, where segmented waterbodies are indicated by white. C. Corresponding 1:24,000 topographic map.

Fig. 8. Detection of snake-like patterns (from [18] with permission). Human observers can easily detect the occurrence of a snake-like pattern - shown on the left - that is embedded in a background of random orientation elements shown on the right. The snake pattern consists of 12 aligned orientation elements.

Fig. 7. Segregation of voiced speech from telephone ringing (from [66]). A. Peripheral response to an auditory stimulus consisting of a male utterance mixed with telephone ringing. A bank of 128 filters having center frequencies ranging from 80 Hz to 5 kHz is employed in peripheral processing. B. A snapshot of the grouping layer. Here, white pixels denote active oscillators that represent the segregated speech stream. C. Another snapshot showing the segregated background.

Fig. 10. Necker cube. This figure can be seen as a cube that is viewed either from above or from below.

Fig. 3. Diagram of a perceptron. R denotes the input layer, which projects to a layer of feature detectors. The response unit takes a weighted sum of the responses of all the detectors, and outputs 1 if the sum passes a certain threshold and 0 otherwise.

Fig. 1. Illustration of the binding problem. The input consists of a triangle and a square. There are four feature detectors for triangle, square, top, and bottom. The binding problem concerns whether the triangle is on top (and the square at bottom) or the square is on top (and the triangle at bottom).

Content maybe subject to copyright Report

Computational Scene Analysis

DeLiang Wang

Department of Computer Science & Engineering and Center for Cognitive Science

The Ohio State University

Columbus, OH 43210-1277, U.S.A.

dwang@cse.ohio-state.edu

Summary. A remarkable achievement of the perceptual system is its scene anal-

ysis capability, which involves two basic perceptual processes: the segmentation of

a scene into a set of coherent patterns (objects) and the recognition of memorized

ones. Although the perceptual system performs scene analysis with apparent ease,

computational scene analysis remains a tremendous challenge as foreseen by Frank

Rosenblatt. This chapter discusses scene analysis in the ﬁeld of computational intel-

ligence, particularly visual and auditory scene analysis. The chapter ﬁrst addresses

the question of the goal of computational scene analysis. A main reason why scene

analysis is diﬃcult in computational intelligence is the binding problem, which refers

to how a collection of features comprising an object in a scene is represented in a

neural network. In this context, temporal correlation theory is introduced as a bio-

logically plausible representation for addressing the binding problem. The LEGION

network lays a computational foundation for oscillatory correlation, which is a special

form of temporal correlation. Recent results on visual and auditory scene analysis

are described in the oscillatory correlation framework, with emphasis on real-world

scenes. Also discussed are the issues of attention, feature-based versus model-based

analysis, and representation versus learning. Finally, the chapter points out that

the time dimension and David Marr’s framework for understanding perception are

essential for computational scene analysis.

1 Introduction

Human intelligence can b e broadly divided into three asp ects: Perception,

reasoning, and action. The ﬁrst is mainly concerned with analyzing the in-

formation in the environment gathered by the ﬁve senses, and the last is

primarily concerned with acting on the environment. In other words, percep-

tion and action are about input and output, respectively, from the viewpoint of

the intelligent agent (i.e. a human being). Reasoning involves higher cognitive

functions such as memory, planning, language understanding, and decision

From Challenges for Computational Intelligence, W. Duch and J. Mandziuk

(Eds.), Springer, Berlin, 2007, pp. 163–191.

164 DeLiang Wang

making, and is at the core of traditional artiﬁcial intelligence [49]. Reasoning

also serves to connect perception and action, and the three aspects interact

with one another to form the whole of intelligence.

This chapter is about perception - we are concerned with how to analyze

the perceptual input, particularly in the visual and auditory domains. Be-

cause perception seeks to describe the physical world, or scenes with objects

located in physical space, perceptual analysis is also known as scene analy-

sis. To diﬀerentiate scene analysis by humans and by machines, we term the

latter computational scene analysis

. In this chapter I focus on the analy-

sis of a scene into its constituent objects and their spatial positions, not the

recognition of memorized objects. Pattern recognition has been much stud-

ied in computational intelligence, and is treated extensively elsewhere in this

collection.

Although humans, and nonhuman animals, perform scene analysis with

apparent ease, computational scene analysis remains an extremely challenging

problem despite decades of research in ﬁelds such as computer vision and

speech processing. The diﬃculty was recognized by Frank Rosenblatt in his

1962 classic book, “Principles of neurodynamics” [47]. In the last chapter,

he summarized a list of challenges facing perceptrons at the time, and two

problems in the list “represent the most baﬄing impediments to the advance

of perceptron theory” (p. 580). The two problems are ﬁgure-ground separation

and the recognition of topological relations. The ﬁeld of neural networks has

since made great strides, particularly in understanding supervised learning

procedures for training multilayer and recurrent networks [2, 48]. However,

progress has been slow in addressing Rosenblatt’s two chief problems, largely

validating his foresight.

Rosenblatt’s ﬁrst problem concerns how to separate a ﬁgure from its back-

ground in a scene, and is closely related to the problem of scene segregation:

To decompose a scene into its comprising objects. The second problem con-

cerns how to compute spatial relations between objects in a scene. Since the

second problem presupposes a solution to the ﬁrst, ﬁgure-ground separation is

a more fundamental issue. Both are central problems of computational scene

analysis.

In the next section I discuss the goal of computational scene analysis.

Section 3 is devoted to a key problem in scene analysis - the binding prob-

lem, which concerns how sensory elements are organized into percepts in the

brain. Section 4 describes oscillatory correlation theory as a biologically plau-

sible representation to address the binding problem. The section also reviews

the LEGION

network that achieves rapid synchronization and desynchro-

nization, hence providing a computational foundation for the oscillatory cor-

relation theory. The following two sections describe visual and auditory scene

This is consistent with the use of the term Computational Intelligence.

LEGION stands for Lo cally Excitatory Globally Inhibitory Oscillator Network

[68].

Computational Scene Analysis 165

analysis separately. In Section 7, I discuss a number of challenging issues facing

computational scene analysis. Finally, Section 8 concludes the chapter.

Note that this chapter does not attempt to survey the large body of liter-

ature on computational scene analysis. Rather, it highlights a few topics that

I consider to be most relevant to this book.

2 What is the Goal of Computational Scene Analysis?

In his monumental book on computational vision, Marr makes a compelling

case that understanding perceptual information processing requires three dif-

ferent levels of description. The ﬁrst level of description, called computational

theory, is mainly concerned with the goal of computation. The second level,

called representation and algorithm, is concerned with the representation of

the input and the output, and the algorithm that transforms from the input

representation to the output representation. The third level, called hardware

implementation, is concerned with how to physically realize the representation

and the algorithm.

So, what is the goal of computational scene analysis? Before addressing this

question, let us ask the question of what purpose perception serves. Answers to

this question have been attempted by philosophers and psychologists for ages.

From the information processing perspective, Gibson [21] considers perception

as the way of seeking and gathering information about the environment from

the sensory input. On visual perception, Marr [30] considers that its purpose is

to produce a visual description of the environment for the viewer. On auditory

scene analysis, Bregman states that its goal is to produce separate streams

from the auditory input, where each stream represents a sound source in the

acoustic environment [6]. It is worth emphasizing that the above views suggest

that perception is a private process of the perceiver even though the physical

environment may be common to diﬀerent perceivers.

In this context, we may state that the goal of computational scene analysis

is to produce a computational description of the objects and their spatial loca-

tions in a physical scene from sensory input. The term ‘object’ here is used in

a modality-neutral way: An object may refer to an image, a sound, a smell,

and so on. In the visual domain, sensory input comprises two retinal images,

and in the auditory domain it comprises two eardrum vibrations. Thus, the

goal of visual scene analysis is to extract visual objects and their locations

from one or two images. Likewise, the goal of auditory scene analysis is to

extract streams from one or two audio recordings.

The above goal of computational scene analysis is strongly related to the

goal of human scene analysis. In particular, we assume the input format to

be similar in both cases. This assumption makes the problem well deﬁned

and has an important consequence: It makes the research in computational

scene analysis perceptually relevant. In other words, progress in computa-

tional scene analysis may shed light on perceptual and neural mechanisms.

166 DeLiang Wang

This restricted scope also diﬀerentiates computational scene analysis from en-

gineering problem solving, where a variety and a number of sensors may be

used.

With common sensory input, we further propose that computational scene

analysis should aim to achieve human level performance. Moreover, we do not

consider the problem solved until a machine system achieves human level

performance in all perceptual environments. That is, computational scene

analysis should aim for the versatile functions of human perception, rather

than its utilities in restricted domains.

3 Binding Problem and Temporal Correlation Theory

The ability to group sensory elements of a scene into coherent objects, often

known as perceptual organization or perceptual grouping [40], is a funda-

mental part of perception. Perceptual organization takes place so rapidly and

eﬀortlessly that it is often taken for granted by us the perceivers. The diﬃ-

culty of this task was not fully appreciated until eﬀort in computational scene

analysis started in earnest. How p erceptual organization is achieved in the

brain remains a mystery.

Early processing in the perceptual system clearly involves detection of

local features, such as color, orientation, and motion in the visual system, and

frequency and onset in the auditory system. Hence, a closely related question

to perceptual organization is how the responses of feature-detecting neurons

are bound together in the brain to form a perceived scene? This is the well-

known binding problem. At the core of the binding problem is that sensory

input contains multiple objects simultaneously and, as a result, the issue of

which features should bind with which others must be resolved in objection

formation. I illustrate the situation with two objects - a triangle and a square

- at two diﬀerent locations: The triangle is at the top and the square is at the

bottom. This layout, shown in Figure 1, was discussed by Rosenblatt [47] and

used as an instance of the binding problem by von der Malsburg [60]. Given

feature detectors that respond to triangle, square, top, and bottom, how can

the nervous system bind the locations and the shapes so as to perceive that

the triangle is at the top and the square is at the bottom (correctly), rather

than the square is on top and the triangle is on bottom (incorrectly)? We

should note that object-level attributes, such as shape and size, are undeﬁned

before the more fundamental problem of ﬁgure-ground separation is solved.

Hence, I will refer to the binding of local features to form a perceived object,

or a percept, when discussing the binding problem.

How does the brain solve the binding problem? Concerned with shape

recognition in the context of multiple objects, Milner [32] suggested that dif-

ferent objects could be separated in time, leading to synchronization of ﬁring

activity within the neurons activated by the same object. Later von der Mals-

burg [59] proposed a correlation theory to address the binding problem. The

Computational Scene Analysis 167

Feature Detectors

Input

Fig. 1. Illustration of the binding problem. The input consists of a triangle and a

square. There are four feature detectors for triangle, square, top, and bottom. The

binding problem concerns whether the triangle is on top (and the square at bottom)

or the square is on top (and the triangle at bottom).

correlation theory asserts that the temporal structure of a neuronal signal

provides the neural basis for correlation, which in turn serves to bind neu-

ronal responses. In a subsequent paper, von der Malsburg and Schneider [61]

demonstrated the temporal correlation theory in a neural model for segre-

gating two auditory stimuli based on their distinct onset times - an example

of auditory scene analysis that I will come back to in Section 6. This paper

proposed, for the ﬁrst time, to use neural oscillators to solve a ﬁgure-ground

separation task, whereby correlation is realized by synchrony and desynchrony

among neural oscillations. Note that the temporal correlation theory is a the-

ory of representation, concerned with how diﬀerent objects are represented in

a neural network, not a computational algorithm; that is, the theory does not

address how multiple objects in the input scene are transformed into multi-

ple cell assemblies with diﬀerent time structures. This is a key computational

issue I will address in the next section.

The main alternative to the temporal correlation theory is the hierarchi-

cal coding hypothesis, which asserts that binding o ccurs through individual

neurons that are arranged in some cortical hierarchy so that neurons higher

in the hierarchy respond to larger and more specialized parts of an object.

Eventually, individual objects are coded by individual neurons, and for this

reason hierarchical coding is also known as the cardinal cell (or grandmother

cell) representation [3]. Gray [23] presented biological evidence for and against

the hierarchical representation. From the computational standpoint, the hier-

HTML Viewer

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Computational scene analysis" ?

In this context, temporal correlation theory is introduced as a biologically plausible representation for addressing the binding problem.

Q2. What is the promising roadmap for understanding scene analysis?

In my view, the Marrian framework for computational perception provides the most promising roadmap for understanding scene analysis.

Q3. What is the common use of the oscillatory correlation approach?

A large number of studies have applied the oscillatory correlation approach to visual scene analysis tasks, including segmentation of range and texture images, extraction of object contours, and selection of salient objects.

Q4. What is the failure of perceptrons to solve this problem?

The failure of perceptrons to solve this problem is rooted in the lack of a proper representation, not the lack of a powerful learning method.

Q5. What is the cause of the connectedness problem?

The cause, as discussed in Sect. 4, is computational complexity - learning the connectedness predicate would require far too many training samples and too much learning time.

Q6. What are the major grouping principles for auditory scene analysis?

Displaying the acoustic input in a 2-D time-frequency (T-F) representation such as a spectrogram, major grouping principles for auditory scene analysis (ASA) are given below [6, 13]:• Proximity in frequency and time.

Q7. What is the significance of the speed of human scene analysis?

For those who are concerned with biological plausibility, the speed of human scene analysis has strong implications on the kind of processing employed.

Q8. What is the pattern of connectivity within the grouping layer?

This pattern of connectivity within the grouping layer promotes synchronization among a group of segments that have common periodicity.

Computational Scene Analysis

Summary (3 min read)

1 Introduction

2 What is the Goal of Computational Scene Analysis?

3 Binding Problem and Temporal Correlation Theory

4 Oscillatory Correlation Theory

A. An input image with 30x30 binary pixels showing a connected cup figure. B.

5 Visual Scene Analysis

6 Auditory Scene Analysis

Hz to 5 kHz is employed in peripheral processing. B. A snapshot of the grouping layer. Here, white pixels denote active oscillators that represent the segregated

7.1 Attention

7.2 Feature-based Analysis versus Model-based Analysis

7.3 Learning versus Representation

8 Concluding Remarks

Figures (10)

Citations

Cites methods from "Computational Scene Analysis"

References