Spatiotemporal energy models for the perception of motion

doi:10.1364/JOSAA.2.000284

284

J.

Opt.

Soc.

Am.

A/Vol.

2,

No.

2/February

1985

Spatiotemporal

energy

models

for

the

perception

of

motion

Edward

H.

Adelson

and

James

R.

Bergen

David

Sarnoff

Research

Center,

RCA,

Princeton,

New

Jersey

08540

Received

July

9,

1984;

accepted

October

12,

1984

A

motion sequence may

be

represented

as

a

single

pattern

in

x-y-t

space;

a

velocity

of

motion

corresponds

to

a

three-dimensional

orientation

in

this

space.

Motion

sinformation

can

be

extracted

by

a

system

that

responds

to

the

oriented

spatiotemporal

energy.

We

discuss

a

class

of

models

for

human

motion mechanisms

in

which

the

first

stage

consists

of

linear filters

that

are

oriented

in

space-time

and

tuned

in

spatial

frequency.

The

outputs

of

quad-

rature

pairs

of such

filters

are

squared

and

summed

to

give

a

measure

of

motion

energy.

These

responses

are

then

fed

into

an

opponent

stage.

Energy

models

can

be

built

from

elements

that

are

consistent

with

known

physiology

and

psychophysics,

and

they

permit

a

qualitative

understanding

of

a

variety

of

motion

phenomena.

1. INTRODUCTION

When

we

watch

a

movie,

we

see

a

sequence

of

images

in

which

objects

appear

at

a

sequence

of

positions.

Although

each

frame

represents

a

frozen

instant

of

time,

the

movie

gives

us

a

convincing

impression

of

motion.

Somehow

the

visual

system

interprets

the

succession of

still

images

so

as

to

arrive

at

a

perception

of

a

continuously

moving

scene.

This

phenomenon

represents

one

form of

apparent

motion.

How

is

it

that

we

see

apparent

motion?

One

possibility

is

that

our

visual

system

matches

up

corresponding

points

in

suc-

ceeding

frames

and

calculates an

inferred

velocity

based

on

the

distance traveled

over

the

frame

interval.

Much research

on

apparent

motion

has

taken

the

establishment

of

this

cor-

respondence

to

be

the

fundamental

problem

to

be

solved.

1

-

3

We

argue

that

this

correspondence problem

can

often

be

by-

passed

altogether;

we

take

up

this

argument after

discussing

various

approaches to

the

problem

of

motion

analysis.

Figure

la

shows

a

vertical

bar,

which

is

presented

at

a

se-

quence

of

discrete

positions

at

a

sequence

of

discrete times.

In

a

typical

feature-matching

model,

the

visual

system

is

said

to (1)

find

salient features

in

successive

frames;

(2)

establish

a

correspondence

between

them;

(3)

determine

Ax,

the

dis-

tance

traveled, and

At,

the

time

between

frames;

and,

finally,

(4)

compute

the

velocity

as

Ax/At.

In

this

example,

the

features

to

be

matched might

be

the

edges

of

the

bar.

In

a

typical

global

matching

model,

the

visual

system

would

perform

a

match

over

some large

region

of

the

image,

in

es-

sence

performing

a

template match

by

sliding

the

image

from

one

frame

to

match

the

image

optimally

in

the

Most

cross-correlation

models

(see,

e.g.,

Lappin

and

Bell

4

)

are

examples

of

the

global

matching

approach.

Once

again,

Ax

and

At

can be

determined, and

the

velocity

can be

inferred.

Matching

models

are

designed

to

make

predictions

about

stimuli presented

as

sequences

of

frames

(e.g.,

movies).

Not

all

stimuli

fall

naturally

into

such

a

description.

In

an

ordi-

nary

television,

for

example,

the

electron

beam

illuminates

adjacent

points

in

a

rapid

sequence,

sweeping

out

the

even

lines

of

the

raster

pattern

on one

field

and

then

returning

to

fill in

the

odd lines

on

the next

field

(two

fields

constitute

a

frame).

Should

the

matching

be

taken

between frames

or

between

fields?

For

that

matter,

why

should

it

not

be

taken

between

the

successively

illuminated

points

themselves?

(Note

that

the

motion

of

the raster

itself,

which

is

normally

invisible,

will

become visible

if

the

raster

is

quite

slow.)

Although

the

answer

is

not

immediately

obvious,

it

is

clear

that

we

need

to

consider

the

well-known

persistence

of

visual

responses-i.e.,

the

temporal

filtering imposed

by early visual

mechanisms-in

order

to

make

sense

of even

the

simplest

phenomena

of

apparent

motion.

The

rapidly

illuminated

points

on

a

television

screen

are

blended

together

in

time,

effectively

making

all

the

lines

of

a

frame (including

both

fields)

visually

present

at

one

time.

One

approach

to

motion

modeling,

therefore,

is

to

build

in

a

temporal-filtering

stage

that

preprocesses

the

visual

input

before

it

is

passed

along

to

the

matching

system.

The

resulting

model

treats

the

stimulus

in

both

a

continuous

and

a

discrete

fashion.

Filtering

is

a

continuous operation and

leads

to

a

continuously

varying

output,

whereas

matching

is

discrete,

taking

place

between

images

sampled

at

two

particular

moments

in

time. Having

been

forced

to

introduce

filtering

into

the

model,

we

would

like

to

make

full

use

of

its

properties.

In fact,

filtering

can

be

used

to

extract

the

motion

information

itself,

thus

rendering

the

discrete

matching

stage

superfluous.

There

are

other

reasons

for

shying

away

from

matching

models

as

they

are commonly

presented.

They

can

usually

make

predictions

about

simple

stimuli

such

as

a

moving

bar,

but

they

may

run

into

trouble

when

presented

with

a

sequence

such

as

is

shown

in

Fig.

lb.

Here,

a

sequence

of

vertical

ran-

dom

noise

patterns

is

presented.

When

this

sequence

is

viewed,

complex

motions

are

seen,

varying

from

point

to

point

in

the

image.

Different

velocities

are

seen

at

different

posi-

tions,

and these

velocities change

rapidly.

A

feature-matching

model has

difficulty

making

predictions

because of

the

fa-

miliar

problems:

What

constitutes

a

feature?

What

should

be

matched

to

what?

Most feature-based

models

are

not

well

enough

defined

to

offer

predictions

about

a

stimulus

such

as

that

of Fig.

lb.

Yet motion

is

seen,

and

we

would like

to

be-

lieve

that

this

motion

percept

is

generated

by

the

same

lawful

processes

that

generate

the

percept

of

the

moving

bar.

Can

a

global

matching

model,

such

as

a

cross-correlation

model,

do

better?

Again,

it

is

hard

to

know

what

such

a

model

will

predict.

Most

global

matching

models

have

been

for-

mulated

only

to

deal

with

the

visibility

of

single

global

motions

and

thus

cannot

be easily

applied

to the situation

in

which

many motions

are

seen

at

different points

in

the

field.

0740-3232/85/020284-16$02.00

©

1985

Optical

Society

of

America

E.

H.

Adelson

and

J.

R. Bergen

E.

H.

Adelson

and J.

R.

Bergen

Vol.

2,

No.

2/February

1985/J.

Opt.

Soc. Am.

A

285

ti

t

2

t

3

b~

~~

;jW

;y

E

Fig.

1.

a,

A

sequence of

images

presented

at

times

t

1

,

t

2

,

and

t

3

showing

a

bar

moving

to

the

right.

b,

A

sequence

of vertical

random

noise

patterns,

also

shown

at

three

successive

instants

of

time.

Motion

is

seen

in

each

case.

The

motion

percept

is

simple

in a

and

complex

in

b,

but

a

motion

model

should

be

able

to handle

both

cases.

A

number

of approaches

have

recently been

developed

that

can be

used with

complex

inputs

such

as

the

dynamic

noise

of

Fig.

lb.

Marr and

Ullman

5

describe

a

method

for

ex-

tracting

the

motion

of

zero

crossings

in

the

outputs

of

linear

filters

by comparing

the

sign

of

the

filter

output

to

the

sign

of

its temporal

derivative

at

the

zero

crossing.

A

rather

different

approach

has been

described

by van

Santen

and

Sperling

6

in

an

elaboration

of

Reichardt's

7

model

in

which

a local corre-

lation

(i.e.,

multiplication)

is

performed

across

space

and

time.

In

van

Santen

and

Sperling's

model,

filters

tuned

for

spatial

frequency

serve

as

the inputs

to

the

correlator

stages.

Van

Santen

and

Sperling provide a

formal

analysis

of

the

model's

properties,

describe

a

set

of

linking

assumptions,

and

show

that

the

model makes correct

predictions

about

a

large

variety

of

simple

motion

displays.

A

third

approach

has been

de-

scribed

by Watson

and

Ahumada8:

Motion

information

is

extracted

with

simple

linear filters

without

a

multiplicative

stage,

the

filters

are

tuned

for

spatial

and

temporal

frequency

as

well

as

velocity,

and

directional

selectivity

is

achieved

by

setting

up

the

appropriate

phase

relationships

between

an

underlying

pair

of

filters.

It

is

notable

that

this

approach

achieves

directional

selectivity

without

any

nonlinearities

(although

some

sort

of

nonlinearity must,

of

course,

be

present

at

some

point

for

motion

detection

to

occur).

Ross

and

Burr

9

have

also

proposed

that

the

visual

system

extracts

motion

information

with directionally

tuned

linear

filters. Morgan'

0

has applied linear-filtering

concepts

to

stroboscopic

displays,

and

Adelson" has

discussed

how

a

number

of motion

illusions

can

be

understood

in

terms

of

mechanisms

that

respond

to

the

motion

energy

within

particular

spatiotemporal-frequency

bands.

Although

it

is

not

immediately

apparent,

there

are

signifi-

cant

formal

connections

between

the

linear-filtering

approach

and

the

correlational

approach

of

a

Reichardt-style

model,

as

has

been previously

noted.

6

.'

2

The

topic

is

taken

up

in

Ap-

pendix

A;

at

this

point,

we

simply

comment

that

both

types

of

model

can be

considered

to

respond

to

motion

energy

within

a

given

spatiotemporal-frequency

band

(a

property

that

will

be

discussed

at

greater

length

below).

Our

interest

in

this paper

is

not

so

much

to

discuss

a

par-

ticular

model

as

to

discuss

a

general

class

of

models

and

not

so

much

to

discuss

this

class as

to

discuss

a

general

approach

to

the

problem

of

motion

detection.

We

will

consider

models

closely

the

ones

just

mentioned-models

that

are

based

on

a

simple

low-level

analysis

of

visual

information,

starting

with

the outputs

of

linear

filters.

This

kind

of

pro-

cessing

is

well

understood and

can be

readily applied

to

any

stimulus

input.

Moreover,

it

is

just

the

kind

of processing

that

is

considered

to

occur

early in

the

visual

pathway, based

on

a large

variety

of

psychophysical

and

physiological

experi-

ments.'

3

-16

Low-Level

Processing

in

Motion

Perception

A

low-level

approach

seems

particularly appropriate

when one

is

dealing

with motion

phenomena

that

occur

with

a

rapid

sequence of

presentations.

Many

investigators

have found

that

these

rapid presentations

lead

to

motion

percepts

that

are

determined

by

rather

simple

low-level

properties

of

the

stimuli.

Braddick'

7

provided

evidence

for

two

distinct

kinds

of

motion

mechanisms

in

apparent

motion.

He

called

them

long-range

and short-range

mechanisms.

The

short-range

process

operates

over

rather

short

spatial

distances

and

short

time

intervals

and

involves

low-level

kinds

of

visual

infor-

286

J.

Opt.

Soc.

Am.

A/Vol.

2,

No.

2/February

1985

mation.

The

long-range

mechanism

can

operate

over

large

spatial separations

and

longish

time

intervals

and

may

involve

somewhat

higher-level forms

of

visual

information.

Hochberg

and

Brooks'

8

also

found

evidence

for

two

pro-

cesses

in

motion

perception.

They

presented

a

sequence

of

images

containing

collections

of

simple

shapes, such

as

circles,

triangles,

and

squares. Each

shape

could

take

one

of

two

motion

paths:

It

could

take

a

short

path

but

change

identity

(e.g.,

a

triangle

could

take

a

short

path

by

turning

into

a

square),

or

it

could

take

a

longer

path

and

retain

its

identity.

At

lower

presentation

rates,

the identity

of

the

objects became

important

and

a

triangle

would

remain

a

triangle

even

if

it

meant

taking

a

longer

path.

But

with

rapid

presentations,

the

shorter

path

length

won

out,

even

though

it

meant

aban-

doning

stable

object

identity.

Sperling'

9

found

that

rapid,

multiple-presentation

motion

stimuli

gave

much

more

compelling

motion

than

did

the

slower

two-view

stimuli

of classic

apparent-motion

experi-

ments.

Evidence

for

a

fast,

low-level

process in

motion

per-

ception

has

also

been

presented

by

various

others.

2 20

'

2

'

The

models

that

we

develop

below

are

designed

to

deal

with

the

rapid-presentation

situation

and

are

based

on

the

simplest,

lowest-level

processes

that

we

can use.

We

will

try

to

avoid

the

concept

of

matching altogether.

2.

REPRESENTING

MOTION

IN

X-Y-

T

SPACE

Moving

stimuli

may

be

pictured

as

occupying

a

three-di-

mensional

space,

in

which

x

and

y

are

the

two

spatial

dimen-

sions

and

t

is

the

temporal

dimension. Consider

a

vertical

bar

moving

continuously

to

the right,

as

shown

in

Fig.

2a.

The

three-dimensional

spatiotemporal

diagram

is

shown

in

Fig.

2b;

the

moving

bar

becomes

a

slanted

slab

in

this

space.

If

the

continuous

motion

is

sampled

at

discrete times,

the

result is

Fig.

2c,

which

shows

a

movie

of

a

moving

bar.

In

Fig.

3,

only

the

x-t

slice

of

the

space

is

shown

(we

can

ignore

the y

dimension

since

a

vertical

bar

is

unchanging

along

the

y

direction).

The

moving

bar

in

Fig.

3a

becomes

a

slanted

strip.

The

slant

reflects

the

velocity

of

the

motion.

Figure

3b

shows

the

result

of

sampling

the

continuous

motion. In

practice,

when

one

presented

the

movie

corresponding

to Fig.

3b,

one

would

leave

each

frame

on

for

a

period

of

time

before

replacing

it

with

the next

one.

Figure

3c

shows

the

spa-

tiotemporal

plot

of

a

movie

in

which

each

frame

lasts

almost

through

the

full

interval

between frames.

(In

most

actual

movie

projection,

a

single

frame

is

broken

up

into

several

shorter

flashes

in

order to

minimize

the

perception

of

24-Hz

flicker;

for

simplicity,

we

do

not

consider

the

case

of

multiple

shuttering

here.)

We

know

that

the

sampled motion

of

Fig.

3c

will

look

sim-

ilar

to

the continuous

motion

of

Fig.

3a.

Indeed,

if

the

sam-

pling

is

sufficiently

frequent

in

time the

two

stimuli

will

look

identical.

Pearson

2 2

has

discussed

how

this

may

be

under-

stood

by

applying

the

standard

notions

of

sampling

and

ali-

asing

to

the

case

of

three-dimensional

sampling

in

space

and

time

and

considering

the spatiotemporal-filtering properties

of

the

human

visual system.

The

argument,

in

brief,

is

this:

A

continuously

moving

image

has

a

three-dimensional Fourier

spectrum

in

fA-fy-ft.

A

sampled

version

of

the

display

has

a

different

spectrum.

The

differences

between

the spectra

of

the

continuous and sampled

scenes

may

be

called

sampling

artifacts

(when

these

artifacts

intrude

on

the

spectrum

of

the

original signal

they

are known

as

aliasing

components).

It

is

these

components

that

allow

an observer

to

distinguish

be-

tween

a

continuous

and

a

sampled display.

The

task

of

a

display

engineer

is

therefore

to

ensure

that

the

artifactual

components

that

are

due

to

sampling

are

of

such

low

contrast

that

they

are

invisible

to the

human

observer.

To

achieve

this

goal,

it

is

necessary

not

to

remove

the

artifactual

components

altogether

but

merely

to

prevent them

from

reaching

threshold

visibility.

This

can

be done by

appropriately

pre-

filtering,

sampling,

and

postfiltering

the

moving

images.

It

is

not

always

easy

to

assess

the

visibility

of

sampling

ar-

tifacts;

one

must

take

into account

subthreshold

summation

between

the

artifactual

components

as

well

as

masking

by

true-image

components.

However,

Watson et al.

2 3

have

de-

scribed

a

set

of

conditions

under

which

one

may

be

confident

that

the

artifacts

will

not

be

visible.

For

sufficiently

high

spatial

and temporal

frequencies,

human

contrast

sensitivity

is

zero;

that

is,

components

lying

outside

a

certain

spa-

tiotemporal-frequency limit

(which

Watson

et

al.

2 3

call

the

window

of

visibility)

cannot

be

seen

regardless

of

their

con-

trast.

If

the

sampling

is

sufficiently

fine

to

keep

all

the

spectral

energy

of

the

sampling

artifacts

outside

this

window,

then

the

artifacts

must

be

invisible.

Morgan1

0

has

applied

frequency-based

analyses

to

the

problem

of

motion

interpolation and

has described

two

dif-

ferent

approaches.

In

the

first

approach,

the

analysis begins

with

the

extraction

of

a

position

signal,

i.e.,

a

single

number

that

varies

over

time.

Low-pass

filtering

is

then

applied

to

this

signal.

Thus

the first

stage of

motion

analysis

is

highly

nonlinear

(position

extraction),

and

linear filtering

follows

it.

In

Morgan's

second

approach,

the

filtering

is

applied directly

to the

stimulus

itself;

position

is

extracted

after

the

filtering

has

occurred.

The

present

discussion

(like

that

of

Pearson

and

that

of

Watson

et

al.)

is

more

closely

connected

to

the

second

approach

than

to

the

first.

But

one

should

note

that

position

as

such

need

not

be

extracted

in

the

computation

of

motion,

as

will

become

clear

in

what

follows.

When

temporal

sampling

is

too

coarse-as

in

an

old

movie-motion

tends

to

look

jerky.

But

motion

is

still

seen.

That

is,

to

convey

the

impression

of

motion,

it

is

not

necessary

that

a

sampled

stimulus

be

indistinguishable

from

a

contin-

uous

one.

A

spatiotemporal-frequency

analysis

helps

one

to

understand

this

as well,

because

a

continuous

and

a

sampled

stimulus

share

a

great

deal

of

spatiotemporal

energy,

even

if

they

do

not

share

it

all.

We

can

expect

the

two

stimuli

to

look

similar

insofar

as

there

are

visual

mechanisms

that

respond

to

the

shared

energy.

It

is

sometimes

helpful

to

perform

the

analysis

in

the

orig-

a

b

C

y

Fig.

2. a, A

picture

of

a

vertical

bar

moving

to

the right.

b,

A

spa-

tiotemporal

picture

of

the

same

stimulus.

Time

forms

the

third

di-

mension.

c, A

spatiotemporal

picture

of

a

moving

bar

sampled

in

time

(i.e.,

a

movie).

E.

H.

Adelson

and

J.

R.

Bergen

Vol.

2,

No.

2/February

1985/J.

Opt.

Soc.

Am.

A 287

a

b

C

d

e

Fig.

3.

a,

An

(x,

t) plot

of

a

bar

moving

to

the

right

over

time.

Time

proceeds

downward.

The

vertical

dimension

is

not

shown.

b, An

(x,

t)

plot

of

the

same

bar,

sampled

in

time.

c,

The

sampled

motion

as

displayed

in a

movie in which each

frame remains

on

until

the next

one

appears.

d,

Continuous

motion

after

spatiotemporal

blurring.

e,

Sampled

motion

after

spatiotemporal

blurring.

The

middle-

and

low-frequency

in-

formation

is

almost

the

same for

the

two

stimuli.

a

b

C

d

e

N

f

Fig.

4. (x,

t)

plots

of

moving

bars.

a,

A

movie

of

a

bar

moving

to

the

right.

b,

A

bar

moving

to

the

right

continuously.

c,

The

difference

(sampling

artifacts)

between

the

sampled and continuous

motions.

d,

A

movie

sampled

at

a

high

frame rate.

e,

Continuous

motion.

f,

The

difference

between

the

finely

sampled

and

continuous

motion. When

the

sampling

rate

is

high,

the

sampling

artifacts

become

difficult

or

impossible

to

see.

E.

H.

Adelson

and

J.

R.

Bergen

288

J.

Opt.

Soc.

Am.

A/Vol.

2,

No.

2/February

1985

inal

space-time

domain,

rather

than

in

the

frequency domain.

Figure

4

makes explicit

the

difference between

the

sampled

and

continuous

versions

of

the

moving

bar.

If

we

simply

subtract

the

continuous

pattern

(Fig. 4b)

from

the

sampled

one (Fig.

4a),

we

can

derive

a

new

spatiotemporal

plot

of

the

sampling

artifacts,

as

illustrated

in

Fig.

4c.

Since

the

differ-

ence

can

be

positive

or

negative,

we

have

displayed

it

on

a

gray

pedestal,

so

that

gray

corresponds to

zero,

white

to

positive,

and

black

to

negative. Observe

that

the

sampled-motion

stimulus

of

Fig.

4a

can

be

considered

to

be

the

sum

of

the

real

motion

of

Fig. 4b

and

the

artifacts

of

Fig.

4c.

That

is,

we

can

think

of

the

sampled

motion

as

being

continuous motion

with

sampling

noise

added

to

it.

If

the

motion

is

sampled

more

frequently

in

time,

the

ap-

proximation

to

continuous motion

is

improved,

as

shown

in

Fig.

4d.

In

this

case,

the

artifacts

(Fig.

4f)

have

rather

little

energy

in

the

range

of

frequencies

that

we

can

see.

If

sampling

is

made

frequent

enough,

there

will

plainly

come

a

point

at

which

the artifactual

components

have

so

little

energy

in

the

visible

spatial-

and temporal-frequency

range

that

they

will

become

invisible,

since

the

fine

spatiotemporal

structure

of

the

artifacts

will

be

blurred to

invisibility

by

the

spatial

and

temporal

response

of

the

eye.

At

this

point,

the

continuous

and the

sampled stimuli

will

be

perfectly indistinguishable.

Again,

it

is

not

necessary

that

the

sampled stimulus

look

identical

to

the

continuous

one

in order

for

the

motion

to

look

similar.

A

motion

mechanism

that

responds

to

low

spatial

and temporal

frequencies

will

give

the

same

responses to

the

two

stimuli,

even

if

mechanisms

sensitive

to

higher

frequencies

give

different

responses.

So

far,

we

have

discussed

the

conditions

under

which

dif-

ferent

moving

stimuli

may

be

expected to

give

similar

im-

pressions

of

motion.

But

we

have

not

discussed

how

motion

information,

in

itself,

might

be

extracted;

this

constitutes

our

3.

MOTION

AS

ORIENTATION

Motion

can

be

perceived

in

continuous

or

sampled displays,

when

there

is

energy

of

the

appropriate

spatiotemporal

or-

ientation.

This

is

illustrated

in

Fig.

5,

which

shows

spa-

tiotemporal

diagrams

of

a

bar:

a,

moving

quickly

to

the

left;

b,

moving

slowly

to

the

left;

c,

stationary;

d, moving

slowly

to

the

right;

and

e,

moving

quickly

to

the

right. The

velocity

is

inverse

with

the

slope.

The

problem

of

detecting

motion,

then,

is

the

problem

of

detecting

spatiotemporal orientation.

How

can

this

be

done?

We

already

know

a

way

of

detecting

orientation

in

ordinary

spatial

displays, namely,

through the

use of

oriented

receptive

fields

like

those

described

by

Hubel and

Wiesel

24

and

some-

times

referred

to

as

bar

detectors

and

edge

detectors.

Simple

cells

in

visual cortex are

now

known

to

act

more or

less

as

linear

filters:

Their

receptive-field

profiles

represent a

weighting

function,

with

both

positive

and

negative

weights,

which

may

be

taken

as

the

spatial

impulse response

of a linear

system.'

4

If

we

could

construct

a

cell

with

a

spatiotemporal

impulse

response

that

was

analogous

to

a

simple

cell's

spatial

impulse

responses,

we

would

have

the

situation

shown

at

the

bottom

of

Fig.

5

(cf.

Ross

and

Burr

9

).

The

cell's

spatiotemporal

im-

pulse

response

is

oriented

in

space

and

time.

In

Fig.

5f,

it

responds

well

to an

edge

moving

continuously

to the

right.

In

x~~~~~~

d

V=-2

V=-1

V=O

V=1

V=2

X

Fig.

5.

a-e,

(x,

t)

plots

of

bars

moving

to

the

left

or

to

the

right

at

various speeds.

f,

Motion

is

like

orientation

in

(x,

t),

and

a

spa-

tiotemporally

oriented

receptive

field

can be

used

to

detect

it.

g,

The

same

oriented

receptive

field

can

respond

to

sampled

motion

just

as

it

responds

to

continuous

motion.

Fig.

5g,

it

responds

well

to

a

sampled

version of

the

same

stimulus.

As

far

as

this

hypothetical

cell

is

concerned,

both

stimuli

have

substantial

rightward-motion

energy.

The

models

that

we

will

develop

will

be

based

on

idealized

mechanisms;

in discussing

these

mechanisms

we

will

use

the

terms

"unit"

and

"channel."

A

unit

corresponds

roughly

to

a

cell

or

to

a

small

set

of

cells

working in

concert

to

extract

a

simple

property

at

one

position in

the

visual

field.

A

channel

consists of an

array

of

similar

units

distributed

across

the

vi-

sual

field.

In

principle,

there

is

no

reason

why

an

oriented

unit

could

not

be

constructed

directly.

The

unit

would

gather

inputs

from an

array

of

photoreceptors

covering

the

spatial

extent

of

its

receptive

field,

and

it

would

sum

their outputs

over

time

with

the appropriate

temporal

impulse

responses.

In

practice,

however,

such

a

unit

would be

difficult

to construct

because

it

would

require

a

different

temporal

impulse response

cor-

rectly tailored to

each

spatial

position

in

the

receptive

field.

The

problem,

then,

is

to

construct

a

unit

that

responds

to

spatiotemporal

orientation

(i.e.,

motion)

and

yet

that

is

built

out

of

simple

neural

mechanisms.

In

Section

4, we

will

discuss

how

such

a

unit

can

be

built

by

combining impulse responses

that

are

space-time

separable

by

using

an

approach

similar

to

that

of

Watson

and

Ahumada.8

For

those

readers

who

are

not

entirely

comfortable with

these notions,

we

begin by re-

viewing

space-time

separability

as

well

as

spatiotemporal

impulse

responses.

4.

SPATIOTEMPORAL

IMPULSE RESPONSES

Many

cells

in

the

visual

system

respond (to

a

good

approxi-

mation)

by

performing

a weighted

integration

of

the

effect

of

light

falling

on

their

receptive

field;

the

receptive-field

profile,

with

its

positive

and

negative

lobes,

defines

the

weighting

function,

or

spatial

impulse response.

Across

the

top

of

Fig.

6

is

an

idealized

spatial

impulse response

from such

a cell.

Since

any

spatial

pattern

can

be

thought

of

as

a

sum

of

points

of

light

of

various

intensities

packed

together

side

by

side,

one

can

easily

predict

the

response

of

a

linear

unit

to

an

arbitrary

E.

H.

Adelson

and

J.

R.

Bergen

Spatiotemporal energy models for the perception of motion

Citations

Performance of optical flow techniques

Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems

The design and use of steerable filters

SlowFast Networks for Video Recognition

The variable discharge of cortical neurons: implications for connectivity, computation, and information coding

References

Handbook of Sensory Physiology

Receptive fields of single neurones in the cat's striate cortex

Application of fourier analysis to the visibility of gratings

The contrast sensitivity of retinal ganglion cells of the cat.

The Interpretation of Visual Motion

Related Papers (5)

Model of human visual-motion sensing

Receptive fields, binocular interaction and functional architecture in the cat's visual cortex

A model of neuronal responses in visual area MT.

Phenomenal coherence of moving visual patterns

Normalization of cell responses in cat striate cortex