scispace - formally typeset

Book ChapterDOI

A Biologically Inspired Model for the Detection of External and Internal Head Motions

10 Sep 2013-pp 232-239

TL;DR: A biologically inspired model is proposed that builds upon the known functional organization of cortical motion processing at low and intermediate stages to decompose the composite motion signal.

AbstractNon-verbal communication signals are to a large part conveyed by visual motion information of the user's facial components (intrinsic motion) and head (extrinsic motion). An observer perceives the visual flow as a superposition of both types of motions. However, when visual signals are used for training of classifiers for non-articulated communication signals, a decomposition is advantageous. We propose a biologically inspired model that builds upon the known functional organization of cortical motion processing at low and intermediate stages to decompose the composite motion signal. The approach extends previous models to incorporate mechanisms that represent motion gradients and direction sensitivity. The neural models operate on larger spatial scales to capture properties in flow patterns elicited by turning head movements. Center-surround mechanisms build contrast-sensitive cells and detect local facial motion. The model is probed with video samples and outputs occurrences and magnitudes of extrinsic and intrinsic motion patterns.

Topics: Structure from motion (62%)

Summary (2 min read)

1 Introduction

  • The recent development of computers show a clear trend towards companion properties [4, 14].
  • To achieve this, those systems must be able to interpret non-verbal communication patterns that are signalled by the user [2, 5].
  • Subsequent classification stages profit from a segregation of these patterns.
  • The method is studied on the example of head movements and dynamic facial expressions which both cause optic flow at the observer position.
  • The authors model mechanisms of signal processing in early and intermediate stages of visual cortex to provide robust automatic decomposition of extrinsic and intrinsic facial motions.

2 Visual Representations of Head Movements

  • The instantaneous motion of a three-dimensional surface that is sensed by a stationary observer can be represented by the linear combination of its translational and rotational motion components [10], as well as non-rigid motion caused by object deformations.
  • The authors aim at spatial processing of the resulting patterns to individually detect the extrinsic and intrinsic components.
  • The apparent motion on the positive half circle, where the facial surface is oriented towards the observer, leads to a generic speed pattern.
  • This pattern corresponds to the speed gradients as investigated by [12] and is also depicted in Fig.
  • In order to analyze such intrinsic facial motions, the authors reasoned that a centersurround mechanism for the filtering of motion patterns within the facial region will indicate occurrences of intrinsic motions, see Sec. 2.2.

2.1 A Model of Cortical Motion Gradient Detection

  • In the following the authors describe the implementation details of their model cells for detecting motion patterns that are characteristic for extrinsic motions corresponding to rotations around the X- or Y -axis, respectively, namely patterns containing speed gradients.
  • All presented detectors need a visual motion field u which is transformed into a log-polar representation, the velocity space V.
  • This representation allows selecting image locations containing certain motion directions, speeds, or both, which will be fundamental features for the upcoming design of gradient cells.
  • First, conjunctive input configurations need to match their tunings for speed and directions, and second, an increase or decrease in speed along an axes corresponding to their directional exists, also known as 3.
  • Each cell response incorporates a divise normalisation component in order to keep responses within bounds.

2.2 Model Mechanisms for Motion Contrast Detection

  • Local facial motions can be accounted by mechanisms that operate on a smaller scale within the facial projection into the image plane.
  • To detect intrinsic motion, the authors propose cells that are sensitive to local changes in speed and direction.
  • These motion patterns are produced in the facial plane while the person is talking or during other facial actions.
  • The input integration of velocity responses is defined by weighted kernels Ω with different spatial scale dimensions operating on responses of motion and speed selective filters.
  • Integration over N directions yields the activation for a direction-insensitive motion contrast cell.

3 Results

  • The authors model was probed with short video sequences containing extrinsic or intrinsic motion.
  • The optic flow was then transformed into velocity space representation and presented to proposed model cells.
  • The middle section of Fig. 4 shows results for sequences of unconstrained motion, where persons could move the head ad libitum.
  • Plots of the activations are also shown in the figure.
  • Also, detectors for intrinsic motion show activations that correlate well with eye blink labels.

4 Conclusion and Discussion

  • These movements are perceived as composite motion signals by the observer.
  • Correct interpretation regarding user dispositions and non-verbal communication patterns from visual signals requires segregated processing of both sources.
  • The authors propose networks of cortical motion processing to detect the individual occurrences of intrinsic and extrinsic motion.
  • Both proposed detectors work well on facial images and segregate composite facial motion into their extrisinic and intrinsic components.
  • In contrast to other approaches their proposed model is independent from highly specialised models, tracking, learning from examples or large optimisation stages to derive the presented results.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A Biologically Inspired Model for the Detection
of External and Internal Head Motions
Stephan Tschechne, Georg Layher, and Heiko Neumann
Institute of Neural Information Processing, University of Ulm,
89069 Ulm, Germany
http://www.uni-ulm.de/in/neuroinformatik.html
Abstract. Non-verbal communication signals are to a large part con-
veyed by visual motion information of the user’s facial components (in-
trinsic motion) and head (extrinsic motion). An observer perceives the
visual flow as a superposition of both types of motions. However, when
visual signals are used for training of classifiers for non-articulated com-
munication signals, a decomposition is advantageous. We propose a bi-
ologically inspired model that builds upon the known functional organi-
zation of cortical motion processing at low and intermediate stages to
decompose the composite motion signal. The approach extends previous
models to incorporate mechanisms that represent motion gradients and
direction sensitivity. The neural models operate on larger spatial scales to
capture properties in flow patterns elicited by turning head movements.
Center-surround mechanisms build contrast-sensitive cells and detect lo-
cal facial motion. The model is probed with video samples and outputs
occurrences and magnitudes of extrinsic and intrinsic motion patterns.
Keywords: Social Interaction, HCI, Motion Patterns, Cortical Process-
ing
1 Introduction
The recent development of computers show a clear trend towards companion
properties [4, 14]. Systems are demanded to improve the efficiency of human com-
puter interaction (HCI) by adapting to the user’s need, experience and moods.
To achieve this, those systems must be able to interpret non-verbal communica-
tion patterns that are signalled by the user [2, 5]. These patterns are to a large
part contained in visual signals that can be captured by an interaction system,
from which they can automatically be derived [9]. Among these visual patterns
bodily and facial expressions contain important cues to emotionally relevant in-
formation, whereas both types can be either of static or dynamic nature. Optic
flow has been investigated for emotion and facial analysis earlier [13, 6, 11, 8].
However, when it comes to analysis of facial motion, extrinsic (head) and in-
trinsic (facial) movements are superpositioned and elicit a composite flow field
of respective patterns from the observer’s perspective (see Fig. 1). Subsequent
classification stages profit from a segregation of these patterns. In [13], an at-
tempt to decompose the facial optical flow into affine as well as higher-order

2 S. Tschechne, G. Layher, H. Neumann
flow patterns in order to segregate the facial motion has been proposed. In [3]
head-pose and facial expressions are estimated graphics and animation. Here, an
affine deformation with parallax is modelled to fit active contours using singular
value decomposition. In [1] the authors propose a multi-stage landmark fitting
and tracking method to derive face and head gestures.
In this paper we propose a novel mechanism to detect occurrences of basic
components of motion from optic flow. The method is studied on the example
of head movements and dynamic facial expressions which both cause optic flow
at the observer position. We model mechanisms of signal processing in early and
intermediate stages of visual cortex to provide robust automatic decomposition
of extrinsic and intrinsic facial motions. We demonstrate how occurrences of
extrinsic as well as intrinsic components are robustly derived from an optic flow
field. Our approach contrasts others by neither requiring face detection or a
previous learning phase. Additionally, processing of multiple persons comes at
no extra cost.
+ =
extrinsic intrinsic superposition
Fig. 1. Visual flow at the observer position is a superposition (right) of extrinsic (head,
left) and intrinsic (facial, middle) motion. Subsequent processing benefits from sepa-
rated processing of both sources.
2 Visual Representations of Head Movements
The instantaneous motion of a three-dimensional surface that is sensed by a sta-
tionary observer can be represented by the linear combination of its translational
and rotational motion components [10], as well as non-rigid motion caused by
object deformations. Any of these cause visual motion in the image plane. In
this paper we focus on the analysis of facial and head motion of an agent in
a communicative act by means of optic flow. This motion is composed of the
extrinsic (head) motion caused by translations and rotations and the superim-
posed internal motion of facial components (intrinsic motion). We assume that
any translational extrinsic motion of the head has been compensated to fixate
the head in the middle of the image so that the world coordinate system is cen-
tered in the moving head. We aim at spatial processing of the resulting patterns

A Model for the Detection of External and Internal Head Motions 3
to individually detect the extrinsic and intrinsic components. For a rotation of a
simple head model around the Y -axis, an arbitrary surface point P = (x, y, z)
T
appears on a rotational path in space (see Fig.2) with frequency ω and periodic
length T . P is uniquely defined by its radius r and vertical component y at time
t, when a distance d to the observer is assumed:
P (t, r, y) = r ·
cos ωt
y
sin ωt
0
0
d
ω =
2π
T
(1)
In the following we assume that y = 0 and r = 1. Application of the projec-
tion model with x = f · X/Z and y = f · Y /Z, given the focal length f of the
camera yields the projected coordinates P
proj
in image plane for the observer:
P
proj
(t, r) = f ·
r cos ωt
r sin ωtd
0
(2)
If we assume constant angular speed, the apparent image speed of the pro-
jection of P is
P
proj
(t, r)
t
=
rsinωt+d1
r
2
sin
2
ωt2dr sinωt+d
2
0
(3)
which yields a characteristic motion pattern, see Fig. 2. The apparent mo-
tion on the positive half circle, where the facial surface is oriented towards the
observer, leads to a generic speed pattern. For a frontal motion from right to
left the pattern is composed by an acceleration, followed by maximum frontal
motion, and a symmetric deceleration patch. This pattern corresponds to the
speed gradients as investigated by [12] and is also depicted in Fig. 2. For sym-
metric and bounded objects image patches of increasing and decreasing speeds
are juxtaposed reflecting appearance and disappearance of the surface markings
on a convex rotating body surface. In Sec. 2.1 we suggest a filtering mechanism
for these arrangements of speed gradients which is tuned to such symmetric
arrangements of image motions with symmetric speed changes.
Facial motion, on the other hand, is caused by actions of facial muscles, e.g.
during verbal communication, eye blinks, or while forming facial expressions.
These spatio-temporal changes occur as deformations on a smaller temporal and
spatial scale compared to the size of the face and are characterized by changes
in motion direction and/or speed relative to its coherent surrounding motion.
In order to analyze such intrinsic facial motions, we reasoned that a center-
surround mechanism for the filtering of motion patterns within the facial region
will indicate occurrences of intrinsic motions, see Sec. 2.2.
2.1 A Model of Cortical Motion Gradient Detection
In the following we describe the implementation details of our model cells for
detecting motion patterns that are characteristic for extrinsic motions corre-
sponding to rotations around the X- or Y -axis, respectively, namely patterns

4 S. Tschechne, G. Layher, H. Neumann
rotation (radians)
proj. speed
proj. position
x
0 0.5 1 1.5 2
−0.2
0
0.15
0 0.5 1 1.5 2
−0.4
0
0.4
rotation (radians)
visible invisible
visible invisible
speed gradient
(magn.)
x
y
z
P
Fig. 2. Left: Projection model of one point on a head’s surface and its trajectory in the
projection when the head is rotating. Middle: X-position over rotation angle of point
P . Right: One-periodical plot of speed over rotation angle (dashed ) and speed gradient
(solid) of a point when rotating on a circular path around Y axis. Point P is closest at
position 1.0. Due to foreshortening effects, characteristic speed gradients occur where
the point approaches or retreats.
containing speed gradients. All presented detectors need a visual motion field
u which is transformed into a log-polar representation, the velocity space V.
In velocity space, motion is separably represented by speed s
p
= log kuk and
direction τ
p
= tan
1
(
u
y
u
x
). This representation allows selecting image locations
containing certain motion directions, speeds, or both, which will be fundamen-
tal features for the upcoming design of gradient cells. Filters for speed ψ and
direction θ with tuning widths σ are defined by
F
S
(µ, ν) = exp(
(µ ψ)
2
2σ
2
ψ
) (4)
F
θ
(µ, ν) = exp(
(ν log(θ))
2
2σ
2
θ
) (µ, ν) V. (5)
Gradient cells M
+/
p
at image position p respond when two conditions are
served, see Fig. 3: First, conjunctive input configurations need to match their
tunings for speed and directions, and second, an increase or decrease in speed
along an axes corresponding to their directional exists. This increase or decrease
is reflected in a simultaneous stimulation of two speed- and direction-tuned cells
that are spatially arranged to build the desired speed gradient. The speed- and
direction-tuned subcells M
p
are derived from the given motion field by applying
a filter F in velocity space representation. Each cell response incorporates a
divise normalisation component in order to keep responses within bounds.
M
p
= V
p
· F (6)
M
+
p
= M
p∆p
· M
p+∆p
· ( + M
p∆p
+ M
p+∆p
)
1
(0 < 1) (7)

A Model for the Detection of External and Internal Head Motions 5
Responses to flow gradients of opposite polarity are subsequently combined
by AND-gating to build a curvature detector. These cells operate at spatially
juxtaposed locations with component offsets along their directional preference
depending on the spatial scale of the filter kernels.
C
p
= M
+
p+∆p
· M
p∆p
· ( + M
+
p+∆p
+ M
p∆p
)
1
(8)
In order to make the response more robust and selective to motion direction
this curvature response is combined with the output of a motion integration cell
M
p
. The final response is thus defined by
R
p
= M
+
p+∆p
· M
p
· M
p∆p
· ( + M
+
p+∆p
+ M
p
+ M
p∆p
)
1
(9)
center
surround
position
activation
Fig. 3. Left: Gradient Cell Mp
+
. Middle: The full model cell circuit for detecting
rotational motion patterns. Oppositely tuned gradient cells are combined with cells
sensitive to uniform motion patterns. Right: Center-surround cell for motion disconti-
nuity detection with two centered subcells with different spatial tunings and integration
weights for center and surround cell.
2.2 Model Mechanisms for Motion Contrast Detection
Local facial motions can be accounted by mechanisms that operate on a smaller
scale within the facial projection into the image plane. To detect intrinsic motion,
we propose cells that are sensitive to local changes in speed and direction. These
motion patterns are produced in the facial plane while the person is talking
or during other facial actions. Here, we employ a center-surround interaction
mechanism of motion sensitive cells that are able to detect local variations in
visual flow, but won’t be sensitive to large uniform flow fields. Such a sensitivity
can be generated by cells with antagonistic center-surround motion sensitivity.
The input integration of velocity responses is defined by weighted kernels
with different spatial scale dimensions operating on responses of motion and
speed selective filters. Integration over N directions yields the activation for a
direction-insensitive motion contrast cell.
D
sub
p,θ
= V
p
F
¯µ
(10)

Figures (4)
Citations
More filters

Proceedings ArticleDOI
07 Nov 2014
TL;DR: This paper outlines the contribution to the 2014 edition of the AVEC competition and proposes an approach based on abstract meta information about individual subjects and also prototypical task and label dependent templates to infer the respective emotional states.
Abstract: This paper outlines our contribution to the 2014 edition of the AVEC competition It comprises classification results and considerations for both the continuous affect recognition sub-challenge and also the depression recognition sub-challenge Rather than relying on statistical features that are normally extracted from the raw audio-visual data we propose an approach based on abstract meta information about individual subjects and also prototypical task and label dependent templates to infer the respective emotional states The results of the approach that were submitted to both parts of the challenge significantly outperformed the baseline approaches Further, we elaborate on several issues about the labeling of affective corpora and the choice of appropriate performance measures

51 citations


Cites background from "A Biologically Inspired Model for t..."

  • ...Additionally to the analysis of audio signals, many different approaches have been followed on emotion recognition from visual input for example in the form of facial expressions, movement cues [57] and body gestures [5]....

    [...]


Journal ArticleDOI
TL;DR: This paper proposes CNN-based binary classifiers for detecting each of the functions from the angular velocity of the head pose and the presence or absence of utterances, and proposes a tendency toward overdetection that added more functions to those originally in the corpus.
Abstract: A functional head-movement corpus and convolutional neural networks (CNNs) for detecting head-movement functions are presented for analyzing the multiple communicative functions of head movements in multiparty face-to-face conversations. First, focusing on the multifunctionality of head movements, i.e., that a single head movement can simultaneously perform multiple functions, this paper defines 32 non-mutually-exclusive function categories, whose genres are speech production, eliciting and giving feedback, turn management, and cognitive and affect display. To represent and capture arbitrary multifunctional structures, our corpus employs multiple binary codes and logical-sum-based aggregations of multiple coders’ judgments. A corpus analysis targeting four-party Japanese conversations revealed multifunctional patterns in which the speaker modulates multiple functions, such as emphasis and eliciting listeners’ responses, through rhythmic head movements, and listeners express various attitudes and responses through continuous back-channel head movements. This paper proposes CNN-based binary classifiers for detecting each of the functions from the angular velocity of the head pose and the presence or absence of utterances. The experimental results showed that the recognition performance varies greatly, from approximately 30% to 90% in terms of the F-score, depending on the function category, and the performance was positively correlated with the amount of data and inter-coder agreement. In addition, we noted a tendency toward overdetection that added more functions to those originally in the corpus. The analyses and experiments confirm that our approach is promising for studying the multifunctionality of head movements.

7 citations


Cites methods from "A Biologically Inspired Model for t..."

  • ...A number of techniques have been used for feature extraction, such as facial landmark detection [101]–[103], optical flow calculation [104]–[106], appearance modeling [107], and head pose tracking [108]–[111]....

    [...]


Proceedings ArticleDOI
15 Jul 2015
TL;DR: This work proposes the first steps towards an optical flow based separation of rigid head motions from non-rigid motions caused by facial expressions and suggests that after their separation, both head movements and facial expressions can be used as a basis for the recognition of a user's emotions and dispositions and thus allow a technical system to effectively adapt to the user's state.
Abstract: In intelligent environments, computer systems not solely serve as passive input devices waiting for user interaction but actively analyze their environment and adapt their behaviour according to changes in environmental parameters. One essential ability to achieve this goal is to analyze the mood, emotions and dispositions a user experiences while interacting with such intelligent systems. Features allowing to infer such parameters can be extracted from auditive, as well as visual sensory input streams. For the visual feature domain, in particular facial expressions are known to contain rich information about a user's emotional state and can be detected by using either static and/or dynamic image features. During interaction facial expressions are rarely performed in isolation, but most of the time co-occur with movements of the head. Thus, optical flow based facial features are often compromised by additional motions. Parts of the optical flow may be caused by rigid head motions, while other parts reflect deformations resulting from facial expressivity (non-rigid motions). In this work, we propose the first steps towards an optical flow based separation of rigid head motions from non-rigid motions caused by facial expressions. We suggest that after their separation, both, head movements and facial expressions can be used as a basis for the recognition of a user's emotions and dispositions and thus allow a technical system to effectively adapt to the user's state.

Cites methods from "A Biologically Inspired Model for t..."

  • ...These results were generated using a restricted set of biologically plausible motion detectors sensitive to motion gradients and local contrasts [18] read out and combined from a multiscale filter bank....

    [...]

  • ...4B we finally show the results of a similar approach [18], where a set of biologically inspired filter combinations was successfully used for the detection of head rotations and eye blinks in real image sequences....

    [...]


References
More filters

Book ChapterDOI
11 May 2004
TL;DR: By proving that this scheme implements a coarse-to-fine warping strategy, this work gives a theoretical foundation for warping which has been used on a mainly experimental basis so far and demonstrates its excellent robustness under noise.
Abstract: We study an energy functional for computing optical flow that combines three assumptions: a brightness constancy assumption, a gradient constancy assumption, and a discontinuity-preserving spatio-temporal smoothness constraint. In order to allow for large displacements, linearisations in the two data terms are strictly avoided. We present a consistent numerical scheme based on two nested fixed point iterations. By proving that this scheme implements a coarse-to-fine warping strategy, we give a theoretical foundation for warping which has been used on a mainly experimental basis so far. Our evaluation demonstrates that the novel method gives significantly smaller angular errors than previous techniques for optical flow estimation. We show that it is fairly insensitive to parameter variations, and we demonstrate its excellent robustness under noise.

2,701 citations


Journal ArticleDOI
01 Jan 1987-Nature
TL;DR: A simple algorithm for computing the three-dimensional structure of a scene from a correlated pair of perspective projections is described here, when the spatial relationship between the two projections is unknown.
Abstract: A simple algorithm for computing the three-dimensional structure of a scene from a correlated pair of perspective projections is described here, when the spatial relationship between the two projections is unknown. This problem is relevant not only to photographic surveying1 but also to binocular vision2, where the non-visual information available to the observer about the orientation and focal length of each eye is much less accurate than the optical information supplied by the retinal images themselves. The problem also arises in monocular perception of motion3, where the two projections represent views which are separated in time as well as space. As Marr and Poggio4 have noted, the fusing of two images to produce a three-dimensional percept involves two distinct processes: the establishment of a 1:1 correspondence between image points in the two views—the ‘correspondence problem’—and the use of the associated disparities for determining the distances of visible elements in the scene. I shall assume that the correspondence problem has been solved; the problem of reconstructing the scene then reduces to that of finding the relative orientation of the two viewpoints.

2,563 citations


Journal ArticleDOI
TL;DR: This paper explores and compares techniques for automatically recognizing facial actions in sequences of images and provides converging evidence for the importance of using local filters, high spatial frequencies, and statistical independence for classifying facial actions.
Abstract: The facial action coding system (FAGS) is an objective method for quantifying facial movement in terms of component actions. This paper explores and compares techniques for automatically recognizing facial actions in sequences of images. These techniques include: analysis of facial motion through estimation of optical flow; holistic spatial analysis, such as principal component analysis, independent component analysis, local feature analysis, and linear discriminant analysis; and methods based on the outputs of local filters, such as Gabor wavelet representations and local principal components. Performance of these systems is compared to naive and expert human subjects. Best performances were obtained using the Gabor wavelet representation and the independent component representation, both of which achieved 96 percent accuracy for classifying 12 facial actions of the upper and lower face. The results provide converging evidence for the importance of using local filters, high spatial frequencies, and statistical independence for classifying facial actions.

1,039 citations


BookDOI
01 Jan 2004
TL;DR: This work presents an analytic solution to the problem of estimating multiple 2-D and 3-D motion models from two-view correspondences or optical flow and proposes a novel motion segmentation algorithm that outperforms existing algebraic methods in terms of efficiency and robustness.
Abstract: We present an analytic solution to the problem of estimating multiple 2-D and 3-D motion models from two-view correspondences or optical flow. The key to our approach is to view the estimation of multiple motion models as the estimation of a single multibody motion model. This is possible thanks to two important algebraic facts. First, we show that all the image measurements, regardless of their associated motion model, can be fit with a real or complex polynomial. Second, we show that the parameters of the motion model associated with an image measurement can be obtained from the derivatives of the polynomial at the measurement. This leads to a novel motion segmentation algorithm that applies to most of the two-view motion models adopted in computer vision. Our experiments show that the proposed algorithm outperforms existing algebraic methods in terms of efficiency and robustness, and provides a good initialization for iterative techniques, such as EM, which is strongly dependent on correct initialization.

909 citations


Journal ArticleDOI
Abstract: This paper explores the use of local parametrized models of image motion for recovering and recognizing the non-rigid and articulated motion of human faces. Parametric flow models (for example affine) are popular for estimating motion in rigid scenes. We observe that within local regions in space and time, such models not only accurately model non-rigid facial motions but also provide a concise description of the motion in terms of a small number of parameters. These parameters are intuitively related to the motion of facial features during facial expressions and we show how expressions such as anger, happiness, surprise, fear, disgust, and sadness can be recognized from the local parametric motions in the presence of significant head motion. The motion tracking and expression recognition approach performed with high accuracy in extensive laboratory experiments involving 40 subjects as well as in television and movie sequences.

524 citations


Frequently Asked Questions (2)
Q1. What are the contributions mentioned in the paper "A biologically inspired model for the detection of external and internal head motions" ?

The authors propose a biologically inspired model that builds upon the known functional organization of cortical motion processing at low and intermediate stages to decompose the composite motion signal. 

Future work will include a validation of the approach for more generic shape-from-motion tasks.