Robot recognizes three simultaneous speech by active audition

doi:10.1109/ROBOT.2003.1241628

Proceedings

ofthe

2003

IEEE

lnlernaliooal Conference

on

Robotics

&

Aufomatioo

~aipei,

TS~WW

September

i4-19,1003

Robot Recognizes Three Simultaneous Speech

By

Active Audition

Kazuhiro Nakadai’, Hiroshi G.

Okuno*,t,

Hiroaki Kitano*.t

*

Kitano Symbiotic Systems Project, ERATO, Japan Science and Tech. Corp., Tokyo, Japan

t

Graduate School of Informatics, Kyoto University, Kyoto,

Japan

t

Sony Computer Science Laboratories, Inc., Tokyo,

Japan

nakadai@nakadai.com,

okuno@nue.org,

kitano@csl.sony.co.jp

AbmacI-

Robots should listen

lo

and

mognire

speeehes with their

own

ears

under noisy environments and simultaneous speeches

to

attain

smooth

commuoicatioos with people in

a

real

world.

This paper

presents three simultaneous speech recognition based

on

active

au-

dition which integrates audition with motion.

Our

mbot

audition system

eonsbts

of

three modules

-

a

real-time

human tracking system,

an

active direction-pass

filter

(ADPF) and

a

speeeh recognition system using

multiple

acoustic

models.

The

real-time human tracking system realizes mbust and ~ce~rate sound

soum

loealization and tracking

by

audio-visual integration.

The

per-

formance

of

loealimtion shows that the resolution

of

the

center

of

the

mbot is much

higher

than that

of

the peripheral.

We

eall

this

phe-

nomena

“auditory

fovea”

because it

b

similar

lo visual

fovea

(high

resalution in the

center

of

human

eye).

Active motions such

as

being

direeted

at

the sound

soume

impmveloeskation because

of

making

the best

use

of

the auditory

fovea.

The ADPF realizes

accurate

and

fast

sound separation by using

a

pair

of

micmphones. The ADPF sep-

arates

sounds originating

fmm

the specified direction obtained by the

d-time human tracking system.

Because

the

performance

of

%pa-

rationdepends

on

thesccuracyaf

loealization,

the extraction

of

sound

fmm

the

fmnt

direetion is more accurate than that

of

sound

from

the

periphery.

This

means

that

the

pass

range

of

the ADPF

should

be

narmwer in

the

fmnt

direction than in the periphery.

In

other words,

such

active

pas

range

contml

improves sound separation. The sep-

arated speech

b

recognized

by

the speech

recognition

using multiple

acoustic

models

that integrates multiple

mulls

to

output the

mult

with the maximum likelihwd. Active motions such

as

being directed

at

a

sound

wuree

impmve speech reeognition beeause it realizes

not

only

impmvement

of

sound extraction but

also

easier integration

of

the

results

using

face

ID by

face

recognition.

The

robat

audition system improved by active audition

is

imple-

mented

on

an

upper-torso

humanoid. The system attains loealization,

separation and

reeognitian

of

thm simulhneous speeches and the

re-

sults

pmves

the efficiency

of

active audition.

I.

INTRODUCTION

Robots that interact with human should separate and rec-

ognize various kinds of sounds. This means that robot

audition is important social interaction as well as trigger

of an’event.

To

realize such robots, four issues should

be considered as follows:

1)

noise cancellation while in

motion,

2)

information integration of audition, vision and

other sensory information,

3)

sound sowce separation un-

der noisy environment, and

4)

speech recognition

of

each

sound source if it is speech. Because most robots consider

them partially, robust and accurate auditory processing

in

robots has been difficult

so

far.

The difficulties in robot audition lie in sound source

separation under real world environments. For example,

Kismet

of the

MIT

AI Lab [I] and

ROBITA

of Waseda Uni-

versity 121 can interact with people by automatic speech

recognition and gestures, but they use a microphone at-

tached near the mouth of each speaker

to

avoid motor noise

in motion. Therefore, it does not have sound source separa-

tion function.

WA-2

of Waseda University[3] can localize a

sound source by using a pair of microphones in the robot,

hut they do not take motor noises in motion into account.

Therefore they adopt the “stop-hear-act” principle; that is,

a robot stops to hear. They also assume a single sound

source,

so

the robot does not have sound source separation

function.

SmarfHead

which can localize and track multiple

sound sources by using four microphones and stereo cam-

eras[4]. However, they use only low level information, and

it is difficult to resolve ambiguity which is solved by higher

level

information such as face ID. Since they do

not

assume

sound distortionby a robot’s head shape, it is difficult to ap-

ply their method to a robot head with sounddistortion such

as human-like head. In addition, the maximum number of

sound sources is limited theoretically.

To solve these problems, we proposed

acfive audition

to control microphone parameters to perceive auditory in-

formation better with cancellation

of

self motor noise[5].

The active audition is integrated with face localization and

recognition and stereo vision by using streams, and real-

time multiple human tracking system bas been reported

[6].

Furthermore,

an

active direction-pass filter (here-

after, ADPF)

to

separate sound sources by using accurate

sound direction obtained from the real-time multiple hu-

man tracking system also has been reported[7]. The ADPF

uses a pair of microphones

to

separate sound sources. It

calculates interaural phase difference

(IPD)

and interaural

intensity difference (IID) for each sub-band and then de-

termines the sound source direction by performing hypo-

thetical reasoning with a set of IPD’s and IID’s. Finally

the ADPF collects sub-bands

of

which IPD and IID match

those of the specified direction. The performance evalu-

ation of the ADPF reveals that the sensitivity

of

localiza-

tion depends on the direction of the sound source.

In

other

0-7803-7736-2/03/$17.00 02003

IEEE

398

Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

I

Each

Subband

&

Fig.

1,

The

Robot

Audition

System

for

Simultaneous

Speech

Recognilion

words, the ADPF separates sound slreams very precisely

when the sound source is just in front of the robot, while it

separates sound streams very poorly when the sound source

is sideways. Although this phenomenon occurs concerning

a

pair of microphone,

it

is quite similar to

fovea,

which

means the difference of resolution in vision, that is, higher

resolution in the center of eye while lower in the periphery.

We call the auditory equivalent of fovea

“auditory fovea”.

As an application

of

the ADPF,

we

also reported auto-

matic recognition

of

simultaneous speech by specific two

persons[S]. However, the integration method of recogni-

tion results by multiple acoustic models is based

on

a sim-

ple majority rule.

In

this paper, we propose an integration method based

on

recognition rate of each acoustic model. The method

also provides audio-visual integration

in

speech recogni-

tion. We present the robot audition system that can localize,

separate and recognize

three

simultaneous speech by using

the real-time multiple

human

tracking system, the auditory

fovea based ADPF, and the speech recognition using mul-

tiple acoustic models’.

The rest

of

this paper is organized as follows: Section

2

describes robot audition system for simultaneous speech

recognition. Section

3,

4

and

5

describe the real-time mul-

tiple human tracking system, the active direction-pass filter,

and the speech recognition using multiple acoustic models,

respectively. Section

6

evaluates the performance by the

robot audition system. The last section provides discussion

and conclusion.

11. ROBOT AUDITION SYSTEM

FOR

SIMULTANEOUS

SPEECH RECOGNITION

The architecture of the robot audition system for simul-

taneous speech recognition is shown in Fig.

1.

It

consists

of

three modules

-

the real-time human tracking system, the

active direction-pass filter and speech recognition by using

multiple acoustic models. Sounds captured by robot’s mi-

crophones and images captured by robot’s cameras are sent

to the real-time human tracking system described in later

section. The sound source directions are obtained from au-

ditory and visual streams generated in the real-time human

tracking system. The sound source directions are sent to

the ADPF, The ADPF extracts sound sources from the di-

rections by hypothesis matching of interaural intensity dif-

ference (IID) and interaural phase difference (IPD) which

ae calculated from input spectra of left and right channels.

The speech recognition module recognizes the extracted

speeches by using multiple acoustic models.

We use the upper torso humanoid

SIG

as a testbed of the

research.

SIC

has

a

cover by

FRP

(fiber reinforced plas-

tic). It is designed to separate the

SIG

inner world from

the external world acoustically. A pair of CCD camera

(Sony

EVI-G20)

is used for stereo vision. Two pairs of

microphones ae used for auditory processing. One pair is

located in the left and right ear position for sound source

localization. The other is installed inside the cover mainly

for canceling self-motor noise in motion.

SIG

has

4

DC

motors

(4

DOFs) with functions of position and velocity

control by using potentio-meters.

The following sections describe three modules in detail.

111. REAL-TIME HUMAN TRACKING SYSTEM

The real-time human tracking system extracts accurate

sound source directions by integration of audition and vi-

sion, and gives them to the ADPF, The architecture

of

the

399

Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

Fig.

2.

Hierarchical

Architecture

of

Red-Time

Traclung

System

real-time human tracking system using

SIG

shown in Fig-

ure

2

consists of seven modules, i.e.,

Sound,

Face, Stereo

Vision, Association, Focus-of-Anenlion, Motor Control and Viewer.

Sound

localizes sound sources. Face detects multiple

faces by combining skin-color extraction, correlation based

matching and multiple scale image generation

[9].

It iden-

tifies each face by Linear Discriminant Analysis (LDA),

which creates an optimal subspace to distinguish classes

and continuously update a subspace

on,

demand with a

small amount

of

computation

[lo].

In

addition, the faces

are localized in

3-D

world coordinates by assuming aver-

age face size. Finally, the

10

best face IDS with proha-

bility

P,

and their locations are sent to Association. Stereo

Vision localizes lengthwise objects such as people precisely

by using fast disparity map generation[l

11.

It improves the

robustness of the system in point of tracking a person who

looks away and does not talk. Association forms

streams

and

associates them into a higher level representation, that is,

an

association

stream according to the proximity in loca-

tion. The directions

of

streams are sent

to

the ADPF with

captured sounds. Focus-of-Attention plans

SIC'S

movement

based

on

the status of streams.

Motor

Control is activated

by the Focus-of-Atlention module and generates PWM (Pulse

Width Modulation) signals

to

DC

motors. Viewer shows

the status of auditory, visual and association streams in the

radar and scrolling windows. The whole system works in

real-time with a small latency

of

500ms by distributedpro-

cessing with

5

PCs

and combination

of

Gigabit and Fast

Ethernet.

Stream Formation

and

Association:

Streams are formed

in Association by connecting events from

Sound,

Face and

Stereo Vision to a time course. First. since location informa-

tion in sound, face, stereo vision events is observed in a

SIC

coordinate system, the coordinates is converted into world

coordinates by comparing a motor event observed at the

same time. The converted events are connected to a stream

by using a Kalman filter based algorithm described in [12]

in detail. Kalman filter is efficient to reduce the influence of

process and measurement noise in localization, especially

in auditory processing with bigger ambiguities.

In

sound

stream formation, when a sound stream and an event have

a

harmonic relationship, and the difference in azimuth he-

tween the stream direction predicted by the Kalman filter

and a sound event is less than

HOO,

they are connected.

In

face and stereo vision stream formation, a face or a stereo

stream event is connected to a face

or

a stereo vision stream

when the distance difference between the predicted loca-

tion of the stream and the location of the event is within

40

cm, and they have the same event ID. An event ID is a

face name or an object ID generated in face

or

stereo vision

module.

When the system judges that multiple streams originate

from the identical person, they are associated into an asso-

ciation stream, higher level stream representation[6]. When

one of the streams forming an association stream is termi-

nated, the terminated stream is removed from the associa-

tion stream, and the association stream is de-associated to

one

or

some separated streams.

Control

of

Tracking:

The tracking is controlled by Focus-

of-Attention to keep the direction of a stream with attention

and sends motorevents

to

Motor. By selecting a stream with

attention and tracking it, the ADPF can continue to make

the hest use of foveal processing. The selection of streams,

that is, focus-of-attention control is programmable accord-

400

Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

90

-

80

5

70

6

60

'6

40

30

20

0

vi

.-

g

50

9

10

-101

'

0

IO

20

30

40

50

60

70

80

90

Sound

Source Direction (deg.)

......

Joy

....

i

....

- -

500

.......

BO"

PO'

......

i10

i20

i30

+_40

+_50

*60

t70

+a0

i90

Pans

Range

(depe)

......

i

....

- -

500

.......

i10

i20

i30

+_40

+_50

*60

t70

+a0

i90

Pans

Range

(depe)

Fig.

3.

Distribution

of

Sound

localization

Fig,

4.

Extraction of Single

Sound

Source

ing to the surrounding situations.

In

this paper, the prece-

dence of focus-of-attention control for the ADPF is used

-

an associated stream including a sound stream has the

highest priority, a sound stream has the second priority, and

other visual streams have the thud priority.

Iv.

ACTIVE DIRECTION

PASS

FILTER

The architecture of the ADPF is shown in a dark area

in Fig.

1.

The APDF uses two key techniques, auditory

epipolar geometry and the auditory fovea. The auditory

epipolar geometry is a localization method by IPD and IID

without using

HRTFs.

The auditory epipolar geometry is

described in the next section in detail. In this paper, the

ADPF

is

implemented

to

be able

to

use both of HRTFs and

auditory epipolar geometry for evaluation. The auditory

fovea uses to control pass range of the ADPF, that is, the

pass range is nmow angles in front direction, and wider

angles in the periphery. The detail algorithm of the ADPF

is described as follows:

1.

IPD

Ay'

and IIDAp' in each sub-band are obtained

by

the difference between left and right channels.

2.

Let

8,

azimuth

of

a stream with current attention in

the robot coordinate system in the real-time human

tracking

system. The

8,

is

sent

to

the

ADPF

through

Gigabit Ether network by considering latency of pro-

cessing.

3.

The pass range

&(e,)

of the ADPF is selected accord-

ing

to

8,

.

The pass range function

6

has a minimum

value in the

SIG

front direction, because it has maxi-

mum sensitivity.

6

has a larger value at the peripheral

because

of

lower sensitivity. Let

us

8,

=

8,

-

S(8,)

and

8h

=

8,

+

S(8,).

4.

From a stream direction, the IPD

APE($)

is esti-

mated for each sub-band by auditory epipolar geome-

try. The IID

hpH

(8)

is obtained from HRTFs.

5.

The sub-bands are collected if the

IPD

and IID satisfy

the following condition.

f

<

fih

:

A(D€(Oi)

5

A$

5

APE(oh),

and

f

2

ffh

:

APH(&)

5

AP'

5

APH(0h).

The

fth

is the upper boundq of frequency which

Fig.

5.

Pass

RangeFunction

is efficient for localization by

IPD.

It depends

on

the

baseline

of

the ears.

In

SIGs

case, the

fth

is

1500Hz.

6.

A wave consisting of collected sub-bands is con-

structed.

Note that the direction of an association stream is spec-

ified by visual information not by auditory one to obtain

more accurate direction.

A. Auditory Fovea and Pass Range Control

In

the retina, the resolution of images is high in the fovea,

which is located at the center of the retina, and much poorer

towards the periphery which serves to capture information

from a much larger area. Because the visual fovea gives a

good compromise between the resolution and field-of-view

without the cost of processing a large amount

of

data,

it

is

useful for robots

[13],

[14].

The visual fovea must face the

target object to obtain good resolution,

so

it is a kind of

active vision

1151.

It is well-known that sound localization

in human is the most accurate at the front direction, and is

getting worse in the periphery[l6]. It is true of a robot with

two microphones. Fig.

3

shows distribution map of sound

source localization in auditory module in real-time human

tracking system.

The

x

axis means input sound directions from

0"

to

90"

at intervals

of

10'.

The

200

trials of localization are per-

formed against each input sound direction. The result of

localization is represented as

a

histgram in each input di-

rection. The darker square means more concentration

on

localization. Figure

3

proves that localization of sounds

from front direction is concentrated

on

a correct direction,

and is widely distributed in the periphery. Thus, sound

lo-

calization in a robot is accurate in the front direction. We

call this phenomena auditory fovea.

Akin to the visual fovea, the auditory fovea also needs

to

be directed at the target object, such as a speaker. There-

fore, it

too

relies

on

active motion. Such integration of

sound and active motion, termed

active audition

[5];

can

be used to attain improved auditory perception. The ac-

tive motion is essential in audition and vision not only for

friendly humanoid-human interaction, but

also

for better

401

Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

perception.

The accuracy of sound source localization affects the

performance

of

sound source separation by the ADPF. Be-

cause the accuracy of sound source localization depends

on

sound direction, pass range of the ADPF should be con-

trolled according to the sound direction. Figure

4

shows re-

sults of single sound source extraction by the ADPF. The x

and y axes are pass range of the ADPF and signal-to-noise

ratio, respectively. When signal-to-noise ratio is OdB, it

is regarded that the sound source is extracted completely.

Each line in Fig.

4

differs the direction of a speaker, and it

is changed from

O0

to

90’

by

10’.

In

case of sound from the front direction

of

the robot,

the pass range of

ilOO

is necessary

to

extract the sound

properly. But,

in

case

of 90‘

sound from the front direction

of the robot, at least, the pass range of

f35O

is necessary.

On

single sound source, the wider pass range realizes the

higher signal-to-noise ratio of sound extraction. However,

background noise and other sound sources should be con-

sidered in real environment,

so

the narrower pass range is

the better in

a

sense

of

noise cancellation. In

Fig.

4,

we se-

lect the narrowestpass ranges which extract a sound source

properly and define the pass range function shown

in

Fig.

5.

B.

Auditory Epipolar Geometry

is

proposed to extract direc-

tional information of sound sources witbout using HRTF

[17].

The epipolar geometry is the fundamental geomet-

ric relationship between two perspective cameras in stereo

vision research

[181.

Auditory epipolar geometry is an ex-

tension

of

the epipolar geometry in vision (hereafter,

vi-

sual epipolar geometry)

to audition. Since auditory epipo-

lar geometry extracts directional information by using the

geometrical relation, it can dispense with

HRTF.

When the

distance between a sound source and a robot is more than

50cm, the influence of the distance can be ignored

[12].

Then, when the influence by a head shape is considered,

the auditory epipolar geometry is defined by

where

f,

U,

r

and

Q

are the frequency of sound, the velocity

of

sound, radius

of

a robot head and the sound direction,

respectively.

Ap

is

an

estimated

IPD

corresponding to

8.

V.

SPEECH

RECOGNITION

FOR

SEPARATED

SOUND

Robust speech recognition against noises

is

one

of

the

hottest topics in speech community. Some approaches

such as multi-condition training and missing data

[19],

[20]

shows efficiency in speech recognition with noise to

some extent. However, these methods are of less use when

signal-noise ratio

is

as low as OdB.

In

this case, speech en-

hancement by a front-end processing

is

necessary.

Tbk

kind of speech enhancement is efficient for speech recog-

nition

in

higher signal-noise ratio, though such approach

has not been studied

so

much. Then, we propose speech

recognition using multiple acoustic models to use the sound

source separation by the ADPF as front-end processing.

A. Acoustic Model

The Japanese automatic speech recognition software

“Julian” is used for automatic speech recognition (ASR).

For speech data,

150

words such as numbers, colors and

fruits by

2

men (MI.

A

and

Mr.

C)

and

1

woman (Ms.

B)

are used.

For acoustic models, the words played by loud speakers

of B&W Nautilus

805

are recorded by a pair of SIC micro-

phones, They are installed in a 3m

x

3m room, the distance

between

SIG

and a speaker is

1

m. The training datasets are

created as follows:

1.

150

words by three persons are recorded by using

robot microphones, The sound direction

is

-60°,

0°,

or

60’.

Two and three simultaneous speeches by com-

bination

of

0’

and

f60’

are

also

recorded. All pat-

terns of combination

of

sound source directions and

persons are recorded.

2.

Speech

of

each direction is extracted from recorded

data by the ADPF.

3.

Separated speeches are clustered by person and direc-

tion

to

be a training dataset.

As a result,

9

training datasets are obtained as follows:

(a) a dataset of

MIA

from

0’

(b) a dataset of

MrA

from

60’

(c) a dataset of

MrA

from

-60’

(d) a dataset

of

Ms.B from

0’

(e) a dataset of Ms.B from

60’

(0

a dataset of Ms.B from

-60’

(g) a dataset of Mr.C from

0’

(h) a dataset of Mr.C from

60’

(i)

a

dataset of Mr.C from

-60’

Nine acoustic models by Hidden Markov Model (HMM)

ae trained by the above training sets. Each HMM is an

acoustic model by triphone and is trained

10

times by using

Hidden Markov Model Toolkit (HTK).

B.

Speech Recognition using Multiple Acoustic Models

S&tcd

Spcd~~s

Fig.

7.

Speech

Recognition

using

Multiple

Acoustic

Models

402

Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

Robot recognizes three simultaneous speech by active audition

Citations

Biologically Inspired Robots.

Speech recognition system

The parameterless self-organizing map algorithm

The Parameter-Less Self-Organizing Map algorithm

Robotics visual and auditory system

References

Three-dimensional computer vision: a geometric viewpoint

Active vision

A Context-Dependent Attention System for a Social Robot

Active Audition for Humanoid

Robust ASR Based On Clean Speech Models: An Evaluation of Missing Data Techniques For Connected Digit Recognition in Noise

Related Papers (5)

Active Audition for Humanoid

Applying scattering theory to robot audition system: robust sound source localization and extraction

Enhanced robot audition based on microphone array source separation with post-filter

Robust ASR Based On Clean Speech Models: An Evaluation of Missing Data Techniques For Connected Digit Recognition in Noise

The generalized correlation method for estimation of time delay

Trending Questions (1)