scispace - formally typeset
Open AccessProceedings ArticleDOI

Robot recognizes three simultaneous speech by active audition

Kazuhiro Nakadai, +2 more
- Vol. 1, pp 398-405
TLDR
In this article, an active direction-pass filter (ADPF) is used to separate sounds originating from the specified direction obtained by the real-time human tracking system, and the separated speech is recognized by the speech recognition using multiple acoustic models that integrate multiple results to output the result with the maximum likelihood.
Abstract
Robots should listen to and recognize speeches with their own ears under noisy environments and simultaneous speeches to attain smooth communications with people in a real world. This paper presents three simultaneous speech recognition based on active audition which integrates audition with motion. Our robot audition system consists of three modules - a real-time human tracking system, an active direction-pass filter (ADPF) and a speech recognition system using multiple acoustic models. The real-time human tracking realizes robust and accurate sound source localization and tracking by audio-visual integration. The performance of localization shows that the resolution of the center of the robot is much higher than that of the peripheral. We call this phenomenon "auditory fovea" because it is similar to visual fovea (high resolution in the center of the human eye). Active motions such as being directed at the sound source improve localization because of making the best use if the auditory fovea. The ADPF realizes accurate and fast sound separation by using a pair of microphones. The ADPF separates sounds originating from the specified direction obtained by the real-time human tracking system. Because the performance of separation depends on the accuracy of localization, the extraction of sound from the front direction is more accurate than that of sound from the periphery. This means that the pass range of ADPF should be narrower in the front direction than in periphery. In other words, such active pass range control improves sound separation. The separated speech is recognized by the speech recognition using multiple acoustic models that integrates multiple results to output the result with the maximum likelihood. Active motions such as being directed at a sound source improve speech recognition because it realizes not only improvement of sound extraction but also easier integration of the results using face ID by face recognition. The robot audition system improved by active audition is implemented on an upper-torso humanoid. The system attains localization, separation and recognition of three simultaneous speeches and the results proves the efficiency of active audition.

read more

Content maybe subject to copyright    Report

Proceedings
ofthe
2003
IEEE
lnlernaliooal Conference
on
Robotics
&
Aufomatioo
~aipei,
TS~WW
September
i4-19,1003
Robot Recognizes Three Simultaneous Speech
By
Active Audition
Kazuhiro Nakadai’, Hiroshi G.
Okuno*,t,
Hiroaki Kitano*.t
*
Kitano Symbiotic Systems Project, ERATO, Japan Science and Tech. Corp., Tokyo, Japan
t
Graduate School of Informatics, Kyoto University, Kyoto,
Japan
t
Sony Computer Science Laboratories, Inc., Tokyo,
Japan
nakadai@nakadai.com,
okuno@nue.org,
kitano@csl.sony.co.jp
AbmacI-
Robots should listen
lo
and
mognire
speeehes with their
own
ears
under noisy environments and simultaneous speeches
to
attain
smooth
commuoicatioos with people in
a
real
world.
This paper
presents three simultaneous speech recognition based
on
active
au-
dition which integrates audition with motion.
Our
mbot
audition system
eonsbts
of
three modules
-
a
real-time
human tracking system,
an
active direction-pass
filter
(ADPF) and
a
speeeh recognition system using
multiple
acoustic
models.
The
real-time human tracking system realizes mbust and ~ce~rate sound
soum
loealization and tracking
by
audio-visual integration.
The
per-
formance
of
loealimtion shows that the resolution
of
the
center
of
the
mbot is much
higher
than that
of
the peripheral.
We
eall
this
phe-
nomena
“auditory
fovea”
because it
b
similar
lo visual
fovea
(high
resalution in the
center
of
human
eye).
Active motions such
as
being
direeted
at
the sound
soume
impmveloeskation because
of
making
the best
use
of
the auditory
fovea.
The ADPF realizes
accurate
and
fast
sound separation by using
a
pair
of
micmphones. The ADPF sep-
arates
sounds originating
fmm
the specified direction obtained by the
d-time human tracking system.
Because
the
performance
of
%pa-
rationdepends
on
thesccuracyaf
loealization,
the extraction
of
sound
fmm
the
fmnt
direetion is more accurate than that
of
sound
from
the
periphery.
This
means
that
the
pass
range
of
the ADPF
should
be
narmwer in
the
fmnt
direction than in the periphery.
In
other words,
such
active
pas
range
contml
improves sound separation. The sep-
arated speech
b
recognized
by
the speech
recognition
using multiple
acoustic
models
that integrates multiple
mulls
to
output the
mult
with the maximum likelihwd. Active motions such
as
being directed
at
a
sound
wuree
impmve speech reeognition beeause it realizes
not
only
impmvement
of
sound extraction but
also
easier integration
of
the
results
using
face
ID by
face
recognition.
The
robat
audition system improved by active audition
is
imple-
mented
on
an
upper-torso
humanoid. The system attains loealization,
separation and
reeognitian
of
thm simulhneous speeches and the
re-
sults
pmves
the efficiency
of
active audition.
I.
INTRODUCTION
Robots that interact with human should separate and rec-
ognize various kinds of sounds. This means that robot
audition is important social interaction as well as trigger
of an’event.
To
realize such robots, four issues should
be considered as follows:
1)
noise cancellation while in
motion,
2)
information integration of audition, vision and
other sensory information,
3)
sound sowce separation un-
der noisy environment, and
4)
speech recognition
of
each
sound source if it is speech. Because most robots consider
them partially, robust and accurate auditory processing
in
robots has been difficult
so
far.
The difficulties in robot audition lie in sound source
separation under real world environments. For example,
Kismet
of the
MIT
AI Lab [I] and
ROBITA
of Waseda Uni-
versity 121 can interact with people by automatic speech
recognition and gestures, but they use a microphone at-
tached near the mouth of each speaker
to
avoid motor noise
in motion. Therefore, it does not have sound source separa-
tion function.
WA-2
of Waseda University[3] can localize a
sound source by using a pair of microphones in the robot,
hut they do not take motor noises in motion into account.
Therefore they adopt the “stop-hear-act” principle; that is,
a robot stops to hear. They also assume a single sound
source,
so
the robot does not have sound source separation
function.
SmarfHead
which can localize and track multiple
sound sources by using four microphones and stereo cam-
eras[4]. However, they use only low level information, and
it is difficult to resolve ambiguity which is solved by higher
level
information such as face ID. Since they do
not
assume
sound distortionby a robot’s head shape, it is difficult to ap-
ply their method to a robot head with sounddistortion such
as human-like head. In addition, the maximum number of
sound sources is limited theoretically.
To solve these problems, we proposed
acfive audition
to control microphone parameters to perceive auditory in-
formation better with cancellation
of
self motor noise[5].
The active audition is integrated with face localization and
recognition and stereo vision by using streams, and real-
time multiple human tracking system bas been reported
[6].
Furthermore,
an
active direction-pass filter (here-
after, ADPF)
to
separate sound sources by using accurate
sound direction obtained from the real-time multiple hu-
man tracking system also has been reported[7]. The ADPF
uses a pair of microphones
to
separate sound sources. It
calculates interaural phase difference
(IPD)
and interaural
intensity difference (IID) for each sub-band and then de-
termines the sound source direction by performing hypo-
thetical reasoning with a set of IPD’s and IID’s. Finally
the ADPF collects sub-bands
of
which IPD and IID match
those of the specified direction. The performance evalu-
ation of the ADPF reveals that the sensitivity
of
localiza-
tion depends on the direction of the sound source.
In
other
0-7803-7736-2/03/$17.00 02003
IEEE
398
Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

I
Each
Subband
&
Fig.
1,
The
Robot
Audition
System
for
Simultaneous
Speech
Recognilion
words, the ADPF separates sound slreams very precisely
when the sound source is just in front of the robot, while it
separates sound streams very poorly when the sound source
is sideways. Although this phenomenon occurs concerning
a
pair of microphone,
it
is quite similar to
fovea,
which
means the difference of resolution in vision, that is, higher
resolution in the center of eye while lower in the periphery.
We call the auditory equivalent of fovea
“auditory fovea”.
As an application
of
the ADPF,
we
also reported auto-
matic recognition
of
simultaneous speech by specific two
persons[S]. However, the integration method of recogni-
tion results by multiple acoustic models is based
on
a sim-
ple majority rule.
In
this paper, we propose an integration method based
on
recognition rate of each acoustic model. The method
also provides audio-visual integration
in
speech recogni-
tion. We present the robot audition system that can localize,
separate and recognize
three
simultaneous speech by using
the real-time multiple
human
tracking system, the auditory
fovea based ADPF, and the speech recognition using mul-
tiple acoustic models’.
The rest
of
this paper is organized as follows: Section
2
describes robot audition system for simultaneous speech
recognition. Section
3,
4
and
5
describe the real-time mul-
tiple human tracking system, the active direction-pass filter,
and the speech recognition using multiple acoustic models,
respectively. Section
6
evaluates the performance by the
robot audition system. The last section provides discussion
and conclusion.
11. ROBOT AUDITION SYSTEM
FOR
SIMULTANEOUS
SPEECH RECOGNITION
The architecture of the robot audition system for simul-
taneous speech recognition is shown in Fig.
1.
It
consists
of
three modules
-
the real-time human tracking system, the
active direction-pass filter and speech recognition by using
multiple acoustic models. Sounds captured by robot’s mi-
crophones and images captured by robot’s cameras are sent
to the real-time human tracking system described in later
section. The sound source directions are obtained from au-
ditory and visual streams generated in the real-time human
tracking system. The sound source directions are sent to
the ADPF, The ADPF extracts sound sources from the di-
rections by hypothesis matching of interaural intensity dif-
ference (IID) and interaural phase difference (IPD) which
ae calculated from input spectra of left and right channels.
The speech recognition module recognizes the extracted
speeches by using multiple acoustic models.
We use the upper torso humanoid
SIG
as a testbed of the
research.
SIC
has
a
cover by
FRP
(fiber reinforced plas-
tic). It is designed to separate the
SIG
inner world from
the external world acoustically. A pair of CCD camera
(Sony
EVI-G20)
is used for stereo vision. Two pairs of
microphones ae used for auditory processing. One pair is
located in the left and right ear position for sound source
localization. The other is installed inside the cover mainly
for canceling self-motor noise in motion.
SIG
has
4
DC
motors
(4
DOFs) with functions of position and velocity
control by using potentio-meters.
The following sections describe three modules in detail.
111. REAL-TIME HUMAN TRACKING SYSTEM
The real-time human tracking system extracts accurate
sound source directions by integration of audition and vi-
sion, and gives them to the ADPF, The architecture
of
the
399
Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

Fig.
2.
Hierarchical
Architecture
of
Red-Time
Traclung
System
real-time human tracking system using
SIG
shown in Fig-
ure
2
consists of seven modules, i.e.,
Sound,
Face, Stereo
Vision, Association, Focus-of-Anenlion, Motor Control and Viewer.
Sound
localizes sound sources. Face detects multiple
faces by combining skin-color extraction, correlation based
matching and multiple scale image generation
[9].
It iden-
tifies each face by Linear Discriminant Analysis (LDA),
which creates an optimal subspace to distinguish classes
and continuously update a subspace
on,
demand with a
small amount
of
computation
[lo].
In
addition, the faces
are localized in
3-D
world coordinates by assuming aver-
age face size. Finally, the
10
best face IDS with proha-
bility
P,
and their locations are sent to Association. Stereo
Vision localizes lengthwise objects such as people precisely
by using fast disparity map generation[l
11.
It improves the
robustness of the system in point of tracking a person who
looks away and does not talk. Association forms
streams
and
associates them into a higher level representation, that is,
an
association
stream according to the proximity in loca-
tion. The directions
of
streams are sent
to
the ADPF with
captured sounds. Focus-of-Attention plans
SIC'S
movement
based
on
the status of streams.
Motor
Control is activated
by the Focus-of-Atlention module and generates PWM (Pulse
Width Modulation) signals
to
DC
motors. Viewer shows
the status of auditory, visual and association streams in the
radar and scrolling windows. The whole system works in
real-time with a small latency
of
500ms by distributedpro-
cessing with
5
PCs
and combination
of
Gigabit and Fast
Ethernet.
Stream Formation
and
Association:
Streams are formed
in Association by connecting events from
Sound,
Face and
Stereo Vision to a time course. First. since location informa-
tion in sound, face, stereo vision events is observed in a
SIC
coordinate system, the coordinates is converted into world
coordinates by comparing a motor event observed at the
same time. The converted events are connected to a stream
by using a Kalman filter based algorithm described in [12]
in detail. Kalman filter is efficient to reduce the influence of
process and measurement noise in localization, especially
in auditory processing with bigger ambiguities.
In
sound
stream formation, when a sound stream and an event have
a
harmonic relationship, and the difference in azimuth he-
tween the stream direction predicted by the Kalman filter
and a sound event is less than
HOO,
they are connected.
In
face and stereo vision stream formation, a face or a stereo
stream event is connected to a face
or
a stereo vision stream
when the distance difference between the predicted loca-
tion of the stream and the location of the event is within
40
cm, and they have the same event ID. An event ID is a
face name or an object ID generated in face
or
stereo vision
module.
When the system judges that multiple streams originate
from the identical person, they are associated into an asso-
ciation stream, higher level stream representation[6]. When
one of the streams forming an association stream is termi-
nated, the terminated stream is removed from the associa-
tion stream, and the association stream is de-associated to
one
or
some separated streams.
Control
of
Tracking:
The tracking is controlled by Focus-
of-Attention to keep the direction of a stream with attention
and sends motorevents
to
Motor. By selecting a stream with
attention and tracking it, the ADPF can continue to make
the hest use of foveal processing. The selection of streams,
that is, focus-of-attention control is programmable accord-
400
Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

90
-
80
5
70
6
60
'6
40
30
20
0
vi
.-
g
50
9
10
-101
'
0
IO
20
30
40
50
60
70
80
90
Sound
Source Direction (deg.)
......
Joy
....
i
....
- -
500
.......
BO"
PO'
......
i10
i20
i30
+_40
+_50
*60
t70
+a0
i90
Pans
Range
(depe)
......
i
....
- -
500
.......
i10
i20
i30
+_40
+_50
*60
t70
+a0
i90
Pans
Range
(depe)
Fig.
3.
Distribution
of
Sound
localization
Fig,
4.
Extraction of Single
Sound
Source
ing to the surrounding situations.
In
this paper, the prece-
dence of focus-of-attention control for the ADPF is used
-
an associated stream including a sound stream has the
highest priority, a sound stream has the second priority, and
other visual streams have the thud priority.
Iv.
ACTIVE DIRECTION
PASS
FILTER
The architecture of the ADPF is shown in a dark area
in Fig.
1.
The APDF uses two key techniques, auditory
epipolar geometry and the auditory fovea. The auditory
epipolar geometry is a localization method by IPD and IID
without using
HRTFs.
The auditory epipolar geometry is
described in the next section in detail. In this paper, the
ADPF
is
implemented
to
be able
to
use both of HRTFs and
auditory epipolar geometry for evaluation. The auditory
fovea uses to control pass range of the ADPF, that is, the
pass range is nmow angles in front direction, and wider
angles in the periphery. The detail algorithm of the ADPF
is described as follows:
1.
IPD
Ay'
and IIDAp' in each sub-band are obtained
by
the difference between left and right channels.
2.
Let
8,
azimuth
of
a stream with current attention in
the robot coordinate system in the real-time human
tracking
system. The
8,
is
sent
to
the
ADPF
through
Gigabit Ether network by considering latency of pro-
cessing.
3.
The pass range
&(e,)
of the ADPF is selected accord-
ing
to
8,
.
The pass range function
6
has a minimum
value in the
SIG
front direction, because it has maxi-
mum sensitivity.
6
has a larger value at the peripheral
because
of
lower sensitivity. Let
us
8,
=
8,
-
S(8,)
and
8h
=
8,
+
S(8,).
4.
From a stream direction, the IPD
APE($)
is esti-
mated for each sub-band by auditory epipolar geome-
try. The IID
hpH
(8)
is obtained from HRTFs.
5.
The sub-bands are collected if the
IPD
and IID satisfy
the following condition.
f
<
fih
:
A(D€(Oi)
5
A$
5
APE(oh),
and
f
2
ffh
:
APH(&)
5
AP'
5
APH(0h).
The
fth
is the upper boundq of frequency which
Fig.
5.
Pass
RangeFunction
is efficient for localization by
IPD.
It depends
on
the
baseline
of
the ears.
In
SIGs
case, the
fth
is
1500Hz.
6.
A wave consisting of collected sub-bands is con-
structed.
Note that the direction of an association stream is spec-
ified by visual information not by auditory one to obtain
more accurate direction.
A. Auditory Fovea and Pass Range Control
In
the retina, the resolution of images is high in the fovea,
which is located at the center of the retina, and much poorer
towards the periphery which serves to capture information
from a much larger area. Because the visual fovea gives a
good compromise between the resolution and field-of-view
without the cost of processing a large amount
of
data,
it
is
useful for robots
[13],
[14].
The visual fovea must face the
target object to obtain good resolution,
so
it is a kind of
active vision
1151.
It is well-known that sound localization
in human is the most accurate at the front direction, and is
getting worse in the periphery[l6]. It is true of a robot with
two microphones. Fig.
3
shows distribution map of sound
source localization in auditory module in real-time human
tracking system.
The
x
axis means input sound directions from
0"
to
90"
at intervals
of
10'.
The
200
trials of localization are per-
formed against each input sound direction. The result of
localization is represented as
a
histgram in each input di-
rection. The darker square means more concentration
on
localization. Figure
3
proves that localization of sounds
from front direction is concentrated
on
a correct direction,
and is widely distributed in the periphery. Thus, sound
lo-
calization in a robot is accurate in the front direction. We
call this phenomena auditory fovea.
Akin to the visual fovea, the auditory fovea also needs
to
be directed at the target object, such as a speaker. There-
fore, it
too
relies
on
active motion. Such integration of
sound and active motion, termed
active audition
[5];
can
be used to attain improved auditory perception. The ac-
tive motion is essential in audition and vision not only for
friendly humanoid-human interaction, but
also
for better
401
Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

perception.
The accuracy of sound source localization affects the
performance
of
sound source separation by the ADPF. Be-
cause the accuracy of sound source localization depends
on
sound direction, pass range of the ADPF should be con-
trolled according to the sound direction. Figure
4
shows re-
sults of single sound source extraction by the ADPF. The x
and y axes are pass range of the ADPF and signal-to-noise
ratio, respectively. When signal-to-noise ratio is OdB, it
is regarded that the sound source is extracted completely.
Each line in Fig.
4
differs the direction of a speaker, and it
is changed from
O0
to
90’
by
10’.
In
case of sound from the front direction
of
the robot,
the pass range of
ilOO
is necessary
to
extract the sound
properly. But,
in
case
of 90‘
sound from the front direction
of the robot, at least, the pass range of
f35O
is necessary.
On
single sound source, the wider pass range realizes the
higher signal-to-noise ratio of sound extraction. However,
background noise and other sound sources should be con-
sidered in real environment,
so
the narrower pass range is
the better in
a
sense
of
noise cancellation. In
Fig.
4,
we se-
lect the narrowestpass ranges which extract a sound source
properly and define the pass range function shown
in
Fig.
5.
B.
Auditory Epipolar Geometry
Auditory Epipolar Geometry
is
proposed to extract direc-
tional information of sound sources witbout using HRTF
[17].
The epipolar geometry is the fundamental geomet-
ric relationship between two perspective cameras in stereo
vision research
[181.
Auditory epipolar geometry is an ex-
tension
of
the epipolar geometry in vision (hereafter,
vi-
sual epipolar geometry)
to audition. Since auditory epipo-
lar geometry extracts directional information by using the
geometrical relation, it can dispense with
HRTF.
When the
distance between a sound source and a robot is more than
50cm, the influence of the distance can be ignored
[12].
Then, when the influence by a head shape is considered,
the auditory epipolar geometry is defined by
where
f,
U,
r
and
Q
are the frequency of sound, the velocity
of
sound, radius
of
a robot head and the sound direction,
respectively.
Ap
is
an
estimated
IPD
corresponding to
8.
V.
SPEECH
RECOGNITION
FOR
SEPARATED
SOUND
Robust speech recognition against noises
is
one
of
the
hottest topics in speech community. Some approaches
such as multi-condition training and missing data
[19],
[20]
shows efficiency in speech recognition with noise to
some extent. However, these methods are of less use when
signal-noise ratio
is
as low as OdB.
In
this case, speech en-
hancement by a front-end processing
is
necessary.
Tbk
kind of speech enhancement is efficient for speech recog-
nition
in
higher signal-noise ratio, though such approach
has not been studied
so
much. Then, we propose speech
recognition using multiple acoustic models to use the sound
source separation by the ADPF as front-end processing.
A. Acoustic Model
The Japanese automatic speech recognition software
“Julian” is used for automatic speech recognition (ASR).
For speech data,
150
words such as numbers, colors and
fruits by
2
men (MI.
A
and
Mr.
C)
and
1
woman (Ms.
B)
are used.
For acoustic models, the words played by loud speakers
of B&W Nautilus
805
are recorded by a pair of SIC micro-
phones, They are installed in a 3m
x
3m room, the distance
between
SIG
and a speaker is
1
m. The training datasets are
created as follows:
1.
150
words by three persons are recorded by using
robot microphones, The sound direction
is
-60°,
0°,
or
60’.
Two and three simultaneous speeches by com-
bination
of
0’
and
f60’
are
also
recorded. All pat-
terns of combination
of
sound source directions and
persons are recorded.
2.
Speech
of
each direction is extracted from recorded
data by the ADPF.
3.
Separated speeches are clustered by person and direc-
tion
to
be a training dataset.
As a result,
9
training datasets are obtained as follows:
(a) a dataset of
MIA
from
0’
(b) a dataset of
MrA
from
60’
(c) a dataset of
MrA
from
-60’
(d) a dataset
of
Ms.B from
0’
(e) a dataset of Ms.B from
60’
(0
a dataset of Ms.B from
-60’
(g) a dataset of Mr.C from
0’
(h) a dataset of Mr.C from
60’
(i)
a
dataset of Mr.C from
-60’
Nine acoustic models by Hidden Markov Model (HMM)
ae trained by the above training sets. Each HMM is an
acoustic model by triphone and is trained
10
times by using
Hidden Markov Model Toolkit (HTK).
B.
Speech Recognition using Multiple Acoustic Models
S&tcd
Spcd~~s
Fig.
7.
Speech
Recognition
using
Multiple
Acoustic
Models
402
Authorized licensed use limited to: Kyoto University. Downloaded on December 26, 2009 at 20:39 from IEEE Xplore. Restrictions apply.

Citations
More filters

Biologically Inspired Robots.

TL;DR: This chapter successively describes bio-inspired morphologies, sensors, and actuators that, beyond mere reflexes, implement cognitive abilities like memory or planning, or adaptive processes like learning, evolution and development.
PatentDOI

Speech recognition system

TL;DR: A speech recognition system prompts a user to provide a first utterance, which is recorded, processed and compared to the second utterance to detect at least one acoustic difference.
Journal ArticleDOI

The parameterless self-organizing map algorithm

TL;DR: The relative performance of the PLSOM and the SOM is discussed and some tasks in which the SOM fails but the P LSOM performs satisfactory are demonstrated and a proof of ordering under certain limited conditions is presented.
Posted Content

The Parameter-Less Self-Organizing Map algorithm

TL;DR: The Parameter-Less Self-Organizing Map (PLSOM) as mentioned in this paper is a new neural network algorithm based on the self-organizing map (SOM), which eliminates the need for a learning rate and annealing schemes for learning rate.
PatentDOI

Robotics visual and auditory system

TL;DR: In this article, a robotic visual and auditory system is provided which is made capable of accurately conducting the sound source localization of a target by associating a visual and an auditory information with respect to a target.
References
More filters
Proceedings Article

Three-dimensional computer vision: a geometric viewpoint

TL;DR: Results from constrained optimization some results from algebraic geometry differential geometry are shown.
Book

Active vision

TL;DR: Sometimes, reading is very boring and it will take long time starting from getting the book and start reading, but in modern era, you can take the developing technology by utilizing the internet and search for the book that is needed.
Proceedings Article

A Context-Dependent Attention System for a Social Robot

TL;DR: The design of a visual attention system based on a model of human visual search behavior from Wolfe (1994) is presented, which integrates perceptions with habituation effects and influences from the robot's motivational and behavioral state to create a context-dependent attention activation map.
Proceedings Article

Active Audition for Humanoid

TL;DR: The experimental result demonstrates that the active audition by integration of audition, vision, and motor control enables sound source tracking in variety of conditions.
Proceedings Article

Robust ASR Based On Clean Speech Models: An Evaluation of Missing Data Techniques For Connected Digit Recognition in Noise

TL;DR: Using models trained on clean speech, techniques for classification with missing or unreliable data are applied to the problem of noise-robustness in Automatic Speech Recognition and obtain a 65% relative improvement over the Aurora clean training baseline system.
Related Papers (5)
Trending Questions (1)
How to Set browser resolution in Robot Framework?

The performance of localization shows that the resolution of the center of the robot is much higher than that of the peripheral.