scispace - formally typeset
Open AccessProceedings ArticleDOI

Multi-modal person verification system based on face profiles and speech

Conrad Sanderson, +1 more
- Vol. 2, pp 947-950
TLDR
A person verification system based on facial profile views and features extracted from speech whose outputs are fused after a normalization step is described, which shows that integration of the face profile and speech information results in superior performance to that of its subsystems.
Abstract
This paper describes a person verification system based on facial profile views and features extracted from speech. The system is comprised of two non-homogeneous classifiers whose outputs are fused after a normalization step. Experiments are reported which show that integration of the face profile and speech information results in superior performance to that of its subsystems. Additionally, the performance of the combined system in noisy conditions is shown to be more robust than the speech-based subsystem alone.

read more

Content maybe subject to copyright    Report

MULTI-MODAL PERSON VERIFICATION SYSTEM
BASED ON FACE PROFILES AND SPEECH
Conrad Sanderson and Kuldip K. Paliwal
School of Microelectronic Engineering
Griffith University
Brisbane, QLD 4111, Australia
e-mail: C.Sanderson@me.gu.edu.au, K.Paliwal@me.gu.edu. au
ABSTRACT
This paper describes a person verification system based on
facial profile v iews and features extracted from speech. The
system is compr ised of two non-homogen eous classifiers
whose outputs are fused after a normalization step. Ex-
periments are reported which show that integration of the
face profile and speech information results in superior per-
formance to that of its subsystems. Additionally, the perfor-
mance of the combined system in noisy co nditions is shown
to be more robust than the speech-based subsystem alone.
1. INTRODUCTION
A person verification system attemp ts to verify the claimed
identity of an individual. This can be useful in situations
where security considerations preclude obtaining access by
simpler means such as a key. Many person verification sys-
tems ar e describ ed in the literature, relying on features de-
rived from speech [1]. However, these systems can easily
fail in the presence of back ground noise. In this paper a
multi-modal person verification system is presented which
relies on the shape of the profile of a person’s head as well
as the speech uttered by that person. The system is made up
of a Profile Verification System (PVS), a Speaker Verifica-
tion System (SVS) and a Fusing and Classification Module
(FCM). The voice and visual cues are combined by the FCM
allowing the resulting system to have superior performance ,
as shown in the experimental section , than either of its sub-
systems alone. The performance and robustness of the SVS
and the combined system are compared in noisy conditions,
to simulate real life co nditions.
The paper is organized as follows: Section 2 describes
the system architecture, Section 3 shows the setup for ex-
periments, and Section 4 presents the results.
2. SYSTEM ARCHITECTURE
As stated before, th e system is made up of 3 modules:
Speaker Verification System
Profile Verification System
Fusing and Classification Mod ule
The SVS used is based on the Ga ussian Mixture Mod-
el (GMM) approach [1]. The speech signal, sampled at 16
kHz and quantized over 16 bits, is analyzed every 10 msec
using a 20 msec Hamming window. For each window (al-
so re ferred to as a frame), the energy is measured, and if it
is above a set threshold (co rresponding to voiced sounds),
12th order cepstral pa rameters are d erived from L inear Pre-
diction Coding (LPC) para meters [2]. Each set of extracted
parameters can be treated as a 12-dime nsional vector. Dur-
ing the training phase of the system, a 12-dimensional, 4-
mixture GMM is computed for each speaker using parame-
ters extracted from the speech signal.
For testing of th e SVS, the same process of feature ex-
traction is performed. U sing a GMM, belonging to th e per-
son whose identify is being claimed, a similar ity measure
is computed by averaging the log-likelihood of individu-
al frames. If the average log-likelihood is above a certain
threshold, then the id entity of the speaker is verified.
The PVS used is very similar to the one described in [3].
Given a head shot of a person who is facing sideways (see
Figure 1), the head is extracted from the background, and
then the profile is extracted from the head. The profile is
refined by searching for the n ose and then depending on the
hair style and amount of facial hair present, an unoccluded
portion of the profile is u sed. Using this refined profile, a
distance map [4] (see Figure 2) is calculated and stored with
the profile.
For testing of the PVS, the p rofile is extracted as pre-
viously. To compare one profile against another, it is ne c-
essary to account for possible tilt, translation and scale of
the profile. Initially the profile is superimposed over the
distance m a p belonging to the profile of the person whose
identity is being claimed , with the noses alig ned and scales
roughly adjusted. Distance is computed by summing up all

Figure 1: Example of a profile shot (mu 1) extracted from the
M2VTS database (left), and head segmentation (right).
Figure 2 : Profile extracted from Figure 1 (l eft), its distance map
(center), the profile superimposed on the distance map (right).
distance values found where the profile’s pixels are present
within the distance map. T he downhill simplex algorithm
[5] is employed to minimize this distance by automatically
adjusting parameters for an a ffine transfo rm of the profile,
ie. scale, translation and rotation (within preset limits). T he
residual distance betwe en the compensated p rofile and the
distance map can be used to decide whethe r the p rofile be-
longs to th e perso n whose identity is being claimed. If the
distance is below a certain thr eshold, the person is deemed
to be verified. The process of comparing profiles is referred
to as matching.
The FCM uses raw scores from the subsystems r a ther
than relying on them for classification - this method is often
referred to as soft fusion. FCM’s first job is to reverse the
sign of the value coming fr om the SVS in order to make it
compatible with the PVS. To prevent the PVS from domi-
nating, the value coming from it is lim ited to a preset max-
imum. The FCM then normalizes the values fro m each of
the subsystems by making them zero mean and unity vari-
ance, and then placin g them in the [0,1 ] interval. The mean
and variance values used during this process must be esti-
mated by first running the subsystems on training data and
analyzing their prob ability density functions (PDFs).
Finally the normalized values can be combined:
f
=
w
p
n
+ (1
w
)
s
n
where
w
is a we ight factor betwee n 0 and 1,
p
n
= normal-
ized distance value from the PVS,
s
n
= normalized negative
log-likelihood value from the SVS. If
f
is below a prede-
fined threshold, then the person requesting access is accept-
ed.
3. EVALUATION OF PERFORMANCE
3.1. Multi-modal Database
The M2VTS database [6] h a s been used for evaluating the
combined system. It is comprised of 37 people counting
from z ero to nine (mostly in French) and facing the camera.
The database is made up of 5 sections, each with video se-
quences for each p e rson. From section to section, the video
sequences often differ in hair styles, clothes, lighting con-
ditions and zoom factors. For each video sequence, a syn-
chronize d speech signal sampled at 44 k H z with 16 bit res-
olution is available. There are a dditional video sequences
where each person rotates their head fro m one side to the
other. If the person is wearing glasses, another head mov-
ing sequenc e is available without them.
Profile shots were obtained by manu ally finding the
frames in head rotating sequences where the person is facing
left and not wearing glasses. Each frame has a resolution of
350x286 p ixels. Figure 1 presents an example frame.
3.2. Experiment Setup
For each person, speech files a nd video sequence s fr om the
first four sections are used for experiments. Sections 1 to
3 are used for training, while section 4 is u sed for testing.
Profiles extracte d from the first three sections are used to
select the best repre sentative profile during the training ses-
sion. The database allows for 37 correct verification trials
and 37*36 impostor trials.
3.3. Training Setup
For the SVS, the speech files are downsampled to 16 kHz
at 16 bit resolutio n. The training session is the same as de-
scribed in Section 2.
There are three matching operations for the training of
the PVS. For each per son, profile from section 1 (P1) is
matched with P3, P2 with P1 and P3 with P2. The p ro-
file that appears in the 2 best matc hings is selected a s the
referenc e pro file.
Figures 3 and 4 show the PDFs of the SVS and PVS
scores. In order to fu se th e se scores in the FCM, we need
the mean (
) and standard deviation (
2
) values of these
PDFs. These are estimated with the following pr ocedure:
both of the subsystems are trained and tested on the train-
ing sections of the database. Outliers must first b e removed
since they re duce the reliability of estimation of
and
2
.
For the SVS, an adequate meth od of outlier removal is by
finding the median (
m
) and the deviation from the med i-
an
m
2
(same as standar d deviation, except substituting the
median for the mean). Any value which is outside of the
interval defined b y
m
2
m
2
is ignor ed. For the PVS,

5 10 15 20 25 30 35 40 45
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
DISTANCE
PROBABILITY
Figure 3: PDF of the PVS score.
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
0.005
0.01
0.015
0.02
0.025
0.03
LOG LIKELIHOOD
PROBABILITY
Figure 4: PDF of the SVS score.
ignorin g values greater th a n a prede fined maximum proved
to be sufficient m ethod for removin g outliers.
After outlier removal the values from the SVS are changed
in pola rity in order to make them compatible with the PVS,
as this is required by the FCM. The
and
2
for the PDF
of the SVS were set to the median and deviation f rom the
median, respectively, as they were found to improve the per-
formance of the system.
4. RESULTS
Four experiments were performed . For a given decision
threshold, False Acceptance (
F A
) and False Rejection (
F R
)
rates were calculated. For each experiment, a Receiver Op-
erating Character stics (
RO C
) curve was generated by vary-
ing the decision threshold continuously. Figure 5 shows th e
ROC curve with
w
= 1
.
A go od way to evaluate the performance of a verification
system is by computing the eq ual error rate (
E E R
), where
F A
=
F R
, the success rate (
S R
), where
1
F A
F R
0 5 10 15 20 25 30 35 40 45 50
0
5
10
15
20
25
30
35
40
45
50
FA (%)
FR (%)
Figure 5: ROC curve of the PVS subsystem, ie.
w
= 1
.
510152025303540
10
20
30
40
50
60
70
80
90
100
SNR (dB)
SUCCESS RATE (SR)
SVS
Combined (w=0.33)
Figure 6: Success Rate of the SVS compared to the combined
system (
w
= 0
:
33
) with decreasing SNR.
510152025303540
10
20
30
40
50
60
70
80
90
100
SNR (dB)
SUCCESS RATE (SR)
SVS
Combined (w=0.5)
Figure 7: Success Rate of the SVS compared to the combined
system (
w
= 0
:
5
) with decreasing SNR.

reaches a maximum, and the
F R
for an
F A
of 1%.
In the first exper iment,
w
was varied from 0 to 1. The
results are shown in Table 1. For
w
= 0
, only the SVS was
used, while for
w
= 1
only the PVS was used, hence it can
be seen that the SVS has better performance than the PVS.
For
w
= 0
:
33
, the combined system outperforms both of
the two subsystems.
In the second experiment, with
w
= 0
, the speech was
progressively corrupted by lowering the Signal to Noise Ra-
tio (
S N R
) from 40dB to 5dB. The results are shown in Ta-
ble 2 and Figure 6. The third exper iment is a repeat of the
2nd experiment, but with
w
= 0
:
33
. T he results are shown
in Table 3 and Figur e 6. The fourth experiment is also a re -
peat of the 2nd experiment, this time with
w
= 0
:
5
. Results
are shown in Table 4 and Figure 7.
As it can be seen, when
w
= 0
:
33
, the com bined sys-
tem outperfor ms the SVS for all SNRs. For
w
= 0
:
5
, the
SVS initially outperforms the combined system, however
its performance drops rapidly with decreasing SNR. This is
in contrast to the combined system, where the perfo rmance
curve has a much mor e grac e ful dropoff. The SR at 10dB
and lower of the combined system with
w
= 0
:
5
is bet-
ter than with
w
= 0
:
33
, h ence ther e is a tr ade-off between
lower performance at high SNRs versus more robust perfor-
mance at low SNRs.
w S R F R
F A
=1%
E E R
1.0 84.08 29.73 8.11
0.66 88.74 19.92 8.15
0.5 90.47 16.22 5.41
0.33 95.50 8.11 2.70
0.0 92.49 16.22 5.52
Table 1: Performance of the combined system, for varying weight
factors.
S N R
(
dB
)
S R F R
F A
=1%
E E R
40 92.04 18.92 5.40
35 91.37 21.62 5.37
30 89.87 21.62 5.52
25 88.06 37.84 8.15
20 75 64.87 13.55
15 43.32 91.89 29.69
10 19.82 100 45.38
5 11.64 100 50.75
Table 2: Performance of the SVS , quoted in %, with decreasing
SNR (see also Figure 6).
5. CONCLUSION
The results presented support the use o f multi-mode , based
on profile views and speech, person verification systems.
S N R
(
dB
)
S R F R
F A
=1%
E E R
40 95.57 8.11 2.74
35 95.57 8.11 2.74
30 94.44 8.11 2.78
25 92.57 13.51 5.41
20 90.32 16.22 5.44
15 84.91 32.43 8.63
10 79.20 67.57 13.51
5 72.82 75.68 16.22
Table 3: Performance of the combined system with
w
= 0
:
33
,
quoted in %, with decreasing SNR (see also Figure 6).
S N R
(
dB
)
S R F R
F A
=1%
E E R
40 90.31 16.22 5.40
35 90.24 16.22 5.44
30 90.09 16.22 5.37
25 89.94 18.92 5.71
20 88.81 18.92 8.11
15 85.89 24.32 8.15
10 82.81 43.24 10.81
5 78.75 59.46 10.81
Table 4: Performance of the combined system with
w
= 0
:
5
,
quoted in %, with decreasing SNR (see also Figure 7).
It was d emonstrated tha t a combine d system outperforms
a speaker verification system, and is much more ro bust in
noisy conditions.
6. REFERENCES
[1] D ouglas A. Reynolds, “Speaker identification and verifica-
tion using Gaussian mixture speaker models”, Speech Com-
munication 17, 1995, pages 91 - 108.
[2] K . K. Paliwal, “Speech processing techniques”, Advances
in Speech, Hearing and Language Processing, Vol. 1, 1990,
pages 1 - 78.
[3] Stephane Pigeon, Luc Vandendorpe, “Profile Authentica-
tion Using a Chamfer Matching Algorithm”, Audio- and
Video-based Biometric Person Authentication - proceedings
of AVBPA’97, Crans-Montanta, Switzerland, March 12-14,
Josef Bigun et al. ( ed), Springer, 1997, pages 185 - 192.
[4] G unilla Borgefors, “Hierarchical Chamfer Matching: A
Parametric Edge Matching Algorithm”, IEEE Transaction-
s on Pattern Analysis and Machine Intelligence, Vol. 10, No.
6, Nov. 1988, pages 849 - 865.
[5] William H. Press et al., Numerical Recipes in C, 2nd ed.,
Cambridge, New York, Cambridge University Press, 1992,
pages 408 - 412.
[6] M2VTS Database: http://www.tele.ucl.ac.be/M2VTS/
Citations
More filters

Automatic Person Verification Using Speech and Face Information

TL;DR: Research aimed at increasing the robustness of single- and multi-modal biometric identity verification systems is reported, which addresses the illumination and pose variation problems in face recognition, as well as the challenge of effectively fusing information from multiple modalities under non-ideal conditions.
Proceedings ArticleDOI

A review of multimodal biometric systems: Fusion methods and their applications

TL;DR: Various methods, fusion levels and integration strategies can be applied to combine information in multimodal systems that combine two or more biometric modalities.
Journal ArticleDOI

Multibiometric fusion strategy and its applications: A review

TL;DR: The different methodology used in a fusion process (Sensor, Feature, Score, Decision, Rank) of multibiometric systems from last three decades are discussed and the methods used, to explore their successes and failure.
Journal ArticleDOI

Fusion of palm-phalanges print with palmprint and dorsal hand vein

TL;DR: A multimodal personal authentication system using palmprint, dorsal hand vein pattern and a novel biometric modality "palm-phalanges print" is presented and it is shown that multi-modality fusion has an edge over unimodal fusion.
Proceedings ArticleDOI

Noise compensation in a multi-modal verification system

TL;DR: An adaptive multi-modal verification system comprised of a modified minimum cost Bayesian classifier (MCBC) and a method to find the reliability of the speech expert for various noisy conditions, allowing the de-emphasis of the contribution of opinions from the expert affected by noise.
References
More filters
Journal ArticleDOI

Numerical recipes

Journal ArticleDOI

Speaker identification and verification using Gaussian mixture speaker models

TL;DR: High performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identity, evaluated on four publically available speech databases.
Journal ArticleDOI

Hierarchical chamfer matching: a parametric edge matching algorithm

TL;DR: The hierarchical chamfer matching algorithm matches edges by minimizing a generalized distance between them in a hierarchical structure, i.e. in a resolution pyramid, which reduces the computational load significantly.
Book ChapterDOI

Profile Authentication Using a Chamfer Matching Algorithm

TL;DR: This paper investigates the use of the profile shape to recognize human faces by using a method that directly works on the profile contour encoded as x-y coordinates and will test the intrinsic efficiency of profile authentication.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Multi-modal person verification system based on face profiles and speech" ?

This paper describes a person verification system based on facial profile views and features extracted from speech. 

During the training phase of the system, a 12-dimensional, 4- mixture GMM is computed for each speaker using parameters extracted from the speech signal. 

The system is made up of a Profile Verification System (PVS), a Speaker Verification System (SVS) and a Fusing and Classification Module (FCM). 

For each experiment, a Receiver Operating Characterstics (ROC) curve was generated by varying the decision threshold continuously. 

As stated before, the system is made up of 3 modules:Speaker Verification SystemProfile Verification SystemFusing and Classification ModuleThe SVS used is based on the Gaussian Mixture Model (GMM) approach [1]. 

After outlier removal the values from the SVS are changedin polarity in order to make them compatible with the PVS, as this is required by the FCM. 

Using a GMM, belonging to the person whose identify is being claimed, a similarity measure is computed by averaging the log-likelihood of individual frames. 

Figure 5 shows the ROC curve with w = 1.A good way to evaluate the performance of a verification system is by computing the equal error rate (EER), where FA = FR, the success rate (SR), where 1 FA FRreaches a maximum, and the FR for an FA of 1%. 

For w = 0, only the SVS was used, while for w = 1 only the PVS was used, hence it can be seen that the SVS has better performance than the PVS. 

For the SVS, an adequate method of outlier removal is by finding the median (m) and the deviation from the median m 2 (same as standard deviation, except substituting the median for the mean). 

Profile shots were obtained by manually finding theframes in head rotating sequences where the person is facing left and not wearing glasses. 

The SR at 10dB and lower of the combined system with w = 0:5 is better than with w = 0:33, hence there is a trade-off between lower performance at high SNRs versus more robust performance at low SNRs.