What contributions have the authors mentioned in the paper "Multi-modal person verification system based on face profiles and speech" ?

This paper describes a person verification system based on facial profile views and features extracted from speech.

What is the main structure of the system?

The system is made up of a Profile Verification System (PVS), a Speaker Verification System (SVS) and a Fusing and Classification Module (FCM).

What is the main idea of the paper?

As stated before, the system is made up of 3 modules:Speaker Verification SystemProfile Verification SystemFusing and Classification ModuleThe SVS used is based on the Gaussian Mixture Model (GMM) approach [1].

What is the method of outlier removal?

For the SVS, an adequate method of outlier removal is by finding the median (m) and the deviation from the median m 2 (same as standard deviation, except substituting the median for the mean).

What is the difference between the two experiments?

The SR at 10dB and lower of the combined system with w = 0:5 is better than with w = 0:33, hence there is a trade-off between lower performance at high SNRs versus more robust performance at low SNRs.

(Open Access) Multi-modal person verification system based on face profiles and speech (1999) | Conrad Sanderson

Q: What is the purpose of the paper?

During the training phase of the system, a 12-dimensional, 4- mixture GMM is computed for each speaker using parameters extracted from the speech signal.

Q: How was the ROC curve generated for each experiment?

For each experiment, a Receiver Operating Characterstics (ROC) curve was generated by varying the decision threshold continuously.

Q: What is the method for outlier removal?

After outlier removal the values from the SVS are changedin polarity in order to make them compatible with the PVS, as this is required by the FCM.

Q: How is the profile compared to the person whose identity is being claimed?

Using a GMM, belonging to the person whose identify is being claimed, a similarity measure is computed by averaging the log-likelihood of individual frames.

Q: What is the ROC curve for the SVS?

Figure 5 shows the ROC curve with w = 1.A good way to evaluate the performance of a verification system is by computing the equal error rate (EER), where FA = FR, the success rate (SR), where 1 FA FRreaches a maximum, and the FR for an FA of 1%.

Q: What is the way to remove outliers from the SVS?

For w = 0, only the SVS was used, while for w = 1 only the PVS was used, hence it can be seen that the SVS has better performance than the PVS.

MULTI-MODAL PERSON VERIFICATION SYSTEM

BASED ON FACE PROFILES AND SPEECH

Conrad Sanderson and Kuldip K. Paliwal

School of Microelectronic Engineering

Grifﬁth University

Brisbane, QLD 4111, Australia

e-mail: C.Sanderson@me.gu.edu.au, K.Paliwal@me.gu.edu. au

ABSTRACT

This paper describes a person veriﬁcation system based on

facial proﬁle v iews and features extracted from speech. The

system is compr ised of two non-homogen eous classiﬁers

whose outputs are fused after a normalization step. Ex-

periments are reported which show that integration of the

face proﬁle and speech information results in superior per-

formance to that of its subsystems. Additionally, the perfor-

mance of the combined system in noisy co nditions is shown

to be more robust than the speech-based subsystem alone.

1. INTRODUCTION

A person veriﬁcation system attemp ts to verify the claimed

identity of an individual. This can be useful in situations

where security considerations preclude obtaining access by

simpler means such as a key. Many person veriﬁcation sys-

tems ar e describ ed in the literature, relying on features de-

rived from speech [1]. However, these systems can easily

fail in the presence of back ground noise. In this paper a

multi-modal person veriﬁcation system is presented which

relies on the shape of the proﬁle of a person’s head as well

as the speech uttered by that person. The system is made up

of a Proﬁle Veriﬁcation System (PVS), a Speaker Veriﬁca-

tion System (SVS) and a Fusing and Classiﬁcation Module

(FCM). The voice and visual cues are combined by the FCM

allowing the resulting system to have superior performance ,

as shown in the experimental section , than either of its sub-

systems alone. The performance and robustness of the SVS

and the combined system are compared in noisy conditions,

to simulate real life co nditions.

The paper is organized as follows: Section 2 describes

the system architecture, Section 3 shows the setup for ex-

periments, and Section 4 presents the results.

2. SYSTEM ARCHITECTURE

As stated before, th e system is made up of 3 modules:



Speaker Veriﬁcation System



Proﬁle Veriﬁcation System



Fusing and Classiﬁcation Mod ule

The SVS used is based on the Ga ussian Mixture Mod-

el (GMM) approach [1]. The speech signal, sampled at 16

kHz and quantized over 16 bits, is analyzed every 10 msec

using a 20 msec Hamming window. For each window (al-

so re ferred to as a frame), the energy is measured, and if it

is above a set threshold (co rresponding to voiced sounds),

12th order cepstral pa rameters are d erived from L inear Pre-

diction Coding (LPC) para meters [2]. Each set of extracted

parameters can be treated as a 12-dime nsional vector. Dur-

ing the training phase of the system, a 12-dimensional, 4-

mixture GMM is computed for each speaker using parame-

ters extracted from the speech signal.

For testing of th e SVS, the same process of feature ex-

traction is performed. U sing a GMM, belonging to th e per-

son whose identify is being claimed, a similar ity measure

is computed by averaging the log-likelihood of individu-

al frames. If the average log-likelihood is above a certain

threshold, then the id entity of the speaker is veriﬁed.

The PVS used is very similar to the one described in [3].

Given a head shot of a person who is facing sideways (see

Figure 1), the head is extracted from the background, and

then the proﬁle is extracted from the head. The proﬁle is

reﬁned by searching for the n ose and then depending on the

hair style and amount of facial hair present, an unoccluded

portion of the proﬁle is u sed. Using this reﬁned proﬁle, a

distance map [4] (see Figure 2) is calculated and stored with

the proﬁle.

For testing of the PVS, the p roﬁle is extracted as pre-

viously. To compare one proﬁle against another, it is ne c-

essary to account for possible tilt, translation and scale of

the proﬁle. Initially the proﬁle is superimposed over the

distance m a p belonging to the proﬁle of the person whose

identity is being claimed , with the noses alig ned and scales

roughly adjusted. Distance is computed by summing up all

Figure 1: Example of a proﬁle shot (mu 1) extracted from the

M2VTS database (left), and head segmentation (right).

Figure 2 : Proﬁle extracted from Figure 1 (l eft), its distance map

(center), the proﬁle superimposed on the distance map (right).

distance values found where the proﬁle’s pixels are present

within the distance map. T he downhill simplex algorithm

[5] is employed to minimize this distance by automatically

adjusting parameters for an a fﬁne transfo rm of the proﬁle,

ie. scale, translation and rotation (within preset limits). T he

residual distance betwe en the compensated p roﬁle and the

distance map can be used to decide whethe r the p roﬁle be-

longs to th e perso n whose identity is being claimed. If the

distance is below a certain thr eshold, the person is deemed

to be veriﬁed. The process of comparing proﬁles is referred

to as matching.

The FCM uses raw scores from the subsystems r a ther

than relying on them for classiﬁcation - this method is often

referred to as soft fusion. FCM’s ﬁrst job is to reverse the

sign of the value coming fr om the SVS in order to make it

compatible with the PVS. To prevent the PVS from domi-

nating, the value coming from it is lim ited to a preset max-

imum. The FCM then normalizes the values fro m each of

the subsystems by making them zero mean and unity vari-

ance, and then placin g them in the [0,1 ] interval. The mean

and variance values used during this process must be esti-

mated by ﬁrst running the subsystems on training data and

analyzing their prob ability density functions (PDFs).

Finally the normalized values can be combined:



+ (1



)



where

is a we ight factor betwee n 0 and 1,

= normal-

ized distance value from the PVS,

= normalized negative

log-likelihood value from the SVS. If

is below a prede-

ﬁned threshold, then the person requesting access is accept-

ed.

3. EVALUATION OF PERFORMANCE

3.1. Multi-modal Database

The M2VTS database [6] h a s been used for evaluating the

combined system. It is comprised of 37 people counting

from z ero to nine (mostly in French) and facing the camera.

The database is made up of 5 sections, each with video se-

quences for each p e rson. From section to section, the video

sequences often differ in hair styles, clothes, lighting con-

ditions and zoom factors. For each video sequence, a syn-

chronize d speech signal sampled at 44 k H z with 16 bit res-

olution is available. There are a dditional video sequences

where each person rotates their head fro m one side to the

other. If the person is wearing glasses, another head mov-

ing sequenc e is available without them.

Proﬁle shots were obtained by manu ally ﬁnding the

frames in head rotating sequences where the person is facing

left and not wearing glasses. Each frame has a resolution of

350x286 p ixels. Figure 1 presents an example frame.

3.2. Experiment Setup

For each person, speech ﬁles a nd video sequence s fr om the

ﬁrst four sections are used for experiments. Sections 1 to

3 are used for training, while section 4 is u sed for testing.

Proﬁles extracte d from the ﬁrst three sections are used to

select the best repre sentative proﬁle during the training ses-

sion. The database allows for 37 correct veriﬁcation trials

and 37*36 impostor trials.

3.3. Training Setup

For the SVS, the speech ﬁles are downsampled to 16 kHz

at 16 bit resolutio n. The training session is the same as de-

scribed in Section 2.

There are three matching operations for the training of

the PVS. For each per son, proﬁle from section 1 (P1) is

matched with P3, P2 with P1 and P3 with P2. The p ro-

ﬁle that appears in the 2 best matc hings is selected a s the

referenc e pro ﬁle.

Figures 3 and 4 show the PDFs of the SVS and PVS

scores. In order to fu se th e se scores in the FCM, we need

the mean (



) and standard deviation (



) values of these

PDFs. These are estimated with the following pr ocedure:

both of the subsystems are trained and tested on the train-

ing sections of the database. Outliers must ﬁrst b e removed

since they re duce the reliability of estimation of



and



For the SVS, an adequate meth od of outlier removal is by

ﬁnding the median (

) and the deviation from the med i-



(same as standar d deviation, except substituting the

median for the mean). Any value which is outside of the

interval deﬁned b y







is ignor ed. For the PVS,

5 10 15 20 25 30 35 40 45

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

DISTANCE

PROBABILITY

Figure 3: PDF of the PVS score.

−5 −4 −3 −2 −1 0 1 2 3 4 5

0.005

0.01

0.015

0.02

0.025

0.03

LOG LIKELIHOOD

PROBABILITY

Figure 4: PDF of the SVS score.

ignorin g values greater th a n a prede ﬁned maximum proved

to be sufﬁcient m ethod for removin g outliers.

After outlier removal the values from the SVS are changed

in pola rity in order to make them compatible with the PVS,

as this is required by the FCM. The



and



for the PDF

of the SVS were set to the median and deviation f rom the

median, respectively, as they were found to improve the per-

formance of the system.

4. RESULTS

Four experiments were performed . For a given decision

threshold, False Acceptance (

F A

) and False Rejection (

F R

)

rates were calculated. For each experiment, a Receiver Op-

erating Character stics (

RO C

) curve was generated by vary-

ing the decision threshold continuously. Figure 5 shows th e

ROC curve with

= 1

A go od way to evaluate the performance of a veriﬁcation

system is by computing the eq ual error rate (

E E R

), where

F A

F R

, the success rate (

S R

), where



F A



F R

0 5 10 15 20 25 30 35 40 45 50

FA (%)

FR (%)

Figure 5: ROC curve of the PVS subsystem, ie.

= 1

510152025303540

100

SNR (dB)

SUCCESS RATE (SR)

SVS

Combined (w=0.33)

Figure 6: Success Rate of the SVS compared to the combined

system (

= 0

) with decreasing SNR.

510152025303540

100

SNR (dB)

SUCCESS RATE (SR)

SVS

Combined (w=0.5)

Figure 7: Success Rate of the SVS compared to the combined

system (

= 0

) with decreasing SNR.

reaches a maximum, and the

F R

for an

F A

of 1%.

In the ﬁrst exper iment,

was varied from 0 to 1. The

results are shown in Table 1. For

= 0

, only the SVS was

used, while for

= 1

only the PVS was used, hence it can

be seen that the SVS has better performance than the PVS.

For

= 0

, the combined system outperforms both of

the two subsystems.

In the second experiment, with

= 0

, the speech was

progressively corrupted by lowering the Signal to Noise Ra-

tio (

S N R

) from 40dB to 5dB. The results are shown in Ta-

ble 2 and Figure 6. The third exper iment is a repeat of the

2nd experiment, but with

= 0

. T he results are shown

in Table 3 and Figur e 6. The fourth experiment is also a re -

peat of the 2nd experiment, this time with

= 0

. Results

are shown in Table 4 and Figure 7.

As it can be seen, when

= 0

, the com bined sys-

tem outperfor ms the SVS for all SNRs. For

= 0

, the

SVS initially outperforms the combined system, however

its performance drops rapidly with decreasing SNR. This is

in contrast to the combined system, where the perfo rmance

curve has a much mor e grac e ful dropoff. The SR at 10dB

and lower of the combined system with

= 0

is bet-

ter than with

= 0

, h ence ther e is a tr ade-off between

lower performance at high SNRs versus more robust perfor-

mance at low SNRs.

w S R F R

F A

=1%

E E R

1.0 84.08 29.73 8.11

0.66 88.74 19.92 8.15

0.5 90.47 16.22 5.41

0.33 95.50 8.11 2.70

0.0 92.49 16.22 5.52

Table 1: Performance of the combined system, for varying weight

factors.

S N R

(

)

S R F R

F A

=1%

E E R

40 92.04 18.92 5.40

35 91.37 21.62 5.37

30 89.87 21.62 5.52

25 88.06 37.84 8.15

20 75 64.87 13.55

15 43.32 91.89 29.69

10 19.82 100 45.38

5 11.64 100 50.75

Table 2: Performance of the SVS , quoted in %, with decreasing

SNR (see also Figure 6).

5. CONCLUSION

The results presented support the use o f multi-mode , based

on proﬁle views and speech, person veriﬁcation systems.

S N R

(

)

S R F R

F A

=1%

E E R

40 95.57 8.11 2.74

35 95.57 8.11 2.74

30 94.44 8.11 2.78

25 92.57 13.51 5.41

20 90.32 16.22 5.44

15 84.91 32.43 8.63

10 79.20 67.57 13.51

5 72.82 75.68 16.22

Table 3: Performance of the combined system with

= 0

quoted in %, with decreasing SNR (see also Figure 6).

S N R

(

)

S R F R

F A

=1%

E E R

40 90.31 16.22 5.40

35 90.24 16.22 5.44

30 90.09 16.22 5.37

25 89.94 18.92 5.71

20 88.81 18.92 8.11

15 85.89 24.32 8.15

10 82.81 43.24 10.81

5 78.75 59.46 10.81

Table 4: Performance of the combined system with

= 0

quoted in %, with decreasing SNR (see also Figure 7).

It was d emonstrated tha t a combine d system outperforms

a speaker veriﬁcation system, and is much more ro bust in

noisy conditions.

6. REFERENCES

[1] D ouglas A. Reynolds, “Speaker identiﬁcation and veriﬁca-

tion using Gaussian mixture speaker models”, Speech Com-

munication 17, 1995, pages 91 - 108.

[2] K . K. Paliwal, “Speech processing techniques”, Advances

in Speech, Hearing and Language Processing, Vol. 1, 1990,

pages 1 - 78.

[3] Stephane Pigeon, Luc Vandendorpe, “Proﬁle Authentica-

tion Using a Chamfer Matching Algorithm”, Audio- and

Video-based Biometric Person Authentication - proceedings

of AVBPA’97, Crans-Montanta, Switzerland, March 12-14,

Josef Bigun et al. ( ed), Springer, 1997, pages 185 - 192.

[4] G unilla Borgefors, “Hierarchical Chamfer Matching: A

Parametric Edge Matching Algorithm”, IEEE Transaction-

s on Pattern Analysis and Machine Intelligence, Vol. 10, No.

6, Nov. 1988, pages 849 - 865.

[5] William H. Press et al., Numerical Recipes in C, 2nd ed.,

Cambridge, New York, Cambridge University Press, 1992,

pages 408 - 412.

[6] M2VTS Database: http://www.tele.ucl.ac.be/M2VTS/

Multi-modal person verification system based on face profiles and speech

Figures

Citations

Automatic Person Verification Using Speech and Face Information

A review of multimodal biometric systems: Fusion methods and their applications

Multibiometric fusion strategy and its applications: A review

Fusion of palm-phalanges print with palmprint and dorsal hand vein

Noise compensation in a multi-modal verification system

References

Numerical recipes

Numerical Recipes in C, 2nd Edition.

Speaker identification and verification using Gaussian mixture speaker models

Hierarchical chamfer matching: a parametric edge matching algorithm

Profile Authentication Using a Chamfer Matching Algorithm

Related Papers (5)

Audio- and Video-based Biometric Person Authentication: First International Conference, AVBPA '97, Crans-Montana, Switzerland, March 12 - 14, 1997, Proceedings

Comparing various voice recognition techniques

Robust Multimodal Person Identification With Limited Training Data

Noise compensation in a person verification system using face and multiple speech features

Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Multi-modal person verification system based on face profiles and speech" ?

Q2. What is the purpose of the paper?

Q3. What is the main structure of the system?

Q4. How was the ROC curve generated for each experiment?

Q5. What is the main idea of the paper?

Q6. What is the method for outlier removal?

Q7. How is the profile compared to the person whose identity is being claimed?

Q8. What is the ROC curve for the SVS?

Q9. What is the way to remove outliers from the SVS?

Q10. What is the method of outlier removal?

Q11. How do you get a profile shot?

Q12. What is the difference between the two experiments?