scispace - formally typeset
Open AccessProceedings ArticleDOI

Age and gender recognition for telephone applications based on GMM supervectors and support vector machines

TLDR
This paper compares two approaches of automatic age and gender classification with 7 classes of Gaussian mixture models with universal background models, which are well known for the task of speaker identification/verification.
Abstract
This paper compares two approaches of automatic age and gender classification with 7 classes. The first approach are Gaussian mixture models (GMMs) with universal background models (UBMs), which is well known for the task of speaker identification/verification. The training is performed by the EM algorithm or MAP adaptation respectively. For the second approach for each speaker of the test and training set a GMM model is trained. The means of each model are extracted and concatenated, which results in a GMM supervector for each speaker. These supervectors are then used in a support vector machine (SVM). Three different kernels were employed for the SVM approach: a polynomial kernel (with different polynomials), an RBF kernel and a linear GMM distance kernel, based on the KL divergence. With the SVM approach we improved the recognition rate to 74% (p < 0.001) and are in the same range as humans.

read more

Content maybe subject to copyright    Report

AGE AND GENDER RECOGNITION FOR TELEPHONE APPLICATIONS BASED ON GMM
SUPERVECTORS AND SUPPORT VECTOR MACHINES
Tobias Bocklet
1
, Andreas Maier
1
, Josef G. Bauer
2
, Felix Burkhardt
3
,ElmarN
¨
oth
1
1
Institute of Pattern Recognition, University of Erlangen-Nuremberg,Germany
2
Siemens AG, CT IC5, M¨unchen, Germany
3
T-Systems Enterprise Service GmbH, SSC ENPS, Berlin, Germany
tobias.bocklet@informatik.stud.uni-erlangen.de,
noeth@informatik.uni-erlangen.de
ABSTRACT
This paper compares two approaches of automatic age and gen-
der classification with 7 classes. The rst approach are Gaus-
sian Mixture Models (GMMs) with Universal Background Models
(UBMs), which is well known for the task of speaker identifica-
tion/verification. The training is performed by the EM algorithm
or MAP adaptation respectively. For the second approach for each
speaker of the test and training set a GMM model is trained. The
means of each model are extracted and concatenated, which results
in a GMM supervector for each speaker. These supervectors are
then used in a support vector machine (SVM). Three different ker-
nels were employed for the SVM approach: a polynomial kernel
(with different polynomials), an RBF kernel and a linear GMM dis-
tance kernel, based on the KL divergence. With the SVM approach
we improved the recognition rate to 74% (p<0.001) and are in the
same range as humans.
Index Terms Acoustic signal analysis, speaker classification,
age, gender, Gaussian mixture models (GMM), support vector ma-
chine (SVM)
1. INTRODUCTION
The human voice not only provides the semantics of spoken words.
It also contains speaker dependent characteristics. Examples for
such non-verbal information are the identity, the gender, the emo-
tional state or the age of a speaker. In telephone calls of everyday
life we extract these speaker specific characteristics and adapt our
speaking style to the person we are talking to. Apart from gender,
in automatic speech recognition (ASR) information about speaker
characteristics are rarely used. There are some approaches to iden-
tify dialogues with angry or unsatisfied users/callers [1]. But there
are only a few approaches, that use the age of speakers in ASR sys-
tems [2, 3], although there are a lot of useful applications associated
with this task. The age (combined with the gender) information can
be used to adapt the ASR system to a certain customer. Other ex-
amples are the adaptation of the waiting queue music, the offer of
age dependent advertisements to callers in the waiting queue or to
change the speaking habits of the text-to-speech module of the ASR
system. Statistical information on the age distribution of a caller
group might also be an application.
In 2007 T-Systems, Siemens AG, Deutsches Forschungszentrum f¨ur
unstliche Intelligenz and Sympalog Voice Solutions compared four
different age recognition systems on two corpora [4]. The most suc-
cessful systems used Mel Frequency Cepstral Coefficients (MFFCs)
and either performed multiple phoneme recognition or modeled the
different age classes with Gaussian Mixture Models (GMMs).
In this paper we also use MFCCs as features and compare two differ-
ent approaches. On the one hand a GMM - UBM (Universal Back-
ground Model) system, which has been shown to be very effective
for the task of speaker identification [5, 6]. On the other hand we use
Support Vector Machines (SVMs) with a GMM Supervector to iden-
tify the speaker’s age. This approach was also published in terms of
speaker identification/verification [7].
The outline of this article is organized as follows: Section 2 depicts
the evaluation corpora on which the two systems are trained and
tested. T he baseline GMM-UBM system is described in Section 3.
The basic framework for SVMs and the used kernel functions are
summarized in Section 4. Section 4.2 shows the idea of GMM Su-
pervectors and describes the SVM-based classification system. In
Section 5 we show the results of the SVM-based approach and com-
pare them to the baseline GMM system developed in our group and
to the 128-dimensional GMM and parallel phone recognition (PPR)
system of [4]. We also compared our results to the human baseline
experiment mentioned in [4]. The paper finishes with a conclusion
and a short outlook in Section 6.
2. CORPORA
The data was taken from the German SpeechDat II corpus which is
annotated with gender and age labels as given by callers at the time
of recording. The scenario of the corpus is telephone speech, where
the speakers called an automatic recording system and read a set of
words, sentences and digits. The used data was an age-balanced
subset of the 4000 native German speakers. The training and test set
is identical to [4]. For each class about 80 speakers were used for
training. The training data consisted of 44 utterances per speaker.
In order to simulate a mismatched condition of training and test
data we also evaluated the system on a 23 speaker subset of the
VoiceClass corpus. This is a dataset collected by Deutsche Teleko m
and it consists of 660 native German speakers. These speakers also
called an automatic recording system and talked about their favorite
dish. For each speaker between 5 and 30 seconds of speech data was
available. The age structure is not balanced, i.e children and youth
speakers are represented significantly higher than senior speakers.
For this corpus also labels of gender and age were available for each
speaker.
The labels of the training and test sets were used, to build up the
16051-4244-1484-9/08/$25.00 ©2008 IEEE ICASSP 2008

following 7 gender-dependent age classes:
Children (C): 13 years
Young male (YM) and female (YF) speakers: 14-19 years
Adult male (AM) and female (AF) speakers: 20-64 years
Senior male (SM) and female (SF) speakers: 65 years
3. BASELINE GMM SYSTEM DESCRIPTION
For the baseline system we use a GMM-UBM system. Each of the 7
classesismodeledbyaGaussian Mixture Model (GMM), composed
of M unimodal Gaussian densities:
p(c|μ, Σ)=
M
X
i=1
ω
i
p
i
(x|μ
i
, Σ
i
), (1)
where ω
i
denotes the weight, Σ
i
the covariance matrix and μ
i
the
mean vector of the i-th Gaussian density. We varied the number of
mixtures M from 16 to 512 in 2
x
steps. For classification a stan-
dard Gaussian Mixture Classifier is used. The classifier calculates
for each feature vector of a specific test speaker the allocation prob-
ability for each GMM-age model. This is done for each frame of
one utterance. The probabilities of each age model are then accumu-
lated. The model which achieved the highest value is expected to be
the correct one.
3.1. Feature Extraction
As features the commonly used Mel Frequency Cepstrum Coeffi-
cients (MFCCs) are used. They examine each of the 18 Mel-bands
but only consider a time window of 16 ms with a time shift of 10 ms.
This gives us a feature v ector with 24 components (log energy,
MFCC(1)-(11)). Furthermore the first order derivatives are com-
puted by a regression line over 5 consecutive frames.
3.2. Training
The training process is shown in Figure 1. After extraction of
the MFCCs a Universal Background Model (UBM) is created by
employing all the available training data, using the Expectation-
Maximization (EM) algorithm [8]. The UBM is then employed as
an initial model for a standard EM training with the age dependent
training data or for the Maximum A Posteriori (MAP) adaptation
[9]. Both algorithms take the UBM as an initial model and create
one GMM for each age class. MAP adaptation calculates the age-
dependent Gaussian mixture components by a single iteration step
and combines them with the UBM parameters. The number of itera-
tions in the EM training was set to 10.
4. SUPPORT VECTOR MACHINES
4.1. SVM Classification
The Support Vector Machine (SVM) [10] performs a binary classi-
fication y (1, 1) based on hyperplane separation. The separator
is chosen in order to maximize the distances between the hyperplane
and the closest training vectors, which are called support vectors.
By the use of kernel functions K(x
i
, x
j
), which satisfy the Mercer
condition, the SVM can be extended to non-linear boundaries:
f(x)=
L
X
i=1
λ
i
y
i
K(x, x
i
)+d (2)
Training
set
speaker
model
Adapted
MAP−
Adaption
model
Speaker
EM−
Algorithm
Universal
background
model
(UBM)
EM−
Algorithm
Fig. 1. Training of t he GMM Baseline System
where y
i
are the target values and x
i
are the support vectors. λ
i
have to be determined in the training process. L denotes the number
of support vectors and d is a (learned) constant. The task of this
paper is a 7-class age identification. So the binary SVM has to be
extended. The simplest way is to separate each age class from all
others. Therefore N × (N 1)/2 classifier are created, each of
them separating two classes.
4.2. GMM Supervector Classification
A GMM supervector is created by concatenating the M 24-
dimensional mean vectors of a speaker model (Eq. 1). The super-
vectors are built for every speaker and a label for one of the seven
classes is assigned to each vector. In the baseline system we de-
rive a GMM from the UBM for each age class. For the supervector
classification approach we use the same UBM and adapt for every
speaker of the training and test set a GMM by EM training or MAP
adaptation. We treated several aspects of adaptation: We used full
covariance matrices, diagonal covariance matrices and we also con-
sidered only adapting the mean values. The GMM supervectors can
be regarded as a mapping from the utterance of a speaker (in our
case the MFCCs) to a high-dimensional feature vector. The super-
vectors are then used as support vectors and are taken as input for
SVM training.
4.3. Employed Kernels
In this paper we applied three different kernel types: the polynomial
kernel Eq. (3), the radial basis function (RBF) kernel (Eq. 4) and
a GMM-based distance kernel (Eq. 6), which is derived from the
KL divergence. This kernel is also very similar to the Mahalanobis
distance.
K(x
i
, x
j
)=(x
T
i
x
j
+1)
n
(3)
K(x
i
, x
j
)=exp
"
1
2
(x
i
x
j
ψ
«
2
#
(4)
n in Eq. 3 defines the polynomial order and ψ in Eq. 4 denotes the
width of the radial basis function. These kernels are commonly used
in the case of SVM-based classification.
For Gaussian densities (created with mean-adapted MAP) an ade-
quate kernel exists [7]. It is an approximation of the KL divergence
1606

Densities EM-f EM-d
32 35% 35% 19% 25%
64 46% 43% 18% 24%
128 41% 42% 43% 32%
256 37% 42% 43% 34%
512 44% 45% 48% 40%
Densities MAP-f MAP-d MAP-dM
32 29% 26% 29% 26% 44% 38%
64 43% 41% 30% 28% 33% 30%
128 45% 40% 40% 36% 49% 41%
256 45% 40% 39% 37% 44% 39%
512 44% 41% 43% 42% 46% 43%
Table 1. Precision and recall on the SpeechDat II corpus with differ-
ent training algorithms (EM-f EM with full covariance matrices;
EM-d EM with diagonal covariance matrices; MAP-f MAP
with full covariance matrices; MAP-d MAP with diagonal covari-
ance matrix; MAP-dM MAP with diagonal covariance matrices
[only means are adapted])
[11] which can be rewritten in closed form as
K(μ
a
, μ
b
)=
N
X
i=1
ω(μ
a
i
)
T
Σ
1
i
(μ
b
i
) (5)
=
N
X
i=1
ωΣ
1/2
i
μ
a
i
T
ωΣ
1/2
i
μ
b
i
. (6)
5. EXPERIMENTAL RESULTS
In this work we performed age recognition experiments on two dif-
ferent corpora: the SpeechDat II corpus and the VoiceClass corpus
provided by Deutsche Telekom.
First we performed preliminary experiments (Section 5.1) in order
to determine the best parameters for the GMM-UBM system. A sec-
ond set of preliminary experiments selected the SVM kernel with the
best performance. Section 5.2 compares the recognition results of
the GMM-UBM system and the supervector-based SVM approach
of our lab with the best results achieved in [4].
5.1. Preliminary Experiments
5.1.1. GMM-UBM system
We examined the influence of t he number of Gaussian Densities,
the training algorithm (EM, MAP) and the form of the covariance
matrices (full and diagonal) on the recognition results. In the case
of MAP adaptation we adapted all GMM-components (ω,μ, Σ)or
only the means respectively. The results are shown in Table 1. For
our baseline systems the best results where achieved by a MAP-
trained GMM with 128 Gaussian densities. Only the mean vectors
of the model were adapted.
5.1.2. SVM system
Table 2 summarizes the overall precision and recall of the
supervector-based SVM system with the different kernels described
in Section 4.3. It can be seen, that the adjustment of the kernel pa-
rameters is very important (especially for the RBF Kernel). The best
Kernel full dia diaMean
EM Training
64 Densities
poly e=1 63% 61% 49% 47%
poly e=3 62% 60% 49% 48%
RBF 0.01 29% 50% 23% 38%
RBF 0.1 65% 41% 25% 38%
KL-based 41% 43% 47% 48%
512 Densities
poly e=1 –– 64% 61%
poly e=3 –– 66% 64%
RBF 0.01 –– 09% 15%
RBF 0.1 –– 26% 43%
KL-based –– 53% 52%
MAP Adaptation
64 Densities
poly e=1 66% 65% 59% 56% 58% 55%
poly e=3 66% 66% 59% 55% 56% 53%
RBF 0.01 44% 49% 25% 42% 21% 36%
RBF 0.1 53% 51% 56% 46% 52% 45%
KL-based 47% 48% 58% 57% 57% 57%
512 Densities
poly e=1 77% 74% 66% 63% 66% 64%
poly e=3 75% 74% 67% 63% 68% 66%
RBF 0.01 21% 24% 26% 19% 26% 19%
RBF 0.1 59% 57% 61% 56% 66% 60%
KL-based 57% 60% 55% 53% 56% 54%
Table 2. Precision and recall on the SpeechDat II corpus with dif-
ferent kernels and training (full full covariance matrices; dia
diagonal covariance matrices; mean diagonal covariance matrix
with only adapting the mean vector)
results were achieved with MAP adaptation. The results reached
with full covariance matrices and 64 Gaussian densities are compa-
rable to diagonal covariances and 512 Gaussian densities. But with
512 Gaussian densities, MAP adaptation, full covariance matrices
and a linear kernel we achieved a recall of 74% and a precision of
77%.
5.2. SVM vs GMM system
Table 3 shows the ev aluation results on the two different corpora.
For the SpeechDat II corpus, the accuracy can be improv ed
compared to our GMM-UBM system by the supervector-based
SVM system by 57% from 49% to 77%. The recall of this approach
was 74%, and the recall of the best GMM-UBM system was 41%.
This is a relative improvement of 80% (significant with p<0.001).
Compared to the PPR system of [4] the precision of our SVM
system is 43% higher and the recall 35% respectively. This is
significant with p<0.001. The confusion matrices of the two
systems on the SpeechDat II corpus are tabulated in Table 4 and
Table 5. The confusions of the SVM-system (Table 5) are more
balanced and way more intuitive than those of the GMM-UBM
system Table 4.
If we compare the performance of the human listeners to the SVM
approach, both the recall and the precision of the SVM approach are
higher. The differences in precision between human and machine
are significant with p<0.001. The differences in recall are
1607

SpeechDat II VoiceClass
System precision recall precision recall
GMM ([4]) 42% 46% 64% 65%
PPR 54% 55% 60% 58%
GMM-UBM 49% 41% 65% 63%
SVM 77% 74% 61% 60%
HUMAN 55% 69%
Table 3. Overall precision and recall for the best two systems of
[4] (GMM and parallel phone recognizer [PPR]) and of our two sys-
tems; t ested on the two different corpora; the last row shows the
performance of human listeners
ac\cl C YFAFSFYMAMSM
C 83 88
YF
55 20 15 5 5
AF
10 30 35 5 20
SF
25 4 8 33 821
YM
551030 545
AM
16 5 5 47 26
SM
28 6 6 17 44
Table 4. Relative confusion matrix of the best GMM-UBM system
(see text) on the SpeechDat II corpus; the columns contain the actual
age (ac) and the rows contain the classified age (cl) (overall precision
49%)
not significant (p>0.1). Note that the F-measur e [12] of the
SVM-system leads to higher values than the F-measure calculated
on the results of the human listeners (with weighs of 0.5, 1 and 2).
To compare the robustness of the two approaches against data from
different domains and channels, we used the already trained GMMs
(or SVMs respectively) and tested on the VoiceClass database. The
robustness of both of our systems seems to be good. The differences
of the 4 approaches are negligible.
6. CONCLUSION
We applied the GMM supervector-based SVM approach to the field
of automatic age recognition in combination with gender recogni-
tion. We compared this approach to the GMM-UBM approach,
ac\cl
C YFAFSFYMAMSM
C 66 33
YF
5 75 20
AF
75 25
SF
42075
YM
85 15
AM
15 78 5
SM
552761
Table 5. Relativ e confusion matrix of the best GMM supervector-
based SVM system (see text) on the SpeechDat II corpus; the
columns contain the actual age (ac) and the rows contain the clas-
sified age (cl) (overall precision 77%)
which is state-of-the art for the task of text-independent speaker
identification, and to the PPR system of [4]. We only investigated
spectral features. The SVM systems outperformed all of these ap-
proaches for the same domain corpus. Compared to the best system
of [4] (PPR) we improved the accuracy by 43% and the recall by
35% (significance: p<0.001).
7. REFERENCES
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,
S. Kollias, W. Fellenz, and J. Taylor, “Emotion Recognition in
Human-Computer Interaction, IEEE Signal Processing Mag-
azine, vol. 18, no. 1, pp. 32–80, 2001.
[2] N. Minematsu, K. Yamauchi, and K. Hirose, Automatic esti-
mation of perceptual age using speaker modeling techniques,
in Proceedings Interspeech 2003, Geneva, Switzerland, 2003,
pp. 3005 3008.
[3] C. M¨uller, F. Wittig, and J. Baus, “Exploiting Speech for Rec-
ognizing Elderly Users to Respond to their Special Needs, in
Pr oceedings Interspeech 2003, Genev a, Switzerland, 2003, pp.
1305 1308.
[4] F. M etze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt,
J. Stegmann, C. M¨uller, R. Huber, B. Andrassy, J.G. Bauer,
and B. Littel, “Comparison of Four Approaches to Age and
Gender Recognition for Telephone Applications, in ICASSP
2007 Pr oceedings, IEEE International Conference on Acous-
tics, Speech and Signal Processing, Honolulu, Hawai’i, USA,
2007, vol. 4, pp. 1089 1092.
[5] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B.
Dunn, “Speaker Verification using Adapted Gaussian Mixture
Models, Digital Signal Processing, pp. 19–41, 2000.
[6] Douglas A. Reynolds, “An Overview of Automatic Speaker
Recognition Technology, in ICASSP 2002 Proceedings, IEEE
International Confer ence on Acoustics, Speech, and Signal
Processing, Orlanda, Florida, USA, 2002, vol. 4, pp. 4072–
4075.
[7] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support
Vector Machines Using GMM Supervectors for Speaker Verifi-
cation, Signal Pr ocessing Letters, IEEE, vol. 13, pp. 308–311,
2006.
[8] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood
from Incomplete Data via the EM Algorithm, Journal of the
Royal Statistical Society, Series B (Methodological), vol. 39,
no. 1, pp. 1–38, 1977.
[9] J.L. Gauvain and C.H. Lee, “Maximum A-Posteriori Es-
timation for Multivariate Gaussian Mixture Observations of
Markov Chains, IEEE Transactions on Speech and Audio Pro-
cessing, vol. 2, pp. 291–298, 1994.
[10] C. J. C. Burges, “A Tutorial on Support Vector Machines for
Pattern Recognition, Data Mining and Knowledge Discovery,
vol. 2, no. 2, pp. 121–167, 1998.
[11] R. Dehak, N. Dehak, P. Kenny, and P. Dumouchel, “Lin-
ear and Non Linear Kernel GMM Support Vector Machines
for Speaker Verification, in Proceedings Interspeech 2007,
Antwerp, Belgium, 2007.
[12] C. J. van Rijsbergen, INFORMATION RETRIEVAL, Butter-
worths, London, 2nd edition, 1979.
1608
Citations
More filters
Proceedings ArticleDOI

Fuzzy support vector machines for age and gender classification

TL;DR: Experiments performed on the aGender corpus for INTERSPEECH 2010 Paralinguistic Challenge show that the proposed fuzzy SVM can improve age and gender classification accuracy.
Proceedings Article

Dimension reduction approaches for SVM based speaker age estimation.

TL;DR: Two novel dimension reduction approaches applied on the gaussian mixture model (GMM) supervectors to improve age estimation speed and accuracy are presented, including the weighted-pairwise principal components analysis (WPPCA) that reduces the vector dimension by minimizing the redundant variability.
Proceedings ArticleDOI

An i-Vector PLDA based gender identification approach for severely distorted and multilingual DARPA RATS data

TL;DR: The proposed i-Vector based approach to gender identification is shown to consistently outperform a GMM-UBM based gender-identification scheme on several test-sets created from a held-out portion of the FE corpus, and is able to achieve an identification accuracy of up to 97.63%.
Proceedings ArticleDOI

Gender identification and performance analysis of speech signals

TL;DR: The experimental results show that SVM classification performed better than ANN in the gender identification of speech using the same features, indicating that the features considered are content independent.
Journal ArticleDOI

QMOS - A Robust Visualization Method for Speaker Dependencies with Different Microphones

TL;DR: The proposed PCA plus rigid registration method surpasses this even further with a mapping error of 24% and a groupingerror which is close to zero and the proposed extension of the COSMOS method, which performs a non-rigid registration during themapping procedure.
References
More filters
Journal ArticleDOI

A Tutorial on Support Vector Machines for Pattern Recognition

TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Journal ArticleDOI

Speaker Verification Using Adapted Gaussian Mixture Models

TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.
Journal ArticleDOI

Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

TL;DR: A framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented, and Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.
Journal ArticleDOI

Emotion recognition in human-computer interaction

TL;DR: Basic issues in signal processing and analysis techniques for consolidating psychological and linguistic analyses of emotion are examined, motivated by the PKYSTA project, which aims to develop a hybrid system capable of using information from faces and voices to recognize people's emotions.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What have the authors contributed in "Age and gender recognition for telephone applications based on gmm supervectors and support vector machines" ?

This paper compares two approaches of automatic age and gender classification with 7 classes. With the SVM approach the authors improved the recognition rate to 74 % ( p < 0. 001 ) and are in the same range as humans. 

The scenario of the corpus is telephone speech, where the speakers called an automatic recording system and read a set of words, sentences and digits. 

But with 512 Gaussian densities, MAP adaptation, full covariance matrices and a linear kernel the authors achieved a recall of 74% and a precision of 77%. 

The most suc-cessful systems used Mel Frequency Cepstral Coefficients (MFFCs) and either performed multiple phoneme recognition or modeled the different age classes with Gaussian Mixture Models (GMMs). 

After extraction of the MFCCs a Universal Background Model (UBM) is created by employing all the available training data, using the ExpectationMaximization (EM) algorithm [8]. 

The authors applied the GMM supervector-based SVM approach to the field of automatic age recognition in combination with gender recognition. 

In order to simulate a mismatched condition of training and test data the authors also evaluated the system on a 23 speaker subset of the VoiceClass corpus. 

Other examples are the adaptation of the waiting queue music, the offer of age dependent advertisements to callers in the waiting queue or to change the speaking habits of the text-to-speech module of the ASR system. 

The GMM supervectors can be regarded as a mapping from the utterance of a speaker (in their case the MFCCs) to a high-dimensional feature vector.