scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Age and gender recognition for telephone applications based on GMM supervectors and support vector machines

12 May 2008-pp 1605-1608
TL;DR: This paper compares two approaches of automatic age and gender classification with 7 classes of Gaussian mixture models with universal background models, which are well known for the task of speaker identification/verification.
Abstract: This paper compares two approaches of automatic age and gender classification with 7 classes. The first approach are Gaussian mixture models (GMMs) with universal background models (UBMs), which is well known for the task of speaker identification/verification. The training is performed by the EM algorithm or MAP adaptation respectively. For the second approach for each speaker of the test and training set a GMM model is trained. The means of each model are extracted and concatenated, which results in a GMM supervector for each speaker. These supervectors are then used in a support vector machine (SVM). Three different kernels were employed for the SVM approach: a polynomial kernel (with different polynomials), an RBF kernel and a linear GMM distance kernel, based on the KL divergence. With the SVM approach we improved the recognition rate to 74% (p < 0.001) and are in the same range as humans.

Summary (2 min read)

Introduction

  • Indeed, at first sight, it presents claims that carry significant empirical and normative appeal.
  • It becomes pernicious, however, when applied to deify the mind and character of elite private lawyers.

Michigan Law Review

  • To a "transactional" client-firm relationship'0o narrow the opportunities for lawyers to cultivate the "capacity" for ends-oriented judgment needed for third-personal deliberation.
  • °2 Kronman's wise counselor displays "a cultivated subtlety of judgment whose possession constitutes a valuable trait of character, as distinct from mere technical skill" (p. 295).
  • That judgment values the practice of law as an "intrinsic good" not as an instrumental enterprise for the pursuit of commercial or public interests (pp. 295-96).
  • Kronman emphasizes practical wisdom as a trait of character acquired "only through the experience of having to make the sorts of decisions that demand it - only through an extended apprenticeship in judgment.".
  • Gutmann remarks: "In the service of social justice, law at its best enlists the practical judgment of lawyers, and (as the authors have seen) the exercise of practical judgment by lawyers requires deliberation with clients, the mutual interchange of relevant information, and understanding.".

109, 116). He treats that method "as an instrument for the develop-

  • Ment of moral imagination" designed to provoke a "bifocality" of sympathies, understandings, and attitudes informed by lawyer partisanship and judicial neutrality (p. 113).
  • The methodological interplay of partisanship and neutrality fashions a "complex exercise in advocacy and detachment" that confers "new perceptual habits" and enhances "empathic understanding" (pp. 113-15).
  • Kronman expounds that the moral-educative content of the case method provides a counterweight to academic relativism.
  • The case method fosters the "transference" of this neutral dispositional trait through student mimicking of the judicial role (p. 119).
  • That tradition explicitly rejects an experiential, common law model of legal education and professionalism founded on the "practitioner's worldly wisdom" (pp. 174-75, 179).

98. P. 308. Kronman considers large firms more congenial to women and blacks.

  • Vomen, he observes, "are joining large firms in numbers that are roughly proportionate to their representation in the pool of qualified applicants" P. 292.
  • Whatever "competitive disadvantage" women suffer, therefore, stems from personal preference, not institutional practice.
  • According to Kronman, the "real challenge" confronting women is neither the gendered definition of job qualifications nor overt and covert patterns of discrimination, but "finding the time and energy to do their jobs in the way and on the terms their firms demand, while also meeting family responsibilities" Pp. 292-93.
  • Blacks, by comparison, though "still not a significant presence" in large firms, furnish Kronman with evidence of "some signs" of institutional advancement.
  • Altman asserts that recent institutional changes "may have further undercut the strength of the lawyer-statesman ideal," but "were not in fact the cause of that ideal's decline or, in turn, of the current dissatisfaction in the American legal profession.".

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

AGE AND GENDER RECOGNITION FOR TELEPHONE APPLICATIONS BASED ON GMM
SUPERVECTORS AND SUPPORT VECTOR MACHINES
Tobias Bocklet
1
, Andreas Maier
1
, Josef G. Bauer
2
, Felix Burkhardt
3
,ElmarN
¨
oth
1
1
Institute of Pattern Recognition, University of Erlangen-Nuremberg,Germany
2
Siemens AG, CT IC5, M¨unchen, Germany
3
T-Systems Enterprise Service GmbH, SSC ENPS, Berlin, Germany
tobias.bocklet@informatik.stud.uni-erlangen.de,
noeth@informatik.uni-erlangen.de
ABSTRACT
This paper compares two approaches of automatic age and gen-
der classification with 7 classes. The rst approach are Gaus-
sian Mixture Models (GMMs) with Universal Background Models
(UBMs), which is well known for the task of speaker identifica-
tion/verification. The training is performed by the EM algorithm
or MAP adaptation respectively. For the second approach for each
speaker of the test and training set a GMM model is trained. The
means of each model are extracted and concatenated, which results
in a GMM supervector for each speaker. These supervectors are
then used in a support vector machine (SVM). Three different ker-
nels were employed for the SVM approach: a polynomial kernel
(with different polynomials), an RBF kernel and a linear GMM dis-
tance kernel, based on the KL divergence. With the SVM approach
we improved the recognition rate to 74% (p<0.001) and are in the
same range as humans.
Index Terms Acoustic signal analysis, speaker classification,
age, gender, Gaussian mixture models (GMM), support vector ma-
chine (SVM)
1. INTRODUCTION
The human voice not only provides the semantics of spoken words.
It also contains speaker dependent characteristics. Examples for
such non-verbal information are the identity, the gender, the emo-
tional state or the age of a speaker. In telephone calls of everyday
life we extract these speaker specific characteristics and adapt our
speaking style to the person we are talking to. Apart from gender,
in automatic speech recognition (ASR) information about speaker
characteristics are rarely used. There are some approaches to iden-
tify dialogues with angry or unsatisfied users/callers [1]. But there
are only a few approaches, that use the age of speakers in ASR sys-
tems [2, 3], although there are a lot of useful applications associated
with this task. The age (combined with the gender) information can
be used to adapt the ASR system to a certain customer. Other ex-
amples are the adaptation of the waiting queue music, the offer of
age dependent advertisements to callers in the waiting queue or to
change the speaking habits of the text-to-speech module of the ASR
system. Statistical information on the age distribution of a caller
group might also be an application.
In 2007 T-Systems, Siemens AG, Deutsches Forschungszentrum f¨ur
unstliche Intelligenz and Sympalog Voice Solutions compared four
different age recognition systems on two corpora [4]. The most suc-
cessful systems used Mel Frequency Cepstral Coefficients (MFFCs)
and either performed multiple phoneme recognition or modeled the
different age classes with Gaussian Mixture Models (GMMs).
In this paper we also use MFCCs as features and compare two differ-
ent approaches. On the one hand a GMM - UBM (Universal Back-
ground Model) system, which has been shown to be very effective
for the task of speaker identification [5, 6]. On the other hand we use
Support Vector Machines (SVMs) with a GMM Supervector to iden-
tify the speaker’s age. This approach was also published in terms of
speaker identification/verification [7].
The outline of this article is organized as follows: Section 2 depicts
the evaluation corpora on which the two systems are trained and
tested. T he baseline GMM-UBM system is described in Section 3.
The basic framework for SVMs and the used kernel functions are
summarized in Section 4. Section 4.2 shows the idea of GMM Su-
pervectors and describes the SVM-based classification system. In
Section 5 we show the results of the SVM-based approach and com-
pare them to the baseline GMM system developed in our group and
to the 128-dimensional GMM and parallel phone recognition (PPR)
system of [4]. We also compared our results to the human baseline
experiment mentioned in [4]. The paper finishes with a conclusion
and a short outlook in Section 6.
2. CORPORA
The data was taken from the German SpeechDat II corpus which is
annotated with gender and age labels as given by callers at the time
of recording. The scenario of the corpus is telephone speech, where
the speakers called an automatic recording system and read a set of
words, sentences and digits. The used data was an age-balanced
subset of the 4000 native German speakers. The training and test set
is identical to [4]. For each class about 80 speakers were used for
training. The training data consisted of 44 utterances per speaker.
In order to simulate a mismatched condition of training and test
data we also evaluated the system on a 23 speaker subset of the
VoiceClass corpus. This is a dataset collected by Deutsche Teleko m
and it consists of 660 native German speakers. These speakers also
called an automatic recording system and talked about their favorite
dish. For each speaker between 5 and 30 seconds of speech data was
available. The age structure is not balanced, i.e children and youth
speakers are represented significantly higher than senior speakers.
For this corpus also labels of gender and age were available for each
speaker.
The labels of the training and test sets were used, to build up the
16051-4244-1484-9/08/$25.00 ©2008 IEEE ICASSP 2008

following 7 gender-dependent age classes:
Children (C): 13 years
Young male (YM) and female (YF) speakers: 14-19 years
Adult male (AM) and female (AF) speakers: 20-64 years
Senior male (SM) and female (SF) speakers: 65 years
3. BASELINE GMM SYSTEM DESCRIPTION
For the baseline system we use a GMM-UBM system. Each of the 7
classesismodeledbyaGaussian Mixture Model (GMM), composed
of M unimodal Gaussian densities:
p(c|μ, Σ)=
M
X
i=1
ω
i
p
i
(x|μ
i
, Σ
i
), (1)
where ω
i
denotes the weight, Σ
i
the covariance matrix and μ
i
the
mean vector of the i-th Gaussian density. We varied the number of
mixtures M from 16 to 512 in 2
x
steps. For classification a stan-
dard Gaussian Mixture Classifier is used. The classifier calculates
for each feature vector of a specific test speaker the allocation prob-
ability for each GMM-age model. This is done for each frame of
one utterance. The probabilities of each age model are then accumu-
lated. The model which achieved the highest value is expected to be
the correct one.
3.1. Feature Extraction
As features the commonly used Mel Frequency Cepstrum Coeffi-
cients (MFCCs) are used. They examine each of the 18 Mel-bands
but only consider a time window of 16 ms with a time shift of 10 ms.
This gives us a feature v ector with 24 components (log energy,
MFCC(1)-(11)). Furthermore the first order derivatives are com-
puted by a regression line over 5 consecutive frames.
3.2. Training
The training process is shown in Figure 1. After extraction of
the MFCCs a Universal Background Model (UBM) is created by
employing all the available training data, using the Expectation-
Maximization (EM) algorithm [8]. The UBM is then employed as
an initial model for a standard EM training with the age dependent
training data or for the Maximum A Posteriori (MAP) adaptation
[9]. Both algorithms take the UBM as an initial model and create
one GMM for each age class. MAP adaptation calculates the age-
dependent Gaussian mixture components by a single iteration step
and combines them with the UBM parameters. The number of itera-
tions in the EM training was set to 10.
4. SUPPORT VECTOR MACHINES
4.1. SVM Classification
The Support Vector Machine (SVM) [10] performs a binary classi-
fication y (1, 1) based on hyperplane separation. The separator
is chosen in order to maximize the distances between the hyperplane
and the closest training vectors, which are called support vectors.
By the use of kernel functions K(x
i
, x
j
), which satisfy the Mercer
condition, the SVM can be extended to non-linear boundaries:
f(x)=
L
X
i=1
λ
i
y
i
K(x, x
i
)+d (2)
Training
set
speaker
model
Adapted
MAP−
Adaption
model
Speaker
EM−
Algorithm
Universal
background
model
(UBM)
EM−
Algorithm
Fig. 1. Training of t he GMM Baseline System
where y
i
are the target values and x
i
are the support vectors. λ
i
have to be determined in the training process. L denotes the number
of support vectors and d is a (learned) constant. The task of this
paper is a 7-class age identification. So the binary SVM has to be
extended. The simplest way is to separate each age class from all
others. Therefore N × (N 1)/2 classifier are created, each of
them separating two classes.
4.2. GMM Supervector Classification
A GMM supervector is created by concatenating the M 24-
dimensional mean vectors of a speaker model (Eq. 1). The super-
vectors are built for every speaker and a label for one of the seven
classes is assigned to each vector. In the baseline system we de-
rive a GMM from the UBM for each age class. For the supervector
classification approach we use the same UBM and adapt for every
speaker of the training and test set a GMM by EM training or MAP
adaptation. We treated several aspects of adaptation: We used full
covariance matrices, diagonal covariance matrices and we also con-
sidered only adapting the mean values. The GMM supervectors can
be regarded as a mapping from the utterance of a speaker (in our
case the MFCCs) to a high-dimensional feature vector. The super-
vectors are then used as support vectors and are taken as input for
SVM training.
4.3. Employed Kernels
In this paper we applied three different kernel types: the polynomial
kernel Eq. (3), the radial basis function (RBF) kernel (Eq. 4) and
a GMM-based distance kernel (Eq. 6), which is derived from the
KL divergence. This kernel is also very similar to the Mahalanobis
distance.
K(x
i
, x
j
)=(x
T
i
x
j
+1)
n
(3)
K(x
i
, x
j
)=exp
"
1
2
(x
i
x
j
ψ
«
2
#
(4)
n in Eq. 3 defines the polynomial order and ψ in Eq. 4 denotes the
width of the radial basis function. These kernels are commonly used
in the case of SVM-based classification.
For Gaussian densities (created with mean-adapted MAP) an ade-
quate kernel exists [7]. It is an approximation of the KL divergence
1606

Densities EM-f EM-d
32 35% 35% 19% 25%
64 46% 43% 18% 24%
128 41% 42% 43% 32%
256 37% 42% 43% 34%
512 44% 45% 48% 40%
Densities MAP-f MAP-d MAP-dM
32 29% 26% 29% 26% 44% 38%
64 43% 41% 30% 28% 33% 30%
128 45% 40% 40% 36% 49% 41%
256 45% 40% 39% 37% 44% 39%
512 44% 41% 43% 42% 46% 43%
Table 1. Precision and recall on the SpeechDat II corpus with differ-
ent training algorithms (EM-f EM with full covariance matrices;
EM-d EM with diagonal covariance matrices; MAP-f MAP
with full covariance matrices; MAP-d MAP with diagonal covari-
ance matrix; MAP-dM MAP with diagonal covariance matrices
[only means are adapted])
[11] which can be rewritten in closed form as
K(μ
a
, μ
b
)=
N
X
i=1
ω(μ
a
i
)
T
Σ
1
i
(μ
b
i
) (5)
=
N
X
i=1
ωΣ
1/2
i
μ
a
i
T
ωΣ
1/2
i
μ
b
i
. (6)
5. EXPERIMENTAL RESULTS
In this work we performed age recognition experiments on two dif-
ferent corpora: the SpeechDat II corpus and the VoiceClass corpus
provided by Deutsche Telekom.
First we performed preliminary experiments (Section 5.1) in order
to determine the best parameters for the GMM-UBM system. A sec-
ond set of preliminary experiments selected the SVM kernel with the
best performance. Section 5.2 compares the recognition results of
the GMM-UBM system and the supervector-based SVM approach
of our lab with the best results achieved in [4].
5.1. Preliminary Experiments
5.1.1. GMM-UBM system
We examined the influence of t he number of Gaussian Densities,
the training algorithm (EM, MAP) and the form of the covariance
matrices (full and diagonal) on the recognition results. In the case
of MAP adaptation we adapted all GMM-components (ω,μ, Σ)or
only the means respectively. The results are shown in Table 1. For
our baseline systems the best results where achieved by a MAP-
trained GMM with 128 Gaussian densities. Only the mean vectors
of the model were adapted.
5.1.2. SVM system
Table 2 summarizes the overall precision and recall of the
supervector-based SVM system with the different kernels described
in Section 4.3. It can be seen, that the adjustment of the kernel pa-
rameters is very important (especially for the RBF Kernel). The best
Kernel full dia diaMean
EM Training
64 Densities
poly e=1 63% 61% 49% 47%
poly e=3 62% 60% 49% 48%
RBF 0.01 29% 50% 23% 38%
RBF 0.1 65% 41% 25% 38%
KL-based 41% 43% 47% 48%
512 Densities
poly e=1 –– 64% 61%
poly e=3 –– 66% 64%
RBF 0.01 –– 09% 15%
RBF 0.1 –– 26% 43%
KL-based –– 53% 52%
MAP Adaptation
64 Densities
poly e=1 66% 65% 59% 56% 58% 55%
poly e=3 66% 66% 59% 55% 56% 53%
RBF 0.01 44% 49% 25% 42% 21% 36%
RBF 0.1 53% 51% 56% 46% 52% 45%
KL-based 47% 48% 58% 57% 57% 57%
512 Densities
poly e=1 77% 74% 66% 63% 66% 64%
poly e=3 75% 74% 67% 63% 68% 66%
RBF 0.01 21% 24% 26% 19% 26% 19%
RBF 0.1 59% 57% 61% 56% 66% 60%
KL-based 57% 60% 55% 53% 56% 54%
Table 2. Precision and recall on the SpeechDat II corpus with dif-
ferent kernels and training (full full covariance matrices; dia
diagonal covariance matrices; mean diagonal covariance matrix
with only adapting the mean vector)
results were achieved with MAP adaptation. The results reached
with full covariance matrices and 64 Gaussian densities are compa-
rable to diagonal covariances and 512 Gaussian densities. But with
512 Gaussian densities, MAP adaptation, full covariance matrices
and a linear kernel we achieved a recall of 74% and a precision of
77%.
5.2. SVM vs GMM system
Table 3 shows the ev aluation results on the two different corpora.
For the SpeechDat II corpus, the accuracy can be improv ed
compared to our GMM-UBM system by the supervector-based
SVM system by 57% from 49% to 77%. The recall of this approach
was 74%, and the recall of the best GMM-UBM system was 41%.
This is a relative improvement of 80% (significant with p<0.001).
Compared to the PPR system of [4] the precision of our SVM
system is 43% higher and the recall 35% respectively. This is
significant with p<0.001. The confusion matrices of the two
systems on the SpeechDat II corpus are tabulated in Table 4 and
Table 5. The confusions of the SVM-system (Table 5) are more
balanced and way more intuitive than those of the GMM-UBM
system Table 4.
If we compare the performance of the human listeners to the SVM
approach, both the recall and the precision of the SVM approach are
higher. The differences in precision between human and machine
are significant with p<0.001. The differences in recall are
1607

SpeechDat II VoiceClass
System precision recall precision recall
GMM ([4]) 42% 46% 64% 65%
PPR 54% 55% 60% 58%
GMM-UBM 49% 41% 65% 63%
SVM 77% 74% 61% 60%
HUMAN 55% 69%
Table 3. Overall precision and recall for the best two systems of
[4] (GMM and parallel phone recognizer [PPR]) and of our two sys-
tems; t ested on the two different corpora; the last row shows the
performance of human listeners
ac\cl C YFAFSFYMAMSM
C 83 88
YF
55 20 15 5 5
AF
10 30 35 5 20
SF
25 4 8 33 821
YM
551030 545
AM
16 5 5 47 26
SM
28 6 6 17 44
Table 4. Relative confusion matrix of the best GMM-UBM system
(see text) on the SpeechDat II corpus; the columns contain the actual
age (ac) and the rows contain the classified age (cl) (overall precision
49%)
not significant (p>0.1). Note that the F-measur e [12] of the
SVM-system leads to higher values than the F-measure calculated
on the results of the human listeners (with weighs of 0.5, 1 and 2).
To compare the robustness of the two approaches against data from
different domains and channels, we used the already trained GMMs
(or SVMs respectively) and tested on the VoiceClass database. The
robustness of both of our systems seems to be good. The differences
of the 4 approaches are negligible.
6. CONCLUSION
We applied the GMM supervector-based SVM approach to the field
of automatic age recognition in combination with gender recogni-
tion. We compared this approach to the GMM-UBM approach,
ac\cl
C YFAFSFYMAMSM
C 66 33
YF
5 75 20
AF
75 25
SF
42075
YM
85 15
AM
15 78 5
SM
552761
Table 5. Relativ e confusion matrix of the best GMM supervector-
based SVM system (see text) on the SpeechDat II corpus; the
columns contain the actual age (ac) and the rows contain the clas-
sified age (cl) (overall precision 77%)
which is state-of-the art for the task of text-independent speaker
identification, and to the PPR system of [4]. We only investigated
spectral features. The SVM systems outperformed all of these ap-
proaches for the same domain corpus. Compared to the best system
of [4] (PPR) we improved the accuracy by 43% and the recall by
35% (significance: p<0.001).
7. REFERENCES
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,
S. Kollias, W. Fellenz, and J. Taylor, “Emotion Recognition in
Human-Computer Interaction, IEEE Signal Processing Mag-
azine, vol. 18, no. 1, pp. 32–80, 2001.
[2] N. Minematsu, K. Yamauchi, and K. Hirose, Automatic esti-
mation of perceptual age using speaker modeling techniques,
in Proceedings Interspeech 2003, Geneva, Switzerland, 2003,
pp. 3005 3008.
[3] C. M¨uller, F. Wittig, and J. Baus, “Exploiting Speech for Rec-
ognizing Elderly Users to Respond to their Special Needs, in
Pr oceedings Interspeech 2003, Genev a, Switzerland, 2003, pp.
1305 1308.
[4] F. M etze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt,
J. Stegmann, C. M¨uller, R. Huber, B. Andrassy, J.G. Bauer,
and B. Littel, “Comparison of Four Approaches to Age and
Gender Recognition for Telephone Applications, in ICASSP
2007 Pr oceedings, IEEE International Conference on Acous-
tics, Speech and Signal Processing, Honolulu, Hawai’i, USA,
2007, vol. 4, pp. 1089 1092.
[5] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B.
Dunn, “Speaker Verification using Adapted Gaussian Mixture
Models, Digital Signal Processing, pp. 19–41, 2000.
[6] Douglas A. Reynolds, “An Overview of Automatic Speaker
Recognition Technology, in ICASSP 2002 Proceedings, IEEE
International Confer ence on Acoustics, Speech, and Signal
Processing, Orlanda, Florida, USA, 2002, vol. 4, pp. 4072–
4075.
[7] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support
Vector Machines Using GMM Supervectors for Speaker Verifi-
cation, Signal Pr ocessing Letters, IEEE, vol. 13, pp. 308–311,
2006.
[8] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood
from Incomplete Data via the EM Algorithm, Journal of the
Royal Statistical Society, Series B (Methodological), vol. 39,
no. 1, pp. 1–38, 1977.
[9] J.L. Gauvain and C.H. Lee, “Maximum A-Posteriori Es-
timation for Multivariate Gaussian Mixture Observations of
Markov Chains, IEEE Transactions on Speech and Audio Pro-
cessing, vol. 2, pp. 291–298, 1994.
[10] C. J. C. Burges, “A Tutorial on Support Vector Machines for
Pattern Recognition, Data Mining and Knowledge Discovery,
vol. 2, no. 2, pp. 121–167, 1998.
[11] R. Dehak, N. Dehak, P. Kenny, and P. Dumouchel, “Lin-
ear and Non Linear Kernel GMM Support Vector Machines
for Speaker Verification, in Proceedings Interspeech 2007,
Antwerp, Belgium, 2007.
[12] C. J. van Rijsbergen, INFORMATION RETRIEVAL, Butter-
worths, London, 2nd edition, 1979.
1608
Citations
More filters
Journal ArticleDOI
TL;DR: A thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications is provided.
Abstract: Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of applications including speech, and thus became a very attractive area of research. This paper provides a thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications. A thorough statistical analysis is provided in this review which was conducted by extracting specific information from 174 papers published between the years 2006 and 2018. The results provided in this paper shed light on the trends of research in this area as well as bring focus to new research topics.

701 citations


Cites background from "Age and gender recognition for tele..."

  • ...Automatic age recognition can be used in security applications, age-restriction applications, and others [31]....

    [...]

Posted Content
TL;DR: The machine learning architecture of the Snips Voice Platform is presented, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices that is fast and accurate while enforcing privacy by design, as no personal user data is ever collected.
Abstract: This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.

566 citations


Cites background from "Age and gender recognition for tele..."

  • ...This achievement unlocked many practical applications of voice assistants which are now used in many fields from customer support [6, 47], to autonomous cars [41], or smart homes [16, 26]....

    [...]

Journal ArticleDOI
TL;DR: A broad overview of the constantly growing field of paralinguistic analysis is provided by defining the field, introducing typical applications, presenting exemplary resources, and sharing a unified view of the chain of processing.

285 citations


Cites background or methods from "Age and gender recognition for tele..."

  • ...Significant improvements were reported later by Bocklet et al. 2008) using a GMM-SVM supervector approach....

    [...]

  • ...Note that the test conditions in Bocklet et al. (2008) were different rom the original evaluation conditions....

    [...]

Journal ArticleDOI
TL;DR: A novel automatic speaker age and gender identification approach which combines seven different methods at both acoustic and prosodic levels to improve the baseline performance is presented and weighted summation based fusion of these seven subsystems at the score level is demonstrated.

176 citations


Cites methods from "Age and gender recognition for tele..."

  • ...Furthermore, techniques from speaker verification and language identification applications such as GMM–SVM mean supervector systems (Bocklet et al., 2008), nuisance attribute projection (NAP) (Dobry et al., 2009), anchor models (Dobry et al., 2009; Kockmann et al., 2010) and…...

    [...]

Journal ArticleDOI
TL;DR: A method for robust offline writer identification using RootSIFT descriptors computed densely at the script contours and using Exemplar-SVMs to train a document-specific similarity measure is described.

93 citations


Cites background from "Age and gender recognition for tele..."

  • ..., for speaker identification [16], or age determination [17]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Abstract: The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.

15,696 citations

Journal ArticleDOI
TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

4,673 citations


"Age and gender recognition for tele..." refers methods in this paper

  • ...Index Terms— Acoustic signal analysis, speaker classification, age, gender, Gaussian mixture models (GMM), support vector machine (SVM)...

    [...]

Journal ArticleDOI
TL;DR: A framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented, and Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.
Abstract: In this paper, a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely, the choice of prior distribution family, the specification of the parameters of prior densities, and the evaluation of the MAP estimates, are addressed. Using HMM's with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely, the forward-backward algorithm and the segmental k-means algorithm, are expanded, and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications/spl minus/parameter smoothing and model adaptation/spl minus/and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications. >

2,430 citations


"Age and gender recognition for tele..." refers methods in this paper

  • ...Furthermore the first order derivatives are computed by a regression line over 5 consecutive frames....

    [...]

Journal ArticleDOI
TL;DR: Basic issues in signal processing and analysis techniques for consolidating psychological and linguistic analyses of emotion are examined, motivated by the PKYSTA project, which aims to develop a hybrid system capable of using information from faces and voices to recognize people's emotions.
Abstract: Two channels have been distinguished in human interaction: one transmits explicit messages, which may be about anything or nothing; the other transmits implicit messages about the speakers themselves. Both linguistics and technology have invested enormous efforts in understanding the first, explicit channel, but the second is not as well understood. Understanding the other party's emotions is one of the key tasks associated with the second, implicit channel. To tackle that task, signal processing and analysis techniques have to be developed, while, at the same time, consolidating psychological and linguistic analyses of emotion. This article examines basic issues in those areas. It is motivated by the PKYSTA project, in which we aim to develop a hybrid system capable of using information from faces and voices to recognize people's emotions.

2,255 citations


"Age and gender recognition for tele..." refers methods in this paper

  • ...Index Terms— Acoustic signal analysis, speaker classification, age, gender, Gaussian mixture models (GMM), support vector machine (SVM)...

    [...]

Frequently Asked Questions (9)
Q1. What have the authors contributed in "Age and gender recognition for telephone applications based on gmm supervectors and support vector machines" ?

This paper compares two approaches of automatic age and gender classification with 7 classes. With the SVM approach the authors improved the recognition rate to 74 % ( p < 0. 001 ) and are in the same range as humans. 

The scenario of the corpus is telephone speech, where the speakers called an automatic recording system and read a set of words, sentences and digits. 

But with 512 Gaussian densities, MAP adaptation, full covariance matrices and a linear kernel the authors achieved a recall of 74% and a precision of 77%. 

The most suc-cessful systems used Mel Frequency Cepstral Coefficients (MFFCs) and either performed multiple phoneme recognition or modeled the different age classes with Gaussian Mixture Models (GMMs). 

After extraction of the MFCCs a Universal Background Model (UBM) is created by employing all the available training data, using the ExpectationMaximization (EM) algorithm [8]. 

The authors applied the GMM supervector-based SVM approach to the field of automatic age recognition in combination with gender recognition. 

In order to simulate a mismatched condition of training and test data the authors also evaluated the system on a 23 speaker subset of the VoiceClass corpus. 

Other examples are the adaptation of the waiting queue music, the offer of age dependent advertisements to callers in the waiting queue or to change the speaking habits of the text-to-speech module of the ASR system. 

The GMM supervectors can be regarded as a mapping from the utterance of a speaker (in their case the MFCCs) to a high-dimensional feature vector.