Text and Language-Independent Speaker Recognition Using Suprasegmental Features and Support Vector Machines

doi:10.1007/978-3-642-03547-0_29

Book Chapter•DOI•

Text and Language-Independent Speaker Recognition Using Suprasegmental Features and Support Vector Machines

Anvita Bajpai, Vinod Pathangay¹•Institutions (1)

17 Aug 2009-Vol. 40, pp 307-317

TL;DR: The presence of the speaker-specific suprasegmental information in the Linear Prediction (LP) residual signal is demonstrated and support Vector Machine is used to classify the patterns in the variance of the autocorrelation sequence for the speaker recognition task.

read less

Abstract: In this paper, presence of the speaker-specific suprasegmental information in the Linear Prediction (LP) residual signal is demonstrated. The LP residual signal is obtained after removing the predictable part of the speech signal. This information, if added to existing speaker recognition systems based on segmental and subsegmental features, can result in better performing combined system. The speaker-specific suprasegmental information can not only be perceived by listening to the residual, but can also be seen in the form of excitation peaks in the residual waveform. However, the challenge lies in capturing this information from the residual signal. Higher order correlations among samples of the residual are not known to be captured using standard signal processing and statistical techniques. The Hilbert envelope of residual is shown to further enhance the excitation peaks present in the residual signal. A speaker-specific pattern is also observed in the autocorrelation sequence of the Hilbert envelope, and further in the statistics of this autocorrelation sequence. This indicates the presence of the speaker-specific suprasegmental information in the residual signal. In this work, no distinction between voiced and unvoiced sounds is done for extracting these features. Support Vector Machine (SVM) is used to classify the patterns in the variance of the autocorrelation sequence for the speaker recognition task.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

iQIYI-VID: A Large Dataset for Multi-modal Person Identification.

[...]

Yuanliu Liu, Peipei Shi, Bo Peng, He Yan, Yong Zhou, Bing Han, Yi Zheng, Chao Lin, Jianbin Jiang, Yin Fan, Tingwei Gao, Ganwen Wang, Jian Liu, Xiangju Lu, Danming Xie - Show less +11 more

19 Nov 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper introduces iQIYI-VID, the largest video dataset for multi-modal person identification, and proposed a Multi- modal Attention module to fuse multi-Modal features that can improve person identification considerably.

...read moreread less

Abstract: Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2\%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for the task of person identification in the wild. We proposed a Multi-modal Attention module to fuse multi-modal features that can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.

...read moreread less

30 citations

Cites methods from "Text and Language-Independent Speak..."

...Speaker recognition has been approached by applying a variety of machine learning models [4, 5, 12], either standard or specifically designed, to speech features such as MFCC [34]....
[...]
...The CNN model is trained as a classification model using the Dev part of Voxceleb2 dataset [5] with 5994 speakers, 14% of the data is used as evaluation while the rest as training data....
[...]

Text and Language-Independent Speaker Recognition Using Suprasegmental Features and Support Vector Machines

Citations

Cites methods from "Text and Language-Independent Speak..."

References

Related Papers (5)