scispace - formally typeset
Search or ask a question
Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.


Papers
More filters
Proceedings ArticleDOI
01 Nov 2020
TL;DR: A magnitude estimation network that is combined with a modified ResNet x-vector system to generate embeddings whose inner product is able to produce calibrated scores with increased discrimination and calibration gains at multiple operating points is presented.
Abstract: We present a magnitude estimation network that is combined with a modified ResNet x-vector system to generate embeddings whose inner product is able to produce calibrated scores with increased discrimination. A three-step training procedure is used. First, the network is trained using short segments and a multi-class cross-entropy loss with angular margin softmax. During the second step, only a reduced subset of the DNN parameters are refined using full-length recordings. Finally, the magnitude estimation network is trained using a binary crossentropy loss over pairs of target and non-target trials. The resulting system is evaluated on 4 widely-used benchmarks and provides significant discrimination and calibration gains at multiple operating points.

66 citations

Proceedings ArticleDOI
14 Apr 1983
TL;DR: A new technique for text-independent speaker recognition is proposed which uses a statistical model of the speaker's vector quantized speech which retains text- independent properties while allowing considerably shorter test utterances than comparable speaker recognition systems.
Abstract: A new technique for text-independent speaker recognition is proposed which uses a statistical model of the speaker's vector quantized speech. The technique retains text-independent properties while allowing considerably shorter test utterances than comparable speaker recognition systems. The frequently-occurring vectors or characters form a model of multiple points in the n dimensional speech space instead of the usual single point models, The speaker recognition depends on the statistical distribution of the distances between the speech frames from the unknown speaker and the closest points in the model. Models were generated with 100 seconds of conversational training speech for each of 11 male speakers. The system was able to identify 11 speakers with 96%, 87%, and 79% accuracy from sections of unknown speech of durations of 10, 5, and 3 seconds, respectively. Accurate recognition was also obtained even when there were variations in channels over which the training and testing data were obtained. A real-time demonstration system has been implemented including both training and recognition processes.

66 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: In contrast to the common belief that "there is no data like more data", it is found possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data.
Abstract: This paper presents a strategy for efficiently selecting informative data from large corpora of transcribed speech. We propose to choose data uniformly according to the distribution of some target speech unit (phoneme, word, character, etc). In our experiment, in contrast to the common belief that "there is no data like more data", we found it possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data. At the same time, our selection process is efficient and fast.

66 citations

Journal ArticleDOI
TL;DR: A multiple expert biometric person identification system that combines information from three experts: audio, visual speech, and face in an automatic unsupervised manner, adapting to the local performance and output reliability of each of the three experts.
Abstract: Information about person identity is multimodal. Yet, most person-recognition systems limit themselves to only a single modality, such as facial appearance. With a view to exploiting the complementary nature of different modes of information and increasing pattern recognition robustness to test signal degradation, we developed a multiple expert biometric person identification system that combines information from three experts: audio, visual speech, and face. The system uses multimodal fusion in an automatic unsupervised manner, adapting to the local performance (at the transaction level) and output reliability of each of the three experts. The expert weightings are chosen automatically such that the reliability measure of the combined scores is maximized. To test system robustness to train/test mismatch, we used a broad range of acoustic babble noise and JPEG compression to degrade the audio and visual signals, respectively. Identification experiments were carried out on a 248-subject subset of the XM2VTS database. The multimodal expert system outperformed each of the single experts in all comparisons. At severe audio and visual mismatch levels tested, the audio, mouth, face, and tri-expert fusion accuracies were 16.1%, 48%, 75%, and 89.9%, respectively, representing a relative improvement of 19.9% over the best performing expert

66 citations

Proceedings Article
01 Jan 2002
TL;DR: Alternative methods for performing speaker identification that utilize domain dependent automatic speech recognition (ASR) to provide a phonetic segmentation of the test utterance are described.
Abstract: Traditional text independent speaker recognition systems are based on Gaussian Mixture Models (GMMs) trained globally over all speech from a given speaker. In this paper, we describe alternative methods for performing speaker identification that utilize domain dependent automatic speech recognition (ASR) to provide a phonetic segmentation of the test utterance. When evaluated on YOHO, several of these approaches were able outperform previously published results on the speaker ID task. On a more difficult conversational speech task, we were able to use a combination of classifiers to reduce identification error rates on single test utterances. Over multiple utterances, the ASR dependent approaches performed significantly better than the ASR independent methods. Using an approach we call speaker adaptive modeling for speaker identification, we were able to reduce speaker identification error rates by 39% over a baseline GMM approach when observing five test utterances from a speaker.

66 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
82% related
Feature extraction
111.8K papers, 2.1M citations
81% related
Signal processing
73.4K papers, 983.5K citations
81% related
Decoding methods
65.7K papers, 900K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023165
2022468
2021283
2020475
2019484
2018420