scispace - formally typeset
Search or ask a question
Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.


Papers
More filters
01 Jan 1984
TL;DR: An automatic lipreading system which has been developed and the combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone.
Abstract: Automatic recognition of the acoustic speech signal alone is inaccurate and computationally expensive. Additional sources of speech information, such as lipreading (or speechreading), should enhance automatic speech recognition, just as lipreading is used by humans to enhance speech recognition when the acoustic signal is degraded. This paper describes an automatic lipreading system which has been developed. A commercial device performs the acoustic speech recognition independently of the lipreading system. The recognition domain is restricted to isolated utterances and speaker dependent recognition. The speaker faces a solid state camera which sends digitized video to a minicomputer system with custom video processing hardware. The video data is sampled during an utterance and then reduced to a template consisting of visual speech parameter time sequences. The distances between the incoming template and all of the trained templates for each utterance in the vocabulary are computed and a visual recognition candidate is obtained. The combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone. Practical considerations and the possible enhancement of speaker independent and continuous speech recognition systems are also discussed.

389 citations

Journal ArticleDOI
TL;DR: This article goes into detail about the BioID system functions, explaining the data acquisition and preprocessing techniques for voice, facial, and lip imagery data and the classification principles used for optical features and the sensor fusion options.
Abstract: Biometric identification systems, which use physical features to check a person's identity, ensure much greater security than password and number systems. Biometric features such as the face or a fingerprint can be stored on a microchip in a credit card, for example. A single feature, however, sometimes fails to be exact enough for identification. Another disadvantage of using only one feature is that the chosen feature is not always readable. Dialog Communication Systems (DCS AG) developed BioID, a multimodal identification system that uses three different features-face, voice, and lip movement-to identify people. With its three modalities, BioID achieves much greater accuracy than single-feature systems. Even if one modality is somehow disturbed-for example, if a noisy environment drowns out the voice-the ether two modalities still lead to an accurate identification. This article goes into detail about the system functions, explaining the data acquisition and preprocessing techniques for voice, facial, and lip imagery data. The authors also explain the classification principles used for optical features and the sensor fusion options (the combinations of the three results-face, voice, lip movement-to obtain varying levels of security).

386 citations

Proceedings Article
01 Jan 1997
TL;DR: This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models and describes how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition.
Abstract: This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models. We compare speaker-dependent background speaker sets to the use of a universal, speaker-independent background model (UBM). For the UBM, we describe how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition. Experiments are conducted on the 1996 NIST Speaker Recognition Evaluation corpus and it is clearly shown that a system using a UBM and Bayesian adaptation of claimant models produces superior performance compared to speaker-dependent background sets or the UBM with independent claimant models. In addition, the creation and use of a telephone handset-type detector and a procedure called hnorm is also described which shows further, large improvements in verification performance, especially under the difficult mismatched handset conditions. This is believed to be the first use of applying a handset-type detector and explicit handset-type normalization for the speaker verification task.

383 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and intra-aural intensity differences (IID).
Abstract: At a cocktail party, one can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel, supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, the notion of an "ideal" time-frequency binary mask is suggested, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. It is observed that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, pattern classification is performed in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that the model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners.

382 citations

Proceedings ArticleDOI
20 Mar 2016
TL;DR: A data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time.
Abstract: In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications Like ours that require highly accurate, easy-to-maintain systems with a small footprint.

378 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
82% related
Feature extraction
111.8K papers, 2.1M citations
81% related
Signal processing
73.4K papers, 983.5K citations
81% related
Decoding methods
65.7K papers, 900K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023165
2022468
2021283
2020475
2019484
2018420