Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Automatic lipreading to enhance speech recognition (speech reading)

[...]

Eric David Petajan

01 Jan 1984

TL;DR: An automatic lipreading system which has been developed and the combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone.

...read moreread less

Abstract: Automatic recognition of the acoustic speech signal alone is inaccurate and computationally expensive. Additional sources of speech information, such as lipreading (or speechreading), should enhance automatic speech recognition, just as lipreading is used by humans to enhance speech recognition when the acoustic signal is degraded. This paper describes an automatic lipreading system which has been developed. A commercial device performs the acoustic speech recognition independently of the lipreading system. The recognition domain is restricted to isolated utterances and speaker dependent recognition. The speaker faces a solid state camera which sends digitized video to a minicomputer system with custom video processing hardware. The video data is sampled during an utterance and then reduced to a template consisting of visual speech parameter time sequences. The distances between the incoming template and all of the trained templates for each utterance in the vocabulary are computed and a visual recognition candidate is obtained. The combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone. Practical considerations and the possible enhancement of speaker independent and continuous speech recognition systems are also discussed.

...read moreread less

389 citations

Journal Article•DOI•

BiolD: a multimodal biometric identification system

[...]

R.W. Frischholz, U. Dieckmann

01 Feb 2000-IEEE Computer

TL;DR: This article goes into detail about the BioID system functions, explaining the data acquisition and preprocessing techniques for voice, facial, and lip imagery data and the classification principles used for optical features and the sensor fusion options.

...read moreread less

Abstract: Biometric identification systems, which use physical features to check a person's identity, ensure much greater security than password and number systems. Biometric features such as the face or a fingerprint can be stored on a microchip in a credit card, for example. A single feature, however, sometimes fails to be exact enough for identification. Another disadvantage of using only one feature is that the chosen feature is not always readable. Dialog Communication Systems (DCS AG) developed BioID, a multimodal identification system that uses three different features-face, voice, and lip movement-to identify people. With its three modalities, BioID achieves much greater accuracy than single-feature systems. Even if one modality is somehow disturbed-for example, if a noisy environment drowns out the voice-the ether two modalities still lead to an accurate identification. This article goes into detail about the system functions, explaining the data acquisition and preprocessing techniques for voice, facial, and lip imagery data. The authors also explain the classification principles used for optical features and the sensor fusion options (the combinations of the three results-face, voice, lip movement-to obtain varying levels of security).

...read moreread less

386 citations

Proceedings Article•

Comparison of background normalization methods for text-independent speaker verification.

[...]

Douglas A. Reynolds¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1997

TL;DR: This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models and describes how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition.

...read moreread less

Abstract: This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models. We compare speaker-dependent background speaker sets to the use of a universal, speaker-independent background model (UBM). For the UBM, we describe how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition. Experiments are conducted on the 1996 NIST Speaker Recognition Evaluation corpus and it is clearly shown that a system using a UBM and Bayesian adaptation of claimant models produces superior performance compared to speaker-dependent background sets or the UBM with independent claimant models. In addition, the creation and use of a telephone handset-type detector and a procedure called hnorm is also described which shows further, large improvements in verification performance, especially under the difficult mismatched handset conditions. This is believed to be the first use of applying a handset-type detector and explicit handset-type normalization for the speaker verification task.

...read moreread less

383 citations

Journal Article•DOI•

Speech segregation based on sound localization

[...]

Nicoleta Roman¹, DeLiang Wang, Guy J. Brown•Institutions (1)

Ohio State University¹

08 Oct 2003-Journal of the Acoustical Society of America

TL;DR: In this paper, the authors proposed a supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and intra-aural intensity differences (IID).

...read moreread less

Abstract: At a cocktail party, one can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel, supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, the notion of an "ideal" time-frequency binary mask is suggested, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. It is observed that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, pattern classification is performed in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that the model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners.

...read moreread less

382 citations

Proceedings Article•DOI•

End-to-end text-dependent speaker verification

[...]

Georg Heigold¹, Ignacio Lopez Moreno², Samy Bengio², Noam Shazeer²•Institutions (2)

Saarland University¹, Google²

20 Mar 2016

TL;DR: A data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time.

...read moreread less

Abstract: In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications Like ours that require highly accurate, easy-to-maintain systems with a small footprint.

...read moreread less

378 citations

Collapse

Network Information

Performance

Metrics

15,632

Papers

337,766

Citations

No. of papers in the topic in previous years
Year	Papers
2023	165
2022	468
2021	283
2020	475
2019	484
2018	420

Speaker recognition

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics