Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Robust Speaker Recognition in Noisy Conditions

[...]

Ji Ming¹, Timothy J. Hazen², James Glass², Douglas A. Reynolds²•Institutions (2)

Queen's University Belfast¹, Massachusetts Institute of Technology²

01 Jul 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics, and is found to achieve lower error rates.

...read moreread less

Abstract: This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the Internet. While the technologies promise an additional biometric layer of security to protect the user, the practical implementation of such systems faces many challenges. One of these is environmental noise. Due to the mobile nature of such systems, the noise sources can be highly time-varying and potentially unknown. This raises the requirement for noise robustness in the absence of information about the noise. This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics. Multicondition training is conducted using simulated noisy data with limited noise variation, providing a ldquocoarserdquo compensation for the noise, and missing-feature theory is applied to refine the compensation by ignoring noise variation outside the given training conditions, thereby reducing the training and testing mismatch. This paper is focused on several issues relating to the implementation of the new model for real-world applications. These include the generation of multicondition training data to model noisy speech, the combination of different training data to optimize the recognition performance, and the reduction of the model's complexity. The new algorithm was tested using two databases with simulated and realistic noisy speech data. The first database is a redevelopment of the TIMIT database by rerecording the data in the presence of various noise types, used to test the model for speaker identification with a focus on the varieties of noise. The second database is a handheld-device database collected in realistic noisy conditions, used to further validate the model for real-world speaker verification. The new model is compared to baseline systems and is found to achieve lower error rates.

...read moreread less

277 citations

Journal Article•DOI•

The limits of speech recognition

[...]

Ben Shneiderman¹•Institutions (1)

University of Maryland, College Park¹

01 Sep 2000-Communications of The ACM

TL;DR: By understanding the cognitive processes surrounding human “acoustic memory” and processing, interface designers may be able to integrate speech more effectively and guide users more successfully.

...read moreread less

Abstract: Continued research and development should be able to improve certain speech input, output, and dialogue applications. Speech recognition and generation is sometimes helpful for environments that are hands-busy, eyes-busy, mobility-required, or hostile and shows promise for telephone-based services. Dictation input is increasingly accurate, but adoption outside the disabled-user community has been slow compared to visual interfaces. Obvious physical problems include fatigue from speaking continuously and the disruption in an office filled with people speaking. By understanding the cognitive processes surrounding human “acoustic memory” and processing, interface designers may be able to integrate speech more effectively and guide users more successfully. By appreciating the differences between human-human interaction and human-computer interaction, designers may then be able to choose appropriate applications for human use of speech with computers. The key distinction may be the rich emotional content conveyed by prosody, or the pacing, intonation, and amplitude in spoken language. The emotive aspects of prosody are potent for human-human interaction but may be disruptive for human-computer interaction. The syntactic aspects of prosody, such as rising tone for questions, are important for a system’s recognition and generation of sentences. Now consider human acoustic memory and processing. Short-term and working memory are sometimes called acoustic or verbal memory. The part of Ben Shneiderman

...read moreread less

277 citations

Journal Article•DOI•

Text-dependent speaker verification: Classifiers, databases and RSR2015

[...]

Anthony Larcher¹, Kong Aik Lee¹, Bin Ma¹, Haizhou Li¹•Institutions (1)

Institute for Infocomm Research Singapore¹

01 May 2014-Speech Communication

TL;DR: The HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system, outperforms the state-of-the-art i- vector system in most of the scenarios and provides a reference evaluation scheme and a reference performance on RSR2015 database to the research community.

...read moreread less

274 citations

Proceedings Article•DOI•

A Comparison of Sequence-to-Sequence Models for Speech Recognition

[...]

Rohit Prabhavalkar¹, Kanishka Rao¹, Tara N. Sainath², Bo Li¹, Leif Johnson³, Navdeep Jaitly⁴ - Show less +2 more•Institutions (4)

Google¹, IBM², University of Texas at Austin³, University of Toronto⁴

20 Aug 2017

TL;DR: It is found that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.

...read moreread less

Abstract: In this work, we conduct a detailed evaluation of various allneural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attentionbased model, and a model which augments the RNN transducer with an attention mechanism. We find that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.

...read moreread less

271 citations

Journal Article•DOI•

Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006

[...]

Niko Brümmer¹, Lukas Burget, Jan Cernocky, Ondrej Glembek, Frantisek Grezl, Martin Karafiat, D.A. van Leeuwen, Pavel Matejka, Petr Schwarz, Albert Strasheim - Show less +6 more•Institutions (1)

Stellenbosch University¹

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The STBU speaker recognition system was a combination of three main kinds of subsystems, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE).

...read moreread less

Abstract: This paper describes and discusses the "STBU" speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of four partners: Spescom DataVoice (Stellenbosch, South Africa), TNO (Soesterberg, The Netherlands), BUT (Brno, Czech Republic), and the University of Stellenbosch (Stellenbosch, South Africa). The STBU system was a combination of three main kinds of subsystems: 1) GMM, with short-time Mel frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) features, 2) Gaussian mixture model-support vector machine (GMM-SVM), using GMM mean supervectors as input to an SVM, and 3) maximum-likelihood linear regression-support vector machine (MLLR-SVM), using MLLR speaker adaptation coefficients derived from an English large vocabulary continuous speech recognition (LVCSR) system. All subsystems made use of supervector subspace channel compensation methods-either eigenchannel adaptation or nuisance attribute projection. We document the design and performance of all subsystems, as well as their fusion and calibration via logistic regression. Finally, we also present a cross-site fusion that was done with several additional systems from other NIST SRE-2006 participants.

...read moreread less

271 citations

Collapse

Network Information

Performance

Metrics

15,632

Papers

337,766

Citations

No. of papers in the topic in previous years
Year	Papers
2023	165
2022	468
2021	283
2020	475
2019	484
2018	420

Speaker recognition

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics