scispace - formally typeset
Search or ask a question
Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.


Papers
More filters
Proceedings ArticleDOI
06 Apr 2003
TL;DR: Using non-native data from German speakers, it is investigated how bilingual models, speaker adaptation, acoustic model interpolation and polyphone decision tree specialization methods can help to improve the recognizer performance.
Abstract: The performance of speech recognition systems is consistently poor on non-native speech. The challenge for non-native speech recognition is to maximize the recognition performance with a small amount of available non-native data. We report on acoustic modeling adaptation for the recognition of non-native speech. Using non-native data from German speakers, we investigate how bilingual models, speaker adaptation, acoustic model interpolation and polyphone decision tree specialization methods can help to improve the recognizer performance. Results obtained from the experiments demonstrate the feasibility of these methods.

169 citations

Proceedings Article
01 Jan 1994
TL;DR: The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences, and the system achieved performance levels equivalent to untrained humans when asked to recognize the first four English digits.
Abstract: This paper presents ongoing work on a speaker independent visual speech recognition system. The work presented here builds on previous research efforts in this area and explores the potential use of simple hidden Markov models for limited vocabulary, speaker independent visual speech recognition. The task at hand is recognition of the first four English digits, a task with possible applications in car-phone dialing. The images were modeled as mixtures of independent Gaussian distributions, and the temporal dependencies were captured with standard left-to-right hidden Markov models. The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences. The system achieved performance levels equivalent to untrained humans when asked to recognize the first four English digits.

168 citations

Posted Content
TL;DR: The submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019 is described, a fusion of 4 Convolutional Neural Network (CNN) topologies and the best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.
Abstract: In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

167 citations

Proceedings ArticleDOI
25 Mar 2012
TL;DR: This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech and investigated statistical machine translation (SMT) - based text generation approaches for building code- Switched language models.
Abstract: This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech. We applied state-of-the-art techniques such as speaker adaptive and discriminative training to build the first baseline system on the SEAME corpus [1] (South East Asia Mandarin-English). For acoustic modeling, we applied different phone merging approaches based on the International Phonetic Alphabet (IPA) and Bhattacharyya distance in combination with discriminative training to improve accuracy. On language model level, we investigated statistical machine translation (SMT) - based text generation approaches for building code-switching language models. Furthermore, we integrated the provided information from a language identification system (LID) into the decoding process by using a multi-stream approach. Our best 2-pass system achieves a Mixed Error Rate (MER) of 36.6% on the SEAME development set.

167 citations

Patent
02 Feb 2004
TL;DR: In this article, the semantic overlap between the support information and the salient non-function words in the recognized text and collaborative user feedback information are used to determine relevancy scores for the recognized texts.
Abstract: Techniques are provided for determining collaborative notes and automatically recognizing speech, handwriting and other type of information. Domain and optional actor/speaker information associated with the support information is determined. An initial automatic speech recognition model is determined based on the domain and/or actor information. The domain and/or actor/speaker language model is used to recognize text in the speech information associated with the support information. Presentation support information such as slides, speaker notes and the like are determined. The semantic overlap between the support information and the salient non-function words in the recognized text and collaborative user feedback information are used to determine relevancy scores for the recognized text. Grammaticality, well formedness, self referential integrity and other features are used to determine correctness scores. Suggested collaborative notes are displayed in the user interface based on the salient non-function words. User actions in the user interface determine feedback signals. Recognition models such as automatic speech recognition, handwriting recognition are determined based on the feedback signals and the correctness and relevance scores.

167 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
82% related
Feature extraction
111.8K papers, 2.1M citations
81% related
Signal processing
73.4K papers, 983.5K citations
81% related
Decoding methods
65.7K papers, 900K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023165
2022468
2021283
2020475
2019484
2018420