Topic
Speaker recognition
About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.
Papers published on a yearly basis
Papers
More filters
••
06 Apr 2003TL;DR: Using non-native data from German speakers, it is investigated how bilingual models, speaker adaptation, acoustic model interpolation and polyphone decision tree specialization methods can help to improve the recognizer performance.
Abstract: The performance of speech recognition systems is consistently poor on non-native speech. The challenge for non-native speech recognition is to maximize the recognition performance with a small amount of available non-native data. We report on acoustic modeling adaptation for the recognition of non-native speech. Using non-native data from German speakers, we investigate how bilingual models, speaker adaptation, acoustic model interpolation and polyphone decision tree specialization methods can help to improve the recognizer performance. Results obtained from the experiments demonstrate the feasibility of these methods.
169 citations
•
01 Jan 1994TL;DR: The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences, and the system achieved performance levels equivalent to untrained humans when asked to recognize the first four English digits.
Abstract: This paper presents ongoing work on a speaker independent visual speech recognition system. The work presented here builds on previous research efforts in this area and explores the potential use of simple hidden Markov models for limited vocabulary, speaker independent visual speech recognition. The task at hand is recognition of the first four English digits, a task with possible applications in car-phone dialing. The images were modeled as mixtures of independent Gaussian distributions, and the temporal dependencies were captured with standard left-to-right hidden Markov models. The results indicate that simple hidden Markov models may be used to successfully recognize relatively unprocessed image sequences. The system achieved performance levels equivalent to untrained humans when asked to recognize the first four English digits.
168 citations
•
TL;DR: The submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019 is described, a fusion of 4 Convolutional Neural Network (CNN) topologies and the best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.
Abstract: In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.
167 citations
••
25 Mar 2012TL;DR: This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech and investigated statistical machine translation (SMT) - based text generation approaches for building code- Switched language models.
Abstract: This paper presents first steps toward a large vocabulary continuous speech recognition system (LVCSR) for conversational Mandarin-English code-switching (CS) speech. We applied state-of-the-art techniques such as speaker adaptive and discriminative training to build the first baseline system on the SEAME corpus [1] (South East Asia Mandarin-English). For acoustic modeling, we applied different phone merging approaches based on the International Phonetic Alphabet (IPA) and Bhattacharyya distance in combination with discriminative training to improve accuracy. On language model level, we investigated statistical machine translation (SMT) - based text generation approaches for building code-switching language models. Furthermore, we integrated the provided information from a language identification system (LID) into the decoding process by using a multi-stream approach. Our best 2-pass system achieves a Mixed Error Rate (MER) of 36.6% on the SEAME development set.
167 citations
•
02 Feb 2004TL;DR: In this article, the semantic overlap between the support information and the salient non-function words in the recognized text and collaborative user feedback information are used to determine relevancy scores for the recognized texts.
Abstract: Techniques are provided for determining collaborative notes and automatically recognizing speech, handwriting and other type of information. Domain and optional actor/speaker information associated with the support information is determined. An initial automatic speech recognition model is determined based on the domain and/or actor information. The domain and/or actor/speaker language model is used to recognize text in the speech information associated with the support information. Presentation support information such as slides, speaker notes and the like are determined. The semantic overlap between the support information and the salient non-function words in the recognized text and collaborative user feedback information are used to determine relevancy scores for the recognized text. Grammaticality, well formedness, self referential integrity and other features are used to determine correctness scores. Suggested collaborative notes are displayed in the user interface based on the salient non-function words. User actions in the user interface determine feedback signals. Recognition models such as automatic speech recognition, handwriting recognition are determined based on the feedback signals and the correctness and relevance scores.
167 citations