scispace - formally typeset
Search or ask a question
Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.


Papers
More filters
Proceedings ArticleDOI
01 Dec 2015
TL;DR: NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.
Abstract: CHiME-3 is a research community challenge organised in 2015 to evaluate speech recognition systems for mobile multi-microphone devices used in noisy daily environments. This paper describes NTT's CHiME-3 system, which integrates advanced speech enhancement and recognition techniques. Newly developed techniques include the use of spectral masks for acoustic beam-steering vector estimation and acoustic modelling with deep convolutional neural networks based on the "network in network" concept. In addition to these improvements, our system has several key differences from the official baseline system. The differences include multi-microphone training, dereverberation, and cross adaptation of neural networks with different architectures. The impacts that these techniques have on recognition performance are investigated. By combining these advanced techniques, our system achieves a 3.45% development error rate and a 5.83% evaluation error rate. Three simpler systems are also developed to perform evaluations with constrained set-ups.

259 citations

Proceedings ArticleDOI
06 Apr 2003
TL;DR: The SuperSID project as mentioned in this paper used prosodic dynamics, pitch and duration features, phone streams, and conversational interactions to improve the accuracy of automatic speaker recognition using a defined NIST evaluation corpus and task.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

256 citations

Proceedings ArticleDOI
06 Apr 2003
TL;DR: A new feature mapping technique that maps feature vectors into a channel independent space is presented that learns mapping parameters from a set of channel-dependent models derived from a channel-independent model via MAP adaptation.
Abstract: In speaker recognition applications, channel variability is a major cause of errors. Techniques in the feature, model and score domains have been applied to mitigate channel effects. In this paper we present a new feature mapping technique that maps feature vectors into a channel independent space. The feature mapping learns mapping parameters from a set of channel-dependent models derived from a channel-independent model via MAP adaptation. The technique is developed primarily for speaker verification, but can be applied for feature normalization in speech recognition applications. Results are presented on NIST landline and cellular telephone speech corpora where it is shown that feature mapping provides significant performance improvements over baseline systems and similar performance to Hnorm and speaker-model-synthesis (SMS).

255 citations

PatentDOI
TL;DR: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems, is presented in this article.
Abstract: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems. The system includes a speech recognition unit for carrying out a speech recognition processing for a speech input made by a user to obtain a recognition result; a program management table for managing program management data indicating a speech recognition interface function required by each application program; and a message processing unit for exchanging messages with the plurality of application programs in order to specify an appropriate recognition vocabulary to be used in the speech recognition processing of the speech input to the speech recognition unit, and to transmit the recognition result for the speech input obtained by the speech recognition unit by using the appropriate recognition vocabulary to appropriate ones of the plurality of application programs, according to the program management data managed by the program management table.

255 citations

Proceedings ArticleDOI
Ara V. Nefian1, Luhong Liang1, Xiaobo Pi1, Liu Xiaoxiang1, Crusoe Mao1, Kevin Murphy1 
13 May 2002
TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.
Abstract: In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.

252 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
82% related
Feature extraction
111.8K papers, 2.1M citations
81% related
Signal processing
73.4K papers, 983.5K citations
81% related
Decoding methods
65.7K papers, 900K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023165
2022468
2021283
2020475
2019484
2018420