Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

[...]

Takuya Yoshioka¹, Nobutaka Ito¹, Marc Delcroix¹, Atsunori Ogawa¹, Keisuke Kinoshita¹, Masakiyo Fujimoto¹, Chengzhu Yu¹, Wojciech J. Fabian¹, Miquel Espi¹, Takuya Higuchi¹, Shoko Araki¹, Tomohiro Nakatani¹ - Show less +8 more•Institutions (1)

Nippon Telegraph and Telephone¹

01 Dec 2015

TL;DR: NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.

...read moreread less

Abstract: CHiME-3 is a research community challenge organised in 2015 to evaluate speech recognition systems for mobile multi-microphone devices used in noisy daily environments. This paper describes NTT's CHiME-3 system, which integrates advanced speech enhancement and recognition techniques. Newly developed techniques include the use of spectral masks for acoustic beam-steering vector estimation and acoustic modelling with deep convolutional neural networks based on the "network in network" concept. In addition to these improvements, our system has several key differences from the official baseline system. The differences include multi-microphone training, dereverberation, and cross adaptation of neural networks with different architectures. The impacts that these techniques have on recognition performance are investigated. By combining these advanced techniques, our system achieves a 3.45% development error rate and a 5.83% evaluation error rate. Three simpler systems are also developed to perform evaluations with constrained set-ups.

...read moreread less

259 citations

Proceedings Article•DOI•

The SuperSID project: exploiting high-level information for high-accuracy speaker recognition

[...]

Douglas A. Reynolds¹, Walter Andrews, Joseph P. Campbell¹, Jiri Navratil², Barbara Peskin³, André Gustavo Adami, Qin Jin⁴, D. Klusacek⁵, J. Abramson⁶, R. Mihaescu⁷, John J. Godfrey, Douglas A. Jones¹, Bing Xiang⁸ - Show less +9 more•Institutions (8)

Massachusetts Institute of Technology¹, IBM², Institute of Company Secretaries of India³, Carnegie Mellon University⁴, Charles University in Prague⁵, York University⁶, Princeton University⁷, Cornell University⁸

06 Apr 2003

TL;DR: The SuperSID project as mentioned in this paper used prosodic dynamics, pitch and duration features, phone streams, and conversational interactions to improve the accuracy of automatic speaker recognition using a defined NIST evaluation corpus and task.

...read moreread less

Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

...read moreread less

256 citations

Proceedings Article•DOI•

Channel robust speaker verification via feature mapping

[...]

Douglas A. Reynolds¹•Institutions (1)

Massachusetts Institute of Technology¹

06 Apr 2003

TL;DR: A new feature mapping technique that maps feature vectors into a channel independent space is presented that learns mapping parameters from a set of channel-dependent models derived from a channel-independent model via MAP adaptation.

...read moreread less

Abstract: In speaker recognition applications, channel variability is a major cause of errors. Techniques in the feature, model and score domains have been applied to mitigate channel effects. In this paper we present a new feature mapping technique that maps feature vectors into a channel independent space. The feature mapping learns mapping parameters from a set of channel-dependent models derived from a channel-independent model via MAP adaptation. The technique is developed primarily for speaker verification, but can be applied for feature normalization in speech recognition applications. Results are presented on NIST landline and cellular telephone speech corpora where it is shown that feature mapping provides significant performance improvements over baseline systems and similar performance to Hnorm and speaker-model-synthesis (SMS).

...read moreread less

255 citations

Patent•DOI•

Speech recognition interface system suitable for window systems and speech mail systems

[...]

Hideki Hashimoto¹, Yoshifumi Nagata¹, Shigenobu Seto¹, Yoichi Takebayashi¹, Hideaki Shinchi¹, Koji Yamaguchi¹ - Show less +2 more•Institutions (1)

Toshiba¹

28 Dec 1993-Journal of the Acoustical Society of America

TL;DR: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems, is presented in this article.

...read moreread less

Abstract: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems. The system includes a speech recognition unit for carrying out a speech recognition processing for a speech input made by a user to obtain a recognition result; a program management table for managing program management data indicating a speech recognition interface function required by each application program; and a message processing unit for exchanging messages with the plurality of application programs in order to specify an appropriate recognition vocabulary to be used in the speech recognition processing of the speech input to the speech recognition unit, and to transmit the recognition result for the speech input obtained by the speech recognition unit by using the appropriate recognition vocabulary to appropriate ones of the plurality of application programs, according to the program management data managed by the program management table.

...read moreread less

255 citations

Proceedings Article•DOI•

A coupled HMM for audio-visual speech recognition

[...]

Ara V. Nefian¹, Luhong Liang¹, Xiaobo Pi¹, Liu Xiaoxiang¹, Crusoe Mao¹, Kevin Murphy¹ - Show less +2 more•Institutions (1)

Intel¹

13 May 2002

TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.

...read moreread less

Abstract: In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.

...read moreread less

252 citations

Collapse

Network Information

Performance

Metrics

15,632

Papers

337,766

Citations

No. of papers in the topic in previous years
Year	Papers
2023	165
2022	468
2021	283
2020	475
2019	484
2018	420

Speaker recognition

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics