scispace - formally typeset
Search or ask a question
Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.


Papers
More filters
Proceedings ArticleDOI
23 May 1989
TL;DR: A shift-tolerant neural network architecture for phoneme recognition based on LVQ2, an algorithm which pays close attention to approximating the optimal Bayes decision line in a discrimination task, which is suggested to be the basis for a successful speech recognition system.
Abstract: The authors describe a shift-tolerant neural network architecture for phoneme recognition. The system is based on LVQ2, an algorithm which pays close attention to approximating the optimal Bayes decision line in a discrimination task. Recognition performances in the 98-99% correct range were obtained for LVQ2 networks aimed at speaker-dependent recognition of phonemes in small but ambiguous Japanese phonemic classes. A correct recognition rate of 97.7% was achieved by a single, larger LVQ2 network covering all Japanese consonants. These recognition results are at least as high as those obtained in the time delay neural network system and suggest that LVQ2 could be the basis for a successful speech recognition system. >

66 citations

Journal ArticleDOI
TL;DR: A novel age estimation system based on LSTM-RNNs that is able to deal with short utterances, easily deployed in a real-time architecture and compared with a state-of-the-art i-vector approach.
Abstract: Age estimation from speech has recently received increased interest as it is useful for many applications such as user-profiling, targeted marketing, or personalized call-routing. This kind of applications need to quickly estimate the age of the speaker and might greatly benefit from real-time capabilities. Long short-term memory (LSTM) recurrent neural networks (RNN) have shown to outperform state-of-the-art approaches in related speech-based tasks, such as language identification or voice activity detection, especially when an accurate real-time response is required. In this paper, we propose a novel age estimation system based on LSTM-RNNs. This system is able to deal with short utterances (from 3 to 10 s) and it can be easily deployed in a real-time architecture. The proposed system has been tested and compared with a state-of-the-art i-vector approach using data from NIST speaker recognition evaluation 2008 and 2010 data sets. Experiments on short duration utterances show a relative improvement up to 28% in terms of mean absolute error of this new approach over the baseline system.

65 citations

Patent
27 Mar 1981
TL;DR: In this paper, a set of signals representative of the correspondence of the identified speaker's features with the feature templates of said reference words is generated, and an unknown speaker is analyzed and the reference word sequence of the utterance is identified.
Abstract: In a speaker recognition and verification arrangement, acoustic feature templates are stored for predetermined reference words. Each template is a standardized set of acoustic features for one word, formed for example by averaging the values of acoustic features from a plurality of speakers. Responsive to the utterances of identified speakers, a set of signals representative of the correspondence of the identified speaker's features with said feature templates of said reference words is generated. An utterance of an unknown speaker is analyzed and the reference word sequence of the utterance is identified. A set of signals representative of the correspondence of the unknown speaker's utterance features and the stored templates for the recognized words is generated. The unknown speaker is identified jointly responsive to the correspondence signals of the identified speakers and unknown speaker.

65 citations

Proceedings ArticleDOI
01 Jan 2005
TL;DR: This paper uses delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal and tests the approach on the 2004 and 2005 NIST meetings evaluation databases show that the technique performs very well.
Abstract: One of the sub-tasks of the Spring 2004 and Spring 2005 NIST Meetings evaluations requires segmenting multi-party meetings into speaker-homogeneous regions using data from multiple distant microphones (the "MDM" sub-task). One approach to this task is to run a speaker segmentation system on each of the microphone channels separately, and then merge the results. This can be thought of as a many-to-one post-processing approach. In this paper we propose an alternative approach in which we use delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal. This approach can be thought of a many-to-one pre-processing approach. In the pre-processing approach we propose, the time delay of arrival (TDOA) between each of the multiple distant channels and a reference channel is computed incrementally using a window that steps through the signals from each of the multiple microphones. No information about the locations or setup of the microphones is required. Using the TDOA information, the channels are first aligned and then summed and the resulting "enhanced" signal is clustered using our standard speaker diarization system. We test our approach on the 2004 and 2005 NIST meetings evaluation databases and show that the technique performs very well

65 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
82% related
Feature extraction
111.8K papers, 2.1M citations
81% related
Signal processing
73.4K papers, 983.5K citations
81% related
Decoding methods
65.7K papers, 900K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023165
2022468
2021283
2020475
2019484
2018420