Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Speaker Identification Using Instantaneous Frequencies

[...]

Marco Grimaldi¹, Fred Cummins¹•Institutions (1)

University College Dublin¹

01 Aug 2008-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel parametrization of speech that is based on the AM-FM representation of the speech signal and to assess the utility of these features in the context of speaker identification is presented.

...read moreread less

Abstract: This paper presents an experimental evaluation of different features for use in speaker identification. The features are tested using speech data provided by the chains corpus, in a closed-set speaker identification task. The main objective of the paper is to present a novel parametrization of speech that is based on the AM-FM representation of the speech signal and to assess the utility of these features in the context of speaker identification. In order to explore the extent to which different instantaneous frequencies due to the presence of formants and harmonics in the speech signal may predict a speaker's identity, this work evaluates three different decompositions of the speech signal within the same AM-FM framework: a first setup has been used previously for formant tracking, a second setup is designed to enhance familiar resonances below 4000 Hz, and a third setup is designed to approximate the bandwidth scaling of the filters conventionally used in the extraction of Mel-fequency cepstral coefficients (MFCCs). From each of the proposed setups, parameters are extracted and used in a closed-set text-independent speaker identification task. The performance of the new featural representation is compared with results obtained adopting MFCC and RASTA-PLP features in the context of a generic Gaussian mixture model (GMM) classification system. In evaluating the novel features, we look selectively at information for speaker identification contained in the frequency range 0-4000 Hz and 4000-8000 Hz, as the instantaneous frequencies revealed by the AM-FM approach suggest the presence of structures not well known from conventional spectrographic analyses. Accuracy results obtained using the new parametrization perform as well as conventional MFCC parameters within the same reference system, when tested and trained on modally voiced speech which is mismatched in both channel and style. When the testing material is whispered speech, the new parameters provide better results than any of the other features tested, although they remain far from ideal in this limiting case.

...read moreread less

144 citations

Proceedings Article•DOI•

Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM

[...]

Jinyu Li¹, Dong Yu¹, Jui-Ting Huang¹, Yifan Gong¹•Institutions (1)

Microsoft¹

01 Dec 2012

TL;DR: This paper presents the strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework, and shows that DNNs provide the flexibility of using arbitrary features.

...read moreread less

Abstract: Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Mel-scale log-filter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixed-bandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CD-DNN-HMMs with mixed-bandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CD-DNN-HMM to recognize mixed-bandwidth speech. By exploiting mixed-bandwidth training data CD-DNN-HMM outperforms fMPE+BMMI trained GMM-HMM, which cannot benefit from using narrowband data, by 18.4%.

...read moreread less

143 citations

Patent•

Temporal decorrelation method for robust speaker verification

[...]

Lorin P. Netsch¹, George R. Doddington¹•Institutions (1)

Texas Instruments¹

12 Feb 1992

TL;DR: In this article, a speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decor correlation linear transformation for generating wordlevel speech feature vectors from such speech inputs, and a word level speech feature storage for storing word level feature vectors known to belong to a speaker with the specific identity.

...read moreread less

Abstract: A speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decorrelation linear transformation for generating word-level speech feature vectors from such speech inputs, word-level speech feature storage for storing word-level speech feature vectors known to belong to a speaker with the specific identity, a word-level speech feature vectors received from the unknown speaker with those received from the word-level speech feature storage, and speaker verification decision circuitry for determining, based on the similarity score, whether the unknown speaker's identity is the same as that claimed The word-level vector scorer further includes concatenation circuitry as well as a word-specific orthogonalizing linear transformer Other systems and methods are also disclosed

...read moreread less

143 citations

Journal Article•DOI•

On-line Emotion Recognition in a 3-D Activation-Valence-Time Continuum using Acoustic and Linguistic Cues

[...]

Florian Eyben¹, Martin Wöllmer¹, Alex Graves¹, Björn Schuller¹, Ellen Douglas-Cowie², Roddy Cowie² - Show less +2 more•Institutions (2)

Technische Universität München¹, Queen's University²

01 Mar 2010-Journal on Multimodal User Interfaces

TL;DR: This work presents a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks, which recognition is performed on low-level signal frames, similar to those used for speech recognition.

...read moreread less

Abstract: For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.

...read moreread less

143 citations

Journal Article•DOI•

Audio-Visual Biometrics

[...]

Petar Aleksic¹, Aggelos K. Katsaggelos¹•Institutions (1)

Northwestern University¹

01 Nov 2006

TL;DR: The main components of audio-visual biometric systems are described, existing systems and their performance are reviewed, and future research and development directions in this area are discussed.

...read moreread less

Abstract: Biometric characteristics can be utilized in order to enable reliable and robust-to-impostor-attacks person recognition. Speaker recognition technology is commonly utilized in various systems enabling natural human computer interaction. The majority of the speaker recognition systems rely only on acoustic information, ignoring the visual modality. However, visual information conveys correlated and complimentary information to the audio information and its integration into a recognition system can potentially increase the system's performance, especially in the presence of adverse acoustic conditions. Acoustic and visual biometric signals, such as the person's voice and face, can be obtained using unobtrusive and user-friendly procedures and low-cost sensors. Developing unobtrusive biometric systems makes biometric technology more socially acceptable and accelerates its integration into every day life. In this paper, we describe the main components of audio-visual biometric systems, review existing systems and their performance, and discuss future research and development directions in this area

...read moreread less

142 citations

Collapse

Network Information

Performance

Metrics

15,632

Papers

337,766

Citations

No. of papers in the topic in previous years
Year	Papers
2023	165
2022	468
2021	283
2020	475
2019	484
2018	420

Speaker recognition

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics