Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification

[...]

Najim Dehak, Pierre Dumouchel, Patrick Kenny

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The use of continuous prosodic features for speaker recognition are introduced, and it is shown how they can be modeled using joint factor analysis, using a standard Gaussian mixture model.

...read moreread less

Abstract: In this paper, we introduce the use of continuous prosodic features for speaker recognition, and we show how they can be modeled using joint factor analysis. Similar features have been successfully used in language identification. These prosodic features are pitch and energy contours spanning a syllable-like unit. They are extracted using a basis consisting of Legendre polynomials. Since the feature vectors are continuous (rather than discrete), they can be modeled using a standard Gaussian mixture model (GMM). Furthermore, speaker and session variability effects can be modeled in the same way as in conventional joint factor analysis. We find that the best results are obtained when we use the information about the pitch, energy, and the duration of the unit all together. Testing on the core condition of NIST 2006 speaker recognition evaluation data gives an equal error rate of 16.6% and 14.6%, with prosodic features alone, for all trials and English-only trials, respectively. When the prosodic system is fused with a state-of-the-art cepstral joint factor analysis system, we obtain a relative improvement of 8% (all trials) and 12% (English only) compared to the cepstral system alone.

...read moreread less

141 citations

Proceedings Article•DOI•

Towards More Reality in the Recognition of Emotional Speech

[...]

Björn Schuller, Dino Seppi, Anton Batliner¹, Andreas Maier¹, Stefan Steidl¹ - Show less +1 more•Institutions (1)

University of Erlangen-Nuremberg¹

15 Apr 2007

TL;DR: The major aspects of emotion recognition are addressed in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence.

...read moreread less

Abstract: As automatic emotion recognition based on speech matures, new challenges can be faced. We therefore address the major aspects in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence. Three different data-sets are used: the Berlin Emotional Speech Database, the Danish Emotional Speech Database, and the spontaneous AIBO Emotion Corpus. By using different feature types such as word- or turn-based statistics, manual versus forced alignment, and optimization techniques we show how to best cope with this demanding task and how noise addition or different microphone positions affect emotion recognition.

...read moreread less

141 citations

Proceedings Article•DOI•

Sub-band based recognition of noisy speech

[...]

Sangita Tibrewala¹, Hynek Hermansky•Institutions (1)

Oregon Health & Science University¹

21 Apr 1997

TL;DR: A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented, shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal.

...read moreread less

Abstract: A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented. The approach is shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal. Some of the issues involved in the implementation of the approach are also addressed.

...read moreread less

140 citations

Journal Article•DOI•

Speaker interpolation for HMM-based speech synthesis system.

[...]

Takayoshi Yoshimura¹, Keiichi Tokuda¹, Takashi Masuko², Takao Kobayashi², Tadashi Kitamura¹ - Show less +1 more•Institutions (2)

Nagoya Institute of Technology¹, Tokyo Institute of Technology²

01 Jan 2000-The Journal of The Acoustical Society of Japan (e)

TL;DR: An approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation, which can synthesize speech with various voice quality without large database in synthesis phase.

...read moreread less

Abstract: This paper describes an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation.Although most text-to-speech synthesis systems which synthesize speech by concatenating speech units can synthesize speech with acceptable quality, they still cannot synthesize speech with various voice quality such as speaker individualities and emotions;In order to control speaker individualities and emotions, therefore, they need a large database, which records speech units with various voice characteristics in sythesis phase.On the other hand, our system synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.Accordingly, our system can synthesize speech with various voice quality without large database in synthesis phase.An HMM interpolation technique is derived from a probabilistic similarity measure for HMMs, and used to synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.The results of subjective experiments show that we can gradually change the voice quality of synthesized speech from one’s to the other’s by changing the interpolation ratio.

...read moreread less

140 citations

Journal Article•DOI•

Simulation of talking faces in the human brain improves auditory speech recognition

[...]

Katharina von Kriegstein¹, Özgür Dogan, Martina Grüter, Anne-Lise Giraud, Christian A. Kell, Thomas Grüter, Andreas Kleinschmidt, Stefan J. Kiebel - Show less +4 more•Institutions (1)

Wellcome Trust Centre for Neuroimaging¹

06 May 2008-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person, and suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face.

...read moreread less

Abstract: Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face.

...read moreread less

140 citations

Collapse

Network Information

Performance

Metrics

15,632

Papers

337,766

Citations

No. of papers in the topic in previous years
Year	Papers
2023	165
2022	468
2021	283
2020	475
2019	484
2018	420

Speaker recognition

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics