scispace - formally typeset
Search or ask a question
Topic

Speaker recognition

About: Speaker recognition is a research topic. Over the lifetime, 14990 publications have been published within this topic receiving 310061 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper introduces a new algorithm for automatically locating the mouth region by using color and motion information and segmenting the lip region by making use of both color and edge information based on Markov random fields, and presents various visual feature performance comparisons to explore their impact on the recognition accuracy.
Abstract: There has been growing interest in introducing speech as a new modality into the human-computer interface (HCI). Motivated by the multimodal nature of speech, the visual component is considered to yield information that is not always present in the acoustic signal and enables improved system performance over acoustic-only methods, especially in noisy environments. In this paper, we investigate the usefulness of visual speech information in HCI related applications. We first introduce a new algorithm for automatically locating the mouth region by using color and motion information and segmenting the lip region by making use of both color and edge information based on Markov random fields. We then derive a relevant set of visual speech parameters and incorporate them into a recognition engine. We present various visual feature performance comparisons to explore their impact on the recognition accuracy, including the lip inner contour and the visibility of the tongue and teeth. By using a common visual feature set, we demonstrate two applications that exploit speechreading in a joint audio-visual speech signal processing task: speech recognition and speaker verification. The experimental results based on two databases demonstrate that the visual information is highly effective for improving recognition performance over a variety of acoustic noise levels.

88 citations

Posted Content
TL;DR: Results of experiments suggest that simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18% and proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.
Abstract: Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

88 citations

Proceedings Article
01 May 2008
TL;DR: A new, linguistically annotated, video database for automatic sign language recognition is presented, which includes the new RWTH-BOSTON-400 corpus, which consists of 843 sentences, several speakers and separate subsets for training, development, and testing.
Abstract: A new, linguistically annotated, video database for automatic sign language recognition is presented The new RWTH-BOSTON-400 corpus, which consists of 843 sentences, several speakers and separate subsets for training, development, and testing is described in detail For evaluation and benchmarking of automatic sign language recognition, large corpora are needed Recent research has focused mainly on isolated sign language recognition methods using video sequences that have been recorded under lab conditions using special hardware like data gloves Such databases have often consisted generally of only one speaker and thus have been speaker-dependent, and have had only small vocabularies A new database access interface, which was designed and created to provide fast access to the database statistics and content, makes it possible to easily browse and retrieve particular subsets of the video database Preliminary baseline results on the new corpora are presented In contradistinction to other research in this area, all databases presented in this paper will be publicly available

88 citations

Proceedings ArticleDOI
Hagai Aronowitz1
04 May 2014
TL;DR: This work analyzes the sources of degradation for a particular setup in the context of an i-vector PLDA system and concludes that the main source for degradation is ani-vector dataset shift, which is introduced using the nuisance attribute projection (NAP) method.
Abstract: Recently satisfactory results have been obtained in NIST speaker recognition evaluations. These results are mainly due to accurate modeling of a very large development dataset provided by LDC. However, for many realistic scenarios the use of this development dataset is limited due to a dataset mismatch. In such cases, collection of a large enough dataset is infeasible. In this work we analyze the sources of degradation for a particular setup in the context of an i-vector PLDA system and conclude that the main source for degradation is an i-vector dataset shift. As a remedy, we introduce inter dataset variability compensation (IDVC) to explicitly compensate for dataset shift in the i-vector space. This is done using the nuisance attribute projection (NAP) method. Using IDVC we managed to reduce error dramatically by more than 50% for the domain mismatch setup.

88 citations

Journal Article
TL;DR: This paper introduces speakerrecognition in general and discusses its relevant parameters in relation to system performance.
Abstract: The explosive growth of information technology in the last decade has made a considerable impact on the designand construction of systems for human-machine communication, which is becoming increasingly important inmany aspects of life. Amongst other speech processing tasks, a great deal of attention has been devoted todeveloping procedures that identify people from their voices, and the design and construction of speakerrecognition systems has been a fascinating enterprise pursued over many decades. This paper introduces speakerrecognition in general and discusses its relevant parameters in relation to system performance.

88 citations


Network Information
Related Topics (5)
Feature vector
48.8K papers, 954.4K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
82% related
Feature extraction
111.8K papers, 2.1M citations
81% related
Signal processing
73.4K papers, 983.5K citations
81% related
Decoding methods
65.7K papers, 900K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023165
2022468
2021283
2020475
2019484
2018420