scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2016"


Posted Content
TL;DR: This work presents persona-based models for handling the issue of speaker consistency in neural response generation that yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models.
Abstract: We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gains in speaker consistency as measured by human judges.

647 citations


Journal ArticleDOI
TL;DR: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously.
Abstract: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time The polyphonic system output requires a suitable procedure for evaluation against a reference Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study In parallel, we provide a toolbox containing implementations of presented metrics

493 citations


Proceedings ArticleDOI
20 Mar 2016
TL;DR: A data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time.
Abstract: In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications Like ours that require highly accurate, easy-to-maintain systems with a small footprint.

378 citations


Proceedings ArticleDOI
Shi-Xiong Zhang1, Zhuo Chen2, Yong Zhao1, Jinyu Li1, Yifan Gong1 
01 Dec 2016
TL;DR: In this article, a speaker discriminate CNN is used to extract the noise-robust frame-level features and these features are smartly combined to form an utterance-level speaker vector through an attention mechanism.
Abstract: A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.

137 citations



Proceedings ArticleDOI
01 Sep 2016
TL;DR: This paper uses simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering, and demonstrates the approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.
Abstract: Deep learning, especially in the form of convolutional neural networks (CNNs), has triggered substantial improvements in computer vision and related fields in recent years. This progress is attributed to the shift from designing features and subsequent individual sub-systems towards learning features and recognition systems end to end from nearly unprocessed data. For speaker clustering, however, it is still common to use handcrafted processing chains such as MFCC features and GMM-based models. In this paper, we use simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering. Furthermore, we elaborate on the question how to transfer a network, trained for speaker identification, to speaker clustering. We demonstrate our approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

73 citations


Proceedings ArticleDOI
20 Mar 2016
TL;DR: It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
Abstract: Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

63 citations


Proceedings ArticleDOI
20 Mar 2016
TL;DR: A DNN-based autoencoder for speech enhancement and its use for speaker recognition systems for distant microphones and noisy data is presented and a more detailed analysis on various conditions of NIST SRE 2010 and PRISM is presented suggesting that the proposed preprocessig is a promising and efficient way to build a robust speaker recognition system.
Abstract: In this paper we present a design of a DNN-based autoencoder for speech enhancement and its use for speaker recognition systems for distant microphones and noisy data. We started with augmenting the Fisher database with artificially noised and reverberated data and trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a preprocessing step in the later stage of modelling in state-of-the-art text-dependent and text-independent speaker recognition systems. We report relative improvements up to 50% for the text-dependent system and up to 48% for the text-independent one. With text-independent system, we present a more detailed analysis on various conditions of NIST SRE 2010 and PRISM suggesting that the proposed preprocessig is a promising and efficient way to build a robust speaker recognition system for distant microphone and noisy data.

63 citations


Journal ArticleDOI
TL;DR: Transfer learning is an enabling technology for speaker adaptation, since it outperforms both the transformation-based adaptation algorithms usually adapted in the speech community, and the multi-condition training schemes, which is a data combination methods often adopted to cover more acoustic variabilities in speech when data from the source and target domains are both available at the training time.

58 citations


Book ChapterDOI
08 Oct 2016
TL;DR: This work is the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset, and is seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision.
Abstract: In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

47 citations


Patent
10 Jun 2016
TL;DR: In this article, machine learning-based technologies are used to analyze an audio input and provide speaker state predictions in response to the audio input, which can be selected and customized for each of a variety of different applications.
Abstract: Disclosed are machine learning-based technologies that analyze an audio input and provide speaker state predictions in response to the audio input The speaker state predictions can be selected and customized for each of a variety of different applications

Proceedings ArticleDOI
Weizhong Zhu1, Jason W. Pelecanos1
20 Mar 2016
TL;DR: This paper proposes a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion.
Abstract: Many speaker diarization systems operate in an off-line mode. Such systems typically find homogeneous segments and then cluster these segments according to speaker. Such algorithms, like bottom-up clustering, k-means or spectral clustering, generally require the registration of all segments before clustering can begin. However, for real-time applications such as with multi-person voice interactive systems, there is a need to perform online speaker assignment in a strict left-to-right fashion. In this paper we propose a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion. Previous work by the community has shown that the principal components of variation of fixed dimensional i-vectors learned across segments tend to indicate a strong basis by which to separate speakers. However, determining this basis can be problematic when there are few segments or when operating in an online manner. The proposed method blends the prior with the estimated subspace as more i-vectors are observed. Given oracle SAD segments, with adaptation we achieve 3.2% speaker diarization error for a strict left-to-right constraint on the LDC Callhome English Corpus compared to 4.8% without adaptation.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: The role of bottleneck features in language and speaker identification tasks are analyzed by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward theDNN output layer.
Abstract: Using bottleneck features extracted from a deep neural network (DNN) trained to predict senone posteriors has resulted in new, state-of-the-art technology for language and speaker identification For language identification, the features' dense phonetic information is believed to enable improved performance by better representing language-dependent phone distributions For speaker recognition, the role of these features is less clear, given that a bottleneck layer near the DNN output layer is thought to contain limited speaker information In this article, we analyze the role of bottleneck features in these identification tasks by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward the DNN output layer Experiments support this hypothesis under certain conditions, and highlight the benefit of using a bottleneck layer close to the DNN output layer when DNN training data is matched to the evaluation conditions, and a layer more central to the DNN otherwise

Journal ArticleDOI
TL;DR: A novel solution that distributes speech signals into a multitude of acoustic subregion that are defined by speech units, and models speakers within the subregions, and can be greatly improved in scenarios where no enrollment data is available for some speech unit classes.
Abstract: Short utterance speaker recognition (SUSR) is highly challenging due to the limited enrollment and/or test data. We argue that the difficulty can be largely attributed to the mismatched prior distributions of the speech data used to train the universal background model (UBM) and those for enrollment and test. This paper presents a novel solution that distributes speech signals into a multitude of acoustic subregions that are defined by speech units, and models speakers within the subregions. To avoid data sparsity, a data-driven approach is proposed to cluster speech units into speech unit classes, based on which robust subregion models can be constructed. Further more, we propose a model synthesis approach based on maximum likelihood linear regression (MLLR) to deal with no-data speech unit classes. The experiments were conducted on a publicly available database SUD12. The results demonstrated that on a text-independent speaker recognition task where the test utterances are no longer than 2 seconds and mostly shorter than 0.5 seconds, the proposed sub-region modeling offered a 21.51% relative reduction in equal error rate (EER), compared with the standard GMM-UBM baseline. In addition, with the model synthesis approach, the performance can be greatly improved in scenarios where no enrollment data are available for some speech unit classes.

Journal ArticleDOI
01 Apr 2016-PeerJ
TL;DR: A speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions is proposed, and provides useful information that can contribute to automatic quality insurance and therapist training.
Abstract: Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.

Patent
10 Jun 2016
TL;DR: In this paper, an audio diarization system segments the audio input into speech and non-speech segments, and these segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.
Abstract: Speech and/or non-speech in an audio input are convolved to localize sounds to different locations for a user. An audio diarization system segments the audio input into speech and non-speech segments. These segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.

Journal ArticleDOI
TL;DR: A prediction model that features three processing steps to predict whether turn-changing or turn-keeping will occur, who will be the next speaker in turn- changing, and the timing of the start of the next speakers’ utterance is proposed.
Abstract: In multiparty meetings, participants need to predict the end of the speaker’s utterance and who will start speaking next, as well as consider a strategy for good timing to speak next. Gaze behavior plays an important role in smooth turn-changing. This article proposes a prediction model that features three processing steps to predict (I) whether turn-changing or turn-keeping will occur, (II) who will be the next speaker in turn-changing, and (III) the timing of the start of the next speaker’s utterance. For the feature values of the model, we focused on gaze transition patterns and the timing structure of eye contact between a speaker and a listener near the end of the speaker’s utterance. Gaze transition patterns provide information about the order in which gaze behavior changes. The timing structure of eye contact is defined as who looks at whom and who looks away first, the speaker or listener, when eye contact between the speaker and a listener occurs. We collected corpus data of multiparty meetings, using the data to demonstrate relationships between gaze transition patterns and timing structure and situations (I), (II), and (III). The results of our analyses indicate that the gaze transition pattern of the speaker and listener and the timing structure of eye contact have a strong association with turn-changing, the next speaker in turn-changing, and the start time of the next utterance. On the basis of the results, we constructed prediction models using the gaze transition patterns and timing structure. The gaze transition patterns were found to be useful in predicting turn-changing, the next speaker in turn-changing, and the start time of the next utterance. Contrary to expectations, we did not find that the timing structure is useful for predicting the next speaker and the start time. This study opens up new possibilities for predicting the next speaker and the timing of the next utterance using gaze transition patterns in multiparty meetings.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: Experimental results on a speech corpus of multiple speakers in both Mandarin and English show that the proposed factorized DNN can not only achieve a similar voice quality as that of a multi-speaker DNN, but also perform polyglot synthesis with a monolingual speaker's voice.
Abstract: We have successfully proposed to use multi-speaker modelling in DNN-based TTS synthesis for improved voice quality with limited available data from a speaker. In this paper, we propose a new speaker and language factorized DNN, where speaker-specific layers are used for multi-speaker modelling, and shared layers and language-specific layers are employed for multi-language, linguistic feature transformation. Experimental results on a speech corpus of multiple speakers in both Mandarin and English show that the proposed factorized DNN can not only achieve a similar voice quality as that of a multi-speaker DNN, but also perform polyglot synthesis with a monolingual speaker's voice.

Proceedings ArticleDOI
08 Sep 2016
TL;DR: This paper analyses systems built for RedDots challenge – the effort to collect and compare the initial results on this new evaluation data set obtained at different sites, and uses the recently introduced HMM based i- vector approach, where a set of phone specific HMMs is used to collect the sufficient statistics for i-vector extraction.
Abstract: Recently, a new data collection was initiated within the RedDots project in order to evaluate text-dependent and text-prompted speaker recognition technology on data from a wider speaker population and with more realistic noise, channel and phonetic variability. This paper analyses our systems built for RedDots challenge – the effort to collect and compare the initial results on this new evaluation data set obtained at different sites. We use our recently introduced HMM based i-vector approach, where, instead of the traditional GMM, a set of phone specific HMMs is used to collect the sufficient statistics for i-vector extraction. Our systems are trained in a completely phraseindependent way on the data from RSR2015 and Libri speech databases. We compare systems making use of standard cepstral features and their combination with neural network based bottle-neck features. The best results are obtained with a scorelevel fusion of such systems.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: This paper proposes to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization, and two approaches are tested and evaluated on the first season of Game Of Thrones TV series.
Abstract: While successful on broadcast news, meetings or telephone conversation, state-of-the-art speaker diarization techniques tend to perform poorly on TV series or movies. In this paper, we propose to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization. Two approaches are tested and evaluated on the first season of Game Of Thrones TV series. The second (better) approach relies on a novel talking-face detection module based on bi-directional long short-term memory recurrent neural network. Both audio-visual approaches outperform the audio-only baseline. A detailed study of the behavior of these approaches is also provided and paves the way to future improvements.

Proceedings ArticleDOI
08 Sep 2016
TL;DR: Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.
Abstract: Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. The proposed method trains connection weights of the whole DNN using a multispeaker speech corpus. When synthesizing a speech parameter sequence, a target speaker is chosen from the corpus and the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker’s voice. We investigated the relationship between the prediction performance and architecture of the DNNs by changing the input hidden layer for speaker codes. Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: A Deep Neural Network/Hidden Markov Model Automatic Speech Recognition (DNN/HMM ASR) system is used to extract content-related posterior probabilities and outperforms systems using Gaussian mixture model posteriors by at least 50% Equal Error Rate (EER) on the RSR2015 in content mismatch trials.
Abstract: The i-vector and Joint Factor Analysis (JFA) systems for text-dependent speaker verification use sufficient statistics computed from a speech utterance to estimate speaker models. These statistics average the acoustic information over the utterance thereby losing all the sequence information. In this paper, we study explicit content matching using Dynamic Time Warping (DTW) and present the best achievable error rates for speaker-dependent and speaker-independent content matching. For this purpose, a Deep Neural Network/Hidden Markov Model Automatic Speech Recognition (DNN/HMM ASR) system is used to extract content-related posterior probabilities. This approach outperforms systems using Gaussian mixture model posteriors by at least 50% Equal Error Rate (EER) on the RSR2015 in content mismatch trials. DNN posteriors are also used in i-vector and JFA systems, obtaining EERs as low as 0.02%.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: The experimental results show that appropriate sound can efficiently degrade the speaker verification performance without degrading speech intelligibility.
Abstract: In this paper, a privacy protection method to prevent speaker identification from recorded speech is proposed and evaluated. Although many techniques for preserving various private information included in speech have been proposed, their impacts on human speech communication in physical space are not taken into account. To overcome this problem, this paper proposes privacy-preserving sound as a privacy protection method. The privacy-preserving sound can degrade speaker verification performance without interfering with human speech communication in physical space. To make a first step toward solving this problem, suitable sound characteristics for preserving privacy are evaluated in terms of the speaker verification performance and speech intelligibility. The experimental results show that appropriate sound can efficiently degrade the speaker verification performance without degrading speech intelligibility.

Proceedings ArticleDOI
01 Aug 2016
TL;DR: This paper considers speaker classification as an auxiliary task in order to improve the generalization abilities of the acoustic model, by training the model to recognize the speaker, or find the closest one inside the training set.
Abstract: In order to address the commonly met issue of overfitting in speech recognition, this article investigates Multi-Task Learning, when the auxiliary task focuses on speaker classification. Overfitting occurs when the amount of training data is limited, leading to an over-sensible acoustic model. Multi-Task Learning is a method, among many other regularization methods, which decreases the overfitting impact by forcing the acoustic model to train jointly for multiple different, but related, tasks. In this paper, we consider speaker classification as an auxiliary task in order to improve the generalization abilities of the acoustic model, by training the model to recognize the speaker, or find the closest one inside the training set. We investigate this Multi-Task Learning setup on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.

Patent
26 Jan 2016
TL;DR: In this paper, the authors present methods of diarizing audio data using first-pass blind diarization and second-passblind diarisation that generate speaker statistical models.
Abstract: Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.

Journal ArticleDOI
TL;DR: Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described and quantity and duration of filled pauses, audible breaths, and correlation between the temporal structure of speech and the syntax structure of the spoken language were the features which characterize speakers most.
Abstract: Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described. The usage of three main types of acoustic pauses (silent, filled and breath pauses) and syntactic pauses (punctuation marks in speech transcripts) was investigated quantitatively in three types of spontaneous speech (presentations, simultaneous interpretation and radio interviews) and read speech (audio books). Selected parameters of pauses extracted for each speaker separately or for speaker groups were examined statistically to verify usefulness of information on pauses for speaker recognition and speaker profile estimation. Quantity and duration of filled pauses, audible breaths, and correlation between the temporal structure of speech and the syntax structure of the spoken language were the features which characterize speakers most. The experiment of using pauses in speaker biometry system (using Universal Background Model and i-vectors) resulted in 30 % equal error rate. Including pause-related features to the baseline Mel-frequency cepstral coefficient system has not significantly improved its performance. In the experiment with automatic recognition of three types of spontaneous speech, we achieved 78 % accuracy, using GMM classifier. Silent pause-related features allowed distinguishing between read and spontaneous speech by extreme gradient boosting with 75 % accuracy.

Journal ArticleDOI
01 Mar 2016
TL;DR: A scalable algorithm for real-time text-independent speaker identification based on vowel recognition that requires less than 100 bytes of data to be saved for each speaker to be identified, and can identify the speaker within a second.
Abstract: Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of speakers. The performance drops gradually as more and more users are registered with the system. This paper proposes a scalable algorithm for real-time text-independent speaker identification based on vowel recognition. Vowel formants are unique across different speakers and reflect the vocal tract information of a particular speaker. The contribution of this paper is the design of a scalable system based on vowel formant filters and a scoring scheme for classification of an unseen instance. Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) have both been analysed for comparison to extract vowel formants by windowing the given signal. All formants are filtered by known formant frequencies to separate the vowel formants for further processing. The formant frequencies of each speaker are collected during the training phase. A test signal is also processed in the same way to find vowel formants and compare them with the saved vowel formants to identify the speaker for the current signal. A score-based scheme allows the speaker with the highest matching formants to own the current signal. This model requires less than 100 bytes of data to be saved for each speaker to be identified, and can identify the speaker within a second. Tests conducted on multiple databases show that this score-based scheme outperforms the back propagation neural network and Gaussian mixture models. Usually, the longer the speech files, the more significant were the improvements in accuracy.

Journal ArticleDOI
TL;DR: While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.
Abstract: The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This work aims at further assessing and compensating for the effect of a wide variety of speech-degradation processes on speaker-recognition performance. We present an open-source simulator generating degraded telephone, VoIP, and interview-speech recordings using a comprehensive list of narrow-band, wide-band, and audio codecs, together with a database of over 60 h of environmental noise recordings and over 100 impulse responses collected from publicly available data. We provide speaker-verification results obtained with an $i$ -vector-based system using either a clean or degraded PLDA back-end on a NIST SRE subset of data corrupted by the proposed simulator. While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

Proceedings ArticleDOI
31 Oct 2016
TL;DR: This work shows how to co-train a classifier for active speaker detection using audio-visual data and uses the training of a personalized audio voice classifier to detect active speakers.
Abstract: In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

Proceedings ArticleDOI
01 Aug 2016
TL;DR: An architecture for speech synthesis using multiple speakers that addresses the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model.
Abstract: Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker model. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model.