scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2017"


Posted Content
TL;DR: Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Abstract: We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition.

446 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Abstract: Speaker diarization is an important front-end for many speech technologies in the presence of multiple speakers, but current methods that employ i-vector clustering for short segments of speech are potentially too cumbersome and costly for the front-end role. In this work, we propose an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely. The proposed architecture simultaneously learns a fixed-dimensional embedding for acoustic segments of variable length and a scoring function for measuring the likelihood that the segments originated from the same or different speakers. Through tests on the CALLHOME conversational telephone speech corpus, we demonstrate that, in addition to streamlining the diarization architecture, the proposed system matches or exceeds the performance of state-of-the-art baselines. We also show that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.

248 citations


Posted Content
TL;DR: This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.

170 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: This work proposes to use deep neural networks to learn short-duration speaker embeddings based on a deep convolutional architecture wherein recordings are treated as images and advocates treating utterances as images or ‘speaker snapshots, much like in face recognition.
Abstract: The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-linear mapping from i-vectors to speaker labels is beneficial for speaker verification [1]. In this work we take the idea of learning a speaker classifier one step further we apply deep neural networks directly to timefrequency speech representations. We propose two feedforward network architectures for this task. Our best model is based on a deep convolutional architecture wherein recordings are treated as images. From our experimental findings we advocate treating utterances as images or ‘speaker snapshots, much like in face recognition. Our convolutional speaker embeddings perform significantly better than i-vectors when scoring is done using cosine distance, where the relative improvement is 23.5%. The proposed deep embeddings combined with cosine distance also outperform a state-of-the-art i-vector verification system by 1%, providing further empirical evidence in favor of our learned speaker features.

155 citations


Proceedings ArticleDOI
Dimitrios Dimitriadis1, Petr Fousek2
20 Aug 2017
TL;DR: This paper presents novel ideas for speeding up the system by using Automatic Speech Recognition (ASR) transcripts as an input to diarization, and introduces a concept of active window to keep the computational complexity linear.
Abstract: In this paper we describe the process of converting a research prototype system for Speaker Diarization into a fully deployed product running in real time and with low latency. The deployment is a part of the IBM Cloud Speech-to-Text (STT) Service. First, the prototype system is described and the requirements for the on-line, deployable system are introduced. Then we describe the technical approaches we took to satisfy these requirements and discuss some of the challenges we have faced. In particular, we present novel ideas for speeding up the system by using Automatic Speech Recognition (ASR) transcripts as an input to diarization, we introduce a concept of active window to keep the computational complexity linear, we improve the speaker model using a new speaker-clustering algorithm, we automatically keep track of the number of active speakers and we enable the users to set an operating point on a continuous scale between low latency and optimal accuracy. The deployed system has been tuned on real-life data reaching average Speaker Error Rates around 3% and improving over the prototype system by about 10% relative.

80 citations


Proceedings ArticleDOI
19 Jun 2017
TL;DR: Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.
Abstract: Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of target-speaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

71 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: This work provides a command line interface (CLI) to improve reproducibility and comparison of speaker diarization research results, and shows that it can also be used for detailed error analysis purposes.
Abstract: pyannote.metrics is an open-source Python library aimed at researchers working in the wide area of speaker diarization. It provides a command line interface (CLI) to improve reproducibility and comparison of speaker diarization research results. Through its application programming interface (API), a large set of evaluation metrics is available for diagnostic purposes of all modules of typical speaker diarization pipelines (speech activity detection, speaker change detection, clustering, and identification). Finally, thanks to visualization capabilities, we show that it can also be used for detailed error analysis purposes. pyannote.metrics can be downloaded from http://pyannote.github.io.

70 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: The final results on speaker diarization system indicate that the use of speaker change detection based on CNN is beneficial with relative improvement of diarized error rate by 28 %.
Abstract: The aim of this paper is to propose a speaker change detection technique based on Convolutional Neural Network (CNN) and evaluate its contribution to the performance of a speaker diarization system for telephone conversations. For the comparison we used an i-vector based speaker diarization system. The baseline speaker change detection uses Generalized Likelihood Ratio (GLR) metric. Experiments were conducted on the English part of the CallHome corpus. Our proposed CNN speaker change detection outperformed the GLR approach, reducing the Equal Error Rate relatively by 46 %. The final results on speaker diarization system indicate that the use of speaker change detection based on CNN is beneficial with relative improvement of diarization error rate by 28 %.

68 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence.
Abstract: Speaker change detection is an important step in a speaker di-arization system. It aims at finding speaker change points in the audio stream. In this paper, it is treated as a sequence labeling task and addressed by Bidirectional long short term memory networks (Bi-LSTM). The system is trained and evaluated on the Broadcast TV subset from ETAPE database. The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence. For instance, in comparison to Gaussian divergence, it produces speech turns that are 19.5% longer on average, with the same level of purity.

67 citations


Journal ArticleDOI
TL;DR: This work proposes a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification and presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.
Abstract: The low-dimensional i-vector representation of speech segments is used in the state-of-the-art text-independent speaker verification systems. However, i-vectors were deemed unsuitable for the text-dependent task, where simpler and older speaker recognition approaches were found more effective. In this work, we propose a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification. In our approach, the Universal Background Model (UBM) for training phrase-independent i-vector extractor is based on a set of monophone HMMs instead of the standard Gaussian Mixture Model (GMM). To compensate for the channel variability, we propose to precondition i-vectors using a regularized variant of within-class covariance normalization, which can be robustly estimated in a phrase-dependent fashion on the small datasets available for the text-dependent task. The verification scores are cosine similarities between the i-vectors normalized using phrase-dependent s-norm. The experimental results on RSR2015 and RedDots databases confirm the effectiveness of the proposed approach, especially in rejecting test utterances with a wrong phrase. A simple MFCC based i-vector/HMM system performs competitively when compared to very computationally expensive DNN-based approaches or the conventional relevance MAP GMM-UBM, which does not allow for compact speaker representations. To our knowledge, this paper presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.

65 citations


Proceedings ArticleDOI
18 Jun 2017
TL;DR: An analysis of how a text-independent voice identification system can be built is presented and the ability for such systems to be utilized in both speaker identification and speaker verification tasks is shown.
Abstract: Speaker identification systems are becoming more important in today's world. This is especially true as devices rely on the user to speak commands. In this article, an analysis of how a text-independent voice identification system can be built is presented. Extracting the Mel-Frequency Cepstral Coefficients is evaluated and a support vector machine is trained and tested on two different data sets, one from LibriSpeech and one from in-house recorded audio files. The results show the ability for such systems to be utilized in both speaker identification and speaker verification tasks.

Posted Content
TL;DR: In this article, a convolutional time-delay deep neural network structure (CT-DNN) was proposed for speaker feature learning, which can produce high quality speaker features.
Abstract: Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.

Journal ArticleDOI
TL;DR: The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.
Abstract: This work presents the development of a multi-level speech based person authentication system with attendance as an application. The multi-level system consists of three different modules of speaker verification, namely voice-password, text-dependent and text-independent speaker verification. The three speaker verification modules are combined in a sequential manner to develop a multi-level framework which is ported over a telephone network through interactive voice response (IVR) system for aiding remote authentication. The users call from a fixed set of mobile handsets to verify their claim against their respective models, which is then authenticated in a multi-level mode using the above stated three modules. An analysis over a period of two months is shown on the performance of the multi-level system in attendance marking. The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: This paper describes the JHU team's Kaldi system submission to the Arabic MGB-3: The Arabic speech recognition in the Wild Challenge for ASRU-2017 and describes its own approach for speaker diarization and audio-transcript alignment.
Abstract: This paper describes the JHU team's Kaldi system submission to the Arabic MGB-3: The Arabic speech recognition in the Wild Challenge for ASRU-2017 We use a weights transfer approach to adapt a neural network trained on the out-of-domain MGB-2 multi-dialect Arabic TV broadcast corpus to the MGB-3 Egyptian YouTube video corpus The neural network has a TDNN-LSTM architecture and is trained using lattice-free maximum mutual information (LF-MMI) objective followed by sMBR discriminative training For supervision, we fuse transcripts from 4 independent transcribers into confusion network training graphs We also describe our own approach for speaker diarization and audio-transcript alignment We use this to prepare lightly supervised transcriptions for training the seed system used for adaptation to MGB-3 Our primary submission to the challenge gives a multi-reference WER of 3278% on the MGB-3 test set

Journal ArticleDOI
TL;DR: The experimental results indicate that the proposed DNN-based approach achieves large performance gains over the state-of-the-art unsupervised techniques without using any specific knowledge about the mixed target and interfering speakers being segregated.
Abstract: We propose an unsupervised speech separation framework for mixtures of two unseen speakers in a single-channel setting based on deep neural networks (DNNs). We rely on a key assumption that two speakers could be well segregated if they are not too similar to each other. A dissimilarity measure between two speakers is first proposed to characterize the separation ability between competing speakers. We then show that speakers with the same or different genders can often be separated if two speaker clusters, with large enough distances between them, for each gender group could be established, resulting in four speaker clusters. Next, a DNN-based gender mixture detection algorithm is proposed to determine whether the two speakers in the mixture are females, males, or from different genders. This detector is based on a newly proposed DNN architecture with four outputs, two of them representing the female speaker clusters and the other two characterizing the male groups. Finally, we propose to construct three independent speech separation DNN systems, one for each of the female–female, male–male, and female–male mixture situations. Each DNN gives dual outputs, one representing the target speaker group and the other characterizing the interfering speaker cluster. Trained and tested on the speech separation challenge corpus, our experimental results indicate that the proposed DNN-based approach achieves large performance gains over the state-of-the-art unsupervised techniques without using any specific knowledge about the mixed target and interfering speakers being segregated.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation and listening tests show that: (1) for speech quality, the DNN based approach is significantly preferred over the i- vector based approach; and (2) for speaker similarity, both d- vector and i- vectors were found to perform similar.
Abstract: The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, socalled d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNNbased text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal component analysis (PCA) on the bottle-neck features of a speaker classifier network. At the adaptation stage, three variants are explored: (1) d-vectors calculated using data from the target speaker, or (2) d-vectors calculated as a weighted sum of d-vectors from training speakers, or (3) d-vectors calculated as an average of the above two approaches. The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation. Listening tests show that: (1) for speech quality, the d-vector based approach is significantly preferred over the i-vector based approach. All the d-vector variants perform similar for speech quality; (2) for speaker similarity, both d-vector and i-vector based adaptation were found to perform similar, except a small significant preference for the d-vector calculated as an average over the i-vector.

Journal ArticleDOI
TL;DR: A technique is proposed in this paper in which a pool of pre-trained transformations between a set of speakers is used as follows, making it possible to produce de-identified speech in real-time with a high level of naturalness.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: Online adaptive beamforming for automatic speech recognition (ASR) in meetings in noisy, reverberant environments is proposed, based on recently developed mask-based beamforming, which reduced the word error rate (WER) on real meeting data by 54.8% relative to the previous beamforming method.
Abstract: Here we propose online adaptive beamforming for automatic speech recognition (ASR) in meetings in noisy, reverberant environments. The proposed method is based on recently developed mask-based beamforming, in which accurate mask estimation and diarization are paramount. Real-world experiments have shown that mask-based beamforming enables accurate ASR in meetings in small noise and reverberation with a signal-to-noise ratio (SNR) of 15–25 dB and a reverberation time (RT) of 120–350 ms. In this paper, we deal with a more adverse condition: meetings in large noise and reverberation with an SNR of 3–15 dB and an RT of 500 ms. To this end, we exploit a probabilistic spatial dictionary, a dictionary that consists of a pre-trained probability distribution of source location features for each potential speaker location. This dictionary enables us to perform mask estimation and diarization for beamforming accurately, even in the above adverse condition. The proposed method reduced the word error rate (WER) on real meeting data by 54.8% relative to our previous beamforming method.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This study investigates a phonetically-aware i-vector system in noisy conditions and proposes a front-end to tackle the noise problem by performing speech separation and examines its performance for both verification and identification tasks.
Abstract: Recent research shows that the i-vector framework for speaker recognition can significantly benefit from phonetic information. A common approach is to use a deep neural network (DNN) trained for automatic speech recognition to generate a universal background model (UBM). Studies in this area have been done in relatively clean conditions. However, strong background noise is known to severely reduce speaker recognition performance. This study investigates a phonetically-aware i-vector system in noisy conditions. We propose a front-end to tackle the noise problem by performing speech separation and examine its performance for both verification and identification tasks. The proposed separation system trains a DNN to estimate the ideal ratio mask of the noisy speech. The separated speech is then used to extract enhanced features for the i-vector framework. We compare the proposed system against a multi-condition trained baseline and a traditional GMM-UBM i-vector system. Our proposed system provides an absolute average improvement of 8% in identification accuracy and 1.2% in equal error rate.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: A speaker identification technique which first takes the original voice signals of a person, e.g., Bob, and then normalizes the audio energies of the signals by employing Fourier transformation approach is studied.
Abstract: Nowadays, many speech recognition applications have been used by people in the world. Typical examples are the SIRI of iPhone, Google speech recognition system, and mobile phones operated by voice, etc. On the contrary, speaker identification in its current stage is relatively immature. Therefore, in this paper, we study a speaker identification technique which first takes the original voice signals of a person, e.g., Bob, and then normalizes the audio energies of the signals. After that, the audio signals is converted from time domain to frequency domain by employing Fourier transformation approach. Next, a MFCC-based human auditory filtering model is utilized to identify the energy levels of different frequencies as the quantified characteristics of Bob's voice. Further, the probability density function of Gaussian mixture model is utilized to indicate the distribution of the quantified characteristics as Bob's specific acoustic model. When receiving an unknown person, e.g., x's voice, the system processes the voice with the same procedure, and compares the processing result, which is x's acoustic model, with known-people's acoustic models collected in an acoustic-model database beforehand to identify who the most possible speaker is.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The experiments on the English part of the CallHome corpus show that the proposed refinement of the statistics accumulation is beneficial with the relative improvement of Diarization Error Rate almost by 16 % when compared to the speaker diarization system without statistics refinement.
Abstract: The aim of this paper is to investigate the benefit of information from a speaker change detection system based on Convolutional Neural Network (CNN) when applied to the process of accumulation of statistics for an i-vector generation. The investigation is carried out on the problem of diarization. In our system, the output of the CNN is a probability value of a speaker change in a conversation for a given time segment. According to this probability, we cut the conversation into short segments that are then represented by the i-vector (to describe a speaker in it). We propose a technique to utilize the information from the CNN for the weighting of the acoustic data in a segment to refine the statistics accumulation process. This technique enables us to represent the speaker better in the final i-vector. The experiments on the English part of the CallHome corpus show that our proposed refinement of the statistics accumulation is beneficial with the relative improvement of Diarization Error Rate almost by 16 % when compared to the speaker diarization system without statistics refinement.

Journal ArticleDOI
TL;DR: Two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system are proposed and Experimental results indicate that the performance of speaker Identification system has been improved in accuracy and time consumption term.
Abstract: The speaker recognition has been one of the interesting issues in signal and speech processing over the last few decades. Feature selection is one of the main parts of speaker recognition system which can improve the performance of the system. In this paper, we have proposed two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system. These feature vectors show individual properties of each person's vocal tract that are mostly repeated. They are used to build speaker's model and to specify decision boundary. We applied MFCC of each window over main signal as a feature vector and used clustering to obtain feature vectors with the highest similar. The Speaker identification experiments are performed using the ELSDSR database that consists of 22 speakers (12 male and 10 female) and Neural Network is used as a classifier. The effect of three main parameters have been considered in two proposed methods. Experimental results indicate that the performance of speaker identification system has been improved in accuracy and time consumption term.

Journal ArticleDOI
TL;DR: New advances are described with the state-of-the-art i-vector based approach to text-dependent speaker verification, which also makes use of different DNN techniques.

Book ChapterDOI
17 Sep 2017
TL;DR: A new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings using a recurrent convolutional neural network applied directly on magnitude spectrograms is proposed.
Abstract: In this paper we propose a new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convolutional neural network applied directly on magnitude spectrograms. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 h of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outperforms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system.
Abstract: In this paper, we investigate several automatic transcription schemes for using raw bilingual broadcast news data in semi-supervised bilingual acoustic model training Specifically, we compare the transcription quality provided by a bilingual ASR system with another system performing language diarization at the front-end followed by two monolingual ASR systems chosen based on the assigned language label Our research focuses on the Frisian-Dutch code-switching (CS) speech that is extracted from the archives of a local radio broadcaster Using 11 hours of manually transcribed Frisian speech as a reference, we aim to increase the amount of available training data by using these automatic transcription techniques By merging the manually and automatically transcribed data, we learn bilingual acoustic models and run ASR experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic transcriptions Using these acoustic models, we present speech recognition and CS detection accuracies The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system

Journal ArticleDOI
TL;DR: A literature review on speaker-specific information extraction from speech is presented by considering the latest studies offering solutions to the aforementioned problem by considering their robustness against channel mismatch, additive noise, and other degradations such as vocal effort, emotion mismatch, etc.
Abstract: Speech is a signal that includes speaker's emotion, characteristic specification, phoneme-information etc. Various methods have been proposed for speaker recognition by extracting specifications of a given utterance. Among them, short-term cepstral features are used excessively in speech, and speaker recognition areas because of their low complexity, and high performance in controlled environments. On the other hand, their performances decrease dramatically under degraded conditions such as channel mismatch, additive noise, emotional variability, etc. In this paper, a literature review on speaker-specific information extraction from speech is presented by considering the latest studies offering solutions to the aforementioned problem. The studies are categorized in three groups considering their robustness against channel mismatch, additive noise, and other degradations such as vocal effort, emotion mismatch, etc. For a more understandable representation, they are also classified into two tables b...

Patent
30 Jan 2017
TL;DR: In this paper, features are disclosed for automatically identifying a speaker, and scores are determined based on individual components of Gaussian mixture models (GMMs) that score best for frames of audio data of an utterance.
Abstract: Features are disclosed for automatically identifying a speaker. Artifacts of automatic speech recognition (“ASR”) and/or other automatically determined information may be processed against individual user profiles or models. Scores may be determined reflecting the likelihood that individual users made an utterance. The scores can be based on, e.g., individual components of Gaussian mixture models (“GMMs”) that score best for frames of audio data of an utterance. A user associated with the highest likelihood score for a particular utterance can be identified as the speaker of the utterance. Information regarding the identified user can be provided to components of a spoken language processing system, separate applications, etc.

Journal ArticleDOI
TL;DR: The proposed DTW approach obtained at least 74% relative improvement in equal error rate on the RSR corpus over other state-of-the-art approaches, including i-vector and JFA.

Journal ArticleDOI
TL;DR: The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision.
Abstract: Most speaker diarization research has focused on unsupervised scenarios, where no human supervision is available. However, in many real-world applications, a certain amount of human input could be expected, especially when minimal human supervision brings significant performance improvement. In this study, we propose an active learning based bottom-up speaker clustering algorithm to effectively improve speaker diarization performance with limited human input. Specifically, the proposed active learning based speaker clustering has two different stages: explore and constrained clustering . The explore stage is to quickly discover at least one sample for each speaker for boosting speaker clustering process with reliable initial speaker clusters. After discovering all, or a majority, of the involved speakers during explore stage, the constrained clustering is performed. Constrained clustering is similar to traditional bottom-up clustering process with an important difference that the clusters created during explore stage are restricted from merging with each other. Constrained clustering continues until only the clusters generated from the explore stage are left. Since the objective of active learning based speaker clustering algorithm is to provide good initial speaker models, performance saturates as soon as sufficient examples are ensured for each cluster. To further improve diarization performance with increasing human input, we propose a second method which actively select speech segments that account for the largest expected speaker error from existing cluster assignments for human evaluation and reassignment. The algorithms are evaluated on our recently created Apollo Mission Control Center dataset as well as augmented multiparty interaction meeting corpus. The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: It is found that by tandeming MFCC with bottleneck features, EERs can be further reduced.
Abstract: We investigate how to improve the performance of DNN ivector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification, due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with “phonetically aware” Deep Neural Net (DNN) in its capability on “stochastic phonetic-alignment” in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or textconstrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.