scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2015"


Journal ArticleDOI
TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.
Abstract: Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.

554 citations


Journal ArticleDOI
TL;DR: This work presents the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks and demonstrates large gains on performance.
Abstract: The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in ${C_{avg}}$ on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance

429 citations


Proceedings ArticleDOI
06 Sep 2015
TL;DR: This paper describes data collection efforts conducted as part of the RedDots project which is dedicated to the study of speaker recognition under conditions where test utterances are of short duration and of variable phonetic content.
Abstract: This paper describes data collection efforts conducted as part of the RedDots project which is dedicated to the study of speaker recognition under conditions where test utterances are of short duration and of variable phonetic content. At the current stage, we focus on English speakers, both native and non-native, recruited worldwide. This is made possible through the use of a recording front-end consisting of an application running on mobile devices communicating with a centralized web server at the back-end. Speech recordings are collected by having speakers read text prompts displayed on the screen of the mobile devices. We aim to collect a large number of sessions from each speaker over a long time span, typically one session per week over a one year period. The corpus is expected to include rich inter-speaker and intra-speaker variations, both intrinsic and extrinsic (that is, due to recording channel and acoustic environment).

151 citations


Patent
27 Aug 2015
TL;DR: In this paper, a speaker identification system for virtual assistants is presented, in which a speaker profile is generated for each user based on the speaker profile for a predetermined user and contextual information is used to verify results produced by the speaker identification process.
Abstract: Systems and processes for generating a speaker profile for use in performing speaker identification for a virtual assistant are provided. One example process can include receiving an audio input including user speech and determining whether a speaker of the user speech is a predetermined user based on a speaker profile for the predetermined user. In response to determining that the speaker of the user speech is the predetermined user, the user speech can be added to the speaker profile and operation of the virtual assistant can be triggered. In response to determining that the speaker of the user speech is not the predetermined user, the user speech can be added to an alternate speaker profile and operation of the virtual assistant may not be triggered. In some examples, contextual information can be used to verify results produced by the speaker identification process.

142 citations


Proceedings ArticleDOI
01 Dec 2015
TL;DR: An evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings at ASRU 2015 is described, and the results obtained are summarized.
Abstract: This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting — i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained.

135 citations


Proceedings ArticleDOI
19 Apr 2015
TL;DR: This work considers two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses a DNN during feature modeling, and several methods of DNN feature processing are applied to bring significantly greater robustness to microphone speech.
Abstract: The recent application of deep neural networks (DNN) to speaker identification (SID) has resulted in significant improvements over current state-of-the-art on telephone speech. In this work, we report a similar achievement in DNN-based SID performance on microphone speech. We consider two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses the DNN during feature modeling. Modeling is conducted using the DNN/i-vector framework, in which the traditional universal background model is replaced with a DNN. The recently proposed use of bottleneck features extracted from a DNN is also evaluated. Systems are first compared with a conventional universal background model (UBM) Gaussian mixture model (GMM) i-vector system on the clean conditions of the NIST 2012 speaker recognition evaluation corpus, where a lack of robustness to microphone speech is found. Several methods of DNN feature processing are then applied to bring significantly greater robustness to microphone speech. To direct future research, the DNN-based systems are also evaluated in the context of audio degradations including noise and reverberation.

132 citations


Proceedings ArticleDOI
19 Apr 2015
TL;DR: This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker.
Abstract: In DNN-based TTS synthesis, DNNs hidden layers can be viewed as deep transformation for linguistic features and the output layers as representation of acoustic space to regress the transformed linguistic features to acoustic parameters. The deep-layered architectures of DNN can not only represent highly-complex transformation compactly, but also take advantage of huge amount of training data. In this paper, we propose an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker. The experimental results show that our approach can significantly improve the quality of synthesized speech objectively and subjectively, comparing with speech synthesized from the individual, speaker-dependent DNN-based TTS. We further transfer the hidden layers for a new speaker with limited training data and the resultant synthesized speech of the new speaker can also achieve a good quality in term of naturalness and speaker similarity.

122 citations


Journal ArticleDOI
TL;DR: This work investigates techniques based on deep neural networks for attacking the single-channel multi-talker speech recognition problem and demonstrates that the proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker.
Abstract: We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy on artificially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a weighted finite-state transducer (WFST)-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a confidence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art IBM superhuman system by 2.8% absolute with fewer assumptions.

115 citations


Patent
13 May 2015
TL;DR: In this article, a method and a system for identifying users in a multi-user environment are disclosed, which includes receiving audio data corresponding to an utterance of a voice command captured by a user device, the user device having a plurality of different users.
Abstract: A method and a system for identifying users in a multi-user environment are disclosed. The method includes receiving audio data corresponding to an utterance of a voice command captured by a user device, the user device having a plurality of different users; for each of a plurality of different users of the user device, obtaining corresponding speaker verification data; generating a correspondingspeaker verification score using the corresponding speaker verification data and the audio data, the corresponding speaker verification score indicating the possibility that the utterance of the voicecommand is spoken by a corresponding one of the plurality of different users of the user device; identifying the speaker of the utterance of the voice command as a user of the plurality of differentusers of the user device associated with the highest corresponding speaker verification score; and using a voice recognition module to process the voice command to identify a particular action for theuser device to perform, when the particular action is executed by the user device, initiating an application on the user device to access the application based on the user permission associated withthe identified speaker.

102 citations


Patent
01 May 2015
TL;DR: In this article, a dynamic threshold for speaker verification is proposed, which includes the actions of receiving, for each of multiple utterances of a hot word, a data set including at least a speaker verification confidence score and environmental context data, and selecting a particular data set from among the subset of data sets based on one or more selection criteria.
Abstract: The invention relates to the dynamic threshold for speaker verification. Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for a dynamic threshold forspeaker verification are disclosed. In one aspect, a method includes the actions of receiving, for each of multiple utterances of a hot word, a data set including at least a speaker verification confidence score, and environmental context data. The actions further include selecting from among the data sets, a subset of the data sets that are associated with a particular environmental context. Theactions further include selecting a particular data set from among the subset of data sets based on one or more selection criteria. The actions further include selecting, as a speaker verification threshold for the particular environmental context, the speaker verification confidence score. The actions further include providing the speaker verification threshold for use in performing speaker verification of utterances that are associated with the particular environmental context.

95 citations


Proceedings ArticleDOI
06 Sep 2015
TL;DR: A novel multi-task deep learning framework is proposed for text-dependent speaker verification and it is shown that the j-vector approach leads to good result on the evaluation data.
Abstract: Text-dependent speaker verification uses short utterances and verifies both speaker identity and text contents. Due to this nature, traditional state-of-the-art speaker verification approaches, such as i-vector, may not work well. Recently, there has been interest of applying deep learning to speaker verification, however in previous works, standalone deep learning systems have not achieved state-of-the-art performance and they have to be used in system combination or as tandem features to obtain gains. In this paper, a novel multi-task deep learning framework is proposed for text-dependent speaker verification. First, multi-task deep learning is employed to learn both speaker identity and text information. With the learned network, utterance level average of the outputs of the last hidden layer, referred to as j-vector, means joint-vector, is extracted. Discriminant function, with classes defined as multi-task labels on both speaker and text, is then applied to the j-vectors as the decision function for the closed-set recognition, and Probabilistic Linear Discriminant Analysis (PLDA), with classes defined as on the multi-task labels, is applied to the j-vectors for the verification. Experiments on the RSR2015 corpus showed that the j-vector approach leads to good result on the evaluation data. The proposed multi-task deep learning system achieved 0.54% EER, 0.14% EER for the closed-set condition.

Patent
06 May 2015
TL;DR: In this article, a speaker recognition system for robust end-pointing of speech signals using speaker recognition is presented, where a stream of audio having a spoken user request can be received and a first likelihood that the audio includes user speech can be determined.
Abstract: Systems and processes for robust end-pointing of speech signals using speaker recognition are provided. In one example process, a stream of audio having a spoken user request can be received. A first likelihood that the stream of audio includes user speech can be determined. A second likelihood that the stream of audio includes user speech spoken by an authorized user can be determined. A start-point or an end-point of the spoken user request can be determined based at least in part on the first likelihood and the second likelihood.


Proceedings ArticleDOI
19 Apr 2015
TL;DR: This paper examines an algorithm for resegmentation that operates instead in factor analysis subspace and yields a diarization error rate of 11.5% on the CALLHOME conversational telephone speech corpus.
Abstract: Resegmentation is an important post-processing step to refine the rough boundaries of diarization systems that rely on segment clustering of an initial uniform segmentation. Past work has primarily used a Viterbi resegmentation with MFCC features for this purpose. In this paper, we examine an algorithm for resegmentation that operates instead in factor analysis subspace. By combining this system with a speaker clustering front-end, we yield a diarization error rate of 11.5% on the CALLHOME conversational telephone speech corpus.

Journal ArticleDOI
01 Feb 2015
TL;DR: The proposed method, based on Formants, Wavelet Entropy and Neural Networks denoted as FWENN, succeeds in the speaker verification and identification tasks with high classification rate, using only 12 coefficient features and only one vowel signal.
Abstract: This paper proposes a new method for speaker feature extraction based on Formants, Wavelet Entropy and Neural Networks denoted as FWENN.In the first stage, five formants and seven Shannon entropy wavelet packets are extracted from the speakers' signals as the speaker feature vector.In the second stage, these 12 feature extraction coefficients are used as inputs to feed-forward neural networks.In contrast to conventional speaker identification methods that extract features from sentences (or words), the proposed method extracts the features from vowels.Advantages of using vowels include the ability to identify speakers when only partially-recorded words are available. This may be useful for deaf-mute persons. This paper proposes a new method for speaker feature extraction based on Formants, Wavelet Entropy and Neural Networks denoted as FWENN. In the first stage, five formants and seven Shannon entropy wavelet packet are extracted from the speakers' signals as the speaker feature vector. In the second stage, these 12 feature extraction coefficients are used as inputs to feed-forward neural networks. Probabilistic neural network is also proposed for comparison. In contrast to conventional speaker recognition methods that extract features from sentences (or words), the proposed method extracts the features from vowels. Advantages of using vowels include the ability to recognize speakers when only partially-recorded words are available. This may be useful for deaf-mute persons or when the recordings are damaged. Experimental results show that the proposed method succeeds in the speaker verification and identification tasks with high classification rate. This is accomplished with minimum amount of information, using only 12 coefficient features (i.e. vector length) and only one vowel signal, which is the major contribution of this work. The results are further compared to well-known classical algorithms for speaker recognition and are found to be superior.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition.
Abstract: The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations The key to this problem is to obtain a sufficiently robust and informative speaker representation This paper compares several speaker representations Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV) To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 100% – 153% relatively performance gains over the SI DNN baseline

Journal ArticleDOI
TL;DR: An impostor selection algorithm and a universal model adaptation process in a hybrid system based on deep belief networks and deep neural networks to discriminatively model each target speaker to fill the performance gap between cosine and PLDA scoring baseline techniques for speaker recognition.
Abstract: The promising performance of Deep Learning (DL) in speech recognition has motivated the use of DL in other speech technology applications such as speaker recognition. Given i-vectors as inputs, the authors proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBN) and Deep Neural Networks (DNN) to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single and multi-session speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Additionally, the parameters of the global model, referred to as universal DBN (UDBN), are normalized before adaptation. UDBN normalization facilitates training DNNs specifically with more than one hidden layer. Experiments are performed on the NIST SRE 2006 corpus. It is shown that the proposed impostor selection algorithm and UDBN adaptation process enhance the performance of conventional DNNs 8-20 % and 16-20 % in terms of EER for the single and multi-session tasks, respectively. In both scenarios, the proposed architectures outperform the baseline systems obtaining up to 17 % reduction in EER.

Proceedings ArticleDOI
04 May 2015
TL;DR: The comparison of both speech synthesis modules integrated in the proposed DROPSY-based approach reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.
Abstract: The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the de-identification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker de-identification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the de-identification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TD-PSOLA-based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.

Patent
30 Nov 2015
TL;DR: In this paper, the authors describe an audio system that adjusts based on the location of a person to optimize the performance of the audio system based on a camera and the number of users in front of the camera.
Abstract: Embodiments herein describe an audio system that adjusts based on the location of a person. That is, instead of relying on fixed speakers, the audio system adjusts the direction of audio output for one or more speakers to optimize the performance of the audio system based on the location of a user or based on the number of users. To do so, the audio system may include a camera and a tracking application which identifies the location of a user and/or the number of users in front of the camera. Using this information, the audio system adjusts one or more actuators coupled to a speaker to change the direction of the audio output of the speaker. As the user continues to move or shift, the audio system can continually adjust the speaker to optimize the performance of the system.

Proceedings ArticleDOI
03 May 2015
TL;DR: AM-FM based features were shown to be more robust to varying training/testing conditions and to improve speaker verification performance for both normal and whispered speech by using the GMM based system alone.
Abstract: In this paper, automatic speaker verification using whispered speech is explored. In the past, whispered speech has been shown to convey relevant speaker identity and gender information, nevertheless it is not clear how to efficiently use this information in speech-based biometric systems. This study compares the performance of three different speaker verification systems trained and tested under different scenarios and with two different feature representations. First, we show the benefits of using AM-FM based features as well as their effectiveness for i-vectors extraction. Second, for the classical mel-frequency cepstral coefficient (MFCC) features we show that gains of up to 40% could be achieved with the fusion of traditional Gaussian mixture model (GMM) based systems and more recent i-vector based ones, relative to using either system alone for normal speech. Additionally, for MFCC, fusion schemes were shown to be more robust to addition of whispered speech data during training or enrollment. Overall, AM-FM based features were shown to be more robust to varying training/testing conditions and to improve speaker verification performance for both normal and whispered speech by using the GMM based system alone.

Journal ArticleDOI
TL;DR: This study shows that it is more interesting to use written names for their high precision for identifying the current speaker, and proposes two approaches for finding speaker identity based only on names written in the image track.
Abstract: Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming," modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.

Journal ArticleDOI
TL;DR: This paper presents an on-line multimodal SD algorithm designed to work in a realistic environment with multiple, overlapping speakers, which achieves an average diarization error rate of 11.48% and is able to run 3.2 × real-time.
Abstract: Speaker diarization (SD) is the process of assigning speech segments of an audio stream to its corresponding speakers, thus comprising the problem of voice activity detection (VAD), speaker labeling/identification, and often sound source localization (SSL). Most research activities in the past aimed towards applications as broadcast news, meetings, conversational telephony, and automatic multimodal data annotation, where SD may be performed off-line. However, a recent research focus is human–computer interaction (HCI) systems where SD must be performed on-line, and in real-time, as in modern gaming devices and interaction with large displays. Often, such applications further suffer from noise, reverberations, and overlapping speech, making them increasingly challenging. In such situations, multimodal/multisensory approaches can provide more accurate results than unimodal ones, given a data stream may compensate for occasional instabilities of other modalities. Accordingly, this paper presents an on-line multimodal SD algorithm designed to work in a realistic environment with multiple, overlapping speakers. Our work employs a microphone array, a color camera, and a depth sensor as input streams, from which speech-related features are extracted to be later merged through a support vector machine approach consisting of VAD and SSL modules. Speaker identification is incorporated through a hybrid technique of face positioning history and face recognition. Our final SD approach experimentally achieves an average diarization error rate of 11.48% in scenarios with up to three simultaneous speakers, and is able to run $3.2 \times ~\hbox{real-time}$ .

Proceedings ArticleDOI
01 Dec 2015
TL;DR: Improved performance can be obtained by appending speaker discriminative features to the more widely used mel-frequency cepstrum coefficients, as measured by the minimum detection cost function (minDCF).
Abstract: This paper describes the application of deep neural networks (DNNs), trained to discriminate among speakers, to improving performance in text-independent speaker verification. Activations from the bottleneck layer of these DNNs are used as features in an i-vector based speaker verification system. The features derived from this network are thought to be more robust with respect to phonetic variability, which is generally considered to have a negative impact on speaker verification performance. The verification performance using these features is evaluated on the 2012 NIST SRE core-core condition with models trained from a subset of the Fisher and Switchboard conversational speech corpora. It is found that improved performance, as measured by the minimum detection cost function (minDCF), can be obtained by appending speaker discriminative features to the more widely used mel-frequency cepstrum coefficients.

Patent
27 Mar 2015
TL;DR: In this paper, an illustrative audio file analyzer computing system uses a seed segment to perform a semi-supervised diarization of the audio file, such as by a human person using an interactive graphical user interface.
Abstract: An audio file analyzer computing system includes technologies to, among other things, localize audio events of interest (such as speakers of interest) within an audio file that includes multiple different classes (e.g., different speakers) of audio. The illustrative audio file analyzer computing system uses a seed segment to perform a semi-supervised diarization of the audio file. The seed segment is pre-selected, such as by a human person using an interactive graphical user interface.

Journal ArticleDOI
TL;DR: This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data, with a new framework called cascaded Gaussian mixture regression (C-GMR), and derive two implementations.
Abstract: This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data. In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a model using only the speaker’s voice. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a gaussian mixture model (GMM). To address the speaker adaptation problem, we propose a new framework called cascaded Gaussian mixture regression (C-GMR), and derive two implementations. The first one, referred to as Split-C-GMR, is a straightforward chaining of two distinct GMRs: one mapping the acoustic features of the source speaker into the acoustic space of the reference speaker, and the other estimating the articulatory trajectories with the reference model. In the second implementation, referred to as Integrated-C-GMR, the two mapping steps are tied together in a single probabilistic model. For this latter model, we present the full derivation of the exact EM training algorithm, that explicitly exploits the missing data methodology of machine learning. Other adaptation schemes based on maximum-a posteriori (MAP), maximum likelihood linear regression (MLLR) and direct cross-speaker acoustic-to-articulatory GMR are also investigated. Experiments conducted on two speakers for different amount of adaptation data show the interest of the proposed C-GMR techniques.

Patent
20 Mar 2015
TL;DR: In this article, a computer system may communicate metadata that identifies a current speaker to at least one of the client devices of the observer or a client device of a different observer, indicating that the current speaker is unrecognized.
Abstract: A computer system may communicate metadata that identifies a current speaker. The computer system may receive audio data that represents speech of the current speaker, generate an audio fingerprint of the current speaker based on the audio data, and perform automated speaker recognition by comparing the audio fingerprint of the current speaker against stored audio fingerprints contained in a speaker fingerprint repository. The computer system may communicate data indicating that the current speaker is unrecognized to a client device of an observer and receive tagging information that identifies the current speaker from the client device of the observer. The computer system may store the audio fingerprint of the current speaker and metadata that identifies the current speaker in the speaker fingerprint repository and communicate the metadata that identifies the current speaker to at least one of the client device of the observer or a client device of a different observer.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: The use of deep neural nets to provide initial speaker change points in a speaker diarization system is investigated and it is shown that this DNN-based change point detector reduces the number of missed change points for both an English test set and a French dev set.
Abstract: We investigate the use of deep neural nets (DNN) to provide initial speaker change points in a speaker diarization system. The DNN trains states that correspond to the location of the speaker change point (SCP) in the speech segment input to the DNN. We model these different speaker change point locations in the DNN input by 10 to 20 states. The confidence in the SCP is measured by the number of frame synchronous states that correspond to the hypothesized speaker change point. We only keep the speaker change points with the highest confidence. We show that this DNN-based change point detector reduces the number of missed change points for both an English test set and a French dev set. We also show that the DNN-based change points reduce the diarization error rate for both an English and a French diarization system. These results show the feasibility of DNNs to provide initial speaker change points.

Proceedings ArticleDOI
06 Sep 2015
TL;DR: Reference EPFL-CONF-209082 Related documents: http://publications.idiap.ch/index.php/publications/showcite/Madikeri_Idiap-RR-20-2015 Record created on 2015-06-19, modified on 2017-05-10
Abstract: Reference EPFL-CONF-209082 Related documents: http://publications.idiap.ch/index.php/publications/showcite/Madikeri_Idiap-RR-20-2015 Record created on 2015-06-19, modified on 2017-05-10

Patent
17 Apr 2015
TL;DR: In this article, a method of augmenting training data by converting a feature sequence of a source speaker determined from a plurality of utterances within a transcript to a feature sequences of a target speaker under the same transcript, training a speaker-dependent acoustic model for the target speaker for corresponding speaker-specific acoustic characteristics, and mapping each utterance from each speaker in a training set using the mapping function to multiple selected target speakers in the training set.
Abstract: A method of augmenting training data includes converting a feature sequence of a source speaker determined from a plurality of utterances within a transcript to a feature sequence of a target speaker under the same transcript, training a speaker-dependent acoustic model for the target speaker for corresponding speaker-specific acoustic characteristics, estimating a mapping function between the feature sequence of the source speaker and the speaker-dependent acoustic model of the target speaker, and mapping each utterance from each speaker in a training set using the mapping function to multiple selected target speakers in the training set.

Proceedings ArticleDOI
06 Sep 2015
TL;DR: The original speaker comparison netwo rk can be improved by adding a nonlinear transform layer, and that further gains are possible by training the network to perform speaker classification rather than comparison.
Abstract: Speaker diarization finds contiguous speaker segments in an audio stream and clusters them by speaker identity, without using a-priori knowledge about the number of speakers or enrollment data. Diarization typically clusters speech segments based on short-term spectral features. In prior work, we showed that neural networks can serve as discriminative feature transformers for diarization by training them to perform same/different speaker comparisons on speech segments, yielding improved diarization accuracy when combined with standard MFCC-based models. In this work, we explore a wider range of neural network architectures for feature transformation, by adding additional layers and nonlinearities, and by varying the objective function during training. We find that the original speaker comparison netwo rk can be improved by adding a nonlinear transform layer, and that further gains are possible by training the network to perform speaker classification rather than comparison. Overal l we achieve relative reductions in speaker error between 18% and 34% on a variety of test data from the AMI, ICSI, and NIST-RT corpora.