scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2012"


Journal ArticleDOI
TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Abstract: Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

706 citations


Journal ArticleDOI
TL;DR: A new feature based on relative phase shift (RPS) is proposed, demonstrated reliable detection of synthetic speech, and shown how this classifier can be used to improve security of SV systems.
Abstract: In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

229 citations


Journal ArticleDOI
TL;DR: A phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods.
Abstract: In conventional speaker recognition methods based on Mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. In this paper, we propose a phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods. There is a problem with the original phase information extraction method when comparing two phase values. For example, the difference in the two values of π-\mathtildeθ1 and \mathtildeθ2=-π+\mathtildeθ1 is 2π-2\mathtildeθ1 . If \mathtildeθ1 ≈ 0, then the difference ≈ 2π, despite the two phases being very similar to one another. To address this problem, we map the phase into coordinates on a unit circle. Speaker identification and verification experiments are performed using the NTT database which consists of sentences uttered by 35 (22 male and 13 female) Japanese speakers with normal, fast and slow speaking modes during five sessions. Although the phase information-based method performs worse than the MFCC-based method, it augments the MFCC and the combination is useful for speaker recognition. The proposed modified phase information is more robust than the original phase information for all speaking modes. By integrating the modified phase information with the MFCCs, the speaker identification rate was improved to 98.8% from 97.4% (MFCC), and equal error rate for speaker verification was reduced to 0.45% from 0.72% (MFCC), respectively. We also conducted the speaker identification and verification experiments on a large-scale Japanese Newspaper Article Sentences (JNAS) database, a similar trend as NTT database was obtained.

196 citations


Journal ArticleDOI
TL;DR: The results demonstrate that the P600 is modulated by speaker identity, extending the knowledge about the role of speaker's characteristics on neural correlates of speech processing.
Abstract: How do native listeners process grammatical errors that are frequent in non-native speech? We investigated whether the neural correlates of syntactic processing are modulated by speaker identity. ERPs to gender agreement errors in sentences spoken by a native speaker were compared with the same errors spoken by a non-native speaker. In line with previous research, gender violations in native speech resulted in a P600 effect (larger P600 for violations in comparison with correct sentences), but when the same violations were produced by the non-native speaker with a foreign accent, no P600 effect was observed. Control sentences with semantic violations elicited comparable N400 effects for both the native and the non-native speaker, confirming no general integration problem in foreign-accented speech. The results demonstrate that the P600 is modulated by speaker identity, extending our knowledge about the role of speaker's characteristics on neural correlates of speech processing.

191 citations


Proceedings ArticleDOI
25 Mar 2012
TL;DR: Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks, but even it experiences more than 5-fold increase in the false acceptance rate.
Abstract: Voice conversion - the methodology of automatically converting one's utterances to sound as if spoken by another speaker - presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.

187 citations


Proceedings Article
01 Jan 2012
TL;DR: Experiments show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.
Abstract: Voice conversion techniques present a threat to speaker verification systems. To enhance the security of speaker verification systems, We study how to automatically distinguish natural speech and synthetic/converted speech. Motivated by the research on phase spectrum in speech perception, in this study, we propose to use features derived from phase spectrum to detect converted speech. The features are tested under three different training situations of the converted speech detector: a) only Gaussian mixture model (GMM) based converted speech data are available; b) only unit-selection based converted speech data are available; c) no converted speech data are available for training converted speech model. Experiments conducted on the National Institute of Standards and Technology (NIST) 2006 speaker recognition evaluation (SRE) corpus show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.

170 citations


Journal ArticleDOI
TL;DR: This work investigates CASA for robust speaker identification and introduces a novel speaker feature, gammatone frequency cepstral coefficient (GFCC), based on an auditory periphery model, and shows that this feature captures speaker characteristics and performs substantially better than conventional speaker features under noisy conditions.
Abstract: Conventional speaker recognition systems perform poorly under noisy conditions. Inspired by auditory perception, computational auditory scene analysis (CASA) typically segregates speech by producing a binary time-frequency mask. We investigate CASA for robust speaker identification. We first introduce a novel speaker feature, gammatone frequency cepstral coefficient (GFCC), based on an auditory periphery model, and show that this feature captures speaker characteristics and performs substantially better than conventional speaker features under noisy conditions. To deal with noisy speech, we apply CASA separation and then either reconstruct or marginalize corrupted components indicated by a CASA mask. We find that both reconstruction and marginalization are effective. We further combine the two methods into a single system based on their complementary advantages, and this system achieves significant performance improvements over related systems under a wide range of signal-to-noise ratios.

154 citations


Patent
Phillip Rutschman1
01 Mar 2012
TL;DR: In this article, a wireless system including a first speaker and a second speaker, where the first and second speakers communicate with each other over a wireless link, is described, and the first speaker includes both a primary wireless interface for receiving audio from an audio source and a secondary wireless interface transmitting a portion of the audio to the second speaker.
Abstract: A wireless system including a first speaker and a second speaker, where the first and second speakers communicate with each other over a wireless link. In some configurations, the first speaker includes both a primary wireless interface for receiving audio from an audio source and a secondary wireless interface transmitting a portion of the audio to the second speaker. The speaker can incorporate Near Field Communication (NFC) technology to provide the wireless link between each other. The wireless system can be configured to synchronize audio output at the speakers, and can also include a second-speaker detection mechanism that permits the first speaker to be used in either a stand-alone mode, with audio output at only the first speaker, or full-headset mode, with audio output at both speakers when the second speaker is detected within wireless range of the first speaker.

130 citations


Patent
Elizabeth V. Woodward1, Shunguo Yan1
26 Sep 2012
TL;DR: In this article, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the audio track corresponding to the speaker.
Abstract: Mechanisms for performing dynamic automatic speech recognition on a portion of multimedia content are provided. Multimedia content is segmented into homogeneous segments of content with regard to speakers and background sounds. For the at least one segment, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source. A speech profile for the speaker is generated using information retrieved from the social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the at least one segment based on the acoustic profile. Automatic speech recognition operations are performed on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

104 citations


Journal ArticleDOI
TL;DR: The techniques and the attempt to achieve the low-latency monitoring of meetings are described, the experimental results for real-time meeting transcription are shown, and the goal is to recognize automatically “who is speaking what” in an online manner for meeting assistance.
Abstract: This paper presents our real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to recognize automatically “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and face poses of each speaker using a microphone array and an omni-directional camera positioned at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speaker's channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g., speaking, laughing, watching someone) and the circumstances of the meeting (e.g., topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.

101 citations


Journal ArticleDOI
TL;DR: A complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications.

Proceedings Article
01 Dec 2012
TL;DR: To reduce false acceptance rate caused by spoofing attack, a general anti-spoofing attack framework is proposed for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the Speaker verification system's acceptance decision.
Abstract: Voice conversion technique, which modifies one speaker's (source) voice to sound like another speaker (target), presents a threat to automatic speaker verification. In this paper, we first present new results of evaluating the vulnerability of current state-of-the-art speaker verification systems: Gaussian mixture model with joint factor analysis (GMM-JFA) and probabilistic linear discriminant analysis (PLDA) systems, against spoofing attacks. The spoofing attacks are simulated by two voice conversion techniques: Gaussian mixture model based conversion and unit selection based conversion. To reduce false acceptance rate caused by spoofing attack, we propose a general anti-spoofing attack framework for the speaker verification systems, where a converted speech detector is adopted as a post-processing module for the speaker verification system's acceptance decision. The detector decides whether the accepted claim is human speech or converted speech. A subset of the core task in the NIST SRE 2006 corpus is used to evaluate the vulnerability of speaker verification system and the performance of converted speech detector. The results indicate that both conversion techniques can increase the false acceptance rate of GMM-JFA and PLDA system, while the converted speech detector can reduce the false acceptance rate from 31.54% and 41.25% to 1.64% and 1.71% for GMM-JFA and PLDA system on unit-selection based converted speech, respectively.

Proceedings ArticleDOI
02 Sep 2012
TL;DR: A protocol for the case of user-dependent pass-phrases in text-dependent speaker recognition is proposed and reported and speaker recognition experiments on RSR2015 database are reported.
Abstract: This paper describes a new speech corpus, the RSR2015 database designed for text-dependent speaker recognition with scenario based on fixed pass-phrases. This database consists of over 71 hours of speech recorded from English speakers covering the diversity of accents spoken in Singapore. Acquisition has been done using a set of six portable devices including smart phones and tablets. The pool of speakers consists of 300 participants (143 female and 157 male speakers) from 17 to 42 years old. We propose a protocol for the case of user-dependent pass-phrases in text-dependent speaker recognition and we also report speaker recognition experiments on RSR2015 database. Index Terms: database, speaker recognition, text-dependent

Patent
11 May 2012
TL;DR: In this paper, an example method is provided and includes receiving a media file that includes video data and audio data, determining an initial scene sequence in the media file; determining the initial speaker sequence, and updating a selected one of the initial scene sequences and the speaker sequences in order to generate an updated scene sequence and an updated speaker sequence respectively.
Abstract: An example method is provided and includes receiving a media file that includes video data and audio data; determining an initial scene sequence in the media file; determining an initial speaker sequence in the media file; and updating a selected one of the initial scene sequence and the initial speaker sequence in order to generate an updated scene sequence and an updated speaker sequence respectively. The initial scene sequence is updated based on the initial speaker sequence, and wherein the initial speaker sequence is updated based on the initial scene sequence.

Patent
26 Jul 2012
TL;DR: In this article, a universal language translator automatically translates a spoken word or phrase between two speakers by locking onto a speaker and filtering out ambient noise so as to be used in noisy environments, and to ensure accuracy of the translation when multiple speakers are present.
Abstract: A universal language translator automatically translates a spoken word or phrase between two speakers. The translator can lock onto a speaker and filter out ambient noise so as to be used in noisy environments, and to ensure accuracy of the translation when multiple speakers are present. The translator can also synthesize a speaker's voice into the dialect of the other speaker such that each speaker sounds like they're speaking the language of the other. A dialect detector could automatically select target dialects either by auto-sensing the dialect by listening to aspects of each speaker's phrases, or based upon the location of the device.

01 Jan 2012
TL;DR: The proposed bottleneck feature extraction paradigm performs slightly worse than MFCCs but provides complementary information in combination and the proposed combination strategy with re-training improved the EER by 14% and 18% relative over the baseline MFCC system in the sameand different-microphone tasks respectively.
Abstract: Bottleneck neural networks have recently found success in a variety of speech recognition tasks. This paper presents an approach in which they are utilized in the front-end of a speaker recognition system. The network inputs are melfrequency cepstral coefficients (MFCCs) from multiple consecutive frames and the outputs are speaker labels. We propose using a recording-level criterion that is optimized via an online learning algorithm. We furthermore propose retraining a network to focus on its errors when leveraging scores from an independently trained system. We ran experiments on the sameand different-microphone tasks of the 2010 NIST Speaker Recognition Evaluation. We found that the proposed bottleneck feature extraction paradigm performs slightly worse than MFCCs but provides complementary information in combination. We also found that the proposed combination strategy with re-training improved the EER by 14% and 18% relative over the baseline MFCC system in the sameand different-microphone tasks respectively.

Journal ArticleDOI
TL;DR: A novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization and is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovISual space.
Abstract: We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions about the location of the recording equipment, and does not require labeled training data as it acquires the model parameters using the Expectation Maximization (EM) algorithm. We apply the proposed model to two meeting videos and a news broadcast video, all of which come from publicly available data sets. The results acquired in speaker diarization are in favor of the proposed multimodal framework, which outperforms the single modality analysis results and improves over the state-of-the-art audio-based speaker diarization.

Patent
06 Jun 2012
TL;DR: In this article, a method for improving speech recognition by a speech recognition system includes obtaining a voice sample from a speaker, storing the voice sample of the speaker as a voice model, identifying an area from which sound matching the voice model for the speaker is coming, and providing one or more audio signals corresponding to sound received from the identified area to the system for processing.
Abstract: A method for improving speech recognition by a speech recognition system includes obtaining a voice sample from a speaker; storing the voice sample of the speaker as a voice model in a voice model database; identifying an area from which sound matching the voice model for the speaker is coming; providing one or more audio signals corresponding to sound received from the identified area to the speech recognition system for processing.

Patent
Hagai Aronowitz1
11 Sep 2012
TL;DR: In this paper, a method and system for speaker diarization is provided, where pre-trained acoustic models of individual speakers and/or groups of speakers are obtained and speech data with multiple speakers is received and divided into frames.
Abstract: A method and system for speaker diarization are provided. Pre-trained acoustic models of individual speaker and/or groups of speakers are obtained. Speech data with multiple speakers is received and divided into frames. For a frame, an acoustic feature vector is determined extended to include log-likelihood ratios of the pre-trained models in relation to a background population model. The extended acoustic feature vector is used in segmentation and clustering algorithms.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: The effectiveness of spectral clustering is examined as an alternative to the previous approach using Kmeans clustering and a previously-used heuristic to estimate the number of speakers is adapted and an iterative optimization scheme is considered.
Abstract: This paper extends upon our previous work using i-vectors for speaker diarization. We examine the effectiveness of spectral clustering as an alternative to our previous approach using Kmeans clustering and adapt a previously-used heuristic to estimate the number of speakers. Additionally, we consider an iterative optimization scheme and experiment with its ability to improve both cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results similar to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus.

Journal ArticleDOI
TL;DR: Three main areas where Speaker Recognition Technique can be used are authentication, surveillance and forensic speaker recognition.

01 Jan 2012
TL;DR: A new clustering model for speaker diarization is proposed, by redefining clustering as a problem of Integer Linear Programming (ILP), Thus an ILP solver can be used which searches the solution of speaker clustering over the whole problem.
Abstract: In this paper, we propose a new clustering model for speaker diarization. A major problem with using greedy agglomerative hierarchical clustering for speaker diariza-tion is that they do not guarantee an optimal solution. We propose a new clustering model, by redefining clustering as a problem of Integer Linear Programming (ILP). Thus an ILP solver can be used which searches the solution of speaker clustering over the whole problem. The experiments were conducted on the corpus of French broadcast news ESTER-2. With this new clustering, the DER decreases by 2.43 points.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: An open source toolkit released under GPL license aiming at facilitating research in multistream speaker diarization and reproducing state-of-the-art results, explicitly designed to handle an arbitrary number of features with very different statistics while limiting the computational complexity.
Abstract: The speaker diarization task consists of inferring “who spoke when” in an audio stream without any prior knowledge and has been object of several NIST international evaluation campaigns is last years. A common trend for improving performances has been the use of several different feature streams as diverse as speaker location features, visual features or noise robust acoustic features. This paper describes an open source toolkit released under GPL license aiming at facilitating research in multistream speaker diarization and reproducing state-of-the-art results. In contrary to other related diarization toolkits, it is explicitly designed to handle an arbitrary number of features with very different statistics while limiting the computational complexity. The release includes a set of scripts to replicate benchmark results on previous NIST evaluations and is intended to provide an easy to use software to study and include novel features into diarization systems.

Journal ArticleDOI
TL;DR: The accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), are shown in order to develop a security control access gate.
Abstract: The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in order to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio database, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC.

Journal ArticleDOI
TL;DR: In this paper, the relative merits of the two most general, dominant approaches to speaker diarization involving bottom-up and top-down hierarchical clustering are analyzed. And a new combination strategy is proposed to exploit the merits of these two approaches.
Abstract: This paper presents a theoretical framework to analyze the relative merits of the two most general, dominant approaches to speaker diarization involving bottom-up and top-down hierarchical clustering. We present an original qualitative comparison which argues how the two approaches are likely to exhibit different behavior in speaker inventory optimization and model training: bottom-up approaches will capture comparatively purer models and will thus be more sensitive to nuisance variation such as that related to the speech content; top-down approaches, in contrast, will produce less discriminative speaker models but, importantly, models which are potentially better normalized against nuisance variation. We report experiments conducted on two standard, single-channel NIST RT evaluation datasets which validate our hypotheses. Results show that competitive performance can be achieved with both bottom-up and top-down approaches (average DERs of 21% and 22%), and that neither approach is superior. Speaker purification, which aims to improve speaker discrimination, gives more consistent improvements with the top-down system than with the bottom-up system (average DERs of 19% and 25%), thereby confirming that the top-down system is less discriminative and that the bottom-up system is less stable. Finally, we report a new combination strategy that exploits the merits of the two approaches. Combination delivers an average DER of 17% and confirms the intrinsic complementary of the two approaches.

Journal ArticleDOI
TL;DR: The first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation is presented, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches.
Abstract: The speaker diarization system developed at the International Computer Science Institute (ICSI) has played a prominent role in the speaker diarization community, and many researchers in the rich transcription community have adopted methods and techniques developed for the ICSI speaker diarization engine. Although there have been many related publications over the years, previous articles only presented changes and improvements rather than a description of the full system. Attempting to replicate the ICSI speaker diarization system as a complete entity would require an extensive literature review, and might ultimately fail due to component description version mismatches. This paper therefore presents the first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches. Some of the components, such as the online system, have not been previously described. The paper also includes all necessary preprocessing steps, such as Wiener filtering, speech activity detection and beamforming.

01 Jan 2012
TL;DR: This work investigates the ability to build an accurate user authentication system with the limitation of having a small development set and limits the practical potential of the system.
Abstract: Voice biometrics for user authentication is a task in which the object is to perform convenient, robust and secure authentication of speakers. Recently we have investigated the use of state-of-the-art text-independent and text-dependent speaker verification technology for user authentication and obtained satisfactory results within a framework of a proof of technology. However, the use we have made of a quite large development set limits the practical potential of our system. In this work we investigate the ability to build an accurate user authentication system with the limitation of having a small development set.

Journal ArticleDOI
TL;DR: This paper addresses the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker, and proposes a novel distance metric learning algorithm-linear spherical discriminant analysis (LSDA) formulation that can be systematically solved within the elegant graph embedding general dimensionality reduction framework.
Abstract: Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm-linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance.

Journal ArticleDOI
TL;DR: A single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product and a sinusoidal model-based algorithm for speech separation is proposed.
Abstract: In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.

Journal ArticleDOI
TL;DR: The hypothesis is that articulatory features provide a better metric for linguistic similarity across speakers than acoustic features, and the approach can achieve a 20% reduction in perceived accent, but also reveal a strong coupling between accent and speaker identity.
Abstract: We propose a concatenative synthesis approach to the problem of foreign accent conversion. The approach consists of replacing the most accented portions of nonnative speech with alternative segments from a corpus of the speaker's own speech based on their similarity to those from a reference native speaker. We propose and compare two approaches for selecting units, one based on acoustic similarity [e.g., mel frequency cepstral coefficients (MFCCs)] and a second one based on articulatory similarity, as measured through electromagnetic articulography (EMA). Our hypothesis is that articulatory features provide a better metric for linguistic similarity across speakers than acoustic features. To test this hypothesis, we recorded an articulatory-acoustic corpus from a native and a nonnative speaker, and evaluated the two speech representations (acoustic versus articulatory) through a series of perceptual experiments. Formal listening tests indicate that the approach can achieve a 20% reduction in perceived accent, but also reveal a strong coupling between accent and speaker identity. To address this issue, we disguised original and resynthesized utterances by altering their average pitch and normalizing vocal tract length. An additional listening experiment supports the hypothesis that articulatory features are less speaker dependent than acoustic features.