scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2011"


Journal ArticleDOI
TL;DR: In this article, a Bayesian nonparametric approach to speaker diarization is proposed, which builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al.
Abstract: We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566–1581]. Although the basic HDP-HMM tends to over-segment the audio data—creating redundant states and rapidly switching among them—we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

289 citations


Book
08 Dec 2011
TL;DR: The "Fundamentals of Speaker Recognition" as mentioned in this paper is a textbook for advanced level students in computer science and engineering, concentrating on biometrics, speech recognition, pattern recognition, signal processing and specifically, speaker recognition.
Abstract: An emerging technology, Speaker Recognition is becoming well-known for providing voice authentication over the telephone for helpdesks, call centres and other enterprise businesses for business process automation. "Fundamentals of Speaker Recognition" introduces Speaker Identification, Speaker Verification, Speaker (Audio Event) Classification, Speaker Detection, Speaker Tracking and more. The technical problems are rigorously defined, and a complete picture is made of the relevance of the discussed algorithms and their usage in building a comprehensive Speaker Recognition System. Designed as a textbook with examples and exercises at the end of each chapter, "Fundamentals of Speaker Recognition" is suitable for advanced-level students in computer science and engineering, concentrating on biometrics, speech recognition, pattern recognition, signal processing and, specifically, speaker recognition. It is also a valuable reference for developers of commercial technology and for speech scientists. Click here to view the table of contents and index.

243 citations


Proceedings ArticleDOI
27 Aug 2011
TL;DR: In this paper, a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA) is presented.
Abstract: Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.

239 citations


Proceedings Article
01 Jan 2011
TL;DR: Using the proposed methods, the ability to achieve state-of-the-art performance in the diarization of summed-channel telephone data from the NIST 2008 SRE is demonstrated.
Abstract: In this paper, we propose a new approach to speaker diarization based on the Total Variability approach to speaker verification. Drawing on previous work done in applying factor analysis priors to the diarization problem, we arrive at a simplified approach that exploits intra-conversation variability in the Total Variability space through the use of Principal Component Analysis (PCA). Using our proposed methods, we demonstrate the ability to achieve state-of-the-art performance (0.9% DER) in the diarization of summed-channel telephone data from the NIST 2008 SRE. Index Terms: speaker diarization, factor analysis, Total Variability, principal component analysis

123 citations


Proceedings Article
01 Jan 2011
TL;DR: A novel approach to flexible control of speaker characteristics using tensor representation of speaker space is described, which can solve an inherent problem of supervector representation, and it improves the performance of voice conversion.
Abstract: This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of speaker space. In voice conversion studies, realization of conversion from/to an arbitrary speaker’s voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. In the EVC, similarly to speaker recognition approaches, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this paper, we revisit construction of the speaker space by introducing the tensor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the Gaussian component and the dimension of the mean vector, and the speaker space is derived by the tensor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. Experimental results of oneto-many voice conversion demonstrate the effectiveness of the proposed approach. Index Terms: voice conversion, Gaussian mixture model, eigenvoice, tensor analysis, Tucker decomposition

97 citations


Proceedings Article
01 Jan 2011
TL;DR: This paper investigates the feasibility of using the results of a number of recent efforts to automatically discover repeated spoken terms without a recognizer as constraints for unsupervised acoustic model training, and starts with a relatively small set of word types.
Abstract: Can we automatically discover speaker independent phonemelike subword units with zero resources in a surprise language? There have been a number of recent efforts to automatically discover repeated spoken terms without a recognizer. This paper investigates the feasibility of using these results as constraints for unsupervised acoustic model training. We start with a relatively small set of word types, as well as their locations in the speech. The training process assumes that repetitions of the same (unknown) word share the same (unknown) sequence of subword units. For each word type, we train a whole-word hidden Markov model with Gaussian mixture observation densities and collapse correlated states across the word types using spectral clustering. We find that the resulting state clusters align reasonably well along phonetic lines. In evaluating cross-speaker word similarity, the proposed techniques outperform both raw acoustic features and language-mismatched acoustic models.

81 citations


Journal ArticleDOI
TL;DR: A seamless neutral/whisper mismatched closed-set speaker recognition system based on an Mel-frequency cepstral coefficient-Gaussian mixture model (MFCC-GMM) framework and an alternative feature extraction algorithm based on linear and exponential frequency scales is applied.
Abstract: Whisper is an alternative speech production mode used by subjects in natural conversation to protect the privacy. Due to the profound differences between whisper and neutral speech in both excitation and vocal tract function, the performance of speaker identification systems trained with neutral speech degrades significantly. In this paper, a seamless neutral/whisper mismatched closed-set speaker recognition system is developed. First, performance characteristics of a neutral trained closed-set speaker ID system based on an Mel-frequency cepstral coefficient-Gaussian mixture model (MFCC-GMM) framework is considered. It is observed that for whisper speaker recognition, performance degradation is concentrated for only a subset of speakers. Next, it is shown that the performance loss for speaker identification in neutral/whisper mismatched conditions is focused on phonemes other than low-energy unvoiced consonants. In order to increase system performance for unvoiced consonants, an alternative feature extraction algorithm based on linear and exponential frequency scales is applied. The acoustic properties of misrecognized and correctly recognized whisper are analyzed in order to develop more effective processing schemes. A two-dimensional feature space is proposed in order to predict on which whispered utterances the system will perform poorly, with evaluations conducted to measure the quality of whispered speech. Finally, a system for seamless neutral/whisper speaker identification is proposed, resulting in an absolute improvement of 8.85%-10.30% for speaker recognition, with the best closed set speaker ID performance of 88.35% obtained for a total of 961 read whisper test utterances, and 83.84% using a total of 495 spontaneous whisper test utterances.

76 citations


Journal ArticleDOI
TL;DR: This work investigates the task of automatically measuring dominance in small group meetings when only a single audio source is available, and shows that the dominance estimation is robust to increasing diarization noise.
Abstract: With the increase in cheap commercially available sensors, recording meetings is becoming an increasingly practical option. With this trend comes the need to summarize the recorded data in semantically meaningful ways. Here, we investigate the task of automatically measuring dominance in small group meetings when only a single audio source is available. Past research has found that speaking length as a single feature, provides a very good estimate of dominance. For these tasks we use speaker segmentations generated by our automated faster than real-time speaker diarization algorithm, where the number of speakers is not known beforehand. From user-annotated data, we analyze how the inherent variability of the annotations affects the performance of our dominance estimation method. We primarily focus on examining of how the performance of the speaker diarization and our dominance tasks vary under different experimental conditions and computationally efficient strategies, and how this would impact on a practical implementation of such a system. Despite the use of a state-of-the-art speaker diarization algorithm, speaker segments can be noisy. On conducting experiments on almost 5 hours of audio-visual meeting data, our results show that the dominance estimation is robust to increasing diarization noise.

72 citations


Patent
17 Jun 2011
TL;DR: In this paper, a profile for each audience member who listen to an audio conference is obtained, and different visual representations of the speaker content are presented to different audience members based on analyzing and identifying.
Abstract: Speaker content generated in an audio conference is selectively visually represented. A profile for each audience member who listen to an audio conference is obtained. Speaker content from audio conference participants who speak in the audio conference is monitored. The speaker content from each of the audio conference participants is analyzed. Based on the analyzing and on the profiles for each of the plurality of audience members, visual representations of the speaker content to present to the audience members are identified. Visual representations of the speaker content are generated based on the analyzing. Different visual representations of the speaker content are presented to different audience members based on the analyzing and identifying.

72 citations


Patent
05 May 2011
TL;DR: In this article, a system is configured to verify a speaker, generates a text challenge that is unique to the request, and prompts the speaker to utter the text challenge, and then the system records a dynamic image feature of the speaker as the speaker utters the challenge.
Abstract: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for performing speaker verification. A system configured to practice the method receives a request to verify a speaker, generates a text challenge that is unique to the request, and, in response to the request, prompts the speaker to utter the text challenge. Then the system records a dynamic image feature of the speaker as the speaker utters the text challenge, and performs speaker verification based on the dynamic image feature and the text challenge. Recording the dynamic image feature of the speaker can include recording video of the speaker while speaking the text challenge. The dynamic feature can include a movement pattern of head, lips, mouth, eyes, and/or eyebrows of the speaker. The dynamic image feature can relate to phonetic content of the speaker speaking the challenge, speech prosody, and the speaker's facial expression responding to content of the challenge.

71 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: An audio/video database, especially built for the speaker diarization task, based on different video genres, is described, which highlights the difficulties encountered in this context, mainly linked to the database heterogeneity.
Abstract: In the last ten years, internet as well as its applications changed significantly, mainly thanks to the raising of available personal resources. Concerning multimedia, the most impressive evolution is the continuous growing success of the video sharing websites. But with this success come the difficulties to efficiently search, index and access relevant information about these documents. Speaker diarization is an important task in the overall information retrieval process. This paper describes an audio/video database, especially built for the speaker diarization task, based on different video genres. Through some preliminary experiments, it highlights the difficulties encountered in this context, mainly linked to the database heterogeneity.

Proceedings ArticleDOI
22 May 2011
TL;DR: A method for automatically generating acoustic sub-word units that can substitute conventional phone models in a query-by-example spoken term detection system and shows that the proposed system performs well on both broadcast and non-broadcast recordings, unlike a conventional phone-based system trained solely on broadcast data.
Abstract: In this paper we present a method for automatically generating acoustic sub-word units that can substitute conventional phone models in a query-by-example spoken term detection system. We generate the sub-word units with a modified version of our speaker diarization system. Given a speech recording, the original diarization system generates a set of speaker models in an unsupervised manner without the need for training or development data. Modifying the diarization system to process the speech of a single speaker and decreasing the minimum segment duration constraint allows us to detect speaker-dependent sub-word units. For the task of query-by-example spoken term detection, we show that the proposed system performs well on both broadcast and non-broadcast recordings, unlike a conventional phone-based system trained solely on broadcast data. A mean average precision of 0.28 and 0.38 was obtained for experiments on broadcast news and on a set of war veteran interviews, respectively.

Proceedings ArticleDOI
Yao Qian1, Ji Xu1, Frank K. Soong1
22 May 2011
TL;DR: The frame mapping-based approach is capable of generating highly intelligible, good quality speech data in L1 (Mandarin), which sounds rather close to the target speaker, and is confirmed with speaker similarity, naturalness and intelligibility evaluations subjectively.
Abstract: Cross-lingual voice transformation is challenging when source language (L1) and target language (L2) are very different in corresponding phonetics and prosodies. We propose a frame mapping based HMM approach to this problem. The source speaker's speech data is first warped in frequency toward the target speaker by mapping corresponding formants of selected vowels. The parameter trajectories of the warped data are then “tiled” with the frames in target speaker's L2 data. The tiled new trajectories then form a simulated training set of target speaker in L1 and it is used to train an HMM TTS. With a bilingual (Mandarin and English) source speaker and a monolingual (English) target speaker, the frame mapping-based approach is capable of generating highly intelligible, good quality speech data in L1 (Mandarin), which sounds rather close to the target speaker. The good performance of the cross-lingual voice transformation is confirmed with speaker similarity, naturalness and intelligibility evaluations subjectively.

Proceedings ArticleDOI
01 Nov 2011
TL;DR: Evaluation results on a corpus of read and spontaneous speech in Dutch confirms the effectiveness of the proposed scheme for speaker gender detection and age estimation, based on a hybrid architecture of Weighted Supervised Non-Negative Matrix Factorization and General Regression Neural Network.
Abstract: In many criminal cases, evidence might be in the form of telephone conversations or tape recordings. Therefore, law enforcement agencies have been concerned about accurate methods to profile different characteristics of a speaker from recorded voice patterns, which facilitate the identification of a criminal. This paper proposes a new approach for speaker gender detection and age estimation, based on a hybrid architecture of Weighted Supervised Non-Negative Matrix Factorization (WSNMF) and General Regression Neural Network (GRNN). Evaluation results on a corpus of read and spontaneous speech in Dutch confirms the effectiveness of the proposed scheme.

Patent
10 Aug 2011
TL;DR: In this article, an acoustic model was used for performing speech recognition on an input signal which comprises a sequence of feature vectors, with a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector.
Abstract: A speech processing method, comprising: receiving a speech input which comprises a sequence of feature vectors; determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising: providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and adapting the acoustic model to the mismatched speech input, the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the acoustic model to the mismatched speaker input comprises: relating speech from the mismatched speaker input to the speech used to train the acoustic model using: a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that: y=f(F(x,v),u) where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v.

Proceedings ArticleDOI
22 May 2011
TL;DR: This work proposes a novel source-normalised-and-weighted LDA algorithm developed to improve the robustness of i-vector-based speaker recognition under both mis-matched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development.
Abstract: The recently developed i-vector framework for speaker recognition has set a new performance standard in the research field. An i-vector is a compact representation of a speaker utterance extracted from a low-dimensional total variability subspace. Prior to classification using a cosine kernel, i-vectors are projected into an LDA space in order to reduce inter-session variability and enhance speaker discrimination. The accurate estimation of this LDA space from a training dataset is crucial to classification performance. A typical training dataset, however, does not consist of utterances acquired from all sources of interest (ie., telephone, microphone and interview speech sources) for each speaker. This has the effect of introducing source-related variation in the between-speaker covariance matrix and results in an incomplete representation of the within-speaker scatter matrix used for LDA. Proposed is a novel source-normalised-and-weighted LDA algorithm developed to improve the robustness of i-vector-based speaker recognition under both mis-matched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. Evaluated on the recent NIST 2008 and 2010 Speaker Recognition Evaluations (SRE), the proposed technique demonstrated improvements of up to 31% in minimum DCF and EER under mis-matched and sparsely-resourced conditions.

Proceedings ArticleDOI
17 Mar 2011
TL;DR: The initial study exploring the impact of mismatch in training and test conditions with collected data finds that the mismatch in sensor, speaking style, and environment result in significant degradation in performance compared to the matched case whereas for language mismatch case the degradation is found to be relatively smaller.
Abstract: In this paper, we present our initial study with the recently collected speech database for developing robust speaker recognition systems in Indian context The database contains the speech data collected across different sensors, languages, speaking styles, and environments, from 200 speakers The speech data is collected across five different sensors in parallel, in English and multiple Indian languages, in reading and conversational speaking styles, and in office and uncontrolled environments such as laboratories, hostel rooms and corridors etc The collected database is evaluated using adapted Gaussian mixture model based speaker verification system following the NIST 2003 speaker recognition evaluation protocol and gives comparable performance to those obtained using NIST data sets Our initial study exploring the impact of mismatch in training and test conditions with collected data finds that the mismatch in sensor, speaking style, and environment result in significant degradation in performance compared to the matched case whereas for language mismatch case the degradation is found to be relatively smaller

Proceedings ArticleDOI
22 May 2011
TL;DR: A new speaker verification system that can be applied to both types of data is proposed, named blind system, based on an extension of the total variability framework that is comparable to state of the art systems that require conditioning on the channel type.
Abstract: The majority of speaker verification systems proposed in the NIST speaker recognition evaluation are conditioned on the type of data to be processed: telephone or microphone. In this paper, we propose a new speaker verification system that can be applied to both types of data. This system, named blind system, is based on an extension of the total variability framework. Recognition results with the proposed channel-independent system are comparable to state of the art systems that require conditioning on the channel type. Another advantage of our proposed system is that it allows for combining data from multiple channels in the same visualization in order to explore the effects of different microphones and collection environments.

Journal ArticleDOI
TL;DR: Experiments reveal that the proposed approach outperforms the GMM-based system when the recording is done with varying number of microphones, and the proposed method avoids to perform the feature combination averaging log-likelihood scores.
Abstract: This correspondence describes a novel system for speaker diarization of meetings recordings based on the combination of acoustic features (MFCC) and time delay of arrivals (TDOAS). The first part of the paper analyzes differences between MFCC and TDOA features which possess completely different statistical properties. When Gaussian mixture models are used, experiments reveal that the diarization system is sensitive to the different recording scenarios (i.e., meeting rooms with varying number of microphones). In the second part, a new multistream diarization system is proposed extending previous work on information theoretic diarization. Both speaker clustering and speaker realignment steps are discussed; in contrary to current systems, the proposed method avoids to perform the feature combination averaging log-likelihood scores. Experiments on meetings data reveal that the proposed approach outperforms the GMM-based system when the recording is done with varying number of microphones.

Proceedings Article
01 Jan 2011
TL;DR: This work presents the ongoing work in addressing the issue of overlapped speech in speaker diarization through the use of overlap segmentation, overlappedspeech exclusion, and overlap segment labeling and shows that the performance improvement now rivals that of an oracle system using reference overlap segments.
Abstract: We present our ongoing work in addressing the issue of overlapped speech in speaker diarization through the use of overlap segmentation, overlapped speech exclusion, and overlap segment labeling. Using feature analysis, we identify the most salient features from a candidate list including those from our previous system and a set of newly proposed features. In addition, through independent optimization of overlap exclusion and labeling, we obtain a relative diarization error rate improvement of 15.1% on a sampled subset of the AMI Meeting Corpus, more than double our previous result. When analyzed independently, we show that the performance improvement due to overlapped speech exclusion now rivals that of an oracle system using reference overlap segments.

Journal ArticleDOI
TL;DR: The evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzín-2010 evaluation campaign is presented, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.
Abstract: Recently, audio segmentation has attracted research interest because of its usefulness in several applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Moreover, a previous audio segmentation stage may be useful to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this article, we present the evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzin-2010 evaluation campaign. That evaluation consisted of segmenting audio from the 3/24 Catalan TV channel into five acoustic classes: music, speech, speech over music, speech over noise, and the other. The evaluation results displayed the difficulty of this segmentation task. In this article, after presenting the database and metric, as well as the feature extraction methods and segmentation techniques used by the submitted systems, the experimental results are analyzed and compared, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.

01 Jan 2011
TL;DR: Experimental results show that the extent of phonetic convergence depends on the speaker's disposition towards an interlocutor, but not on more “macro” social variables, such as the speaker’s gender.
Abstract: Numerous studies have documented the phenomenon of phonetic convergence: the process by which speakers alter their productions to become more similar on some phonetic or acoustic dimension to those of their interlocutor. Though social factors have been suggested as a motivator for imitation, a relatively smaller body of studies has established a tight connection between extralinguistic factors and a speaker’s likelihood to imitate. The present study explores the effects of a speaker’s attitude toward an interlocutor on the likelihood of imitation for extended VOT. Experimental results show that the extent of phonetic convergence (and divergence) depends on the speaker’s disposition towards an interlocutor, but not on more “macro” social variables, such as the speaker’s gender.

Proceedings ArticleDOI
27 Aug 2011
TL;DR: Different architectures for cross-show speaker diarization are compared: the obvious concatenation of all shows, a hybrid system combining first a local clustering stage followed by a global clusteringStage, and an incremental system which processes the shows in a predefined order and updates the speaker models accordingly.
Abstract: Acoustic speaker diarization is investigated for situations where a collection of shows from the same source needs to be processed. In this case, the same speaker should receive the same label across all shows. We compare different architectures for cross-show speaker diarization: the obvious concatenation of all shows, a hybrid system combining first a local clustering stage followed by a global clustering stage, and an incremental system which processes the shows in a predefined order and updates the speaker models accordingly. This latter system being best suited to real applicative situations. These three strategies were compared to a baseline single-show system on a set of 46 ten-minutes samples of British English scientific podcasts.


Proceedings Article
01 Jan 2011
TL;DR: It is shown that the addition of prosodic features decreased overlap detection error and was used in speaker diarization to recover missed speech by assigning multiple speaker labels and to increase the purity of speaker clusters.
Abstract: Overlapping speech is responsible for a certain amount of er-rors produced by standard speaker diarization systems in meet-ing environment We are investigating a set of prosody-basedlong-term features as a potential complement to our overlap de-tection system relying on short-term spectral parameters Themost relevant features are selected in a two-step process Theyare firstly evaluated and sorted according to mRMR criterionand then the optimal number is determined by iterative wrapperapproach We show that the addition of prosodic features de-creased overlap detection error Detected overlap segments areused in speaker diarization to recover missed speech by assign-ing multiple speaker labels and to increase the purity of speakerclustersIndex Terms: overlapping speech detection, prosody, featureselection, speaker diarization 1 Introduction Human conversation often includes certain amount of overlap-ping speech Several works identified these specific conversa-tion events as a challenge for many automatic human languagetechnologies [1, 2] One of these technologies is speaker di-arization, which, given a recording, strives to answer the ques-tion “Who spoke when?” without any prior knowledge aboutthe speakers The problem is that conventional diarization sys-tems assign only one speaker label per segment and, conse-quently, miss speech from overlapping speakers Furthermore,it is reasonable to assume that overlapping speech included intothe training data of a single-speaker model can lead to somelevel of corruption of the modelsProsody describes the rhythm, intonation and stress ofspeech It can reflect various things about the speaker or theutterance, eg, the emotional state There has been significanteffort to use this kind of higher-level speech information for var-ious tasks like speaker verification and identification Recently,prosodic features were also successfully applied for speaker di-arization [3, 4]A few studies were published which researched the rela-tionship between prosodic cues and the interaction of conver-sation participants, eg, one speaker jumping into the talk ofanother The work by Ward and Tsukahara [5] suggests thatstretches of low pitch can trigger back-channel feedback fromlistener (yeah, uh-huh, right) Shriberg et al [6] showed thatspeakers raise their voices when starting their utterance duringsomebody else’s talk, compared to starting in silence Some-what related work was presented in [7], where a specific fea-ture based on pitch prediction was used for speaker count label-

Patent
27 Sep 2011
TL;DR: In this article, a system, method, and computer readable medium that facilitate verbal control of conference call features are provided, where hot words are configured in the conference platform that may be identified in speech supplied to a conference call.
Abstract: A system, method, and computer readable medium that facilitate verbal control of conference call features are provided. Automatic speech recognition functionality is deployed in a conferencing platform. Hot words are configured in the conference platform that may be identified in speech supplied to a conference call. Upon recognition of a hot word, a corresponding feature may be invoked. A speaker may be identified using speaker identification technologies. Identification of the speaker may be utilized to fulfill the speaker's request in response to recognition of a hot word and the speaker. Particular participants may be provided with conference control privileges that are not provided to other participants. Upon recognition of a hot word, the speaker may be identified to determine if the speaker is authorized to invoke the conference feature associated with the hot word.

Journal ArticleDOI
TL;DR: The aim was to produce a system that can perform simultaneous identification of large numbers of voice streams in real time and has important potential applications in security and in automated call centre applications.
Abstract: In today's society, highly accurate personal identification systems are required. Passwords or pin numbers can be forgotten or forged and are no longer considered to offer a high level of security. The use of biological features, biometrics, is becoming widely accepted as the next level for security systems. Biometric-based speaker identification is a method of identifying persons from their voice. Speaker-specific characteristics exist in speech signals due to different speakers having different resonances of the vocal tract. These differences can be exploited by extracting feature vectors such as Mel-Frequency Cepstral Coefficients (MFCCs) from the speech signal. A well-known statistical modelling process, the Gaussian Mixture Model (GMM), then models the distribution of each speaker's MFCCs in a multidimensional acoustic space. The GMM-based speaker identification system has features that make it promising for hardware acceleration. This paper describes the hardware implementation for classification of a text-independent GMM-based speaker identification system. The aim was to produce a system that can perform simultaneous identification of large numbers of voice streams in real time. This has important potential applications in security and in automated call centre applications. A speedup factor of ninety was achieved compared to a software implementation on a standard PC.

Proceedings ArticleDOI
22 May 2011
TL;DR: The results show an important influence of the emotional state upon text-independent speaker identification and try to give a possible solution to this problem.
Abstract: In this paper we evaluate the effect of the emotional state of a speaker when text-independent speaker identification is performed. The spectral features used for speaker recognition are the Mel-frequency cepstral coefficients, while for the training of the speaker models and testing the system the Gaussian Mixture Models are employed. The tests are performed on the Berlin emotional speech database which contains 10 different speakers recorded in different emotional situations: happy, angry, fear, bored, sad and neutral. The results show an important influence of the emotional state upon text-independent speaker identification. In the end we try to give a possible solution to this problem.

Proceedings ArticleDOI
27 Aug 2011
TL;DR: A method for compensating for speaker and environmental mismatch using a cascade of CMLLR transforms that enables speaker transforms estimated in one environment to be effectively applied to speech from the same user in a different environment.
Abstract: Two primary sources of variability that degrade accuracy in speech recognition systems are the speaker and the environment. While many algorithms for speaker or environment adaptation have been proposed to improve performance, far less attention has been paid to approaches which address for both factors. In this paper, we present a method for compensating for speaker and environmental mismatch using a cascade of CMLLR transforms. The proposed approach enables speaker transforms estimated in one environment to be effectively applied to speech from the same user in a different environment. This approach can be further improved using a new training method called speaker and environment adaptive training method. When applying speaker transforms to new environments, the proposed approach results in a 13% relative improvement over conventional CMLLR.

Proceedings Article
01 Jan 2011
TL;DR: This work builds upon a base set of various static acoustic features by proposing the combination of several different methods for intoxication detection, and obtains an optimal unweighted recall for intoxication recognition using score-level fusion of these subsystems.
Abstract: Speaker state recognition is a challenging problem due to speaker and context variability. Intoxication detection is an important area of paralinguistic speech research with potential real-world applications. In this work, we build upon a base set of various static acoustic features by proposing the combination of several different methods for this learning task. The methods include extracting hierarchical acoustic features, performing iterative speaker normalization, and using a set of GMM supervectors. We obtain an optimal unweighted recall for intoxication recognition using score-level fusion of these subsystems. Unweighted average recall performance is 70.54% on the test set, an improvement of 4.64% absolute (7.04% relative) over the baseline model accuracy of 65.9%. Index Terms: intoxication detection, speaker state, hierarchical features, speaker normalization, GMM supervectors