scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2005"


Proceedings ArticleDOI
18 Mar 2005
TL;DR: An overview of current audio diarization approaches is provided and performance and potential applications are discussed, as well as the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarized evaluation.
Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper, we provide an overview of current audio diarization approaches and discuss performance and potential applications. We outline the general framework of diarization systems and present the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. Lastly, we look at future challenges and directions for diarization research.

191 citations


Patent
Ilya Skuratovsky1
22 Nov 2005
TL;DR: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations according to whether each portion of text was identified as a spoken passage or not.
Abstract: A method of speech synthesis can include automatically identifying spoken passages and non-spoken passages within a text source and converting the text source to speech by applying different voice configurations to different portions of text within the text source according to whether each portion of text was identified as a spoken passage or a non-spoken passage. The method further can include identifying the speaker and/or the gender of the speaker and applying different voice configurations according to the speaker identity and/or speaker gender.

160 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: The use of adaptation transforms employed in speech recognition systems as features for speaker recognition is explored, and the resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems.
Abstract: We explore the use of adaptation transforms employed in speech recognition systems as features for speaker recognition. This approach is attractive because, unlike standard framebased cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification. Affine transforms are computed for the Gaussian means of the acoustic models used in a recognizer, using maximum likelihood linear regression (MLLR). The high-dimensional vectors formed by the transform coefficients are then modeled as speaker features using support vector machines (SVMs). The resulting speaker verification system is competitive, and in some cases significantly more accurate, than state-of-the-art cepstral gaussian mixture and SVM systems. Further improvements are obtained by combining baseline and MLLR-based systems.

152 citations


Journal ArticleDOI
TL;DR: A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance and combining the evidence from these features seem to improve the performance of the system significantly.
Abstract: This paper proposes a text-dependent (fixed-text) speaker verification system which uses different types of information for making a decision regarding the identity claim of a speaker. The baseline system uses the dynamic time warping (DTW) technique for matching. Detection of the end-points of an utterance is crucial for the performance of the DTW-based template matching. A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance. The proposed method for speaker verification uses the suprasegmental and source features, besides spectral features. The suprasegmental features such as pitch and duration are extracted using the warping path information in the DTW algorithm. Features of the excitation source, extracted using the neural network models, are also used in the text-dependent speaker verification system. Although the suprasegmental and source features individually may not yield good performance, combining the evidence from these features seem to improve the performance of the system significantly. Neural network models are used to combine the evidence from multiple sources of information.

117 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: Improved multi-stage speaker diarization by incorporating a speaker identification step is described, which provides between 40% and 50% reduction of the speaker error, relative to a standard BIC clustering system.
Abstract: This paper describes recent advances in speaker diarization by incorporating a speaker identification step. This system builds upon the LIMSI baseline data partitioner used in the broadcast news transcription system. This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters, when there is a large quantity of data for the speaker. Several improvements to the baseline sys- tem have been made. Firstly, a standard Bayesian information criterion (BIC) agglomerative clustering has been integrated re- placing the iterative Gaussian mixture model (GMM) cluster- ing. Then a second clustering stage has been added, using a speaker identification method with MAP adapted GMM. A fi- nal post-processing stage refines the segment boundaries using the output of the transcription system. On the RT-04f and ES- TER evaluation data, the improved multi-stage system provides between 40% and 50% reduction of the speaker error, relative to a standard BIC clustering system.

107 citations


Book ChapterDOI
11 Jul 2005
TL;DR: This paper describes the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation, and adds several features to the baseline clustering system, including a “purification” module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones sub-task.
Abstract: In this paper we describe the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplifying the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, including a “purification” module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post-evaluation work we further improved the delay&sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lecture room environment.

99 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: This system combines techniques used successfully in previous speaker diarisation systems with an additional second clustering stage based on state-of-the-art speaker identification methods to give a diarification error rate of 6.9% on the RT-04 Fall darisation evaluation data.
Abstract: This paper describes the speaker diarisation system developed at Cambridge University in March 2005. This system combines techniques used successfully in our previous speaker diarisation systems with an additional second clustering stage based on state-of-the-art speaker identification methods. Several strategies for using the new system are investigated and the final system gives a diarisation error rate of 6.9% on the RT-04 Fall diarisation evaluation data when processing all the test data together or 8.6% when processing the test data shows independently.

94 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: In this article, an approach to modelling session variability for text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures is presented, which reduces the data labelling requirements and removes discrete categorisation needed by techniques such as feature mapping and H-Norm.
Abstract: Presented is an approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures. The proposed technique reduces the data labelling requirements and removes discrete categorisation needed by techniques such as feature mapping and H-Norm, while providing superior performance. Experiments on Switchboard-II conversational telephony data show improvements of as much as 48% in detection cost with a single training utterance and 68% with multiple training utterances over a baseline system.

89 citations


Proceedings ArticleDOI
P.Z. Patrick1, G. Aversano1, Raphaël Blouet1, M. Charbit1, Gérard Chollet1 
18 Mar 2005
TL;DR: It is demonstrated that an automatic speaker recognition system could be seriously threatened by a transformation of this kind, using a speaker verification system to calculate the likelihood that the forged voice belongs to the genuine client.
Abstract: The article deals with a technique of voice forgery using the ALISP (automatic language independent speech processing) approach. Such a technique allows the voice of an arbitrary person (the impostor) to be transformed, forging the identity of another person (the client). Our goal is to demonstrate that an automatic speaker recognition system could be seriously threatened by a transformation of this kind. For this purpose, we use a speaker verification system to calculate the likelihood that the forged voice belongs to the genuine client. Experiments on NIST 2004 evaluation data show that the equal error rate for the verification task is significantly increased by our voice transformation.

68 citations


Proceedings ArticleDOI
17 Oct 2005-Scopus
TL;DR: This paper distinguishes between standard American English and Indian Accented English using the second and third formant frequencies of specific accent markers to achieve a suitable classification for these two accent groups.
Abstract: Apart form the word content and identity of a speaker; speech also conveys information about several soft biometric traits such as accent and gender. Accurate classification of these features can have a direct impact on present speech systems. An accent specific dictionary or word models can be used to improve accuracy of speech recognition systems. Gender and accent information can also be used to improve the performance of speaker recognition systems. In this paper, we distinguish between standard American English and Indian Accented English using the second and third formant frequencies of specific accent markers. A GMM classification is used on the feature set for each accent group. The results show that using just the formant frequencies of these accent markers is sufficient to achieve a suitable classification for these two accent groups.

68 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: This paper uses delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal and tests the approach on the 2004 and 2005 NIST meetings evaluation databases show that the technique performs very well.
Abstract: One of the sub-tasks of the Spring 2004 and Spring 2005 NIST Meetings evaluations requires segmenting multi-party meetings into speaker-homogeneous regions using data from multiple distant microphones (the "MDM" sub-task). One approach to this task is to run a speaker segmentation system on each of the microphone channels separately, and then merge the results. This can be thought of as a many-to-one post-processing approach. In this paper we propose an alternative approach in which we use delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal. This approach can be thought of a many-to-one pre-processing approach. In the pre-processing approach we propose, the time delay of arrival (TDOA) between each of the multiple distant channels and a reference channel is computed incrementally using a window that steps through the signals from each of the multiple microphones. No information about the locations or setup of the microphones is required. Using the TDOA information, the channels are first aligned and then summed and the resulting "enhanced" signal is clustered using our standard speaker diarization system. We test our approach on the 2004 and 2005 NIST meetings evaluation databases and show that the technique performs very well

Book ChapterDOI
11 Jul 2005
TL;DR: The TNO speaker speaker diarization system is based on a standard BIC segmentation and clustering algorithm, and a speech activity detector (SAD) is developed based on decoding the speech signal using two Gaussian Mixture Models trained on silence and speech.
Abstract: The TNO speaker speaker diarization system is based on a standard BIC segmentation and clustering algorithm. Since for the NIST Rich Transcription speaker dizarization evaluation measure correct speech detection appears to be essential, we have developed a speech activity detector (SAD) as well. This is based on decoding the speech signal using two Gaussian Mixture Models trained on silence and speech. The SAD was trained on only AMI development test data, and performed quite well in the evaluation on all 5 meeting locations, with a SAD error rate of 5.0 %. For the speaker clustering algorithm we optimized the BIC penalty parameter λ to 14, which is quite high with respect to the theoretical value of 1. The final speaker diarization error rate was evaluated at 35.1 %.

PatentDOI
TL;DR: In this article, a method and apparatus for spotting a target speaker within a call interaction by generating speaker models based on one or more speaker's speech; and by searching for speaker models associated with one or multiple target speaker speech files.
Abstract: A method and apparatus for spotting a target speaker within a call interaction by generating speaker models (98) based on one or more speaker's speech; and by searching for speaker models (110) associated with one or more target speaker speech files.

Patent
24 May 2005
TL;DR: In this paper, a dual-step, text-independent, language-independent speaker voice-print creation and speaker recognition method is presented, where a neural network-based technique is used in the first step and a Markov model-based approach is used for the second step.
Abstract: Disclosed herein is an automatic dual-step, text~ independent, language-independent speaker voice-print creation and speaker recognition method, wherein a neural network-based technique is used in a first step and a Markov model-based technique is used in the second step. In particular, the first step uses a neural network-based technique for decoding the content of what is uttered by the speaker in terms of language~ independent acoustic-phonetic classes, wherein the second step uses the sequence of language-independent acoustic-phonetic classes from the first step and employs a Markov model-based technique for creating the speaker voice-print and for recognizing the speaker. The combination of the two steps enables improvement in the accuracy and efficiency of the speaker voice-print creation and of the speaker recognition, without setting any constraints on the lexical content of the speaker utterance and on the language thereof.

Journal ArticleDOI
TL;DR: A predetermined generic speaker-independent model set, called the sample speaker models (SSM), is proposed, which can be useful for more accurate speaker modeling and clustering without requiring training models on target speaker data.
Abstract: Unsupervised speaker indexing sequentially detects points where a speaker identity changes in a multispeaker audio stream, and categorizes each speaker segment, without any prior knowledge about the speakers. This paper addresses two challenges: The first relates to sequential speaker change detection. The second relates to speaker modeling in light of the fact that the number/identity of the speakers is unknown. To address this issue, a predetermined generic speaker-independent model set, called the sample speaker models (SSM), is proposed. This set can be useful for more accurate speaker modeling and clustering without requiring training models on target speaker data. Once a speaker-independent model is selected from the generic sample models, it is progressively adapted into a specific speaker-dependent model. Experiments were performed with data from the Speaker Recognition Benchmark NIST Speech corpus (1999) and the HUB-4 Broadcast News Evaluation English Test material (1999). Results showed that our new technique, sampled using the Markov Chain Monte Carlo method, gave 92.5% indexing accuracy on two speaker telephone conversations, 89.6% on four-speaker conversations with the telephone speech quality, and 87.2% on broadcast news. The SSMs outperformed the universal background model by up to 29.4% and the universal gender models by up to 22.5% in indexing accuracy in the experiments of this paper.

Journal ArticleDOI
TL;DR: This letter studies feature selection in speaker recognition from an information-theoretic view and closely ties the performance, in terms of the expected classification error probability, to the mutual information between speaker identity and features.
Abstract: This letter studies feature selection in speaker recognition from an information-theoretic view. We closely tie the performance, in terms of the expected classification error probability, to the mutual information between speaker identity and features. Information theory can then help us to make qualitative statements about feature selection and performance. We study various common features used for speaker recognition, such as mel-warped cepstrum coefficients and various parameterizations of linear prediction coefficients. The theory and experiments give valuable insights in feature selection and performance of speaker-recognition applications.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: The MIT Lincoln Laboratory submission for the 2004 NIST speaker recognition evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage.
Abstract: The MIT Lincoln Laboratory submission for the 2004 NIST speaker recognition evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian mixture models, support vector machines and N-gram language models and were combined using a single layer perceptron fuser. The 2004 SRE used a new multi-lingual, multi-channel speech corpus that provided a challenging speaker detection task for the above systems. We describe the core systems used and provide an overview of their performance on the 2004 SRE detection tasks.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: Recent studies on speaker diarization from automatic broadcast news transcripts are described and the performance using automatic transcripts generated with an LVCSR system are compared with those obtained using manual transcriptions.
Abstract: This paper describes recent studies on speaker diarization from automatic broadcast news transcripts. Linguistic information revealing the true names of who speaks during a broadcast (the next, the previous and the current speaker) is detected by means of linguistic patterns. In order to associate the true speaker names with the speech segments, a set of rules are defined for each pattern. Since the effectiveness of linguistic patterns for diarization depends on the quality of the transcription, the performance using automatic transcripts generated with an LVCSR system are compared with those obtained using manual transcriptions. On about 150 hours of broadcast news data (295 shows) the global ratio of false identity association is about 13% for the automatic and the manual transcripts

Book ChapterDOI
11 Jul 2005
TL;DR: Different pre-processing techniques, coupled with three speaker diarization systems in the framework of the NIST 2005 Spring Rich Transcription campaign (RT'05S), aim at providing a signal quality index in order to build a unique “virtual” signal obtained from all the microphone recordings available for a meeting.
Abstract: This paper presents different pre-processing techniques, coupled with three speaker diarization systems in the framework of the NIST 2005 Spring Rich Transcription campaign (RT'05S). The pre-processing techniques aim at providing a signal quality index in order to build a unique “virtual” signal obtained from all the microphone recordings available for a meeting. This unique virtual signal relies on a weighted sum of the different microphone signals while the signal quality index is given according to a signal to noise ratio. Two methods are used in this paper to compute the instantaneous signal to noise ratio: a speech activity detection based approach and a noise spectrum estimate. The speaker diarization task is performed using systems developed by different labs: the LIA, LIUM and CLIPS. Among the different system submissions made by these three labs, the best system obtained 24.5 % speaker diarization error for the conference subdomain and 18.4 % for the lecture subdomain.

01 Jan 2005
TL;DR: The paper presents a new database called ELSDSR dedicated to speaker recognition applications, and its main characteristics are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics.
Abstract: In this paper we discuss properties of speech databases used for speaker recognition research and evaluation, and we characterize some popular standard databases The paper presents a new database called ELSDSR dedicated to speaker recognition applications The main characteristics of this database are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics

Book ChapterDOI
14 Sep 2005
TL;DR: How vulnerable a speaker verification system is to conscious effort by impostors to mimic a client of the system is determined and how much closer an impostor can get to another speaker's voice by repeated attempts is explored.
Abstract: The aim of this paper is to determine how vulnerable a speaker verification system is to conscious effort by impostors to mimic a client of the system. The paper explores systematically how much closer an impostor can get to another speaker's voice by repeated attempts. Experiments on 138 speakers in the YOHO database and six people who played a role as imitators showed a fact that professional linguists could successfully attack the system. Non-professional people could have a good chance if they know their closest speaker in the database.

Patent
27 Sep 2005
TL;DR: In this article, a speech recognition method comprising the steps of storing multiple recognition models for a vocabulary set, each model distinguished from the other models in response to a Lombard characteristic, detecting at least one speaker utterance in a motor vehicle, selecting one of the multiple recognition model and utilizing the selected recognition model to recognize the at least speaker utterances.
Abstract: A speech recognition method comprising the steps of: storing multiple recognition models for a vocabulary set, each model distinguished from the other models in response to a Lombard characteristic, detecting at least one speaker utterance in a motor vehicle, selecting one of the multiple recognition models in response to a Lombard characteristic of the at least one speaker utterance, utilizing the selected recognition model to recognize the at least one speaker utterance; and providing a signal in response to the recognition.

Journal ArticleDOI
Lie Lu1, Hong-Jiang Zhang1
TL;DR: In this approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing.
Abstract: This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis.

Book ChapterDOI
22 Oct 2005
TL;DR: The ongoing work proposes an approach of speech emotion-state conversion to improve the performance of speaker identification system over various affective speech.
Abstract: The performance of speaker recognition system is easily disturbed by the changes of the internal states of human. The ongoing work proposes an approach of speech emotion-state conversion to improve the performance of speaker identification system over various affective speech. The features of neutral speech are modified according to statistical prosodic parameters of emotion utterances. Speaker models are generated based on the converted speech. The experiments conducted on an emotion corpus with 14 emotion states shows promising results with an improved performance by 7.2%.

Journal ArticleDOI
TL;DR: This work proposes a flexible framework in which an optimal speaker model (GMM or VQ) is automatically selected based on the Bayesian Information Criterion according to the amount of training data available, and demonstrates that speaker indexing with this framework is sufficiently accurate for adaptation of the acoustic model.
Abstract: In conventional speaker recognition tasks, the amount of training data is almost the same for each speaker, and the speaker model structure is uniform and specified manually according to the nature of the task and the available size of the training data. In real-world speech data such as telephone conversations and meetings, however, serious problems arise in applying a uniform model because variations in the utterance durations of speakers are large, with numerous short utterances. We therefore propose a flexible framework in which an optimal speaker model (GMM or VQ) is automatically selected based on the Bayesian Information Criterion (BIC) according to the amount of training data available. The framework makes it possible to use a discrete model when the data is sparse, and to seamlessly switch to a continuous model after a large amount of data is obtained. The proposed framework was implemented in unsupervised speaker indexing of a discussion audio. For a real discussion archive with a total duration of 10 hours, we demonstrate that the proposed method has higher indexing performance than that of conventional methods. The speaker index is also used to adapt a speaker-independent acoustic model to each participant for automatic transcription of the discussion. We demonstrate that speaker indexing with our method is sufficiently accurate for adaptation of the acoustic model.

Proceedings ArticleDOI
23 Mar 2005
TL;DR: A cluster-voting scheme is described which takes the output from two speaker diarisation systems and produces a new output which aims to have a lower speaker diARisation error rate (DER) than either input.
Abstract: Speaker diarisation is the task of automatically segmenting audio data and providing speaker labels for the resulting regions of audio. A cluster-voting scheme is described which takes the output from two speaker diarisation systems and produces a new output which aims to have a lower speaker diarisation error rate (DER) than either input. The scheme works in two stages: the first produces a set of possible outputs which minimise a distance metric based on the DER; the second votes between these alternatives to give the final output. Decisions where the inputs agree are always passed to the output and those where the inputs differ are re-evaluated in the final voting stage. Results are presented on the 6-show RT-03 broadcast news evaluation data; they show that the DER can be reduced by 1.64% and 2.56% absolute using this method when combining the best two Cambridge University and the best two MIT Lincoln Laboratory diarisation systems respectively.

Proceedings ArticleDOI
09 May 2005
TL;DR: Results show that the hybrid approach using two-level clustering using a Bayesian information criterion and HMM model scores significantly outperforms direct metric-based segmentation.
Abstract: We present a hybrid speaker-based segmentation, which combines metric-based and model-based techniques. Without a priori information about the number of speakers and speaker identities, the speech stream is segmented in three stages: (1) the most likely speaker changes are detected; (2) to group segments of identical speakers, a two-level clustering algorithm is performed using a Bayesian information criterion (BIC) and HMM model scores - every cluster is assumed to contain only one speaker; (3) the speaker models are reestimated from each cluster by HMM. Finally a resegmentation step performs a more refined segmentation using these speaker models. To measure the performance, we compare the segmentation results of the proposed hybrid method versus metric-based segmentation. Results show that the hybrid approach using two-level clustering significantly outperforms direct metric-based segmentation.

Proceedings Article
01 Jan 2005
TL;DR: The syllable-based modelling technique is shown to outperform a state-of-the-art baseline GMM system, and a simple selective reduction of the syllable set is also shown to give further improvement in performance.
Abstract: This paper examines the usefulness of a multilingual broad syllable-based framework for text-independent speaker verification. Syllabic segmentation is used in order to obtain a convenient unit for constrained and more detailed model generation. Gaussian mixture models are chosen as a suitable modelling paradigm for initial testing of the framework. Promising results are presented for the NIST 2003 speaker recognition evaluation corpus. The syllable-based modelling technique is shown to outperform a state-of-the-art baseline GMM system. A simple selective reduction of the syllable set is also shown to give further improvement in performance. Overall, the syllable based framework presents itself as valid alternative to text-constrained speaker verification systems, with the advantage of being multilingual. The framework allows for future testing of alternative modelling paradigms, feature sets and qualitative analysis.

Patent
18 Jan 2005
TL;DR: In this article, a method for improving spoken language includes accepting a speech input from by a speaker using a language, identifying the speaker with a predetermined speaker category and correcting an error in the speech input using an error model specific to the speaker category.
Abstract: A method for improving spoken language includes accepting a speech input from by a speaker using a language, identifying the speaker with a predetermined speaker category and correcting an error in the speech input using an error model that is specific to the speaker category.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: The technique of anchor modelling is presented and a new metric to compare speech segments based on the correlation coefficient is introduced, which appears to be more efficient than the classical metrics for the task of speaker detection.
Abstract: This paper presents an approach for speaker tracking in a large audio database. The system described is based on a speaker segmentation procedure consisting of a detection of statistical ruptures in the speech signal followed by a speaker detection procedure using anchor models. The technique of anchor modelling is presented and a new metric to compare speech segments based on the correlation coefficient is introduced. This novel metric is evaluated and compared to the classical Euclidean and angular metrics for the speaker detection task. Evaluation is carried out on the audio database of the ESTER evaluation campaign for the rich transcription of French broadcast news. The new metric appears to be more efficient than the classical metrics for the task of speaker detection.