scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2001"


Proceedings Article
01 Jan 2001
TL;DR: This paper introduces a first approach to emotion recognition using RAMSES, the UPC’s speech recognition system, based on standard speech recognition technology using hidden semi-continuous Markov models.
Abstract: This paper introduces a first approach to emotion recognition using RAMSES, the UPC’s speech recognition system The approach is based on standard speech recognition technology using hidden semi-continuous Markov models Both the selection of low level features and the design of the recognition system are addressed Results are given on speaker dependent emotion recognition using the Spanish corpus of INTERFACE Emotional Speech Synthesis Database The accuracy recognising seven different emotions—the six ones defined in MPEG-4 plus neutral style—exceeds 80% using the best combination of low level features and HMM structure This result is very similar to that obtained with the same database in subjective evaluation by human judges Dealing with the speaker’s emotion is one of the latest challenges in speech technologies Three different aspects can be easily identified: speech recognition in the presence of emotional speech, synthesis of emotional speech, and emotion recognition In this last case, the objective is to determine the emotional state of the speaker out of the speech samples Possible applications include from help to psychiatric diagnosis to intelligent toys, and is a subject of recent but rapidly growing interest [1] This paper describes the TALP researchers first approach to emotion recognition The work is inserted in the scope of the INTERFACE project [2] The objective of this European Commission sponsored project is “to define new models and implement advanced tools for audio-video analysis, synthesis and representation in order to provide essential technologies for the implementation of large-scale virtual and augmented environments The work is oriented to make man-machine interaction as natural as possible, based on everyday human communication by speech, facial expressions and body gestures” In the field of emotion recognition out of speech, the main goal of the INTERFACE project will be the construction of a real-time multi-lingual speaker independent emotion recogniser For this purpose, large speech databases with recordings from many speakers and languages are needed As these resources are not available yet, a reduced problem will be addressed first: emotion recognition in multi-speaker language dependent conditions Namely, this paper deals with the recognition of emotion for two Spanish speakers using standard hidden Markov models technology

641 citations


Proceedings Article
01 Jan 2001
TL;DR: These initial experiments strongly suggest that further exploration of “familiar” speaker characteristics will likely be an extremely interesting and valuable research direction for recognition of speakers in conversational speech.
Abstract: “Familiar” speaker information is explored using non-acoustic features in NIST’s new “extended data” speaker detection task.[1] Word unigrams and bigrams, used in a traditional target/background likelihood ratio framework, are shown to give surprisingly good performance. Performance continues to improve with additional training and/or test data. Bigram performance is also found to be a function of target/model sex and age difference. These initial experiments strongly suggest that further exploration of “familiar” speaker characteristics will likely be an extremely interesting and valuable research direction for recognition of speakers in conversational speech.

285 citations


PatentDOI
TL;DR: In this article, a new speaker provides speech from which comparison snippets are extracted, and a greedy selection algorithm is performed with the required sound units for identifying the smallest subset of the input text that contains all the text for the new speaker to read.
Abstract: A new speaker provides speech from which comparison snippets are extracted. The comparison snippets are compared with initial snippets stored in a recorded snippet database that is associated with a concatenative synthesizer. The comparison of the snippets to the initial snippets produces required sound units. A greedy selection algorithm is performed with the required sound units for identifying the smallest subset of the input text that contains all of the text for the new speaker to read. The new speaker then reads the optimally selected text and sound units are extracted from the human speech such that the recorded snippet database is modified and the speech synthesized adopts the voice quality and characteristics of the new speaker.

164 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features, and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.
Abstract: Describes a technique for synthesizing speech with arbitrary speaker characteristics using speaker independent speech units, which we call "average voice" units. The technique is based on an HMM-based text-to-speech (TTS) system and maximum likelihood linear regression (MLLR) adaptation algorithm. In the HMM-based TTS system, speech synthesis units are modeled by multi-space probability distribution (MSD) HMMs which can model spectrum and pitch simultaneously in a unified framework. We derive an extension of the MLLR algorithm to apply it to MSD-HMMs. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.

158 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.
Abstract: The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

135 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: The anchor modeling algorithm is refined by pruning the number of models needed and it is shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers.
Abstract: Introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art Gaussian mixture model with universal background model (GMM-UBM) system. However, it is further shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers. Here, excessive computation may prohibit the use of the GMM-UBM recognition system. Finally, the paper presents a method for cascading anchor model and GMM-UBM detectors for speaker indexing. This approach benefits from the efficiency of anchor modeling and high accuracy of GMM-UBM recognition.

103 citations


Proceedings ArticleDOI
I. Viikki1, Imre Kiss1, Jilei Tian1
07 May 2001
TL;DR: This work proposes an architecture for embedded multilingual speech recognition systems and investigates the technical challenges that are faced when making a transition from the speaker-dependent to speaker-independent speech recognition technology in mobile communication devices.
Abstract: We investigate the technical challenges that are faced when making a transition from the speaker-dependent to speaker-independent speech recognition technology in mobile communication devices. Due to globalization as well as the international nature of the markets and the future applications, speaker independence implies the development and use of language-independent automatic speech recognition (ASR) to avoid logistic difficulties. We propose an architecture for embedded multilingual speech recognition systems. Multilingual acoustic modeling, automatic language identification, and on-line pronunciation modeling are the key features which enable the creation of truly language- and speaker-independent ASR applications with dynamic vocabularies and sparse implementation resources. Our experimental results confirm the viability of the proposed architecture. While the use of multilingual acoustic models degrades the recognition rates only marginally, a recognition accuracy decrease of approximately 4% is observed due to sub-optimal on-line text-to-phoneme mapping and automatic language identification. This performance loss can nevertheless be compensated by applying acoustic model adaptation techniques.

101 citations


Patent
TL;DR: In this paper, a method of improving the recognition accuracy of an in-vehicle speech recognition system is presented. But, the method of the present invention selectively adapts the system's speech engine to a speaker's voice characteristics using an N-best matching technique.
Abstract: Disclosed herein is a method of improving the recognition accuracy of an in-vehicle speech recognition system. The method of the present invention selectively adapts the system's speech engine to a speaker's voice characteristics using an N-best matching technique. In this method, the speech recognition system receives and processes a spoken utterance relating to a car command and having particular speaker-dependent speech characteristics so as to select a set of N-best voice commands matching the spoken utterance. Upon receiving a training mode input from the speaker, the system outputs the N-best command set to the speaker who selects the correct car command. The system then adapts the speech engine to recognize a spoken utterance having the received speech characteristics as the user-selected car command.

86 citations


Proceedings Article
01 Jan 2001
TL;DR: Experimental results show that thefalse acceptance rate for synthetic speech was reduced drastically without significant increase of the false acceptance and rejection rates for natural speech.
Abstract: This paper describes a text-prompted speaker verification system which is robust to imposture using synthetic speech generated by an HMM-based speech synthesis system. In the verification system, text and speaker are verified separately. Text verification is based on phoneme recognition using HMM, and speaker verification is based on GMM. To discriminate synthetic speech from natural speech, an average of inter-frame difference of the log likelihood is calculated, and input speech is judged to be synthetic when this value is smaller than a decision threshold. Experimental results show that the false acceptance rate for synthetic speech was reduced drastically without significant increase of the false acceptance and rejection rates for natural speech.

72 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: The aim is to apply speaker grouping information to speaker adaptation for speech recognition by using vector quantization (VQ) distortion as the criterion and showing the superiority of the proposed method.
Abstract: Addresses the problem of the detection of speaker changes and clustering speakers when no information is available regarding speaker classes or even the total number of classes. We assume that no previous information on speakers is available (no speaker model, no training phase) and that people do not speak simultaneously. The aim is to apply speaker grouping information to speaker adaptation for speech recognition. We use vector quantization (VQ) distortion as the criterion. A speaker model is created from successive utterances as a codebook by a VQ algorithm, and the VQ distortion is calculated between the model and an utterance. A result was obtained by the experiment on speaker detection and speaker clustering. The speaker change detection experiment was compared with results by generalized likelihood ratio and Bayesian information criterion. We show the superiority of our proposed method.

68 citations


01 Jan 2001
TL;DR: This paper presents an iterative process for blind speaker indexing based on a HMM that reduces the miss detection of short utterances by exploiting all the information (detected speakers) as soon as it is available.
Abstract: This paper presents an iterative process for blind speaker indexing based on a HMM. This process detects and adds speakers one after the other to the evolutive HMM (E-HMM). The use of this HMM approach takes advantage of the different components of AMIRAL automatic speaker recognition system (ASR system: frontend processing, learning, loglikelihood ratio computing) from LIA. The proposed solution reduces the miss detection of short utterances by exploiting all the information (detected speakers) as soon as it is available. The proposed system was tested on N-speaker segmentation task of NIST 2001 evaluation campaign. Experiments were carried out to validate the speakers detection. Moreover, these tests measure the influence of parameters used for speaker models learning.

Proceedings Article
01 Sep 2001
TL;DR: This work discusses the multi-speaker tasks of detection, tracking, and segmentation of speakers as included in recent NIST Speaker Recognition Evaluations, and examines the effects of target speaker speech duration and the gender mix within test segments on results.
Abstract: We discuss the multi-speaker tasks of detection, tracking, and segmentation of speakers as included in recent NIST Speaker Recognition Evaluations. We consider how performance for the two-speaker detection task is related to that for the corresponding one-speaker task. We examine the effects of target speaker speech duration and the gender mix within test segments on results for these tasks. We also relate performance results for the tracking and segmentation tasks, and look at factors affecting segmentation performance.

Proceedings Article
01 Jan 2001
TL;DR: It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using a large amount of speech data.
Abstract: This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptation, we derive an algorithm for adaptation of state duration. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using a large amount of speech data.

Journal ArticleDOI
TL;DR: A study of speaker perception and identification by psychoacoustic experiments was carried out and statistical analysis of the results suggests that the prototype model is appropriate for explaining the process of speaker identification.
Abstract: Little is known about the perceptual processes of speaker identification and their relationship to the acoustic features of the speaker's voice. A study of speaker perception and identification by psychoacoustic experiments was carried out. Twenty male speakers were recorded and thirty listeners participated in the experiments. Statistical analysis of the results suggests that the prototype model is appropriate for explaining the process of speaker identification. The most important features for speaker identification were the fundamental frequency, the third and fourth formants, and the closing phase of the glottal wave. For different listeners, different sets of features were found to be significant for coding speaker identity.

Proceedings ArticleDOI
07 May 2001
TL;DR: This paper investigates the effectiveness of a frame-based usable speech extraction technique for speaker identification by investigating the ability of the SAR method to determine usable frames within co-channel speech.
Abstract: A "usable speech" extraction system was proposed (Yanatorno, 1998) to separate co-channel speech into "usable" frames that are minimally corrupted by interfering speech. Studies indicate that a significant amount of cochannel speech can be considered "usable" for speaker identification (SID). Therefore, it is necessary to establish criteria for usable speech frames for SID. Voiced speech, of which usable speech is entirely comprised, is shown to be information rich for SID. In addition, SID accuracy increases as the frame-based target to interferer ratio (TIR) increases when evaluated independently of the amount of available segments. Krishnamachari et al. (2000) developed a frame-based spectral autocorrelation ratio (SAR) technique for determining usable frames within co-channel speech. The ability of the SAR method to determine usable frames at various thresholds is examined. This paper investigates the effectiveness of a frame-based usable speech extraction technique for speaker identification.

Proceedings ArticleDOI
01 Jan 2001
TL;DR: This paper introduced a speaker recognition system based on differences in dynamic realization of phonetic features (i.e., pronunciation) between speakers rather than spectral differences in voice quality, and used phonetic information from six languages to perform text independent speaker recognition.
Abstract: This paper introduces a novel language-independent speaker-recognition system based on differences in dynamic realization of phonetic features (i.e., pronunciation) between speakers rather than spectral differences in voice quality. The system exploits phonetic information from six languages to perform text independent speaker recognition. All experiments were performed on the NIST 2001 Speaker Recognition Evaluation Extended Data Task. Recognition results are provided for unigram, bigram, and trigram models. Performance for each of the three models is examined for phones from each individual language and the final multilanguage fused system. Additional fusion experiments demonstrate that speaker recognition capability is maintained even without phonetic information in the language of the speaker.

Patent
13 Mar 2001
TL;DR: In this article, a system and method for unsupervised, on-line, adaptation in speaker verification is presented, which consists of detecting a channel of a verification utterance, learning vocal characteristics of the speaker on the detected channel, and transforming the learned vocal characteristics from the speaker model of a second channel.
Abstract: The present invention introduces a system and method for unsupervised, on-line, adaptation in speaker verification. In one embodiment, a method for adapting a speaker model to improve the verification of a speaker's voice, comprises detecting a channel of a verification utterance; learning vocal characteristics of the speaker on the detected channel; and transforming the learned vocal characteristics of the speaker from the detected channel to the speaker model of a second channel.

Proceedings ArticleDOI
07 May 2001
TL;DR: The proposed method can obtain a more optimal speaker cluster because the clustering result is determined according to test speaker's data on-line, and the proposed method attains better improvement than MLLR from the speaker independent model.
Abstract: Describes an efficient method for unsupervised speaker adaptation. This method is based on (1) selecting a subset of speakers who are acoustically close to a test speaker, and (2) calculating adapted model parameters according to the previously stored sufficient HMM statistics of the selected speakers' data. In this method, only a few unsupervised test speaker's data are required for the adaptation. Also, by using the sufficient HMM statistics of the selected speakers' data, a quick adaptation can be done. Compared with a pre-clustering method, the proposed method can obtain a more optimal speaker cluster because the clustering result is determined according to test speaker's data on-line. Experimental results show that the proposed method attains better improvement than MLLR from the speaker independent model. Moreover the proposed method utilizes only one unsupervised sentence utterance, while MLLR usually utilizes more than ten supervised sentence utterances.

01 Jan 2001
TL;DR: A text-independent speaker recognition system that achieves an equal error rate of less than 1% by combining phonetic, idiolect, and acoustic features.
Abstract: This paper describes a text-independent speaker recognition system that achieves an equal error rate of less than 1% by combining phonetic, idiolect, and acoustic features. The phonetic system is a novel language-independent speakerrecognition system based on differences among speakers in dynamic realization of phonetic features (i.e., pronunciation), rather than spectral differences in voice quality. The system exploits phonetic information from six languages to perform text-independent speaker recognition. The idiolectal system models speaker idiosyncrasies with word n-gram frequency counts computed from the output of an automatic speech recognition system. The acoustic system is a Gaussian Mixture Model-Universal Background Model that exploits the spectral differences in voice quality. All experiments were performed on the NIST 2001 Speaker Recognition Evaluation Extended Data Task.

Proceedings Article
01 Jan 2001
TL;DR: A novel language-independent speaker-recognition system based on differences in dynamic realization of phonetic features (i.e., pronunciation) between speakers rather than spectral differences in voice quality is introduced.
Abstract: This paper introduces a novel language-independent speaker-recognition system based on differences in dynamic realization of phonetic features (i.e., pronunciation) between speakers rather than spectral differences in voice quality. The system exploits phonetic information from six languages to perform text independent speaker recognition. All experiments were performed on the NIST 2001 Speaker Recognition Evaluation Extended Data Task. Recognition results are provided for unigram, bigram, and trigram models. Performance for each of the three models is examined for phones from each individual language and the final multilanguage fused system. Additional fusion experiments demonstrate that speaker recognition capability is maintained even without phonetic information in the language of the speaker.

Proceedings ArticleDOI
14 Aug 2001
TL;DR: The main techniques followed in each of the above steps are reviewed and the importance of feature vector extraction, selection and normalization are also discussed.
Abstract: Speaker Recognition (SR) is the process of automatically recognizing the person speaking on the basis of the information obtained from the speech features. SR process involves Speaker verification (SV) and Speaker Identification (SI). Automatic Speaker verification (ASV) is the process of authenticating the true identity of the speaker. ASV is generally accomplished in four steps. The first step is the digital speech data acquisition. In the second step, feature extraction and feature selection are performed. The third step involves clustering the feature vectors and storing in a database. Decision-making through Pattern matching is the last step. In this paper, the main techniques followed in each of the above steps are reviewed. The importance of feature vector extraction, selection and normalization are also discussed.

01 Jan 2001
TL;DR: A model of joint probability functions of the pitch and the feature vectors is proposed and three strategies are designed and compared for all female speakers taken from the SPIDRE corpus, observing an increase of the identification rates.
Abstract: Usually, speaker recognition systems do not take into account the dependence between the vocal source and the vocal tract. A feasibility study that retains this dependence is presented here. A model of joint probability functions of the pitch and the feature vectors is proposed. Three strategies are designed and compared for all female speakers taken from the SPIDRE corpus. The first operates on all voiced and unvoiced speech segments (baseline strategy). The second strategy considers only the voiced speech segments and the last includes the pitch information along with the standard MFCC. We use two pattern recognizers: LVQ–SLP and GMM. In all cases, we observe an increase of the identification rates and more specifically when using a time duration of 500ms (6% higher).

Proceedings ArticleDOI
07 May 2001
TL;DR: In this paper, the authors apply independent component analysis for extracting an optimal basis to the problem of finding efficient features for a speaker, which are oriented and localized in both space and frequency, bearing a resemblance to Gabor functions.
Abstract: We apply independent component analysis for extracting an optimal basis to the problem of finding efficient features for a speaker. The basis functions learned by the algorithm are oriented and localized in both space and frequency, bearing a resemblance to Gabor functions. The speech segments are assumed to be generated by a linear combination of the basis functions, thus the distribution of speech segments of a speaker is modeled by a basis, which is calculated so that each component should be independent upon others on the given training data. The speaker distribution is modeled by the basis functions. To assess the efficiency of the basis functions, we performed speaker classification experiments and compared our results with the conventional Fourier-basis. Our results show that the proposed method is more efficient than the conventional Fourier-based features, in that they can obtain a higher classification rate.

PatentDOI
TL;DR: In this paper, a speaker identification system is provided that constructs speaker models using a discriminant analysis technique where the data in each class is modeled by Gaussian mixtures, and the likelihood scores of the second set of feature vectors are computed using speaker models trained using mixture discriminant analyses.
Abstract: A speaker identification system is provided that constructs speaker models using a discriminant analysis technique where the data in each class is modeled by Gaussian mixtures. The speaker identification method and apparatus determines the identity of a speaker, as one of a small group, based on a sentence-length password utterance. A speaker's utterance is received and a sequence of a first set of feature vectors are computed based on the received utterance. The first set of feature vectors are then transformed into a second set of feature vectors using transformations specific to a particular segmentation unit, and likelihood scores of the second set of feature vectors are computed using speaker models trained using mixture discriminant analysis. The likelihood scores are then combined to determine an utterance score and the speaker's identity is validated based on the utterance score. The speaker identification method and apparatus also includes training and enrollment phases. In the enrollment phase the speaker's password utterance is received multiple times. A transcription of the password utterance as a sequence of phones is obtained, and the phone string is stored in a database containing phone strings of other speakers in the group. In the training phase, the first set of feature vectors are extracted from each password utterance and the phone boundaries for each phone in the password transcription are obtained using a speaker independent phone recognizer. A mixture model is developed for each phone of a given speaker's password. Then, using the feature vectors from the password utterances of all of the speakers in the group, transformation parameters and transformed models are generated for each phone and speaker, using mixture discriminant analysis.

Proceedings ArticleDOI
01 Jan 2001
TL;DR: Both types of score normalization significantly improve performance, and can eliminate the performance loss that occurs when there is a mismatch between training and testing conditions.
Abstract: We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset-dependent score normalization and test score normalization are examined. Both types of score normalization significantly improve performance, and can eliminate the performance loss that occurs when there is a mismatch between training and testing conditions.

Book ChapterDOI
TL;DR: Experiments show that the new method provides significantly higher identification accuracy and it can detect the correct speaker from shorter speech samples more reliable than the unweighted matching method.
Abstract: We consider the matching function in vector quantization based speaker identification system. The model of a speaker is a codebook generated from the set of feature vectors from the speakers voice sample. The matching is performed by evaluating the similarity of the unknown speaker and the models in the database. In this paper, we propose to use weighted matching method that takes into account the correlations between the known models in the database. Larger weights are assigned to vectors that have high discriminating power between the speakers and vice versa. Experiments show that the new method provides significantly higher identification accuracy and it can detect the correct speaker from shorter speech samples more reliable than the unweighted matching method.

Journal ArticleDOI
01 Feb 2001
TL;DR: Good classification performance has been achieved using a probabilistic RAM (pRAM) neuron and no speech recognition stage was used in obtaining these results, so the performance relates only to identifying a speaker's voice and is therefore independent of the spoken phrase.
Abstract: Speaker identification may be employed as part of a security system requiring user authentication. In this case, the claimed identity of the user is known from a magnetic card and PIN number, for example, and an utterance is requested to confirm the identity of the user. A fast response is necessary in the confirmation phase and a fast registration process for new users is desirable. The time encoded signal processing and recognition (TESPAR) digital language is used to preprocess the speech signal. A speaker cannot be identified directly from the single TESPAR vector since there is a highly nonlinear relationship between the vector's components such that vectors are not linearly separable. Therefore the vector and its characteristics suggest that classification using a neural network will provide an effective solution. Good classification performance has been achieved using a probabilistic RAM (pRAM) neuron. Four probabilistic pRAM neural network architectures are presented. A performance of approximately 97% correct classifications has been obtained, which is similar to results obtained elsewhere (M. Sharma and R.J. Mammone, 1996), and slightly better than a MLP network. No speech recognition stage was used in obtaining these results, so the performance relates only to identifying a speaker's voice and is therefore independent of the spoken phrase. This has been achieved in a hardware-realizable system which may be incorporated into a smart-card or similar application.

01 Jan 2001
TL;DR: There are sufficient parallels between the two for speaker identification by earwitnesses to benefit greatly from a close study of the guidelines that have been proposed for the administration of line-ups in the visual domain, and the need to put alternative validation procedures in place is becoming more widely accepted.
Abstract: Although the development of state-of-the-art speaker recognition systems has shown considerable progress in the last decade, performance levels of these systems do not as yet seem to warrant large-scale introduction in anything other than relatively low-risk applications. Conditions typical of the forensic context such as differences in recording equipment and transmission channels, the presence of background noise and of variation due to differences in communicative context continue to pose a major challenge. Consequently, the impact of automatic speaker recognition technology on the forensic scene has been relatively modest and forensic speaker identification practice remains heavily dominated by the use of a wide variety of largely subjective procedures. While recent developments in the interpretation of the evidential value of forensic evidence clearly favour methods that make it possible for results to be expressed in terms of a likelihood ratio, unlike automatic procedures, traditional methods in the field of speaker identification do not generally meet this requirement. However, conclusions in the form of a binary yes/no-decision or a qualified statement of the probability of the hypothesis rather than the evidence are increasingly criticised for being logically flawed. Against this background, the need to put alternative validation procedures in place is becoming more widely accepted. Although speaker identification by earwitnesses differs in some important respects from the much more widely studied field of eyewitness identification, there are sufficient parallels between the two for speaker identification by earwitnesses to benefit greatly from a close study of the guidelines that have been proposed for the administration of line-ups in the visual domain. Some of the central notions are briefly discussed. Rapid technical developments in the world of telecommunications in which speech and data are increasingly transmitted through the same communication channels may soon blunt the efficacy of traditional telephone interception as an investigative and evidential tool. The gradual shift from analogue to digital recording media and the increasingly widespread availability of digital sound processing equipment as well as its ease of operation make certain types of manipulation of audio recordings comparatively easy to perform. If done competently, such manipulation may leave no traces and may therefore well be impossible to detect. Authorship attribution is another forensic area that has had a relatively chequered history. The rapid increase in the use of electronic writing media including e-mail, sms, and the use of ink jet printers at the expense of typewritten and to a lesser extent hand-written texts reduces the opportunities of authorship attribution by means of traditional document examination techniques and may create a greater demand for linguistic expertise in this area. A survey is provided of ongoing work in the area, based on reactions to a questionnaire sent out earlier this year.

Patent
30 Nov 2001
TL;DR: In this article, a method and apparatus for processing a continuous audio stream containing human speech in order to locate a particular speech-based transaction in the audio stream, applying both known speaker recognition and speech recognition techniques.
Abstract: Disclosed are a method and apparatus for processing a continuous audio stream containing human speech in order to locate a particular speech-based transaction in the audio stream, applying both known speaker recognition and speech recognition techniques. Only the utterances of a particular predetermined speaker are transcribed thus providing an index and a summary of the underlying dialogue(s). In a first scenario, an incoming audio stream, e.g. a speech call from outside, is scanned in order to detect audio segments of the predetermined speaker. These audio segments are then indexed and only the indexed segments are transcribed into spoken or written language. In a second scenario, two or more speakers located in one room are using a multi-user speech recognition system (SRS). For each user there exists a different speaker model and optionally a different dictionary or vocabulary of words already known or trained by the speech or voice recognition system.

Proceedings ArticleDOI
07 May 2001
TL;DR: This work investigates language mismatches between the target speaker and the world model in a GMM speaker verification system and shows major degradations for Mandarin and Vietnamese when the target speakers spoke American English.
Abstract: Applying speech technology in appliances available around the world cannot restrict the functionality to a certain language. However, most of today's text-independent verification systems based on Gaussian mixture models, GMMs, use an adaptive approach for training the speaker model. This assumes that the world model incorporates the same language as that of the target speaker. We investigate language mismatches between the target speaker and the world model in a GMM speaker verification system. Experiments performed with different world model languages showed major degradations, in particular for Mandarin and Vietnamese when the target speakers spoke American English. Experiments with world models trained on data pooled from different languages revealed only minor performance degradations.