scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2004"


Journal ArticleDOI
TL;DR: An introduction proposes a modular scheme of the training and test phases of a speaker verification system, and the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed.
Abstract: This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.

874 citations


01 Jan 2004
TL;DR: This paper explores techniques that are specific to the SVM framework in order to derive fully non-linear channel compensations, resulting in a system that is less sensitive to specific kinds of labeled channel variations observed in training.
Abstract: One of the major remaining challenges to improving accuracy in state-of-the-art speaker recognition algorithms is reducing the impact of channel and handset variations on system performance. For Gaussian Mixture Model based speaker recognition systems, a variety of channel-adaptation techniques are known and available for adapting models between different channel conditions, but for the much more recent Support Vector Machine (SVM) based approaches to this problem, much less is known about the best way to handle this issue. In this paper we explore techniques that are specific to the SVM framework in order to derive fully non-linear channel compensations. The result is a system that is less sensitive to specific kinds of labeled channel variations observed in training.

156 citations


01 Jun 2004
TL;DR: The variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task are document here.
Abstract: NIST has coordinated annual evaluations of textindependent speaker recognition since 1996. During the course of this series of evaluations there have been notable milestones related to the development of the evaluation paradigm and the performance achievements of state-of-the-art systems. We document here the variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task. Finally, we discuss the data collection and protocols for the 2004 evaluation and beyond.

133 citations


Proceedings ArticleDOI
20 Oct 2004
TL;DR: Experiments on 138 speakers in the YOHO database and two people who played a role as imitators have shown that an imposter can attack the system if that impostor knows a registered speaker in the database who has very similar voice to the imposter's voice.
Abstract: We consider mimicry, a simple technology form of attack requiring a low level of expertise, to investigate whether a speaker recognition system is vulnerable to mimicry by an impostor without using the assistance of any other technologies. Experiments on 138 speakers in the YOHO database and two people who played a role as imitators have shown that an impostor can attack the system if that impostor knows a registered speaker in the database who has very similar voice to the impostor's voice.

133 citations


Patent
28 Apr 2004
TL;DR: In this article, a speech feature vector for a voice associated with a source of a text message was determined and compared to speaker models, and a speaker model was selected as a preferred match for the voice based on the comparison.
Abstract: A method of generating speech from textmessages includes determining a speech feature vector for a voice associated with a source of a text message, and comparing the speech feature vector to speaker models. The method also includes selecting one of the speaker models as a preferred match for the voice based on the comparison, and generating speech from the text message based on the selected speaker model.

118 citations


Book ChapterDOI
21 Jun 2004
TL;DR: In this paper, the authors present a corpus of audio-visual data called "AV16.3" along with a method for 3-D location annotation based on calibrated cameras, which can be either 2-dimensional or 3-dimensional (physical space).
Abstract: Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called “AV16.3”, along with a method for 3-D location annotation based on calibrated cameras. “16.3” stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.

105 citations


Journal ArticleDOI
TL;DR: The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
Abstract: Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus‐base statistical modeling, e.g., HMM and n‐grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time‐normalization to DTW/DP matching, (4) from gdistanceh‐based to likelihood‐based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context‐independent units to context‐dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker‐independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single‐modality (audio signal only) to multi‐modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.

101 citations


01 Jan 2004
TL;DR: The ICSI-SRI system is an agglomerative clustering system that uses a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge, providing robustness and portability.
Abstract: We describe the ICSI-SRI entry in the Fall 2004 DARPA EARS Metadata Evaluation. The current system was derived from ICSI’s Fall 2003 Speaker-attributed STT system. Our system is an agglomerative clustering system that uses a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. The main advantage of this approach is that it does not require pre-trained acoustic models, providing robustness and portability. Changes for this year’s system include: different front-end features, the addition of SRI’s Broadcast News speech/non-speech detector, and modifications to the segmentation routine. In post-evaluation work, we found further improvement by changing the stopping criterion from the BIC-like measure to a Viterbi measure. Additionally, we have explored issues related to pruning and improved initialization.

91 citations


01 Nov 2004
TL;DR: The improved LIMSI speaker diarization system used in the RT-04F evaluation reduces the speaker error time by over 75% on the development data, compared to the best configuration baseline system for this task.
Abstract: This paper describes the LIMSI speaker diarization system used in the RT-04F evaluation The RT-04F system builds upon the LIMSI baseline data partitioner, which is used in the broadcast news transcription system This partitioner provides a high cluster purity but has a tendency to split the data from a speaker into several clusters when there is a large quantity of data for the speaker In the RT-03S evaluation the baseline partitioner had a 245% diarization error rate Several improvements to the baseline diarization system have been made A standard Bayesian information criterion (BIC) agglomerative clustering has been integrated replacing the iterative Gaussian mixture model (GMM) clustering; a local BIC criterion is used for comparing single Gaussians with full covariance matrices A second clustering stage has been added, making use of a speaker identification method: maximum a posteriori adaptation of a ref- erence GMM with 128 Gaussians A final post-processing stage refines the segment boundaries using the output of the transcrip- tion system Compared to the best configuration baseline system for this task, the improved system reduces the speaker error time by over 75% on the development data On evaluation data, a 85% overall diarization error rate was obtained, a 60% reduction in error compared to the baseline

85 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: The issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations and the IHM system, which requires no prior training and executes in one fifth real time on modern architectures, is described.
Abstract: This paper describes the issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations. Two systems were developed and evaluated in the NIST RT-04S Meeting Recognition Evaluation, the Multiple Distant Microphone (MDM) system and the Individual Headset Microphone (IHM) system. The MDM system achieved a speaker diarization performance of 28.17%. This system also aims to provide automatic speech segments and speaker grouping information for speech recognition, a necessary prerequisite for subsequent audio processing. A 44.5% word error rate was achieved for speech recognition. The IHM system is based on the short-time crosscorrelation of all personal channel pairs. It requires no prior training and executes in one fifth real time on modern architectures. A 35.7% word error rate was achieved for speech recognition when segmentation was provided by this system.

81 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: A new kernel based upon standard log likelihood ratio scoring to address limitations of text classification methods is derived and it is shown that the methods achieve significant gains over standard methods for processing high-level features.
Abstract: Recently, high-level features such as word idiolect, pronunciation, phone usage, prosody, etc., have been successfully used in speaker verification. The benefit of these features was demonstrated in the NIST extended data task for speaker verification; with enough conversational data, a recognition system can become "familiar" with a speaker and achieve excellent accuracy. Typically, high-level-feature recognition systems produce a sequence of symbols from the acoustic signal and then perform recognition using the frequency and co-occurrence of symbols. We propose the use of support vector machines for performing the speaker verification task from these symbol frequencies. Support vector machines have been applied to text classification problems with much success. A potential difficulty in applying these methods is that standard text classification methods tend to "smooth" frequencies which could potentially degrade speaker verification. We derive a new kernel based upon standard log likelihood ratio scoring to address limitations of text classification methods. We show that our methods achieve significant gains over standard methods for processing high-level features.

Proceedings ArticleDOI
17 May 2004
TL;DR: It is shown how a joint factor analysis of inter-speaker and intra-speakers variability in a training database which contains multiple recordings for each speaker can be used to construct likelihood ratio statistics for speaker verification which take account of intra-Speaker variation and channel variation in a principled way.
Abstract: We show how a joint factor analysis of inter-speaker and intra-speaker variability in a training database which contains multiple recordings for each speaker can be used to construct likelihood ratio statistics for speaker verification which take account of intra-speaker variation and channel variation in a principled way. We report the results of experiments on the NIST 2001 cellular one speaker detection task carried out by applying this type of factor analysis to Switchboard Cellular Part I. The evaluation data for this task is contained in Switchboard Cellular Part I so these results cannot be taken at face value but they indicate that the factor analysis model can perform extremely well if it is perfectly estimated.

Patent
24 Jun 2004
TL;DR: In this paper, a system and method enrolls a speaker with an enrollment utterance and authenticates a user with a biometric analysis of an authentication utterance, without the need for a PIN (Personal Identification Number).
Abstract: A system and method enrolls a speaker with an enrollment utterance and authenticates a user with a biometric analysis of an authentication utterance, without the need for a PIN (Personal Identification Number). During authentication, the system uses the same authentication utterance to identify who a speaker claims to be with speaker recognition, and verify whether is the speaker is actually the claimed person. Thus, it is not necessary for the speaker to identify biometric data using a PIN. The biometric analysis includes a neural tree network to determine unique aspects of the authentication utterances for comparison to the enrollment authentication. The biometric analysis leverages a statistical analysis using Hidden Markov Models to before authorizing the speaker.


01 Nov 2004
TL;DR: This paper describes the systems developed by MITLL and used in DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation and presents experiments analyzing performance of the systems and a cross-cluster recombination approach that significantly improves performance.
Abstract: : Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper we describe the systems developed by MITLL and used in DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. The primary system is based on a new proxy speaker model approach and the secondary system follows a more standard BIC based clustering approach. We present experiments analyzing performance of the systems and present a cross-cluster recombination approach that significantly improves performance. In addition, we also present results applying our system to a telephone speech, summed channel speaker detection task.

Patent
10 Jun 2004
TL;DR: In this article, a graphical interface allowing a user to select a number of listeners, number of speakers, and a listening space is presented, along with graphically displaying on a display device the one or more listeners and one or multiple speakers within the listening space.
Abstract: A system for determining speaker spatialization parameters. The system includes a graphical interface allowing a user to select a number of listeners, number of speakers, and a listening space and for graphically displaying on a display device the one or more listeners and one or more speakers within the listening space. The system further includes a speaker spatialization module for determining one or more speaker spatialization parameters based upon the relative positions on the display device of the one or more listeners and the one or more speakers within the listening space system. The user may reposition the speakers and for each speaker that is repositioned the speaker spatialization module recalculates the speaker spatialization parameters for that speaker. The user may also graphically reposition one of the listeners and the speaker spatialization module recalculates all of the speaker spatialization parameters for all of the speakers.

Proceedings ArticleDOI
17 May 2004
TL;DR: A new approach toward automatic annotation of meetings in terms of speaker identities and their locations is presented by segmenting the audio recordings using two independent sources of information: magnitude spectrum analysis and sound source localization in an appropriate HMM framework.
Abstract: The paper presents a new approach toward automatic annotation of meetings in terms of speaker identities and their locations. This is achieved by segmenting the audio recordings using two independent sources of information: magnitude spectrum analysis and sound source localization. We combine the two in an appropriate HMM framework. There are three main advantages of this approach. First, it is completely unsupervised, i.e. speaker identities and number of speakers and locations are automatically inferred. Second, it is threshold-free, i.e. the decisions are made without the need of a threshold value which generally requires an additional development dataset. The third advantage is that the joint segmentation improves over the speaker segmentation derived using only acoustic features. Experiments on a series of meetings recorded in the IDIAP smart meeting room demonstrate the effectiveness of this approach.

Proceedings ArticleDOI
17 May 2004
TL;DR: The paper presents the ELISA consortium activities in automatic speaker segmentation, also known as speaker diarization, during the NIST rich transcription (RT), 2003, evaluation, and two different approaches from the CLIPS and LIA laboratories are presented and different possibilities of combining them are investigated.
Abstract: The paper presents the ELISA consortium activities in automatic speaker segmentation, also known as speaker diarization, during the NIST rich transcription (RT), 2003, evaluation. The experiments were conducted on real broadcast news data (HUB4). Two different approaches from the CLIPS and LIA laboratories are presented and different possibilities of combining them are investigated, in the framework of the ELISA consortium. The system submitted as an ELISA primary system obtained the second lowest segmentation error rate compared to the other RT03-participant primary systems. Another ELISA system submitted as a secondary system outperformed the best primary system and obtained the lowest speaker segmentation error rate.

Patent
28 Sep 2004
TL;DR: In this paper, a system and method for synthesizing audio-visual content in a video image processor is presented, where audio features and video features are extracted from audio visual input signals that represent a speaker who is speaking and used to create a computer generated animated version of the face of the speaker.
Abstract: A system and method is provided for synthesizing audio-visual content in a video image processor. A content synthesis application processor extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking. The processor uses the extracted visual features to create a computer generated animated version of the face of the speaker. The processor synchronizes facial movements of the animated version of the face of the speaker with a plurality of audio logical units such as phonemes that represent the speaker's speech. In this manner the processor synthesizes an audio-visual representation of the speaker's face that is properly synchronized with the speaker's speech.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: Investigation of the use of the linguistic information present in the audio signal to structure broadcast news data, and in particular to associate speaker identities with audio segments develops patterns which can be used to identify the current, previous or next speaker.

Patent
08 Sep 2004
TL;DR: In this article, an audio device and an audio processing method are provided for adjusting the position of a virtual speaker, which consists of a decoder which has audio data provided, the audio data including an audio component for a center speaker and a plurality of audio components corresponding to other speakers disposed with the center speaker interposed therewith, and which decodes these audio components to separate them from audio data, a center delay processor for delaying the audio component from the decoder, and a downmixing processor for distributing the delayed center speaker audio component between the other speakers and for merging the audio
Abstract: An audio device and an audio processing method are provided for adjusting the position of a virtual speaker. The audio device comprises a decoder which has audio data provided thereto, the audio data including an audio component for a center speaker and a plurality of audio components corresponding to other speakers disposed with the center speaker interposed therewith, and which decodes these audio components to separate them from the audio data, a center delay processor for delaying the audio component for the center speaker received from the decoder, and a downmixing processor for distributing the delayed center speaker audio component between the other speakers and for merging the audio component distributed to each of the other speakers and the original audio component for each other speaker. Audio sounds corresponding to the downmixed audio components are produced from the other speakers.

Proceedings ArticleDOI
17 May 2004
TL;DR: This work proposes a voice conversion method that does not require a parallel corpus for training, and shows that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases.
Abstract: The objective of voice conversion methods is to modify the speech characteristics of a particular speaker in such manner, as to sound like speech by a different target speaker. Current voice conversion algorithms are based on deriving a conversion function by estimating its parameters through a corpus that contains the same utterances spoken by both speakers. Such a corpus, usually referred to as a parallel corpus, has the disadvantage that many times it is difficult or even impossible to collect. Here, we propose a voice conversion method that does not require a parallel corpus for training, i.e. the spoken utterances by the two speakers need not be the same, by employing speaker adaptation techniques to adapt to a particular pair of source and target speakers, the derived conversion parameters from a different pair of speakers. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases, and with performance comparable with the ideal case when a parallel corpus is available.

DOI
01 Jan 2004
TL;DR: A novel distance metric is proposed in this thesis for the purpose of finding speaker segment boundaries (speaker change detection) and it is shown that the proposed criterion outperforms the use of LLR, when LLR is used with an optimal threshold value.
Abstract: Audio segmentation, in general, is the task of segmenting a continuous audio stream in terms of acoustically homogenous regions, where the rule of homogeneity depends on the task. This thesis aims at developing and investigating efficient, robust and unsupervised techniques for three important tasks related to audio segmentation, namely speech/music segmentation, speaker change detection and speaker clustering. The speech/music segmentation technique proposed in this thesis is based on the functioning of a HMM/ANN hybrid ASR system where an MLP estimates the posterior probabilities of different phonemes. These probabilities exhibit a particular pattern when the input is a speech signal. This pattern is captured in the form of feature vectors, which are then integrated in a HMM framework. The technique thus segments the audio data in terms of {\it recognizable} and {\it non-recognizable} segments. The efficiency of the proposed technique is demonstrated by a number of experiments conducted on broadcast news data exhibiting real-life scenarios (different speech and music styles, overlapping speech and music, non-speech sounds other than music, etc.). A novel distance metric is proposed in this thesis for the purpose of finding speaker segment boundaries (speaker change detection). The proposed metric can be seen as special case of Log Likelihood Ratio (LLR) or Bayesian Information Criterion (BIC), where the number of parameters in the two models (or hypotheses) is forced to be equal. However, the advantage of the proposed metric over LLR, BIC and other metric based approaches is that it achieves comparable performance without requiring an adjustable threshold/penalty term, hence also eliminating the need for a development dataset. Speaker clustering is the task of unsupervised classification of the audio data in terms of speakers. For this purpose, a novel HMM based agglomerative clustering algorithm is proposed where, starting from a large number of clusters, {\it closest} clusters are merged in an iterative process. A novel merging criterion is proposed for this purpose, which does not require an adjustable threshold value and hence the stopping criterion is also automatically met when there are no more clusters left for merging. The efficiency of the proposed algorithm is demonstrated with various experiments on broadcast news data and it is shown that the proposed criterion outperforms the use of LLR, when LLR is used with an optimal threshold value. These tasks obviously play an important role in the pre-processing stages of ASR. For example, correctly identifying {\it non-recognizable} segments in the audio stream and excluding them from recognition saves computation time in ASR and results in more meaningful transcriptions. Moreover, researchers have clearly shown the positive impact of further clustering of identified speech segments in terms of speakers (speaker clustering) on the transcription accuracy. However, we note that this processing has various other interesting and practical applications. For example, this provides characteristic information about the data (metadata), which is useful for the indexing of audio documents. One such application is investigated in this thesis which extracts this metadata and combines it with the ASR output, resulting in Rich Transcription (RT) which is much easier to understand for an end-user. In a further application, speaker clustering was combined with precise location information available in scenarios like smart meeting rooms to segment the meeting recordings jointly in terms of speakers and their locations in a meeting room. This is useful for automatic meeting summarization as it enables answering of questions like ``who is speaking and where''. This could be used to access, for example, a specific presentation made by a particular speaker or all the speech segments belonging to a particular speaker.

01 Jan 2004
TL;DR: This paper describes two systems for audio segmentation developed at CUED and MIT-LL and evaluates their performance using the speaker diarisation score defined in the 2003 Rich Transcription Evaluation.
Abstract: It is often important to be able to automatically label ‘who spoke when’ during some audio data. This paper describes two systems for audio segmentation developed at CUED and MIT-LL and evaluates their performance using the speaker diarisation score defined in the 2003 Rich Transcription Evaluation. A new clustering procedure and BIC-based stopping criterion for the CUED system is introduced which improves both performance and robustness to changes in segmentation. Finally a hybrid ‘Plug and Play’ system is built which combines different parts of the CUED and MIT-LL systems to produce a single system which outperforms both the individual systems.

Patent
27 Dec 2004
TL;DR: In this paper, the authors proposed a speaker clustering and speaker adaptation method using average model variation information over speakers while analyzing the quantity variation amount and the directional variation amount, which can be applied to any speaker adaptation algorithm of MLLR and MAP.
Abstract: A speech recognition method and apparatus perform speaker clustering and speaker adaptation using average model variation information over speakers while analyzing the quantity variation amount and the directional variation amount. In the speaker clustering method, a speaker group model variation is generated based on the model variation between a speaker-independent model and a training speaker ML model. In the speaker adaptation method, the model in which the model variation between a test speaker ML model and a speaker group ML model to which the test speaker belongs which is most similar to a training speaker group model variation is found, and speaker adaptation is performed on the found model. Herein, the model variation in the speaker clustering and the speaker adaptation are calculated while analyzing both the quantity variation amount and the directional variation amount. The present invention may be applied to any speaker adaptation algorithm of MLLR and MAP.

Proceedings ArticleDOI
17 May 2004
TL;DR: This paper presents a new text-independent speaker recognition method by combining a speaker-specific Gaussian mixture model (GMM) with a syllable-based HMM adapted by MLLR or MAP and shows that the attractive result was brought from the compensational effect between speaker specific GMM and speaker adapted syllable based HMM.
Abstract: We presented a new text-independent speaker recognition method by combining a speaker-specific Gaussian mixture model (GMM) with a syllable-based HMM adapted by MLLR or MAP (S. Nakagawa et al., Proc. Eurospeech, p.3017-3020, 2003). The robustness of this speaker recognition method for speaking style changes was evaluated in this paper. A speaker identification experiment, using an NTT database, which consists of sentences of data uttered at three speed modes (normal, fast and slow) by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months, was conducted. Each speaker uttered only 5 training utterances (about 20 seconds in total). We obtained an accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). This result was superior to conventional methods for the same database. We show that the attractive result was brought from the compensational effect between speaker specific GMM and speaker adapted syllable based HMM.

Proceedings ArticleDOI
R. Turetsky1, Nevenka Dimitrova1
27 Jun 2004
TL;DR: This work investigates the use of screenplay as a source of information for speaker/character identification and finds that the screenplay alignment is able to identify the speaker correctly in 30% of lines of dialogue on average, but with additional automatic statistical labeling for audio speaker ID on the soundtrack, the recognition rate improves significantly.
Abstract: Existing methods for audiovisual and text analysis of videos perform "blind" recovery of metadata from the audiovisual signal. The film production process however, is based on the original screenplay and its versions. Using this information is like using the recipe book for the movie. High-level semantic information that is otherwise very difficult to derive from the audiovisual content can be extracted automatically by enhancing feature extraction with screenplay processing and analysis. As a test-bed of our approach, we investigated the use of screenplay as a source of information for speaker/character identification. Our speaker identification method consists of screenplay parsing, extraction of time-stamped transcript, alignment of the screenplay with the time-stamped transcript, audio segmentation and audio speaker identification. As the screenplay alignment cannot identify all dialogue sections within any film, we use the segments found by alignment as labels to train a statistical model in order to identify unaligned pieces of dialogue. We find that the screenplay alignment is able to identify the speaker correctly in 30% of lines of dialogue on average. However, with additional automatic statistical labeling for audio speaker ID on the soundtrack, our recognition rate improves significantly.

Patent
Ross Cutler1
30 Apr 2004
TL;DR: In this paper, a system and process for highlighting the current speaker on an on-going basis in each frame of a low frame-rate video of an event having multiple people in attendance, such as a video teleconference, is presented.
Abstract: A system and process for highlighting the current speaker on an on-going basis in each frame of a low frame-rate video of an event having multiple people in attendance, such as a video teleconference, is presented. In general, this is accomplished by periodically identifying an attendee that is currently speaking at a rate substantially faster than the video frame rate, and for each frame of the video updating the frame to highlight the attendee currently speaking. More particularly, an A/V source provides video and audio data streams to the client computing device, with current speaker data embedded into the audio stream via audio watermarking techniques. The client device extracts the current speaker data from the audio stream, and then renders and displays the video while using the current speaker data to periodically update the frame being displayed to highlight the current speaker.

Proceedings ArticleDOI
15 Dec 2004
TL;DR: The hypothesized speaker model is derived by adapting the parameters of UBM using the speaker's training speech and a form of Bayesian adaptation and the UBM technique is incorporated into the GMM speaker identification system to reduce the time requirement for recognition significantly.
Abstract: In this paper, we describe a Gaussian mixture model-universal background model (GMM-UBM) speaker identification system. In this GMM-UBM system, we derive the hypothesized speaker model by adapting the parameters of UBM using the speaker's training speech and a form of Bayesian adaptation. The UBM technique is incorporated into the GMM speaker identification system to reduce the time requirement for recognition significantly. The paper also presents a new frame level likelihood score normalization for adjusting different scores of speaker models to get more robust scores in the final decision. Experiments on the 2000 NIST speaker recognition evaluation corpus show that GMM-UBM and frame level likelihood score normalization yield better performance. Compared to the baseline system, around 31.2% relative error reduction is obtained from the combination of both techniques.

PatentDOI
TL;DR: In this article, a matrix of optimum classifiers for the detection of SID usable and SID unusable speech segments is presented, and a decision tree based on fixed thresholds indicates the presence of a speech feature in a given speech segment.
Abstract: Method for improving speaker identification by determining usable speech. Degraded speech is preprocessed in a speaker identification (SID) process to produce SID usable and SID unusable segments. Features are extracted and analyzed so as to produce a matrix of optimum classifiers for the detection of SID usable and SID unusable speech segments. Optimum classifiers possess a minimum distance from a speaker model. A decision tree based upon fixed thresholds indicates the presence of a speech feature in a given speech segment. Following preprocessing, degraded speech is measured in one or more time, frequency, cepstral or SID usable/unusable domains. The results of the measurements are multiplied by a weighting factor whose value is proportional to the reliability of the corresponding time, frequency, or cepstral measurements performed. The measurements are fused as information, and usable speech segments are extracted for further processing. Such further processing of co-channel speech may include speaker identification where a segment-by-segment decision is made on each usable speech segment to determine whether they correspond to speaker #1 or speaker #2. Further processing of co-channel speech may also include constructing the complete utterance of speaker #1 or speaker #2. Speech features such as pitch and formants may be extended back into the unusable segments to form a complete utterance from each speaker.