scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1998"



Proceedings Article
01 Jan 1998
TL;DR: This paper proposes statistical tests for the existence of sheep, goats, lambs and wolves and applies these tests to hunt for such animals using results from the 1998 NIST speaker recognition evaluation.
Abstract: : Performance variability in speech and speaker recognition systems can be attributed to many factors. One major factor, which is often acknowledged but seldom analyzed, is inherent differences in the recognizability of different speakers. In speaker recognition systems such differences are characterized by the use of animal names for different types of speakers, including sheep, goats, lambs and wolves, depending on their behavior with respect to automatic recognition systems. In this paper we propose statistical tests for the existence of these animals and apply these tests to hunt for such animals using results from the 1998 NIST speaker recognition evaluation.

444 citations


Patent
21 Dec 1998
TL;DR: A programmable automatic call and data transfer processing system which automatically processes incoming telephone calls, facsimiles and e-mails based on the identity of the caller or author, the subject matter of the message or request, and/or the time of day is presented in this article.
Abstract: A programmable automatic call and data transfer processing system which automatically processes incoming telephone calls, facsimiles and e-mails based on the identity of the caller or author, the subject matter of the message or request, and/or the time of day, which includes: a central server for automatically answering an incoming call and collecting voice data of a caller; a speaker recognition module connected to the server for identifying the caller or author; a switching module responsive to the speaker recognition module for processing the call or message in accordance with a pre-programmed procedure based on the identification of the caller or author; and a programming interface for programming the server, speaker recognizer module and the switching module. The system is programmed by the user to so as to process incoming telephone calls or e-mail and facsimile messages based on the identity of the caller or author, subject matter and content of the message and the time of day. Such processing includes, but is not limited to, switching the call to another system, forwarding the call to another telephone terminal, placing the call on hold, or disconnecting the call. In another aspect of the present invention, the system may be employed to process information retrieved from other telecommunication devices such as voice mail, facsimile/modem or e-mail. The system is capable of tagging the identity of a caller or participants to a teleconference, and transcribing the teleconferences, phone conversations and messages of such callers and participants. The system can automatically index or prioritize the received calls, messages, e-mails and facsimiles according to the caller identification or subject matter of the conversation or message, and allow the user to retrieve messages that either originated from a specific source or caller or retrieve calls which deal with similar or specific subject matter.

224 citations


01 Jan 1998
TL;DR: It is argued and demonstrated empirically that the articulatory feature approach can lead to greater robustness by enhancing the accuracy of the bottom-up acoustic modeling component in a speech recognition system, to improve the robustness of speech recognition systems in adverse acoustic environments.
Abstract: Current automatic speech recognition systems make use of a single source of information about their input, viz a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information Articulatory information is represented in terms of abstract articulatory classes or "features", which are extracted from the speech signal by means of statistical classifiers A higher-level classifier then combines the scores for these features and maps them to standard subword unit probabilities The main motivation for this approach is to improve the robustness of speech recognition systems in adverse acoustic environments, such as background noise Typically, recognition systems show a sharp decline of performance under these conditions We argue and demonstrate empirically that the articulatory feature approach can lead to greater robustness by enhancing the accuracy of the bottom-up acoustic modeling component in a speech recognition system The second focus point of this thesis is to provide detailed analyses of the different types of information provided by the acoustic and the articulatory representations, respectively, and to develop strategies to optimally combine them To this effect we investigate combination methods at the levels of feature extraction, subword unit probability estimation, and word recognition The feasibility of this approach is demonstrated with respect to two different speech recognition tasks The first of these is an American English corpus of telephone-bandwidth speech; the recognition domain is continuous numbers The second is a German database of studio-quality speech consisting of spontaneous dialogues In both cases recognition performance will be tested not only under clean acoustic conditions but also under deteriorated conditions

221 citations


Proceedings Article
01 Jan 1998
TL;DR: This paper presents an entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments that uses the spectral entropy to identify the speech segments accurately.
Abstract: This paper presents an entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments. Instead of using the conventional energy-based features, the spectral entropy is developed to identify the speech segments accurately. Experimental results show that this algorithm outperforms the energy-based algorithms in both detection accuracy and recognition performance under noisy environments, with an average error rate reduction of more than 16%.

221 citations


Proceedings ArticleDOI
04 Oct 1998
TL;DR: In this paper, two approaches for extracting features relevant to lipreading, given image sequences of the speaker's mouth region, are considered: a lip contour based feature approach which first obtains estimates of speaker's lip contours and subsequently extracts features from them; and an image transform based approach, which obtains a compressed representation of the image pixel values that contain the speaker mouth.
Abstract: This paper concentrates on the visual front end for hidden Markov model based automatic lipreading. Two approaches for extracting features relevant to lipreading, given image sequences of the speaker's mouth region, are considered: a lip contour based feature approach which first obtains estimates of the speaker's lip contours and subsequently extracts features from them; and an image transform based approach, which obtains a compressed representation of the image pixel values that contain the speaker's mouth. Various possible features are considered in each approach, and experimental results on a number of visual-only recognition tasks are reported. It is shown that the image transform based approach results in superior lipreading performance. In addition, feature mean subtraction is demonstrated to improve the performance in multi-speaker and speaker-independent recognition tasks. Finally, the effects of video degradations to image transform based automatic lipreading are studied. It is shown that lipreading performance dramatically deteriorates below a 10 Hz field rate, and that image transform features are robust to noise and compression artifacts.

201 citations


Proceedings ArticleDOI
12 May 1998
TL;DR: The problem of clustering speakers by their voices is addressed, metrics based on purity and completeness of clusters are introduced, and experimental results on a subset of the Switchboard corpus are presented.
Abstract: The problem of clustering speakers by their voices is addressed. With the mushrooming of available speech data from television broadcasts to voice mail, automatic systems for archive retrieval, organizing and labeling by speaker are necessary. Clustering conversations by speaker is a solution to all three of the above tasks. Another application for speaker clustering is to group utterances together for speaker adaptation in speech recognition. Metrics based on purity and completeness of clusters are introduced. Next our approach to speaker clustering is described and finally experimental results on a subset of the Switchboard corpus are presented.

173 citations


Proceedings Article
01 Nov 1998
TL;DR: This work model the speaker’s f0 movements by fitting a piecewise linear model to the f0 track to obtain a stylized f0 contour, and improves the verification performance of a cepstrum-based Gaussian mixture model system by 10%.
Abstract: Statistics of frame-level pitch have recently been used in speaker recognition systems with good results [1, 2, 3]. Although they convey useful long-term information about a speaker’s distribution of f0 values, such statistics fail to capture information about local dynamics in intonation that characterize an individual’s speaking style. In this work, we take a first step toward capturing such suprasegmental patterns for automatic speaker verification. Specifically, we model the speaker’s f0 movements by fitting a piecewise linear model to the f0 track to obtain a stylized f0 contour. Parameters of the model are then used as statistical features for speaker verification. We report results on 1998 NIST speaker verification evaluation. Prosody modeling improves the verification performance of a cepstrum-based Gaussian mixture model system (as measured by a task-specific Bayes risk) by 10%.

159 citations


Patent
Hiroaki Hattori1
TL;DR: A speaker recognition device for judging whether or not an unknown speaker is an authentic registered speaker himself/herself executes text verification using speaker independent speech recognition and speaker verification by comparison with a reference pattern of a password of a registered speaker as mentioned in this paper.
Abstract: A speaker recognition device for judging whether or not an unknown speaker is an authentic registered speaker himself/herself executes `text verification using speaker independent speech recognition` and `speaker verification by comparison with a reference pattern of a password of a registered speaker`. A presentation section instructs the unknown speaker to input an ID and utter a specified text designated by a text generation section and a password. The `text verification` of the specified text is executed by a text verification section, and the `speaker verification` of the password is executed by a similarity calculation section. The judgment section judges that the unknown speaker is the authentic registered speaker himself/herself if both the results of the `text verification` and the `speaker verification` are affirmative. According to the device, the `text verification` is executed using a set of speaker independent reference patterns, and the `speaker verification` is executed using speaker reference patterns of passwords of registered speakers, thereby storage capacity for storing reference patterns for verification can be considerably reduced. Preferably, `speaker identity verification` between the specified text and the password is executed.

150 citations


Patent
Jennifer Lai1, John Vergo1
23 Nov 1998
TL;DR: In this article, a speech recognition computer system and method indicate the level of confidence that a speech recognizer has in it recognition of one or more displayed words, and a plurality of confidence levels of individual recognized words may be visually indicated.
Abstract: A speech recognition computer system and method indicates the level of confidence that a speech recognizer has in it recognition of one or more displayed words. The system and method allow for the rapid identification of speech recognition errors. A plurality of confidence levels of individual recognized words may be visually indicated. Additionally, the system and method allow the user of the system to select threshold levels to determine when the visual indication occurs.

126 citations


Proceedings Article
01 Jan 1998
TL;DR: Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the \emph{d*} algorithm.
Abstract: Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes or even the total number of classes. Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the \emph{d*} algorithm. Results are presented using message sets derived from the Switchboard and Callfriend corpora. Potential applications include automatic indexing of recorded speech corpora by speaker/language tags and automatic or semiautomatic selection of speaker specific speech utterances for speaker recognition adaptation.

Patent
31 Mar 1998
TL;DR: In this article, a speech sample is received and speech recognition is performed on the speech sample to produce recognition results, and the recognition results are evaluated in view of the training data and the identification of the speech elements to which the portions of training data are related.
Abstract: A speech sample is evaluated using a computer. Training data that include samples of speech are received and stored along with identification of speech elements to which portions of the training data are related. A speech sample is received and speech recognition is performed on the speech sample to produce recognition results. Finally, the recognition results are evaluated in view of the training data and the identification of the speech elements to which the portions of the training data are related. The technique may be used to perform tasks such as speech recognition, speaker identification, and language identification.

01 Jan 1998
TL;DR: From the results of objective and subjective tests, it is shown that the characteristics of synthetic speech is close to target speaker’s voice, and the speech generated from the adapted model set using 5 sentences has almost the same DMOS score as that from the speaker dependent model set.
Abstract: This paper describes a voice characteristics conversion technique for an HMM-based text-to-speech synthesis system. The system uses phoneme HMMs as the speech synthesis units, and voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthetic speech to the target speaker, we apply an MLLR (Maximum Likelihood Linear Regression) technique, one of the speaker adaptation techniques, to the system. From the results of objective and subjective tests, it is shown that the characteristics of synthetic speech is close to target speaker’s voice, and the speech generated from the adapted model set using 5 sentences has almost the same DMOS score as that from the speaker dependent model set.

Journal ArticleDOI
TL;DR: A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to reestimate the system parameters.
Abstract: A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to reestimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speaker's data to the test speaker's acoustic space. Finally, the system parameters (Gaussian means) are reestimated specifically for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of providing an 18% relative improvement in the error rate on a large-vocabulary task with the use of as little as three sentences of adaptation data.

DOI
01 Jan 1998
TL;DR: The design of a multilingual speech recognizer is described using an LVCSR dictation database which has been collected under the project GlobalPhone and built on a global phoneme set which can handle five different languages.
Abstract: This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Turkish. For our experiments we used six of these languages to train and test several recognition engines in monolingual, multilingual and crosslingual setups. Based on a global phoneme set we built a multilingual speech recognition system which can handle five different languages. The acoustic models of the five languages are combined into a monolithic system and context dependent phoneme models are created using language questions.

Patent
TL;DR: In this paper, a speaker recognition system for selectively permitting access by a requesting speaker to one of a service and facility include an acoustic front-end for computing at least one feature vector from a speech utterance provided by the requesting speaker; a speaker dependent codebook store for pre-storing sets of acoustic features, in the form of codebooks, respectively corresponding to a pool of previously enrolled speakers; and a speaker identifier/verifier module operatively coupled to the acoustic front end, wherein: the speaker identifier and verifier module identifies, from identifying indicia provided by a
Abstract: A speaker recognition system for selectively permitting access by a requesting speaker to one of a service and facility include an acoustic front-end for computing at least one feature vector from a speech utterance provided by the requesting speaker; a speaker dependent codebook store for pre-storing sets of acoustic features, in the form of codebooks, respectively corresponding to a pool of previously enrolled speakers; a speaker identifier/verifier module operatively coupled to the acoustic front-end, wherein: the speaker identifier/verifier module identifies, from identifying indicia provided by the requesting speaker, a previously enrolled speaker as a claimed speaker; further, the speaker identifier/verifier module associates, with the claimed speaker, first and second groups of previously enrolled speakers, the first group being defined as speakers whose codebooks are respectively acoustically similar to the claimed speaker (i.e., cohort set) and the second group being defined as speakers whose codebooks are acoustically similar to the claimed speaker but not as acoustically similar as the codebooks of the speakers in the first group (i.e., legion set); and still further, the speaker identifier/verifier module verifies the requesting speaker by comparing the at least one feature vector of the requesting speaker to the codebooks of the previously enrolled speakers in the second group and, in response to such comparison, generates an indicator indicating that the requesting speaker is one of verified and not verified for access to one of the service and facility.

Proceedings ArticleDOI
Subhro Das1, D. Nix, M. Picheny
12 May 1998
TL;DR: Comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system are described.
Abstract: There are several reasons why conventional speech recognition systems modeled on adult data fail to perform satisfactorily on children's speech input. For instance, children's vocal characteristics differ significantly from those of adults. In addition, their choices of vocabulary and sentence construction modalities usually do not conform to adult patterns. We describe comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system.

Patent
TL;DR: In this paper, speaker-dependent and speaker-independent speech recognition in a voice-controlled multi-station network has been discussed and a fallback procedure is maintained for any particular station in order to cater for failure of the speakerdependent recognition, whilst allowing reverting to the improvement procedure.
Abstract: A voice-controlled multi-station network has both speaker-dependent and speaker-independent speech recognition. Conditionally to recognizing items of an applicable vocabulary, the network executes a particular function. The method receives a call from a particular origin and executes speaker-independent speech recognition on the call. In an improvement procedure, in case of successful determination of what has been said, a template associated to the recognized speech items is stored and assigned to the origin. Next, speaker-dependent recognition is applied if feasible, for speech received from the same origin, using one or more templates associated to that station. Further, a fallback procedure to speaker-independent recognition is maintained for any particular station in order to cater for failure of the speaker-dependent recognition, whilst allowing reverting to the improvement procedure.

Journal ArticleDOI
TL;DR: A novel algorithm for reducing the computational complexity of identifying a speaker within a Gaussian mixture speaker model framework is presented and it is illustrated that rapid pruning of unlikely speaker model candidates can be achieved by reordering the time-sequence of observation vectors used to update the accumulated probability of each speaker model.
Abstract: This article presents a novel algorithm for reducing the computational complexity of identifying a speaker within a Gaussian mixture speaker model framework. For applications in which the entire observation sequence is known, we illustrate that rapid pruning of unlikely speaker model candidates can be achieved by reordering the time-sequence of observation vectors used to update the accumulated probability of each speaker model. The overall approach is integrated into a beam-search strategy and shown to reduce the time to identify a speaker by a factor of 140 over the standard full-search method, and by a factor of six over the standard beam-search method when identifying speakers from the 138 speaker YOHO corpus.

Proceedings ArticleDOI
12 May 1998
TL;DR: In this article, a new approach for robust automatic speaker verification in adverse conditions is proposed based on the combination of speech enhancement using traditional spectral subtraction, and missing feature compensation to dynamically modify the probability computations performed in GMM recognizers.
Abstract: In the framework of Gaussian mixture models (GMMs), we present a new approach towards robust automatic speaker verification (SV) in adverse conditions. This new and simple approach is based on the combination of speech enhancement using traditional spectral subtraction, and missing feature compensation to dynamically modify the probability computations performed in GMM recognizers. The identity of the spectral features missing due to noise masking is provided by the spectral subtraction algorithm. Previous works have demonstrated that the missing feature modeling method succeeds in speech recognition with some artificially generated interruptions, filtering and noise. We show that this method also improves noise compensation techniques used for speaker verification in more realistic conditions.

01 Jan 1998
TL;DR: In this article, non-linear signal processing was used for speech processing in LANOS-CONF-1998-004 Record created on 2004-12-03, modified on 2017-05-12.
Abstract: Keywords: Non-Linear Signal Processing ; Speech Processing Reference LANOS-CONF-1998-004 Record created on 2004-12-03, modified on 2017-05-12

Proceedings Article
01 Jan 1998
TL;DR: This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone and presents several recognition results in language independent and language adaptive setups.
Abstract: This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Turkish Based on a global phoneme set we built different multilingual speech recognition systems for five of the 15 languages Context dependent phoneme models are created data-driven by introducing questions about language and language groups to our polyphone clustering procedure We apply the resulting multilingual models to unseen languages and present several recognition results in language independent and language adaptive setups

Proceedings ArticleDOI
07 Dec 1998
TL;DR: Preliminary results on emotion recognition by machine from joint audiovisual input of facial video and speech show potential advantages in using both modalities over either modality alone.
Abstract: We report preliminary results on emotion recognition by machine from joint audiovisual input of facial video and speech. The results show potential advantages in using both modalities over either modality alone. The recognition rate for audio alone is about 75% and video alone about 70%. Using audiovisual data we achieved 97% without increasing the number of features. The improvement in performance is accredited to the complementary property between the two modalities. A possible application is in natural human-computer interfaces.

Patent
04 Feb 1998
TL;DR: In this article, a method of speech recognition, in accordance with the present invention, includes the steps of grouping acoustics to form classes based on acoustic features, clustering training speakers by the classes to provide class-specific cluster systems, selecting from the cluster system, a subset of cluster systems closest to adaptation data from a test speaker, transforming the subset of CHS to bring the CHS closer to the test speaker based on the adaptation data to form adapted CHS, and combining the adapted cluster systems to create a speaker adapted system for decoding speech from the test speakers.
Abstract: A method of speech recognition, in accordance with the present invention includes the steps of grouping acoustics to form classes based on acoustic features, clustering training speakers by the classes to provide class-specific cluster systems, selecting from the cluster systems, a subset of cluster systems closest to adaptation data from a test speaker, transforming the subset of cluster systems to bring the subset of cluster systems closer to the test speaker based on the adaptation data to form adapted cluster systems and combining the adapted cluster systems to create a speaker adapted system for decoding speech from the test speaker. System and methods for building speech recognition systems as well as adapting speaker systems for class-specific speaker clusters are included.

Proceedings ArticleDOI
12 May 1998
TL;DR: This paper presents a novel technique for the tracking and extraction of features from lips for the purpose of speaker identification, where syntactic information is derived from chromatic information in the lip region.
Abstract: This paper presents a novel technique for the tracking and extraction of features from lips for the purpose of speaker identification. In noisy or other adverse conditions, identification performance via the speech signal can significantly reduce, hence additional information which can complement the speech signal is of particular interest. In our system, syntactic information is derived from chromatic information in the lip region. A model of the lip contour is formed directly from the syntactic information, with no minimization procedure required to refine estimates. Colour features are then extracted from the lips via profiles taken around the lip contour. Further improvement in lip features is obtained via linear discriminant analysis (LDA). Speaker models are built from the lip features based on the Gaussian mixture model (GMM). Identification experiments are performed on the M2VTS database, with encouraging results.

Book
01 Jan 1998
TL;DR: This work presentsceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments, a novel approach to signal-processing that automates the very labor-intensive and therefore time-heavy and expensive process of recognizing speech.
Abstract: Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments

Patent
31 Jul 1998
TL;DR: In this paper, the authors used therapidly available speech recognition results to provide intelligent barge-in for voice-response systems and, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence to increase processing throughput.
Abstract: Speech recognition technology has attained maturity such that the most likely speech recognition result has been reached and is available before an energy based termination of speech has been made. The present invention innovatively uses therapidly available speech recognition results to provide intelligent barge-in forvoice-response systems and, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence to increase processing throughput.

Patent
23 Oct 1998
TL;DR: In this paper, superwords are used to refer to those word combinations which are so often spoken that they are recognized as units or should have models to reflect them in the language model.
Abstract: This invention is directed to the selection of superwords based on a criterion relevant to speech recognition and understanding. Superwords are used to refer to those word combinations which are so often spoken that they are recognized as units or should have models to reflect them in the language model. The selected superwords are placed in a lexicon along with selected meaningful phrases. The lexicon is then used by a speech recognizer to improve recognition of input speech utterances for the proper routing of a user's task objectives.

01 Jan 1998
TL;DR: In this paper, a nonlinear discriminant analysis (NLDA) technique was used to extract a speaker-discriminant feature set, which is optimized to discriminate between speakers and to be robust to mismatched training and testing conditions.
Abstract: We study a nonlinear discriminant analysis (NLDA) technique that extracts a speaker-discriminant feature set. Ou r approach is to train a multilayer perceptron (MLP) to maximize the separation between speakers by nonlinearly projecting a large set of acoustic features (e.g., several fram es) to a lower-dimensional feature set. The extracted features are optimized to discriminate between speakers and to be robust to mismatched training and testing conditions. We train the MLP on a development set and apply it to the training and testing utterances. Our results show that by combining the NLDA-based system with a state of the art cepstrum-based system we improve the speaker verification performance on the 1997 NIST Speaker Recognition Evaluation set by 15% in average compared with our cepstrum-only system.

Proceedings Article
16 Sep 1998
TL;DR: This paper examines an adaptation scheme requiring very few parameters to adapt the models, cluster adaptive training (CAT), and finds that on a speaker-independent task CAT reduced the word error rate using very little adaptation data.
Abstract: When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. Recently the most popular adaptation schemes have used many parameters to adapt the models. This limits how rapidly the models may be adapted. This paper examines an adaptation scheme requiring very few parameters to adapt the models, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting one cluster, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speaker-independent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set.