Showing papers on "Speaker recognition published in 1998"

PDF

Open Access

Book•

Convolutional networks for images, speech, and time series

[...]

Yann LeCun, Yoshua Bengio¹, Yoshua Bengio², Yoshua Bengio³•Institutions (3)

École Polytechnique de Montréal¹, AT&T², Alcatel-Lucent³

01 Oct 1998

4,482 citations

Proceedings Article•

Sheep, Goats, Lambs and Wolves: A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation

[...]

George R. Doddington, Walter Liggett, Alvin F. Martin, Mark A. Przybocki, Douglas A. Reynolds - Show less +1 more

01 Jan 1998

TL;DR: This paper proposes statistical tests for the existence of sheep, goats, lambs and wolves and applies these tests to hunt for such animals using results from the 1998 NIST speaker recognition evaluation.

...read moreread less

Abstract: : Performance variability in speech and speaker recognition systems can be attributed to many factors. One major factor, which is often acknowledged but seldom analyzed, is inherent differences in the recognizability of different speakers. In speaker recognition systems such differences are characterized by the use of animal names for different types of speakers, including sheep, goats, lambs and wolves, depending on their behavior with respect to automatic recognition systems. In this paper we propose statistical tests for the existence of these animals and apply these tests to hunt for such animals using results from the 1998 NIST speaker recognition evaluation.

...read moreread less

444 citations

Patent•

System and methods for automatic call and data transfer processing

[...]

Mark E. Epstein¹, Dimitri Kanevsky¹, Stephane H. Maes¹•Institutions (1)

IBM¹

21 Dec 1998

TL;DR: A programmable automatic call and data transfer processing system which automatically processes incoming telephone calls, facsimiles and e-mails based on the identity of the caller or author, the subject matter of the message or request, and/or the time of day is presented in this article.

...read moreread less

Abstract: A programmable automatic call and data transfer processing system which automatically processes incoming telephone calls, facsimiles and e-mails based on the identity of the caller or author, the subject matter of the message or request, and/or the time of day, which includes: a central server for automatically answering an incoming call and collecting voice data of a caller; a speaker recognition module connected to the server for identifying the caller or author; a switching module responsive to the speaker recognition module for processing the call or message in accordance with a pre-programmed procedure based on the identification of the caller or author; and a programming interface for programming the server, speaker recognizer module and the switching module. The system is programmed by the user to so as to process incoming telephone calls or e-mail and facsimile messages based on the identity of the caller or author, subject matter and content of the message and the time of day. Such processing includes, but is not limited to, switching the call to another system, forwarding the call to another telephone terminal, placing the call on hold, or disconnecting the call. In another aspect of the present invention, the system may be employed to process information retrieved from other telecommunication devices such as voice mail, facsimile/modem or e-mail. The system is capable of tagging the identity of a caller or participants to a teleconference, and transcribing the teleconferences, phone conversations and messages of such callers and participants. The system can automatically index or prioritize the received calls, messages, e-mails and facsimiles according to the caller identification or subject matter of the conversation or message, and allow the user to retrieve messages that either originated from a specific source or caller or retrieve calls which deal with similar or specific subject matter.

...read moreread less

224 citations

Robust speech recognition using articulatory information

[...]

Katrin Kirchhoff

01 Jan 1998

TL;DR: It is argued and demonstrated empirically that the articulatory feature approach can lead to greater robustness by enhancing the accuracy of the bottom-up acoustic modeling component in a speech recognition system, to improve the robustness of speech recognition systems in adverse acoustic environments.

...read moreread less

Abstract: Current automatic speech recognition systems make use of a single source of information about their input, viz a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information Articulatory information is represented in terms of abstract articulatory classes or "features", which are extracted from the speech signal by means of statistical classifiers A higher-level classifier then combines the scores for these features and maps them to standard subword unit probabilities The main motivation for this approach is to improve the robustness of speech recognition systems in adverse acoustic environments, such as background noise Typically, recognition systems show a sharp decline of performance under these conditions We argue and demonstrate empirically that the articulatory feature approach can lead to greater robustness by enhancing the accuracy of the bottom-up acoustic modeling component in a speech recognition system The second focus point of this thesis is to provide detailed analyses of the different types of information provided by the acoustic and the articulatory representations, respectively, and to develop strategies to optimally combine them To this effect we investigate combination methods at the levels of feature extraction, subword unit probability estimation, and word recognition The feasibility of this approach is demonstrated with respect to two different speech recognition tasks The first of these is an American English corpus of telephone-bandwidth speech; the recognition domain is continuous numbers The second is a German database of studio-quality speech consisting of spontaneous dialogues In both cases recognition performance will be tested not only under clean acoustic conditions but also under deteriorated conditions

...read moreread less

221 citations

Proceedings Article•

Robust entropy-based endpoint detection for speech recognition in noisy environments.

[...]

Jia-Lin Shen¹, Jeih-weih Hung, Lin-Shan Lee•Institutions (1)

Academia Sinica¹

01 Jan 1998

TL;DR: This paper presents an entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments that uses the spectral entropy to identify the speech segments accurately.

...read moreread less

Abstract: This paper presents an entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments. Instead of using the conventional energy-based features, the spectral entropy is developed to identify the speech segments accurately. Experimental results show that this algorithm outperforms the energy-based algorithms in both detection accuracy and recognition performance under noisy environments, with an average error rate reduction of more than 16%.

...read moreread less

221 citations

Proceedings Article•DOI•

An image transform approach for HMM based automatic lipreading

[...]

Gerasimos Potamianos¹, Hans Peter Graf², Eric Cosatto²•Institutions (2)

AT&T Labs¹, AT&T²

04 Oct 1998

TL;DR: In this paper, two approaches for extracting features relevant to lipreading, given image sequences of the speaker's mouth region, are considered: a lip contour based feature approach which first obtains estimates of speaker's lip contours and subsequently extracts features from them; and an image transform based approach, which obtains a compressed representation of the image pixel values that contain the speaker mouth.

...read moreread less

Abstract: This paper concentrates on the visual front end for hidden Markov model based automatic lipreading. Two approaches for extracting features relevant to lipreading, given image sequences of the speaker's mouth region, are considered: a lip contour based feature approach which first obtains estimates of the speaker's lip contours and subsequently extracts features from them; and an image transform based approach, which obtains a compressed representation of the image pixel values that contain the speaker's mouth. Various possible features are considered in each approach, and experimental results on a number of visual-only recognition tasks are reported. It is shown that the image transform based approach results in superior lipreading performance. In addition, feature mean subtraction is demonstrated to improve the performance in multi-speaker and speaker-independent recognition tasks. Finally, the effects of video degradations to image transform based automatic lipreading are studied. It is shown that lipreading performance dramatically deteriorates below a 10 Hz field rate, and that image transform features are robust to noise and compression artifacts.

...read moreread less

201 citations

Proceedings Article•DOI•

Clustering speakers by their voices

[...]

Alex Solomonoff, A. Mielke, M. Schmidt, Herbert Gish

12 May 1998

TL;DR: The problem of clustering speakers by their voices is addressed, metrics based on purity and completeness of clusters are introduced, and experimental results on a subset of the Switchboard corpus are presented.

...read moreread less

Abstract: The problem of clustering speakers by their voices is addressed. With the mushrooming of available speech data from television broadcasts to voice mail, automatic systems for archive retrieval, organizing and labeling by speaker are necessary. Clustering conversations by speaker is a solution to all three of the above tasks. Another application for speaker clustering is to group utterances together for speaker adaptation in speech recognition. Metrics based on purity and completeness of clusters are introduced. Next our approach to speaker clustering is described and finally experimental results on a subset of the Switchboard corpus are presented.

...read moreread less

173 citations

Proceedings Article•

Modeling dynamic prosodic variation for speaker verification.

[...]

M. Kemal Sönmez¹, Elizabeth Shriberg¹, Larry Heck¹, Mitchel Weintraub•Institutions (1)

SRI International¹

01 Nov 1998

TL;DR: This work model the speaker’s f0 movements by fitting a piecewise linear model to the f0 track to obtain a stylized f0 contour, and improves the verification performance of a cepstrum-based Gaussian mixture model system by 10%.

...read moreread less

Abstract: Statistics of frame-level pitch have recently been used in speaker recognition systems with good results [1, 2, 3]. Although they convey useful long-term information about a speaker’s distribution of f0 values, such statistics fail to capture information about local dynamics in intonation that characterize an individual’s speaking style. In this work, we take a first step toward capturing such suprasegmental patterns for automatic speaker verification. Specifically, we model the speaker’s f0 movements by fitting a piecewise linear model to the f0 track to obtain a stylized f0 contour. Parameters of the model are then used as statistical features for speaker verification. We report results on 1998 NIST speaker verification evaluation. Prosody modeling improves the verification performance of a cepstrum-based Gaussian mixture model system (as measured by a task-specific Bayes risk) by 10%.

...read moreread less

159 citations

Patent•

Speaker recognition device

[...]

Hiroaki Hattori¹•Institutions (1)

NEC¹

29 Jan 1998-Journal of the Acoustical Society of America

TL;DR: A speaker recognition device for judging whether or not an unknown speaker is an authentic registered speaker himself/herself executes text verification using speaker independent speech recognition and speaker verification by comparison with a reference pattern of a password of a registered speaker as mentioned in this paper.

...read moreread less

Abstract: A speaker recognition device for judging whether or not an unknown speaker is an authentic registered speaker himself/herself executes `text verification using speaker independent speech recognition` and `speaker verification by comparison with a reference pattern of a password of a registered speaker`. A presentation section instructs the unknown speaker to input an ID and utter a specified text designated by a text generation section and a password. The `text verification` of the specified text is executed by a text verification section, and the `speaker verification` of the password is executed by a similarity calculation section. The judgment section judges that the unknown speaker is the authentic registered speaker himself/herself if both the results of the `text verification` and the `speaker verification` are affirmative. According to the device, the `text verification` is executed using a set of speaker independent reference patterns, and the `speaker verification` is executed using speaker reference patterns of passwords of registered speakers, thereby storage capacity for storing reference patterns for verification can be considerably reduced. Preferably, `speaker identity verification` between the specified text and the password is executed.

...read moreread less

150 citations

Patent•

Speech recognition confidence level display

[...]

Jennifer Lai¹, John Vergo¹•Institutions (1)

IBM¹

23 Nov 1998

TL;DR: In this article, a speech recognition computer system and method indicate the level of confidence that a speech recognizer has in it recognition of one or more displayed words, and a plurality of confidence levels of individual recognized words may be visually indicated.

...read moreread less

Abstract: A speech recognition computer system and method indicates the level of confidence that a speech recognizer has in it recognition of one or more displayed words. The system and method allow for the rapid identification of speech recognition errors. A plurality of confidence levels of individual recognized words may be visually indicated. Additionally, the system and method allow the user of the system to select threshold levels to determine when the visual indication occurs.

...read moreread less

126 citations

Proceedings Article•

Blind clustering of speech utterances based on speaker and language characteristics.

[...]

Douglas A. Reynolds, Elliot Singer, Beth A. Carlson, G.C. O'Leary, Jack McLaughlin, M.A. Zissman - Show less +2 more

01 Jan 1998

TL;DR: Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the \emph{d*} algorithm.

...read moreread less

Abstract: Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes or even the total number of classes. Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the \emph{d*} algorithm. Results are presented using message sets derived from the Switchboard and Callfriend corpora. Potential applications include automatic indexing of recorded speech corpora by speaker/language tags and automatic or semiautomatic selection of speaker specific speech utterances for speaker recognition adaptation.

...read moreread less

Patent•

Sequential, nonparametric speech recognition and speaker identification

[...]

Laurence S. Gillick, Andres Corrada-Emmanuel, Michael J. Newman, Barbara Peskin

31 Mar 1998

TL;DR: In this article, a speech sample is received and speech recognition is performed on the speech sample to produce recognition results, and the recognition results are evaluated in view of the training data and the identification of the speech elements to which the portions of training data are related.

...read moreread less

Abstract: A speech sample is evaluated using a computer. Training data that include samples of speech are received and stored along with identification of speech elements to which portions of the training data are related. A speech sample is received and speech recognition is performed on the speech sample to produce recognition results. Finally, the recognition results are evaluated in view of the training data and the identification of the speech elements to which the portions of the training data are related. The technique may be used to perform tasks such as speech recognition, speaker identification, and language identification.

...read moreread less

Speaker adaptation for HMM-based speech synthesis system using MLLR.

[...]

Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi

01 Jan 1998

TL;DR: From the results of objective and subjective tests, it is shown that the characteristics of synthetic speech is close to target speaker’s voice, and the speech generated from the adapted model set using 5 sentences has almost the same DMOS score as that from the speaker dependent model set.

...read moreread less

Abstract: This paper describes a voice characteristics conversion technique for an HMM-based text-to-speech synthesis system. The system uses phoneme HMMs as the speech synthesis units, and voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthetic speech to the target speaker, we apply an MLLR (Maximum Likelihood Linear Regression) technique, one of the speaker adaptation techniques, to the system. From the results of objective and subjective tests, it is shown that the characteristics of synthetic speech is close to target speaker’s voice, and the speech generated from the adapted model set using 5 sentences has almost the same DMOS score as that from the speaker dependent model set.

...read moreread less

Journal Article•DOI•

Speaker clustering and transformation for speaker adaptation in speech recognition systems

[...]

Mukund Padmanabhan¹, Lalit R. Bahl¹, David Nahamoo¹, Michael Picheny¹•Institutions (1)

IBM¹

01 Jan 1998-IEEE Transactions on Speech and Audio Processing

TL;DR: A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to reestimate the system parameters.

...read moreread less

Abstract: A speaker adaptation strategy is described that is based on finding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to reestimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speaker's data to the test speaker's acoustic space. Finally, the system parameters (Gaussian means) are reestimated specifically for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of providing an 18% relative improvement in the error rate on a large-vocabulary task with the use of as little as three sentences of adaptation data.

...read moreread less

DOI•

Multilingual and Crosslingual Speech Recognition

[...]

Tanja Schultz, Alex Waibel

01 Jan 1998

TL;DR: The design of a multilingual speech recognizer is described using an LVCSR dictation database which has been collected under the project GlobalPhone and built on a global phoneme set which can handle five different languages.

...read moreread less

Abstract: This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Turkish. For our experiments we used six of these languages to train and test several recognition engines in monolingual, multilingual and crosslingual setups. Based on a global phoneme set we built a multilingual speech recognition system which can handle five different languages. The acoustic models of the five languages are combined into a monolithic system and context dependent phoneme models are created using language questions.

...read moreread less

Patent•

Systems and methods for access filtering employing relaxed recognition constraints

[...]

Liam David Comerford¹, Stephane H. Maes¹•Institutions (1)

IBM¹

11 Feb 1998-Journal of the Acoustical Society of America

TL;DR: In this paper, a speaker recognition system for selectively permitting access by a requesting speaker to one of a service and facility include an acoustic front-end for computing at least one feature vector from a speech utterance provided by the requesting speaker; a speaker dependent codebook store for pre-storing sets of acoustic features, in the form of codebooks, respectively corresponding to a pool of previously enrolled speakers; and a speaker identifier/verifier module operatively coupled to the acoustic front end, wherein: the speaker identifier and verifier module identifies, from identifying indicia provided by a

...read moreread less

Abstract: A speaker recognition system for selectively permitting access by a requesting speaker to one of a service and facility include an acoustic front-end for computing at least one feature vector from a speech utterance provided by the requesting speaker; a speaker dependent codebook store for pre-storing sets of acoustic features, in the form of codebooks, respectively corresponding to a pool of previously enrolled speakers; a speaker identifier/verifier module operatively coupled to the acoustic front-end, wherein: the speaker identifier/verifier module identifies, from identifying indicia provided by the requesting speaker, a previously enrolled speaker as a claimed speaker; further, the speaker identifier/verifier module associates, with the claimed speaker, first and second groups of previously enrolled speakers, the first group being defined as speakers whose codebooks are respectively acoustically similar to the claimed speaker (i.e., cohort set) and the second group being defined as speakers whose codebooks are acoustically similar to the claimed speaker but not as acoustically similar as the codebooks of the speakers in the first group (i.e., legion set); and still further, the speaker identifier/verifier module verifies the requesting speaker by comparing the at least one feature vector of the requesting speaker to the codebooks of the previously enrolled speakers in the second group and, in response to such comparison, generates an indicator indicating that the requesting speaker is one of verified and not verified for access to one of the service and facility.

...read moreread less

Proceedings Article•DOI•

Improvements in children's speech recognition performance

[...]

Subhro Das¹, D. Nix, M. Picheny•Institutions (1)

IBM¹

12 May 1998

TL;DR: Comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system are described.

...read moreread less

Abstract: There are several reasons why conventional speech recognition systems modeled on adult data fail to perform satisfactorily on children's speech input. For instance, children's vocal characteristics differ significantly from those of adults. In addition, their choices of vocabulary and sentence construction modalities usually do not conform to adult patterns. We describe comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system.

...read moreread less

Patent•

A method and device for activating a voice-controlled function in a multi-station network through using both speaker-dependent and speaker-independent speech recognition

[...]

Franciscus Johannes Lambertus Dams¹, Piet Bernard Hesdahl¹, Jeroen G. Van Velden¹•Institutions (1)

Philips¹

08 Sep 1998-Journal of the Acoustical Society of America

TL;DR: In this paper, speaker-dependent and speaker-independent speech recognition in a voice-controlled multi-station network has been discussed and a fallback procedure is maintained for any particular station in order to cater for failure of the speakerdependent recognition, whilst allowing reverting to the improvement procedure.

...read moreread less

Abstract: A voice-controlled multi-station network has both speaker-dependent and speaker-independent speech recognition. Conditionally to recognizing items of an applicable vocabulary, the network executes a particular function. The method receives a call from a particular origin and executes speaker-independent speech recognition on the call. In an improvement procedure, in case of successful determination of what has been said, a template associated to the recognized speech items is stored and assigned to the origin. Next, speaker-dependent recognition is applied if feasible, for speech received from the same origin, using one or more templates associated to that station. Further, a fallback procedure to speaker-independent recognition is maintained for any particular station in order to cater for failure of the speaker-dependent recognition, whilst allowing reverting to the improvement procedure.

...read moreread less

Journal Article•DOI•

An efficient scoring algorithm for Gaussian mixture model based speaker identification

[...]

Bryan L. Pellom¹, John H. L. Hansen¹•Institutions (1)

Duke University¹

01 Nov 1998-IEEE Signal Processing Letters

TL;DR: A novel algorithm for reducing the computational complexity of identifying a speaker within a Gaussian mixture speaker model framework is presented and it is illustrated that rapid pruning of unlikely speaker model candidates can be achieved by reordering the time-sequence of observation vectors used to update the accumulated probability of each speaker model.

...read moreread less

Abstract: This article presents a novel algorithm for reducing the computational complexity of identifying a speaker within a Gaussian mixture speaker model framework. For applications in which the entire observation sequence is known, we illustrate that rapid pruning of unlikely speaker model candidates can be achieved by reordering the time-sequence of observation vectors used to update the accumulated probability of each speaker model. The overall approach is integrated into a beam-search strategy and shown to reduce the time to identify a speaker by a factor of 140 over the standard full-search method, and by a factor of six over the standard beam-search method when identifying speakers from the 138 speaker YOHO corpus.

...read moreread less

Proceedings Article•DOI•

Speaker verification in noisy environments with combined spectral subtraction and missing feature theory

[...]

Andrzej Drygajlo, M. El-Maliki¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

12 May 1998

TL;DR: In this article, a new approach for robust automatic speaker verification in adverse conditions is proposed based on the combination of speech enhancement using traditional spectral subtraction, and missing feature compensation to dynamically modify the probability computations performed in GMM recognizers.

...read moreread less

Abstract: In the framework of Gaussian mixture models (GMMs), we present a new approach towards robust automatic speaker verification (SV) in adverse conditions. This new and simple approach is based on the combination of speech enhancement using traditional spectral subtraction, and missing feature compensation to dynamically modify the probability computations performed in GMM recognizers. The identity of the spectral features missing due to noise masking is provided by the spectral subtraction algorithm. Previous works have demonstrated that the missing feature modeling method succeeds in speech recognition with some artificially generated interruptions, filtering and noise. We show that this method also improves noise compensation techniques used for speaker verification in more realistic conditions.

...read moreread less

Polycost : a telephone-speech database for speaker recognition

[...]

D. Petrovska-Delacretaz, Jean Hennebert, H. Melin, Dominique Genoud

01 Jan 1998

TL;DR: In this article, non-linear signal processing was used for speech processing in LANOS-CONF-1998-004 Record created on 2004-12-03, modified on 2017-05-12.

...read moreread less

Abstract: Keywords: Non-Linear Signal Processing ; Speech Processing Reference LANOS-CONF-1998-004 Record created on 2004-12-03, modified on 2017-05-12

...read moreread less

Proceedings Article•

Language independent and language adaptive large vocabulary speech recognition.

[...]

Tanja Schultz, Alex Waibel

01 Jan 1998

TL;DR: This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone and presents several recognition results in language independent and language adaptive setups.

...read moreread less

Abstract: This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Turkish Based on a global phoneme set we built different multilingual speech recognition systems for five of the 15 languages Context dependent phoneme models are created data-driven by introducing questions about language and language groups to our polyphone clustering procedure We apply the resulting multilingual models to unseen languages and present several recognition results in language independent and language adaptive setups

...read moreread less

Proceedings Article•DOI•

Emotion recognition from audiovisual information

[...]

L.S. Chen¹, H. Tao, Thomas S. Huang, Tsutomu Miyasato, Ryohei Nakatsu - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

07 Dec 1998

TL;DR: Preliminary results on emotion recognition by machine from joint audiovisual input of facial video and speech show potential advantages in using both modalities over either modality alone.

...read moreread less

Abstract: We report preliminary results on emotion recognition by machine from joint audiovisual input of facial video and speech. The results show potential advantages in using both modalities over either modality alone. The recognition rate for audio alone is about 75% and video alone about 70%. Using audiovisual data we achieved 97% without increasing the number of features. The improvement in performance is accredited to the complementary property between the two modalities. A possible application is in natural human-computer interfaces.

...read moreread less

Patent•

Speaker adaptation system and method based on class-specific pre-clustering training speakers

[...]

Yuqing Gao¹, Mukund Padmanabhan¹, Michael Picheny¹•Institutions (1)

IBM¹

04 Feb 1998

TL;DR: In this article, a method of speech recognition, in accordance with the present invention, includes the steps of grouping acoustics to form classes based on acoustic features, clustering training speakers by the classes to provide class-specific cluster systems, selecting from the cluster system, a subset of cluster systems closest to adaptation data from a test speaker, transforming the subset of CHS to bring the CHS closer to the test speaker based on the adaptation data to form adapted CHS, and combining the adapted cluster systems to create a speaker adapted system for decoding speech from the test speakers.

...read moreread less

Abstract: A method of speech recognition, in accordance with the present invention includes the steps of grouping acoustics to form classes based on acoustic features, clustering training speakers by the classes to provide class-specific cluster systems, selecting from the cluster systems, a subset of cluster systems closest to adaptation data from a test speaker, transforming the subset of cluster systems to bring the subset of cluster systems closer to the test speaker based on the adaptation data to form adapted cluster systems and combining the adapted cluster systems to create a speaker adapted system for decoding speech from the test speaker. System and methods for building speech recognition systems as well as adapting speaker systems for class-specific speaker clusters are included.

...read moreread less

Proceedings Article•DOI•

A syntactic approach to automatic lip feature extraction for speaker identification

[...]

Tim Wark¹, Sridha Sridharan•Institutions (1)

Queensland University of Technology¹

12 May 1998

TL;DR: This paper presents a novel technique for the tracking and extraction of features from lips for the purpose of speaker identification, where syntactic information is derived from chromatic information in the lip region.

...read moreread less

Abstract: This paper presents a novel technique for the tracking and extraction of features from lips for the purpose of speaker identification. In noisy or other adverse conditions, identification performance via the speech signal can significantly reduce, hence additional information which can complement the speech signal is of particular interest. In our system, syntactic information is derived from chromatic information in the lip region. A model of the lip contour is formed directly from the syntactic information, with no minimization procedure required to refine estimates. Colour features are then extracted from the lips via profiles taken around the lip contour. Further improvement in lip features is obtained via linear discriminant analysis (LDA). Speaker models are built from the lip features based on the Gaussian mixture model (GMM). Identification experiments are performed on the M2VTS database, with encouraging results.

...read moreread less

Book•

Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments

[...]

Brian Kingsbury, Nelson Morgan

01 Jan 1998

TL;DR: This work presentsceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments, a novel approach to signal-processing that automates the very labor-intensive and therefore time-heavy and expensive process of recognizing speech.

...read moreread less

Abstract: Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments

...read moreread less

Patent•

Method and apparatus for word counting in continuous speech recognition useful for reliable barge-in and early end of speech detection

[...]

Anand Rangaswamy Setlur¹, Rafid Antoon Aurora Sukkar¹•Institutions (1)

Alcatel-Lucent¹

31 Jul 1998

TL;DR: In this paper, the authors used therapidly available speech recognition results to provide intelligent barge-in for voice-response systems and, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence to increase processing throughput.

...read moreread less

Abstract: Speech recognition technology has attained maturity such that the most likely speech recognition result has been reached and is available before an energy based termination of speech has been made. The present invention innovatively uses therapidly available speech recognition results to provide intelligent barge-in forvoice-response systems and, to count words to output sub-sequences to provide paralleling and/or pipelining of tasks related to the entire word sequence to increase processing throughput.

...read moreread less

Patent•

Selection of superwords based on criteria relevant to both speech recognition and understanding

[...]

Allen Louis Gorin¹, Giuseppe Riccardi¹, Jeremy H. Wright¹•Institutions (1)

AT&T¹

23 Oct 1998

TL;DR: In this paper, superwords are used to refer to those word combinations which are so often spoken that they are recognized as units or should have models to reflect them in the language model.

...read moreread less

Abstract: This invention is directed to the selection of superwords based on a criterion relevant to speech recognition and understanding. Superwords are used to refer to those word combinations which are so often spoken that they are recognized as units or should have models to reflect them in the language model. The selected superwords are placed in a lexicon along with selected meaningful phrases. The lexicon is then used by a speech recognizer to improve recognition of input speech utterances for the proper routing of a user's task objectives.

...read moreread less

Nonlinear Discriminant Feature Extraction for Robust Text-Independent Speaker Recognition

[...]

Yochai Konig, Larry Heck, Mitch Weintraub, Kemal Sonmez

01 Jan 1998

TL;DR: In this paper, a nonlinear discriminant analysis (NLDA) technique was used to extract a speaker-discriminant feature set, which is optimized to discriminate between speakers and to be robust to mismatched training and testing conditions.

...read moreread less

Abstract: We study a nonlinear discriminant analysis (NLDA) technique that extracts a speaker-discriminant feature set. Ou r approach is to train a multilayer perceptron (MLP) to maximize the separation between speakers by nonlinearly projecting a large set of acoustic features (e.g., several fram es) to a lower-dimensional feature set. The extracted features are optimized to discriminate between speakers and to be robust to mismatched training and testing conditions. We train the MLP on a development set and apply it to the training and testing utterances. Our results show that by combining the NLDA-based system with a state of the art cepstrum-based system we improve the speaker verification performance on the 1997 NIST Speaker Recognition Evaluation set by 15% in average compared with our cepstrum-only system.

...read moreread less

Proceedings Article•

Cluster adaptive training for speech recognition

[...]

Mark J. F. Gales

16 Sep 1998

TL;DR: This paper examines an adaptation scheme requiring very few parameters to adapt the models, cluster adaptive training (CAT), and finds that on a speaker-independent task CAT reduced the word error rate using very little adaptation data.

...read moreread less

Abstract: When performing speaker adaptation there are two conicting requirements. First the transform must be powerful enough to represent the speaker. Second the transform must be quickly and easily estimated for any particular speaker. Recently the most popular adaptation schemes have used many parameters to adapt the models. This limits how rapidly the models may be adapted. This paper examines an adaptation scheme requiring very few parameters to adapt the models, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting one cluster, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speaker-independent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set.

...read moreread less

Collapse