Showing papers on "Voice activity detection published in 1991"

PDF

Open Access

Journal Article•DOI•

Performance of PRMA: a packet voice protocol for cellular systems

[...]

S. Nanda¹, David J. Goodman¹, U. Timor²•Institutions (2)

Rutgers University¹, Rafael Advanced Defense Systems²

01 Aug 1991-IEEE Transactions on Vehicular Technology

TL;DR: Equilibrium point analysis is used to evaluate system behavior in a packet reservation multiple access (PRMA) protocol based network and the probability of packet dropping given the number of simultaneous conversations is derived.

...read moreread less

Abstract: Equilibrium point analysis is used to evaluate system behavior in a packet reservation multiple access (PRMA) protocol based network. The authors derive the probability of packet dropping given the number of simultaneous conversations. The authors establish conditions for system stability and efficiency. Numerical calculations based on the theory show close agreement with computer simulations. They also provide valuable guides to system design. Because PRMA is a statistical multiplexer, the channel becomes congested when too many terminals are active. For a particular example it is shown that speech activity detection permits 37 speech terminals to share a PRMA channel with 20 slots per frame, with a packet dropping probability of less than 1%. >

...read moreread less

483 citations

Book•DOI•

Acoustical and environmental robustness in automatic speech recognition

[...]

Alex Acero

01 May 1991

TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).

...read moreread less

Abstract: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy. Two kinds of environmental variability are introduced by the use of desk-top microphones and different training and testing conditions: additive noise and spectral tilt introduced by linear filtering. An important attribute of the novel compensation algorithms described in this thesis is that they provide joint rather than independent compensation for these two types of degradation. Acoustical compensation is applied in our algorithms as an additive correction in the cepstral domain. This allows a higher degree of integration within SPHINX, the Carnegie Mellon speech recognition system, that uses the cepstrum as its feature vector. Therefore, these algorithms can be implemented very efficiently. Processing in many of these algorithms is based on instantaneous signal-to-noise ratio (SNR), as the appropriate compensation represents a form of noise suppression at low SNRs and spectral equalization at high SNRs. The compensation vectors for additive noise and spectral transformations are estimated by minimizing the differences between speech feature vectors obtained from a "standard" training corpus of speech and feature vectors that represent the current acoustical environment. In our work this is accomplished by minimizing the distortion of vector-quantized cepstra that are produced by the feature extraction module in SPHINX. In this dissertation we describe several algorithms including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cepstral Normalization (CDCN). With CDCN, the accuracy of SPHINX when trained on speech recorded with a close-talking microphone and tested on speech recorded with a desk-top microphone is essentially the same obtained when the system is trained and tested on speech from the desk-top microphone. An algorithm for frequency normalization has also been proposed in which the parameter of the bilinear transformation that is used by the signal-processing stage to produce frequency warping is adjusted for each new speaker and acoustical environment. The optimum value of this parameter is again chosen to minimize the vector-quantization distortion between the standard environment and the current one. In preliminary studies, use of this frequency normalization produced a moderate additional decrease in the observed error rate.

...read moreread less

474 citations

Journal Article•DOI•

Efficiency of packet reservation multiple access

[...]

David J. Goodman¹, S.X. Wei•Institutions (1)

Rutgers University¹

01 Feb 1991-IEEE Transactions on Vehicular Technology

TL;DR: The influence of several variables on PRMA efficiency, defined as the number of conversations per channel, is examined and it is found that with 32-kb/s speech coding and 720- kb/s transmission (22.5 channels), PRMA supports up to 37 simultaneous conversations, or 1.64 conservations per channel.

...read moreread less

Abstract: Packet-reservation multiple access (PRMA) is viewed as a merger of slotted ALOHA and time-division multiple access (TDMA). Dispersed terminals transmit packets of speech information to a central base station. When its speech activity detector indicates the beginning of a talkspurt, a terminal contends with other terminals for access to an available time slot. After the base station detects the first packet in the talkspurt, the terminal reserves future time slots for transmission of subsequent speech packets. The influence of several variables on PRMA efficiency, defined as the number of conversations per channel, is examined. The number of channels is the ratio of transmission rate to speech coding rate. It is found that with 32-kb/s speech coding and 720-kb/s transmission (22.5 channels), PRMA supports up to 37 simultaneous conversations, or 1.64 conservations per channel. The number of conversations per channel is at least 1.5 over a wide range of packet sizes (8 ms of speech per packet to 34 ms) and for all systems with 16 or more channels (transmission rate >or=512 kb/s, with 32-kb/s speech coding). Other factors studied are the sensitivity of the speech activity detector, the retransmission probability of the contention scheme, and the maximum time delay for the transmission of speech packets. >

...read moreread less

433 citations

Patent•DOI•

Speech recognition apparatus & method having dynamic reference pattern adaptation

[...]

Leah S. Larkey

11 Feb 1991-Journal of the Acoustical Society of America

TL;DR: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance.

...read moreread less

Abstract: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance. The method and apparatus provide user correction actions representing the accuracy of a speech recognition, dynamically, during the recognition of unknown incoming speech utterances and after training of the system. The quality values are updated, during the speech recognition process, for at least a portion of those reference patterns used during the speech recognition process. Reference patterns having low quality values, indicative of either inaccurate representation of the unknown speech or non-use, can be deleted so long as the reference pattern is not needed, for example, where the reference pattern is the last instance of a known word or phrase. Various methods and apparatus are provided for determining when reference patterns can be deleted or added, to the reference memory, and when the scores or values associated with a reference pattern should be increased or decreased to represent the "goodness" of the reference pattern in recognizing speech.

...read moreread less

263 citations

Patent•

Method and system for CELP speech coding and codebook for use therewith

[...]

Yuhung Kao¹, John S. Baras¹•Institutions (1)

University of Maryland, College Park¹

28 Oct 1991

TL;DR: In this paper, a CELP speech processor utilizes an organized, non-overlapping, algebraic codebook containing a predetermined number of vectors, uniformly distributed over a multi-dimensional sphere to generate a remaining speech residual.

...read moreread less

Abstract: Apparatus and method for encoding speech using a codebook excited linear predictive (CELP) speech processor and an algebraic codebook for use therewith The CELP speech processor receives a digital speech input representative of human speech and performs linear predictive code analysis and perceptual weighting filtering to produce a short term speech information and a long term speech information The CELP speech processor utilizes an organized, non-overlapping, algebraic codebook containing a predetermined number of vectors, uniformly distributed over a multi-dimensional sphere to generate a remaining speech residual The short term speech information, long term speech information and remaining speech residual are combinable to form a quality reproduction of the digital speech input

...read moreread less

230 citations

Patent•DOI•

Method and apparatus for low-delay celp speech coding and decoding

[...]

Juin-Hwey Chen¹•Institutions (1)

Alcatel-Lucent¹

10 Sep 1991-Journal of the Acoustical Society of America

TL;DR: In this paper, a low-bitrate (typically 8 kbit/s or less), low-delay digital coder and decoder based on Code Excited Linear Prediction for speech and similar signals features backward adaptive adjustment for codebook gain and short-term synthesis filter parameters and forward adaptive adjustment of long-term (pitch) synthesis filter parameter.

...read moreread less

Abstract: A low-bitrate (typically 8 kbit/s or less), low-delay digital coder and decoder based on Code Excited Linear Prediction for speech and similar signals features backward adaptive adjustment for codebook gain and short-term synthesis filter parameters and forward adaptive adjustment of long-term (pitch) synthesis filter parameters. A highly efficient, low delay pitch parameter derivation and quantization permits overall delay which is a fraction of prior coding delays for equivalent speech quality at low bitrates.

...read moreread less

166 citations

Journal Article•DOI•

Human ultrasonic speech perception

[...]

Martin L. Lenhardt¹, Ruth Skellett¹, Peter Wang¹, A. M. Clarke¹•Institutions (1)

VCU Medical Center¹

05 Jul 1991-Science

TL;DR: When speech signals were modulated into the ultrasonic range, listening to words resulted in the clear perception of the speech stimuli and not a sense of high-frequency vibration.

...read moreread less

Abstract: Bone-conducted ultrasonic hearing has been found capable of supporting frequency discrimination and speech detection in normal, older hearing-impaired, and profoundly deaf human subjects. When speech signals were modulated into the ultrasonic range, listening to words resulted in the clear perception of the speech stimuli and not a sense of high-frequency vibration. These data suggest that ultrasonic bone conduction hearing has potential as an alternative communication channel in the rehabilitation of hearing disorders.

...read moreread less

145 citations

Proceedings Article•DOI•

16 kbps wideband speech coding technique based on algebraic CELP

[...]

Claude Laflamme¹, J.-P. Adoul¹, R. Salami¹, S. Morissette¹, P. Mabilleau¹ - Show less +1 more•Institutions (1)

Université de Sherbrooke¹

14 Apr 1991

TL;DR: An efficient procedure for searching such a large codebook deploying a focused search strategy, where less than 0.1% of the codebook is searched with performance very close to that of a full search is described.

...read moreread less

Abstract: The application of algebraic code excited linear prediction (ACELP) coding to wideband speech is presented An algebraic codebook with a 20 bit address can be used without any storage requirements and, more importantly, with a very efficient search procedure which allows for real-time implementation The authors describe an efficient procedure for searching such a large codebook deploying a focused search strategy, where less than 01% of the codebook is searched with performance very close to that of a full search High-quality speech at a bit rate of 13 kbps was obtained >

...read moreread less

114 citations

Journal Article•DOI•

Speaker-dependent-feature extraction, recognition and processing techniques

[...]

Sadaoki Furui

15 Dec 1991-Speech Communication

TL;DR: Recent advances in and perspectives of research on speaker-dependent-feature extraction from speech waves, automatic speaker identification and verification, speaker adaptation in speech recognition, and voice conversion techniques are discussed.

...read moreread less

108 citations

Journal Article•DOI•

Voice packetization and compression in broadband ATM networks

[...]

Kotikalapudi Sriram¹, R.S. McKinney¹, M.H. Sherif¹•Institutions (1)

Bell Labs¹

01 Apr 1991-IEEE Journal on Selected Areas in Communications

TL;DR: Some methods of supporting voice in broadband ISDN, (B-ISDN) asynchronous transfer mode (ATM), including voice compression, are examined and possible approaches for packetization and implementation of variable-bit-rate voice coding schemes are described.

...read moreread less

Abstract: Some methods of supporting voice in broadband ISDN, (B-ISDN) asynchronous transfer mode (ATM), including voice compression, are examined. Techniques for voice compression with variable-length packet format at DS1 transmission rate, e.g., wideband packet technology (WPT), have been successfully implemented utilizing embedded adaptive differential pulse code modulation (ADPCM) coding, digital speech interpolation (DSI), and block-dropping schemes. For supporting voice in B-ISDN, voice compression techniques are considered that are similar to those used in WPT but with different packetization and congestion control methods designed for the fixed-length ATM protocol at high speeds. Possible approaches for packetization and implementation of variable-bit-rate voice coding schemes are described. ADPCM and DSI for voice coding and compression and cell discarding (CD) for congestion control are considered. The advantages of voice compression and CD in broadband ATM networks are demonstrated in terms of transmission bandwidth savings and resiliency of the network during congestion. >

...read moreread less

96 citations

Patent•DOI•

Low bit rate speech coding system and compression

[...]

Kung-Pu Li

28 Aug 1991-Journal of the Acoustical Society of America

TL;DR: A speech coder apparatus operates to compress speech signals to a low bit rate and includes a continuous speech recognizer (CSR) which has a memory for storing templates.

...read moreread less

Abstract: A speech coder apparatus operates to compress speech signals to a low bit rate. The apparatus includes a continuous speech recognizer (CSR) which has a memory for storing templates. Input speech is processed by the CSR where information in the speech is compared against the templates to provide an output digital signal indicative of recognized words, which signal is transmitted along a first path. There is further included a front end processor which is also responsive to the input speech signal for providing output digitized speech samples during a given frame interval. A side information encoder circuit responds to the output from the front end processor to provide at the output of the encoder a parameter signal indicative of the value of the pitch and word duration for each word as recognized by the CSR unit. The output of the encoder is transmitted as a second signal. There is a receiver which includes a synthesizer responsive to the first and second transmitted signals for providing an output synthesized signal for each recognized word where the pitch, duration and amplitude of the synthesized signal is changed according to the parameter signal to preserve the quality of the synthesized speech.

...read moreread less

Patent•DOI•

System for separating speech from background noise

[...]

Melvyn J. Hunt¹•Institutions (1)

National Research Council¹

11 Jul 1991-Journal of the Acoustical Society of America

TL;DR: In this article, an adaptive filtering technique is applied to sequences of energy estimates in each of two signal channels, one channel containing speech and environmental noise and the other channel containing primarily the same environmental noise.

...read moreread less

Abstract: A digital signal processing system applies an adaptive filtering technique to sequences of energy estimates in each of two signal channels, one channel containing speech and environmental noise and the other channel containing primarily the same environmental noise. From the channel containing primarily environmental noise, a prediction is made of the energy of that noise in the channel containing both the speech and that noise, so that the noise can be extracted from the mixture of speech and noise. The result is that the speech will be more easily recognizable by either human listeners or speech recognition systems.

...read moreread less

Journal Article•DOI•

Methods for waveform interpolation in speech coding

[...]

W. Bastiaan Kleijn¹, Wolfgang Granzow¹•Institutions (1)

Bell Labs¹

01 Oct 1991-Digital Signal Processing

TL;DR: A new method based on the assumption that, for voiced speech, a perceptually accurate speech signal can be reconstructed from a description of the waveform of a single, representative pitch cycle per interval of 20-30 ms is presented, which retains the natural quality of coders which encode the entire waveform, but requires a bit rate close to that of the parametric coders.

...read moreread less

Proceedings Article•DOI•

Techniques for information retrieval from voice messages

[...]

Richard Rose¹, E.I. Chang¹, Richard P. Lippmann¹•Institutions (1)

Massachusetts Institute of Technology¹

14 Apr 1991

TL;DR: The techniques and experiments described are the first demonstration of a complete system that accepts speech messages as input and produces as estimated message class as output and demonstrate the feasibility of the technology and illustrate the need for further work.

...read moreread less

Abstract: The components of a speech message information retrieval system include an acoustic front end which provides an incomplete transcription of a spoken message, and a message classifier that interprets the incomplete transcription and classifies the message according to message category. The techniques and experiments described are concerned with the integration of these components and represent the first demonstration of a complete system that accepts speech messages as input and produces as estimated message class as output. The complete system has been implemented on special-purpose digital signal processing hardware and demonstrated using live speech input. The results obtained on a conversational speech task have demonstrated the feasibility of the technology and also illustrate the need for further work. Even with a perfect acoustic front end, a message classification accuracy of only 78% was obtained with a 126 keyword vocabulary. >

...read moreread less

Patent•

Variable hangover time in a voice activity detector

[...]

Daehyoung Hong¹, Douglas A. Carlone¹•Institutions (1)

Motorola¹

23 Dec 1991

TL;DR: In this article, variable hangover time is provided for a speech coder and variable VAD (Voice Activity Detection) is used to detect voice activity within a speech message, and a variable HOG is calculated.

...read moreread less

Abstract: Variable hangover time is provided for a speech coder (105). Voice activity within a speech message is detected (209) using a voice activity detector (VAD) (107), and a signal-to-noise ratio is calculated. A variable hangover time is calculated (215) and appended to the time in which voice activity is detected, producing an extended voice detection period. The speech coder (105) is enabled only during the extended voice detection period, thus saving power.

...read moreread less

Proceedings Article•DOI•

The application of the IMBE speech coder to mobile communications

[...]

J.C. Hardwick, J.S. Lim

14 Apr 1991

TL;DR: The test results show that the IMBE system is a viable alternative to CELP based speech coders and has the best performance of the systems tested.

...read moreread less

Abstract: A 6.4 kb/s improved multiband excitation (IMBE) speech coder is presented. This speech coder combines high speech quality with a robustness to channel impairments which is necessary for successful operation in a mobile communication environment. MOS (mean opinion score) results for the IMBE speech coder are compared against those of four 6.4-kb/s CELP (code excited linear prediction) based speech coders which were tested as part of the INMARSAT-M voice codec evaluation. The IMBE system yielded the best performance of the systems tested. It received an MOS score of 3.4 at both 0% and 1% bit error rate. The test results show that the IMBE system is a viable alternative to CELP based speech coders. >

...read moreread less

Proceedings Article•DOI•

Technologies for personal communications

[...]

R.W. Brodersen¹, Anantha P. Chandrakasan, Samuel Sheng•Institutions (1)

University of California¹

30 May 1991

Proceedings Article•DOI•

A segment-based approach to voice conversion

[...]

Masanobu Abe

14 Apr 1991

TL;DR: The proposed voice conversion algorithm was used with two male speakers and, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than thespeech converted frame-by-frame.

...read moreread less

Abstract: A voice conversion algorithm that uses speech segments as conversion units is proposed. Input speech is decomposed into speech segments by a speech recognition module, and the segments are replaced by speech segments uttered by another speaker. This algorithm makes it possible to convert not only the static characteristics but also the dynamic characteristics of speaker individuality. The proposed voice conversion algorithm was used with two male speakers. Spectrum distortion between target speech and the converted speech was reduced to one-third the natural spectrum distortion between the two speakers. A listening experiment showed that, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than the speech converted frame-by-frame. >

...read moreread less

Proceedings Article•DOI•

On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition

[...]

Xuedong Huang¹, Kai-Fu Lee¹•Institutions (1)

Carnegie Mellon University¹

14 Apr 1991

TL;DR: The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX, and extended it to speaker-dependent speech recognition, which demonstrated a substantial difference between speaker- dependent and -independent systems.

...read moreread less

Abstract: The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%. >

...read moreread less

Patent•DOI•

Provision of speech coder gain information using multiple coding modes

[...]

Ira A. Gerson¹, Mark A. Jasiuk¹•Institutions (1)

Motorola¹

05 Sep 1991-Journal of the Acoustical Society of America

TL;DR: In a speech coder, excitation source gain information is transmitted along with a coding mode indicator that indicates how the gain information has been interpreted and which of a plurality of excitation sources are utilized when synthesizing the speech.

...read moreread less

Abstract: In a speech coder (100), excitation source gain information (802) is transmitted along with a coding mode indicator. The coding mode indicator indicates how the gain information is to be interpreted. In one embodiment, the coding mode indicator can also be utilized to control which of a plurality of excitation sources (202, 206-208) are utilized when synthesizing the speech. The coding mode itself is selected as a function of the periodicity of an input speech signal.

...read moreread less

Patent•DOI•

Speech detection apparatus not affected by input energy or background noise levels

[...]

Hideki Satoh¹, Tsuneo Nitta¹•Institutions (1)

Toshiba¹

09 Apr 1991-Journal of the Acoustical Society of America

TL;DR: In this article, a speech detection apparatus capable of reliably detecting speech segments in audio signals regardless of the levels of input audio signals and background noises is presented. But it is not yet clear how to detect speech segments.

...read moreread less

Abstract: A speech detection apparatus capable of reliably detecting speech segments in audio signals regardless of the levels of input audio signals and background noises. In the apparatus, a parameter of input audio signals is calculated frame by frame, and then compared with a threshold in order to judge each input frame as one of a speech segment and a noise segment, while the parameters of the input frames judged as the noise segments are stored in the buffer and the threshold is updated according to the parameters stored in the buffer. The apparatus may utilize a transformed parameter obtained from the parameter, in which the difference between speech and noise is emphasized, and noise standard patterns are constructed from the parameters of the input frames pre-estimated as noise segments.

...read moreread less

Proceedings Article•DOI•

A new mixed excitation LPC vocoder

[...]

Alan V. McCree¹, Thomas P. Barnwell¹•Institutions (1)

Georgia Institute of Technology¹

14 Apr 1991

TL;DR: A novel synthesizer structure for an LPC (linear predictive coding) vocoder is introduced which increases the clarity and naturalness of the output speech and replaces the traditional binary voicing decision with more robust periodicity, peakiness, and power level detectors.

...read moreread less

Abstract: The authors introduce a novel synthesizer structure for an LPC (linear predictive coding) vocoder which increases the clarity and naturalness of the output speech. This synthesizer enhances the usual excitations of either periodic pulses or white noise by allowing pulse/noise mixtures and aperiodic pulses, and thus can generate a wider range of possible speech signals. The control algorithms for this new model replace the traditional binary voicing decision with more robust periodicity, peakiness, and power level detectors, without a significant increase in bit rate. As a result, the vocoder produces synthetic speech which is free of the usual LPC synthesis artifacts, even at bit rates below 2400 bps. >

...read moreread less

Journal Article•DOI•

Neural network vowel-recognition jointly using voice features and mouth shape image

[...]

Jian-Tong Wu¹, Shinichi Tamura¹, Hiroshi Mitsumoto¹, Hideo Kawai², Kenji Kurosu³, Kozo Okazaki⁴ - Show less +2 more•Institutions (4)

Osaka University¹, Osaka Electro-Communication University², Kyushu Institute of Technology³, Tottori University⁴

01 Oct 1991-Pattern Recognition

TL;DR: A neural approach to improve the performance of an automatic speech recognition system for unrestricted speakers by using not only voice sound features but also image features of the mouth shape, which can be applied not only to the improvement of voice recognition, but also to aid the communication of hearing-impaired people.

...read moreread less

Journal Article•DOI•

Voice across America: Toward robust speaker-independent speech recognition for telecommunications applications

[...]

Barbara J. Wheatley¹, Joseph Picone¹•Institutions (1)

Texas Instruments¹

01 Apr 1991-Digital Signal Processing

TL;DR: The methods and motivation for VAA data collection and validation procedures, the current contents of thedatabase, and the results of exploratory research on a 1088-speaker subset of the database are described.

...read moreread less

Proceedings Article•DOI•

Variable partition duplexing for wireless communications

[...]

S. Nanda¹, On-Ching Yue¹•Institutions (1)

Bell Labs¹

02 Dec 1991

TL;DR: A scheme for almost doubling the capacity of wireless communication systems by speech activity detection by dynamically varying the bandwidth assigned to the two parties in a TDMA (time division multiaccess) system.

...read moreread less

Abstract: A scheme for almost doubling the capacity of wireless communication systems by speech activity detection is proposed. This scheme is called variable partition duplexing (VPD). The key observation is that in conversational speech, except for small overlaps, only one of the two parties is talking at any given time. VPD attempts to use this observation to advantage by dynamically varying the bandwidth assigned to the two parties in a TDMA (time division multiaccess) system. >

...read moreread less

Proceedings Article•DOI•

Collection and analysis of data from real users: implications for speech recognition/understanding systems

[...]

Judith Spitz

19 Feb 1991

TL;DR: The results of several field trials suggest that real user compliance with instructions is dramatically affected by the particular details of the prompts supplied to the user.

...read moreread less

Abstract: Performance estimates given for speech recognition/understanding systems are typically based on the assumption that users will behave in ways similar to the observed behavior of laboratory volunteers. This includes the acoustic/phonetic characteristics of the speech they produce as well as their willingness and ability to constrain their input to the device according to instructions. Since speech recognition devices often do not perform as well in the field as they do in the laboratory, analyses of real user behavior have been undertaken. The results of several field trials suggest that real user compliance with instructions is dramatically affected by the particular details of the prompts supplied to the user. A significant amount of real user speech data has been collected during these trials (34,000 utterances, 29 hours of data). These speech databases are described along with the results of an experiment comparing the performance of a speech recognition system on real user vs. laboratory speech.

...read moreread less

Proceedings Article•DOI•

A robust speech recognition system using word-spotting with noise immunity learning

[...]

Yoichi Takebayashi¹, Hiroyuki Tsuboi¹, Hiroshi Kanazawa¹•Institutions (1)

Toshiba¹

14 Apr 1991

TL;DR: A speech recognition system using word-spotting with noise immunity learning has been developed to achieve robust performance under noisy environments and employs an accelerator for reducing processing time.

...read moreread less

Abstract: A speech recognition system using word-spotting with noise immunity learning has been developed to achieve robust performance under noisy environments. The system employs word-spotting based on the multiple similarity (MS) method for eliminating word boundary detection errors, noise immunity learning for improving noise robustness, and an accelerator for reducing processing time. Noise immunity learning is performed using noisy speech data and noise data. Data from 39 male speakers were used to evaluate the recognition performance; the remaining data were used for the learning. Recognition scores obtained by word-spotting alone and with noise immunity learning were 88.5% and 98.4%, respectively, for an SNR of 10 dB. >

...read moreread less

Techniques for information retrieval from speech messages

[...]

R. C. Rose

01 Mar 1991

Proceedings Article•DOI•

TDNN-LR continuous speech recognition system using adaptive incremental TDNN training

[...]

H. Sawai

14 Apr 1991

TL;DR: Efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system and provides large-vocabulary and continuous speech recognition.

...read moreread less

Abstract: An investigation of speech recognition and language processing is described. The speech recognition part consists of the large phonemic time-delay neural networks (TDNNs) which can automatically spot all 24 Japanese phonemes by simply scanning input speech. The language processing part is made up of a predictive LR parser which predicts subsequent phonemes based on the currently proposed phonemes. This TDNN-LR recognition system provides large-vocabulary and continuous speech recognition. Recognition experiments for ATR's conference registration task were performed using the TDNN-LR method. Speaker-dependent phrase recognition rates of 65.1% for the first choices and 88.8% within the fifth choices were attained. Also, efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system. >

...read moreread less

Proceedings Article•DOI•

Regression features for recognition of speech in quiet and in noise

[...]

T.H. Applebaum, Brian Hanson

14 Apr 1991

TL;DR: It is shown that for recognition based on the combination of the first two regression features with the static cepstral coefficients, increasing the time length to more than 200 ms, using all of the frames in this time interval, resulted in the highest recognition rates for noisy-Lombard test speech.

...read moreread less

Abstract: It is proposed that the number of speech analysis frames used in calculating regression features should be controlled separately from the time length over which the features are calculated. Regression features are used to represent the first two time derivatives of the speech cepstrum in a speaker-independent, isolated-word recognition task. The recognition system is trained on normal (noise-free, non-Lombard) speech, but tested on normal, noisy, Lombard, or noisy-Lombard speech. It is shown that for recognition based on the combination of the first two regression features with the static cepstral coefficients, increasing the time length to more than 200 ms, using all of the frames in this time interval, resulted in the highest recognition rates for noisy-Lombard test speech. >

...read moreread less