scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1991"


Journal ArticleDOI
TL;DR: Equilibrium point analysis is used to evaluate system behavior in a packet reservation multiple access (PRMA) protocol based network and the probability of packet dropping given the number of simultaneous conversations is derived.
Abstract: Equilibrium point analysis is used to evaluate system behavior in a packet reservation multiple access (PRMA) protocol based network. The authors derive the probability of packet dropping given the number of simultaneous conversations. The authors establish conditions for system stability and efficiency. Numerical calculations based on the theory show close agreement with computer simulations. They also provide valuable guides to system design. Because PRMA is a statistical multiplexer, the channel becomes congested when too many terminals are active. For a particular example it is shown that speech activity detection permits 37 speech terminals to share a PRMA channel with 20 slots per frame, with a packet dropping probability of less than 1%. >

483 citations


BookDOI
01 May 1991
TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).
Abstract: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy. Two kinds of environmental variability are introduced by the use of desk-top microphones and different training and testing conditions: additive noise and spectral tilt introduced by linear filtering. An important attribute of the novel compensation algorithms described in this thesis is that they provide joint rather than independent compensation for these two types of degradation. Acoustical compensation is applied in our algorithms as an additive correction in the cepstral domain. This allows a higher degree of integration within SPHINX, the Carnegie Mellon speech recognition system, that uses the cepstrum as its feature vector. Therefore, these algorithms can be implemented very efficiently. Processing in many of these algorithms is based on instantaneous signal-to-noise ratio (SNR), as the appropriate compensation represents a form of noise suppression at low SNRs and spectral equalization at high SNRs. The compensation vectors for additive noise and spectral transformations are estimated by minimizing the differences between speech feature vectors obtained from a "standard" training corpus of speech and feature vectors that represent the current acoustical environment. In our work this is accomplished by minimizing the distortion of vector-quantized cepstra that are produced by the feature extraction module in SPHINX. In this dissertation we describe several algorithms including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cepstral Normalization (CDCN). With CDCN, the accuracy of SPHINX when trained on speech recorded with a close-talking microphone and tested on speech recorded with a desk-top microphone is essentially the same obtained when the system is trained and tested on speech from the desk-top microphone. An algorithm for frequency normalization has also been proposed in which the parameter of the bilinear transformation that is used by the signal-processing stage to produce frequency warping is adjusted for each new speaker and acoustical environment. The optimum value of this parameter is again chosen to minimize the vector-quantization distortion between the standard environment and the current one. In preliminary studies, use of this frequency normalization produced a moderate additional decrease in the observed error rate.

474 citations


Journal ArticleDOI
TL;DR: The influence of several variables on PRMA efficiency, defined as the number of conversations per channel, is examined and it is found that with 32-kb/s speech coding and 720- kb/s transmission (22.5 channels), PRMA supports up to 37 simultaneous conversations, or 1.64 conservations per channel.
Abstract: Packet-reservation multiple access (PRMA) is viewed as a merger of slotted ALOHA and time-division multiple access (TDMA). Dispersed terminals transmit packets of speech information to a central base station. When its speech activity detector indicates the beginning of a talkspurt, a terminal contends with other terminals for access to an available time slot. After the base station detects the first packet in the talkspurt, the terminal reserves future time slots for transmission of subsequent speech packets. The influence of several variables on PRMA efficiency, defined as the number of conversations per channel, is examined. The number of channels is the ratio of transmission rate to speech coding rate. It is found that with 32-kb/s speech coding and 720-kb/s transmission (22.5 channels), PRMA supports up to 37 simultaneous conversations, or 1.64 conservations per channel. The number of conversations per channel is at least 1.5 over a wide range of packet sizes (8 ms of speech per packet to 34 ms) and for all systems with 16 or more channels (transmission rate >or=512 kb/s, with 32-kb/s speech coding). Other factors studied are the sensitivity of the speech activity detector, the retransmission probability of the contention scheme, and the maximum time delay for the transmission of speech packets. >

433 citations


PatentDOI
TL;DR: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance.
Abstract: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance. The method and apparatus provide user correction actions representing the accuracy of a speech recognition, dynamically, during the recognition of unknown incoming speech utterances and after training of the system. The quality values are updated, during the speech recognition process, for at least a portion of those reference patterns used during the speech recognition process. Reference patterns having low quality values, indicative of either inaccurate representation of the unknown speech or non-use, can be deleted so long as the reference pattern is not needed, for example, where the reference pattern is the last instance of a known word or phrase. Various methods and apparatus are provided for determining when reference patterns can be deleted or added, to the reference memory, and when the scores or values associated with a reference pattern should be increased or decreased to represent the "goodness" of the reference pattern in recognizing speech.

263 citations


Patent
28 Oct 1991
TL;DR: In this paper, a CELP speech processor utilizes an organized, non-overlapping, algebraic codebook containing a predetermined number of vectors, uniformly distributed over a multi-dimensional sphere to generate a remaining speech residual.
Abstract: Apparatus and method for encoding speech using a codebook excited linear predictive (CELP) speech processor and an algebraic codebook for use therewith The CELP speech processor receives a digital speech input representative of human speech and performs linear predictive code analysis and perceptual weighting filtering to produce a short term speech information and a long term speech information The CELP speech processor utilizes an organized, non-overlapping, algebraic codebook containing a predetermined number of vectors, uniformly distributed over a multi-dimensional sphere to generate a remaining speech residual The short term speech information, long term speech information and remaining speech residual are combinable to form a quality reproduction of the digital speech input

230 citations


PatentDOI
Juin-Hwey Chen1
TL;DR: In this paper, a low-bitrate (typically 8 kbit/s or less), low-delay digital coder and decoder based on Code Excited Linear Prediction for speech and similar signals features backward adaptive adjustment for codebook gain and short-term synthesis filter parameters and forward adaptive adjustment of long-term (pitch) synthesis filter parameter.
Abstract: A low-bitrate (typically 8 kbit/s or less), low-delay digital coder and decoder based on Code Excited Linear Prediction for speech and similar signals features backward adaptive adjustment for codebook gain and short-term synthesis filter parameters and forward adaptive adjustment of long-term (pitch) synthesis filter parameters. A highly efficient, low delay pitch parameter derivation and quantization permits overall delay which is a fraction of prior coding delays for equivalent speech quality at low bitrates.

166 citations


Journal ArticleDOI
05 Jul 1991-Science
TL;DR: When speech signals were modulated into the ultrasonic range, listening to words resulted in the clear perception of the speech stimuli and not a sense of high-frequency vibration.
Abstract: Bone-conducted ultrasonic hearing has been found capable of supporting frequency discrimination and speech detection in normal, older hearing-impaired, and profoundly deaf human subjects. When speech signals were modulated into the ultrasonic range, listening to words resulted in the clear perception of the speech stimuli and not a sense of high-frequency vibration. These data suggest that ultrasonic bone conduction hearing has potential as an alternative communication channel in the rehabilitation of hearing disorders.

145 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: An efficient procedure for searching such a large codebook deploying a focused search strategy, where less than 0.1% of the codebook is searched with performance very close to that of a full search is described.
Abstract: The application of algebraic code excited linear prediction (ACELP) coding to wideband speech is presented An algebraic codebook with a 20 bit address can be used without any storage requirements and, more importantly, with a very efficient search procedure which allows for real-time implementation The authors describe an efficient procedure for searching such a large codebook deploying a focused search strategy, where less than 01% of the codebook is searched with performance very close to that of a full search High-quality speech at a bit rate of 13 kbps was obtained >

114 citations


Journal ArticleDOI
TL;DR: Recent advances in and perspectives of research on speaker-dependent-feature extraction from speech waves, automatic speaker identification and verification, speaker adaptation in speech recognition, and voice conversion techniques are discussed.

108 citations


Journal ArticleDOI
TL;DR: Some methods of supporting voice in broadband ISDN, (B-ISDN) asynchronous transfer mode (ATM), including voice compression, are examined and possible approaches for packetization and implementation of variable-bit-rate voice coding schemes are described.
Abstract: Some methods of supporting voice in broadband ISDN, (B-ISDN) asynchronous transfer mode (ATM), including voice compression, are examined. Techniques for voice compression with variable-length packet format at DS1 transmission rate, e.g., wideband packet technology (WPT), have been successfully implemented utilizing embedded adaptive differential pulse code modulation (ADPCM) coding, digital speech interpolation (DSI), and block-dropping schemes. For supporting voice in B-ISDN, voice compression techniques are considered that are similar to those used in WPT but with different packetization and congestion control methods designed for the fixed-length ATM protocol at high speeds. Possible approaches for packetization and implementation of variable-bit-rate voice coding schemes are described. ADPCM and DSI for voice coding and compression and cell discarding (CD) for congestion control are considered. The advantages of voice compression and CD in broadband ATM networks are demonstrated in terms of transmission bandwidth savings and resiliency of the network during congestion. >

96 citations


PatentDOI
TL;DR: A speech coder apparatus operates to compress speech signals to a low bit rate and includes a continuous speech recognizer (CSR) which has a memory for storing templates.
Abstract: A speech coder apparatus operates to compress speech signals to a low bit rate. The apparatus includes a continuous speech recognizer (CSR) which has a memory for storing templates. Input speech is processed by the CSR where information in the speech is compared against the templates to provide an output digital signal indicative of recognized words, which signal is transmitted along a first path. There is further included a front end processor which is also responsive to the input speech signal for providing output digitized speech samples during a given frame interval. A side information encoder circuit responds to the output from the front end processor to provide at the output of the encoder a parameter signal indicative of the value of the pitch and word duration for each word as recognized by the CSR unit. The output of the encoder is transmitted as a second signal. There is a receiver which includes a synthesizer responsive to the first and second transmitted signals for providing an output synthesized signal for each recognized word where the pitch, duration and amplitude of the synthesized signal is changed according to the parameter signal to preserve the quality of the synthesized speech.

PatentDOI
TL;DR: In this article, an adaptive filtering technique is applied to sequences of energy estimates in each of two signal channels, one channel containing speech and environmental noise and the other channel containing primarily the same environmental noise.
Abstract: A digital signal processing system applies an adaptive filtering technique to sequences of energy estimates in each of two signal channels, one channel containing speech and environmental noise and the other channel containing primarily the same environmental noise. From the channel containing primarily environmental noise, a prediction is made of the energy of that noise in the channel containing both the speech and that noise, so that the noise can be extracted from the mixture of speech and noise. The result is that the speech will be more easily recognizable by either human listeners or speech recognition systems.

Journal ArticleDOI
TL;DR: A new method based on the assumption that, for voiced speech, a perceptually accurate speech signal can be reconstructed from a description of the waveform of a single, representative pitch cycle per interval of 20-30 ms is presented, which retains the natural quality of coders which encode the entire waveform, but requires a bit rate close to that of the parametric coders.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The techniques and experiments described are the first demonstration of a complete system that accepts speech messages as input and produces as estimated message class as output and demonstrate the feasibility of the technology and illustrate the need for further work.
Abstract: The components of a speech message information retrieval system include an acoustic front end which provides an incomplete transcription of a spoken message, and a message classifier that interprets the incomplete transcription and classifies the message according to message category. The techniques and experiments described are concerned with the integration of these components and represent the first demonstration of a complete system that accepts speech messages as input and produces as estimated message class as output. The complete system has been implemented on special-purpose digital signal processing hardware and demonstrated using live speech input. The results obtained on a conversational speech task have demonstrated the feasibility of the technology and also illustrate the need for further work. Even with a perfect acoustic front end, a message classification accuracy of only 78% was obtained with a 126 keyword vocabulary. >

Patent
23 Dec 1991
TL;DR: In this article, variable hangover time is provided for a speech coder and variable VAD (Voice Activity Detection) is used to detect voice activity within a speech message, and a variable HOG is calculated.
Abstract: Variable hangover time is provided for a speech coder (105). Voice activity within a speech message is detected (209) using a voice activity detector (VAD) (107), and a signal-to-noise ratio is calculated. A variable hangover time is calculated (215) and appended to the time in which voice activity is detected, producing an extended voice detection period. The speech coder (105) is enabled only during the extended voice detection period, thus saving power.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The test results show that the IMBE system is a viable alternative to CELP based speech coders and has the best performance of the systems tested.
Abstract: A 6.4 kb/s improved multiband excitation (IMBE) speech coder is presented. This speech coder combines high speech quality with a robustness to channel impairments which is necessary for successful operation in a mobile communication environment. MOS (mean opinion score) results for the IMBE speech coder are compared against those of four 6.4-kb/s CELP (code excited linear prediction) based speech coders which were tested as part of the INMARSAT-M voice codec evaluation. The IMBE system yielded the best performance of the systems tested. It received an MOS score of 3.4 at both 0% and 1% bit error rate. The test results show that the IMBE system is a viable alternative to CELP based speech coders. >


Proceedings ArticleDOI
14 Apr 1991
TL;DR: The proposed voice conversion algorithm was used with two male speakers and, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than thespeech converted frame-by-frame.
Abstract: A voice conversion algorithm that uses speech segments as conversion units is proposed. Input speech is decomposed into speech segments by a speech recognition module, and the segments are replaced by speech segments uttered by another speaker. This algorithm makes it possible to convert not only the static characteristics but also the dynamic characteristics of speaker individuality. The proposed voice conversion algorithm was used with two male speakers. Spectrum distortion between target speech and the converted speech was reduced to one-third the natural spectrum distortion between the two speakers. A listening experiment showed that, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than the speech converted frame-by-frame. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX, and extended it to speaker-dependent speech recognition, which demonstrated a substantial difference between speaker- dependent and -independent systems.
Abstract: The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%. >

PatentDOI
Ira A. Gerson1, Mark A. Jasiuk1
TL;DR: In a speech coder, excitation source gain information is transmitted along with a coding mode indicator that indicates how the gain information has been interpreted and which of a plurality of excitation sources are utilized when synthesizing the speech.
Abstract: In a speech coder (100), excitation source gain information (802) is transmitted along with a coding mode indicator. The coding mode indicator indicates how the gain information is to be interpreted. In one embodiment, the coding mode indicator can also be utilized to control which of a plurality of excitation sources (202, 206-208) are utilized when synthesizing the speech. The coding mode itself is selected as a function of the periodicity of an input speech signal.

PatentDOI
Hideki Satoh1, Tsuneo Nitta1
TL;DR: In this article, a speech detection apparatus capable of reliably detecting speech segments in audio signals regardless of the levels of input audio signals and background noises is presented. But it is not yet clear how to detect speech segments.
Abstract: A speech detection apparatus capable of reliably detecting speech segments in audio signals regardless of the levels of input audio signals and background noises. In the apparatus, a parameter of input audio signals is calculated frame by frame, and then compared with a threshold in order to judge each input frame as one of a speech segment and a noise segment, while the parameters of the input frames judged as the noise segments are stored in the buffer and the threshold is updated according to the parameters stored in the buffer. The apparatus may utilize a transformed parameter obtained from the parameter, in which the difference between speech and noise is emphasized, and noise standard patterns are constructed from the parameters of the input frames pre-estimated as noise segments.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A novel synthesizer structure for an LPC (linear predictive coding) vocoder is introduced which increases the clarity and naturalness of the output speech and replaces the traditional binary voicing decision with more robust periodicity, peakiness, and power level detectors.
Abstract: The authors introduce a novel synthesizer structure for an LPC (linear predictive coding) vocoder which increases the clarity and naturalness of the output speech. This synthesizer enhances the usual excitations of either periodic pulses or white noise by allowing pulse/noise mixtures and aperiodic pulses, and thus can generate a wider range of possible speech signals. The control algorithms for this new model replace the traditional binary voicing decision with more robust periodicity, peakiness, and power level detectors, without a significant increase in bit rate. As a result, the vocoder produces synthetic speech which is free of the usual LPC synthesis artifacts, even at bit rates below 2400 bps. >

Journal ArticleDOI
TL;DR: A neural approach to improve the performance of an automatic speech recognition system for unrestricted speakers by using not only voice sound features but also image features of the mouth shape, which can be applied not only to the improvement of voice recognition, but also to aid the communication of hearing-impaired people.

Journal ArticleDOI
TL;DR: The methods and motivation for VAA data collection and validation procedures, the current contents of thedatabase, and the results of exploratory research on a 1088-speaker subset of the database are described.

Proceedings ArticleDOI
S. Nanda1, On-Ching Yue1
02 Dec 1991
TL;DR: A scheme for almost doubling the capacity of wireless communication systems by speech activity detection by dynamically varying the bandwidth assigned to the two parties in a TDMA (time division multiaccess) system.
Abstract: A scheme for almost doubling the capacity of wireless communication systems by speech activity detection is proposed. This scheme is called variable partition duplexing (VPD). The key observation is that in conversational speech, except for small overlaps, only one of the two parties is talking at any given time. VPD attempts to use this observation to advantage by dynamically varying the bandwidth assigned to the two parties in a TDMA (time division multiaccess) system. >

Proceedings ArticleDOI
19 Feb 1991
TL;DR: The results of several field trials suggest that real user compliance with instructions is dramatically affected by the particular details of the prompts supplied to the user.
Abstract: Performance estimates given for speech recognition/understanding systems are typically based on the assumption that users will behave in ways similar to the observed behavior of laboratory volunteers. This includes the acoustic/phonetic characteristics of the speech they produce as well as their willingness and ability to constrain their input to the device according to instructions. Since speech recognition devices often do not perform as well in the field as they do in the laboratory, analyses of real user behavior have been undertaken. The results of several field trials suggest that real user compliance with instructions is dramatically affected by the particular details of the prompts supplied to the user. A significant amount of real user speech data has been collected during these trials (34,000 utterances, 29 hours of data). These speech databases are described along with the results of an experiment comparing the performance of a speech recognition system on real user vs. laboratory speech.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A speech recognition system using word-spotting with noise immunity learning has been developed to achieve robust performance under noisy environments and employs an accelerator for reducing processing time.
Abstract: A speech recognition system using word-spotting with noise immunity learning has been developed to achieve robust performance under noisy environments. The system employs word-spotting based on the multiple similarity (MS) method for eliminating word boundary detection errors, noise immunity learning for improving noise robustness, and an accelerator for reducing processing time. Noise immunity learning is performed using noisy speech data and noise data. Data from 39 male speakers were used to evaluate the recognition performance; the remaining data were used for the learning. Recognition scores obtained by word-spotting alone and with noise immunity learning were 88.5% and 98.4%, respectively, for an SNR of 10 dB. >


Proceedings ArticleDOI
14 Apr 1991
TL;DR: Efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system and provides large-vocabulary and continuous speech recognition.
Abstract: An investigation of speech recognition and language processing is described. The speech recognition part consists of the large phonemic time-delay neural networks (TDNNs) which can automatically spot all 24 Japanese phonemes by simply scanning input speech. The language processing part is made up of a predictive LR parser which predicts subsequent phonemes based on the currently proposed phonemes. This TDNN-LR recognition system provides large-vocabulary and continuous speech recognition. Recognition experiments for ATR's conference registration task were performed using the TDNN-LR method. Speaker-dependent phrase recognition rates of 65.1% for the first choices and 88.8% within the fifth choices were attained. Also, efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: It is shown that for recognition based on the combination of the first two regression features with the static cepstral coefficients, increasing the time length to more than 200 ms, using all of the frames in this time interval, resulted in the highest recognition rates for noisy-Lombard test speech.
Abstract: It is proposed that the number of speech analysis frames used in calculating regression features should be controlled separately from the time length over which the features are calculated. Regression features are used to represent the first two time derivatives of the speech cepstrum in a speaker-independent, isolated-word recognition task. The recognition system is trained on normal (noise-free, non-Lombard) speech, but tested on normal, noisy, Lombard, or noisy-Lombard speech. It is shown that for recognition based on the combination of the first two regression features with the static cepstral coefficients, increasing the time length to more than 200 ms, using all of the frames in this time interval, resulted in the highest recognition rates for noisy-Lombard test speech. >