scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1992"


Journal ArticleDOI
Yariv Ephraim1
01 Oct 1992
TL;DR: A unified statistical approach for the three basic problems of speech enhancement is developed, using composite source models for the signal and noise and a fairly large set of distortion measures.
Abstract: Since the statistics of the speech signal as well as of the noise are not explicitly available, and the most perceptually meaningful distortion measure is not known, model-based approaches have recently been extensively studied and applied to the three basic problems of speech enhancement: signal estimation from a given sample function of noisy speech, signal coding when only noisy speech is available, and recognition of noisy speech signals in man-machine communication. Research on the model-based approach is integrated and put into perspective with other more traditional approaches for speech enhancement. A unified statistical approach for the three basic problems of speech enhancement is developed, using composite source models for the signal and noise and a fairly large set of distortion measures. >

383 citations


PatentDOI
TL;DR: In this paper, a plurality of linearly arrayed sensors to detect spoken input and to output signals in response thereto, a beamformer connected to the sensors to cancel a preselected noise portion of the signals to thereby produce a processed signal, and a speech recognition system to recognize the processed signal and to respond thereto.
Abstract: Systems and methods for improved speech acquisition are disclosed including a plurality of linearly arrayed sensors to detect spoken input and to output signals in response thereto, a beamformer connected to the sensors to cancel a preselected noise portion of the signals to thereby produce a processed signal, and a speech recognition system to recognize the processed signal and to respond thereto. The beamformer may also include an adaptive filter with enable/disable circuitry for selectively training the adaptive filter a predetermined period of time. A highpass filter may also be used to filter a preselected noise portion of the sensed signals before the signals are forwarded to the beamformer. The speech recognition system may include a speaker independent base which is able to be adapted by a predetermined amount of training by a speaker, and which system includes a voice dialer or a speech coder for telecommunication.

214 citations


Journal ArticleDOI
01 Aug 1992
TL;DR: A voice activity detector (VAD) that can operate reliably in SNRs down to 0 dB and detect most speech at −5 dB is described, and how robustness to these signals can be achieved with suitable preprocessing and postprocessing is shown.
Abstract: The paper describes a voice activity detector (VAD) that can operate reliably in SNRs down to 0 dB and detect most speech at −5 dB. The detector applies a least-squares periodicity estimator to the input signal, and triggers when a significant amount of periodicity is found. It does not aim to find the exact talkspurt boundaries and, consequently, is most suited to speech-logging applications where it is easy to include a small margin to allow for any missed speech. The paper discusses the problem of false triggering on nonspeech periodic signals and shows how robustness to these signals can be achieved with suitable preprocessing and postprocessing.

205 citations



BookDOI
01 Jan 1992

152 citations


Patent
12 Feb 1992
TL;DR: In this article, a speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decor correlation linear transformation for generating wordlevel speech feature vectors from such speech inputs, and a word level speech feature storage for storing word level feature vectors known to belong to a speaker with the specific identity.
Abstract: A speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decorrelation linear transformation for generating word-level speech feature vectors from such speech inputs, word-level speech feature storage for storing word-level speech feature vectors known to belong to a speaker with the specific identity, a word-level speech feature vectors received from the unknown speaker with those received from the word-level speech feature storage, and speaker verification decision circuitry for determining, based on the similarity score, whether the unknown speaker's identity is the same as that claimed The word-level vector scorer further includes concatenation circuitry as well as a word-specific orthogonalizing linear transformer Other systems and methods are also disclosed

143 citations


Proceedings ArticleDOI
07 Jun 1992
TL;DR: A modified time-delay neural network (TDNN) has been designed to perform both automatic lipreading (speech reading) in conjunction with acoustic speech recognition in order to improve recognition both in silent environments as well as in the presence of acoustic noise.
Abstract: A modified time-delay neural network (TDNN) has been designed to perform both automatic lipreading (speech reading) in conjunction with acoustic speech recognition in order to improve recognition both in silent environments as well as in the presence of acoustic noise. The system is far more robust to acoustic noise and verbal distractors than is a system not incorporating visual information. Specifically, in the presence of high-amplitude pink noise, the low recognition rate in the acoustic only system (43%) is raised to 75% by the incorporation of visual information. The system responds to (artificial) conflicting cross-modal patterns in a way closely analogous to the McGurk effect in humans. The power of neural techniques is demonstrated in several difficult domains: pattern recognition; sensory integration; and distributed approaches toward 'rule-based' (linguistic-phonological) processing. >

129 citations


PatentDOI
TL;DR: In this paper, a method and an apparatus for hearing assistance, capable of compensating the lowering of the speech recognition ability related to the deterioration of the auditory sense center, is presented.
Abstract: A method and an apparatus for hearing assistance, capable of compensating the lowering of the speech recognition ability related to the deterioration of the auditory sense center. The input speech is divided into voiced speech sections, unvoiced speech sections, and silent sections, of which the voiced speech sections and the silent sections are appropriately extended/contracted while the unvoiced speech sections are left unchanged, and then these sections are combined in an identical order as in the input speech, so as to obtain output speech which is easier to listen for a listener with a handicapped hearing ability. Also, only the silent sections other than the punctuation silent sections for pauses due to punctuation between sentences can be contracted and the speech speed for each of the voiced speech sections can be adjusted, and then the adjusted voiced speech sections, the unvoiced speech sections, the punctuation silent sections and the contracted silent sections can be combined in an identical order as in the input speech, in order to realize the real time hearing assistance without extending the speech utterance period.

128 citations


Patent
13 Aug 1992
TL;DR: In this paper, a speech input uttered by a human is received by a microphone which outputs microphone output signals, and the speech input received by the microphone is then recognized by a speech recognition unit, and a synthetic speech response appropriate for the input recognized by the speech recognizer is generated and outputted from a loudspeaker to the human.
Abstract: In the system, a speech input uttered by a human is received by a microphone which outputs microphone output signals. The speech input received by the microphone is then recognized by a speech recognition unit, and a synthetic speech response appropriate for the speech input recognized by the speech recognition unit is generated and outputted from a loudspeaker to the human. In recognizing the speech input, the speech recognition unit receives input signals in which the synthetic speech response, outputted from the loudspeaker and then received by the microphone, is cancelled from the microphone output signals.

89 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: The authors present a method for segmenting speech waveforms containing several speakers into utterances, each from one individual, and then identifying each utterance as coming from a specific individual or group of individuals.
Abstract: The authors present a method for segmenting speech waveforms containing several speakers into utterances, each from one individual, and then identifying each utterance as coming from a specific individual or group of individuals. The procedure is unsupervised in that there is no training set, and sequential in that information obtained in early stages of the process is utilized in later stages. >

77 citations


Journal ArticleDOI
I.A. Gerson1, M.A. Jasiuk1
TL;DR: Techniques for improving the performance of CELP (code excited linear prediction)-type speech coders while maintaining reasonable computational complexity are explored and a harmonic noise weighting function is introduced.
Abstract: Techniques for improving the performance of CELP (code excited linear prediction)-type speech coders while maintaining reasonable computational complexity are explored. A harmonic noise weighting function, which enhances the perceptual quality of the processed speech, is introduced. The combination of harmonic noise weighting and subsample pitch lag resolution significantly improves the coder performance for voiced speech. Strategies for reducing the speech coder's data rate, while maintaining speech quality, are presented. These include a method for efficient encoding of the long-term predictor lags, utilization of multiple gain vector quantizers, and a multimode definition of the speech coder frame. A 5.9-kb/s VSELP speech coder that incorporates these features is described. Complexity reduction techniques which allow the coder to be implemented using a single fixed-point DSP (digital signal processor) are discussed. >

Proceedings ArticleDOI
23 Mar 1992
TL;DR: The authors present the result of their research on developing a hands-free voice communication system with a microphone array for use in an automobile environment, showing that the microphone array is superior to a single microphone.
Abstract: The authors present the result of their research on developing a hands-free voice communication system with a microphone array for use in an automobile environment. The goal of this research is to develop a speech acquisition and enhancement system so that a speech recognizer can reliably be used inside a noise automobile environment, for digital cellular phone application. Speech data have been collected using a microphone array and a digital audio tape (DAT) recorder inside a real car for several idling and driving conditions, and processed using delay-and-sum and adaptive beamforming algorithms. Performance criteria including signal-to-noise ratio and speech recognition error rate have been evaluated for the processed data. Detailed performance results presented show that the microphone array is superior to a single microphone. >

Proceedings ArticleDOI
23 Mar 1992
TL;DR: The authors discuss the application of generalized analysis-by-synthesis coding to the pitch predictor of a code excited linear predictor (CELP) coder, which makes it possible to transmit the pitch prediction parameters at a much lower rate than conventional approaches, without compromising speech quality.
Abstract: Many modifications can be applied to a speech signal without changing its perceptual quality. For a particular speech coder, the coding efficiency will differ for distinct modifications. To exploit this, the authors introduced a generalized analysis-by-synthesis procedure. In this procedure, a search is performed over a multitude of modified original signals (on a blockwise basis), and the signal which can be encoded with the least distortion is selected for transmission. At the receiver, a quantized version of this modified original signal is constructed. The authors discuss the application of generalized analysis-by-synthesis coding to the pitch predictor of a code excited linear predictor (CELP) coder. The use of this technique makes it possible to transmit the pitch predictor parameters at a much lower rate than conventional approaches, without compromising speech quality. >

Proceedings ArticleDOI
23 Feb 1992
TL;DR: It is concluded that word accuracy can be improved by explicitly modeling spontaneous effects in the recognizer, and by using as much spontaneous speech training data as possible.
Abstract: We describe three analyses on the effects of spontaneous speech on continuous speech recognition performance. We have found that: (1) spontaneous speech effects significantly degrade recognition performance, (2) fluent spontaneous speech yields word accuracies equivalent to read speech, and (3) using spontaneous speech training data can significantly improve performance for recognizing spontaneous speech. We conclude that word accuracy can be improved by explicitly modeling spontaneous effects in the recognizer, and by using as much spontaneous speech training data as possible. Inclusion of read speech training data, even within the task domain, does not significantly improve performance.

Proceedings ArticleDOI
14 Jun 1992
TL;DR: An overview of speech recognition systems and design strategies for their use in portable communications are given and types of speech recognizers are discussed.
Abstract: The authors give an overview of speech recognition systems and discuss design strategies for their use in portable communications. State-of-the-art speech recognition systems can recognize continuously spoken speech from a large vocabulary in real time. In the future, portable speech recognition systems will be made possible by advances in integrated circuit technology, by optimizing system architectures, and by exploiting the special features of personal communications systems. Types of speech recognizers are discussed. Current speech recognition systems are outlined. Personal communication systems with speech recognition are discussed. >

PatentDOI
TL;DR: In this paper, a method and system are provided for alleviating the harmful effects of convolutional distortions of speech, such as the effect of a telecommunication channel, on the performance of an automatic speech recognizer (ASR).
Abstract: A method and system are provided for alleviating the harmful effects of convolutional distortions of speech, such as the effect of a telecommunication channel, on the performance of an automatic speech recognizer (ASR). The technique is based on the filtering of time trajectories of an auditory-like spectrum derived from the Perceptual Linear Predictive (PLP) method of speech parameter estimation.

ReportDOI
22 May 1992
TL;DR: In this article, a more complete evaluation of speech quality is needed-one that takes into account the many different sources of information that contribute to how we understand speech, and subjective acceptability tests can be used to evaluate voice quality.
Abstract: : The evaluation of speech intelligibility and acceptability is an important aspect of the use, development, and selection of voice communication devices-telephone systems, digital voice systems, speech synthesis by rule, speech in noise, and the effects of noise stripping. Standard test procedures can provide highly reliable measures of speech intelligibility, and subjective acceptability tests can be used to evaluate voice quality. These tests are often highly correlated with other measures of communication performance and can be used to predict performance in many situations. However, when the speech signal is severely degraded or highly processed. a more complete evaluation of speech quality is needed-one that takes into account the many different sources of information that contribute to how we understand speech.

Proceedings ArticleDOI
11 Oct 1992
TL;DR: A novel voice compression method which provides significant improvement in transmission efficiency and flexibility for communications systems is described, and the basic scheme involves the use of split vector quantized transform coding in conjunction with pitch prediction to achieve excellent voice quality.
Abstract: The authors discussed a variable bit rate voice coding system in digital communications networks. A novel voice compression method which provides significant improvement in transmission efficiency and flexibility for communications systems is described. The basic scheme used for the investigations involves the use of split vector quantized (SVQ) transform coding (TC) in conjunction with pitch prediction (PP) to achieve excellent voice quality at rates of 4.8 kb/s and below. The authors describe of the algorithm and its implementation for a variable bit rate voice coding system from 4.8 kb/s to 2.4 kb/s. >


Proceedings ArticleDOI
23 Mar 1992
TL;DR: Using the original method developed by Laforia, a series of text-independent speaker recognition experiments, characterized by a long-term multivariate auto-regressive modelization, gives first-rate results without using more than one sentence.
Abstract: Two models, the temporal decomposition and the multivariate linear prediction, of the spectral evolution of speech signals capable of processing some aspects of the speech variability are presented. A series of acoustic-phonetic decoding experiments, characterized by the use of spectral targets of the temporal decomposition techniques and a speaker-dependent mode, gives good results compared to a reference system (i.e., 70% vs. 60% for the first choice). Using the original method developed by Laforia, a series of text-independent speaker recognition experiments, characterized by a long-term multivariate auto-regressive modelization, gives first-rate results (i.e., 98.4% recognition rate for 420 speakers) without using more than one sentence. Taking into account the interpretation of the models, these results show how interesting the cinematic models are for obtaining a reduced variability of the speech signal representation. >

PatentDOI
Motoaki Koyama1
TL;DR: In this paper, a speech segment detector is used to detect speech segments and a reference pattern memory for storing reference patterns, and a speech recognition section for comparing the detected speech segment detected by the detector with the reference patterns stored in the Reference Pattern Memory and selecting the reference pattern most similar to that of the speech segment.
Abstract: A speech recognition LSI system comprises a speech segment detector for detecting a speech segment from a speech segment detected, a reference pattern memory for storing reference patterns, and a speech recognition section for comparing the speech segment detected by the detector with the reference patterns stored in the reference pattern memory and selecting the reference pattern most similar to that of the speech segment. The system further comprises a recording/reproduction device for recording the speech signal and for reproducing only the speech segment the speech segment detector has detected, so that an operator can hear the speech segment.

Proceedings ArticleDOI
25 Jun 1992
TL;DR: Variable rate speech coding is a critical system component for achieving very high capacity in future generation multiple access systems for cellular networks and TDMA can also be designed to benefit from voice activity patterns.
Abstract: Variable rate speech coding is a critical system component for achieving very high capacity in future generation multiple access systems for cellular networks. A significant capacity gain comes from exploitation of the large fraction of the time during which a speaker is idle in a two-way conversation. Additional capacity gain can also be achieved by exploiting the time-varying entropy of active speech. While CDMA and packet-based multiple access systems, e.g. PRMA, are naturally suited for variable rate coding. TDMA can also be designed to benefit from voice activity patterns. >

PatentDOI
Benjamin K. Reaves1
TL;DR: The device detects the beginning and ending portions of speech contained within an input signal based on the variance of frequency band limited energy within the signal.
Abstract: The device detects the beginning and ending portions of speech contained within an input signal based on the variance of frequency band limited energy within the signal. The use of the variance allows detection which is relatively independent of an absolute signal-to-noise ratio with the signal, and allows accurate detection within a wide variety of backgrounds such as music, motor noise, and background noise, such as other speakers. The device can be easily implemented using off-the-shelf hardware along with a high-speed special purpose digital signal processor integrated circuit.

Journal ArticleDOI
TL;DR: These results, which resemble earlier findings obtained with orthographic visual input, indicate that the mapping from sight to sound is lexically mediated even when, as in the case of the articulatory-phonetic correspondence, the cross-modal relationship is non-arbitrary.
Abstract: In two experiments, we investigated whether simultaneous speech reading can influence the detection of speech in envelope-matched noise. Subjects attempted to detect the presence of a disyllabic utterance in noise while watching a speaker articulate a matching or a non-matching utterance. Speech detection was not facilitated by an audio-visual match, which suggests that listeners relied on low-level auditory cues whose perception was immune to cross-modal top-down influences. However, when the stimuli were words (Experiment 1), there was a (predicted) relative shift in bias, suggesting that the masking noise itself was perceived as more speechlike when its envelope corresponded to the visual information. This bias shift was absent, however, with non-word materials (Experiment 2). These results, which resemble earlier findings obtained with orthographic visual input, indicate that the mapping from sight to sound is lexically mediated even when, as in the case of the articulatory-phonetic correspondence, the cross-modal relationship is non-arbitrary.

Proceedings ArticleDOI
Brian Mak1, J.-C. Junqua1, B. Reaves1
23 Mar 1992
TL;DR: A new algorithm is proposed that identifies islands of reliability (essentially the portion of speech contained between the first and last vowel) using time- and frequency-based features and then applies a noise adaptive procedure to refine the endpoints.
Abstract: The authors address the problem of automatic endpoint detection in normal and adverse conditions. Attention has been given to automatic endpoint detection for both additive noise and noise-induced changes in the talker's speech production (Lombard reflex). After a comparison of several automatic endpoint detection algorithms in different noisy-Lombard conditions, the authors propose a new algorithm. This algorithm identifies islands of reliability (essentially the portion of speech contained between the first and last vowel) using time- and frequency-based features and then applies a noise adaptive procedure to refine the endpoints. It is shown that this algorithm outperforms the commonly used algorithm developed by Lamel et al. (1981), and several other recently developed methods. >


PatentDOI
TL;DR: In this article, a speech coding circuit with a speech coder and a power comparator is described, which consists of a PCM encoder for converting an analog input into a digital output, and a speech-coder with voice activity detector which detects whether the analog input is voice active or non-active.
Abstract: A speech coding circuit is disclosed, which comprises a PCM encoder for converting an analog input into a digital output, and a speech coder with voice activity detector which encodes the digital output from the PCM encoder into speech coding data and detects whether the analog input is voice active or non-active, for each period, and then outputs a speech detection flag indicating whether the analog input is voice active or non-active. A power comparator compares the power of the analog input with a predetermined power threshold value and outputs a level detection flag indicating voice activity or non-activity, depending on whether the power of the analog input is greater or smaller than the power threshold value. A mode switch receives the level detection flag indicating voice activity or non-activity and applies to the PCM encoder and the speech coder a mode control signal which puts them into an activated mode or a sleep mode.

Proceedings ArticleDOI
23 Mar 1992
TL;DR: Four voice packet reconstruction methods used for speech coded by code excited linear prediction (CELP)-type speech coders are described and their performance is discussed.
Abstract: Four voice packet reconstruction methods used for speech coded by code excited linear prediction (CELP)-type speech coders are described. In the first method, the authors generalize the waveform substitution technique originally developed for the PCM coded speech to the CELP speech coding. In the second method, a priority level is assigned to each speech frame to protect against those perceptually important and hard-to-reconstruct speech frames being lost. The third and fourth methods both split the information bits in a frame into two groups of different levels of importance. In method three, the bits for representing the filter parameters are given high priority and bits for representing the excitation signals are given low priority. Method four is an embedded coding technique based on two-stage CELP. The four methods were tested in combination with a simulated voice activity and queuing model and their performance is discussed. >

PatentDOI
Heidi Hackbarth1
TL;DR: Recognition of speech with successive expansion of a reference vocabulary, can be used for automatic telephone dialing by voice input.
Abstract: Recognition of speech with successive expansion of a reference vocabulary, can be used for automatic telephone dialing by voice input. Neural and conventional recognition methods are performed in parallel so that during training and configuration of the neural network, a conventional recognizer operating according to the dynamic programming principle has available newly added word patterns as references for immediate use in recognition. Upon completion of the training and configuration, the neural network takes over the recognition of the now expanded vocabulary.

Book
01 Jan 1992
TL;DR: The application of Audio/Speech Recognition for Military Requirements and Quality Evaluation of Speech Processing Systems is studied.
Abstract: 1: Overview of Voice Communications and Speech Processing.- 2: The Speech Signal.- 3: Speech Coding.- 4: Voice Interactive Information Systems.- 5: Speech Recognition Based on Pattern Recognition Approaches.- 6: Quality Evaluation of Speech Processing Systems.- 7: Speech Processing Standards.- 8: Application of Audio/Speech Recognition for Military Requirements.- Selective Bibliography with Abstract.