scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1985"


Proceedings ArticleDOI
S. Roucos1, A. Wilgus1
26 Apr 1985
TL;DR: A new and simple method for speech rate modification that yields high quality rate-modified speech and both objective and informal subjective results for the new and previous TSM methods are presented.
Abstract: We present a new and simple method for speech rate modification that yields high quality rate-modified speech. Earlier algorithms either required a significant amount of computation for good quality output speech or resulted in poor quality rate-modified speech. The algorithm we describe allows arbitrary linear or nonlinear scaling of the time axis. The algorithm operates in the time domain using a modified overlap-and-add (OLA) procedure on the waveform. It requires moderate computation and could be easily implemented in real time on currently available hardware. The algorithm works equally well on single voice speech, multiple-voice speech, and speech in noise. In this paper, we discuss an earlier algorithm for time-scale modification (TSM), and present both objective and informal subjective results for the new and previous TSM methods.

420 citations


PatentDOI
TL;DR: A system is disclosed for recognizing a pattern in a collection of data given a context of one or more other patterns previously identified, which enables an operator to confirm the system's best guess as to the spoken word merely by speaking another word.
Abstract: A system is disclosed for recognizing a pattern in a collection of data given a context of one or more other patterns previously identified. Preferably the system is a speech recognition system, the patterns are words and the collection of data is a sequence of acoustic frames. During the processing of each of a plurality of frames, for each word in an active vocabulary, the system updates a likelihood score representing a probability of a match between the word and the frame, combines a language model score based on one or more previously recognized words with that likelihood score, and prunes the word from the active vocabulary if the combined score is below a threshold. A rapid match is made between the frames and each word of an initial vocabulary to determine which words should originally be placed in the active vocabulary. Preferably the system enables an operator to confirm the system's best guess as to the spoken word merely by speaking another word, to indicate that an alternate guess by the system is correct by typing a key associated with that guess, and to indicate that neither the best guess nor the alternate guesses was correct by typing yet another key. The system includes other features, including ones for determining where among the frames to look for the start of speech, and a special hardware processor for computing likelihood scores.

208 citations


Journal ArticleDOI
01 Nov 1985
TL;DR: The nature of variabilities is discussed, to describe the kinds of speech knowledge that may help us understand variabilities, and to advocate and suggest specific procedures for the increased utilization ofspeech knowledge in automatic speech recognition.
Abstract: In automatic speech recognition, the acoustic signal is the only tangible connection between the talker and the machine. While the signal conveys linguistic information, this information is often encoded in such a complex manner that the signal exhibits a great deal of variability. In addition, variations in environment and speaker can introduce further distortions that are linguistically irrelevant. This paper has three aims: 1) to discuss the nature of variabilities; 2) to describe the kinds of speech knowledge that may help us understand variabilities; and 3) to advocate and suggest specific procedures for the increased utilization of speech knowledge in automatic speech recognition.

186 citations


Patent
03 Sep 1985
TL;DR: In this paper, an improved hands-free user-interactive control and dialing system is disclosed for use with a speech communications device, which includes a dynamic noise suppressor (410), a speech recognizer (420) for implementing voice-control, a device controller (430) responsive to the speech recognition, and a speech synthesizer (440) for providing reply information to the user as to the communication device operating status.
Abstract: PCT No. PCT/US85/01672 Sec. 371 Date Sep. 3, 1985 Sec. 102(e) Date Sep. 3, 1985 PCT Filed Sep. 3, 1985 PCT Pub. No. WO87/01546 PCT Pub. Date Mar. 12, 1987.An improved hands-free user-interactive control and dialing system is disclosed for use with a speech communications device. The control system (400) includes a dynamic noise suppressor (410), a speech recognizer (420) for implementing voice-control, a device controller (430) responsive to the speech recognizer for controlling operating parameters of the speech communications device (450) and for producing status information representing the operating status of the device, and a speech synthesizer (440) for providing reply information to the user as to the speech communications device operating status. In a mobile radiotelephone application, the spectral subtraction noise suppressor (414) is configured to improve the performance of the speech recognizer (424), the voice quality of the transmitted audio (417), and the audio switching operation of the vehicular speakerphone (460). The combination of noise processing, speech recognition, and speech synthesis provides a substantial improvement to prior art control systems.

166 citations


Journal ArticleDOI
TL;DR: Examination of speech impairments likely to arise in dynamically managed voice (DMV) systems, which utilize speech activity detection to exploit speech idle time and variable bit rate coding to exploit nonstationary speech statistics, finds two impairments not commonly found in traditional communication systems variable Speech burst delay and speech clipping.
Abstract: The purpose of this paper is to examine speech impairments likely to arise in dynamically managed voice (DMV) systems. DMV systems utilize speech activity detection to exploit speech idle time and variable bit rate coding to exploit nonstationary speech statistics. The emphasis here is on systems using speech detection. This processing introduces two impairments not commonly found in traditional communication systems variable Speech burst delay and speech clipping. Simulations of these impairments were implemented, and formal subjective testing was performed to assess subjects' reactions to a range of impairment levels. Emphasis was on formal subjective listening tests and customer opinion of speech quality as defined by a rating scale. The test conditions are applicable to general telephony, where relatively high speech quality is required. Results on variable speech burst delay and front-end and midspeech burst clipping are presented. These results serve as input to the design process and to the establishment of performance guidelines for DMV systems.

119 citations


Proceedings ArticleDOI
01 Apr 1985
TL;DR: In vector quantization schemes usually speech and speaker dependent codebooks are applied in order to achieve good speech quality at medium bit rates, but this paper deals with another approach: the speech waveforms are transformed into signals which ideally do no longer containspeech and speaker specific features.
Abstract: In vector quantization schemes usually speech and speaker dependent codebooks are applied in order to achieve good speech quality at medium bit rates. This paper deals with another approach: The speech waveforms are transformed into signals which ideally do no longer contain speech and speaker specific features. Thus these signals can be encoded by an universal vector quantizer. This concept is realized by a system called RELP-VQ. The performance of this RELP-VQ scheme was evaluated by SNR-measurements as well as by informal listening tests including female and male English and German speakers.

112 citations


PatentDOI
TL;DR: In this article, a method and apparatus for noise suppression for speech recognition systems which employs the principle of a least-means square estimation which is implemented with conditional expected values is proposed.
Abstract: A method and apparatus for noise suppression for speech recognition systems which employs the principle of a least means square estimation which is implemented with conditional expected values Essentially, according to this method, one computes a series of optimal estimators which estimators and their variances are then employed to implement a noise immune metric This noise immune metric enables the system to substitute a noisy distance with an expected value which value is calculated according to combined speech and noise data which occurs in the bandpass filter domain Thus the system can be used with any set of speech parameters and is relatively independent of a specific speech recognition apparatus structure

79 citations


PatentDOI
Takehiro Yoshida1
TL;DR: A communication apparatus with an auto-communication mechanism for automatically communicating data, a speech communication mechanism for allowing an operator to communicate speech, and speech output mechanism for outputting speech in accordance with an output of the speech detection mechanism when the communication apparatus is in the auto- communication mode.
Abstract: A communication apparatus with an auto-communication mechanism for automatically communicating data, a speech communication mechanism for allowing an operator to communicate speech, a selection mechanism for selecting the auto-communication mechanism or the speech communication mechanism to set the communication apparatus in an auto-communication mode or a speech communication mode, a speech presence identification mechanism for detecting a speech sent from a destination station, and speech output mechanism for outputting speech in accordance with an output of the speech detection mechanism when the communication apparatus is in the auto-communication mode.

73 citations


Proceedings ArticleDOI
26 Apr 1985
TL;DR: A flexible analysis-synthesis system with signal dependent features is described and used to realize some desired voice characteristics in synthesized speech.
Abstract: A flexible analysis-synthesis system with signal dependent features is described and used to realize some desired voice characteristics in synthesized speech. The intelligibility of synthetic speech appears to depend on the ability to reproduce dynamic sounds such as stops, whereas the quality of voice is mainly determined by the true reproduction of voiced segments. We describe our work in converting the speech of one speaker to sound like that of another. A number of factors are important for maintaining the quality of the voice during this conversion process. These factors are derived from both the speech and electroglottograph signals.

72 citations


Proceedings ArticleDOI
26 Apr 1985
TL;DR: In this paper a sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves.
Abstract: In this paper a sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves. The resulting synthetic waveform preserves the waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in the presence of noise the perceptual characteristics of the speech and the noise are maintained. Based on this system, a coder operating at 8 kbps is developed that codes the amplitudes and phases of each of the sine wave components and uses a harmonic model to code all of the frequencies. Since not all of the phases can be coded, a high frequency regeneration technique is developed that exploits the properties of the sinusoidal representation of the coded baseband signal. Based on a relatively limited data base, computer simulation has demonstrated that coded speech of good quality can be achieved. A real-time simulation is being developed to provide a more thorough evaluation of the algorithm.

70 citations


Patent
25 Mar 1985
TL;DR: In this article, a weighing scale involving calculating functions is provided with an automatic control system that can accept words of speech spoken into a microphone as inputs to the automated control system, which is used to control the scale.
Abstract: A weighing scale involving calculating functions is provided with an automatic control system that can accept words of speech spoken into a microphone as inputs to the automatic control system.


Journal ArticleDOI
TL;DR: In this paper, the authors review human factors research on the design of systems that use speech recognition for human control of the system or use speech generation for the display of information, and present a survey of the literature.
Abstract: This article reviews human factors research on the design of systems that use speech recognition for human control of the system or that use speech generation for the display of information. Speech...

Proceedings ArticleDOI
26 Apr 1985
TL;DR: The results replicated previous studies demonstrating reliable increases in amplitude, duration and vocal pitch while talking in noise and found reliable differences in the tilt of the short-term spectrum of consonants and vowels.
Abstract: Acoustical analyses were carried out on the digits 0-9 spoken by two male talkers in the quiet and in 90 dB SPL of masking noise in their headphones. The results replicated previous studies demonstrating reliable increases in amplitude, duration and vocal pitch while talking in noise. We also found reliable differences in the tilt of the short-term spectrum of consonants and vowels. The results are discussed in terms of: (1) the development of algorithms for recognition of speech in noise; (2) the nature of the acoustic changes that take place when talkers produce speech under adverse conditions such as noise, stress or high cognitive load; and, (3) the role of training and feedback in controlling and modifying a talker's speech to improve performance of current speech recognizers.

PatentDOI
TL;DR: It becomes possible to prepare highly reliable reference pattern vectors in an easy manner from a small number of speech patterns, which makes it possible to achieve an improvement in the speech recognition factor.
Abstract: The learning method of reference pattern vectors for speech recognition in accordance with the present invention, a plurality of speech feature vectors are generated from the time series of speech feature parameter for the input speech pattern, by taking into account knowledge concerning the variation tendencies of the speech patterns, and the learning (preparation) of reference pattern vectors for speech recognition is carried out by the use of these speech feature vectors thus generated. Therefore, it becomes possible to prepare highly reliable reference pattern vectors in an easy manner from a small number of speech patterns, which makes it possible to achieve an improvement in the speech recognition factor. In particular, it becomes possible to plan an easy improvement of the reference pattern vectors by an effective use of a relatively small number of input speech patterns.

Journal ArticleDOI
TL;DR: Modifications which improve the quality of the synthesized speech without requiring the transmission of additional data are presented and diagnostic acceptability measure tests show an increase of up to five points in overall speech quality with the implementation of these improvements.
Abstract: The major weakness of the current narrow-band LPC synthesizer lies in the use of a "canned" invariant excitation signal, The use of such an excitation signal is based on three primary assumptions, namely, 1) that the amplitude spectrum of the excitation signal is flat and time invariant, 2) that the phase spectrum of the voiced excitation signal is a time-invariant function of frequency, and 3) that the probability density function of the phase spectrum of the unvoiced excitation signal is also time invariant. This paper critically examines these assumptions and presents modifications which improve the quality of the synthesized speech without requiring the transmission of additional data. Diagnostic acceptability measure (DAM) tests show an increase of up to five points in overall speech quality with the implementation of each of these improvements. These modifications can also improve the speech quality of LPC-based speech synthesizers.

Proceedings ArticleDOI
Frederick Jelinek1
01 Apr 1985
TL;DR: The architecture of an experimental, real-time, isolated-word, speech recognition system with a 5,000-word vocabulary which can be used for dictating office correspondence is described and some recent experimental results obtained are given.
Abstract: The Speech Recognition Group at IBM, Yorktown Heights, has recently completed the implementation of an experimental, real-time, isolated-word, speech recognition system with a 5,000-word vocabulary which can be used for dictating office correspondence. Typical recognition accuracy is greater than 94% correct word recognition for words within the vocabulary. We first describe the architecture of this system, and then give some recent experimental results obtained with it for read and spontaneously dictated speech from five speakers.

Patent
Takahashi Tsutomu1
26 Mar 1985
TL;DR: In this article, a mobile radio data communication system where a fixed unit transmits an operation command code sequence for an instruction to an operator having a portable unit is presented, where a speech synthesizer of the portable unit generates a speech signal which is audible to the operator, in response to the audible instruction, the operator replies by voice, and the speech signal is supplied to a speech recognition circuit which sends a code associated with the signal to the fixed unit.
Abstract: A mobile radio data communication system wherein a fixed unit transmits an operation command code sequence for an instruction to an operator having a portable unit. A speech synthesizer of the portable unit is responsive to the command code sequence to generate a speech signal which is audible to the operator. In response to the audible instruction, the operator replies by voice, and the speech signal is supplied to a speech recognition circuit which sends a code associated with the speech signal to the fixed unit.

Patent
04 Jun 1985
TL;DR: In this article, a voice activated echo generator employs digital voice recording technology to digitize several spoken words of audio using low-cost encoding techniques, which is played back by reconverting the data back from its digitized form to audio, which then drives a loudspeaker.
Abstract: A voice activated echo generator employs digital voice recording technology to digitize several spoken words of audio using low-cost encoding techniques. The audio information is stored in dynamic memory and is played back by reconverting the data back from its digitized form to audio, which then drives a loudspeaker. The echo generator has many design parameters which can be varied, such as, duration of recorded speech, voice actuation sensitivity, number of playback repetitions, speech quality, playback speed, playback pitch, and playback volume. For a toy, an input speech threshold initiates a distinct record interval followed by a substantially immediate and automatic playback interval during which the echo is generated without the problem of acoustic feedback.

Proceedings ArticleDOI
S. Roucos1, A. Wilgus1
26 Apr 1985
TL;DR: Methods for high-quality modification of the pitch and duration of a segment of a speech waveform are presented and shown how these methods can be applied to improve the quality of the segment vocoder's output speech.
Abstract: We propose a new method of synthesis to be used for the segment vocoder, which transmits intelligible speech at rates below 300 b/s. The earlier segment vocoder applies LPC analysis to input speech, divides it into segments of variable duration, matches each segment with the nearest template from a codebook, concatenates at the receiver the set of nearest templates, and finally synthesizes the resultant sequence of speech frames using LPC synthesis. The quality of such a segment vocoder cannot exceed that of a standard unquantized LPC vocoder, which sounds buzzy due to the pulse/noise excitation used. Alternatively, by beginning with the waveforms (not the spectral representation) corresponding to the set of nearest templates, we can independently modify the pitch, energy, and duration of each template to match those of the input segment. These modified segments are then concatenated to produce the output waveform. We present here methods for high-quality modification of the pitch and duration of a segment of a speech waveform and show how these methods can be applied to improve the quality of the segment vocoder's output speech.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: The speech recognition accuracy of this method in recognizing non-training voice data was 95.8% with automatic segmentation, and the category of the nearest reference pattern is taken as the result.
Abstract: This paper descrives recognition method, reference pattern generation method, and evaluation about the speaker independent recognition for telephone speech response systems. Input utterance is analyzed by 19 channel BPFs. The power and vocal cord source characteristics are normalized. The time normalization is realized by linearly compressing or expanding to 32 frames. The speech pattern undergoes pattern matching with male and female reference patterns, and the category of the nearest reference pattern is taken as the result. It is necessary to optimize the reference patterns so that the speech can be correctly recognized in spite of the difference of formant frequencies, and slight segmentation errors. To optimize the reference patterns, the recognition of the training patterns and updating of the reference patterns are repeated. A total of 256 male and female reference patterns were generated The speech recognition accuracy of this method in recognizing non-training voice data was 95.8% with automatic segmentation.

Proceedings ArticleDOI
M. Copperi1, D. Sereno
26 Apr 1985
TL;DR: The main objective is improving the excitation representation in a linear predictive coding scheme and, hence, the subjective quality of synthesized speech signals.
Abstract: Considerable effort has been and is currently being concentrated on improving the speech quality at low and very low bit rates. Recently new models of LPC excitation have been devised, which are able to yield good quality speech by exploiting our knowledge of the human speech production and perception processes. Unfortunately, these models generally require too much computational load to be easily implemented on currently available hardware. This paper describes an efficient speech coder, capable of providing acceptable quality speech, within the limitations of both low bit rate (approximately 2.4 kbit/s) and real-time implementation. The coder is based upon pattern classification and cluster analysis with perceptually-meaningful error minimization criteria. Our main objective is improving the excitation representation in a linear predictive coding scheme and, hence, the subjective quality of synthesized speech signals.

PatentDOI
Tetsu Taguchi1
TL;DR: Speech analysis and synthesis invole analysis for sinusoidal components and pitch frequency, and synthesis by first phase-resetting to zero at pitch period all sine oscillater components, whether periodic for voiced speech, or at random period in accordance with a random code for unvoiced speech as discussed by the authors.
Abstract: Speech analysis and synthesis invole analysis for sinusoidal components and pitch frequency, and synthesis by first phase-resetting to zero at pitch period all sine oscillater components, whether periodic for voiced speech, or at random period in accordance with a random code for unvoiced speech. As a result, the synthesized speech signal has the initial line spectrum spread due to pitch structure for better speech quality. Frequency modulation may also be used.


Journal ArticleDOI
TL;DR: An analytical derivation of a simple noniterative technique for extracting a multiple impulse excitation model for synthesized speech directly from the LPC residual sequence, which is very applicable for speech enhancement where processor capability is limited.
Abstract: This paper provides an analytical derivation of a simple noniterative technique for extracting a multiple impulse excitation model for synthesized speech directly from the LPC residual sequence. While suboptimal with respect to "multipulse" techniques, this method is very applicable for speech enhancement where processor capability is limited. The results suggest an additional "orthogonality" requirement between the excitation sequence and the resulting prediction error, which aids in the intuitive understanding of the method.

Book ChapterDOI
01 Apr 1985
TL;DR: In this article, a system for speech synthesis by rule is described, which uses demisyllables (DSs) as phonetic units and concatenation is discussed in detail; the pertinent stage converts a string of phonetic symbols into a stream of speech parameter frames.
Abstract: A system for speech synthesis by rule is described which uses demisyllables (DSs) as phonetic units. The problem of concatenation is discussed in detail; the pertinent stage converts a string of phonetic symbols into a stream of speech parameter frames. For German about 1650 DSs are required to permit synthesizing a very large vocabulary. Synthesis is controlled by 18 rules which are used for splitting up the phonetic string into DSs, for selecting the DSs in such a way that the inventory size is minimized, and- last but not least - for concatenation. The quality and intelligibility of the synthetic signal is very good; in a subjective test the median word intelligibility dropped from 96.6% for a LPC vocoder to 92.1% for the DS synthesis, and the quality difference between the DS synthesis and ordinary vocoded speech was judged very small.

Proceedings ArticleDOI
26 Apr 1985
TL;DR: This paper proposes in this paper a hybrid approach, suited both for machine implementation and for perceiving subtle differences in phonetic structure, in computer speech recognition.
Abstract: Computer speech recognition is a discipline that has been viewed from two diametrically opposed perspectives. One perspective perceives recognition as a purely mathematical process; the other perceives it as an extensive linguistical "knowledge" base. Because each perspective has its own set of limitations, neither approach has been able to achieve a viable machine realization of human auditory capabilities. Mathematical approaches do not perform fine phonetic distinctions well; linguistical approaches are not suitably machine oriented. We, therefore, propose in this paper a hybrid approach, suited both for machine implementation and for perceiving subtle differences in phonetic structure.

Journal ArticleDOI
TL;DR: Quality evaluation tests are reported which show that this type of coder, operating at 7.2 kbps, allows the transmission of telephone speech with communications quality and is a good candidate for telephony applications such as digital trunk transmissions, satellite speech communications, secure voice communications, and audio distribution systems.
Abstract: In this paper, we discuss the implementation of a medium-bit-rate linear prediction baseband coder on an IBM bipolar signal processor prototype having a high processing capacity. We show that the implementation of our algorithm requires a processing load of 5 MIPS, with a program size of 5K instructions. We then discuss the application of our coder in a normal telephone environment, which requires mu-law to linear PCM conversion and other signal processing functions such as voice activity detection, automatic gain control, echo control, and error recovery. Quality evaluation tests are also reported which show that this type of coder, operating at 7.2 kbps, allows the transmission of telephone speech with communications quality. Moreover, obtained intelligibility scores and speaker recognition levels are high enough to demonstrate that this coder is a good candidate for telephony applications such as digital trunk transmissions, satellite speech communications, secure voice communications, and audio distribution systems.


Patent
28 Jun 1985
TL;DR: In this paper, a detection statistic T (i To ), used to estimate the short-term speech energy, is developed from energy estimates made in each subband, and a speech presence energy threshold λ ON a speech silence energy threshold (λ OFF and λ OFF) are computed which adapt to the long-term speaker level.
Abstract: Speech detection is accomplished in conjunction with two-band subband encoding. A detection statistic T (i To ), used to estimate the short-term speech energy, is developed from energy estimates made in each subband. A speech presence energy threshold λ ON a speech silence energy threshold λ OFF and λ OFF are computed which adapt to the long-term speech level. The detection statistic is compared to the thresholds to make a decision concerning the presence or absence of speech. Also disclosed are considerations for extrapolating the detection to result in an arrangement with more than two subbands.