scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1990"



Journal ArticleDOI
TL;DR: With Xspeak, window navigation tasks usually performed with a mouse can be controlled by voice, and an improved version, Xspeak II, which incorporates a language for translating spoken commands, is introduced.
Abstract: Some necessary background in speech recognition and window systems is given, with an analysis of how they might be combined. Xspeak, a navigation application, and its operation and a field study of its use are described. With Xspeak, window navigation tasks usually performed with a mouse can be controlled by voice. An improved version, Xspeak II, which incorporates a language for translating spoken commands, is introduced. >

249 citations


Proceedings ArticleDOI
03 Apr 1990
TL;DR: Switching adaptive filters, suitable for speech beamforming, with no prior knowledge about the speech source are presented, and the most robust solution, i.e. a delay and sum beamformer that cues in on the direct path only and neglects all multipath contributions is given.
Abstract: Switching adaptive filters, suitable for speech beamforming, with no prior knowledge about the speech source are presented. The filters have two sections, of which only one section at any given time is allowed to adapt its coefficients. The switch between both is controlled by a speech detection function. The first section implements an adaptive look direction and cues in on the desired speech. This section only adapts when speech is present. The second section acts as a multichannel adaptive noise canceller. The obtained noise references are typically very bad; hence, adaptation must be restricted to silence-only periods. Several ideas were explored for the first section. The most robust solution, and the one with the best sound quality, was given by the simplest solution, i.e. a delay and sum beamformer that cues in on the direct path only and neglects all multipath contributions. Tests were performed with a four-microphone array in a highly reverberant room with both music and fan type noise as jammers, SNR improvements of 10 dB were typical with no audible distortion. >

141 citations


PatentDOI
TL;DR: In this article, a set of speaker specific enrollment parameters for normalizing analysis parameters including the speaker's pitch, the frequency spectrum of the speech as a function of time, and certain measurements of speech signal in the time-domain.
Abstract: The present invention processes an independent body of speech during an enrollment process and creates a set of speaker specific enrollment parameters for normalizing analysis parameters including the speaker's pitch, the frequency spectrum of the speech as a function of time, and certain measurements of the speech signal in the time-domain. A particular objective of the invention is to make these analysis parameters have the same meaning from speaker to speaker. Thus after the pre-processing performed by this invention, the parameters would look much the same for the same word independent of speaker. In this manner, variations in the speech signal caused by the physical makeup of a speaker's throat, mouth, lips, teeth, and nasal cavity would be, at least in part, reduced by the pre-processing.

92 citations


Journal ArticleDOI
TL;DR: In this article, five approaches that can be used to control and simplify the speech recognition task are examined: isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions.
Abstract: Five approaches that can be used to control and simplify the speech recognition task are examined. They entail the use of isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions. The five components of a speech recognition system are described: a speech capture device, a digital signal processing module, preprocessed signal storage, reference speech patterns, and a pattern-matching algorithm. Current speech recognition systems are reviewed and categorized. Speaker recognition approaches and systems are also discussed. >

87 citations


Journal ArticleDOI
TL;DR: In this paper, the authors presented a microphone array adaptive beamformer with a dual function, which is suited to transmission as well as to use as input to speech recognition systems. But the performance of the beamformer was limited.

84 citations


PatentDOI
TL;DR: A voice response unit for transmitting voice prompt messages to customers and for receiving messages generated by customers in response to the voice prompt message has a speech recognizer for recognizing customer commands used to control operation of the unit.
Abstract: A voice response unit for transmitting voice prompt messages to customers and for receiving messages generated by customers in response to the voice prompt message. The unit has a speech recognizer for recognizing customer commands used to control operation of the voice response unit. Apparatus interconnects a voice decoder and voice recorder with a telephone line to transmit a generated voice prompt message and to recieve a customer message in response thereto from a calling customer coupled with the telephone line. The apparatus is also coupled with the speech recognizer and responds to receipt of a customer command combined with a portion of the transmitted voice prompt message reflected from the telephone line by cancelling the reflected voice prompt message with the transmitted voice prompt message thereby enabling the speech recognizer to respond to the customer command during transmission of the voice prompt message and interrupt the voice prompt message.

68 citations


PatentDOI
TL;DR: Protection of a digital multi-pulse speech coder from fading pattern bit errors common in a digital mobile radio channel is accomplished with error detection techniques which are simple to implement and require no error correcting codes.
Abstract: Protection of a digital multi-pulse speech coder from fading pattern bit errors common in a digital mobile radio channel is accomplished with error detection techniques which are simple to implement and require no error correcting codes. A synthetic regeneration algorithm is employed which uses only the perceptually significant bits in the transmitted frame. Separate parity checksums for line spectrum pair frequency data, pitch lag data and pulse amplitude data are added to each frame of speech coder bits in the transmitter. The bits are then transmitted through a mobile environment susceptible to fading that induces bursty error patterns in the stream. At the receiving station, the parity checksum bits and speech coder bits are used to determine if an error has occurred in a particular section of the bit stream. Detected errors are flagged and supplied to the speech decoder. The speech decoder uses the error flags to modify its output signal so as to minimize perceptual artifacts in the output speech. Separate checksums are developed for subsets of line spectrum pair (LSP) coefficients and related speech data, whereby a single subset may be error-detected and replaced, rather than an entire frame.

59 citations


PatentDOI
TL;DR: In this article, a harmonic signal is created from a limited spectral representation of a voice signal, which is combined with the at least a portion of the limited delayed spectral signal to provide a reconstructed speech signal having perceptually improved audio quality.
Abstract: A harmonic signal is created from a limited spectral representation of a voice signal. The harmonic signal is combined with the at least a portion of the limited delayed spectral signal to provide a reconstructed speech signal having perceptually improved audio quality.

53 citations


Journal ArticleDOI
TL;DR: The historical and theoretical bases of contemporary high-performance text-to-speech (TTS) systems and their current design are discussed, with particular reference to vocal tract models.
Abstract: The historical and theoretical bases of contemporary high-performance text-to-speech (TTS) systems and their current design are discussed. The major elements of a TTS system are described, with particular reference to vocal tract models. The stages involved in the process of converting text into speech parameters are examined, covering text normalization, word pronunciation, prosodies, phonetic rules, voice tables, and hardware implementation. Examples are drawn mainly from Berkeley Speech Technologies' proprietary text-to-speech system, T-T-S, but other approaches are indicated briefly. >

51 citations


Proceedings ArticleDOI
03 Apr 1990
TL;DR: A PDA is presented which outperforms the other methods regarding correct voicing decision and pitch estimation in quasi-periodic as well as in aperiodic speech signals.
Abstract: The problem of pitch determination in aperiodic speech signals and its relevance for practical computer speech applications is discussed. Four patterns of aperiodic voice excitation are distinguished systematically with respect to their acoustical characteristics and their distributional properties between different speakers (female and male) and different kinds of text boundaries. Several pitch determination algorithms (PDAs), including both time-domain and short-term analysis approaches such as the Gold-Rabiner algorithm, the SIFT algorithm, and the cepstrum method, are evaluated for their capacity to detect and identify these patterns correctly in continuous human speech. A PDA is presented which outperforms the other methods regarding correct voicing decision and pitch estimation in quasi-periodic as well as in aperiodic speech signals. >

Journal ArticleDOI
TL;DR: Digital speech technology is reviewed, with the emphasis on applications demanding high-quality reproduction of the speech signal, which include the important subclass of wideband speech.
Abstract: Digital speech technology is reviewed, with the emphasis on applications demanding high-quality reproduction of the speech signal. Examples of such applications are network telephony, ISDN terminals for audio teleconferencing, and systems for the storage of audio signals, which include the important subclass of wideband speech. Depending on the application, the bandwidth of input speech can vary from about 3 kHz to nearly 20 kHz. Coding for digital telephony at 4 and 8 kb/s, network quality coding at 16 kb/s, and coding for audio at 7 and 20 kHz are examined. Future directions in the field are discussed with respect to anticipated technology applications and the algorithms needed to support these technologies. >

Journal ArticleDOI
TL;DR: The use of speaker-independent speech recognition in the development of Northern Telecom's automated alternate billing service (AABS) for collect calls, third-number-billed calls, and calling-card-b billed calls is discussed.
Abstract: The use of speaker-independent speech recognition in the development of Northern Telecom's automated alternate billing service (AABS) for collect calls, third-number-billed calls, and calling-card-billed calls is discussed. The AABS system automates a collect call by recording the calling party's name, placing a call to the called party, playing back the calling party's name to the called party, informing the called party that he or she has a collect call from that person, and asking. 'Will you pay for the call?' The operation of AABS, the architecture of the voice interface, and the speech recognition algorithm are described, and the accuracy of the recognizer is discussed. AABS relies on isolated-word recognition, although more advanced techniques that can recognize continuous speech are being pursued. >

Journal ArticleDOI
TL;DR: This work uses cross-validation to increase the effective training size and introduces a near-miss sentence hypothesization algorithm for continuous speech training that resulted in over 20% error reductions both with and without grammar.

PatentDOI
Masanobu Shimanuki1
TL;DR: It is desirable for this device to be provided with a circuit that prevents generation of ringing tones when an incoming call arrives and a circuit to reduce the level of signals send from a telephone network to the receiver when the speech recognition unit receives speech signals from the transmitter microphone.
Abstract: A telephone terminal device equipped with a transmitter microphone, a receiver, a speech recognition unit that receives and recognizes speech signals from the transmitter microphone and a circuit to reduce the level of signals send from a telephone network to the receiver when the speech recognition unit receives speech signals from the transmitter microphone. Further, this device is preferably equipped with a speech reproduction unit that reproduces the speech information stored in a memory, in response to the information of recognition result from the speech recognition unit, and a circuit that prevents transmission of signals from the telephone network to the receiver when the regenerated speech information is sent to the receiver. Furthermore, it is desirable for this device to be provided with a circuit that prevents generation of ringing tones when an incoming call arrives.

Proceedings ArticleDOI
05 Nov 1990
TL;DR: Different excitation signals are discussed, as well as procedures for determining the various coder parametsrs, which are based on analysis-by-synthesis techniques.
Abstract: This paper presents an overview of analysis-by-synthesis techniques used for low bit rate coding of speech signals. Analysis-by-synthesis procedures use linear predictors to remove the redundancies in the speech signal. The remaining difference signal is not quantized directly, but is replaced by an excitation signa1 that can be represented with a low number of bits. The selection of this signal is typically based on an exhaustive search procedure, in which for each prototype excitation the corresponding speech signal is constructed. The average mean-squared error between the original and the reconstructed signal is used as a criterion to determine the best choice of Lhe excitation signal. In this paper, different excitation signals are discussed, as well as procedures for determining the various coder parametsrs. In addition, the paper discusses some recently proposed speech coding standards, which are based on analysis-by-synthesis techniques.

Proceedings ArticleDOI
03 Apr 1990
TL;DR: A model for cross-language voice conversion is described and the converted speech from male to female is as understandable as the unconverted speech and, moreover, it is recognized as female speech.
Abstract: First, the part of spectral difference that is due to the difference in language is assessed. This is investigated using a bilingual speaker's speech data. It is found that the interlanguage (between English and Japanese) difference is smaller than the interspeaker difference. Listening tests indicate that the difference between English and Japanese is very small. Second, a model for cross-language voice conversion is described. In this approach, voice conversion is considered a mapping problem between two speakers' spectrum spaces. The spectrum spaces are represented by codebooks. From this point of view, a cross-language voice conversion model and measures for the model are proposed. The converted speech from male to female is as understandable as the unconverted speech and, moreover, it is recognized as female speech. >

Proceedings ArticleDOI
24 Jun 1990
TL;DR: An empirical study in which users were asked to enter digit strings into the computer by voice and by keyboard shows that speech is preferable for strings that require more than a few keystrokes.
Abstract: Meaningful evaluation of spoken language interfaces must be based on detailed comparisons with an alternate, well-understood input modality, such as the keyboard. This paper presents an empirical study in which users were asked to enter digit strings into the computer by voice and by keyboard. Two different ways of verifying and correcting the spoken input were also examined using either voice or keyboard. Timing analyses were performed to determine which aspects of the interface were critical to speedy completion of the task. The results show that speech is preferable for strings that require more than a few keystrokes. The results emphasize the need for fast and accurate speech recognition, but also demonstrate how error correction and input validation are crucial components of a speech interface.

Journal ArticleDOI
TL;DR: In this article, the authors discuss various design issues related to developing an integrated voice/data mobile radio system, including high speed digital radio frequency modulation in a mobile environment, statistics for the talkspurt/silence gap composition of speech, switching schemes for voice/Data integration, encoding techniques, and voice and data traffic statistics.
Abstract: The various design issues related to developing an integrated voice/data mobile radio system, including high speed digital radio frequency modulation in a mobile environment, statistics for the talkspurt/silence gap composition of speech, switching schemes for voice/data integration, encoding techniques, and voice and data traffic statistics are discussed. A performance analysis is conducted for a typical design, showing that a voice-only mobile radio system can be upgraded to an integrated voice/data system capable of carrying the full voice and data loads without requiring additional radio channels and without compromising voice performance. Data traffic is only minimally delayed (46.2 ms mean delay) for a fully loaded system. >

Proceedings ArticleDOI
02 Dec 1990
TL;DR: Techniques for combating two types of distortion that degrade the quality of vector excitation coded (VXC) speech are presented and it is shown that the first technique can benefit any VXC coder, whereas the second is applicable specifically when phonetic segmentation is used as a front end to V XC coders.
Abstract: Techniques for combating two types of distortion that degrade the quality of vector excitation coded (VXC) speech are presented One degradation, the presence of noiselike components between the intended harmonics in voiced speech segments, is reduced by adaptive comb filtering, controlled by a smoothed pitch estimate The other degradation arises with front vowel sounds whose second and third formants tend to be attenuated in VXC coders This is improved by adding high-frequency emphasis to the perceptual weighting when computing the distortion between original and reconstructed speech It is shown that the first technique can benefit any VXC coder, whereas while the second is applicable specifically when phonetic segmentation is used as a front end to VXC coders >

Journal ArticleDOI
TL;DR: Speech perception testing and speech discrimination results indicate that, given sufficient training, children can utilize speech feature information provided through the Tickle Talker to improve discrimination of words and sentences.
Abstract: Fourteen prelinguistically profoundly hearing-impaired children were fitted with the multichannel electrotactile speech processor (Tickle Talker) developed by Cochlear Pty. Ltd. and the University of Melbourne. Each child participated in an ongoing training and evaluation program, which included measures of speech perception and production. Results of speech perception testing demonstrate clear benefits for children fitted with the device. Thresholds for detection of pure tones were lower for the Tickle Talker than for hearing aids across the frequency range 250-4000 Hz, with the greatest tactual advantage in the high-frequency consonant range (above 2000 Hz). Individual and mean speech detection thresholds for the Ling 5-sound test confirmed that speech sounds were detected by the electrotactile device at levels consistent with normal conversational speech. Results for three speech feature tests showed significant improvement when the Tickle Talker was used in combination with hearing aids (TA) as compared with hearing aids along (A). Mean scores in the TA condition increased by 11% for vowel duration, 20% for vowel formant, and 25% for consonant manner as compared with hearing aids alone. Mean TA score on a closed-set word test (WIPI) was 48%, as compared with 32% for hearing aids alone. Similarly, mean WIPI score for the combination of Tickle Talker, lipreading, and hearing aids (TLA) increased by 6% as compared with combined lipreading and hearing aid (LA) scores. Mean scores on open-set sentences (BKB) showed a significant increase of 21% for the tactually aided condition (TLA) as compared with unaided (LA). These results indicate that, given sufficient training, children can utilize speech feature information provided through the Tickle Talker to improve discrimination of words and sentences. These results indicate that, given sufficient training, children can utilize speech feature information provided through the Tickle Talker to improve discrimination of words and sentences. These results are consistent with improvement in speech discrimination previously reported for normally hearing and hearing-impaired adults using the device. Anecdotal evidence also indicates some improvements in speech production for children fitted with the Tickle Talker.

Journal ArticleDOI
TL;DR: Advances in coding algorithms and digital signal processing have led to sophisticated technologies for speech communication for a variety of applications, as well as to greater flexibilities in the design of ISDN terminals, which implies stereo teleconferencing or dual-language programming over a 64-kb/s channel.
Abstract: Advances in coding algorithms and digital signal processing have led to sophisticated technologies for speech communication for a variety of applications, as well as to greater flexibilities in the design of ISDN terminals for integrated communication of speech, images, and data. For traditional telephony with a signal bandwidth of 3.2 kHz, the transmission rate for network-quality speech is now down to 16 kb/s. Robust communications-quality speech appropriate for cellular radio has been realized at 8 kb/s. Research attention is shifting toward 4 kb/s, focused on improving speaker identification and the naturalness of coded speech. For wideband audio with a signal bandwidth of 7 kHz, high-quality coding is now possible at 32 kb/s, which implies stereo teleconferencing or dual-language programming over a 64-kb/s channel. Transparent coding of 20-kHz audio has been demonstrated at 128 kb/s, with near-transparent performance at rates as low as 64 kb/s for some classes of signals.

Patent
21 Jun 1990
TL;DR: In this paper, a speech detector has an intensity detector that indicates whether the intensity of a PCM signal exceeds a first threshold, and a normal zero-crossing-count detector, which is combined by AND logic to produce the output of the speech detector.
Abstract: A speech detector has an intensity detector that indicates whether the intensity of a PCM signal exceeds a first threshold, and a normal-zero-crossing-count detector that indicates whether the zero-crossing count of the PCM signal exceeds a second threshold. The outputs of the intensity detector and normal-zero-crossing-count detector are combined by AND logic to produce the output of the speech detector. The second threshold is set well below the minimum zero-crossing count occurring in normal speech, the function of the normal-zero-crossing-count detector being to disable speech detection during line faults.

PatentDOI
TL;DR: A portable voice or speech aid enabling a deaf or voice impaired user to make sounds into a microphone to output intelligible speech through a built-in speaker or to a text display screen is described in this article.
Abstract: A portable voice or speech aid enabling a deaf or voice impaired user to make sounds into a microphone to output intelligible speech through a built-in speaker or to a text display screen.

Journal ArticleDOI
TL;DR: The performance levels for increasing cell loss are compared for various speech coding methods, in combination with methods for dividing coded speech signals into cells and discarding cells.
Abstract: A type of speech coding for asynchronous transfer mode (ATM) is described. Cell processing, which improves service quality, is taken into account. Missing-cell recovery methods are discussed, and the distinctive features of missing-cell recovery methods used with low-bit-rate coding are examined. An example of the speech quality obtained using speech coding techniques in the ATM networks is described. The performance levels for increasing cell loss are compared for various speech coding methods, in combination with methods for dividing coded speech signals into cells and discarding cells. Representative feasible network applications of coding technologies are considered. >

PatentDOI
John W. Jackson1
TL;DR: In this article, a method and apparatus for speech analysis and speech recognition is described, where each speech utterance under examination in accordance with the method of the present invention is digitally sampled and represented as a temporal sequence of data frames.
Abstract: A method and apparatus are disclosed for speech analysis and speech recognition. Each speech utterance under examination in accordance with the method of the present invention is digitally sampled and represented as a temporal sequence of data frames. Each data frame is then analyzed by the application of a Fast Fourier Transform (FFT) to obtain an indication of the energy content of each data frame in a plurality of frequency bands or bins. An indication of each of the most significant frequency bands, in terms of energy content, are then plotted by bin number for all data frames and graphically combined to create a power content signature for the speech utterance which is indicative of the movement of audio power through the audio spectrum over time for that utterance. By comparing the power content signature of an unknown speech utterance to a number of previously stored power content signatures, each associated with a known utterance, it is possible to identify an unknown speech utterance with a high degree of accuracy. In one preferred embodiment of the present invention, comparisons of power content signatures from unknown speech utterances are made with stored power content signatures utilizing a least squares fit or other suitable technique.

PatentDOI
TL;DR: It is proposed to encode speech signals by means of a residual signal speech encoder to reduce the quantity of data to be stored without noticeably affecting the acoustic quality of the speech.
Abstract: Devices for the digital recording and reproduction of speech signals are used, for example in answering apparatus. In order to reduce the quantity of data to be stored without noticeably affecting the acoustic quality of the speech, it is proposed to encode speech signals by means of a residual signal speech encoder.

PatentDOI
TL;DR: In speech decoding, a transmission code is received and whether or not there is a code error is detected on the basis of the error correcting code and artificially background sound corresponding to the decoded speech is generated from characteristic parameters indicating unvoiced sound in the decodes speech.
Abstract: In speech decoding, a transmission code, which includes an error correcting code added to a speech code, is received and whether or not there is a code error is detected on the basis of the error correcting code. At this time, when there is no code error or when the detected code error has been corrected, a normal speech decoding processing is executed. On the other hand, when there is a code error which is impossible to be corrected, artificially background sound corresponding to the decoded speech is generated from characteristic parameters indicating unvoiced sound in the decoded speech. The parameters are continuously extracted from the decoded speech, stored in a memory and are used to replace an erroneous portion of the speech code.


Journal ArticleDOI
01 Jul 1990
TL;DR: Predictive coding of speech, multipulse and code-excited coders and frequency-domain coders, and intraframe and still image coding and interframe coding are examined for the coding of image and video signals.
Abstract: Some digital source coding techniques for speech and video are reviewed. Predictive coding of speech, multipulse and code-excited coders and frequency-domain coders are discussed and compared for the coding of speech signals, and intraframe and still image coding and interframe coding are examined for the coding of image and video signals. The emphasis is on those algorithms that offer high compression while maintaining the perceptual quality of the source signals are discussed. Some algorithms that are general waveform coding algorithms and do not strictly depend on the input source are included. >