scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1976"


Journal ArticleDOI
Frederick Jelinek1
01 Apr 1976
TL;DR: Experimental results are presented that indicate the power of the methods and concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding.
Abstract: Statistical methods useful in automatic recognition of continuous speech are described. They concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding. Experimental results are presented that indicate the power of the methods.

1,024 citations


Journal ArticleDOI
B.S. Atal1
01 Apr 1976
TL;DR: The paper indudes a discussion of the speaker-dependent properties of the speech signal, methods for selecting an efficient set of speech measurements, results of experimental studies illustrating the performance of various methods of speaker recognition, and a comparision of theperformance of automatic methods with that of human listeners.
Abstract: This paper presents a survey of automatic speaker recognition techniques. The paper indudes a discussion of the speaker-dependent properties of the speech signal, methods for selecting an efficient set of speech measurements, results of experimental studies illustrating the performance of various methods of speaker recognition, and a comparision of the performance of automatic methods with that of human listeners. Both text-dependent as well as text-independent speaker-recognition techniques are discussed.

420 citations


Journal ArticleDOI
TL;DR: In this paper, the harmonics of the desired voice in the Fourier transform of the input were selected to distinguish between two different voices. But the authors focus on the principal subproblem, the separation of vocalic speech.
Abstract: A common type of interference in speech transmission is that caused by the speech of a competing talker. Although the brain is adept at clarifying such speech, it relies heavily on binaural data. When voices interfere over a single channel, separation is much more difficult and intelligibility suffers. Clarifying such speech is a complex and varied problem whose nature changes with the moment‐to‐moment variation in the types of sound which interfere. This paper describes an attack on the principal subproblem, the separation of vocalic speech. Separation is done by selecting the harmonics of the desired voice in the Fourier transform of the input. In implementing this process, techniques have been developed for resolving overlapping spectrum components, for determining pitches of both talkers, and for assuring consistent separation. These techniques are described, their performance on test utterances is summarized, and the possibility of using this process as a basis for the solution of the general two‐tal...

294 citations


Journal ArticleDOI
TL;DR: When trained to the voice of a particular speaker, the decoder recognized seven‐digit telephone numbers correctly 96% of the time, with a better than 99% per‐digit accuracy.
Abstract: Continuous speech was treated as if produced by a finite‐state machine making a transition every centisecond. The observable output from state transitions was considered to be a power spectrum—a probabilistic function of the target state of each transition. Using this model, observed sequences of power spectra from real speech were decoded as sequences of acoustic states by means of the Viterbi trellis algorithm. The finite‐state machine used as a representation of the speech source was composed of machines representing words, combined according to a “language model.” When trained to the voice of a particular speaker, the decoder recognized seven‐digit telephone numbers correctly 96% of the time, with a better than 99% per‐digit accuracy. Results for other tests of the system, including syllable and phoneme recognition, will also be given.

208 citations


Journal ArticleDOI
01 Apr 1976
TL;DR: Future developments in both new applications and increased capability voice input systems can be expected to considerably expand the usage of this form of man-machine communications.
Abstract: Voice input to machine is the most natural form of man-machine communications. In this type of system the machine responds to the mode of communications preferred by the user, rather than vice versa. Many practical applications exist today for limited capability voice input systems. The first operational voice input systems have taken place with limited vocabulary, isolated word voice input systems. Most of these initial systems were for industrial applications in which the users' hands or eyes were already busy with their normal work requirements. Future developments in both new applications and increased capability voice input systems can be expected to considerably expand the usage of this form of man-machine communications.

133 citations


Journal ArticleDOI
TL;DR: It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.
Abstract: This paper presents the results of an examination of rapid amplitude compression following high-pass filtering as a method for processing speech, prior to reception by the listener, as a means of enhancing the intelligibility of speech in high noise levels. Arguments supporting this particular signal processing method are based on the results of previous perceptual studies of speech in noise. In these previous studies, it has been shown that high-pass filtered/clipped speech offers a significant gain in the intelligibility of speech in white noise over that for unprocessed speech at the same signal-to-noise ratios. Similar results have also been obtained for speech processed by high-pass filtering alone. The present paper explores these effects and it proposes the use of high-pass filtering followed by rapid amplitude compression as a signal processing method for enhancing the intelligibility of speech in noise. It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.

131 citations


Journal ArticleDOI
01 Apr 1976
TL;DR: The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.
Abstract: For many applications, it is desirable to be able to convert arbitrary English text to natural and intelligible sounding speech. This transformation between two surface forms is facilitated by first obtaining the common underlying abstract linguistic representation which relates to both text and speech surface representations. Calculation of these abstract bases then permits proper selection of phonetic segments, lexical stress, juncture, and sentence-level stress and intonation. The resulting system serves as a model for the cognitive process of reading aloud, and also as a stable practical means for providing speech output in a broad class of computer-based systems.

116 citations


Patent
17 Aug 1976
TL;DR: In this paper, the authors improved the detection sensitivity and noise rejection of an arrangement for detecting speech in the presence of noise by accumulating the weighted differences between input signal samples and their short-term running average.
Abstract: The detection sensitivity and noise rejection of an arrangement for detecting speech in the presence of noise is improved by accumulating the weighted differences between input signal samples and their short-term running average. The detector thus tracks ambient noise, providing an adaptive detection threshold such that detection sensitivity is increased in low noise environments without excessive false operation on high level noise. The peak average attained during an interval of speech is used to provide variable hangover upon cessation of speech, yielding greater hangover for weak talkers than for loud talkers. In an illustrative embodiment of the speech detector used in a speech interpolation system, protection is afforded also against false transmission path operation due to detection of speech echo.

67 citations


Journal ArticleDOI
TL;DR: The system described in this paper is subdivided into three main steps: pitch extraction, segmentation, and formant analysis, which uses an adaptive digital filter in time-domain transforming the speech signal into a signal similar to the glottal waveform.
Abstract: The system described in this paper is subdivided into three main steps: pitch extraction, segmentation, and formant analysis. The pitch extractor uses an adaptive digital filter in time-domain transforming the speech signal into a signal similar to the glottal waveform. Using the levels of the speech signal and the differenced signal as parameters in time domain, the subsequent segmentation algorithm derives a signal parameter which describes the speed of articulatory movement. From this, the signal is divided into "stationary" and "'transitional" segments; one stationary segment is associated to one phoneme. For the formant tracking procedure, a subset of the pitch periods is selected by the segmentation algorithm and is transformed into frequency domain. The formant tracking algorithm uses a maximum detection strategy and continuity criteria for adjacent spectra. After this step, the total parameter set is offered to an adaptive universal pattern classifier which is trained by selected material before working. For stationary phonemes, the recognition rate is about 85 percent when training material and test material are uttered by the same speaker. The recognition rate is increased to about 90 percent when segmentation results are used.

47 citations


Proceedings ArticleDOI
01 Apr 1976
TL;DR: This report presents results obtained in some experiments on the computer recognition of continuous speech with two simple languages having vocabularies of 11 and 250 words.
Abstract: This report presents results obtained in some experiments on the computer recognition of continuous speech. The experiments deal with two simple languages having vocabularies of 11 and 250 words.

36 citations


Proceedings ArticleDOI
01 Apr 1976
TL;DR: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal.
Abstract: A speech processing system named SPAC (SPlicing of AutoCorrelation function) is proposed in order to compress or expand the speech spectrum, to prolong or shorten the duration of utterance, and to reduce the noise level in speech signal. A period of short-time autocorrelation function is sampled and spliced after change of the time scale. Transformed speech is quite natural and free from distortion. Applications of SPAC are expected in many fields such as improvement of speech quality, narrow band transmission, communication aid for hard of hearing, information service for blind, unscrambling of helium speech, stenography and so on.

Journal ArticleDOI
TL;DR: How a speech synthesizer can be controlled by a small computer in real time and the properties of the synthesizer and the control program are described along with an example of the speech synthesis.
Abstract: This paper describes how a speech synthesizer can be controlled by a small computer in real time. The synthesizer allows precise control of the speech output that is necessary for experimental purposes. The control information is computed in real time during synthesis in order to reduce data storage. The properties of the synthesizer and the control program are prsented along with an example of the speech synthesis.

Patent
24 Jun 1976
TL;DR: In this article, a system and method for detecting the presence of useful speech information in telephone voice channels capable of containing noise as well as such useful information for optimizing the telephone transmission of such speech information is presented.
Abstract: A system and method for detecting the presence of useful speech information in telephone voice channels capable of containing noise as well as such useful speech information for optimizing the telephone transmission of such speech information. Two segments of the envelope of a given voice channel are compared against each other over two different time domains in order to determine if a predetermined magnitude of difference exists between these envelopes. The presence of such magnitude of difference is indicative of the presence of such useful speech information in the voice channel thereby enabling transmission thereof by the system, whereas the absence of such magnitude of difference is indicative of the presence of solely noise thereby preventing the transmission thereof by the system.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: This paper describes a connected speech understanding system being implemented in Nancy made up of an acoustic recognizer which gives a string of phoneme-like segments from a spoken sentence, a syntactic parser which controls the recognition process, a word recognizer working on words predicted by the parser and a dialog procedure which takes in account semantic constraints in order to avoid some of the errors and ambiguities.
Abstract: This paper describes a connected speech understanding system being implemented in Nancy, thanks to the work done in automatic speech recognition since 1968. This system is made up of four parts : an acoustic recognizer which gives a string of phoneme-like segments from a spoken sentence, a syntactic parser which controls the recognition process, a word recognizer working on words predicted by the parser and a dialog procedure which takes in account semantic constraints in order to avoid some of the errors and ambiguities. Some original features of the system are pointed out : modularily (e.g. the language used is considered as a parameter), possibility of processing slightly syntactically incorrect sentences, ... The application both in data management and in oral control of a telephone center has given very promising results. Work is in progress for generalizing our model : extension of the vocabulary and of the grammar, multi-speaker operation, etc.

Patent
10 May 1976
TL;DR: In this paper, a method and a system for speech detection on PCM multiplexed voice channels is described, where a decision is reached every M samples regarding the channel activity.
Abstract: The disclosure herein describes a method and a system for speech detection on PCM multiplexed voice channels; for each channel, a decision is reached every M samples regarding the channel activity; in addition, the nature of speech is detected as: voiced (compact or non-compact) or unvoiced (fricative or non-fricative) when the channel is active; pure silence, white noise or echo when the channel is inactive. The decision is based on the joint value of the amplitude, zero crossing of the signal and zero crossing of the signal derivative.

Patent
20 May 1976
TL;DR: In this paper, a method and an installation for masked or scrambled speech transmission utilize a time-scrambling unit for dividing the speech band into at least two sub-bands, for delaying the one sub-band with respect to the other, and for forming an aggregate signal, and a frequency-scambling unit is used to divide the aggregate signal into two second subbands of variable bandwidth, for their cyclic interchanging, for forming a transmission signal capable of being transmitted over a transmission channel, in order to mask not only the sound character of the speech signals but
Abstract: A method and an installation for masked or scrambled speech transmission utilize a time-scrambling unit for dividing the speech band into at least two sub-bands, for delaying the one sub-band with respect to the other, and for forming an aggregate signal, and a frequency-scrambling unit for dividing the aggregate signal into at least two second sub-bands of variable band-width, for their cyclic interchanging, and for forming a transmission signal capable of being transmitted over a transmission channel, in order to mask not only the sound character of the speech signals but also the speech rhythm, thus ensuring increased privacy of transmission with high code-changing speed and low sensitivity to distortion.

Proceedings ArticleDOI
C. Cook1
01 Apr 1976
TL;DR: Verification offers an alternative strategy by doing a top-down parametric word match independent of segmentation and labeling, which results in a distance measure between the reference parameterization of a hypothesized word and the computed parameterizations of the real speech.
Abstract: If, in a speech understanding system, word matching is performed at the phonetic level, then the accurate determination of the locations and identities of words present in an unknown utterance is necessarily limited by the phonetic segmentation and labeling. Verification offers an alternative strategy by doing a top-down parametric word match independent of segmentation and labeling. The result is a distance measure between the reference parameterization of a hypothesized word and the computed parameterization of the real speech. This distance is interpreted as the likelihood of that word having actually occurred over a given portion of the utterance.

27 Jan 1976
TL;DR: Relatively little effort has been expended toward designing low data rate speech processing devices which can operate in difficult environments, but problems addressed include that of good beahvior for a wide variety of speakers.
Abstract: : Relatively little effort has been expended toward designing low data rate speech processing devices which can operate in difficult environments. The particular problems addressed include that of good beahvior for a wide variety of speakers, with tandeming and conferencing configurations, in the presence of jamming and/or background noise and with telephone speech as input. (Author)

Proceedings ArticleDOI
12 Apr 1976
TL;DR: The voice-operated question-answering system for seat reservation is constructed by computer simulation technique and the promising results are obtained.
Abstract: The speech recognition system composing a part of the question-answering system operated by conversational speech is described. The recognition system consists of two stages of process : acoustic processing stage and linguistic processing stage. In the acoustic processing stage, input speech is analyzed and transformed into the phoneme sequence which usually contains ambiguities and errors caused in the segmentation and phoneme recognition. In the linguistic processing stage, the phoneme sequence containing ambiguities and errors is converted into the correct word sequence by the use of the linguistic knowledge such as phoneme rewriting rules, lexicon, syntax, semantics and pragmatics. The voice-operated question-answering system for seat reservation is constructed by computer simulation technique and the promising results are obtained.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: An analysis/synthesis method whereby speech may be transmitted at 600 bps, a data rate which is less than 1 percent of the PCM transmission rate for original speech sounds, which is enough to permit the use of the system in certain specialized military applications.
Abstract: This paper presents an analysis/synthesis method whereby speech may be transmitted at 600 bps, a data rate which is less than 1 percent of the PCM transmission rate for original speech sounds. This R&D effort was motivated by the pressing need for very-low-data rate (VLDR) voice digitizers to meet some of the current military voice communication requirements. The use of a VLDR voice digitizer makes it possible to transmit speech signals over adverse channels which support data rates of only a few hundred bps, or to transmit speech signals over more favorable channels with redundancies for error protection and other useful applications. The 600 bps synthesized speech loses some of its original speech quality, but the intelligibility is sufficiently high to permit the use of the system in certain specialized military applications. One of the most attractive features of the 600 bps voice digitizer is that it is a simple extension of the 2400 bps linear predictive encoder (LPE) which has been under intensive investigation by various government agencies, including the Navy, and is presently entering advanced development. In essence, the 600 bps voice digitizer is a combination of an LPE and a format vocoder, which is realized by adding a processor to the existing 2400 bps LPE. This add-on processor converts the 2400 bps speech data to 600 bps speech data at the transmitter, and reconverts the data to 2400 bps at the receiver.

Journal ArticleDOI
TL;DR: A speech processing system has been developed in which the unvoiced portion of speech is bandwidth compressed from an original bandwidth of 4000 Hz into a low-frequency band not exceeding 1000 Hz, in which hearing impaired subjects with severe high-frequency hearing losses still possess some residual speech perception.
Abstract: A speech processing system has been developed in which the unvoiced portion of speech is bandwidth compressed from an original bandwidth of 4000 Hz into a low-frequency band not exceeding 1000 Hz, in which hearing impaired subjects with severe high-frequency hearing losses still possess some residual speech perception. The basic compression operation is based upon a time-domain time expansion technique, and the resulting reduction in bandwidth is accomplished without relinquishing the essential information contained in unvoiced speech. Thus, subjects are able again to perceive unvoiced speech of fair intelligibility where conventional hearing aids normally fail to be of any assistance. The imposition of stringent operating requirements such as portability, real-time operation, and functionality in a real-listening environment composed of many competing speech and noise sources, eliminated numerous elegant speech processing approaches.

30 Sep 1976
TL;DR: Test results indicate that packet-system speech quality varies from essentially perfect to unusable, and guidelines are provided for an acceptable packetized speech communication system.
Abstract: : This paper reports on the effects--examined in parametric fashion--on the overall voice quality, acceptability, and communicability of speech packetization and its transmission through a packet-switched network. Speech processed through a number of real-time simulation programs developed to create anticipated anomalies (glitches) in packet speech systems were evaluated by informal acceptability testing. Depending on system design parameters, test results indicate that packet-system speech quality varies from essentially perfect (no packet-related anomalies) to unusable. Guidelines are provided for an acceptable packetized speech communication system.

Journal ArticleDOI
TL;DR: A simple algorithm for locating the beginning and end of a speech utterance has been developed that has been tested in computer simulations and has been constructed with standard integrated circuit technology.
Abstract: When speech is coded using a differential pulse-code modulation system with an adaptive quantizer, the digital code words exhibit considerable variation among all quantization levels during both voiced and unvoiced speech intervals. However, because of limits on the range of step sizes, during silent intervals the code words vary only slightly among the smallest quantization steps. Based on this principle, a simple algorithm for locating the beginning and end of a speech utterance has been developed. This algorithm has been tested in computer simulations and has been constructed with standard integrated circuit technology.

Proceedings ArticleDOI
Harvey F. Silverman1, N. Dixon
01 Apr 1976
TL;DR: The problems concerning the diadic segment classification and final string estimation are discussed, and the current solutions given.
Abstract: The Modular Acoustic Processor (MAP), a complex experimental system for automatic derivation of phonemic string output for continuous speech, was first described in April 1974. Many of the new concepts currently in MAP are described. In particular, the problems concerning the diadic segment classification and final string estimation are discussed, and the current solutions given. Results on a large body of continuous speech data, prepared by an automatic evaluation system will also be presented.

Proceedings ArticleDOI
01 Apr 1976
TL;DR: This research has resulted in the development of a new pitch-synchronous analysis technique for the extraction of accurate formant information from speech signals that is an improvement over current methods of analysis in terms of accuracy and temporal resolution.
Abstract: This research has resulted in the development of a new pitch-synchronous analysis technique for the extraction of accurate formant information from speech signals. The method is an improvement over current methods of analysis in terms of accuracy and temporal resolution. This is achieved by extension of the signal from one pitch period into the next, using a speech production model based on linear prediction. The result is higher accuracy in the determination of formant frequencies, bandwidths and amplitudes, and the ability to follow rapid formant transitions. The method performs equally well with nasal and high pitched sounds. The method is applied to the speech recognition and the speaker identification problems.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: Computational and syntactic information is used to resolve ambiguities and to yield higher—order decisions in automatic speech recognition and understanding systems.
Abstract: Automatic speech recognition and understanding are currently receiving considerable attention.1 Most approaches to problems in these areas involve rather complicated systems. Typically, the acoustic waveform is first segmented into units such as phonemes or syl— lables. Semantic and syntactic information is then used to resolve ambiguities and to yield higher—order decisions. This complexity is probably necessary if the most general speech—recognition problems are to be solved.

Proceedings ArticleDOI
12 Apr 1976
TL;DR: The quantitative rules obtained for generating the SSRU's are expected to be useful, at least as a preliminary investigation tool, for synthesis-by-rule.
Abstract: Summary form only given, as follows. The paper deals with the application of linear prediction technique to the speech synthesis of both italian and german languages by Standard Speech Reproducing Units (SSRU), it is by combining elementary speech segments of standardized charac teristics extracted fron utterances of native speakers. The nain feature of the method presented is the possibility of synthesizing in a higly intelligible form any nessage of such languages with a very limited amount of data. So far the use of linear predictive coding of the previously realized SSRU sets allowed a memory occupation less than 16 kb for the synthesis of italian language and less than 32 k-bytes for the combined synthesis of italian and german languages. The data flow rate is about 1 kb/s. A key property of the code with respect to methods previously used (i.e. simple concatenation of original segments ) relies in the possibility of greatly enhancing the naturalness of the synthesized speech by varying pitch, amplitude and duration of the synthetic segments. Further, the quantitative rules obtained for generating the SSRU's are expected to be useful, at least as a preliminary investigation tool, for synthesis-by-rule.


Proceedings ArticleDOI
12 Apr 1976
TL;DR: Algorithms for segmenting speech sounds into vowel-like and nonvowellike segments, and then for identifying vowels and detecting nasal segments, turbulence noise segments, etc. are described, together with an algorithm for feature normalization.
Abstract: This paper presents a new approach to automatic segmentation and feature normalization of connected speech based on area functions. Algorithms for segmenting speech sounds into vowel-like and nonvowellike segments, and then for identifying vowels and detecting nasal segments, turbulence noise segments, etc. are described, together with an algorithm for feature normalization. Fairly reasonable results were obtained with seven sentences spoken by two male speakers and a female speaker.

Journal ArticleDOI
TL;DR: Results are presented of experiments with a recognition scheme intended for continuous speech that utilizes information about interphoneme contextual effects contained in formant transitions and employs internal trial synthesis and feedback comparison as a means for recognition.
Abstract: Preliminary results are presented of experiments with a recognition scheme intended for continuous speech. The scheme utilizes information about interphoneme contextual effects contained in formant transitions and employs internal trial synthesis and feedback comparison as a means for recognition. The aim is to achieve minimal sensitivity to the appreciable variability which occurs in the speech signal, even for utterances of a single speaker. While the approach outlined here is quite general, it has initially been tried out on vowel-stop-vowel utterances. Recognition scores obtained are encouraging and demonstrate the viability of the approach.