scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1978"


Book
05 Sep 1978
TL;DR: This paper presents a meta-modelling framework for digital Speech Processing for Man-Machine Communication by Voice that automates the very labor-intensive and therefore time-heavy and expensive process of encoding and decoding speech.
Abstract: 1. Introduction. 2. Fundamentals of Digital Speech Processing. 3. Digital Models for the Speech Signal. 4. Time-Domain Models for Speech Processing. 5. Digital Representation of the Speech Waveform. 6. Short-Time Fourier Analysis. 7. Homomorphic Speech Processing. 8. Linear Predictive Coding of Speech. 9. Digital Speech Processing for Man-Machine Communication by Voice.

3,103 citations


Journal ArticleDOI
TL;DR: New results of masking and loudness reduction of noise are reported and the design principles of speech coding systems exploiting auditory masking are described.
Abstract: In any speech coding system that adds noise to the speech signal, the primary goal should not be to reduce the noise power as much as possible, but to make the noise inaudible or to minimize its subjective loudness. ’’Hiding’’ the noise under the signal spectrum is feasible because of human auditory masking: sounds whose spectrum falls near the masking threshold of another sound are either completely masked by the other sound or reduced in loudness. In speech coding applications, the ’’other sound’’ is, of course, the speech signal itself. In this paper we report new results of masking and loudness reduction of noise and describe the design principles of speech coding systems exploiting auditory masking.

434 citations


PatentDOI
TL;DR: In this article, a speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed, where each keyword is represented by a keyword template representing one of the target patterns, and each target pattern comprises statistics of each of at least one spectrum selected from a predetermined system for processing of the incoming audio.
Abstract: A speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed. Each keyword is represented by a keyword template representing one or more target patterns, and each target pattern comprises statistics of each of at least one spectrum selected from plural short-term spectra generated according to a predetermined system for processing of the incoming audio. The spectra are processed by a frequency equalization and normalizing method to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into spectral patterns, are transformed to reduce dimensionality of the patterns, and are compared by means of likelihood statistics with the target patterns of the keyword templates. A concatenation technique employing a loosely set detection threshold makes it very unlikely that a correct pattern will be rejected.

78 citations


PatentDOI
TL;DR: In this article, a speech recognition system adaptable to noisy environments is described, which includes a recognition unit for recognizing input speech signals and a noise measuring unit for measuring the intensity of ambient noises.
Abstract: A speech recognition system adaptable to noisy environments is disclosed. The system includes a recognition unit for recognizing input speech signals and a noise measuring unit for measuring the intensity of ambient noises. The system also includes a rejection unit responsive to a rejection standard controlled by the intensity of the measured noise for rejecting the rejection results given from the recognition unit when the rejection standard is exceeded.

54 citations


Journal ArticleDOI
TL;DR: This paper proposes two dynamic-type speech detectors based on the same operational principle: the presence of the speech signal is detected by analyzing the dynamic variations of the short-time-power of the channel signal.
Abstract: This paper proposes two dynamic-type speech detectors; their performances are described also by means of in-field experimental results. The two detectors are based on the same operational principle: the presence of the speech signal is detected by analyzing the dynamic variations of the short-time-power of the channel signal.

35 citations


Journal ArticleDOI
TL;DR: A new method of digitising speech waveforms is described, based on the comparison of successive segments of the waveform with a suitably stored catalogue of possible distinct shapes.
Abstract: A new method of digitising speech waveforms is described, based on the comparison of successive segments of the waveform with a suitably stored catalogue of possible distinct shapes.

34 citations


Proceedings ArticleDOI
01 Apr 1978
TL;DR: Overall subjective quality of speech processed by adaptive differential PCM is well predicted by segmental signal-to-noise ratio and even better by a linear combination of measures of granular distortion and overload distortion.
Abstract: An experiment has been performed to study the perceptual characteristics of speech processed by ADPCM. We created 18 three-bit and four-bit coders spanning a wide range of quantizer adaptation parameters. Subjects judged the difference between each pair of coders and rated the quality of each coder individually. The difference data reveal three important perceptual dimensions (overall clarity, signal vs. background distortion, muffled vs. hoarse) which are related to various objective measures of coder performance. Overall subjective quality is well predicted by segmental SNR and even better by a linear combination of measures of granular distortion and overload distortion.

30 citations


Proceedings ArticleDOI
01 Apr 1978
TL;DR: This coding scheme, in addition to the baseband excitation concepts, takes advantage of the association of recently published digital speech processing techniques such that transversal predictive coding, splitband coding by signal decimation/interpolation and adaptive block quantization.
Abstract: This paper describes a common voice coding architecture based on a Voice Excited Predictive Coding (VEPC) scheme allowing operation at different bit rates : 9600, 7200 bps or below by simply modifying the bandwidth allocated to the coding of the baseband excitation signal. This coding scheme, in addition to the baseband excitation concepts, takes advantage of the association of recently published digital speech processing techniques such that transversal predictive coding, splitband coding by signal decimation/interpolation and adaptive block quantization. Simulations have shown that the proposed architecture allows to obtain a 'standard telephone quality' assuming a 300-3400 Hz telephone bandwidth at transmission rates below 9600 bps.

28 citations


Proceedings ArticleDOI
10 Apr 1978
TL;DR: It is demonstrated that it is possible to achieve pattern recognition classification with much less computational effort by adopting a scheme based on the concept of variable decision space, using only three features and by avoiding the time consuming linear prediction analysis.
Abstract: A pattern recognition approach for deciding whether a given segment of speech should be classified as voiced speech, unvoiced speech or silence based on a set of five measurements of the signal is given by Atal and Rabiner [1]. In this paper, we demonstrate that it is possible to achieve this classification with much less computational effort. These computational savings are mainly achieved by adopting a scheme based on the concept of variable decision space, using only three features and by avoiding the time consuming linear prediction analysis.

22 citations


15 Dec 1978
TL;DR: The usefulness of the new approach for speech modeling has been successfully established after several parameter quantization methods were considered to achieve the desired low bit rates.
Abstract: : This constitutes our final report on a research program aimed at the development of a high quality low data rate speech transmission system based on new types of speech modeling algorithms. Several such algorithms were developed and tested on simulated and real speech data. These algorithms have many desirable features including the capability of rapidly tracking time-varying model parameters. The best algorithm was used as the basis of a speech transmission system in order to test the quality of the speech models. The model parameters (reflection coefficients) together with pitch information and speech energy form a speech parameter vector to be transmitted and used to reconstruct the original speech. Several parameter quantization methods were considered to achieve the desired low bit rates. The various algorithms as well as the complete transmission system were coded and tested. Simulation results are very promising and the usefulness of our new approach for speech modeling has been successfully established. (Author)

19 citations


Journal ArticleDOI
M. Orceyre1, R. Heller
TL;DR: The matter of secure voice communication-enabling speakers to converse naturally over telephone media without fear that their conversation can be usefully intercepted-poses special problems and is receiving close attention within both the commercial and the Government sectors.
Abstract: Telephone communications have been understood from their beginnings to be vulnerable to interception (unauthorized reception). In recent years, with increasing public and private sector reliance upon electronic media for communicating sensitive technical, financia’l, military, political, economic, and personal information, and with the rapidly increasing use of microwave and satellite telephone carrier media, concern about these vulnerabilities .has mounted dramatically. Starting in mid-1977 there has been considerable attention given in the news media to the matter of wholesale interception by foreign governments of American private and commercial voice and data communications. Publicly available documents note he ase with which such ommon carrier transmissions can be “captured” for subsequent analysis arid use by unauthorized listeners. Fig. 1 illustrates the many vulnerabilities of a typical public switched telephone network. Within this broad framework, the matter of secure voice communication-enabling speakers to converse naturally over telephone media without fear that their conversation can be usefully intercepted-poses special problems and is receiving close attention within both the commercial and the Government sectors.

PatentDOI
Osamu Fujimura1
TL;DR: In this article, a speech transmission system is improved in intelligibility and naturalness by separating voiced from invoiced speech segments prior to application to a transmission channel of restricted bandwidth.
Abstract: A speech transmission system is improved in intelligibility and naturalness by separating voiced from invoiced speech segments prior to application to a transmission channel of restricted bandwidth. Voiced segments are combined without processing with discrete-frequency coded unvoiced segments processed in analog or digital fashion conformably with the limited channel bandwidth at the transmitter. Voiced segments are reproduced conventionally while unvoiced segments are simulated by noise sources triggered by decoded discrete frequencies at the receiver. The reconstructed speech signal can thus occupy substantially more than the limited channel bandwidth.

PatentDOI
TL;DR: In this article, a method of communicating Digital Speech Data to a speech synthesis circuit is described. But the data is stored in a memory which is coupled to the speech synthesis circuits.
Abstract: A method of communicating Digital Speech Data to a speech synthesis circuit. The data is compressed to on the order of 1000-1200 bits, per second for normal human speech. The speech synthesis circuit utilizes linear predictive coding techniques for producing high quality speech or other sounds. The data is preferably stored in a memory which is coupled to the speech synthesis circuit. The data has variable frame lengths; in the disclosed embodiment, four different frame lengths are described having frame lengths from four bits to forty-nine bits. The memory stores the variable frame length data and communicates the same to the speech synthesis circuit in response to certain control signals.

Proceedings ArticleDOI
10 Apr 1978
TL;DR: This paper describes a method of speech coding in a high ambient noise environment and shows that the spectral envelope of speech signal is a most reliable information when the noise reduction method proposed in this paper is used.
Abstract: Preservation of both the spectral distribution and the periodicity of speech signals are essential in speech processing. This paper describes a method of speech coding in a high ambient noise environment and shows that the spectral envelope of speech signal is a most reliable information when the noise reduction method proposed in this paper is used. Also reported in this paper comparisons of several pitch extraction methods with extensive experimental data, based on which a pitch extraction method suited for noisy speech signals is proposed.

Journal Article
TL;DR: Although the data support the use of MLV testing, verification with a standardized recording should be considered when unusually poor SDS's are obtained, and half-list testing can be an effective screening procedure to determine it full- list testing is advisable.
Abstract: Several speech audiometric measurements were made on 212 ears with mild sensorineural hearing loss. An 8-dB difference between speech detection and spondee thresholds was observed, which is the same relationship that has been found in normal ears. No significant differences in speech discrimination scores (SDS's) were observed when NU-6 was administered via monitored live voice (MLV) and the Auditec recordings. Although our data support the use of MLV testing, verification with a standardized recording should be considered when unusually poor SDS's are obtained. Half-list and full-list SDS's were analyzed for both taped and MLV presentation modes. This analysis showed that both the MLV and taped stimuli exhibited very similar variability and that about 96% of the half-list scores were within 6% of the full-list scores. The clinician should be cautious, however, because 4% of the ears had half-list/full-list discrepancies ranging from 8 to 14% and differences as large as 28% have been reported by Raffin and Thornton (1977). Furthermore, variability between half-list and full-list SDS's varies as a function of intelligibility impairment, being least for scores approaching the extremes of 0 and 100% and greatest for scores in the 30 to 70% range. Finally, our data suggest that half-list testing can be an effective screening procedure to determine it full-list testing is advisable.

Proceedings ArticleDOI
01 Apr 1978
TL;DR: Several techniques for reducing the effect of channel bit errors on the synthesized speech are described, which cause no measurable degradation of the LPC speech transmitted over an error-free channel and they require less than a one percent increase in computer execution time.
Abstract: The U.S. Government has developed a real-time 2400 bps Linear Predictive Coded (LPC) voice algorithm which was designed to provide maximum intelligibility and quality within the time and accuracy limitations imposed by modern high-speed minicomputers. The algorithm which resulted provides excellent intelligibility and quality when transmitted over an ideal channel. However, the speech is significantly degraded in an error environment. This paper describes several techniques for reducing the effect of channel bit errors on the synthesized speech. These techniques cause no measurable degradation of the LPC speech transmitted over an error-free channel and they require less than a one percent increase in computer execution time.

Proceedings ArticleDOI
01 Apr 1978
TL;DR: Sperry Univac is developing a linguistically oriented system for locating important words in conversational speech that uses acoustic, prosodic, and phonetic analyses to produce a phonetic description of the incoming speech.
Abstract: Sperry Univac is developing a linguistically oriented system for locating important words in conversational speech. The system uses acoustic, prosodic, and phonetic analyses to produce a phonetic description of the incoming speech. Next, phonetic dictionary representations of the keywords to be found are compared to all portions of the phonetic analysis. High scoring matches are then verified by aligning prestored spectral patterns with the spectral information found during analysis, and resulting good matches are announced as likely keyword occurrences. Current results are presented for this system, which is being developed and tested on bandlimited, conversational speech from a large, diverse speaker population.

Proceedings ArticleDOI
01 Apr 1978
TL;DR: The pitch predictor is not useful on balance and should be eliminated, and the residual should be quantized with no clipping and encoded using a variable-length code, which seems to be adequate for all speech and all conditions.
Abstract: We report on the results of research to code speech at 16 kbps under the condition that the quality of the transmitted speech be equal to that of the original. Some of the original speech had been corrupted by noise and distortions typical of long distance telephone lines. The rigorous requirements of this work led to a new outlook on adaptive predictive coding. We have found that the pitch predictor is not useful on balance and should be eliminated, and that the residual should be quantized with no clipping and encoded using a variable-length code. A single coding scheme seems to be adequate for all speech and all conditions. In addition, the adaptive predictive coding system has been modified to include a noise spectral shaping filter that effectively eliminates the perception of background granular noise.

Journal ArticleDOI
TL;DR: An all digital system, labeled PCM.RR is presented, which enables the doubling of traffic capacity of PCM links, by properly using "Adaptive Quantization and Speech Interpolation" performed by means of a "Speech Detector" that works directly on the A -law compressed digital signal.
Abstract: An all digital system, labeled PCM.RR. is presented, which enables the doubling of traffic capacity of PCM links. This is obtained, although keeping the transmission quality impairment very close to the normal PCM standards, by properly using "Adaptive Quantization" and "Speech Interpolation" performed by means of a "Speech Detector" that works directly on the A -law compressed digital signal.

Patent
14 Mar 1978
TL;DR: In this paper, the frequency range of each speech channel is broken into sub-channels and each of these is considered separately for operational activity, and composite speech signals are then formed from the active frequency subchannels of individual speech channels and these are transmitted with coding signals indicative of their composition.
Abstract: To transmit a number of individual speech channels over a smaller number of transmission channels, the frequency range of each speech channel is broken into sub-channels and each of these is considered separately for operational activity. Composite speech signals are then formed from the active frequency sub-channels of the individual speech channels and these are transmitted with coding signals indicative of their composition.

Journal ArticleDOI
TL;DR: The Parcor analysis‐synthesis method is being applied to a wide range of speech coding from 1200 bps variable frame‐rate coding to high quality 16 kbps adaptive, predictive coding.
Abstract: Since the introduction of speech analysis—synthesis based on the maximum likelihood spectrum estimation—in 1966, we have been conducting research activities on low bit rate speech coding techniques, and their aplication to audio response and low bit rate digital speech transmission. Parcor analysis‐synthesis, demonstrated in 1969, was one of the most fundamental methods, and it has formed the basis of the present development of linear predictive coding. Recently, various kinds of techniques have been proposed to improve speech quality, such as interpolation and nonlinear quantization of parameters, spectral smoothing, etc. They have been applied in the hardware realization of a 4 CH multiplexed 2400 bps Vocoder. At present, the Parcor method is being applied to a wide range of speech coding from 1200 bps variable frame‐rate coding to high quality 16 kbps adaptive, predictive coding.

Journal ArticleDOI
J. Dubnowski1
TL;DR: A microprocessor has been used to translate between Log PCM and ADPCM (Adaptive Differential PCM) code forms, in bridging the gap between simulation and prototyping, provides realtime speech processing with user interaction.
Abstract: A microprocessor has been used to translate between Log PCM and ADPCM (Adaptive Differential PCM) code forms. This system, in bridging the gap between simulation and prototyping, provides realtime speech processing with user interaction. Continuously coded speech can he subjectively evaluated while switching the values of code word length, step size, or predictor coefficients. Translations of additional code forms such as Δ-Mod, NIC, or Tree Codes could easily be implemented with the micro-codable system. The processor is configured as a stand-alone device competitive with special purpose hardware in size, speed, and cost.

Journal ArticleDOI
01 Feb 1978
TL;DR: A system is described which provides for the output of information from a real engineering database in spoken form, using its own predefined knowledge of the information domain and a knowledge of simple English.
Abstract: A system is described which provides for the output of information from a real engineering database in spoken form. Data extracted from the database is converted by the system, using its own predefined knowledge of the information domain and a knowledge of simple English, into a sequence of words and an associated pitch contour. The spoken output is then generated by the concatenation and resynthesis of previously analysed stored isolated words using a hardware digital speech synthesiser.

ReportDOI
01 Nov 1978
TL;DR: In this paper, the authors describe the design, principles of operation, and performance characteristics of an Advanced Development Model of a speech enhancement unit, which improves the quality and intelligibility of speech signals by the removal of frequently encountered interference or noise from received or recorded speech signals.
Abstract: : This report describes the design, principles of operation, and performance characteristics of an Advanced Development Model of a speech enhancement unit. This unit improves the quality and intelligibility of speech signals by the removal of frequently encountered interference or noise from received or recorded speech signals. A high speed digital array processor and various time and frequency domain algorithms permits the detection and attenuation of narrowband noise (such as tones, hums, whistles, etc.) and impulse noise (such as ignition pulses, static, etc.) with minimum degradation to the speech signals. The enhancement unit provides automatic tracking and attenuation of interferring signals in real time and with a maximum lag of .15 second. The heart of the speech enhancement unit is a powerful computer known as a macro-array processor, or MAP, that performs all of the measurement, analysis, and processing of the input signal. It is supported by a digital magnetic tape unit used to program the MAP and a minicomputer which reads the program into the MAP. Tests on the unit showed attenuation of 30 to 50 db on both narrowband and impulse noise. Operational tests performed by trained Air Force personnel showed the unit to be highly effective in providing improved intelligibility and listenability which significantly reduced listener fatigue. Provision has been made in the design and fabrication of the speech enhancement unit to implement a technique for attenuating wideband random noise. This technique known as INTEL is one of the few known methods of suppressing this commonly encountered noise without severely distorting co-existing speech.

Proceedings ArticleDOI
01 Apr 1978
TL;DR: A modified version of BASIC which the authors call 'SPOKEN-BASIC-1' is selected as a task for speech recognition and the two space search strategies, depth- first method and best-first method have been compared quantitatively in their effectiveness.
Abstract: This paper describes a speech recognition system developed as a voice-input programming system. A modified version of BASIC which we call 'SPOKEN-BASIC-1' is selected as a task for speech recognition. The system consists of four major components: acoustic, lexical matching, syntactic and semantic processors. 71 sentences spoken by each of four speakers have been applied to the system for gathering several statistics on system performance. The system have achieved a sentence recognition rate of 85.5 %. The average time required to recognize an utterance is from 1/4 to 1/5 times real time on the large scale computer. The two space search strategies, depth-first method and best-first method have been compared quantitatively in their effectiveness. Further, various types of knowledge sources have been investigated in their contribution to the system performance.

Proceedings ArticleDOI
01 Apr 1978
TL;DR: A very large data base consisting of over thirty-six hours of linguistically unconstrained extemporaneous speech, from seventeen speakers, recorded over a period of more than three months, was analyzed to determine the effectiveness of long-term average features for speaker identification.
Abstract: A very large data base consisting of over thirty-six hours of linguistically unconstrained extemporaneous speech, from seventeen speakers, recorded over a period of more than three months, was analyzed to determine the effectiveness of long-term average features for speaker identification. The results were strongly dependent on the voiced speech averaging interval, or L v . Monotonic increases in the probability of correct identification were obtained as L v increased, even with substantial time periods between successive sessions. Speaker identification performance in open tests improved if features with small between-class to within-class variance ratios were eliminated. For L v corresponding to approximately thirty-nine seconds of speech, true text-independent results (no linguistic constraints embedded into the data base) of 98.05% for speaker identification were obtained.

01 Apr 1978
TL;DR: The development of a speech processing computer facility with the ultimate goal of transmitting narrowband speech in real time over the ARPA Network and a reliable method for measuring subjective speech quality are described.
Abstract: : This report describes our work in the past three years on data compression and quality evaluation of digital speech We developed and implemented linear predictive coding (LPC) techniques with the overall objective of digitally transmitting high quality speech at the lowest possible average data rates over packet-switched communication media Major techniques reported include: covariance lattice method of linear prediction analysis, adaptive lattice methods, linear predictive spectral warping, improved quantization of LPC parameters, variable frame rate transmission of LPC parameters based on a functional perceptual model of speech, and a mixed-source model for LPC synthesizer to produce more natural-sounding speech Also, we developed a reliable method for measuring subjective speech quality This method was employed to formally demonstrate the quality improvements provided by our speech analysis/synthesis techniques as well as for studying speech quality as a function of LPC parameters As subjective procedures are generally expensive and time-consuming, we developed and tested several objective procedures for speech quality evaluation The results from these objective procedures were found to be highly correlated to the corresponding subjective quality judgments Another highlight of our work is the development of a speech processing computer facility with the ultimate goal of transmitting narrowband speech in real time over the ARPA Network


01 May 1978
TL;DR: In this paper, the Fourier transform of the input and the harmonics of the desired voice were selected to suppress the interference caused by the speech of a competing talker in a natural-speech environment.
Abstract: : One of the most common types of interference in speech communication is that caused by the speech of a competing talker. A technique has been developed for suppressing such interference by examining the Fourier transform of the input and selecting the harmonics of the desired voice. The initial version of this process was applicable only to vocalic speech (i.e., speech consisting only of vowels and vowel-like sounds), but in subsequent research steps have been taken to extend the process to natural (i.e., unrestricted) speech. This report describes the improvements which have been made in this research, first, to ruggedize the process so that it can perform in an natural-speech environment, second, to improve the intelligibility and naturalness of the recovered speech, and third, to enable the process to handle the non-vocalic speech sounds (such as plosives and fricatives) which occur in natural speech. (Author)

Journal ArticleDOI
TL;DR: A complete algorithm of a 1200-bits/s digital formant vocoder system is described, which draws heavily on the results of recent research in linear predictive coding.
Abstract: A complete algorithm of a 1200-bits/s digital formant vocoder system is described. This vocoder algorithm draws heavily on the results of recent research in linear predictive coding. The transmitting parameters are frequencies and amplitudes of the first three formants, the pitch period, voiced/unvoiced decision, and the gain. Formant bandwidths are estimated at the synthesizer by using the amplitude information. The synthesizer structure is in the parallel form. The synthetic speech quality at 1200 bits/s is reasonably good; most of the speech is intelligible and speaker-recognizable.