scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1979"


Journal ArticleDOI
S. Boll1
TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Abstract: A stand-alone noise suppression algorithm is presented for reducing the spectral effects of acoustically added noise in speech. Effective performance of digital speech processors operating in practical environments may require suppression of noise from the digital wave-form. Spectral subtraction offers a computationally efficient, processor-independent approach to effective digital speech analysis. The method, requiring about the same computation as high-speed convolution, suppresses stationary noise from speech by subtracting the spectral noise bias calculated during nonspeech activity. Secondary procedures are then applied to attenuate the residual noise left after subtraction. Since the algorithm resynthesizes a speech waveform, it can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.

4,862 citations


Journal ArticleDOI
TL;DR: Improved speech quality is obtained by efficient removal of formant and pitch-related redundant structure of speech before quantizing, and by effective masking of the quantizer noise by the speech signal.
Abstract: Predictive coding methods attempt to minimize the rms error in the coded signal. However, the human ear does not perceive signal distortion on the basis of rms error, regardless of its spectral shape relative to the signal spectrum. In designing a coder for speech signals, it is necessary to consider the spectrum of the quantization noise and its relation to the speech spectrum. The theory of auditory masking suggests that noise in the formant regions would be partially or totally masked by the speech signal. Thus, a large part of the perceived noise in a coder comes from frequency regions where the signal level is low. In this paper, methods for reducing the subjective distortion in predictive coders for speech signals are described and evaluated. Improved speech quality is obtained: 1) by efficient removal of formant and pitch-related redundant structure of speech before quantizing, and 2) by effective masking of the quantizer noise by the speech signal.

376 citations


Journal ArticleDOI
TL;DR: Research to code speech at 16 kbit/s with the goal of having the quality of the coded speech be equal to that of the original is reported, finding that the pitch predictor is not cost-effective on balance and may be eliminated.
Abstract: We report on research to code speech at 16 kbit/s with the goal of having the quality of the coded speech be equal to that of the original. Some of the original speech had been corrupted by noise and distortions typical of long-distance telephone lines. The basic structure chosen for our system was adaptive predictive coding. However, the rigorous requirements of this work led to a new outlook on the different aspects of adaptive predictive coding. We have found that the pitch predictor is not cost-effective on balance and may be eliminated. Solutions are presented to deal with the two types of quantization noise: clipping and granular noise. The clipping problem is completely eliminated by allowing the number of quantizer levels to increase indefinitely. An appropriate self-synchronizing variable-length code is proposed to minimize the average data rate; the coding scheme seems to be adequate for all speech and all conditions tested. The granular noise problem is treated by modifying the predictive coding system in a novel manner to include an adaptive noise spectral shaping filter. A design for such a filter is proposed that effectively eliminates the perception of granular noise.

99 citations


Journal ArticleDOI
TL;DR: The speech synthesis from concept system converts an input concept into speech by using a transformational grammar to generate a well‐formed English sentence and a word concatenation synthesizer to generate the actual speech output.
Abstract: A synthesis method, called speech synthesis from concept, is described which has been designed specifically for providing speech output from information systems. It differs from conventional techniques in that data is passed from the information system to the speech synthesis system, not in the form of text or phonetic transcription, but in the form of an abstract structure called an input concept. The speech synthesis from concept system converts an input concept into speech by using a transformational grammar to generate a well‐formed English sentence and a word concatenation synthesizer to generate the actual speech output. The ’’top down’’ nature of this process reduces the computation required within the information system and enables high‐quality speech to be produced.

69 citations


PatentDOI
Bishnu S. Atal1
TL;DR: In this paper, a speech signal is partitioned into intervals, and a set of coded prediction parameter signals, pitch period and voicing signals, and signals corresponding to the spectrum of the prediction error signal are produced.
Abstract: In a speech processing arrangement for synthesizing more natural sounding speech, a speech signal is partitioned into intervals. For each interval, a set of coded prediction parameter signals, pitch period and voicing signals, and a set of signals corresponding to the spectrum of the prediction error signal are produced. A replica of the speech signal is generated responsive to the coded pitch period and voicing signals as modified by the coded prediction parameter signals. The pitch period and voicing signals are shaped responsive to the prediction error spectral signals to compensate for errors in the predictive parameter signals whereby the speech replica is natural sounding.

48 citations


Proceedings ArticleDOI
B. Atal1, N. David
01 Apr 1979
TL;DR: A modified analysis-synthesis procedure which, although relying on the basic LPC technique for analysis and synthesis, avoids spectral amplitude and phase distortions introduced by these techniques.
Abstract: In speech analysis and synthesis based on linear prediction, it is a common assumption that predictor coeffcients contain all the necessary spectral and phase information for accurate synthesis of the speech signal. However, even under the best circumstances, the synthetic speech sounds unnatural to the critical listener. Subjective tests reveal that spectral errors introduced by the linear prediction analysis techniques are a major source of unnatural sound quality in synthetic speech. This paper describes a modified analysis-synthesis procedure which, although relying on the basic LPC technique for analysis and synthesis, avoids spectral amplitude and phase distortions introduced by these techniques. In new method, proper reproduction of speech spectrum at the receiver is ensured by transmitting the short-time spectrum of prediction residual to the receiver.

39 citations


Journal ArticleDOI
TL;DR: A mathematical analysis for the steady state performance of a system where voice calls and data packets are transmitted over the same channel is developed.
Abstract: We develop a mathematical analysis for the steady state performance of a system where voice calls and data packets are transmitted over the same channel The voice calls have priority over the data packets, in that the data packets are transmitted only when there are no voice calls present in the system or the voice conversation is in a long silent period

30 citations


Proceedings ArticleDOI
01 Apr 1979
TL;DR: This paper describes a unique design that attacks two problem areas of LPC: noise suppression input level control and real time simulation/ test.
Abstract: This paper describes a unique design that attacks two problem areas of LPC: noise suppression input level control and real time simulation/ test The noise level design uses algorithms to digitally process speech data before input to the LPC algorithm processor The LPC processor described in the paper is based on a microprocessor design conceived specifically for speech The noise suppression and level control algorithms are performed in a separate front end processor that detects noise patterns and deletes them from the normal voice input The operational hardware system is shown to the block diagram level as well as the particular simulation/test scheme Test results are also described in this paper

30 citations


PatentDOI
TL;DR: In this article, the upper sideband of a sampled-speech signal with the original baseband signal for further signal processing is enhanced by shifting both bands to form a continuum from 0 Hz to the sampling frequency.
Abstract: Signal-to-noise is enhanced by including the upper sideband of a sampled-speech signal with the original baseband signal for further signal processing. The invention features shifting both bands to form a continuum from 0 Hz to the sampling frequency. Application in a speech recognition is shown.

25 citations


Journal ArticleDOI
Daniel Minoli1
01 Aug 1979
TL;DR: A queuing model for a link carrying packetised voice is introduced and solved, and results on optimal packet length, transient behaviour, and buffer length are presented.
Abstract: Because of perceived economic and technical benefits, digital voice techniques and corresponding packet network architectures are receiving considerable attention. In this paper we summarise speech traffic models, followed by a discussion of performance criteria. A queuing model for a link carrying packetised voice is introduced and solved. The results and network implications of this link model are addressed; results on optimal packet length, transient behaviour, and buffer length are presented.

25 citations


PatentDOI
TL;DR: In this paper, human voice sounds are modified to produce the effect of a different person speaking, where a signal representative of the original voice sounds is separated into a plurality of voice signal components each having a different frequency band.
Abstract: Human voice sounds are modified to produce the effect of a different person speaking. A signal representative of the original voice sounds is separated into a plurality of voice signal components each having a different frequency band. The frequency of at least one voice signal component is shifted and the voice signal components are recombined to produce a modified voice signal representative of the modified but intelligible voice sounds.

Journal ArticleDOI
TL;DR: A speech processing system has been developed which is capable of providing an accurate indication of whether or not a given speech segment is voiced or unvoiced and offers a realiable voiced/unvoiced (V/UV) decision even in the presence of some competing speech and noise sources.
Abstract: A speech processing system has been developed which is capable of providing an accurate indication of whether or not a given speech segment is voiced or unvoiced. In comparison to other existing techniques, this one avails itself for easy implementation, and it offers a realiable voiced/unvoiced (V/UV) decision even in the presence of some competing speech and noise sources. In addition, the V/UV decision is achieved in real time with time delays of \leq 4 ms going from voiced to unvoiced speech, and \leq 2 ms going from unvoiced to voiced speech.

Patent
30 Aug 1979
TL;DR: In this paper, a digital speech interpolation system is combined with an adaptive differential PCM (ADPCM), employing a speech detector for detecting speech signals and for discriminating voiced and unvoiced sounds.
Abstract: A digital speech interpolation system is combined with an adaptive differential PCM (ADPCM), employing a speech detector for detecting speech signals and for discriminating voiced and unvoiced sounds. An adaptive quantization bit assignment to the speech is adopted to cope with any freeze-out condition. And further PCM speech signals with 8 KHz sampling are applied to ADPCM after shifted 250 Hz down and then converted into 6 KHz sampling frequency, thereby attaining a total gain of about 7 without degrading speech quality.

Journal ArticleDOI
TL;DR: The author expects such methods to be available to the business world before too long including rapid speech synthesis from printed test inputs able to accommodate an unlimited vocabulary.
Abstract: Reviews the techniques for producing synthetic speech from a computer. The author expects such methods to be available to the business world before too long including rapid speech synthesis from printed test inputs able to accommodate an unlimited vocabulary. Topics include: analogue recording; human speech; compressed digital speech; speech synthesis from text; software; conversion to sound; synthetic speech for business.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: A speaker dependent system for recognizing carefully articulated continuous speech that accepts English sentences composed from a 127 word vocabulary appropriate to an airline information reservation task and achieves 75% sentence recognition.
Abstract: A speaker dependent system for recognizing carefully articulated continuous speech is described. The system accepts English sentences composed from a 127 word vocabulary appropriate to an airline information reservation task. The system is controlled by a finite state parser which generates word candidates and established their temporal locations in hypothetical sentences. The word candidates are evaluated by an LPC distance measure and a dynamic programming algorithm which nonlinearly time aligns isolated word reference templates with the input speech stream. The input is recognized as the hypothetical sentence having the lowest distance according to a well-defined criterion. In a preliminary test based on 100 sentences spoken over dialed up telephone lines by two male talkers, 90% word accuracy, resulting in 75% sentence recognition, was achieved.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: It is shown that by careful design the algorithm can be made to be as robust to channel errors as that of a fixed rate ADPCM coder.
Abstract: In this paper we examine a number of concepts and issues concerning variable rate coding of speech. We formulate the problem as a multistate coder (i.e. a coder that can operate at several bit rates) coupled with a time buffer. We first analyze the theoretical aspects of the problem by examining it in the context of a block processing formulation. We also allude to a multiple user configuration of variable rate coding for TASI type applications. A practical example of a variable rate ADPCM coder is presented and applied to speech coding. It is shown that by careful design the algorithm can be made to be as robust to channel errors as that of a fixed rate ADPCM coder.

Proceedings ArticleDOI
L. Nebbia1, P. Lucchini1
02 Apr 1979
TL;DR: An automatic vocal response system for the Italian language has been implemented at CSELT, consisting of a hardware speech synthesizer controlled by a programmed device (mini or micro computer) and two excitation generators for voiced and unvoiced sounds.
Abstract: An automatic vocal response system for the Italian language has been implemented at CSELT, consisting of a hardware speech synthesizer controlled by a programmed device (mini or micro computer). The synthesizer exploits a speech production model composed of a 10th order digital lattice filter and two excitation generators for voiced and unvoiced sounds. The hardware includes also a module, which controls the updating and transfer of the parameters, and an output module which provides the analog speech signal. The synthesizer configuration is modular and expandible up to 8 channels. For each channel, the minicomputer supplies the synthesizer with the start-stop command plus 13 parameters: 10 filter coefficients, a gain factor, the pitch period and voiced-unvoiced information and the updating interval. For each channel, every 125 µs, 20 multiplications, 9 addition and 10 subtractions are executed. The filter and the source generator are time-shared among the 8 channels. The complete digital equipment is implemented by TTL-LS integrated circuits.

Journal ArticleDOI
TL;DR: This paper describes a speech digitizer that is capable of transmitting and receiving at 2400 bits/s and typical applications of such digitizers are described.
Abstract: This paper describes a speech digitizer that is capable of transmitting and receiving at 2400 bits/s. Comparisons are made between this implementation and past approaches. Typical applications of such digitizers are also described.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: This paper describes continuing efforts which have concentrated on minimizing loss of synchronization between the receiver and the transmitter, and applies constraints which guarantee synchronization at a cost of some freedom in the selection of data for transmission.
Abstract: Recently we described a variable-frame-rate LPC vocoder designed to transmit good quality speech over 2400 bps fixed-rate noisy channels with bit-error probabilities ranging up to 5% [3]. The basic idea was to lower the data rate by transmitting LPC parameters only when speech characteristics have changed sufficiently since the last transmission, and to employ the resulting bit-rate savings for protecting important transmission data against channel noise. This paper describes our continuing efforts which have concentrated on minimizing loss of synchronization between the receiver and the transmitter. In one approach, we emphasize heavy protection of header, and rapid resynchronization. Alternatively, we apply constraints which guarantee synchronization at a cost of some freedom in the selection of data for transmission. Results from the first approach are presented; results from both methods will be compared at the conference.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: The SIRENE system is an interactive computer-based system of speech-training aids for the deaf that features the use of automatic speech recognition algorithms in the training of sounds and words.
Abstract: This paper describes the SIRENE system which is being developed in our laboratory. SIRENE is an interactive computer-based system of speech-training aids for the deaf. It also includes a variety of procedures for analysis and classification of pathological voices. The basic idea of speech-training aids consists of compensating for the lack of auditory feedback in deaf children by use of visual displays. The system is intended to be used by speech teachers ; several acoustic and phonetic parameters of speech can be displayed and trained : pitch, voicing, intensity, etc... SIRENE also features the use of automatic speech recognition algorithms in the training of sounds and words.

Proceedings ArticleDOI
E. Vivalda1, S. Sandri, C. Miotti
01 Apr 1979
TL;DR: The paper describes the software architecture of an Italian text-to-speech synthesis system based on the joining of LPC coded diphones, which is designed according to multichannel and real time criteria.
Abstract: The paper describes the software architecture of an Italian text-to-speech synthesis system based on the joining of LPC coded diphones. The automatic voice response system is designed according to multichannel and real time criteria. For each output channel, the following operations are performed: pre-processing of the input string of characters, translation into the proper sequence of diphones, generation of prosodic contours and real-time control of a hardware speech synthesizer.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: Under the present restriction to vowel spectra adaptation methods by spectral amplitude weighting and by spectral shifting are investigated, by a special method it was enabled to adapt test spectra class specifically.
Abstract: An automatic speech recognition system based on the reference set of a single speaker can be extended for use by several speakers by applying appropriate preprocessing transformations. These transformations adapt the incoming patterns of a new speaker to the patterns of the reference set. Under the present restriction to vowel spectra adaptation methods by spectral amplitude weighting and by spectral shifting are investigated. By a special method it was enabled to adapt test spectra class specifically.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: The quality of reproduction and storage requirements for the Microprogrammed Intoned Speech Synthesizer utilizing linear predictive coding techniques is evaluated in comparison with other systems.
Abstract: To provide speech output for a major demonstration project of sophisticated computer based instruction, the Microprogrammed Intoned Speech Synthesizer (MISS) utilizing linear predictive coding techniques was developed. In addition to simple resynthesis of preanalyzed recorded speech, the MISS system can apply procedures to modify the pitch, duration and amplitude of individual words so that they can be concatenated into natural sounding utterances. Text analysis programs provide the linguistic parameters that direct MISS to apply the appropriate manipulations. We evaluate the quality of reproduction and storage requirements for this system in comparison with other systems.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: It is proposed to characterize the speech short-term spectrum with a reduced number of parameters (4 to 7) computed from a rough spectral analysis that permits a correct classification of the steady-state French speech sounds pronounced by different speakers.
Abstract: Tracking and identifying the formants in order to perform speech recognition is a time-consuming, error full and speaker-dependent operation. It is proposed to characterize the speech short-term spectrum with a reduced number of parameters (4 to 7) computed from a rough spectral analysis. These parameters permit a correct classification of the steady-state French speech sounds (vowels, including nasals, and unvoiced fricatives) pronounced by different speakers. A word recognition experiment based on the same parameters gives good results with words differing from each other by one phoneme only (single speaker, one learning pass).

Proceedings ArticleDOI
02 Apr 1979
TL;DR: A prototype for the acoustic-phonetic processing level, which makes it possible to test various parameters and strategies for phonemic transcription of continuous speech, is realized in the framework of MYRTILLE II Speech Understanding System.
Abstract: In the framework of MYRTILLE II Speech Understanding System under development in our Laboratory we have realized a prototype for the acoustic-phonetic processing level. This prototype makes it possible to test various parameters and strategies for phonemic transcription of continuous speech. It can be considered as a metasystem in the sense that, given a hierarchy of recognition algorithms and a strategy, it can generate the optimal system for phoneme recognition. The system directly works on the digitized speech wave, which makes it possible to get the best accuracy on the parameters. The speech signal is segmented into phoneme-like units by a decision function which incorporates voicing, energy, zero-crossing rate and curve length. The segments thus obtained are then processed by the recognition system which can be viewed as a tree structure the nodes of which are algorithms. These algorithms take into account one or several features and their answers can be phoneme classes and/or other algorithms. Problems involved in the design of such a system are also presented in this paper together with a particular implementation.


Proceedings ArticleDOI
02 Apr 1979
TL;DR: A system is being developed which permits a fully synchronized presentation of recorded speech with its corresponding printed text, and any synchronization errors are corrected through operator intervention prior to the creation of the synchronized speech/text material for the classroom.
Abstract: A communications problem encountered by most hearing impared people is their inability to understand spoken English. Since present technology appears unable to eliminate this speech perception problem, it is hoped that a better understanding of the relationship between printed and spoken English will permit the hearing impared person to better use their residual hearing. To aid in such instruction, a system is being developed which permits a fully synchronized presentation of recorded speech with its corresponding printed text. A series of computer algorithms are employed to segment the speech signal into syllable-like units, and to separate the corresponding printed text into syllables. The resulting data are then combined on a syllable-by-syllable basis, and any synchronization errors are corrected through operator intervention prior to the creation of the synchronized speech/text material for the classroom.

Proceedings ArticleDOI
01 Apr 1979
TL;DR: A speech coding algorithm for digital transmission of speech at a rate of 9600 bits per second which can be implemented on a speech processing system is described and yielded a signal-to-noise ratio which is indicative of very high quality speech.
Abstract: A speech coding algorithm for digital transmission of speech at a rate of 9600 bits per second which can be implemented on a speech processing system is described. The algorithm combines the following: a pitch extraction loop, a pitch compensating adaptive quantizer, a sequentially adaptive linear predictor, an adaptive source coding, and a multipath tree searching to generate very high quality speech output. Although each of these elements has been previously applied to speech coding, the combination of all five of these elements has not been discussed before. Preliminary simulation studies of the algorithm have yielded a signal-to-noise ratio which is indicative of very high quality speech.

Journal ArticleDOI
TL;DR: The linguistic and contextual knowledge that must be supplied or programmed into a computer to accomplish speech interpretation is the subject of several research activities which are described.
Abstract: A major motivation is to achieve in man-machine interactions the efficiency of speech communication among humans. Continuous speech is more difficult to understand than are isolated words. Commercially available speech recognition systems of the latter type are highly successful despite their limited capability. To recognize continuous speech, more information is needed than is contained in acoustic waves alone. The linguistic and contextual knowledge that must be supplied or programmed into a computer to accomplish speech interpretation is the subject of several research activities which are described. Speech synthesis systems face similar problems but are further advanced.

Proceedings ArticleDOI
04 Sep 1979