scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1984"


01 Jan 1984
TL;DR: An automatic lipreading system which has been developed and the combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone.
Abstract: Automatic recognition of the acoustic speech signal alone is inaccurate and computationally expensive. Additional sources of speech information, such as lipreading (or speechreading), should enhance automatic speech recognition, just as lipreading is used by humans to enhance speech recognition when the acoustic signal is degraded. This paper describes an automatic lipreading system which has been developed. A commercial device performs the acoustic speech recognition independently of the lipreading system. The recognition domain is restricted to isolated utterances and speaker dependent recognition. The speaker faces a solid state camera which sends digitized video to a minicomputer system with custom video processing hardware. The video data is sampled during an utterance and then reduced to a template consisting of visual speech parameter time sequences. The distances between the incoming template and all of the trained templates for each utterance in the vocabulary are computed and a visual recognition candidate is obtained. The combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone. Practical considerations and the possible enhancement of speaker independent and continuous speech recognition systems are also discussed.

389 citations


Proceedings ArticleDOI
Sharad Singhal1, B. S. Atal2
19 Mar 1984
TL;DR: This paper focuses on problems encountered in attempting to maintain speech quality while synthesizing speech using multi-pulse excitation at lower bit rates.
Abstract: The multi-pulse excitation model provides a method for producing natural-sounding speech at medium to low bit rates. Multi-pulse analysis obtains the all-pole filter excitation by minimizing a spectrally-weighted mean-squared error between the original and synthetic speech signals. Although the method provides high quality speech around 10 kbits/sec, speech quality suffers if the bit rate is lowered. In this paper, we focus on problems encountered in attempting to maintain speech quality while synthesizing speech using multi-pulse excitation at lower bit rates.

163 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: Harmonic Coding is synthesized in the time domain, as a superimposition of "harmonics" whose instantaneous frequency varies continuously along an interpolation curve, within each frame, so that fast pitch variations can be tracked with no difficulty.
Abstract: The Harmonic Coding concept has already shown its potential for efficiently coding speech. Previous implementations have usec a frame rate of one every 16 ms. This was mainly due to the fact that, with longer frames, even a nonstationary spectral model (of low order) cannot reproduce the zones of fast-varying pitch with the desirable quality. However, the high framing rate is a limitation, since it implies that fewer bits will be available for encoding each frame. A solution for this problem has been devised: the signal is synthesized in the time domain, as a superimposition of "harmonics" whose instantaneous frequency varies continuously along an interpolation curve, within each frame. In this way, fast pitch variations can be tracked with no difficulty. Experimental results are presented, confirming these facts. The integration of this synthesis scheme in a speech coder is discussed.

98 citations


PatentDOI
Mary Anne Garrett1, Gideon Shichman1
TL;DR: Voice signals amplified in a programmable gain, programmable bandwidth subsystem are digitized and buffered and the digitized signals are transferred over a dedicated high speed bus to a speech processor, thus relieving the speech processor and its resident host of the overhead of high data rate transfers.
Abstract: Voice signals amplified in a programmable gain, programmable bandwidth subsystem are digitized and buffered and the digitized signals are transferred over a dedicated high speed bus to a speech processor, thus relieving the speech processor and its resident host of the overhead of high data rate transfers and permitting a relatively low capacity computer to accomplish voice recognition on a real time basis.

95 citations


Patent
18 May 1984
TL;DR: In this paper, a VAD (Voice Activity Detector) is detected in two steps: (1) Signal energy above a threshold decides presence, below threshold decides ambiguity; (2) ambiguity is resolved by testing the rate of change of spectral parameters.
Abstract: Speech signal presence is detected in a VAD (Voice Activity Detector) in two steps: (1) Signal energy above a threshold decides presence, below threshold decides ambiguity; (2) ambiguity is resolved by testing the rate of change of spectral parameters.

79 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: A system for automatic alignment of phonetic transcriptions with continuous speech has been developed and 93% of the segments are mapped into only one phoneme, and the offset between the boundary found by the automatic alignment system and a hand transcriber is less than 10 ms.
Abstract: A system for automatic alignment of phonetic transcriptions with continuous speech has been developed The speech signal is first segmented into broad classes using a non-parametric Pattern classifier A knowledge-based dynamic programming algorithm then aligns the broad classes with the phonetic transcriptions These broad classes provide "islands of reliability" for more detailed segmentation and refinement of boundaries By doing alignment at the phonetic level, the system can often tolerate inter and intra-speaker variability The system was evaluated on sixty sentences spoken by three speakers, two male and one female 93% of the segments are mapped into only one phoneme, 70% of the time the offset between the boundary found by the automatic alignment system and a hand transcriber is less than 10 ms The performance can be improved by applying more heuristic rules

64 citations


Journal ArticleDOI
TL;DR: This paper presents an improved word-detection algorithm, which can incorporate both vocabulary (syntactic) and task (semantic) information, leading to word- Detection accuracies close to 100 percent for isolated digit detection over a wide range of telephone transmission conditions.
Abstract: Accurate location of the endpoints of spoken words and phrases is important for reliable and robust speech recognition. The endpoint detection problem is fairly straightforward for high-level speech signals in low-level stationary noise environments (e.g., signal-to-noise ratios greater than 30-dB rms). However, this problem becomes considerably more difficult when either the speech signals are too low in level (relative to the background noise), or when the background noise becomes highly nonstationary. Such conditions are often encountered in the switched telephone network when the limitation on using local dialed-up lines is removed. In such cases the background noise is often highly variable in both level and spectral content because of transmission line characteristics, transients and tones from the line and/or from signal generators, etc. Conventional speech endpoint detectors have been shown to perform very poorly (on the order of 50-percent word detection) under these conditions. In this paper we present an improved word-detection algorithm, which can incorporate both vocabulary (syntactic) and task (semantic) information, leading to word-detection accuracies close to 100 percent for isolated digit detection over a wide range of telephone transmission conditions.

62 citations


Patent
25 May 1984
TL;DR: In this article, a video display of stored text is accompanied by associated speech from a speech synthesizer using coded sounds and intonation, and a central processor controls selection of text and speech.
Abstract: Video display of stored text is accompanied by associated speech from a speech synthesizer using coded sounds and intonation. A central processor controls selection of text and speech. Speech is selectable in one of a plurality of prestored languages coded in frequency and duration data.

52 citations



Proceedings ArticleDOI
01 Mar 1984
TL;DR: A computationally efficient formulation is derived for both covariance and correlation type analyses for multipulse coding of speech, ranging from a purely sequential one to one which reoptimizes pulse amplitudes at each step.
Abstract: This paper discusses the analysis techniques used to derive the excitation waveform for multipulse coding of speech. A computationally efficient formulation is derived for both covariance and correlation type analyses. These methods differ in the way block edges are treated. Several methods for pulse amplitude and position determination are given, ranging from a purely sequential one to one which reoptimizes pulse amplitudes at each step. It is shown that the reoptimization scheme has a nested structure that allows a reduction in the computations. An efficient method for pulse position coding is given. This method can essentially achieve the entropy limit for randomly placed pulses. Experimental results are given for typical configurations including computational requirements and speech quality assessments.

48 citations


PatentDOI
Peter F. Brown1
TL;DR: In this article, a speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters.
Abstract: A speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters. The acoustic parameters represent the speech input signal for a frame time. A plurality of template matching and cost processing circuitries are connected to a system bus, along with the speech processing circuitry, for determining, or identifying, the speech units in the input speech, by comparing the acoustic parameters with stored template patterns. The apparatus can be expanded by adding more template matching and cost processing circuitry to the bus thereby increasing the speech recognition capacity of the apparatus. The speech processing circuitry establishes overlapping time durations for generating the acoustic parameters and further employs a sinc-Kaiser smoothing function in combination with a folding technique for providing a discrete Fourier transform. The Fourier spectra are transformed using a biased principal component analysis which optimizes the across class variance. The template matching and cost processing circuitries provide distributed processing, on demand, of the acoustic parameters for generating through a dynamic programming technique the recognition decision.

PatentDOI
TL;DR: In a speech recognition system, the beginning of speech versus non-speech (a cough or noise) is distinguished by reverting to a nonspeech decision process whenever the liklihood cost of template (vocabulary) patterns, including silence, is worse than a predetermined threshold, established by a Joker Word which represents a non-vocabulary word score and path in the grammar graph as discussed by the authors.
Abstract: In a speech recognition system, the beginning of speech versus non-speech (a cough or noise) is distinguished by reverting to a non-speech decision process whenever the liklihood cost of template (vocabulary) patterns, including silence, is worse than a predetermined threshold, established by a Joker Word which represents a non-vocabulary word score and path in the grammar graph.

PatentDOI
TL;DR: In this article, a speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters.
Abstract: A speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters. The acoustic parameters represent the speech input signal for a frame time. A plurality of template matching and cost processing circuitries are connected to a system bus, along with the speech processing circuitry, for determining, or identifying, the speech units in the input speech, by comparing the acoustic parameters with stored template patterns. The apparatus can be expanded by adding more template matching and cost processing circuitry to the bus thereby increasing the speech recognition capacity of the apparatus. Template pattern generation is advantageously aided by using a "joker" word to specify the time boundaries of utterances spoken in isolation, by finding the beginning and ending of an utterance surrounded by silence.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A system for speech analysis and enhancement which combines signal processing and symbolic processing in a closely coupled manner and attempts to reconstruct the original speech waveform using symbolic processing to help model the signal and to guide reconstruction.
Abstract: This paper describes a system for speech analysis and enhancement which combines signal processing and symbolic processing in a closely coupled manner. The system takes as input both a noisy speech signal and a symbolic description of the speech signal. The system attempts to reconstruct the original speech waveform using symbolic processing to help model the signal and to guide reconstruction. The system uses various signal processing algorithms for parameter estimation and reconstruction.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A further application for time-alignment algorithms is described, in which replacement dialogue for a film soundtrack may be automatically synchronized to reference dialogue recorded during filming, in a digital signal processing system that uses a DP algorithm.
Abstract: A number of applications exist in basic speech research for Dynamic Programming (DP) algorithms that can produce accurate time registration data for aligning one speech signal with a similar speech signal. In this paper, a further application for time-alignment algorithms is described, in which replacement dialogue for a film soundtrack may be automatically synchronized to reference dialogue recorded during filming. This is being carried out in a digital signal processing system that uses a DP algorithm capable of aligning utterances of indeterminate length accurately and efficiently in real-time. The main features of this system and the DP algorithm will be described.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: The paper describes an automatic method, called Automatic Diphone Bootstrapping (or A.D.R.B.B.), for template extraction for Speaker-Adaptive Continuous Speech Recognition using "diphones" as speech units, which operates without any manual intervention and performed very well for all the speakers on which it was tested.
Abstract: The paper describes an automatic method, called Automatic Diphone Bootstrapping (or A.D.B.), for template extraction for Speaker-Adaptive Continuous Speech Recognition using "diphones" as speech units. Diphones have proved to be very suitable for C.S.R. as they meet the main requirements of phonetic units: invariance with the context and economy. Furthermore the performance of diphone-based speaker dependent C.S.R. systems is very high. For a long time manual extraction has been presented in the literature as the only completely reliable method for sub-word template creation for any speaker (see [1] as an example). Recently some automatic techniques for reference pattern extraction were developed [2,3], but they also require some manual corrections. The A.D.B. procedure operates without any manual intervention and performed very well for all the speakers on which it was tested. In a connected digit recognition task, a W.R.R. of 98.79% was achieved by using the speaker-adaptive templates created by the A.D.B. procedure.

Patent
01 Feb 1984
TL;DR: In this paper, a digital speech processor operates in parallel with a programmable digital computer to generate sequences of variable-length speech phrases and pauses at the request of the computer, using a separate command memory region, accessible to and loadable by the computer.
Abstract: A digital speech processor operates in parallel with a programmable digital computer to generate sequences of variable-length speech phrases and pauses at the request of the computer. A speech memory within the speech processor contains digitally-encoded speech data segments of varying length. A separate command memory region, accessible to and loadable by the computer, can be loaded with a plurality of commands. When sequentially executed by the speech processor, these commands cause the processor to generate an arbitrary sequence of spoken phrases and pauses without intervention by the computer. Each two-byte command causes the speech processor to retrieve from the speech memory a particular speech data segment and convert it into speech, or to pause for a time interval specified by a number within the command.

Proceedings ArticleDOI
19 Mar 1984
TL;DR: An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy, and alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech.
Abstract: A capacity to carry out reliable automatic time alignment of synthetic speech to naturally produced speech offers potential benfits in speech recognition and speaker recognition as well as in synthesis itself. Phrase alignment experiments are described that indicate that alignment to synthetic speech is more difficult than alignment of speech from two natural speakers. An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy. By this measure, alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech, by modifying the spectrum similarity metric, and by generating the synthetic spectra directly from the control parameters using simplified excitation spectra. The improvements seem to limit, however, at a level below that found between natural speakers. It is conjectured that further improvement requires modifications to the synthesis rules themselves.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: It is shown that prosodic information, e.g., the rhythmic structure of an input word, its syllabic structure, voiced/unvoiced regions in the word and the temporal distribution of back/front vowels, nasals and liquids and glides, can be used effectively to select a substantially reduced subvocabulary of candidates, before any fine phonetic analysis is attempted to recognize the word.
Abstract: Prosodic information is believed to be valuable informnation in human speech perception, but speech recognition systems to date have largely been based on segmental spectral analysis. In this paper I describe parts of a front end to a very-large-vocabulary isolated word recognition system using prosodic information. The present front end is template independent (speaker training for large vocabulary systems (> 20,000 words) is undesirable) and makes use of robust cues in the incoming speech to obtain a presorted vocabulary of candidates. It is shown that prosodic information, e.g., the rhythmic structure of an input word, its syllabic structure, voiced/unvoiced regions in the word and the temporal distribution of back/front vowels, nasals and liquids and glides, can be used effectively to select a substantially reduced subvocabulary of candidates, before any fine phonetic analysis is attempted to recognize the word.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A concatenation of small sound segments is used to assemble a sentence along with duration and pitch contour extracted from a natural utterance of the same sentence to develop a speech coding system working at a very low bit rate.
Abstract: The purpose of this work was to develop a speech coding system working at a very low bit rate (100 bits/s) and capable of reproducing natural sounding speech. The approach chosen was to use a concatenation of small sound segments to assemble a sentence along with duration and pitch contour extracted (by Dynamic Time Warping) from a natural utterance of the same sentence. The segmental aspect is coded at the phonological level, thus leading to a bit rate around 60 bits/s. The decoding of the phonemes is obtained by a set of rules operating on phonetic features. The prosodic aspect is coded using stored pitch contours and duration patterns, thus leading to a bit rate around 40 b/s.

Journal ArticleDOI
TL;DR: A recognition scheme which adapts itself to mild degradations in speech and improves the reliability of recognition significantly is proposed, and techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculation are suggested.

Journal ArticleDOI
TL;DR: An algorithm is proposed which will obtain, from an input speech signal, formant parameter data to control a parallel formant speech synthesiser by allowing some delay and employing variable-frame-rate techniques.
Abstract: An algorithm is proposed which will obtain, from an input speech signal, formant parameter data to control a parallel formant speech synthesiser. By allowing some delay and employing variable-frame-rate techniques, the parameter data can be obtained at a low frame rate (typically 20 frames per second) suitable for transmission or storage.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A Speaker Recognizability Test (SRT) is presented, which tries to establish how well a given communications system preserves a speaker's identity.
Abstract: Speech intelligibility and quality are the two most often tested features of speech coding systems. However, another feature of interest in store-and-forward applications is the preservation of a speaker's identity. Here, a Speaker Recognizability Test (SRT) is presented, which tries to establish how well a given communications system preserves a speaker's identity. Contrary to previous efforts, no attempt is made to identify the cues used by listeners for speaker recognition. Instead, listeners are asked directly to identify a speaker who says an utterance by comparing the uttered sentence with reference sentences, one from each speaker. Among the issues considered in the design of the test is the choice of speakers, the use of reference sentences from the same or different sessions of data collection, and the use of processed or unprocessed speech for reference.

Journal ArticleDOI
TL;DR: A microprocessor-based speech acquisition and processing system which uses waveform analysis techniques to extract measurements from the acoustic signal and operates in "real time" and employs noninvasive data-capturing techniques.
Abstract: Durational measurements of frication, aspiration, prevoicing, and voice onset are often difficult to perform from the spectrogram, and the resolution is limited to about 5 ms. In many instances, a ...

Journal ArticleDOI
TL;DR: This paper deals with the problems associated with full-duplex scrambled speech communications over analog two-wire telephone networks by analyzing the situation in which the two users speak at the same time and each should hear the other.
Abstract: This paper deals with the problems associated with full-duplex scrambled speech communications over analog two-wire telephone networks. The goal is not to describe particular scrambling systems or methods, but to analyze the situation in which the two users speak at the same time and each should hear the other. Current telephone scrambling devices preclude this feature. A general half-duplex scrambling model is described as a base to the discussion of pseudo- and true full-duplex communications. Finally, a novel true full-duplex scrambling architecture based on a paper by Cox and Tribolet [11] is presented.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A speaker adaptation method that follows two steps -- selection of "persons" who have voices similar to the user's and generation of a speaker-adapted dictionary from their dictionaries is studied.
Abstract: A speaker-trained voice recognition system with a large vocabulary has a serious weak point, that is, the user must register a large number of words prior to its use. To be freed from this problem, the authors have studied a speaker adaptation method. This method follows two steps -- 1) selection of "persons" who have voices similar to the user's and 2) generation of a speaker-adapted dictionary from their dictionaries. Results of simulation using 1000-word speech samples by 40 male speakers (20 for standard dictionaries and 20 for performance evaluation) are reported. The results indicated the advantage of this method. The speaker-trained dictionary gave 90.1% recognition accuracy, the speaker-independent dictionary gave 83.6%, and the speaker-adapted dictionary which required only 10% of the vocabulary for training gave 85.7%.

Journal ArticleDOI
TL;DR: The voice characteristics conversion technique described herein is best suited for speech systems either a text-to-speech or analysis-synthesis (meaning record-playback) using LPC synthesizers.
Abstract: Text-to-speech systems available today generate virtually unlimited speech from a prestored component sounds library, which is frequently created from a male voice. A major reason for that is due to the low pitch profile in the male voice. For example, the speech analysis software and commercial synthesizers seem to work more accurately and better with the male voice than the female voice. To overcome the analysis and synthesis problem on the female voice and to provide the choices of having more than one sex of voices output from a text-to-speech system, some means of voice characteristics conversion is needed. To do so, it is important to first understand what parameters in speech define the perception of different sex. An attempt is underway herein to first study these parameters and then to learn to adjust them so that the voice of one sex can be converted to another or from one voice character to another. The voice characteristics conversion technique described herein is best suited for speech systems either a text-to-speech or analysis-synthesis (meaning record-playback) using LPC synthesizers.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: Modifications are presented which improve the quality of the synthesized speech without requiring the transmission of any additional data and show an increase of up to 5 points in overall speech quality with the implementation of each of these improvements.
Abstract: The major weakness of the current narrowband LPC synthesizer lies in the use of a "canned" invariant excitation signal. The use of such an excitation signal is based on three primary assumptions, namely, (1) that the amplitude spectrum of the excitation signal is flat and time-invariant, (2) that the phase spectrum of the voiced excitation signal is a time-invariant function of frequency, and (3) that the probability density function (PDF) of the phase spectrum of the unvoiced excitation signal is also time-invariant. This paper critically examines these assumptions and presents modifications which improve the quality of the synthesized speech without requiring the transmission of any additional data. Diagnostic Acceptability Measure (DAM) tests show an increase of up to 5 points in overall speech quality with the implementation of each of these improvements.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: This paper proposes a recognition scheme which adapts itself to mild degradations in speech, and suggests techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculations.
Abstract: The performance of an isolated word speech recognition (IWSR) system is known to drop rapidly with increase in the degradation of the input speech. In this paper we propose a recognition scheme which adapts itself to mild degradations in speech. The scheme does not need apriori information regarding the nature and extent of noise. We suggest techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculations. A suitable index is used to study the performance of the recognition system for small data sets. Our scheme lends itself to greater flexibility in handling degradations in speech input than do the existing recognition schemes. We illustrate our scheme by simulating an adaptive differential pulse code modulated (ADPCM) speech, where the main distortion is contributed by the quatization noise.

Proceedings Article
01 Jan 1984