scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1977"


Journal ArticleDOI
01 Dec 1977
TL;DR: This paper focuses on the voice problem and the possibilities offered by complete digitization of the voice signal immediately following the microphone.
Abstract: Digital techniques, already widely used for transmission of data, are now being introduced in the field of voice communications. By appreciating some of the long-range implications of this trend we can help point the way towards appropriate usage of this developing technology for improved customer service. This paper focuses on the voice problem and the possibilities offered by complete digitization of the voice signal immediately following the microphone. Included in the discussion are a summary of the properties of the speech signal and its potentialities for efficient transmission, a survey of the existing voice digitization algorithms, some examples of voice digitization implementations, and a brief treatment of voice packetization. There are some comments, near the end of the paper, on the possibility of digitized-voice inputting to, and outputting from, computers in an integrated telephone-computer network.

107 citations


PatentDOI
TL;DR: In a speech recognition system of the type including a recognition unit responsive to a voice input and a conditioning input for recognizing the voice input to produce a recognition output, a start signal is produced whenever a voiceinput exceeds a threshold level and a pause interval detection signal isproduced.
Abstract: In a speech recognition system of the type including a recognition unit responsive to a voice input and a conditioning input for recognizing the voice input to produce a recognition output, a start signal is produced whenever a voice input exceeds a threshold level and a pause interval detection signal is produced whenever a voice input falls below a threshold level. An output timing signal is produced when the detection signal lasts a preselected interval of time that may be either about 250 milliseconds or about 250 milliseconds plus a delay. The recognition output from the recognition unit produced in response to the detection signal is displayed in response also to the detection signal. The result is delivered to a utilization device in response to the output timing signal. The delay may be given either by a predetermined duration or an interval between those instants at which the above-mentioned 250 milliseconds have just elapsed after production of the detection signal and after production of another pause interval detection signal for a next following voice input. During the delay, it is possible either by a manually operable switch or a cancel voice input to cancel delivery of the recognition result displayed to be incorrect.

94 citations


Journal ArticleDOI
TL;DR: A novel approach to the voiced-unvoiced-silence detection problem is proposed in which a spectral characterization of each of the three classes of signal is obtained during a training session, and an LPC distance measure and an energy distance are nonlinearly combined to make the final discrimination.
Abstract: One of the most difficult problems in speech analysis is reliable discrimination among silence, unvoiced speech, and voiced speech which has been transmitted over a telephone line. Although several methods have been proposed for making this three-level decision, these schemes have met with only modest success. In this paper, a novel approach to the voiced-unvoiced-silence detection problem is proposed in which a spectral characterization of each of the three classes of signal is obtained during a training session, and an LPC distance measure and an energy distance are nonlinearly combined to make the final discrimination. This algorithm has been tested over conventional switched telephone lines, across a variety of speakers, and has been found to have an error rate of about 5 percent, with the majority of the errors (about \frac{2}{3} ) occurring at the boundaries between signal classes. The algorithm is currently being used in a speaker-independent word recognition system.

73 citations


Journal ArticleDOI
Frederick Jelinek1
TL;DR: This group works towards automatic transcription of continuous speech with a vocabulary and syntax as unrestricted as possible and an experimental system is operational.
Abstract: This group works towards automatic transcription of continuous speech with a vocabulary and syntax as unrestricted as possible. It is a long-term effort; however, an experimental system is operational. The acoustic processor contains a spectrum analyzer based on the Fast Fourier Transform and a phone segmenter/recognizer which makes use of transitional and steady-state information in its classification. The linguistic processor accepts an imperfect string of phones and produces an estimated transcription of the speech input.

72 citations


Journal ArticleDOI
TL;DR: A summary of the state-of-the-art of automatic speech recognition (ASR) and its relevance to military applications and a number of unsolved problems and techniques which need to be perfected before the solutions to anumber of military applications of the ASR field are possible.
Abstract: The objective of this paper is to provide a summary of the state-of-the-art of automatic speech recognition (ASR) and its relevance to military applications. Until recently, speech recognition had its widest application in the development of vocoders for narrow-band speech communications. Presently, research in ASR has been accelerated for military tasks such as command and control, secure voice systems, surveillance of communication channels, and others. Research in voice control technology and digital narrow-band systems are of special interest. Much of the emphasis of today's military-supported research is to reduce to practice the current state of knowledge of ASR, as well as directing research in such a way as to have future military relevance. In coordination with the above-mentioned emphasis in military-supported research, this paper is divided into two major sections. The first section presents discussion of the state-of-the-art and problems in the various subareas of the ASR field. The second section presents a number of unsolved problems and techniques which need to be perfected before the solutions to a number of military applications of the ASR field are possible.

51 citations


Patent
31 May 1977
TL;DR: In this paper, a system for voice signal processing including recognizing input voice messages and generating output voice messages is described, including an audionic clock, a calculator, and an announciator.
Abstract: A system is provided for voice signal processing including recognizing input voice messages and generating output voice messages. Digital processing of voice signals yields particular advantages. Voice responsive operations provide operator interaction flexibility. Communication of digital information in response to voice information provides data compression for communication applications. Other applications of the voice signal processing system include an audionic clock, an audionic calculator, and an audionic announciator.

38 citations


Journal ArticleDOI
N. Dixon1, Harvey F. Silverman
TL;DR: The modular acoustic processor (MAP), a complex experimental system for automatic derivation of phonemic string output for continuous speech, has stages dedicated to signal analysis, spectral classification, phonemic segmentation, phonetic (steady state) classified, phoneme boundary placement, dyadic (transitional) classification, and final phoneme string consolidation.
Abstract: The modular acoustic processor (MAP), a complex experimental system for automatic derivation of phonemic string output for continuous speech, has stages dedicated to signal analysis, spectral classification, phonemic segmentation, phonemic (steady state) classification, phoneme boundary placement, dyadic (transitional) classification, and final phoneme string consolidation. This paper presents the concepts of and some details concerning these five stages. Results on a large body of continuous speech data, prepared by an automatic evaluation system, will also be presented.

28 citations


Journal ArticleDOI
TL;DR: The Telephone Enquiry Service is a computer system which allows interactive information retrieval from an ordinary touch-tone telephone, and an unusual feature of the system is that the speech is generated by rule from a phonetic representation.
Abstract: The Telephone Enquiry Service is a computer system which allows interactive information retrieval from an ordinary touch-tone telephone. For input, the caller employs the touch-tone keypad, and the computer replies with a synthetic voice response. The service has been in fairly continuous operation for around one year, using a small time-shared computer in conjunction with an internal 200-line telephone exchange, and has been widely used by people with no special interest in synthetic speech. An unusual feature of the system is that the speech is generated by rule from a phonetic representation. A satellite computer, acting as a peripheral to the main machine, performs this task in real time, and controls the parameters of an analogue speech synthesizer. This constitutes an extremely economical and flexible method of speech storage, whose only real disadvantage is the low quality of articulation of the output. A major conclusion of the work is that even low-quality speech is acceptable to casual users, if the service is sufficiently interesting and useful to them.

23 citations


Journal ArticleDOI
TL;DR: This paper describes a speech analysis-synthesis system based on stationary linear prediction formulation that uses a variable analysis frame size concept and the k-parameters are used to represent the spectral information in the speech.
Abstract: This paper describes a speech analysis-synthesis system based on stationary linear prediction formulation. This system uses a variable analysis frame size concept. The k-parameters are used to represent the spectral information in the speech. The statistical and quantization properties of k-parameters are studied in detail. A method for calculating the analysis frame size based on energy and pitch period variations within a speech waveform has been developed. The speech analysis-synthesis system has been implemented on the computing facility of the Signal Processing Laboratory at Case Western Reserve University. Average data rates of 4800, 3600, and 2400 bits/s have been achieved on a limited speech data base of male speakers.

22 citations


Journal ArticleDOI
TL;DR: A novel digital computer algorithm for pitch period analysis of connected speech is presented, to find the periodic-like portions of the speech, from its zero-crossing interval (ZCI) sequence, by an adaptive search.
Abstract: A novel digital computer algorithm for pitch period analysis of connected speech is presented. The principal idea employed is to find the periodic-like portions of the speech, from its zero-crossing interval (ZCI) sequence, by an adaptive search. The algorithm requires only additions and comparisons, no multiplication. An implementation of the algorithm on a 1.5 μs memory-cycle computer (PDP-8) performs the analysis in real time, as accurately as manually performed pitch measurements from a plot of the speech waveform low-pass filtered to a bandwidth of 900 Hz in such a way that each pitch boundary is marked on a zero-crossing. Although 12 threshold values are used for decision making, only two of them, the upper and lower limits of the speaker's one octave pitch range, are speaker dependent. However, these two threshold values of a speaker are easily extracted and set by the algorithm from a sample of only 4-5 s duration of his (or her) normal speech.

20 citations


Journal ArticleDOI
TL;DR: By various manipulative procedures appreciable improvements may be obtained in the confidence limits of the material by improving the quality of the word lists re-recorded by a professional announcer.
Abstract: A necessary feature of good recorded speech material is that each list should be identical in respect of graded order of difficulty. Extensive validatory tests have been carried out upon 20 M.R.C. word lists re-recorded by a professional announcer. It can be shown that by various manipulative procedures appreciable improvements may be obtained in the confidence limits of the material. In addition studies have been carried out upon the relationship between Speech Detection Threshold (S.D.T.) Speech Reception Threshold (S.R.T.) and Hearing level (H.L.) in normal hearing subjects. The surprising finding is that both S.D.T. and S.R.T. appear to be relatively independent of H.L. An analysis has been made of the words comprising the material in respect of their ease of recognition and is presented as an appendix.

Proceedings ArticleDOI
09 May 1977
TL;DR: In this paper, an adaptive noise-stripping Wiener filter is used to prefilter the noisy speech in order to adapt to quasi-stationary noise, and a speech classifier is developed that detects the presence of silence (noise alone), unvoiced speech or voiced speech.
Abstract: In an attempt to develop a more robust vocoder an adaptive noise-stripping Wiener filter is used to prefilter the noisy speech. In order to adapt to quasi-stationary noise a speech classifier is developed that detects the presence of silence (noise alone), unvoiced speech or voiced speech. During the silent intervals the noise statistics and the corresponding Wiener filter are up-dated resulting in a decision-directed adaptive structure.

Proceedings ArticleDOI
01 May 1977
TL;DR: Variable data rate LPC speech compression schemes are employed to transmit LPC parameters only when speech characteristics have changed sufficiently since the last transmission, yielding improved speech quality relative to fixed-rate schemes for a given average transmission rate.
Abstract: Variable data rate LPC speech compression schemes are employed to transmit LPC parameters only when speech characteristics have changed sufficiently since the last transmission, yielding improved speech quality relative to fixed-rate schemes for a given average transmission rate. Transmission of variable-rate LPC speech over fixed-rate channels is accomplished using transmit and receive buffers, with resulting transmission delays. Development of proper buffer control strategy is essential to minimize losses caused by exhausting either buffer, or by corrective actions, namely, forced or suppressed transmission. Certain aspects of such strategy and their impact on speech quality and data rate are discussed for a narrowband (2400 bps) speech transmission system.

Journal ArticleDOI
TL;DR: In this article, a combination of variable-quality coding and time-interval modification is proposed to load a transmission facility and accommodate fluctuating demands on it for switched digital transmission by switched digital packets.
Abstract: Speech transmission by switched digital packets offers several opportunities for increasing the utilization of transmission capacity We comment here upon a combination of variable-quality coding and time-interval modification that can efficiently load a transmission facility and accommodate fluctuating demands on it

Proceedings ArticleDOI
01 May 1977
TL;DR: Acoustic-phonetic conversion is probably the most critical step in continuous speech recognition and the transitional information can be used as follows, in order to improve the results.
Abstract: Acoustic-phonetic conversion is probably the most critical step in continuous speech recognition. The transitional information can be used as follows, in order to improve the results. First we contitute a lexicon of the phoneme steady-state spectra and a lexicon of all the transitions (diphones), each one being characterized by a"differential spectrum". The unknown continuous speech wave is segmented into quasi steady-state and transitional segments ; the labelling of the quasi steady-state segments admits several candidates. The transitional segment between two quasi steady-state spectra is then compared to the diphones of the lexicon selected from the combination of the surrounding possible phoneme labels. Actually, only the comparisons which are compatible with the recent past of the message are made. When working as a phoneme-vocoder, the whole procedure needs about 3x real-time, without any optimization.


Proceedings ArticleDOI
01 May 1977
TL;DR: Preliminary results indicate that a factor of 3 to 5 further reduction in bandwidth might be possible using segmentation and labeling in conjunction with LPC vocoders.
Abstract: We have been attempting to produce further bandwidth reduction in LPC based analysts-synthesis techniques by used the segmentation and labeling algorithms used in the Harpy and Hearsay-II systems. Preliminary results indicate that a factor of 3 to 5 further reduction in bandwidth might be possible using segmentation and labeling in conjunction with LPC vocoders.

Proceedings ArticleDOI
27 Sep 1977
TL;DR: The paper shows that TI-NET has various features which counter the above problems, and thereby make it appear very suitable for encoded voice transmission and represents an important step in the development of a transparent and intelligent public network capable of transmitting encoded voice and data.
Abstract: The paper describes the work done thus far in the development of a means of statistically multiplexing data and encoded voice in a transparent and intelligent network called TI-NET. A review of previous work in packetized voice transmission in a conventional packet switched network (ARPANET) has revealed problems related to large fixed and variable transmission delays. These problems result in degradations in speech quality in the form of time scale distortion and gaps due to very late or lost packets. The paper shows that TI-NET has various features which counter the above problems, and thereby make it appear very suitable for encoded voice transmission.The paper describes a software implementation of a protocol for encoded voice transmission which takes advantage of the 60 to 65% idle time in conversations (one way) so that only active periods in speech need be transmitted; this is not possible with present frame synchronous vocoders which transmit continuously.The paper describes the present experimental TI-NET which consists of two nodes (PDP11's) joined by a 9.6 kb/s link. For convenience the protocol for encoded voice transmission has been implemented in the TI-NET nodes, and the transparent transmission of data and encoded voice has been demonstrated. With regard to packetizing synchronous messages within TI-NET, a new protocol is described (and has been implemented) for padding partially filled “minipackets” that are likely to occur at the end of most synchronous messages.In addition, the paper discusses the features of another aspect of TI-NET, i.e. satellite extension nodes, which are used to enable both local and remote regions of high data concentration to access the subnet through 12-14 GHZ satellite links. It is shown that the advantages of accessing TI-NET by satellite extension nodes (as compared to accessing a conventional packet switched network by terrestrial facilities), include lower communication costs, greater accuracy and security, and smaller entrance delay.Finally, the paper describes an experiment in which data and encoded voice was transmitted from a TI-NET node at Carleton University, Ottawa, at 9.6 kb/s in “multiuser” packets, over the Hermes (CTS) satellite to NASA AMES Research Center in California, where it was looped back and returned to the same TI-NET node.The significance of the work described in the paper is that it represents an important step in the development of a transparent and intelligent public network capable of transmitting encoded voice (at 9.6 kb/s and lower rates) and data. Such a network can serve as an integrated system for data and voice and provide cost benefits to its users through savings in bandwidth of the order of 50 to 1, by statistically multiplexing data with vocoded voice.

01 Oct 1977
TL;DR: Results of an examination of four methods for processing speech so as to enhance its intelligibility in the presence of wideband random noise at the source are described.
Abstract: : This report describes results of an examination of four methods for processing speech so as to enhance its intelligibility in the presence of wideband random noise at the source. The four methods were: (1) INTEL, a method which involves processing in both the first and second order spectral domains; (2) Spectral subtraction, which involves a simple subtraction of the average noise spectrum from the first-order spectrum; (3) Minimum mean square error filtering, which involves filtering speech in such a way as to minimize the mean square error between a signal and its expected value in noise; and (4) Methods based upon suppressing the frequency content of a speech plus noise signal between pitch harmonics of the speech signal. To carry out a study of methods of enhance speech intelligibility in noise, two general-purpose computer processing systems were implemented. The first, a terminal interactive system for generation, analysis, and graphic display of synthetic voiced speech sounds, provided considerable insight into the effect of various processing algorithms upon speech and upon speech in noise. The second computer processing system has been developed for the processing of real speech. It involves use of a DDP-116 data converter and a Honeywell 6000 Computer.

01 Jan 1977
TL;DR: A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented and the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.
Abstract: A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.

Patent
23 Dec 1977
TL;DR: In this article, a fixed delay between the input speech channel and the transmission facility provides time to generate and transmit a symbol representing the speech channel to which the transmission device has been assigned.
Abstract: In a time assignment speech interpolation system, a fixed delay between the input speech channel and the transmission facility provides time to generate and transmit a symbol representing the speech channel to which the transmission facility has been assigned. At the remote location, a fixed delay between the transmission facility and the output speech channel provides time in which to detect the symbol and perform the necessary switching.

Journal ArticleDOI
TL;DR: In this article, a high-pass filter was used for the measurement of the ratio of two voltages each of which can vary over a wide dynamic range; a circuit for this was presented.
Abstract: The paper is about the design of a novel speech processor and its successful incorporation in a vibrotactile speech training aid for the deaf. The introduction makes clear the particular difficulties and needs of the deaf child who is learning to speak. The discussion shows that to help with the deaf child's difficulty with mimicry the technologist must enable the child to sense and compare patterns of articulator configurations. It is suggested that the vocal gestures will be monitored most easily if the speech processor operates in such a way as to ignore variations in speech intensity. An explanation is given of a novel amplitude insensitive filtering system which employs a bank of high-pass filters. A particular requirement of the system is for the measurement of the ratio of two voltages each of which can vary over a wide dynamic range; a circuit for this is presented. Although the discussion focuses on the application of the frequency-analysis system to speech processing the theoretical treatment facilitates the evaluation of the system's potential for processing other signals.


Proceedings ArticleDOI
01 May 1977
TL;DR: A voice recognition experiment for speech understanding based on the fact that a voice recognition system can have a big improvement by exploiting the intrinsic redundancy of the spoken natural language, that is by delaying every decision to the highest available information level.
Abstract: This paper presents a voice recognition experiment for speech understanding. The approach is based on the fact that a voice recognition system can have a big improvement by exploiting the intrinsic redundancy of the spoken natural language, that is by delaying every decision to the highest available information level. Namely any decision taken at phoneme level (acoustic level) carries the loss of a certain amount of information. The linguistic recognition system, we have so far developed, is based on a linguistic model, where decisions are taken only at the full message level. This approach follows the same basic idea of a system now successfully working for Mail Address Optical Recognition (1). Such a system has been successfully improved via EMMA, a spe cial network of associative minicomputers, consisting, for that application, in about 60 processors.

Proceedings ArticleDOI
Subhro Das1, Charles C. Tappert
01 May 1977
TL;DR: A real-time Adaptive Differential Pulse Code Modulation system is described which employs an inexpensive, currently about $25, stack-architecture microprocessor which performs all processing between the taking of input speech samples.
Abstract: A real-time Adaptive Differential Pulse Code Modulation (ADPCM) system is described which employs an inexpensive, currently about $25, stack-architecture microprocessor. The coder operates at 3 bits/sample and a 10 KHz rate and performs all processing between the taking of input speech samples. The ADPCM tables are stored in the stack for rapid manipulation of table column and row pointers allowing implementation of real-time operation.

Proceedings ArticleDOI
N. Dixon1
01 May 1977
TL;DR: The rationale for and some examples from an application hierarchy and a recognition-then-segmentation approach will be presented; this approach has been used fairly successfully in phonemic segmentation of continuous speech.
Abstract: In Automatic Recognition of Continuous Speech (ARCS), one approach is to segment the speech continuum approximately at the phoneme level as an initial step in abstracting lexical and/or sementic content. If heuristic rules are used for this segmentation, the order of rule application and the character of the data to be used by the rules become important considerations. The rationale for and some examples from an application hierarchy and a recognition-then-segmentation approach will be presented; this approach has been used fairly successfully in phonemic segmentation of continuous speech.


Proceedings ArticleDOI
01 May 1977
TL;DR: A system for the recognition of spoken phrases has been used successfully to retrieve messages from a remote computer by voice and the phrase recognition accuracy was 92% for each group of talkers.
Abstract: A system for the recognition of spoken phrases is described. The input speech is filtered to 3.3 KHz (approximating telephone bandwidth), and parameters are computed while the utterance is spoken. The remaining time necessary to produce a result is generally less than the duration of the input utterance. The recognizer assumes that the input utterance contains one of a known set of key phrases; the phrase is allowed to be included within a more or less arbitrary carrier sentence. Analysis is performed on a syllable-by-syllable basis with only the strong syllables considered in the recognition process. An interactive training facility allows flexible composition of key phrase sets. Testing has been performed for a few phrase sets, each containing less than twenty phrases. Talkers used in the training and talkers not used in the training tested the system. The phrase recognition accuracy was 92% for each group of talkers. The recognizer has been used successfully to retrieve messages from a remote computer by voice.

Proceedings ArticleDOI
01 May 1977
TL;DR: The redundancy analysis of symbolic systems is proposed a a guide-line through the labyrinth of speech recognition, which distinguishes the spectral and the symbolic information capacity, the negative entropy, the acceptance rhythm, the redundancies and anti-redundancies, which characterize the relevant constraints, rules or knowledge sources.
Abstract: The telephonic wave carries 50,000 bits/sec, but the brain accepts only 10 linguistic bits/ sec. How can we optimize such a 5000 : 1 transformation? The redundancy analysis of symbolic systems is proposed a a guide-line through the labyrinth of speech recognition. We distinguish the spectral and the symbolic information capacity (bit/sec), the negative entropy (bit/symbol), the acceptance rhythm (symbol/ sec), the redundancies and anti-redundancies (bit/bit), which characterize the relevant constraints, rules or knowledge sources. Redundancy analysis can detail prosodic components, such as dynamics, duration, pitch or voicing degree. It can be applied to various forms of connected, restricted or coded speech, with vocabularies ranging from small to large numbers of words.

Journal ArticleDOI
TL;DR: The objective of this study is to evaluate the performance of the Harpy continuous speech recognition system when the speech input to the system is corrupted by the quantization noise of an ADPCM system.
Abstract: One of the major problems of a speech processing system is the degradation it suffers due to distortions in the speech input. One such distortion is caused by the quantization noise of waveform encoding schemes which have several attractive features for speech transmission. The objective of this study is to evaluate the performance of the Harpy continuous speech recognition system when the speech input to the system is corrupted by the quantization noise of an ADPCM system. The Harpy system uses segmentation of continuous speech based on Itakura metric and LPC‐based parameters for template generation. The effect of quantization noise on the segmentation and the estimation of LPC‐based parameters is studied for different bit rates in the range 16–48 kbs of the ADPCM system and the overall word and sentence recognition accuracies are evaluated.