Showing papers on "Voice activity detection published in 1980"
••
TL;DR: In this paper, a spectral decomposition of a frame of noisy speech is used to attenuate a particular spectral line depending on how much the measured speech plus noise power exceeds an estimate of the background noise.
Abstract: One way of enhancing speech in an additive acoustic noise environment is to perform a spectral decomposition of a frame of noisy speech and to attenuate a particular spectral line depending on how much the measured speech plus noise power exceeds an estimate of the background noise. Using a two-state model for the speech event (speech absent or speech present) and using the maximum likelihood estimator of the magnitude of the speech spectrum results in a new class of suppression curves which permits a tradeoff of noise suppression against speech distortion. The algorithm has been implemented in real time in the time domain, exploiting the structure of the channel vocoder. Extensive testing has shown that the noise can be made imperceptible by proper choice of the suppression factor.
854 citations
••
TL;DR: Several control methodologies are described, leading to an end-to end feedback approach that achieves stable operation and efficient utilization of network resources by adaptively matching transmitted voice bit rates to prevailing network conditions.
Abstract: Integrated packet-switched networks have potential for providing improved performance by dynamically sharing transmission bandwidths between various users and user types, but new flow control methods are needed to deal with packetized voice traffic. This paper describes a packet voice flow control concept based on embedded speech coding. Results are presented from a computer simulation study of the technique in the context of a multilink wideband packet speech network. Several control methodologies are described, leading to an end-to end feedback approach that achieves stable operation and efficient utilization of network resources by adaptively matching transmitted voice bit rates to prevailing network conditions. Issues in the design of embedded speech coding algorithms are reviewed and a candidate structure based on channel vocoding principles is presented, along with the subjective results of some preliminary listening tests
125 citations
••
01 Apr 1980TL;DR: The development of a digital encoding system designed to exploit the limited detection ability of the auditory system is described, dynamically shaping the encoding error spectrum as a function of the input speech signal, the error is masked by the speech.
Abstract: The development of a digital encoding system designed to exploit the limited detection ability of the auditory system is described. By dynamically shaping the encoding error spectrum as a function of the input speech signal, the error is masked by the speech. Psychoacoustic experiments and results from the literature provide a basis for determining the system parameters that ensure that the error is inaudible. The encoder is a multi-channel system, each channel approximately of critical bandwidth. The input signal is filtered into 17 frequency channels via the quadrature mirror filter technique. Each channel is then coded using block-companding adaptive PCM. For 4.1 kHz bandwidth speech, the differential threshold of the encoding degradation occurs at a bit rate of 34.4 kbps. At 16 kbps, the encoder produces toll quality speech output.
103 citations
••
TL;DR: A signal model based more directly upon the phsyics of of speech generation is proposed and implemented and parametric control of the synthesis model is implemented by an adaptive procedure that minimizes the spectral difference between a human speech input and the synthetic output of the model.
Abstract: A traditional model of the speech signal has provided the underpinning of vocoder technology since the inception of analysis/synthesis telephony. The model is a first‐order approximation to human speech generation in which the source of vocal sound and the resonant acoustic system are treated as linear, separable elements. This source‐system model cannot properly account for a number of acoustic factors now known to exist in speech generation. We propose and implement here a signal model based more directly upon the phsyics of of speech generation. We also implement parametric control of the synthesis model by an adaptive procedure that minimizes the spectral difference between a human speech input and the synthetic output of the model.The adapted parameters constitute a low bit‐rate representation of the input human speech. We test a preliminary form of the system by computer simulation and demonstrate that in simple inital trials the signal model is able to adapt in a realistic manner.
62 citations
••
TL;DR: In this paper, a helium-speech unscrambler was proposed to reduce the bandwidth of the helium speech before transmitting the speech signals to a distant location on a carrier wave selected for optimum transmission through the water.
Abstract: The invention relates to a novel helium-speech unscrambler which can be located at a diver's location, and enables the helium-speech voiced by the diver to be subjected to waveform time expansion to reduce the bandwidth of the helium speech (e.g. to 2 to 3 KHZ) prior to transmitting the speech signals to a distant location on a carrier wave selected for optimum transmission through the water.
46 citations
•
15 Dec 1980
TL;DR: Speech compaction/replay apparatus for real-time monitoring speech and filtering out periods of relative slence from a recording of the speech are described in this article. But they do not address the problem of time code information.
Abstract: Speech compaction/replay apparatus for real time monitoring speech and filtering out periods of relative slence from a recording of the speech. The recording also containing synchronization and time code information for ensuring that on replay and in terms of real time the audio output will essentially replicates the analog speech input. The apparatus and technique minimizing the amount of storage media required to store the speech.
44 citations
••
TL;DR: This paper presents a new method of voiced/unvoiced/ silence discrimination of speech based on the results of counting bit alternations of the bit stream from linear delta modulation of the speech signal and zero crossings of a band-pass filtered output of the decoded LDM signal.
Abstract: This paper presents a new method of voiced/unvoiced/ silence discrimination of speech. The decision algorithm is based on the results of counting bit alternations of the bit stream from linear delta modulation (LDM) of the speech signal and zero crossings of a band-pass filtered output of the decoded LDM signal. Computer simulation of the system with real speech has yielded accurate results. Economical realization of the discriminator hardware using standard integrated circuits is also considered.
26 citations
••
TL;DR: The performances of the SVADM and the CVSD in terms of dynamic range, sampling rate and the channel errors are compared and the parameters employed for subjective evaluation of the packet voice system are packet size, silence detection algorithm, bit rate and packet loss rate.
Abstract: In this paper, the performances of the Song Voice Adaptive Delta Modulator (SVADM) and the Continuously Variable Slope Delta Modulator (CVSD) in terms of dynamic range, sampling rate and the channel errors are compared. The use of the SVADM and the CVSD in a packet voice system, the algorithms for digital detection of silent periods and the performance of a packet voice system using the SVADM and the CVSD as source encoders are presented. The parameters employed for subjective evaluation of the packet voice system are packet size, silence detection algorithm, bit rate and packet loss rate.
24 citations
••
09 Apr 1980TL;DR: A conversational-mode, speech-understanding system which enables its user to make airline reservations and obtain timetable information through a spoken dialog as a three-level hierarchy consisting of an acoustic word recognizer, a syntax analyzer, and a semantic processor.
Abstract: We describe a conversational mode speech understanding system which enables its user to make airline reservations and obtain timetable information through a spoken dialog. The system is structured as a three level hierarchy consisting of an acoustic word recognizer, a syntax analyzer and a semantic processor. The semantic level controls an audio response system making two way speech communication possible. The system is highly robust and operates on-line in a few times real time on a laboratory minicomputer. The speech communication channel is a standard telephone set connected to the computer by an ordinary dialed-up line.
22 citations
••
TL;DR: A microprocessor based speech recognition system for the voice control of wheelchair, touch-tone phone, typewriter and environmental control unit, which exhibits less than one percent substitutions and eleven percent rejections with the ten digit set.
••
••
TL;DR: Discusses the concept of automatic speech recognition presenting the latest system developments achieved in research programmes undertaken by various people discussing the various difficulties such as speech modelling, unclear understanding of the human speech recognition patterns, inadequate software and hardware to perform the required functions.
Abstract: Discusses the concept of automatic speech recognition presenting the latest system developments achieved in research programmes undertaken by various people. Three research systems are briefly mentioned and the various difficulties such as speech modelling, unclear understanding of the human speech recognition patterns, inadequate software and hardware to perform the required functions are discussed.
••
09 Apr 1980
TL;DR: Improvements on the classical model of speech are presented which produces speech that is significantly better than currently available systems and an efficient encoding of the prediction residuals of the two components.
Abstract: This paper presents improvements on the classical model which produces speech that is significantly better than currently available systems. The first major improvement results from treating speech as a two source phenomenon that can be separated for parallel but independent analysis/ synthesis. This two component decomposition is accomplished by making use of the quasi-periodic nature of 'voiced' speech. The second major improvement in bit compression and robustness of operation results from an efficient encoding of the prediction residuals of the two components. The key step is to encode the residual of the periodic component by picking out and transmitting the essential information for only one cycle (pitch period) of the residual.
••
09 Apr 1980TL;DR: The speaker trained, voice controlled, repertory dialer system was tested extensively by 6 talkers and there were no recognition errors and a request for a repeat of a spoken word occurred about 2% of the time.
Abstract: In this paper we describe a speaker trained, voice controlled, repertory dialer system. The main elements of the system include: 1. A real-time speech analyzer that detects the presence of speech on the input line, and analyzes the speech to give features appropriate for a word recognizer. 2. An isolated word recognizer that decides which of a set of words was spoken. 3. A voice response system to provide spoken commands to the user to guide the use of the repertory dialer system. 4. A dialer (simulated) to outpulse the desired telephone number. The repertory dialer system is implemented on a minicomputer with a high speed array processor performing the real-time operations. The vocabulary for the system consists of 7 command words, 10 digits, and any number of names up to some specified maximum Recognition is performed on one or more subsets of the vocabulary, depending on fine state of the system. To train the system the user is requested to speak each of the vocabulary words twice to provide reference templates for the system. Following training, the system can dial the telephone number corresponding to any name in the repertory, or it can dial a 4 digit telephone extension spoken as an isolated string of digits. The system was tested extensively by 6 talkers (3 male, 3 female - 3 of whom were naive and 3 experienced users) over a three week period. A total of 4620 words were spoken and during the course of the test there were no recognition errors. A request for a repeat of a spoken word occurred about 2% of the time. These tests demonstrate the reliability and robustness of this voice repertory dialer system.
••
01 Apr 1980TL;DR: This paper presents the results of the investigation of the various aspects of baseband LPC coders with the goal of maximizing the speech quality at a transmission bit-rate of 9.6 kb/s and for channel bit-error rates of up to 1%.
Abstract: This paper presents the results of our investigation of the various aspects of baseband LPC coders with the goal of maximizing the speech quality at a transmission bit-rate of 96 kb/s and for channel bit-error rates of up to 1% Important among these aspects are: baseband width, coding of baseband, high-frequency regeneration, and error protection of important transmission parameters The paper discusses these and other issues, presents the results of speech-quality tests conducted during the various stages of optimization, and describes the details of the optimized speech coder
••
TL;DR: Advances in analogue-to-digital speech conversion providing acceptable quality at bit rates below 9.6 k bit/s should eventually enable a simple digital output coder to be as effective as a cryptovocoder for audio band high security applications.
Abstract: This paper surveys some past and present principles of speech coding. The advent of digital logic systems and integrated circuit components is seen as a major turning point in the implementation of effective coding principles that could not previously be used satisfactorily.Various coding methods are discussed, covering frequency domain, time domain and analogue and digital output systems. The problems of key generator integrity, synchronization, received voice quality and recognizability, and channel bandwidth restrictions, and the security offered by different systems are considered.The author concludes that in the immediate future high security speech encipherment systems will be of the digital output crypto-vocoder type, with audio bandwidth analogue output coders offering an increasingly secure alternative for the majority of applications. Digital output coders will continue to be used effectively in those applications where the communication equipment and transmission path have the necessary characteristics to carry low-error-rate digital signals. Advances in analogue-to-digital speech conversion providing acceptable quality at bit rates below 9.6 k bit/s should eventually enable a simple digital output coder to be as effective as a cryptovocoder for audio band high security applications.
••
TL;DR: A high quality Speech synthesizer system which consists of 3 LSI chips, a speech synthesizer, a 128k bit ROM and a general purpose microprocessor has been developed, based on the recently developed PARCOR voice compression technique.
Abstract: A high quality speech synthesizer system which consists of 3 LSI chips, a speech synthesizer, a 128k bit ROM and a general purpose microprocessor has been developed. This system is based on the recently developed Partial Autocorrelation (PARCOR) voice compression technique. This system can generate high quality speech from a data rate of less than 2400 bits per second. Several new techniques are applied for this system to improve the quality of generated speech especially of the female voice. This system has many advantageous features such as speech speed control and external pitch excitation.
••
01 Apr 1980
TL;DR: This detector is an approximation to a maximum-a-posteriori sequence estimator and employs the Viterbi algorithm and the performance of this detector with real speech is dealt with.
Abstract: A composite-Gaussian source model for speech was suggested at ICASSP-79. Based upon this model a voice/unvoiced detector is derived. This detector is an approximation to a maximum-a-posteriori sequence estimator and employs the Viterbi algorithm. This paper deals with the performance of this detector with real speech.
••
01 Apr 1980
TL;DR: In this machine, a new method for connected word recognition, namely inverse dynamic programming (DP) matching, is adopted, and the recognition rate of 99.3% is obtained.
Abstract: Construction and performance of a machine for recognizing spoken connected words are described. In this machine, a new method for connected word recognition, namely inverse dynamic programming (DP) matching, is adopted. Two kinds of DP matching techniques are used in the inverse DP matching, one of which is the usual DP matching and the other is matching performed in a time reverse mode, starting from the end of speech. Combining the similarities obtained by these two kinds of matching, the similarities between input speech and word sequences are computed. Also a technique for rejecting candidates is used in the machine to reduce computation amount. The machine performance is tested by 1400 samples of connected digits. The recognition rate of 99.3% is obtained.
••
IBM1
TL;DR: Current efforts to recognize continuous (or “connected”) speech are aimed at constructing a voice-excited “typewriter” that automatically transcribes natural speech into ordinary (e.g. English) written form.
Abstract: Current efforts to recognize continuous (or “connected”) speech are aimed at constructing a voice-excited “typewriter” that automatically transcribes natural speech into ordinary (e.g. English) written form. So far, however, only very restricted speech has been recognized. The sentences that are spoken must either be prescribed a priori by an artificial grammar which the experimenter has designed, or else limited by a vocabulary and a restricted area of discourse such as that used in business letters, book reviews, or airline reservation systems. These latter so-called natural tasks are generally much more difficult than the artificial ones (given a fixed vocabulary).
••
TL;DR: Results of the analysis of speech signals and sinusoids using the periodogram algorithm are presented and comparisons are made with the average magnitude difference function (a.m.d.f.) which is an alternative method of estimating the pitch period of the voiced speech.
Abstract: In speech processing an estimation of the speech pitch period is important. Real time pitch detection is only possible by the selection of an efficient algorithm suitable for implementation on a programmable processor or in special-purpose hardware. The use of the periodogram algorithm (p.a.) is proposed to detect the pitch period of voiced speech. This algorithm is attractive for the following reasons: (a) it has no multiply operation; (b) when implemented on a 16-bit computer (e.g. microprocessor) the computation can be done in integer arithmetic without exceeding the microprocessor's dynamic range; (c) it is a simple technique for estimating the pitch period with reasonable accuracy. Results of the analysis of speech signals and sinusoids using the periodogram algorithm are presented and comparisons are made with the average magnitude difference function (a.m.d.f.) which is an alternative method of estimating the pitch period of the voiced speech.
••
TL;DR: The DP-100 design overcomes two serious handicaps which cause inaccuracies in automatic speech recognition systems, namely the variation in the rate at which words are spoken and the general problem of continuous speech recognition.
Abstract: Considers the Nippon Electric Co.'s DP-100 automatic continuous speech recognition system having an identification capability of approximately 100 words and aimed at application such as routing and inventory control in warehouses. The DP-100 design overcomes two serious handicaps which cause inaccuracies in automatic speech recognition systems, namely the variation in the rate at which words are spoken and the general problem of continuous speech recognition. The author gives details of the design and how these problems are overcome.
••
TL;DR: In this system both coded and uncoded signals are stored with an associated control signal which controls selection of corresponding direct or decoder circuit coupling of memory output to a speech synthesizer.
Abstract: Speech parameter signals may be coded to consist of fewer bits for data reduction in memory. In this system both coded and uncoded signals are stored with an associated control signal which controls selection of corresponding direct or decoder circuit coupling of memory output to a speech synthesizer.
••
••
01 Apr 1980
TL;DR: The realization of a speech analyzer plus an LPC synthesizer in a single chip signal processing microprocessor that is able to process both algorithms in real time to create an interactive voice analyzer/response system operating under the control of a microprocessor and with the LPC speech data stored in a ROM.
Abstract: The realization of a speech analyzer plus an LPC synthesizer in a single chip signal processing microprocessor is described. The chip is able to process both algorithms in real time to create an interactive voice analyzer/response system operating under the control of a microprocessor and with the LPC speech data stored in a ROM. The chip is a 16 bit microprocessor specially architectured for signal processing. It features all single cycle instructions with a 300nsec cycle time, and a 12 × 12 bit parallel multiplier pipelined to operate in a single cycle. It can be programmed to perform a wide variety of signal processing functions including speech processing.
••
TL;DR: The system delay involved in time-encoded speech can be made tolerable by a modest increase in transmission rate over average source generation rate.
Abstract: The system delay involved in time-encoded speech can be made tolerable by a modest increase in transmission rate over average source generation rate.
•
01 Jan 1980
TL;DR: This thesis addresses the problem of further reducing the speech bit rate by manipulation of the LPC parameters without lowering the quality of the resynthesized speech signal and shows that the dynamic model is very effective for speech compression in voice response applications.
Abstract: In analysis-synthesis systems like Linear Predictive Coding (LPC), the speech signal is modeled as the output of a time-varying digital linear filter representing the vocal tract, where the filter input is random noise or a sequence of impulses. In LPC, this filter is described by a set of parameters, called PARCOR coefficients, which are then coded to reduce (compress) the bit rate necessary to code the speech signal. An equivalent set of parameters are the area functions, through which the vocal tract is modeled as a concatenation of cylindrical tubes.
This thesis addresses the problem of further reducing the speech bit rate by manipulation of the LPC parameters without lowering the quality of the resynthesized speech signal. In this study, the vocal tract is viewed both statically and dynamically. In the static description, certain sections of the vocal tract model are unified so that the vocal tract is represented by fewer tubes. The selection of these sections is based on the minimization of an objective speech distortion measure. The new reduced model is tested for both its effectiveness in modeling the vocal tract and its speech compression capabilities.
The dynamic description is connected with the time varying nature of the vocal tract filter. The speech signal is divided into quasi-stationary frames where the filter parameters remain constant within each frame. For certain phonemes (e.g. vowels), the filter parameters often will not change greatly from one frame to the next and when successive frames are encoded, some or all of the parameters can retain their previous values. Hence, it is possible to transmit some, all or none of the new LPC parameters. This choice is not made immediately; instead, a number of consecutive frames is considered and all the possible decision sequences are examined. The final choice of the optimum decision sequence is based on the minimization of a cost function which penalizes both for the number of bits transmitted per frame and for the distortion introduced in the speech signal. The examination of the possible decision sequences is equivalent to a decision tree search which is most efficiently done through dynamic programming.
The quality of the resynthesized speech was judged by listeners, who ranked the different systems in a user preference test. The reduced static model was judged to be a very effective representation of the vocal tract, but its speech compression capabilities were limited because of the necessary additional overhead information.
On the other hand, the dynamic model achieved a high compression of the data rate: 1200 bits per second speech at the output of the present systems was judged to be of equal quality to 2400 bits per second speech generated by only quantizing the LPC parameters. These results show that the dynamic model is very effective for speech compression in voice response applications. It has less importance for real time applications because of the required long analysis time.