scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1987"


Book
01 Jan 1987
TL;DR: The toe or heel holder of a safety binding is pivotally mounted on a stub shaft, and held in its angular operating position by a spring-loaded, spherical detent guided in a bore of the holder radially relative to the shaft axis toward one of four equiangularly offset notches in the shaft which differ in their depth.
Abstract: The toe or heel holder of a safety binding is pivotally mounted on a stub shaft, and held in its angular operating position by a spring-loaded, spherical detent guided in a bore of the holder radially relative to the shaft axis toward one of four equiangularly offset notches in the shaft which differ in their depth. The shaft is attached to the top surface of the ski by a square mounting plate and four screws at the corners of the plate so that the detent engages different notches in the shaft, and therefore resists deflection of the holder from its operating position with different force depending on the orientation of the mounting plate on the ski surface.

627 citations


Journal ArticleDOI
TL;DR: An efficient computer program is developed that will serve as a tool for investigating whether articulatory speech synthesis may achieve this low bit rate.
Abstract: High quality speech at low bit rates (e.g., 2400 bits/s) is one of the important objectives of current speech research. As part of long range activity on this problem, we have developed an efficient computer program that will serve as a tool for investigating whether articulatory speech synthesis may achieve this low bit rate. At a sampling frequency of 8 kHz, the most comprehensive version of the program, including nasality and frication, runs at about twice real time on a Cray-1 computer.

243 citations


Proceedings ArticleDOI
06 Apr 1987
TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.
Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

156 citations


Book
01 Jan 1987

139 citations



Journal ArticleDOI
TL;DR: It is observed that the quality of decoded speech improves significantly when stable synthesis filters are employed and the stability and performance of pitch filters in speech coding when pitch prediction is combined with formant prediction is analyzed.
Abstract: This paper analyzes the stability and performance of pitch filters in speech coding when pitch prediction is combined with formant prediction. A computationally simple stability test based on a sufficient condition is formulated for pitch synthesis filters. For typical orders of pitch filters, this sufficient test is very tight. Based on the test, a simple stabilization technique that minimizes the loss in prediction gain of the pitch predictor is employed to generate stable synthesis filters. Finally, it is observed that the quality of decoded speech improves significantly when stable synthesis filters are employed.

84 citations


PatentDOI
TL;DR: In this article, an apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers, each speaker is modeled and recognized with any example of their speech, and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system.
Abstract: An apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers. Each speaker is modeled and recognized with any example of their speech. The input to the system is analog speech and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system. The system includes front end processing means which is responsive to the speech signal to provide digitized samples of the speech signal at an output which are stored in a memory. The stored digitized samples are then retrieved and divided into frames. The frames are processed to provide a series of speech parameters indicative of the nature of the speech content in each of the frames. The processor for producing the speech parameters is coupled to either a speaker modeling means, whereby a model for each speaker is provided and consequently stored, or a speaker recognition mode, whereby the speech parameters are again processed with current parameters and compared with the stored parameters during each speech frame. The comparison is accomplished over a predetermined number of frames whereby a favorable comparison is indicative of a known speaker for which a model is stored.

65 citations


Patent
06 Apr 1987
TL;DR: In this paper, a speech recognition method is implemented with an image pickup apparatus such as a TV camera which picks up a lip image during speech and with a small computer which has a small memory capacity and is connected to the TV camera.
Abstract: A speech recognition method is implemented with an image pickup apparatus such as a TV camera which picks up a lip image during speech and with a small computer which has a small memory capacity and is connected to the TV camera. The computer receives and processes as lip data an image signal from the TV camera which represents the lip image. The lip data is collated with language data stored in the memory of the computer so as to select the language corresponding to the lip data, thereby recognizing the speech. A microphone may also be provided to output to the system a voice waveform signal serving as voice data. This voice data is collated with the language data stored in the memory of the computer to select the language corresponding to the voice data, thereby recognizing the speech on the basis of the language selected using the lip data and using the voice data. Image pattern data and voice pattern data may be extracted and processed for every word, or for every unit sound. With the inventive method, the speech recognition ratio and processing speed are improved, particularly with respect to use of a computer with a small memory capacity.

51 citations


PatentDOI
TL;DR: A speech encoder is disclosed quantizing speech information with respect to energy, voicing and pitch parameters to provide a fixed number of bits per block of frames, irrespective of phonemic boundaries.
Abstract: A speech encoder is disclosed quantizing speech information with respect to energy, voicing and pitch parameters to provide a fixed number of bits per block of frames. Coding of the parameters takes place for each N frames, which comprise a block, irrespective of phonemic boundaries. Certain frames of speech information are discarded during transmission, if such information is substantially duplicated in an adjacent frame. A very low data rate transmission system is thus provided which exhibits a high degree of fidelity and throughput.

49 citations


PatentDOI
TL;DR: In this article, the quality of speech in a voice communication system is evaluated using a Mahalanobis D2 matrix, yielding D2 data which represents an estimation of the quality in the sample file.
Abstract: A method of evaluating the quality of speech in a voice communication system is used in a speech processor. A digital file of undistorted speech representative of a speech standard for a voice communication system is recorded. A sample file of possibly distorted speech carried by said voice communication system is also recorded. The file of standard speech and the file of possibly distorted speech are passed through a set of critical band filters to provide power spectra which include distorted-standard speech pairs. A variance-covariance matrix is calculated from said pairs, and a Mahalanobis D2 calculation is performed on said matrix, yielding D2 data which represents an estimation of the quality of speech in the sample file.

46 citations


Proceedings ArticleDOI
J. Lynch1, J. Josenhans2, R. Crochiere2
06 Apr 1987
TL;DR: A new algorithmic technique is presented for efficiently implementing the end-point decisions necessary to separate and segment speech from noisy background environments and for silence compression of speech in which speech segments are encoded with a low bit-rate encoding scheme and silence information is characterized by a set of parameters.
Abstract: A new algorithmic technique is presented for efficiently implementing the end-point decisions necessary to separate and segment speech from noisy background environments. The algorithm utilizes a set of computationally efficient production rules that are used to generate speech and noise metrics continuously from the input speech waveform. These production rules are based on statistical assumptions about the characteristics of the speech and noise waveform and are generated via time-domain processing to achieve a zero delay decision. An end-pointer compares the speech and silence metrics using an adaptive thresholding scheme with a hysteresis characteristic to control the switching speed of the speech/silence decision. The paper further describes the application of this algorithm to silence compression of speech in which speech segments are encoded with a low bit-rate encoding scheme and silence information is characterized by a set of parameters. In the receiver the resulting packetized speech is reconstructed by decoding the speech segments and reconstructing the silence intervals through a noise substitution process in which the amplitude and duration of background noise is defined by the silence parameters. A noise generation technique is described which utilizes an 18th order polynomial to generate a spectrally flat pseudo-random sequence that is filtered to match the mean coloration of acoustical background noise. A technique is further described in which the speech/silence transitions are merged rather than switched to achieve maximum subjective performance of the compression technique. The above silence compression algorithm has been implemented in a single DSP-20 signal processing chip using sub-band coding for speech encoding. Using this system, experiments were conducted to evaluate the performance of the technique and to verify the robustness of the endpoint and silence compression over a wide range of background noise conditions.

Journal ArticleDOI
TL;DR: An endpoint detection algorithm is presented which is based on hidden Markov model (HMM) technology and explicitly determines a set of speech endpoints based on the output of a Viterbi decoding algorithm.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: The problem addressed by this study is the suppression of an undesired talker when two talkers are communicating simultaneously on the same monophonic channel (co-channel speech).
Abstract: The problem addressed by this study is the suppression of an undesired talker when two talkers are communicating simultaneously on the same monophonic channel (co-channel speech). Two different applications are considered, improved intelligibility for human listeners, and improved performance for automatic speech and speaker recognition (ASR) systems. For the human intelligibility problem, the desired talker is the weaker of the two signals with voice-to-voice power ratios (Power desired / Power interference), or VVRs, as low as -18dB. For ASR applications, the desired talker is the stronger of the two signals, with VVRs as low as 5dB. Signal analysis algorithms have been developed which attempt to separate the co-channel spectrum into components due to the two different (stronger and weaker) talkers.


PatentDOI
TL;DR: In this paper, a process for digitizing speech for the purpose of reducing information rate and bandwidth relative to that of prior art means to digitize speech, while enjoying a high signal to noise ratio.
Abstract: A process is disclosed for digitizing (more precisely encoding) speech for the purpose of reducing information rate and bandwidth relative to that of prior art means to digitize speech, while enjoying a high signal to noise ratio. Using the the same encoding techniques the process can be used for storage of speech and machine recognition of speech. The process depends on detecting audio waveform zero crossings and generating uniform pulses at the time of the zero crossings. The uniform pulses are created through a regenerative process and are independent of the actual waveform shape save the time of zero crossing. Transmission (or storage) of these uniform pulses permits reconstruction of highly intelligible speech. Ratios of times between zero crossings is used in a new technique for machine word recognition; said ratios allowing recognition regardless of the speakers actual speech rate.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: A new technique is described for coding the sine-wave amplitudes based on the idea of a pitch-adaptive channel vocoder and operating at a total bit rate of 4.8 kbps, it was possible to code and transmit enough phase information so that very intelligible, natural sounding speech could be synthesized.
Abstract: It has been shown [1] that an analysis/synthesis system based on a sinusoidal representation leads to synthetic speech that is essentially indistinguishable from the original. By exploiting the peak-to-peak correlation of the sine-wave amplitudes [2], a harmonic model for the sine-wave frequencies, and a predictive model for the sine-wave phases [3], it has also been shown that the sine-wave parameters can be coded at 8 kbps. In this paper a new technique is described for coding the sine-wave amplitudes based on the idea of a pitch-adaptive channel vocoder. Using this amplitude-coding strategy and operating at a total bit rate of 4.8 kbps, it was possible to code and transmit enough phase information so that very intelligible, natural sounding speech could be synthesized. This 4.8 kbps system has been implemented in real-time and has achieved a Diagnostic Rhyme Test (DRT) score of 95. At 2.4 kbps no explicit phase information could be coded, but by phase-locking all of the sine waves to the fundamental, by adding a pitch-adaptive quadratic phase, and by adding a voicing dependent random phase to each sine wave, natural sounding synthetic speech could be obtained. This new system is currently being implemented in real-time so that intelligibility tests can be performed.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: This work investigates the performance of a recent algorithm for linear predictive (LP) modeling of speech signals, which have been degraded by uncorrelated additive noise, as a front-end processor in a speech recognition system.
Abstract: We investigate the performance of a recent algorithm for linear predictive (LP) modeling of speech signals, which have been degraded by uncorrelated additive noise, as a front-end processor in a speech recognition system. The system is speaker dependent, and recognizes isolated words, based on dynamic time warping principles. The LP model for the clean speech is estimated through appropriate composite modeling of the noisy speech. This is done by minimizing the Itakura-Saito distortion measure between the sample spectrum of the noisy speech and the power spectral density of the composite model. This approach results in a "filtering-modeling" scheme in which the filter for the noisy speech, and the LP model for the clean speech, are alternatively optimized. The proposed system was tested using the 26 word English alphabet, the ten English digits, and the three command words, "stop," "error," and "repeat," which were contaminated by additive white noise at 5-20 dB signal to noise ratios (SNR's). By replacing the standard LP analysis with the proposed algorithm, during training on the clean speech and testing on the noisy speech, we achieve an improvement in recognition accuracy equivalent to an increase in input SNR of approximately 10 dB.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: This paper presents an approach to applying the analysis-by-synthesis technique to sinusoidal speech modelling in an attempt to increase the ability of the model to accurately represent the speech waveform.
Abstract: In recent years the concept of analysis-by-synthesis has been applied very successfully to improving the performance of LPC based models At the same time, new speech models have been introduced based on representing speech by a sum of amplitude and frequency-modulated sinusoids which have been shown to successfully represent the non-linear, time-varying and quasi-periodic nature of speech In this paper we present an approach to applying the analysis-by-synthesis technique to sinusoidal speech modelling in an attempt to increase the ability of the model to accurately represent the speech waveform

Journal ArticleDOI
TL;DR: The design of a custom MOS-LSI chip capable of performing the pattern matching portion of a 1000-word speech recognition algorithm in real time is reported, and the resulting special-purpose architecture is sufficiently general that connected speech can be recognized without a speed penalty.
Abstract: The design of a custom MOS-LSI chip capable of performing the pattern matching portion of a 1000-word speech recognition algorithm in real time is reported. The chip implements a dynamic-time-warp algorithm. The chip is part of a single-board speech recognition system that performs spectral analysis, dictionary storage and management, and speech recognition for both isolated and connected word applications of up to 1000 words. Speech recognition algorithms are normally refined to work well on general-purpose machines without the influence of future special-purpose hardware implementation. With general-purpose machines, chip implementation issues such as bit widths and parallelism cannot be utilized so they are ignored in favor of increasing algorithmic complexity by techniques such as pruning. If developed together, the chip architecture and algorithm can be refined to fully use parallelism and increasing throughput, while retaining efficient silicon area utilization. The resulting special-purpose architecture is sufficiently general that connected speech can be recognized without a speed penalty.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: This paper presents a study of talker- stress-induced intraword variability, and an algorithm that compensates for the systematic changes observed, based on Hidden Markov Models trained by speech tokens in various talking styles.
Abstract: Automtic speech recognition algorithms generally rely on the assumption that for the distance measure used, intraword variabilities are smaller than interword variabilities so that appropriate separation in the measurement space is possible. As evidenced by degradation of recognition perforrmnce, the validity of such an assumption decreases from simple tasks to complex tasks, from cooperative talkers to casual talkers, and from laboratory talking environments to practical talking environments. This paper presents a study of talker- stress-induced intraword variability, and an algorithm that commpensates for the systematic changes observed. The study is based on Hidden Markov Models trained by speech tokens in various talking styles. The talking styles include normal speech, fast speech, loud speech, soft speech, and talking with noise injected through earphones; the styles are designed to simulate speech produced under real stressful conditions. Cepstral coefficients are used as the parameters in the Hidden Markov Models. The stress compensation algorithm compensates for the variations in the cepstral coefficients in a hypothesis-driven manner. The functional form of the compensation is shown to correspond to the equalization of spectral tilts. Preliminary experiments indicate that a substantial reduction in recognition error rate can be achieved with relatively little increase in computation and storage requirements.

PatentDOI
TL;DR: A speech coding system includes apparatus for generating a variable threshold dependent upon the power of an input speech signal, and a comparator to generate a discriminating signal for discriminating between aperiod when a speech continues and a period when the speech pauses.
Abstract: A speech coding system includes apparatus for generating a variable threshold dependent upon the power of an input speech signal, and a comparator for comparing the power of the input speech signal with the variable threshold value to generate a discriminating signal for discriminating between a period when a speech continues and a period when the speech pauses, to change the coding operation for the input speech signal in accordance with the level of the discriminating signal, thereby forming voiced and unvoiced frames independently of each other.

Proceedings ArticleDOI
D. Van Compernolle1
01 Jan 1987
TL;DR: This paper presents several ways of making the signal processing in the IBM speech recognition system more robust with respect to variations in the background noise level by reintroducing a semi-natural background by adding noise after applying spectral subtraction.
Abstract: This paper presents several ways of making the signal processing in the IBM speech recognition system more robust with respect to variations in the background noise level. The underlying problem is that the speech recognition system trains on the specific noise circumstances of the training session. A simple solution lays in the controlled addition of noise. The level of noise that has to be added in to effectively mask all background noise is rather high and causes a significant reduction in accuracy. Spectral subtraction does a better job in a limited number of cases, but the thresholding in spectral subtraction often leads to training problems in the hidden Markov model based recognition system. The best results were obtained by reintroducing a semi-natural background by adding noise after applying spectral subtraction.

Patent
12 Mar 1987
TL;DR: In this article, a comparison is carried out between the true meaning provided from a transmission memory and the meaning of a speech sample or a test signal recognized by the speech recogniser or speaker recogniser.
Abstract: In the measuring method according to the invention, speech samples and/or test signals, from which a speech recogniser or speaker recogniser has previously formed a reference pattern during a learning phase, are presented to this speech recogniser or speaker recogniser via a speech coder to be assessed or a transmission route to be tested Using an evaluation computer a comparison is carried out between the true meaning provided from a transmission memory and the meaning of a speech sample or a test signal recognised by the speech recogniser or speaker recogniser In this process, an error rate or a measure of the reliability of recognition is simultaneously calculated over a measurement cycle

Proceedings ArticleDOI
01 Apr 1987
TL;DR: The experiments in Finnish and Polish show that it is feasible to develop simple and inexpensive synthesizers with natural and high-quality human-like characteristics without expensive signal processing hardware.
Abstract: Waveform concatenation has been the method used in many low-quality speech synthesis experiments. The objective of this study was to find new ways to overcome the inherent difficulties in concatenating speech sample waveforms. Our experiments in Finnish and Polish show that it is feasible to develop simple and inexpensive synthesizers with natural and high-quality human-like characteristics. The implementation can be based on standard microprocessors and D/A-converters without expensive signal processing hardware. This paper describes the results of the experiments and conclusions to the design of speech synthesis that we call the microphonemic method.

Journal ArticleDOI
TL;DR: In this paper, a new multiaccess protocol is proposed for an integrated voice/data application, which is a variation of virtual time CSMA (VT-CSMA), taking advantage of the periodicity of voice packets and possesses a number of important features.
Abstract: A new multiaccess protocol is proposed for an integrated voice/data application. The protocol, which is a variation of virtual time CSMA (VT-CSMA), takes advantage of the periodicity of voice packets and possesses a number of important features. With this protocol, voice stations appear to have a dedicated time-division multiplexed (TDM) slot, and the delay of a voice packet is bounded by the length of a frame (defined to be the period between two consecutive voice packets from a voice station). Also, the amount of data added to the channel has little effect on the voice traffic. When silence detection is used, many more voice conversations can be supported without losing the dedicated-slot characteristic. This is in contrast to a movingboundary TDM system where the excessive bandwidth saved by silence detection can only be used for data. The protocol requires no global synchronization and is easy to implement. Simulation results are presented to evaluate its performance.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: A two-channel, speech and electroglottograph (EGG) approach to speech analysis is suggested to aid the automatic processing of speech.
Abstract: Attempts to measure the synthetic quality of speech usually consider the two factors intelligibility and naturalness, each involving subjective and objective characteristics. To generate high quality synthetic speech, spectral distortion should be avoided, spectral continuity and formant tracking should be done well. Glottal-related factors, including proper modeling of the 1) glottal excitation waveforms and 2) effects of source-tract interaction for synthesizers are discussed. Accurate detection of voiced/unvoiced/ silent segments in the speech waveform and the fundamental frequency of voicing are also major concerns. We present both formal and informal listener evaluations of three synthesizers: LPC, formant and articulatory. Finally, we suggest a two-channel, speech and electroglottograph (EGG), approach to speech analysis to aid the automatic processing of speech.

Book ChapterDOI
01 May 1987
TL;DR: A Speech Recognition Methodology is proposed which is based on the general assumption of ‘fuzzyness’ of both speech-data and knowledge-sources and on other fundamental assumptions which are also the bases of the proposed methodology.
Abstract: In this paper a Speech Recognition Methodology is proposed which is based on the general assumption of ‘fuzzyness’ of both speech-data and knowledge-sources. Besides this general principle, there are other fundamental assumptions which are also the bases of the proposed methodology: ‘Modularity’ in the knowledge organization, ‘Homogeneity’ in the representation of data and knowledge, ‘Passiveness’ of the ‘understanding flow’ (no backtraking or feedback), and ‘Parallelism’ in the recognition activity.


PatentDOI
Kouichi Shibagaki1, Akira Fukui1
TL;DR: In this paper, a multi-pulse speech coder uses synthetic filters for generating cross-correlated signals without pitch prediction and autocorrelated signals with pitch prediction, which are used as a basis for calculations to detect the correlations between the signal with pitch predictions and input speech signals.
Abstract: A multi-pulse speech coder uses synthetic filters for generating crosscorrelated signals without pitch prediction and autocorrelated signals with pitch prediction. These signals are used as a basis for calculations to detect the correlations between the signals with pitch predictions and input speech signals.

PatentDOI
Yoshiaki Asakawa1, Takanori Miyamoto1, Kazuhiro Kondo1, Akira Ichikawa1, Toshiro Suzuki1 
TL;DR: In this article, a speech coding method and system in which a speech signal is analyzed in each frame so as to be separated into spectral envelope information and excitation information and both of the information are coded, each frame is divided into a plurality of sub-frames and a pulse of the maximum amplitude is extracted from pulses within each sub-frame in order to provide large amplitude pulses from each frame.
Abstract: In a speech coding method and system in which a speech signal is analyzed in each frame so as to be separated into spectral envelope information and excitation information and both of the information are coded, each frame is divided into a plurality of sub-frames and a pulse of the maximum-amplitude is extracted from pulses within each sub-frame in order to provide large-amplitude pulses from each frame, thereby greatly reducing the number of pulse extracting processing steps.