scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1995"


Journal ArticleDOI
TL;DR: The survey indicates that the essential points in noisy speech recognition consist of incorporating time and frequency correlations, giving more importance to high SNR portions of speech in decision making, exploiting task-specific a priori knowledge both of speech and of noise, using class-dependent processing, and including auditory models in speech processing.

712 citations


Patent
07 Nov 1995
TL;DR: In this article, a knowledge-based speech recognition apparatus and methods are provided for translating an input speech signal to text, which employ a largely speaker independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions against which the series of hypothesized acoustic feature vectors are compared to select word choices.
Abstract: Knowledge based speech recognition apparatus and methods are provided for translating an input speech signal to text. The speech recognition apparatus captures an input speech signal, segments it based on the detection of pitch period, and generates a series of hypothesized acoustic feature vectors for the input speech signal that characterizes the signal in terms of primary acoustic events, detectable vowel sounds and other acoustic features. The apparatus and methods employ a largely speaker-independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions against which the series of hypothesized acoustic feature vectors are compared to select word choices. Local and global syntactic analysis of the word choices is provided to enhance the recognition capability of the methods and apparatus.

483 citations


Book
01 Feb 1995
TL;DR: A detailed account of the most recently developed digital speech coders designed specifically for use in the evolving communications systems, including an in-depth examination of the important topic of code excited linear prediction (CELP).
Abstract: From the Publisher: A detailed account of the most recently developed digital speech coders designed specifically for use in the evolving communications systems. Discusses the variety of speech coders utilized with such new systems as MBE IMMARSAT-M. Includes an in-depth examination of the important topic of code excited linear prediction (CELP).

453 citations


Journal ArticleDOI
TL;DR: A new mixed excitation LPC vocoder model is presented that preserves the low bit rate of a fully parametric model but adds more free parameters to the excitation signal so that the synthesizer can mimic more characteristics of natural human speech.
Abstract: Traditional pitch-excited linear predictive coding (LPC) vocoders use a fully parametric model to efficiently encode the important information in human speech. These vocoders can produce intelligible speech at low data rates (800-2400 b/s), but they often sound synthetic and generate annoying artifacts such as buzzes, thumps, and tonal noises. These problems increase dramatically if acoustic background noise is present at the speech input. This paper presents a new mixed excitation LPC vocoder model that preserves the low bit rate of a fully parametric model but adds more free parameters to the excitation signal so that the synthesizer can mimic more characteristics of natural human speech. The new model also eliminates the traditional requirement for a binary voicing decision so that the vocoder performs well even in the presence of acoustic background noise. A 2400-b/s LPC vocoder based on this model has been developed and implemented in simulations and in a real-time system. Formal subjective testing of this coder confirms that it produces natural sounding speech even in a difficult noise environment. In fact, diagnostic acceptability measure (DAM) test scores show that the performance of the 2400-b/s mixed excitation LPC vocoder is close to that of the government standard 4800-b/s CELP coder. >

352 citations


Proceedings ArticleDOI
09 May 1995
TL;DR: Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals and can be combined with a nonlinear spectral subtraction scheme to enhance noisy speech and to improve the performance of speech recognition systems.
Abstract: Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals No explicit speech pause detection is required Past noisy segments of just about 400 ms duration are needed for the estimation Thus the algorithm is able to quickly adapt to slowly varying noise levels or slowly changing noise spectra This techniques can be combined with a nonlinear spectral subtraction scheme The ability can be shown to enhance noisy speech and to improve the performance of speech recognition systems Another application is the realization of a robust voice activity detection

273 citations


Patent
13 Nov 1995
TL;DR: In this article, a word tagging and editing system for speech recognition receives recognized speech text from a speech recognition engine, and creates tagging information that follows the speech text as it is received by a word processing program or other program.
Abstract: A word tagging and editing system for speech recognition receives recognized speech text from a speech recognition engine, and creates tagging information that follows the speech text as it is received by a word processing program or other program. The body of text to be edited in connection with the word processing program may be selected and cut and pasted and otherwise manipulated, and the tags follow the speech text. A word may be selected by a user, and the tag information used to point to a sound bite within the audio data file created initially by the speech recognition engine. The sound bite may be replayed to the user through a speaker. The practical results include that the user may confirm the correctness of a particular recognized word, in real time whilst editing text in the word processor. If the recognition is manually corrected, the correction information may be supplied to the engine for use in updating a user profile for the user who dictated the audio that was recognized. Particular tagging approaches are employed depending on the particular word processor being used.

188 citations


Patent
20 Sep 1995
TL;DR: In this paper, an adaptive speech recognition and control system and method for controlling various mechanisms and systems in response to spoken instructions is presented, in which spoken commands are effective to direct the system into appropriate memory nodes, and to respective appropriate memory templates corresponding to the voiced command.
Abstract: An adaptive speech recognition and control system and method for controlling various mechanisms and systems in response to spoken instructions and in which spoken commands are effective to direct the system into appropriate memory nodes, and to respective appropriate memory templates corresponding to the voiced command Spoken commands from any of a group of operators for which the system is trained may be identified, and voice templates are updated as required in response to changes in pronunciation and voice characteristics over time of any of the operators for which the system is trained Provisions are made for both near-real-time retraining of the system with respect to individual terms which are determined not be positively identified, and for an overall system training and updating process in which recognition of each command and vocabulary term is checked, and in which the memory templates are retrained if necessary for respective commands or vocabulary terms with respect to an operator currently using the system In one embodiment, the system includes input circuitry connected to a microphone and including signal processing and control sections for sensing the level of vocabulary recognition over a given period and, if recognition performance falls below a given level, processing audio-derived signals for enhancing recognition performance of the system

188 citations


Book
31 Oct 1995
TL;DR: This work analyzes the nature and perception of speech sounds, application domain, human factors, and dialogue, and the current technology and its limits: an overview of automatic speech recognition (ASR).
Abstract: About the authors. Foreword. Preface. Part A: Speech communication by humans and machines. 1. Nature and perception of speech sounds. 2. Background on speech analysis. 3. Fundamentals of automatic speech recognition. Part B: Robustness in ASR: Problems and issues. 4. Speaker variability and specificity. 5. Dealing with noisy speech and channel distortions. Part C: Possible solutions and some perspectives. 6. The current technology and its limits: an overview. 7. Towards robust speech analysis. 8. On the use of a robust speech representation. 9. ASR of noisy, stressed, and channel distorted speech. 10. Word-spotting and rejection. 11. Spontaneous speech. 12. On the use of knowledge in ASR. 13. Application domain, human factors, and dialogue. Appendix. Index.

178 citations


Patent
31 Jul 1995
TL;DR: In this article, a speech recognition system provides a user with graphical and textual feedback, which is displayed in windows but occupies little of the available display space and is displayed only for a short period of time.
Abstract: A speech recognition system provides a user with graphical and textual feedback. The textual feedback is displayed in windows but occupies little of the available display space and are displayed only for a short period of time. The graphical feedback is displayed in a designated notification area and does not obscure any other displayed items. The feedback provided by the speech recognition system may indicate a current mode of operation of the speech recognition system as well as a state of processing of audio input by the speech recognition system.

166 citations


Proceedings ArticleDOI
27 Sep 1995
TL;DR: Analysis of the transmission of voice and data over an 802.11 WLAN shows that a larger superframe length provides the opportunity for more voice conversations or a higher data throughput, but requires increasing the time to live for the speech bits to retain an acceptable quality.
Abstract: This paper analyzes the transmission of voice and data over an 802.11 wireless local area network (WLAN). The data is transmitted in a contention based access period, while the voice samples are transmitted during a contention free period, based on a polling scheme. Because statistical multiplexing can be utilized, speech may be outdated when a poll arrives. The portion of outdated speech is then clipped to decrease the load on the channel. We analyze the quality of the voice conversations in terms of the percentage of bits clipped as well as the throughput of the data for various parameters. We show the boundary conditions involved in the transmission of voice over the WLAN and demonstrate the impact of a time-bounded service on the throughput during the contention period. The results show that the high overhead introduced by the 802.11 WLAN standard results in a low number of possible voice conversations. It can also be concluded that the cooperation of the contention based and contention free periods results in a poor performance. Further, variation of the maximum payload size reveals that the largest possible maximum payload size must be selected to minimize the percentage of clipped bits and maximize the throughput. Finally, we show that a larger superframe length provides the opportunity for more voice conversations or a higher data throughput, but requires increasing the time to live for the speech bits to retain an acceptable quality.

164 citations


Patent
19 Jun 1995
TL;DR: In this paper, a personal communications system enables the operator to simultaneously transmit voice and data communication to a remote site, using a modified supervisory packet for negotiating communication parameters, including speech compression algorithm, the speech compression ratio, the communication multiplex scheme, and other operations needed for control of remote hardware interfaces.
Abstract: A personal communications system enables the operator to simultaneously transmit voice and data communication to a remote site. The personal communications system is equipped with two telephone line interfaces to allow connection between two remote sites. The connection between the first remote site and the local site may operate in a voice over data communications mode to simultaneously send compressed voice and data. A digital transmission protocol which is consistent with current packet standards is used to create an independent channel through use of a modified supervisory packet for negotiating communication parameters, including the speech compression algorithm, the speech compression ratio, the communication multiplex scheme, and other operations needed for control of remote hardware interfaces.

Proceedings ArticleDOI
09 May 1995
TL;DR: It is suggested that phone rate is a more meaningful measure of speech rate than the more common word rate, and it is found that when data sets are clustered according to the phone rate metric, recognition errors increase when thePhone rate is more than 1 standard deviation greater than the mean.
Abstract: It is well known that a higher-than-normal speech rate will cause the rate of recognition errors in large vocabulary automatic speech recognition (ASR) systems to increase. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than the more common word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We propose three methods to improve the recognition accuracy of fast speech, each addressing different aspects of performance degradation. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, the pronunciation dictionaries are modified using rule-based techniques and compound words are added. We compare improvements in recognition accuracy for each method using data sets clustered according to the phone rate metric. Adaptation of the HMM state-transition probabilities to fast speech improves recognition of fast speech by a relative amount of 4 to 6 percent.

PatentDOI
TL;DR: In this article, a voice activity detector uses an energy estimate to detect the presence of speech in a received speech signal in a noise environment, and a set of high pass filters are used to filter the signal based upon the background noise level.
Abstract: A method and apparatus for improving sound quality in a digital cellular radio system receiver. A voice activity detector uses an energy estimate to detect the presence of speech in a received speech signal in a noise environment. When no speech is present the system attenuates the signal and inserts low pass filtered white noise. In addition, a set of high pass filters are used to filter the signal based upon the background noise level. This high pass filtering is applied to the signal regardless of whether speech is present. Thus, a combination of signal attenuation with insertion of low pass filtered white noise during periods of non-speech, along with high pass filtering of the signal, improves sound quality when decoding speech which has been encoded in a noisy environment.

PatentDOI
TL;DR: An instantaneous context switching speech recognition system is disclosed which enables a speech recognition application to be changed without loading new pattern matching data into the system.
Abstract: An instantaneous context switching speech recognition system is disclosed which enables a speech recognition application to be changed without loading new pattern matching data into the system. Selectable pointer maps are included in the memory of the system which selectively change the relationship between words and phonemes between a first application context and the pattern matching logic to a second application context and the pattern matching logic.

Proceedings ArticleDOI
09 May 1995
TL;DR: In this article, a modular system for flexible human-computer interaction via speech is presented, which integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments.
Abstract: We present the development of a modular system for flexible human-computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker's face by the lip locator module. Finally, the speaker's face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions.

Proceedings ArticleDOI
Xuedong Huang1, Alejandro Acero1, F. Alleva1, Mei-Yuh Hwang1, Li Jiang1, Milind Mahajan1 
09 May 1995
TL;DR: The Whisper (Windows Highly Intelligent Speech Recognizer) represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system.
Abstract: Since January 1993, the authors have been working to refine and extend Sphinx-II technologies in order to develop practical speech recognition at Microsoft. The result of that work has been the Whisper (Windows Highly Intelligent Speech Recognizer). Whisper represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system. In addition Whisper offers speech input capabilities for Microsoft Windows and can be scaled to meet different PC platform configurations. It provides features such as continuous speech recognition, speaker-independence, on-line adaptation, noise robustness, dynamic vocabularies and grammars. For typical Windows Command-and-Control applications (less than 1000 words), Whisper provides a software only solution on PCs equipped with a 486DX, 4MB of memory, and a standard sound card and a desk-top microphone.

PatentDOI
TL;DR: A speech circuit is disclosed which solves the serious problem of the degradation of the articulation of received speech voice in conventional circuits and permits pleasant communications at places where the background noise level is high.
Abstract: A speech circuit is disclosed which solves the serious problem of the degradation of the articulation of received speech voice in conventional circuits and permits pleasant communications at places where the background noise level is high. The circuit has a construction in which an input signal from a microphone is attenuated in correspondence to the background noise level to form a sidetone signal and a received speech signal from a speech channel is amplified in correspondence to the background noise level to form a new received speech signal.

Patent
05 Sep 1995
TL;DR: In this article, the identification of a caller is determined upon connection to the network via standard caller identification circuitry and upon detection of a spoken utterance, that utterance is processed against the core library, if the caller's identity cannot be determined, or against a particular caller-specific library.
Abstract: A method and system are disclosed for reducing perplexity in a speech recognition system within a telephonic network based upon determined caller identity. In a speech recognition system which processes input frames of speech against stored templates representing speech, a core library of speech templates is created and stored representing a basic vocabulary of speech. Multiple caller-specific libraries of speech templates are also created and stored, each library containing speech templates which represent a specialized vocabulary and pronunciations for a specific geographic location and a particular individual. Additionally, the caller-specific libraries of speech templates are preferably processed to reflect the reduced bandwidth, transmission channel variations and other signal variations introduced into the system via a telephonic network. The identification of a caller is determined upon connection to the network via standard caller identification circuitry and upon detection of a spoken utterance, that utterance is processed against the core library, if the caller's identity cannot be determined, or against a particular caller-specific library, if the caller's identity can be determined, thereby greatly enhancing the efficiency and accuracy of speech recognition by the system.

Journal ArticleDOI
01 Jun 1995
TL;DR: Basic approaches to speech, wideband speech, and audio bit rate compressions in audiovisual communications are explained and it will become obvious that the use of the knowledge of auditory perception helps minimizing perception of coding artifacts and leads to efficient low bit rate coding algorithms which can achieve substantially more compression than was thought possible only a few years ago.
Abstract: Current and future visual communications for applications such as broadcasting videotelephony, video- and audiographic-conferencing, and interactive multimedia services assume a substantial audio component. Even text, graphics, fax, still images, email documents, etc. will gain from voice annotation and audio clips. A wide range of speech, wideband speech, and wideband audio coders is available for such applications. In the context of audiovisual communications, the quality of telephone-bandwidth speech is acceptable for some videotelephony and videoconferencing services. Higher bandwidths (wideband speech) may be necessary to improve the intelligibility and naturalness of speech. High quality audio coding including multichannel audio will be necessary in advanced digital TV and multimedia services. This paper explains basic approaches to speech, wideband speech, and audio bit rate compressions in audiovisual communications. These signal classes differ in bandwidth, dynamic range, and in listener expectation of offered quality. It will become obvious that the use of our knowledge of auditory perception helps minimizing perception of coding artifacts and leads to efficient low bit rate coding algorithms which can achieve substantially more compression than was thought possible only a few years ago. The paper concentrates on worldwide source coding standards beneficial for consumers, service providers, and manufacturers. >

Proceedings ArticleDOI
09 May 1995
TL;DR: In Informal listenings,inite impulse response (FIR) Wiener-like filters are applied to time trajectories of the cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications and bring a noticeable improvement to the quality of processed noisy speech.
Abstract: Finite impulse response (FIR) Wiener-like filters are applied to time trajectories of the cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications. Informal listenings indicate that the technique brings a noticeable improvement to the quality of processed noisy speech while not causing any significant degradation to clean speech. Alternative filter structures are being investigated as well as other potential applications in cellular channel compensation and narrowband to wideband speech mapping.

PatentDOI
TL;DR: In this article, the authors propose a conference bridge that receives speech data in the form of data packets, and transmits data without transforming it in the conference bridge, based on the loudest speaker.
Abstract: A conference bridge that receives speech data in the form of data packets, and transmits data in the same form, without transforming the data in the conference bridge. The conference bridge according to this invention includes a plurality of inputs that have speech detectors that detect the presence of speech data. The speech detectors report the presence of speech to a controller. The controller causes data packets from one of the inputs detecting speech to be replicated for all outputs. If there is speech at more than one input at a time, then a decision is made as to which input to replicate. Advantageously, the decision is based on who is the loudest speaker. Further, the data that is replicated is not sent to the output for the originator in order to prevent echo.

Journal ArticleDOI
Richard Rose1
TL;DR: Decision tree based allophone clustering resulted in a significant increase in keyword detection performance over that obtained using tri-phone based subword units while at the same time reducing the size of the inventory of subword acoustic models by 40%.

Journal ArticleDOI
TL;DR: The effects of additive background noise on speech quality and recognition parameters are discussed, and a source generator based framework to address stress and noise is proposed.


Patent
04 Apr 1995
TL;DR: In this paper, the authors exploit the synergy between operations performed by a speech rate modification system and those operations performed in a speech coding system to provide a speech-rate modification system with reduced hardware requirements.
Abstract: Synergy between operations performed by a speech-rate modification system and those operations performed in a speech coding system is exploited to provide a speech-rate modification system with reduced hardware requirements. The speech rate of an input signal is modified based on a signal representing a predetermined change in speech rate. The modified speech-rate signal is then filtered to generate a speech signal having increased short-term correlation. Modification of the input speech signal may be performed by inserting in the input speech signal a previous sequence of samples corresponding substantially to a pitch cycle. Alternatively, the input speech signal may be modified by removing from the input speech signal a sequence of samples corresponding substantially to a pitch cycle.

Journal ArticleDOI
T. Chen1, H.P. Graf1, Kuansan Wang2
TL;DR: The marriage of speech analysis and image processing can solve problems related to lip synchronization and speech information is utilized to improve the quality of audio-visual communications such as videotelephony and videoconferencing.
Abstract: We utilize speech information to improve the quality of audio-visual communications such as videotelephony and videoconferencing. In particular, the marriage of speech analysis and image processing can solve problems related to lip synchronization. We present a technique called speech-assisted frame-rate conversion. Demonstration sequences are presented. Other applications, including speech-assisted video coding, are outlined. >


Patent
17 Aug 1995
TL;DR: In this paper, an apparatus for monitoring signal quality in a communications link is provided which recognizes speech elements in signals received over the communications link and generates therefrom an estimate of the original speech signal, and compares the estimated signal with the actual received signal to provide an output based on the comparison.
Abstract: An apparatus for monitoring signal quality in a communications link is provided which recognizes speech elements in signals received over the communications link and generates therefrom an estimate of the original speech signal, and compares the estimated signal with the actual received signal to provide an output based on the comparison.

Patent
27 Nov 1995
TL;DR: A speech recognition test system comprising a single host processing system having a host processor and a memory device, wherein the memory device contains a plurality of audio files accessible by the host processor, is described in this paper.
Abstract: A speech recognition test system comprising a single host processing system having a host processor and a memory device, wherein the memory device contains a plurality of audio files accessible by the host processor. The test system also includes a speech recognition application having a vocabulary, an independent test application, a means for concurrently executing the speech recognition application and the independent test application on the host processor, a means for queuing the audio files as input to the speech recognition application by way of the test application, a means for programming the test application at configuration time to expand the vocabulary of the application being tested and/or other voice gender, volume, and speed playback parameters, and a means for capturing and evaluating test results from the speech recognition application by way of the test application. In an alternative embodiment the speech recognition test system includes an audio input/output device operatively connected to the host processor, and a means for redirecting output from the audio input/output system as input to itself.

Proceedings ArticleDOI
05 Nov 1995
TL;DR: The paper describes the Informedia Digital Video Library project and discusses how speech recognition is used for transcript creation from video, alignment with hand-generated transcripts, query interface and audio paragraph segmentation.
Abstract: In principle, speech recognition technology can make any spoken data useful for library indexing and retrieval. The paper describes the Informedia Digital Video Library project and discusses how speech recognition is used for transcript creation from video, alignment with hand-generated transcripts, query interface and audio paragraph segmentation. The results show that speech recognition accuracy varies dramatically depending on the quality and type of data used. Our information retrieval experiments also show that reasonable recall and precision can be obtained with moderate speech recognition accuracy. Finally we discuss some active areas of speech research relevant to the digital video library problem.