scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2003"


Journal ArticleDOI
TL;DR: In this article, an improved minima controlled recursive averaging (IMCRA) approach is proposed for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR).
Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

902 citations


Patent
10 Dec 2003
TL;DR: In this paper, a technique for disambiguating speech input for multimodal systems by using a combination of speech and visual I/O interfaces is presented, where the user is presented with a set of possible matches using a visual display and/or speech output.
Abstract: A technique is disclosed for disambiguating speech input (202) for multimodal systems by using a combination of speech and visual I/O interfaces. When the user's speech input is not recognized with sufficiently high confidence, a the user is presented with a set of possible matches (210) using a visual display and/or speech output. The user then selects (212) the intended input from the list of matches via one or more available input mechanisms (e.g., stylus, buttons, keyboard, mouse, or speech input). These techniques involve the combined use of speech and visual interfaces to correctly identify user's speech input. The techniques disclosed herein may be utilized in computer devices such as PDAs, cellphones, desktop and laptop computers, tablet PCs, etc.

292 citations


MonographDOI
18 Apr 2003

257 citations


Journal ArticleDOI
TL;DR: A statistical approach based on a hidden Markov model (HMM) is used, which takes into account several features of the band-limited speech, and enhanced speech exhibits a significantly improved quality without objectionable artifacts.

197 citations


PatentDOI
TL;DR: In this paper, an apparatus and method for selective distributed speech recognition includes a dialog manager that is capable of receiving a grammar type indicator (170), which can be coupled to an external speech recognition engine (108), which may be disposed on a communication network.
Abstract: An apparatus and method for selective distributed speech recognition includes a dialog manager (104) that is capable of receiving a grammar type indicator (170). The dialog manager (104) is capable of being coupled to an external speech recognition engine (108), which may be disposed on a communication network (142). The apparatus and method further includes an audio receiver (102) coupled to the dialog manager (104) wherein the audio receiver (104) receives a speech input (110) and provides an encoded audio input (112) to the dialog manager (104). The method and apparatus also includes an embedded speech recognition engine (106) coupled to the dialog manager (104), such that the dialog manager (104) selects to distribute the encoded audio input (112) to either the embedded speech recognition engine (106) or the external speech recognition engine (108) based on the corresponding grammar type indicator (170).

187 citations


Patent
02 Apr 2003
TL;DR: In a voice synthesis apparatus, by bounding a desired range of input text to be output by, eg, a start tag'' and end tag, a feature of synthetic voice is continuously changed while gradually changing voice from a happy voice to an angry voice upon outputting synthetic voice.
Abstract: In a voice synthesis apparatus, by bounding a desired range of input text to be output by, eg, a start tag ' ' and end tag , a feature of synthetic voice is continuously changed while gradually changing voice from a happy voice to an angry voice upon outputting synthetic voice

173 citations


Journal ArticleDOI
TL;DR: The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD That uses a Gaussian model.
Abstract: A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model.

161 citations


Patent
Oscar J. Blass1
25 Mar 2003
TL;DR: A method for digitally generating speech with improved prosodic characteristics can include receiving a speech input, determining at least one prosodic characteristic contained within the speech input and generating a speech output including the prosodic feature within the output.
Abstract: A method for digitally generating speech with improved prosodic characteristics can include receiving a speech input, determining at least one prosodic characteristic contained within the speech input, and generating a speech output including the prosodic characteristic within the speech output Consequently, the method can adjust speech output based on prosodic features within the speech input

160 citations


Journal ArticleDOI
08 Sep 2003
TL;DR: This paper examines how people communicate with computers using speech, and the popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech.
Abstract: This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or text-to-speech (TTS)] performs the reverse task. ASR has been largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear. Typically, a Fourier transform is employed, but following the auditory Bark scale and simplifying the spectral representation with a decorrelation into cepstral coefficients. Current ASR provides good accuracy and performance on limited practical tasks, but exploits only the most rudimentary knowledge about human production and perception phenomena. The popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech. Common language models use a time window of three successive words in their syntactic-semantic analysis. Speech synthesis is the automatic generation of a speech waveform, typically from an input text. As with ASR, TTS starts from a database of information previously established by analysis of much training data, both speech and text. Previously analyzed speech is stored in small units in the database, for concatenation in the proper sequence at runtime. TTS systems first perform text processing, including "letter-to-sound" conversion, to generate the phonetic transcription. Intonation must be properly specified to approximate the naturalness of human speech. Modern synthesizers using large databases of stored spectral patterns or waveforms output highly intelligible synthetic speech, but naturalness remains to be improved.

156 citations


Patent
27 Feb 2003
TL;DR: A communication architecture for delivery of grammar and speech related information such as text-to-speech (TTS) data to a speech recognition server operating with a wireless telecommunication system for use with automatic speech recognition and interactive voice-based applications is presented in this paper.
Abstract: A communication architecture for delivery of grammar and speech related information such as text-to-speech (TTS) data to a speech recognition server operating with a wireless telecommunication system for use with automatic speech recognition and interactive voice-based applications. In the invention, a mobile client retrieves a Web page containing multi-modal content hosted on a origin server via WAP gateway. The content may include a grammar file and/or TTS strings embedded in the content or reference URL(s) pointing to their storage locations. The client then sends the grammar and/or TTS strings to a speech recognition server via a wireless packet streaming protocol channel. When URL(s) are received by the client and sent to the SRS, the grammar file and/or TTS strings are obtained via a high speed HTTP connection. The speech processing results and the synthesized speech are returned to the client over the established wireless UDP connection.

132 citations


Patent
05 Mar 2003
TL;DR: In this paper, the authors described a VAD device, system and methods for use with signal processing systems to denoise acoustic signals and VAD devices, systems and methods are described.
Abstract: Voice Activity Detection (VAD) devices, systems and methods are described for use with signal processing systems to denoise acoustic signals Components of a signal processing system and/or VAD system receive acoustic signals and voice activity signals Control signals are automatically generated from data of the voice activity signals Components of the signal processing system and/or VAD system use the control signals to automatically select a denoising method appropriate to data of frequency subbands of the acoustic signals The selected denoising method is applied to the acoustic signals to generate denoised acoustic signals

Patent
27 Mar 2003
TL;DR: In this article, the authors describe a number of microphone configurations to receive acoustic signals of an environment, including both portable handset and headset devices, which use a variety of microphones configurations, such as a two-microphone array including two unidirectional microphones and one omnidirectal microphone.
Abstract: Communication systems are described, including both portable handset and headset devices, which use a number of microphone configurations to receive acoustic signals of an environment. The microphone configurations include, for example, a two-microphone array including two unidirectional microphones, and a two-microphone array including one unidirectional microphone and one omnidirectional microphone. The communication systems also include Voice Activity Detection (VAD) devices to provide information of human voicing activity. Components of the communications systems receive the acoustic signals and voice activity signals and, in response, automatically generate control signals from data of the voice activity signals. Components of the communication systems use the control signals to automatically select a denoising method appropriate to data of frequency subbands of the acoustic signals. The selected denoising method is applied to the acoustic signals to generate denoised acoustic signals when the acoustic signal includes speech (101) and noise (102).

Proceedings Article
01 Jan 2003
TL;DR: Two approaches for extracting speaker traits are investigated: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker, showing that voice signatures are of practical interest in real-world applications.
Abstract: Most current spoken-dialog systems only extract sequences of words from a speaker's voice. This largely ignores other useful information that can be inferred from speech such as gender, age, dialect, or emotion. These characteristics of a speaker's voice, voice signatures, whether static or dynamic, can be useful for speech mining applications or for the design of a natural spoken-dialog system. This paper explores the problem of extracting automatically and accurately voice signatures from a speaker's voice. We investigate two approaches for extracting speaker traits: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker. In the first approach, we show that standard speech/nonspeech HMM, conditioned on speaker traits and evaluated on cepstral and pitch features, achieve accuracies well above chance for all examined traits. The second approach, using support vector machines with rational kernels applied to speech recognition lattices, attains an accuracy of about 8.1 % in the task of binary classification of emotion. Our results are based on a corpus of speech data collected from a deployed customer-care application (HMIHY 0300). While still preliminary, our results are significant and show that voice signatures are of practical interest in real-world applications.

Patent
29 May 2003
TL;DR: In this article, a computer-based automatic speech recognition (ASR) system generates a sequence of text material used to train the ASR system, where at least some of the text material is based on the evaluation of previous user utterances.
Abstract: A computer-based automatic speech recognition (ASR) system generates a sequence of text material used to train the ASR system. The system compares the sequence of text material to inputs corresponding to a user's speech utterances of that text material in order to update the speech models (e.g., phoneme templates) used during normal ASR processing. The ASR system is able to generate a user-dependent sequence of text material for adapting the speech models, where at least some of the text material is based on the evaluation of previous user utterances. In this way, the system can be trained more efficiently by concentrating on particular speech models that are more problematic than others for the particular user (or group of users).

Patent
15 Aug 2003
TL;DR: In this article, an indication of the loudness of an audio signal containing speech and other types of audio material is obtained by classifying segments of audio information as either speech or non-speech.
Abstract: An indication of the loudness of an audio signal containing speech and other types of audio material is obtained by classifying segments of audio information as either speech or non-speech. The loudness of the speech segments is estimated and this estimate is used to derive the indication of loudness. The indication of loudness may be used to control audio signal levels so that variations in loudness of speech between different programs is reduced. A preferred method for classifying speech segments is described.

Proceedings ArticleDOI
30 Nov 2003
TL;DR: A novel hardware device that combines a regular microphone with a bone-conductive microphone that is able to detect very robustly whether the speaker is talking and remove background speech significantly, even when the background speaker speaks at the same time as the speaker wearing the headset.
Abstract: We present a novel hardware device that combines a regular microphone with a bone-conductive microphone. The device looks like a regular headset and it can be plugged into any machine with a USB port. The bone-conductive microphone has an interesting property: it is insensitive to ambient noise and captures the low frequency portion of the speech signals. Thanks to the signals from the bone-conductive microphone, we are able to detect very robustly whether the speaker is talking, eliminating more than 90% of background speech. Furthermore, by combining both channels, we are able to remove background speech significantly, even when the background speaker speaks at the same time as the speaker wearing the headset.

Patent
Yasunaga Miyazawa1
31 Oct 2003
TL;DR: In this paper, the authors present an acoustical model creating method that obtains high recognition performance under various noise environments such as the inside of a car, which can include a noise data determination unit, which receives data representing the traveling state of a vehicle, the surrounding environments of the vehicle and the operational states of apparatuses mounted in the vehicle, and according to the data, determines which noise data of the previously classified n types of noise data corresponds to the current noise.
Abstract: The invention provides an acoustical model creating method that obtains high recognition performance under various noise environments such as the inside of a car. The present invention can include a noise data determination unit, which receives data representing the traveling state of the vehicle, the surrounding environments of the vehicle, and the operational states of apparatuses mounted in the vehicle, and according to the data, determines which noise data of the previously classified n types of noise data corresponds to the current noise. The invention can also include a noise removal processing unit in which the n types of noise data are superposed on standard speech data to create n types of noise-superposed speech data, and then n types of acoustic models M 1 to Mn, which are created based on the n types of noise-removed speech data from which noise is removed, and noise-superposed speech from a microphone are input together with the result of the noise type determination, and then noise removal is performed on the noise-superposed speech. The invention can also include a speech recognition processing unit in which speech recognition is performed on the noise-removed speech using the acoustic model corresponding to the noise type which is determined by the noise data determination unit among the n types of acoustic models.

Patent
20 May 2003
TL;DR: In this paper, a method for enhancing voice interactions within a portable multimodal computing device using visual messages is presented, where the message is a prompt for the speech input and/or a confirmation of the input.
Abstract: A method for enhancing voice interactions within a portable multimodal computing device using visual messages. A multimodal interface can be provided that includes an audio interface and a visual interface. A speech input can then be received and a voice recognition task can be performed upon at least a portion of the speech input. At least one message within the multimodal interface can be visually presented, wherein the message is a prompt for the speech input and/or a confirmation of the speech input.

Patent
02 Oct 2003
TL;DR: In this paper, a media capture device has an audio input receptive of user speech relating to media capture activities in close temporal relation to the media capture activity, and a speech recognizer recognizes the user speech based on a selected one of the focused speech recognition lexica.
Abstract: A media capture device has an audio input receptive of user speech relating to a media capture activity in close temporal relation to the media capture activity. A plurality of focused speech recognition lexica respectively relating to media capture activities are stored on the device, and a speech recognizer recognizes the user speech based on a selected one of the focused speech recognition lexica. A media tagger tags captured media with generated speech recognition text, and a media annotator annotates the captured media with a sample of the user speech suitable for input to a speech recognizer. Tagging and annotating are based on close temporal relation between receipt of the user speech and capture of the captured media. Annotations may be converted to tags during post processing, employed to edit a lexicon using letter-to-sound rules and spelled word input, or matched directly to speech to retrieve captured media.

Patent
09 Jul 2003
TL;DR: In this article, a speech data mining system for use in generating a rich transcription having utility in call center management includes a speech differentiation module differentiating between speech of interacting speakers, and a speech recognition module improving automatic recognition of speech of one speaker based on interaction with another speaker employed as a reference speaker.
Abstract: A speech data mining system for use in generating a rich transcription having utility in call center management includes a speech differentiation module differentiating between speech of interacting speakers, and a speech recognition module improving automatic recognition of speech of one speaker based on interaction with another speaker employed as a reference speaker. A transcript generation module generates a rich transcript based on recognized speech of the speakers. Focused, interactive language models improve recognition of a customer on a low quality channel using context extracted from speech of a call center operator on a high quality channel with a speech model adapted to the operator. Mined speech data includes number of interaction turns, customer frustration phrases, operator polity, interruptions, and/or contexts extracted from speech recognition results, such as topics, complaints, solutions, and resolutions. Mined speech data is useful in call center and/or product or service quality management.

01 Jan 2003
TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of the correct hypothesis.
Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms designed for signal enhancement are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance. However, speech recognition systems are statistical pattern classifiers that process features derived from the speech waveform, not the waveform itself. An array processing algorithm can therefore only be expected to improve recognition if it maximizes or at least increases the likelihood of the correct hypothesis, relative to other competing hypotheses. In this thesis a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of the correct hypothesis. In this approach, called Likelihood Maximizing Beamforming (LIMABEAM), information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Using LIMABEAM, significant improvements in recognition accuracy over conventional array processing approaches are obtained in moderately reverberant environments over a wide range of signal-to-noise ratios. However, only limited improvements are obtained in environments with more severe reverberation. To address this issue, a subband filtering approach to LIMABEAM is proposed, called Subband-Likelihood Maximizing Beamforming (S-LIMABEAM). S-LIMABEAM employs a new subband filter-and-sum architecture which explicitly considers how the features used for recognition are computed. This enables S-LIMABEAM to achieve dramatically improved performance over the original LIMABEAM algorithm in highly reverberant environments. Because the algorithms in this thesis are data-driven, they do not require a priori knowledge of the room impulse response, nor any particular number of microphones or array geometry. To demonstrate this, LIMABEAM and S-LIMABEAM are evaluated using multiple array configurations and environments including an array-equipped personal digital assistant (PDA) and a meeting room with a few tabletop microphones. In all cases, the proposed algorithms significantly outperform conventional array processing approaches.

Proceedings ArticleDOI
05 Apr 2003
TL;DR: Unvoiced speech recognition, "Mime Speech Recognition", is proposed, not based on voice signals but electromyography (EMG), which will realize unvoiced communication, which is a new communication style.
Abstract: We propose unvoiced speech recognition, "Mime Speech Recognition" It recognizes speech by observing the muscles associated with speech It is not based on voice signals but electromyography (EMG) It will realize unvoiced communication, which is a new communication style Because voice signals are not used, it can be applied in noisy environments; it also supports people without vocal-cords and aphasics In preliminary experiments, we try to recognize the 5 Japanese vowels EMG signals from the 3 muscles that contribute greatly to the utterance of Japanese vowels are input to a neural network The recognition accuracy is over 90% for the three subjects tested

Proceedings ArticleDOI
10 Nov 2003
TL;DR: This paper investigates the effects of packet loss on speech quality in Voice over Internet Protocol (VoIP) applications by using ITU-T G.107, the E-model, whose parameters currently only cover limited VoIP scenarios.
Abstract: This paper investigates the effects of packet loss on speech quality in Voice over Internet Protocol (VoIP) applications by using ITU-T G.107, the E-model, whose parameters currently only cover limited VoIP scenarios. Several packet loss rates, packet sizes and error concealment techniques for codec G.729 are examined. Mean Opinion Score (MOS) is used as an index for speech quality and is measured by Perceptual Evaluation of Speech Quality (PESQ) algorithm. These effects on speech quality are assessed in the equipment impairment factor domain and then formulated into the E-model. The validation test shows good accuracy of the proposed formula, the prediction errors range between /spl mnplus/0.10 MOS for most cases with an absolute maximum of 0.14 MOS.

Journal ArticleDOI
TL;DR: Two versions of SND algorithm, based on statistical criteria, are proposed and compared, and a post-detection technique is introduced in order to reject the wrongly detected noise segments.

Patent
Stefan Gustavsson1
17 Jun 2003
TL;DR: In this paper, the authors proposed a method for voice activity detection in a mobile telephone using the directional sensitivity of a microphone system and exploiting the knowledge about the voice source's orientation in space.
Abstract: The invention relates to a device, a mobile apparatus incorporating the device, and accessory therefor and a method for voice activity detection, particularly in a mobile telephone, using the directional sensitivity of a microphone system and exploiting the knowledge about the voice source's orientation in space. The device comprises a sound signal analyser arranged to determine whether a sound signal comprises speech. According to the invention, the device further comprises a microphone system (2a, 2b, 2c, 2d, 2e) arranged to discriminate sounds emanating from sources located in different directions from the microphone system, so that sounds only emanating from a range of directions are included as signals possibly containing speech.

Patent
Milan Jelinek1
09 Oct 2003
TL;DR: Speech signal classification and encoding systems and methods are disclosed in this paper, where the signal classification is done in three steps each of them discriminating a specific signal class, and the classification chain ends and the frame is encoded using a coding method optimized for unvoiced signals.
Abstract: Speech signal classification and encoding systems and methods are disclosed herein. The signal classification is done in three steps each of them discriminating a specific signal class. First, a voice activity detector (VAD) discriminates between active and inactive speech frames. If an inactive speech frame is detected (background noise signal) then the classification chain ends and the frame is encoded with comfort noise generation (CNG). If an active speech frame is detected, the frame is subjected to a second classifier dedicated to discriminate unvoiced frames. If the classifier classifies the frame as unvoiced speech signal, the classification chain ends, and the frame is encoded using a coding method optimized for unvoiced signals. Otherwise, the speech frame is passed through to the 'stable voiced' classification module. If the frame is classified as stable voiced frame, then the frame is encoded using a coding method optimized for stable voiced signals. Otherwise, the frame is likely to contain a non-stationary speech segment such as a voiced onset or rapidly evolving voiced speech signal. In this case a general-purpose speech coder is used at a high bit rate for sustaining good subjective quality .

Patent
05 Feb 2003
TL;DR: In this paper, a speech processing unit assigns priority either to voice guidance processing or to speech recognition processing to be carried out previously, when a speech input requesting for the speech input is accepted while the voice guidance process is being carried out.
Abstract: A speech processing unit assigns priority either to voice guidance processing or to speech recognition processing to be carried out previously, when a speech input requesting for the speech recognition processing is accepted while the voice guidance processing is being carried out. It can solve a problem of a conventional speech processing unit in that when a user operates a speech input button requesting for the speech recognition processing, the currently output voice guidance is interrupted, or the voice guidance scheduled to be output is not produced, thereby hindering the user from obtaining truly necessary information.

Journal ArticleDOI
01 Sep 2003
TL;DR: The paper presents several methods of analyzing stuttered speech and describes attempts to establish those parameters that represent stuttering event and reports results of some experiments on automatic detection of speech disorder events that were based on both rough sets and artificial neural networks.
Abstract: The process of counting stuttering events could be carried out more objectively through the automatic detection of stop-gaps, syllable repetitions and vowel prolongations. The alternative would be based on the subjective evaluations of speech fluency and may be dependent on a subjective evaluation method. Meanwhile, the automatic detection of intervocalic intervals, stop-gaps, voice onset time and vowel durations may depend on the speaker and the rules derived for a single speaker might be unreliable when trying to consider them as universal ones. This implies that learning algorithms having strong generalization capabilities could be applied to solve the problem. Nevertheless, such a system requires vectors of parameters, which characterize the distinctive features in a subject's speech patterns. In addition, an appropriate selection of the parameters and feature vectors while learning may augment the performance of an automatic detection system. The paper reports on automatic recognition of stuttered speech in normal and frequency altered feedback speech. It presents several methods of analyzing stuttered speech and describes attempts to establish those parameters that represent stuttering event. It also reports results of some experiments on automatic detection of speech disorder events that were based on both rough sets and artificial neural networks.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: A prosodic end-of-utterance detector using only speech/nonspeech detection output is still considerably more accurate and has lower latency than a baseline system based on pause-length thresholding.
Abstract: In previous work we showed that state-of-the-art end-of-utterance detection (as used, for example, in dialog systems) can be improved significantly by making use of prosodic and/or language models that predict utterance endpoints, based on word and alignment output from a speech recognizer. However, using a recognizer in endpointing might not be practical in certain applications. We demonstrate that the improvements due to the prosodic knowledge can be realized largely without alignment information, i.e., without requiring a speech recognizer. A prosodic end-of-utterance detector using only speech/nonspeech detection output is still considerably more accurate and has lower latency than a baseline system based on pause-length thresholding.

Proceedings ArticleDOI
Sumit Basu1
06 Apr 2003
TL;DR: This work presents a novel method for simultaneous voicing and speech detection based on a linked-HMM architecture, with robust features that are independent of the signal energy, and demonstrates the performance of this method in a variety of testing conditions.
Abstract: We present a novel method for simultaneous voicing and speech detection based on a linked-HMM architecture, with robust features that are independent of the signal energy. Because this approach models the change in dynamics between speech and nonspeech regions, it is robust to low sampling rates, significant levels of additive noise, and large distances from the microphone. We demonstrate the performance of our method in a variety of testing conditions and also compare it to other methods reported in the literature.