scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2004"


Journal ArticleDOI
TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.

412 citations


Journal ArticleDOI
TL;DR: An original paradigm is used to show that seeing the speaker's lips enables the listener to hear better and hence to understand better, and this early contribution to audio-visual speech identification is discussed in relationships with recent neurophysiological data on audio- visual perception.

330 citations


Patent
05 Aug 2004
TL;DR: In this article, a method and apparatus for performing speech recognition using a dynamic vocabulary is described, where results from a preliminary speech recognition pass can be used to update or refine a language model in order to improve the accuracy of search results and to simplify subsequent recognition passes.
Abstract: A method and apparatus are provided for performing speech recognition using a dynamic vocabulary. Results from a preliminary speech recognition pass can be used to update or refine a language model in order to improve the accuracy of search results and to simplify subsequent recognition passes. This iterative process greatly reduces the number of alternative hypotheses produced during each speech recognition pass, as well as the time required to process subsequent passes, making the speech recognition process faster, more efficient and more accurate. The iterative process is characterized by the use of results from one or more data set queries, where the keys used to query the data set, as well as the queries themselves, are constructed in a manner that produces more effective language models for use in subsequent attempts at decoding a given speech signal.

191 citations


Patent
24 Dec 2004
TL;DR: In this paper, a speech recognition unit performs speech recognition on a speech of an utter input by a speech input unit, specifies possible words which are represented by the speech, and the scores thereof, and a natural language analyzer (3) specifies parts of speech of the words and supplies word data representing the words to an agent processing unit.
Abstract: A speech recognition unit (2) performs speech recognition on a speech of an utterer input by a speech input unit (1), specifies possible words which are represented by the speech, and the scores thereof, and a natural language analyzer (3) specifies parts of speech of the words and supplies word data representing the words to an agent processing unit (7). The agent processing unit (7) stores process item data which defines a data acquisition process to acquire word data or the like, a discrimination process, and an input/output process, and wires or data defining transition from one process to another and giving a weighting factor to the transition, and executes a flow represented generally by the process item data and the wires to thereby control devices belonging to an input/output target device group (6) in such a way as to adequately grasp a demand of the utterer and meet the demand.

181 citations


Journal ArticleDOI
TL;DR: This study investigated whether this enhancement effect of acoustic speech in noise is specific to visual speech stimuli or can rely on more generic non-speech visual stimulus properties.

175 citations


Proceedings ArticleDOI
24 Oct 2004
TL;DR: Three applications that utilize dual-purpose speech to assist a user in conversational tasks: the Calendar Navigator Agent, DialogTabs, and Speech Courier are presented.
Abstract: In this paper, we explore the concept of dual-purpose speech: speech that is socially appropriate in the context of a human-to-human conversation which also provides meaningful input to a computer. We motivate the use of dual-purpose speech and explore issues of privacy and technological challenges related to mobile speech recognition. We present three applications that utilize dual-purpose speech to assist a user in conversational tasks: the Calendar Navigator Agent, DialogTabs, and Speech Courier. The Calendar Navigator Agent navigates a user's calendar based on socially appropriate speech used while scheduling appointments. DialogTabs allows a user to postpone cognitive processing of conversational material by proving short-term capture of transient information. Finally, Speech Courier allows asynchronous delivery of relevant conversational information to a third party.

156 citations


Book
01 Jan 2004
TL;DR: This work focuses on Speech Segregation using an Event-synchronous Auditory Image and STRAIGHT, and on Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis.
Abstract: Speech Segregation: Problems and Perspectives.- Auditory Scene Analysis.- Speech separation.- Recurrent Timing Nets for F0-based Speaker Separation.- Blind Source Separation Using Graphical Models.- Speech Recognizer Based Maximum Likelihood Beamforming.- Exploiting Redundancy to Construct Listening Systems.- Automatic Speech Processing by Inference in Generative Models.- Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition.- Speech Segregation Using an Event-synchronous Auditory Image and STRAIGHT.- Underlying Principles of a High-quality Speech Manipulation System STRAIGHT and Its Application to Speech Segregation.- On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis.- The History and Future of CASA.- Techniques for Robust Speech Recognition in Noisy and Reverberant Conditions.- Source Separation, Localization, and Comprehension in Humans, Machines, and Human-machine Systems.- The Cancellation Principle in Acoustic Scene Analysis.- Informational and Energetic Masking Effects in Multitalker Speech Perception.- Masking the Feature Information In Multi-stream Speech-analogue Displays.- Interplay Between Visual and Audio Scene Analysis.- Evaluating Speech Separation Systems.- Making Sense of Everyday Speech: a Glimpsing Account.

150 citations


Journal ArticleDOI
TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.
Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

147 citations


Patent
24 Sep 2004
TL;DR: Text-to-speech (TTS) generation is used in conjunction with large vocabulary speech recognition to automatically repeat text recognized by the speech recognition after each of a succession of end-of- utterance detections as discussed by the authors.
Abstract: Text-to-speech (TTS) generation is used in conjunction with large vocabulary speech recognition to say words selected by the speech recognition. The software for performing the large vocabulary speech recognition can share speech modeling data with the TTS software. TTS or recorded audio can be used to automatically say both recognized text and the names of recognized commands after their recognition. The TTS can automatically repeats text recognized by the speech recognition after each of a succession of end of utterance detections. A user can move a cursor back or forward in recognized text, and the TTS can speak one or more words at the cursor location after each such move. The speech recognition can be used to produces a choice list of possible recognition candidates and the TTS can be used to provide spoken output of one or more of the candidates on the choice list.

142 citations


PatentDOI
TL;DR: In this article, the system implements high-accuracy speech recognition while suppressing the amount of data transfer between the client and the server, where the client receives the compression-encoded speech parameters, a speech processing unit makes speech recognition of the compressed speech parameters and sends information corresponding to the speech recognition result to the client.
Abstract: The system implements high-accuracy speech recognition while suppressing the amount of data transfer between the client and server For this purpose, the client compression-encodes speech parameters by a speech processing unit, and sends the compression-encoded speech parameters to the server The server receives the compression-encoded speech parameters, a speech processing unit makes speech recognition of the compression-encoded speech parameters, and sends information corresponding to the speech recognition result to the client

136 citations


Proceedings ArticleDOI
13 Oct 2004
TL;DR: The development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy using segment-constrained Hidden Markov Models (HMMs).
Abstract: This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.

Journal ArticleDOI
01 Aug 2004
TL;DR: It is shown that by masking the TF representation of the speech signals, the noise components are distorted beyond recognition while the speech source of interest maintains its perceptual quality.
Abstract: A dual-microphone speech-signal enhancement algorithm, utilizing phase-error based filters that depend only on the phase of the signals, is proposed. This algorithm involves obtaining time-varying, or alternatively, time-frequency (TF), phase-error filters based on prior knowledge regarding the time difference of arrival (TDOA) of the speech source of interest and the phases of the signals recorded by the microphones. It is shown that by masking the TF representation of the speech signals, the noise components are distorted beyond recognition while the speech source of interest maintains its perceptual quality. This is supported by digit recognition experiments which show a substantial recognition accuracy rate improvement over prior multimicrophone speech enhancement algorithms. For example, for a case with two speakers with a 0.1 s reverberation time, the phase-error based technique results in a 28.9% recognition rate gain over the single channel noisy signal, a gain of 22.0% over superdirective beamforming, and a gain of 8.5% over postfiltering.

Proceedings ArticleDOI
28 Sep 2004
TL;DR: In this method, audio information and video information are fused by a Bayesian network to enable the detection of speech events and the information of detected speech events is utilized in sound separation using adaptive beam forming.
Abstract: For cooperative work of robots and humans in the real world, a communicative function based on speech is indispensable for robots. To realize such a function in a noisy real environment, it is essential that robots be able to extract target speech spoken by humans from a mixture of sounds by their own resources. We have developed a method of detecting and extracting speech events based on the fusion of audio and video information. In this method, audio information (sound localization using a microphone array) and video information (human tracking using a camera) are fused by a Bayesian network to enable the detection of speech events. The information of detected speech events is then utilized in sound separation using adaptive beam forming. In this paper, some basic investigations for applying the above system to the humanoid robot HRP-2 are reported. Input devices, namely a microphone array and a camera, were mounted on the head of HRP-2, and acoustic characteristics for sound localization/separation performance were investigated. Also, the human tracking system was improved so that it can be used in a dynamic situation. Finally, overall performance of the system was tested via off-line experiments.

Patent
05 Dec 2004
TL;DR: A handheld device with both large-vocabulary speech recognition and audio recoding allows users to switch between at least two of the following three modes: (1) recording audio without corresponding speech recognition, (2) recording with speech recognition; and (3) speech recognition without audio recording.
Abstract: A handheld device with both large-vocabulary speech recognition and audio recoding allows users to switch between at least two of the following three modes: (1) recording audio without corresponding speech recognition; (2) recording with speech recognition; and (3) speech recognition without audio recording. A handheld device with both large-vocabulary speech recognition and audio recoding enables a user to select a portion of previously recorded sound and have speech recognition performed upon it. A system enables a user to search for a text label associated with portions of unrecognized recorded sound by uttering the label's words. A large-vocabulary system allows users to switch between playing back recorded audio and speech recognition with a single input, with successive audio playbacks automatically starting slightly before the end of prior playback. And a cell phone that allows both large-vocabulary speech recognition and audio recording and playback.

Patent
28 Apr 2004
TL;DR: In this article, a speech feature vector for a voice associated with a source of a text message was determined and compared to speaker models, and a speaker model was selected as a preferred match for the voice based on the comparison.
Abstract: A method of generating speech from textmessages includes determining a speech feature vector for a voice associated with a source of a text message, and comparing the speech feature vector to speaker models. The method also includes selecting one of the speaker models as a preferred match for the voice based on the comparison, and generating speech from the text message based on the selected speaker model.

Patent
12 Jul 2004
TL;DR: In this paper, a system, method, computer-readable medium, and computer-implemented system for optimizing allocation of speech recognition tasks among multiple speech recognizers and combining recognizer results is described.
Abstract: A system, method, computer-readable medium, and computer-implemented system for optimizing allocation of speech recognition tasks among multiple speech recognizers and combining recognizer results is described. An allocation determination is performed to allocate speech recognition among multiple speech recognizers using at least one of an accuracy-based allocation mechanism, a complexity-based allocation mechanism, and an availability-based allocation mechanism. The speech recognition is allocated among the speech recognizers based on the determined allocation. Recognizer results received from multiple speech recognizers in accordance with the speech recognition task allocation are combined.

Patent
27 Jul 2004
TL;DR: In this paper, a method for providing personalized services through voice and speaker recognition, comprising the steps of inputting, by a user, his/her voice through a wireless microphone of a remote control; if the voice is input, recognizing the input voice and the speaker that has input the voice; determining a command based on input voice; and providing a service according to the determination results.
Abstract: Disclosed is an audio/video apparatus for providing personalized services to a user through voice and speaker recognition, wherein when the user inputs his/her voice through a wireless microphone of a remote control, the voice recognition and speaker recognition for the input voice are performed and determination on a command corresponding to the input voice is made, thereby providing the user's personalized services to the user. Further, disclosed is a method for providing personalized services through voice and speaker recognition, comprising the steps of inputting, by a user, his/her voice through a wireless microphone of a remote control; if the voice is input, recognizing the input voice and the speaker that has input the voice; determining a command based on the input voice; and providing a service according to the determination results.

Patent
29 Mar 2004
TL;DR: In this paper, a mobile telephone set with a speech recognition function is presented, which can be prevented from being used by others without complicated operation by using an input deciding function to decide whether the speech signal recognized by the speech recognition part is matched with a previously registered reset code or not.
Abstract: PROBLEM TO BE SOLVED: To provide a mobile telephone provided with a speech recognition function which can be prevent from being used by others without complicated operation. SOLUTION: This mobile telephone set 100 provided with the speech recognition function comprises: a speech recognition part 110 which recognizes a speech signal; and a control part 111 provided with an input deciding function of deciding whether the speech signal recognized by the speech recognition part 110 is matched with a previously registered reset code or not and a resetting function of resetting a locked function when the decision part decides that the speech signal matches the reset code. In this configuration, the mobile telephone set can be prevented from being used by others without complicated operation. COPYRIGHT: (C)2006,JPO&NCIPI

Patent
09 Jan 2004
TL;DR: In this article, an audio-visual speech activience recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise and surrounding persons' voices.
Abstract: The present invention generally relates to the field of noise reduction systems which are equipped with an audio-visual user interface, in particular to an audio-visual speech activ­ity recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise (n(t)) and surrounding persons' voices

PatentDOI
TL;DR: In this paper, a voice level indicator is presented to the operator to help the operator keep his or her voice in the ideal range of the speech engine and a database of common words attempting to complete the word before all the letters have been input.
Abstract: A voice transcription system employing a speech engine to transcribe spoken words, detects the spelled entry of words via keyboard or voice to invoke a database of common words attempting to complete the word before all the letters have been input. This database is separate from the database of words used by the speech engine. A voice level indicator is presented to the operator to help the operator keep his or her voice in the ideal range of the speech engine.

Journal ArticleDOI
TL;DR: Two techniques for handling convolutional distortion with ‘missing data’ speech recognition using spectral features and a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy are proposed.

Patent
Uma Arun1
22 Sep 2004
TL;DR: In this paper, a method of configuring a speech recognition unit in a vehicle is described, which includes receiving a noise error from the speech recognition system responsive to a user voice command and reducing a confidence threshold for an appropriate grammar set responsive to the received noise error.
Abstract: A method of configuring a speech recognition unit in a vehicle. The method includes receiving a noise error from the speech recognition unit responsive to a user voice command and reducing a confidence threshold for an appropriate grammar set responsive to the received noise error.

Proceedings ArticleDOI
17 May 2004
TL;DR: New hardware prototypes that integrate several heterogeneous sensors into a single headset are presented and the underlying DSP techniques for robust speech detection, enhancement and recognition in highly non-stationary noisy environments are described.
Abstract: In this paper, we present new hardware prototypes that integrate several heterogeneous sensors into a single headset and describe the underlying DSP techniques for robust speech detection, enhancement and recognition in highly non-stationary noisy environments. We also speculate other business uses with this type of device.

PatentDOI
TL;DR: In this article, the authors describe methods and systems for reducing the audible gap in concatenated recorded speech, resulting in more natural sounding speech in voice applications, including phone-based and non-phone-based applications.
Abstract: Described are methods and systems for reducing the audible gap in concatenated recorded speech, resulting in more natural sounding speech in voice applications. The sound of concatenated, recorded speech is improved by also coarticulating the recorded speech. The resulting message is smooth, natural sounding and lifelike. Existing libraries of regularly recorded bulk prompts can be used by coarticulating the user interface prompt occurring just before the bulk prompt. Applications include phone-based applications as well as non-phone-based applications.

Proceedings ArticleDOI
17 May 2004
TL;DR: The results suggest that combining missing data technique with RNN enhancement is an effective enhancement scheme resulting in a 16 dB background noise reduction for all input signal to noise ratio (SNR) conditions from -5 to 20 dB, improved spectral quality and robust automatic speech recognition performance.
Abstract: This paper presents an application of missing data techniques in speech enhancement. The enhancement system consists of two stages: the first stage uses a recurrent neural network, which is supplied with noisy speech and produces enhanced speech; whereas the second stage uses missing data techniques to further improve the quality of enhanced speech. The results suggest that combining missing data technique with RNN enhancement is an effective enhancement scheme resulting in a 16 dB background noise reduction for all input signal to noise ratio (SNR) conditions from -5 to 20 dB, improved spectral quality and robust automatic speech recognition performance.

Journal ArticleDOI
TL;DR: This paper addresses the problem of quantitatively evaluating the quality of a speech stream transported over the Internet as perceived by the end-user by using G-networks as Neural Networks to learn how humans react vis-a-vis a speech signal that has been distorted by encoding and transmission impairments.

Patent
26 May 2004
TL;DR: In this paper, a speech synthesis system receives a text string from either a telephony network, or a data network, and determines whether a rendered audio file of the text string is stored in a database.
Abstract: An approach providing the efficient use of speech synthesis in rendering text content as audio in a communications network. The communications network can include a telephony network and a data network in support of, for example, Voice over Internet Protocol (VoIP) services. A speech synthesis system receives a text string from either a telephony network, or a data network. The speech synthesis system determines whether a rendered audio file of the text string is stored in a database and to render the text string to output the rendered audio file, if the rendered audio is determined not to exist. The rendered audio file is stored in the database for re-use according to a hash value generated by the speech synthesis system based on the text string.

Patent
Xuedong Huang1
22 Mar 2004
TL;DR: In this article, a method of performing speech recognition, and a mobile computing device (10) implementing the same, is disclosed, which includes receiving audible speech at a microphone (17) of the mobile computing devices (10).
Abstract: A method of performing speech recognition, and a mobile computing device (10) implementing the same, are disclosed. The method includes receiving audible speech at a microphone (17) of the mobile computing device (10). The audible speech is converted into speech signals at the mobile computing device (10). Also at the mobile computing device (10), preliminary and secondary speech recognition functions are performed on the speech signals to obtain requests for results from modules. Then, the requests for results are transmitted from the mobile computing device (10) to a second computing device (12) located remotely from the mobile computing device (10) to obtain the results which are then transmitted back to the mobile computing device (10) for completion of the speech recognition process.

Journal ArticleDOI
01 Jul 2004
TL;DR: Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results.
Abstract: Audio-visual speech recognition employing both acoustic and visual speech information is a novel extension of acoustic speech recognition and it significantly improves the recognition accuracy in noisy environments. Although various audio-visual speech-recognition systems have been developed, a rigorous and detailed comparison of the potential geometric visual features from speakers' faces is essential. Thus, in this paper the geometric visual features are compared and analyzed rigorously for their importance in audio-visual speech recognition. Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results. Speech signals are modeled by hidden Markov models (HMMs) and using the optimized HMMs and geometric visual features the accuracy of acoustic-only, visual-only, and audio-visual speech recognition methods are compared. The audio-visual speech recognition scheme has a much improved recognition accuracy compared to acoustic-only and visual-only speech recognition especially at high noise levels. The experimental results showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal-to-noise ratio of 0 dB).

Patent
27 Aug 2004
TL;DR: In this paper, a new type of speech recognition system is described, which is good at accomplishing these tasks by regulating the reception characteristic of a directionally selective microphone array, by means of the optimization unit with regard to the respective direction from which speech signals are received.
Abstract: Voice command operated systems are being installed in modern motor vehicles with increasing frequency Such systems should be operable by various vehicle occupants and from various seating positions A new type of speech recognition system is described, which is good at accomplishing these tasks Therein, for regulating the speech recognition system, the reception characteristic of a a directionally selective microphone array (12) is controlled by an optimization unit (10) These speech signals are then processed in a speech recognizer (11) at least parallel in time Then, on the basis of the results provided by the speech recognizer (11), the reception characteristic of the microphone array (12) is so controlled by an optimization unit (10), that the recognition performance of the speech recognizer (11) downstream of the optimization unit (10) is optimized Herein, the speech recognizer is supplied with the received speech signals from different spatial directions parallel or at least quasi-parallel via the speech channels (14), so that this selects and further processes those speech signals that have the potential for the best possible recognition performance On the basis of the recognition results, the optimization unit (10) obtains via the speech recognizer the necessary regulatory signals (18), in order to optimize the reception characteristics of the microphone array (12) by means of the optimization unit (10) with regard to the respective direction from which speech signals are received, which have the potential for the best possible recognition performance