Showing papers on "Voice activity detection published in 2004"

PDF

Open Access

Journal Article•DOI•

Efficient voice activity detection algorithms using long-term speech information

[...]

Javier Ramírez¹, José C. Segura¹, M. Carmen Benítez¹, Angel de la Torre¹, Antonio J. Rubio¹ - Show less +1 more•Institutions (1)

University of Granada¹

01 Apr 2004-Speech Communication

TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.

...read moreread less

412 citations

Journal Article•DOI•

Seeing to hear better: evidence for early audio-visual interactions in speech identification.

[...]

Jean-Luc Schwartz¹, Frédéric Berthommier¹, Christophe Savariaux¹•Institutions (1)

Stendhal University¹

01 Sep 2004-Cognition

TL;DR: An original paradigm is used to show that seeing the speaker's lips enables the listener to hear better and hence to understand better, and this early contribution to audio-visual speech identification is discussed in relationships with recent neurophysiological data on audio- visual perception.

...read moreread less

330 citations

Patent•

Method and apparatus for speech recognition using a dynamic vocabulary

[...]

Anand Venkataraman¹, Horacio Franco², Douglas A. Bercow•Institutions (2)

SRI International¹, Nuance Communications²

05 Aug 2004

TL;DR: In this article, a method and apparatus for performing speech recognition using a dynamic vocabulary is described, where results from a preliminary speech recognition pass can be used to update or refine a language model in order to improve the accuracy of search results and to simplify subsequent recognition passes.

...read moreread less

Abstract: A method and apparatus are provided for performing speech recognition using a dynamic vocabulary. Results from a preliminary speech recognition pass can be used to update or refine a language model in order to improve the accuracy of search results and to simplify subsequent recognition passes. This iterative process greatly reduces the number of alternative hypotheses produced during each speech recognition pass, as well as the time required to process subsequent passes, making the speech recognition process faster, more efficient and more accurate. The iterative process is characterized by the use of results from one or more data set queries, where the keys used to query the data set, as well as the queries themselves, are constructed in a manner that produces more effective language models for use in subsequent attempts at decoding a given speech signal.

...read moreread less

191 citations

Patent•

Device control device, speech recognition device, agent device, on-vehicle device control device, navigation device, audio device, device control method, speech recognition method, agent processing method, on-vehicle device control method, navigation method, and audio device control method, and prog

[...]

Yasushi Sato

24 Dec 2004

TL;DR: In this paper, a speech recognition unit performs speech recognition on a speech of an utter input by a speech input unit, specifies possible words which are represented by the speech, and the scores thereof, and a natural language analyzer (3) specifies parts of speech of the words and supplies word data representing the words to an agent processing unit.

...read moreread less

Abstract: A speech recognition unit (2) performs speech recognition on a speech of an utterer input by a speech input unit (1), specifies possible words which are represented by the speech, and the scores thereof, and a natural language analyzer (3) specifies parts of speech of the words and supplies word data representing the words to an agent processing unit (7). The agent processing unit (7) stores process item data which defines a data acquisition process to acquire word data or the like, a discrimination process, and an input/output process, and wires or data defining transition from one process to another and giving a weighting factor to the transition, and executes a flow represented generally by the process item data and the wires to thereby control devices belonging to an input/output target device group (6) in such a way as to adequately grasp a demand of the utterer and meet the demand.

...read moreread less

181 citations

Journal Article•DOI•

Auditory speech detection in noise enhanced by lipreading

[...]

Lynne E. Bernstein¹, Lynne E. Bernstein², Edward T. Auer¹, Sumiko Takayanagi¹•Institutions (2)

House Ear Institute¹, National Science Foundation²

01 Oct 2004-Speech Communication

TL;DR: This study investigated whether this enhancement effect of acoustic speech in noise is specific to visual speech stimuli or can rely on more generic non-speech visual stimulus properties.

...read moreread less

175 citations

Proceedings Article•DOI•

Augmenting conversations using dual-purpose speech

[...]

Kent Lyons¹, Christopher L. Skeels¹, Thad Starner¹, Cornelis M. Snoeck¹, Benjamin A. Wong¹, Daniel Ashbrook¹ - Show less +2 more•Institutions (1)

Georgia Institute of Technology¹

24 Oct 2004

TL;DR: Three applications that utilize dual-purpose speech to assist a user in conversational tasks: the Calendar Navigator Agent, DialogTabs, and Speech Courier are presented.

...read moreread less

Abstract: In this paper, we explore the concept of dual-purpose speech: speech that is socially appropriate in the context of a human-to-human conversation which also provides meaningful input to a computer. We motivate the use of dual-purpose speech and explore issues of privacy and technological challenges related to mobile speech recognition. We present three applications that utilize dual-purpose speech to assist a user in conversational tasks: the Calendar Navigator Agent, DialogTabs, and Speech Courier. The Calendar Navigator Agent navigates a user's calendar based on socially appropriate speech used while scheduling appointments. DialogTabs allows a user to postpone cognitive processing of conversational material by proving short-term capture of transient information. Finally, Speech Courier allows asynchronous delivery of relevant conversational information to a third party.

...read moreread less

156 citations

Book•

Speech Separation by Humans and Machines

[...]

Pierre Divenyi

01 Jan 2004

TL;DR: This work focuses on Speech Segregation using an Event-synchronous Auditory Image and STRAIGHT, and on Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis.

...read moreread less

Abstract: Speech Segregation: Problems and Perspectives.- Auditory Scene Analysis.- Speech separation.- Recurrent Timing Nets for F0-based Speaker Separation.- Blind Source Separation Using Graphical Models.- Speech Recognizer Based Maximum Likelihood Beamforming.- Exploiting Redundancy to Construct Listening Systems.- Automatic Speech Processing by Inference in Generative Models.- Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition.- Speech Segregation Using an Event-synchronous Auditory Image and STRAIGHT.- Underlying Principles of a High-quality Speech Manipulation System STRAIGHT and Its Application to Speech Segregation.- On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis.- The History and Future of CASA.- Techniques for Robust Speech Recognition in Noisy and Reverberant Conditions.- Source Separation, Localization, and Comprehension in Humans, Machines, and Human-machine Systems.- The Cancellation Principle in Acoustic Scene Analysis.- Informational and Energetic Masking Effects in Multitalker Speech Perception.- Masking the Feature Information In Multi-stream Speech-analogue Displays.- Interplay Between Visual and Audio Scene Analysis.- Evaluating Speech Separation Systems.- Making Sense of Everyday Speech: a Glimpsing Account.

...read moreread less

150 citations

Journal Article•DOI•

Likelihood-maximizing beamforming for robust hands-free speech recognition

[...]

Michael L. Seltzer, Bhiksha Raj¹, Richard M. Stern²•Institutions (2)

Mitsubishi Electric¹, Carnegie Mellon University²

16 Aug 2004-IEEE Transactions on Speech and Audio Processing

TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.

...read moreread less

Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

...read moreread less

147 citations

Patent•

Combined speech recongnition and text-to-speech generation

[...]

Daniel L. Roth, Jordan Cohen, David F. Johnston, Manfred G. Grabherr, Edward W. Porter - Show less +1 more

24 Sep 2004

TL;DR: Text-to-speech (TTS) generation is used in conjunction with large vocabulary speech recognition to automatically repeat text recognized by the speech recognition after each of a succession of end-of- utterance detections as discussed by the authors.

...read moreread less

Abstract: Text-to-speech (TTS) generation is used in conjunction with large vocabulary speech recognition to say words selected by the speech recognition. The software for performing the large vocabulary speech recognition can share speech modeling data with the TTS software. TTS or recorded audio can be used to automatically say both recognized text and the names of recognized commands after their recognition. The TTS can automatically repeats text recognized by the speech recognition after each of a succession of end of utterance detections. A user can move a cursor back or forward in recognized text, and the TTS can speak one or more words at the cursor location after each such move. The speech recognition can be used to produces a choice list of possible recognition candidates and the TTS can be used to provide spoken output of one or more of the candidates on the choice list.

...read moreread less

142 citations

Patent•DOI•

Client-server speech processing system, apparatus, method, and storage medium

[...]

Teruhiko Ueyama¹, Yasuhiro Komori¹, Tetsuo Kosaka¹, Masayuki Yamada¹, Akihiro Kushida¹ - Show less +1 more•Institutions (1)

Canon Inc.¹

04 Oct 2004-Journal of the Acoustical Society of America

TL;DR: In this article, the system implements high-accuracy speech recognition while suppressing the amount of data transfer between the client and the server, where the client receives the compression-encoded speech parameters, a speech processing unit makes speech recognition of the compressed speech parameters and sends information corresponding to the speech recognition result to the client.

...read moreread less

Abstract: The system implements high-accuracy speech recognition while suppressing the amount of data transfer between the client and server For this purpose, the client compression-encodes speech parameters by a speech processing unit, and sends the compression-encoded speech parameters to the server The server receives the compression-encoded speech parameters, a speech processing unit makes speech recognition of the compression-encoded speech parameters, and sends information corresponding to the speech recognition result to the client

...read moreread less

136 citations

Proceedings Article•DOI•

A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

[...]

Timothy J. Hazen¹, Kate Saenko¹, Chia-Hao La¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Oct 2004

TL;DR: The development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy using segment-constrained Hidden Markov Models (HMMs).

...read moreread less

Abstract: This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.

...read moreread less

Journal Article•DOI•

Phase-based dual-microphone robust speech enhancement

[...]

Parham Aarabi¹, Guangji Shi¹•Institutions (1)

University of Toronto¹

01 Aug 2004

TL;DR: It is shown that by masking the TF representation of the speech signals, the noise components are distorted beyond recognition while the speech source of interest maintains its perceptual quality.

...read moreread less

Abstract: A dual-microphone speech-signal enhancement algorithm, utilizing phase-error based filters that depend only on the phase of the signals, is proposed. This algorithm involves obtaining time-varying, or alternatively, time-frequency (TF), phase-error filters based on prior knowledge regarding the time difference of arrival (TDOA) of the speech source of interest and the phases of the signals recorded by the microphones. It is shown that by masking the TF representation of the speech signals, the noise components are distorted beyond recognition while the speech source of interest maintains its perceptual quality. This is supported by digit recognition experiments which show a substantial recognition accuracy rate improvement over prior multimicrophone speech enhancement algorithms. For example, for a case with two speakers with a 0.1 s reverberation time, the phase-error based technique results in a 28.9% recognition rate gain over the single channel noisy signal, a gain of 22.0% over superdirective beamforming, and a gain of 8.5% over postfiltering.

...read moreread less

Proceedings Article•DOI•

Robust speech interface based on audio and video information fusion for humanoid HRP-2

[...]

I. Hara¹, Futoshi Asano¹, Hideki Asoh¹, Jun Ogata¹, Naoyuki Ichimura¹, Yoshihiro Kawai, Fumio Kanehiro, Hirohisa Hirukawa, K. Yamamoto - Show less +5 more•Institutions (1)

National Institute of Advanced Industrial Science and Technology¹

28 Sep 2004

TL;DR: In this method, audio information and video information are fused by a Bayesian network to enable the detection of speech events and the information of detected speech events is utilized in sound separation using adaptive beam forming.

...read moreread less

Abstract: For cooperative work of robots and humans in the real world, a communicative function based on speech is indispensable for robots. To realize such a function in a noisy real environment, it is essential that robots be able to extract target speech spoken by humans from a mixture of sounds by their own resources. We have developed a method of detecting and extracting speech events based on the fusion of audio and video information. In this method, audio information (sound localization using a microphone array) and video information (human tracking using a camera) are fused by a Bayesian network to enable the detection of speech events. The information of detected speech events is then utilized in sound separation using adaptive beam forming. In this paper, some basic investigations for applying the above system to the humanoid robot HRP-2 are reported. Input devices, namely a microphone array and a camera, were mounted on the head of HRP-2, and acoustic characteristics for sound localization/separation performance were investigated. Also, the human tracking system was improved so that it can be used in a dynamic situation. Finally, overall performance of the system was tested via off-line experiments.

...read moreread less

Patent•

Combined speech recognition and sound recording

[...]

Daniel L. Roth¹, Jordan Cohen, David F. Johnston, Edward W. Porter•Institutions (1)

Nuance Communications¹

05 Dec 2004

TL;DR: A handheld device with both large-vocabulary speech recognition and audio recoding allows users to switch between at least two of the following three modes: (1) recording audio without corresponding speech recognition, (2) recording with speech recognition; and (3) speech recognition without audio recording.

...read moreread less

Abstract: A handheld device with both large-vocabulary speech recognition and audio recoding allows users to switch between at least two of the following three modes: (1) recording audio without corresponding speech recognition; (2) recording with speech recognition; and (3) speech recognition without audio recording. A handheld device with both large-vocabulary speech recognition and audio recoding enables a user to select a portion of previously recorded sound and have speech recognition performed upon it. A system enables a user to search for a text label associated with portions of unrecognized recorded sound by uttering the label's words. A large-vocabulary system allows users to switch between playing back recorded audio and speech recognition with a single input, with successive audio playbacks automatically starting slightly before the end of prior playback. And a cell phone that allows both large-vocabulary speech recognition and audio recording and playback.

...read moreread less

Patent•

Source-dependent text-to-speech system

[...]

Nicholas J. Cutaia¹•Institutions (1)

Cisco Systems, Inc.¹

28 Apr 2004

TL;DR: In this article, a speech feature vector for a voice associated with a source of a text message was determined and compared to speaker models, and a speaker model was selected as a preferred match for the voice based on the comparison.

...read moreread less

Abstract: A method of generating speech from textmessages includes determining a speech feature vector for a voice associated with a source of a text message, and comparing the speech feature vector to speaker models. The method also includes selecting one of the speaker models as a preferred match for the voice based on the comparison, and generating speech from the text message based on the selected speaker model.

...read moreread less

Patent•

Allocation of speech recognition tasks and combination of results thereof

[...]

Paul Michael Burke¹, Sherif Yacoub¹•Institutions (1)

Hewlett-Packard¹

12 Jul 2004

TL;DR: In this paper, a system, method, computer-readable medium, and computer-implemented system for optimizing allocation of speech recognition tasks among multiple speech recognizers and combining recognizer results is described.

...read moreread less

Abstract: A system, method, computer-readable medium, and computer-implemented system for optimizing allocation of speech recognition tasks among multiple speech recognizers and combining recognizer results is described. An allocation determination is performed to allocate speech recognition among multiple speech recognizers using at least one of an accuracy-based allocation mechanism, a complexity-based allocation mechanism, and an availability-based allocation mechanism. The speech recognition is allocated among the speech recognizers based on the determined allocation. Recognizer results received from multiple speech recognizers in accordance with the speech recognition task allocation are combined.

...read moreread less

Patent•

Audio/video apparatus and method for providing personalized services through voice and speaker recognition

[...]

Choi Seung Eok¹, Sun-wha Chung¹, In-sik Myung¹, Jung-Bong Lee¹•Institutions (1)

Samsung¹

27 Jul 2004

TL;DR: In this paper, a method for providing personalized services through voice and speaker recognition, comprising the steps of inputting, by a user, his/her voice through a wireless microphone of a remote control; if the voice is input, recognizing the input voice and the speaker that has input the voice; determining a command based on input voice; and providing a service according to the determination results.

...read moreread less

Abstract: Disclosed is an audio/video apparatus for providing personalized services to a user through voice and speaker recognition, wherein when the user inputs his/her voice through a wireless microphone of a remote control, the voice recognition and speaker recognition for the input voice are performed and determination on a command corresponding to the input voice is made, thereby providing the user's personalized services to the user. Further, disclosed is a method for providing personalized services through voice and speaker recognition, comprising the steps of inputting, by a user, his/her voice through a wireless microphone of a remote control; if the voice is input, recognizing the input voice and the speaker that has input the voice; determining a command based on the input voice; and providing a service according to the determination results.

...read moreread less

Patent•

Electronic apparatus, lock function release method therefor and program therefor

[...]

Yoshiaki Yamada, 義昭山田

29 Mar 2004

TL;DR: In this paper, a mobile telephone set with a speech recognition function is presented, which can be prevented from being used by others without complicated operation by using an input deciding function to decide whether the speech signal recognized by the speech recognition part is matched with a previously registered reset code or not.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a mobile telephone provided with a speech recognition function which can be prevent from being used by others without complicated operation. SOLUTION: This mobile telephone set 100 provided with the speech recognition function comprises: a speech recognition part 110 which recognizes a speech signal; and a control part 111 provided with an input deciding function of deciding whether the speech signal recognized by the speech recognition part 110 is matched with a previously registered reset code or not and a resetting function of resetting a locked function when the decision part decides that the speech signal matches the reset code. In this configuration, the mobile telephone set can be prevented from being used by others without complicated operation. COPYRIGHT: (C)2006,JPO&NCIPI

...read moreread less

Patent•

Noise reduction and audio-visual speech activity detection

[...]

Morio c o Sony Ericsson Mobile Taneda¹•Institutions (1)

Ericsson Mobile Communications¹

09 Jan 2004

TL;DR: In this article, an audio-visual speech activience recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise and surrounding persons' voices.

...read moreread less

Abstract: The present invention generally relates to the field of noise reduction systems which are equipped with an audio-visual user interface, in particular to an audio-visual speech activity recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise (n(t)) and surrounding persons' voices

...read moreread less

Patent•DOI•

Real-time transcription correction system

[...]

Robert M. Engelke, Kevin R. Colwell, Troy D. Vitek, Kurt M. Grittner, Jayne M. Turner, Pamela A. Frazier - Show less +2 more

13 May 2004-Journal of the Acoustical Society of America

TL;DR: In this paper, a voice level indicator is presented to the operator to help the operator keep his or her voice in the ideal range of the speech engine and a database of common words attempting to complete the word before all the letters have been input.

...read moreread less

Abstract: A voice transcription system employing a speech engine to transcribe spoken words, detects the spelled entry of words via keyboard or voice to invoke a database of common words attempting to complete the word before all the letters have been input. This database is separate from the database of words used by the speech engine. A voice level indicator is presented to the operator to help the operator keep his or her voice in the ideal range of the speech engine.

...read moreread less

Journal Article•DOI•

Techniques for handling convolutional distortion with `missing data' automatic speech recognition

[...]

Kalle J. Palomäki¹, Kalle J. Palomäki², Kalle J. Palomäki³, Guy J. Brown², Jon Barker² - Show less +1 more•Institutions (3)

Helsinki University of Technology¹, University of Sheffield², University of Helsinki³

01 Jun 2004-Speech Communication

TL;DR: Two techniques for handling convolutional distortion with ‘missing data’ speech recognition using spectral features and a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy are proposed.

...read moreread less

Patent•

Adaptive confidence thresholds in telematics system speech recognition

[...]

Uma Arun¹•Institutions (1)

General Motors¹

22 Sep 2004

TL;DR: In this paper, a method of configuring a speech recognition unit in a vehicle is described, which includes receiving a noise error from the speech recognition system responsive to a user voice command and reducing a confidence threshold for an appropriate grammar set responsive to the received noise error.

...read moreread less

Abstract: A method of configuring a speech recognition unit in a vehicle. The method includes receiving a noise error from the speech recognition unit responsive to a user voice command and reducing a confidence threshold for an appropriate grammar set responsive to the received noise error.

...read moreread less

Proceedings Article•DOI•

Multi-sensory microphones for robust speech detection, enhancement and recognition

[...]

Zhengyou Zhang¹, Zicheng Liu¹, Mike Sinclair¹, Alejandro Acero¹, Li Deng¹, Jasha Droppo¹, Xuedong Huang¹, Yanli Zheng¹ - Show less +4 more•Institutions (1)

Microsoft¹

17 May 2004

TL;DR: New hardware prototypes that integrate several heterogeneous sensors into a single headset are presented and the underlying DSP techniques for robust speech detection, enhancement and recognition in highly non-stationary noisy environments are described.

...read moreread less

Abstract: In this paper, we present new hardware prototypes that integrate several heterogeneous sensors into a single headset and describe the underlying DSP techniques for robust speech detection, enhancement and recognition in highly non-stationary noisy environments. We also speculate other business uses with this type of device.

...read moreread less

Patent•DOI•

Coarticulated concatenated speech

[...]

Scott J. Bailey, Nikko Strom

19 Nov 2004-Journal of the Acoustical Society of America

TL;DR: In this article, the authors describe methods and systems for reducing the audible gap in concatenated recorded speech, resulting in more natural sounding speech in voice applications, including phone-based and non-phone-based applications.

...read moreread less

Abstract: Described are methods and systems for reducing the audible gap in concatenated recorded speech, resulting in more natural sounding speech in voice applications. The sound of concatenated, recorded speech is improved by also coarticulating the recorded speech. The resulting message is smooth, natural sounding and lifelike. Existing libraries of regularly recorded bulk prompts can be used by coarticulating the user interface prompt occurring just before the bulk prompt. Applications include phone-based applications as well as non-phone-based applications.

...read moreread less

Proceedings Article•DOI•

Speech enhancement with missing data techniques using recurrent neural networks

[...]

Shahla Parveen¹, Phil D. Green¹•Institutions (1)

University of Sheffield¹

17 May 2004

TL;DR: The results suggest that combining missing data technique with RNN enhancement is an effective enhancement scheme resulting in a 16 dB background noise reduction for all input signal to noise ratio (SNR) conditions from -5 to 20 dB, improved spectral quality and robust automatic speech recognition performance.

...read moreread less

Abstract: This paper presents an application of missing data techniques in speech enhancement. The enhancement system consists of two stages: the first stage uses a recurrent neural network, which is supplied with noisy speech and produces enhanced speech; whereas the second stage uses missing data techniques to further improve the quality of enhanced speech. The results suggest that combining missing data technique with RNN enhancement is an effective enhancement scheme resulting in a 16 dB background noise reduction for all input signal to noise ratio (SNR) conditions from -5 to 20 dB, improved spectral quality and robust automatic speech recognition performance.

...read moreread less

Journal Article•DOI•

Performance evaluation of real-time speech through a packet network: a random neural networks-based approach

[...]

Samir A. Elsagheer Mohamed¹, Gerardo Rubino¹, Martin Varela¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jun 2004-Performance Evaluation

TL;DR: This paper addresses the problem of quantitatively evaluating the quality of a speech stream transported over the Internet as perceived by the end-user by using G-networks as Neural Networks to learn how humans react vis-a-vis a speech signal that has been distorted by encoding and transmission impairments.

...read moreread less

Patent•

Method and system for providing synthesized speech

[...]

Paul T. Schultz¹, Robert A. Sartini¹•Institutions (1)

Verizon Communications¹

26 May 2004

TL;DR: In this paper, a speech synthesis system receives a text string from either a telephony network, or a data network, and determines whether a rendered audio file of the text string is stored in a database.

...read moreread less

Abstract: An approach providing the efficient use of speech synthesis in rendering text content as audio in a communications network. The communications network can include a telephony network and a data network in support of, for example, Voice over Internet Protocol (VoIP) services. A speech synthesis system receives a text string from either a telephony network, or a data network. The speech synthesis system determines whether a rendered audio file of the text string is stored in a database and to render the text string to output the rendered audio file, if the rendered audio is determined not to exist. The rendered audio file is stored in the database for re-use according to a hash value generated by the speech synthesis system based on the text string.

...read moreread less

Patent•

Distributed speech recognition for mobile communication devices

[...]

Xuedong Huang¹•Institutions (1)

Microsoft¹

22 Mar 2004

TL;DR: In this article, a method of performing speech recognition, and a mobile computing device (10) implementing the same, is disclosed, which includes receiving audible speech at a microphone (17) of the mobile computing devices (10).

...read moreread less

Abstract: A method of performing speech recognition, and a mobile computing device (10) implementing the same, are disclosed. The method includes receiving audible speech at a microphone (17) of the mobile computing device (10). The audible speech is converted into speech signals at the mobile computing device (10). Also at the mobile computing device (10), preliminary and secondary speech recognition functions are performed on the speech signals to obtain requests for results from modules. Then, the requests for results are transmitted from the mobile computing device (10) to a second computing device (12) located remotely from the mobile computing device (10) to obtain the results which are then transmitted back to the mobile computing device (10) for completion of the speech recognition process.

...read moreread less

Journal Article•DOI•

Analysis of lip geometric features for audio-visual speech recognition

[...]

Kaynak Mustafa N¹, Qi Zhi², Adrian David Cheok², Kuntal Sengupta², Zhang Jian², Ko Chi Chung² - Show less +2 more•Institutions (2)

Arizona State University¹, National University of Singapore²

01 Jul 2004

TL;DR: Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results.

...read moreread less

Abstract: Audio-visual speech recognition employing both acoustic and visual speech information is a novel extension of acoustic speech recognition and it significantly improves the recognition accuracy in noisy environments. Although various audio-visual speech-recognition systems have been developed, a rigorous and detailed comparison of the potential geometric visual features from speakers' faces is essential. Thus, in this paper the geometric visual features are compared and analyzed rigorously for their importance in audio-visual speech recognition. Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results. Speech signals are modeled by hidden Markov models (HMMs) and using the optimized HMMs and geometric visual features the accuracy of acoustic-only, visual-only, and audio-visual speech recognition methods are compared. The audio-visual speech recognition scheme has a much improved recognition accuracy compared to acoustic-only and visual-only speech recognition especially at high noise levels. The experimental results showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal-to-noise ratio of 0 dB).

...read moreread less

Patent•

Intelligent acoustic microphone fronted with speech recognizing feedback

[...]

Alfred Kaltenmeier, Klaus Linhard

27 Aug 2004

TL;DR: In this paper, a new type of speech recognition system is described, which is good at accomplishing these tasks by regulating the reception characteristic of a directionally selective microphone array, by means of the optimization unit with regard to the respective direction from which speech signals are received.

...read moreread less

Abstract: Voice command operated systems are being installed in modern motor vehicles with increasing frequency Such systems should be operable by various vehicle occupants and from various seating positions A new type of speech recognition system is described, which is good at accomplishing these tasks Therein, for regulating the speech recognition system, the reception characteristic of a a directionally selective microphone array (12) is controlled by an optimization unit (10) These speech signals are then processed in a speech recognizer (11) at least parallel in time Then, on the basis of the results provided by the speech recognizer (11), the reception characteristic of the microphone array (12) is so controlled by an optimization unit (10), that the recognition performance of the speech recognizer (11) downstream of the optimization unit (10) is optimized Herein, the speech recognizer is supplied with the received speech signals from different spatial directions parallel or at least quasi-parallel via the speech channels (14), so that this selects and further processes those speech signals that have the potential for the best possible recognition performance On the basis of the recognition results, the optimization unit (10) obtains via the speech recognizer the necessary regulatory signals (18), in order to optimize the reception characteristics of the microphone array (12) by means of the optimization unit (10) with regard to the respective direction from which speech signals are received, which have the potential for the best possible recognition performance

...read moreread less

Collapse