scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2006"


Patent
02 Feb 2006
TL;DR: In this paper, a speech recognition system receives and analyzes speech input from a user in order to recognize and accept a response from the user, under certain conditions, information about the response expected from user may be available.
Abstract: A speech recognition system receives and analyzes speech input from a user in order to recognize and accept a response from the user. Under certain conditions, information about the response expected from the user may be available. In these situations, the available information about the expected response is used to modify the behavior of the speech recognition system by taking this information into account. The modified behavior of the speech recognition system comprises adjusting the rejection threshold when speech input matches the predetermined expected response.

517 citations


PatentDOI
TL;DR: In this paper, a real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user, where the partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.
Abstract: A real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user. Both the client and server can dedicate a variable number of processing resources for performing speech recognition functions. The partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.

279 citations


Journal ArticleDOI
TL;DR: This paper proposes a class of VAD algorithms based on several statistical models based on the Gaussian model, and incorporates the complex Laplacian and Gamma probability density functions to the analysis of statistical properties.
Abstract: One of the key issues in practical speech processing is to achieve robust voice activity detection (VAD) against the background noise. Most of the statistical model-based approaches have tried to employ the Gaussian assumption in the discrete Fourier transform (DFT) domain, which, however, deviates from the real observation. In this paper, we propose a class of VAD algorithms based on several statistical models. In addition to the Gaussian model, we also incorporate the complex Laplacian and Gamma probability density functions to our analysis of statistical properties. With a goodness-of-fit tests, we analyze the statistical properties of the DFT spectra of the noisy speech under various noise conditions. Based on the statistical analysis, the likelihood ratio test under the given statistical models is established for the purpose of VAD. Since the statistical characteristics of the speech signal are differently affected by the noise types and levels, to cope with the time-varying environments, our approach is aimed at finding adaptively an appropriate statistical model in an online fashion. The performance of the proposed VAD approaches in both the stationary and nonstationary noise environments is evaluated with the aid of an objective measure.

241 citations


Journal ArticleDOI
TL;DR: This paper introduces a method for nonintrusive assessment of speech quality for narrow-band telephony, which was approved by the International Telecommunication Union (ITU-T) in May 2004, based on models of voice production and perception.
Abstract: Objective voice quality assessment has been the subject of research for many years. Up until very recently, objective models required a copy of the unprocessed signal for estimating the quality of a signal transmitted across a telecommunication network, making live call monitoring impossible. This paper introduces a method for nonintrusive assessment of speech quality for narrow-band telephony, which was approved by the International Telecommunication Union (ITU-T) in May 2004. Essentially based on models of voice production and perception, the algorithm demonstrates good performance on more than 48 subjective experiments representing most distortions that occur on voice networks

218 citations


Patent
21 Jul 2006
TL;DR: In this paper, a method for improving the quality of a speech signal extracted from a noisy acoustic environment is provided, where a signal separation process (180) is associated with a voice activity detector (185).
Abstract: A method for improving the quality of a speech signal extracted from a noisy acoustic environment is provided. In one approach, a signal separation process (180) is associated with a voice activity detector (185). The voice activity detector (185) is a two-channel (178,182) detector, which enables a particularly robust and accurate detection of voice activity. When a speech is detected, the voice activity detector generates a control signal (411). The control signal (411) is used to activate, adjust, or control signal separation processes or post -processing operations (195) to improve the quality of the resulting speech signal. In another approach, a signal separation process (180) is provided as a learning stage (752) and an output stage (756). The learning stage (752) aggressively adjus to current acoustic conditions and passes coefficients to the output stage (756). The output stage (756) adapts more slowly and generates a speech-content signal (181,770) and a noise dominant signal (407,773). When the learning stage (752) becomes unstable only the learning stage (752) is reset, allowing the output stage (756) to continue outputting a high quality speech signal.

217 citations


Book
01 Jan 2006
TL;DR: This work focuses on the development of an E-model for Speech Quality in Telephony, which automates the very labor-intensive and therefore time-heavy and expensive process of modeling speech quality through simulation.
Abstract: Preface. List of Abbreviations. Introduction. 1 Speech Quality in Telephony. 1.1 Speech. 1.2 Speech Quality. 2 Speech Quality Measurement Methods. 2.1 Auditory Methods. 2.2 Instrumental Methods. 2.3 Speech Quality Measurement Methods: Summary. 3 Quality Elements and Quality Features of VoIP. 3.1 Speech Transmission Using Internet Protocol. 3.2 Overview of Quality Elements. 3.3 Quality Elements and Related Features. 3.4 Quality Dimensions. 3.5 Combined Elements and Combined Features. 3.6 Listening and Conversational Features. 3.7 Desired Nature. 3.8 Open Questions. 3.9 From Elements to Features: Modeling VoIP Speech Quality. 3.10 Quality Elements and Quality Features of VoIP: Summary. 4 Time-Varying Distortion: Quality Features and Modeling. 4.1 Microscopic Loss Behavior. 4.2 Macroscopic Loss Behavior. 4.3 Interactivity. 4.4 Packet Loss and Combined Impairments. 4.5 Time-Varying Distortion: Summary. 5 Wideband Speech, Linear and Non Linear Distortion: Quality Features and Modeling. 5.1 Wideband Speech: Improvement Over Narrowband. 5.2 Bandpass-Filtered Speech. 5.3 Wideband Codecs. 5.4 Desired Nature. 6 From Elements to Features: Extensions of the E-model. 6.1 E-model: Packet Loss. 6.2 E-model: Additivity. 6.3 E-model: Wideband, Linear and Non-Linear Distortion. 7 Summary and Conclusions. 8 Outlook. A Aspects of a Parametric Description of Time-Varying Distortion. B Simulation of Quality Elements. C Frequency Responses. D Test Data Normalization and Transformation. E E-model Algorithm. F Interactive Short Conversation Test Scenarios (iSCTs). G Auditory Test Settings and Results. H Modeling Details. I Glossary. Bibliography. Index.

211 citations


Patent
23 Aug 2006
TL;DR: In this article, the authors proposed a speech enhancement system that is able to suppress highly non-stationary noise, which can be adapted to a hearing aid or a headset, using a speech model and a noise model having at least one shape and gain.
Abstract: A central aspect of the invention relates to a method of enhancing speech, the method comprising the steps of, receiving noisy speech comprising a clean speech component and a non-stationary noise component, providing a speech model, providing a noise model having at least one shape and a gain, dynamically modifying the noise model based on the speech model and the received noisy speech, enhancing the noisy speech at least based on the modified noise model Hereby is achieved a method of speech enhancement that is able to suppress highly non-stationary noise Another aspect of the invention relates to a speech enhancement system that may be adapted to be used in a hearing system, such as a hearing aid or a headset

211 citations


Patent
28 Sep 2006
TL;DR: In this paper, a speech recognition engine that utilizes portable voice profiles for converting recorded speech to text is presented, where each portable voice profile includes speaker-dependent data, and is configured to be accessible to a plurality of speech recognition engines through a common interface.
Abstract: An embodiment of the present invention provides a speech recognition engine that utilizes portable voice profiles for converting recorded speech to text. Each portable voice profile includes speaker-dependent data, and is configured to be accessible to a plurality of speech recognition engines through a common interface. A voice profile manager receives the portable voice profiles from other users who have agreed to share their voice profiles. The speech recognition engine includes speaker identification logic to dynamically select a particular portable voice profile, in real-time, from a group of portable voice profiles. The speaker-dependent data included with the portable voice profile enhances the accuracy with which speech recognition engines recognize spoken words in recorded speech from a speaker associated with a portable voice profile.

201 citations


Patent
Jung-Eun Kim1, Jeong-Su Kim1
16 Feb 2006
TL;DR: In this paper, a user adaptive speech recognition method and apparatus is disclosed that controls user confirmation of a recognition candidate using a new threshold value adapted to a user, which includes calculating a confidence score of recognition candidate according to the result of speech recognition.
Abstract: A user adaptive speech recognition method and apparatus is disclosed that controls user confirmation of a recognition candidate using a new threshold value adapted to a user. The user adaptive speech recognition method includes calculating a confidence score of a recognition candidate according to the result of speech recognition, setting a new threshold value adapted to the user based on a result of user confirmation of the recognition candidate and the confidence score of the recognition candidate, and outputting a corresponding recognition candidate as a result of the speech recognition if the calculated confidence score is higher than the new threshold value. Thus, the need for user confirmation of the result of speech recognition is reduced and the probability of speech recognition success is increased.

181 citations


Journal ArticleDOI
TL;DR: A novel statistical method for voice activity detection using a signal-to-noise ratio measure that employs a low-variance spectrum estimate and determines an optimal threshold based on the estimated noise statistics.
Abstract: Traditionally, voice activity detection algorithms are based on any combination of general speech properties such as temporal energy variations, periodicity, and spectrum. This paper describes a novel statistical method for voice activity detection using a signal-to-noise ratio measure. The method employs a low-variance spectrum estimate and determines an optimal threshold based on the estimated noise statistics. A possible implementation is presented and evaluated over a large test set and compared to current modern standardized algorithms. The evaluations indicate promising results with the proposed scheme being comparable or favorable over the whole test set.

173 citations


Patent
27 Oct 2006
TL;DR: In this article, a privacy sound may be based on the speaker's own voice or another voice, which may be used to access a database of the speaker or another's voice, and form one or more voice streams to form the privacy sound.
Abstract: A privacy apparatus adds a privacy sound into the environment, thereby confusing listeners as to which of the sounds is the real source. The privacy sound may be based on the speaker's own voice or may be based on another voice. At least one characteristic of the speaker (such as a characteristic of the speaker's speech) may be identified. The characteristic may then be used to access a database of the speaker's own voice or another's voice, and to form one or more voice streams to form the privacy sound. The privacy sound may thus permit disruption of the ability to understand the source speech of the user by eliminating segregation cues that the auditory system uses to interpret speech.

Patent
12 Jul 2006
TL;DR: In this article, a simple and efficient method for producing an obfuscated speech signal which may be used to mask a stream of speech, is described, and a speech signal representing the speech stream to be masked is obtained.
Abstract: A simple and efficient method for producing an obfuscated speech signal which may be used to mask a stream of speech, is disclosed. A speech signal representing the speech stream to be masked is obtained. The speech signal is then temporally partitioned into segments, preferably corresponding to phonemes within the speech stream. The segments are then stored in a memory, and some or all of the segments are subsequently selected, retrieved, and assembled into an obfuscated speech signal representing an unintelligible speech stream that, when combined with the speech signal or reproduced and combined with the speech stream, provides a masking effect. While the presently preferred embodiment finds application most readily in an open plan office, embodiments suitable for use in restaurants, classrooms, and in telecommunications systems are also disclosed.

Patent
21 Aug 2006
TL;DR: In this paper, a phonetic transcription is generated in a language system by a given spelling of words in a non-language system for processing the input signals and/or output signals that are not carried out in the language system.
Abstract: The method involves receiving input signals by voice input units (2), and outputting output signals of voice output units (3). A phonetic transcription is generated in a language system by a given spelling of words in a non-language system for processing the input signals and/or output signals that are not carried out in the language system. The phonetic transcription is recorded in a data file or list, where the phonetic transcription in the language system is determined upon replacement of noise sequences in the non-language system. An independent claim is also included for a device for speech recognition and/or speech rendering for accomplishing speech dialogue between a person and a machine.

Proceedings Article
01 Jan 2006
TL;DR: This paper demonstrates how to train the phoneme-based acoustic models with carefully designed electromyographic feature extraction methods by decomposing the signal into different feature space and successfully keep the useful information while reducing the noise.
Abstract: We present our research on continuous speech recognition of the surface electromyographic signals that are generated by the human articulatory muscles. Previous research on electromyographic speech recognition was limited to isolated word recognition because it was very difficult to train phoneme-based acoustic models for the electromyographic speech recognizer. In this paper, we demonstrate how to train the phoneme-based acoustic models with carefully designed electromyographic feature extraction methods. By decomposing the signal into different feature space, we successfully keep the useful information while reducing the noise. Additionally, we also model the anticipatory effect of the electromyographic signals compared to the speech signal. With a 108-word decoding vocabulary, the experimental results show that the word error rate improves from 86.8% to 32.0% by using our novel feature extraction methods. Index Terms: speech recognition, electromyography, articulatory muscles, feature extraction.

Patent
17 Nov 2006
TL;DR: In this paper, the authors present a method for building a voice response system, which comprises developing voice content for the voice response, the voice content including prompts and information to be played to a user; and integrating the voice contents with logic to define a voice user-interface that is capable of interacting with the user in a manner of a conversation.
Abstract: In one embodiment, the invention provides a method for building a voice response system. The method comprises developing voice content for the voice response system, the voice content including prompts and information to be played to a user; and integrating the voice content with logic to define a voice user-interface that is capable of interacting with the user in a manner of a conversation in which the voice user-interface receives an utterance from the user and presents a selection of the voice content to the user in response to the utterance.

Proceedings ArticleDOI
14 May 2006
TL;DR: A new approach is presented that applies unit selection to find corresponding time frames in source and target speech to achieve the same performance as the conventional text-dependent training.
Abstract: So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require text-independent training, over the last two years, training techniques that use non-parallel data were proposed. In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training.

01 Jan 2006
TL;DR: The evaluation results show that the proposed USTC speech synthesis system is able to synthesize speech with high naturalness and intelligibility by using either full database or only ARCTIC subset.
Abstract: This paper introduces the USTC speech synthesis system for Blizzard Challenge 2006. The HMM-based parametric synthesis approach was adopted for its convenience and effectiveness in building a new voice, especially for the nonnative developers. Some useful techniques were also integrated into our system, such as minimum generation error (MGE) training, phone duration modeling and linear spectral pair (LSP) based formant enhancement. The evaluation results show that the proposed system is able to synthesize speech with high naturalness and intelligibility by using either full database or only ARCTIC subset.

Proceedings ArticleDOI
14 May 2006
TL;DR: A digital signal processing algorithm to improve intelligibility of clean far end speech for the near end listener who is located in an environment with background noise is presented.
Abstract: In contrast to common noise reduction systems, this contribution presents a digital signal processing algorithm to improve intelligibility of clean far end speech for the near end listener who is located in an environment with background noise. Since the noise reaches the ears of the near end listener directly and therefore can hardly be influenced, a sensible option is to manipulate the far end speech. The proposed algorithm raises the average speech spectrum over the average noise spectrum and takes precautions to prevent hearing damage. Informal listening tests and the Speech Intelligibility Index indicate an improved speech intelligibility.

Patent
18 Jul 2006
TL;DR: In this article, a method of transferring a real-time audio signal transmission, including registering voice patterns (or other characteristics) of on more users to be used to identify the voices of the users, accepting an audio signal as it is created as a sequence of segments, analyzing each segment of the accepted audio signal to determine if it contains voice activity, determining a probability level that the voice activity of the segment is of a registered user, and selectively transferring the contents, of a segment responsive to the determined probability level.
Abstract: A method of transferring a real-time audio signal transmission, including: registering voice patterns (or other characteristics) of on more users to be used to identify the voices of the users, accepting an audio signal as it is created as a sequence of segments, analyzing each segment of the accepted audio signal to determine if it contains voice activity (314), determining a probability level that the voice activity of the segment is of a registered user (320 & 322); and selectively transferring the contents, of a segment responsive to the determined probability level (324).

Journal ArticleDOI
TL;DR: The results show that ERVU successfully increased intelligibility of speech using a simple automated segmentation algorithm, applicable to a wide variety of communication systems such as cell phones and public address systems.

PatentDOI
TL;DR: In this article, a voice browsing system maintains a database containing a list of information sources such as web sites, connected to a network, each of the information sources is assigned a rank number which is listed in the database along with the record for the information source.
Abstract: The present invention relates to a system for acquiring information from sources on a network, such as the Internet. A voice browsing system maintains a database containing a list of information sources, such as web sites, connected to a network. Each of the information sources is assigned a rank number which is listed in the database along with the record for the information source. In response to a speech command received from a user, a network interface system accesses the information source with the highest rank number in order to retrieve information requested by the user.

Journal ArticleDOI
TL;DR: This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis, named for its duration- embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively.
Abstract: This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis

Journal ArticleDOI
TL;DR: A brain-based mechanism that uses the voice pitch cue in the low-frequency sound to first segregate the target voice from the competing voice and then to group appropriate temporal envelope cues in thetarget voice for robust speech recognition under realistic listening situations is suggested.
Abstract: Speech can be recognized by multiple acoustic cues in both frequency and time domains. These acoustic cues are often thought to be redundant. One example is the low-frequency sound component below 300 Hz, which is not even transmitted by the majority of communication devices including telephones. Here, we showed that this low-frequency sound component, although unintelligible when presented alone, could improve the functional signal-to-noise ratio (SNR) by 10-15 dB for speech recognition in noise when presented in combination with a cochlear-implant simulation. A similar low-frequency enhancement effect could be obtained by presenting the low-frequency sound component to one ear and the cochlear-implant simulation to the other ear. However, a high-frequency sound could not produce a similar speech enhancement in noise. We argue that this low-frequency enhancement effect cannot be due to linear addition of intelligibility between low- and high-frequency components or an increase in the physical SNR. We suggest a brain-based mechanism that uses the voice pitch cue in the low-frequency sound to first segregate the target voice from the competing voice and then to group appropriate temporal envelope cues in the target voice for robust speech recognition under realistic listening situations

01 Jan 2006
TL;DR: The paper proposes the use of synthetic speech coding algorithms (vocoders) to provide redundancy, since the algorithms produce a very low bit-rate stream, which only adds a small overhead to a packet.
Abstract: This paper describes current problems found with audio applications over the MBONE (Multicast Backbone), and investigates possible solutions to the most common one packet loss. The principles of packet speech systems are discussed, and how the structure allows the use of redundancy to design viable solutions to the problem. The paper proposes the use of synthetic speech coding algorithms (vocoders) to provide redundancy, since the algorithms produce a very low bit-rate stream, which only adds a small overhead to a packet. Preliminary experiments show that normal speech repaired with synthetic quality speech is intelligible, even at very high loss rates.

Book
01 Jan 2006
TL;DR: This book discusses Speech Recognition with HMMs, a Alternative Representations of the LPC Coefficients, and Front-end Processing for Robust Feature Extraction, a Review of Channel Coding Techniques.
Abstract: Forward. Preface. 1 Introduction. 1.1 Introduction. 1.2 RSR over Digital Channels. 1.3 Organization of the Book. 2 Speech Recognition with HMMs. 2.1 Introduction. 2.2 Some General Issues. 2.3 Analysis of Speech Signals. 2.4 Vector Quantization. 2.5 Approaches to ASR. 2.6 Hidden Markov Models. 2.7 Application of HMMs to Speech Recognition. 2.8 Model Adaptation. 2.9 Dealing with Uncertainty. 3 Networks and Degradation. 3.1 Introduction. 3.2 Mobile and Wireless Networks. 3.3 IP Networks. 3.4 The Acoustic Environment. 4 Speech Compression and Architectures for RSR. 4.1 Introduction. 4.2 Speech Coding. 4.3 Recognition from Decoded Speech. 4.4 Recognition from Codec Parameters. 4.5 Distributed Speech Recognition. 4.6 Comparison between NSR and DSR. 5 Robustness Against Transmission Channel Errors. 5.1 Introduction. 5.2 Channel Coding Techniques. 5.3 Error Concealment (EC). 6 Front-end Processing for Robust Feature Extraction. 6.1 Introduction. 6.2 Noise Reduction Techniques. 6.3 Voice Activity Detection. 6.4 Feature Normalization. 7 Standards for Distributed Speech Recognition. 7.1 Introduction. 7.2 Signal Preprocessing. 7.3 Feature Extraction. 7.4 Feature Compression and Encoding. 7.5 Feature Decoding and Postprocessing. A Alternative Representations of the LPC Coefficients. B Basic Digital Modulation Concepts. C Review of Channel Coding Techniques. C.1 Media-independent FEC. C.2 Interleaving. Bibliography. List of Acronyms. Index.

Patent
01 Mar 2006
TL;DR: In this paper, a transcript associated with the speech processing may be displayed to a user with a first visual indication of words having a confidence level within a first predetermined confidence range, and an error correction facility may be provided for the user to correct errors in the displayed transcript.
Abstract: A method, a processing device, and a machine-readable medium are provided for improving speech processing. A transcript associated with the speech processing may be displayed to a user with a first visual indication of words having a confidence level within a first predetermined confidence range. An error correction facility may be provided for the user to correct errors in the displayed transcript. Error correction information, collected from use of the error correction facility, may be provided to a speech processing module to improve speech processing accuracy.

Patent
28 Oct 2006
TL;DR: A speech recognition method includes receiving input speech from a user, processing the input speech to obtain at least one parameter value, and determining an experience level of the user using the parameter value(s) as discussed by the authors.
Abstract: A speech recognition method includes receiving input speech from a user, processing the input speech to obtain at least one parameter value, and determining an experience level of the user using the parameter value(s). The method can also include prompting the user based upon the determined experience level of the user to assist the user in delivering speech commands.

Proceedings ArticleDOI
14 May 2006
TL;DR: A system combination approach using different models and features for deception detection is proposed, resulting in improved accuracy over the individual systems.
Abstract: We report on machine learning experiments to distinguish deceptive from nondeceptive speech in the Columbia-SRI-Colorado (CSC) corpus. Specifically, we propose a system combination approach using different models and features for deception detection. Scores from an SVM system based on prosodic/lexical features are combined with scores from a Gaussian mixture model system based on acoustic features, resulting in improved accuracy over the individual systems. Finally, we compare results from the prosodic-only SVM system using features derived either from recognized words or from human transcriptions.

Patent
13 Mar 2006
TL;DR: In this paper, a method for providing help to voice-enabled applications, including multimodal applications, can include a step of identifying at least one speech grammar associated with a voiceenabled application.
Abstract: A method for providing help to voice-enabled applications, including multimodal applications, can include a step of identifying at least one speech grammar associated with a voice-enabled application. Help fields can be defined within the speech grammar. The help fields can include available speech commands for the voice enabled application. When the speech grammar is activated for use by the voice-enabled application, the available speech commands can be presented to a user of the voice-enabled application. The presented speech commands can be obtained from the help fields.

Patent
06 Mar 2006
TL;DR: In this article, a speech processing method can automatically and dynamically adjust speech grammar weights at runtime based upon usage data, which can indicate a relative frequency with which each of the available speech commands is utilized.
Abstract: A speech processing method can automatically and dynamically adjust speech grammar weights at runtime based upon usage data. Each of the speech grammar weights can be associated with an available speech command contained within a speech grammar to which the speech grammar weights apply. The usage data can indicate a relative frequency with which each of the available speech commands is utilized.