scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2001"


01 Jan 2001
TL;DR: An unbiased noise power estimator based on minimum statistics is derived and its statistical properties and its performance in the context of spectral subtraction are discussed.
Abstract: This contribution presents and analyses an algorithm for the enhancement of noisy speech signals by means of spectral subtraction. In contrast to the standard spectral subtraction algorithm the proposed method does not need a speech activity detector nor histograms to learn signal statistics. The algorithm is capable to track non stationary noise signals and compares favorably with standard spectral subtraction methods in terms of performance and computational complexity. Our noise estimation method is based on the observation that a noise power estimate can be obtained using minimum values of a smoothed power estimate of the noisy speech signal. Thus, the use of minimum statistics eliminates the problem of speech activity detection. The proposed method is conceptually simple and well suited for real time implementations. In this paper we derive an unbiased noise power estimator based on minimum statistics and discuss its statistical properties and its performance in the context of spectral subtraction.

645 citations


Patent
02 May 2001
TL;DR: In this paper, new techniques and systems may be implemented to improve error correction in speech recognition systems, which may be used in a standard desktop environment, in a mobile environment, or in any other type of environment that can receive and/or present recognized speech.
Abstract: New techniques and systems may be implemented to improve error correction in speech recognition. These new techniques and systems may be implemented to correct errors in speech recognition systems may be used in a standard desktop environment, in a mobile environment, or in any other type of environment that can receive and/or present recognized speech.

423 citations


PatentDOI
TL;DR: In this article, a system and method for the control of color-based lighting through voice control or speech recognition as well as a syntax for use with such a system is presented. But this approach is limited to the use of spoken voice (in any language) without having to learn the myriad manipulation required of some complex controller interfaces.
Abstract: A system and method for the control of color-based lighting through voice control or speech recognition as well as a syntax for use with such a system. In this approach, the spoken voice (in any language) can be used to more naturally control effects without having to learn the myriad manipulation required of some complex controller interfaces. A simple control language based upon spoken words consisting of commands and values is constructed and used to provide a common base for lighting and system control.

260 citations


Journal ArticleDOI
TL;DR: The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames and derives a voicing condition for speech frames based on the relation between the skewness and kurtosis of voiced speech.
Abstract: This paper presents a robust algorithm for voice activity detection (VAD) based on newly established properties of the higher order statistics (HOS) of speech. Analytical expressions for the third and fourth-order cumulants of the LPC residual of short-term speech are derived assuming a sinusoidal model. The flat spectral feature of this residual results in distinct characteristics for these cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. Important properties about these cumulants and their similarity with the autocorrelation function are revealed from this exploratory part. They show that the HOS of speech are sufficiently distinct from those of Gaussian noise and can be used as a basis for speech detection. Their immunity to Gaussian noise makes them particularly useful in algorithms designed for low SNR environments. The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames. The variance of the HOS estimators is quantified and used to yield a likelihood measure for noise frames. Moreover, a voicing condition for speech frames is derived based on the relation between the skewness and kurtosis of voiced speech. The performance of the algorithm is compared to the ITU-T G.729B VAD in various noise conditions, and quantified using the probability of correct and false classifications. The results show that the proposed algorithm has an overall better performance than G.729B, with noticeable improvement in Gaussian-like noises, such as street and parking garage, and moderate to low SNR.

249 citations


Journal ArticleDOI
TL;DR: Audiovisual speech processing results have shown that, with lip reading, it is possible to enhance the reliability of audio speech recognition, which may result in a computer that can truly understand the user via hand-free natural spoken language even in a very noisy environments.
Abstract: We have reported activities in audiovisual speech processing, with emphasis on lip reading and lip synchronization. These research results have shown that, with lip reading, it is possible to enhance the reliability of audio speech recognition, which may result in a computer that can truly understand the user via hand-free natural spoken language even in a very noisy environments. Similarly, with lip synchronization, it is possible to render realistic talking heads with lip movements synchronized with the voice, which is very useful for human-computer interactions. We envision that in the near future, advancement in audiovisual speech processing will greatly increase the usability of computers. Once that happens, the cameras and the microphone may replace the keyboard and the mouse as better mechanisms for human-computer interaction.

244 citations


Patent
26 Feb 2001
TL;DR: In this paper, a speech manager interface allows the speech recognition process and the text-to-speech process to be accessed by other application processes in handheld electronic devices such as a personal digital assistant (PDA).
Abstract: A handheld electronic device such as a personal digital assistant (PDA) has multiple application processes. A speech recognition process takes input speech from a user and produces a recognition output representative of the input speech. A text-to-speech process takes output text and produces a representative speech output. A speech manager interface allows the speech recognition process and the text-to-speech process to be accessed by other application processes.

239 citations


Patent
Steve Tischer1
10 Dec 2001
TL;DR: In this paper, a method and system of customizing voice translation of a text to speech includes digitally recording speech samples of a known speaker, correlating each of the speech samples with a standardized audio representation, and organizing the recorded speech samples and correlated audio representations into a collection.
Abstract: A method and system of customizing voice translation of a text to speech includes digitally recording speech samples of a known speaker, correlating each of the speech samples with a standardized audio representation, and organizing the recorded speech samples and correlated audio representations into a collection. The collection of speech samples correlated with audio representations is saved as a single voice file and stored in a device capable of translating the text to speech. The voice file is applied to a translation of text to speech so that the translated speech is customized according to the applied voice file.

229 citations


PatentDOI
TL;DR: In this article, a computerized method and apparatus for automatically generating from a first speech recognizer a second speech recogniser which can be adapted to a specific domain is presented. But this method is limited to a single domain.
Abstract: The present invention provides a computerized method and apparatus for automatically generating from a first speech recognizer a second speech recognizer which can be adapted to a specific domain. The first speech recognizer can include a first acoustic model with a first decision network and corresponding first phonetic contexts. The first acoustic model can be used as a starting point for the adaptation process. A second acoustic model with a second decision network and corresponding second phonetic contexts for the second speech recognizer can be generated by re-estimating the first decision network and the corresponding first phonetic contexts based on domain-specific training data.

228 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: Results of perceptual experiments show that by combining the steps of prosody prediction and unit selection the authors are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.
Abstract: We describe how prosody prediction can be efficiently integrated with the unit selection process in a concatenative speech synthesizer under a weighted finite-state transducer (WFST) architecture. WFSTs representing prosody prediction and unit selection can be composed during synthesis, thus effectively expanding the space of possible prosodic targets. We implemented a symbolic prosody prediction module and a unit selection database as the synthesis components of a travel planning system. Results of perceptual experiments show that by combining the steps of prosody prediction and unit selection we are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.

223 citations


01 Jan 2001
TL;DR: A new type of speech corpus that is especially suited to VT research and development by consisting of naturally time-aligned sentences is proposed, which results in high-quality speech samples that only differ in their segmental properties, the focus of transformation.
Abstract: Speaker identity, the sound of a person's voice, plays an important role in human communication. With speech systems becoming more and more ubiquitous, Voice Transformation (VT), a technology that modifies a source speaker's speech utterance to sound as if a target speaker had spoken it, offers a number of useful applications. For example, a novice user can adapt a text-to-speech system to speak with a new voice quickly and inexpensively. In this dissertation, we consider new approaches in both the design and the evaluation of VT techniques. We propose a new type of speech corpus that is especially suited to VT research and development by consisting of naturally time-aligned sentences. Consequently, removal of individual prosodic characteristics, such as fundamental pitch and durations, requires only very little processing and results in high-quality speech samples that only differ in their segmental properties, our focus of transformation. These “prosody-normalized” speech samples are used for training VT systems, as well as for evaluating their transformation performance objectively and subjectively. Our baseline transformation system (SET) is based on transforming the spectral envelope as represented by the LPC spectrum, using a harmonic sinusoidal model for analysis and synthesis. The transformation function is implemented as a regressive, joint-density Gaussian mixture model, trained on aligned LSF vectors by an expectation maximization algorithm. We improve upon the baseline by adding a residual prediction module, which predicts target LPC residuals from transformed LPC spectral envelopes, using a classifier and residual codebooks. The resulting high resolution transformation system (HRT) is capable of rendering transformed speech with a high degree of spectral detail. Because of the severe shortcomings of evaluating VT performance objectively, we propose a subjective evaluation strategy, consisting of several listening tests. In a speaker discrimination test, the HRT system performed significantly better than the SET system. However, discrimination is below that of natural utterances. Similarly, listeners selected the HRT system over other systems in a system comparison test. Finally, listeners rated the speech quality of the HRT system as better than the SET system. However, the quality of natural utterances was considered better than that of transformed speech.

198 citations


Journal ArticleDOI
TL;DR: A new and generalizing approach to error concealment is described as part of a modified robust speech decoder that can be applied to any speech codec standard and preserves bit exactness in the case of an error free channel.
Abstract: In digital speech communication over noisy channels there is the need for reducing the subjective effects of residual bit errors which have not been eliminated by channel decoding. This task is usually called error concealment. We describe a new and generalizing approach to error concealment as part of a modified robust speech decoder. It can be applied to any speech codec standard and preserves bit exactness in the case of an error free channel. The proposed method requires bit reliability information provided by the demodulator or by the equalizer or specifically by the channel decoder and can exploit additionally a priori knowledge about codec parameters. We apply our algorithms to PCM, ADPCM, and GSM full-rate speech coding using AWGN, fading, and GSM channel models, respectively. It turns out that the speech quality is significantly enhanced, showing the desired inherent muting mechanism or graceful degradation behavior in the case of extreme adverse transmission conditions.

PatentDOI
Steven G. Woodward1
TL;DR: In this paper, a method for processing a misrecognition error in an embedded speech recognition system during a speech recognition session can include the step of speech-to-text converting audio input in the embedded SPR system based on an active language model.
Abstract: A method for processing a misrecognition error in an embedded speech recognition system during a speech recognition session can include the step of speech-to-text converting audio input in the embedded speech recognition system based on an active language model. The speech-to-text conversion can produce speech recognized text that can be presented through a user interface. A user-initiated misrecognition error notification can be detected. The audio input and a reference to the active language model can be provided to a speech recognition system training process associated with the embedded speech recognition system.

Patent
17 Dec 2001
TL;DR: In this article, a speech input from two or more speakers, including a first speaker (such as a customer service representative for example), blocking a portion of the speech input that originates from the first speaker and processing the remaining portion with a computer.
Abstract: The present invention comprises receiving speech input from two or more speakers, including a first speaker (such as a customer service representative for example); blocking a portion of the speech input that originates from the first speaker; and processing the remaining portion of the speech input with a computer. The blocking and processing are real-time processes, completed during a conversation. One example is a method for de-cluttering speech input for better automatic processing, by removing all but the pertinent words spoken by a customer. Another example is a system for executing methods of the present invention. A third example is a set of instructions on a computer-usable medium, or resident in a computer system, for executing methods of the present invention.

Patent
Richard Rose1, Bojana Gajic1
12 Oct 2001
TL;DR: In this paper, a dynamic re-configurable speech recognition model is proposed for small devices such as mobile phones and personal digital assistants as well as environments such as office, home or vehicle while maintaining the accuracy of the speech recognition.
Abstract: Speech recognition models are dynamically re-configurable based on user information, background information such as background noise and transducer information such as transducer response characteristics to provide users with alternate input modes to keyboard text entry (Fig. 5). The techniques of dynamic re-configurable speech recognition provide for deployment of speech recognition on small devices such as mobile phones and personal digital assistants as well environments such as office, home or vehicle while maintaining the accuracy of the speech recognition.

PatentDOI
TL;DR: In this paper, a method for converting text to concatenated voice by utilizing a digital voice library and a set of playback rules is provided, which includes a plurality of speech items and a corresponding plurality of voice recordings.
Abstract: A method for converting text to concatenated voice by utilizing a digital voice library and a set of playback rules is provided. The digital voice library includes a plurality of speech items and a corresponding plurality of voice recordings. Each speech item corresponds to at least one available voice recording. Multiple voice recordings that correspond to a single speech item represent various inflections of that single speech item. The method includes receiving text data, converting the text data into a sequence of speech items in accordance with the digital voice library. The method further includes establishing multiple voice recordings in the digital voice library that correspond to a single inflection of a single speech item, for a plurality of inflections of a plurality of speech items, that represent various ligatures for the single inflection of the single speech item with adjacent speech items.

PatentDOI
Min Chu1, Hu Peng1
TL;DR: In this article, a speech synthesizer is provided that concatenates stored samples of speech units without modifying the prosody of the samples, achieving a high level of naturalness in synthesized speech with a carefully designed training speech corpus.
Abstract: A speech synthesizer is provided that concatenates stored samples of speech units without modifying the prosody of the samples. The present invention is able to achieve a high level of naturalness in synthesized speech with a carefully designed training speech corpus by storing samples based on the prosodic and phonetic context in which they occur. In particular, some embodiments of the present invention limit the training text to those sentences that will produce the most frequent sets of prosodic contexts for each speech unit. Further embodiments of the present invention also provide a multi-tier selection mechanism for selecting a set of samples that will produce the most natural sounding speech.

Proceedings ArticleDOI
11 Nov 2001
TL;DR: It is suggested that voice-as-sound techniques can enhance traditional voice recognition approach and achieve more direct, immediate interaction by using lower-level features of voice such as pitch and volume.
Abstract: We describe the use of non-verbal features in voice for direct control of interactive applications. Traditional speech recognition interfaces are based on an indirect, conversational model. First the user gives a direction and then the system performs certain operation. Our goal is to achieve more direct, immediate interaction like using a button or joystick by using lower-level features of voice such as pitch and volume. We are developing several prototype interaction techniques based on this idea, such as "control by continuous voice", "rate-based parameter control by pitch," and "discrete parameter control by tonguing." We have implemented several prototype systems, and they suggest that voice-as-sound techniques can enhance traditional voice recognition approach.

PatentDOI
Jebu Jacob Rajan1
TL;DR: In this article, a system for allowing a user to add word models to a speech recognition system is described. But this system requires the user to input a number of renditions of a new word and generate from these a sequence of phonemes representative of the new word.
Abstract: A system is provided for allowing a user to add word models to a speech recognition system. In particular, the system allows a user to input a number of renditions of the new word and which generates from these a sequence of phonemes representative of the new word. This representative sequence of phonemes is stored in a word to phoneme dictionary together with the typed version of the word for subsequent use by the speech recognition system.

Patent
Magnus Westerlund1, Anders Nohlgren1, Anders Uvliden1, Jonas Svedberg1, Jim Sundqvist1 
10 May 2001
TL;DR: In this article, an improved forward error correction (FEC) technique for coding speech data was proposed, which provides interaction between the primary synthesis model and the redundant synthesis model during and after decoding to improve the quality of a synthesized output speech signal.
Abstract: An improved forward error correction (FEC) technique for coding speech data provides an encoder module which primary-encodes an input speech signal using a primary synthesis model to produce primary-encoded data, and redundant-encodes the input speech signal using a redundant synthesis model to produce redundant-encoded data. A packetizer combines the primary-encoded data and the redundant-encoded data into a series of packets and transmits the packets over a packet-based network, such as an Internet Protocol (IP) network. A decoding module primary-decodes the packets using the primary synthesis model, and redundant-decodes the packets using the redundant synthesis model. The technique provides interaction between the primary synthesis model and the redundant synthesis model during and after decoding to improve the quality of a synthesized output speech signal. Such "interaction," for instance, may take the form of updating states in one model using the other model.

Proceedings ArticleDOI
20 Dec 2001
TL;DR: In this paper, a common narrow-band speech signal is expanded into a wide-band signal and the expanded signal gives the impression of a wide band speech signal regardless of what type of vocoder is used in a receiver.
Abstract: A common narrow-band speech signal is expanded into a wide-band speech signal. The expanded signal gives the impression of a wide-band speech signal regardless of what type of vocoder is used in a receiver. The robust techniques suggested herein are based on speech acoustics and fundamentals of human hearing. That is the techniques extend the harmonic structure of the speech signal during voiced speech segments and introduce a linearly estimated amount of speech energy in the wide frequency-band. During unvoiced speech segments, a fricated noise may be introduced in the upper frequency-band.

Proceedings ArticleDOI
07 May 2001
TL;DR: Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.
Abstract: The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

Proceedings Article
01 Jan 2001
TL;DR: LTS1 Reference LTS-CONF-2001-039 Record created on 2006-06-14, modified on 2016-08-08.
Abstract: Keywords: LTS1 Reference LTS-CONF-2001-039 Record created on 2006-06-14, modified on 2016-08-08

Proceedings ArticleDOI
09 Dec 2001
TL;DR: Speech recognition experiments show that it is beneficial in this multispeaker setting to use the output of the speech activity detector for presegmenting the recognizer input, achieving word error rates within 10% of those achieved with manual turn labeling.
Abstract: As part of a project into speech recognition in meeting environments, we have collected a corpus of multichannel meeting recordings. We expected the identification of speaker activity to be straightforward given that the participants had individual microphones, but simple approaches yielded unacceptably erroneous labelings, mainly due to crosstalk between nearby speakers and wide variations in channel characteristics. Therefore, we have developed a more sophisticated approach for multichannel speech activity detection using a simple hidden Markov model (HMM). A baseline HMM speech activity detector has been extended to use mixtures of Gaussians to achieve robustness for different speakers under different conditions. Feature normalization and crosscorrelation processing are used to increase the channel independence and to detect crosstalk. The use of both energy normalization and crosscorrelation based postprocessing results in a 35% relative reduction of the frame error rate. Speech recognition experiments show that it is beneficial in this multispeaker setting to use the output of the speech activity detector for presegmenting the recognizer input, achieving word error rates within 10% of those achieved with manual turn labeling.

PatentDOI
Ellen Eide1
TL;DR: In this paper, a data-driven text-to-speech system is proposed to collect a database of natural speech from which to train models or select segments for concatenation.
Abstract: Building a data-driven text-to-speech system involves collecting a database of natural speech from which to train models or select segments for concatenation. Typically the speech in that database is produced by a single speaker. In this invention we include in our database speech from a multiplicity of speakers.

Patent
05 Jan 2001
TL;DR: In this article, a system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wide band speech signal, and then the lower frequency range of the wide band signal is reproduced using the received narrow band signal.
Abstract: A system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wideband speech signal. The lower frequency range of the wideband speech signal is reproduced using the received narrowband speech signal. The received narrowband speech signal is analyzed to determine its formants and pitch information. The upper frequency range of the wideband speech signal is synthesized using information derived from the received narrowband speech signal.

Patent
28 Feb 2001
TL;DR: In this paper, an energy level associated with audio input is ascertained, and a decision is rendered on whether to accept the at least one word as valid speech input, based on the ascertained energy level.
Abstract: Methods and apparatus for providing speech recognition in noisy environments. An energy level associated with audio input is ascertained, and a decision is rendered on whether to accept the at least one word as valid speech input, based on the ascertained energy level.

PatentDOI
TL;DR: In this article, a distributed speech recognition system includes a speech processor linked to a plurality of speech recognition engines, each of which performs different functions, and the system further includes means for selectively activating or deactivating the plurality of servers based upon usage of the system.
Abstract: A distributed speech recognition system includes a speech processor linked to a plurality of speech recognition engines. The speech processor includes an input for receiving speech files from a plurality of users and storage means for storing the received speech files until such a time that they are forwarded to a selected speech recognition engine for processing. Each of the speech recognition engines includes a plurality of servers selectively performing different functions. The system further includes means for selectively activating or deactivating the plurality of servers based upon usage of the distributed speech recognition system.

Patent
08 Nov 2001
TL;DR: In this paper, the authors present a data processing system and method that uses RTSP and associated protocols to support voice applications and audio processing by various, distributed, speech processing engines.
Abstract: The present invention relates to a data processing system and method and, more particularly, to a computer aided telephony system and method which uses RTSP and associated protocols to support voice applications and audio processing by various, distributed, speech processing engines. Since RTSP is used to distribute the tasks to be performed by the speech processing engines, a distributed and scalable system can be realised. Furthermore, the integration of third party speech processing engines is greatly simplified due to the RTSP or HTTP interface to those engines.

Patent
16 Aug 2001
TL;DR: In this article, a technique is provided for updating speech models for speech recognition by identifying, from a class of users, speech data for a predetermined set of utterances that differ from a set of stored speech models by at least a predetermined amount.
Abstract: A technique is provided for updating speech models for speech recognition by identifying, from a class of users, speech data for a predetermined set of utterances that differ from a set of stored speech models by at least a predetermined amount. The identified speech data for similar utterances from the class of users is collected and used to correct the set of stored speech models. As a result, the corrected speech models are a closer match to the utterances than were the set of stored speech models. The set of speech models are subsequently updated with the corrected speech models to provide improved speech recognition of utterances from the class of users. For example, the corrected speech models may be processed and stored at a central database and returned, via a suitable communications channel (e.g. the Internet) to individual user sites to update the speech recognition apparatus at those sites.

Proceedings ArticleDOI
Wu Chou1, Liang Gu2
07 May 2001
TL;DR: A new set of features derived from the harmonic coefficient and its 4 Hz modulation values are developed in this paper and these new features provide additional and reliable cues to separate speech from singing, which leads to further improvements in speech/music discrimination.
Abstract: In this paper, an approach for robust singing signal detection in speech/music discrimination is proposed and applied to applications of audio indexing. Conventional approaches in speech/music discrimination can provide reasonable performance with regular music signals but often perform poorly with singing segments. This is due mainly to the fact that speech and singing signals are extremely close and traditional features used in speech recognition do not provide a reliable cue for speech and singing signal discrimination. In order to improve the robustness of speech/music discrimination, a new set of features derived from the harmonic coefficient and its 4 Hz modulation values are developed in this paper, and these new features provide additional and reliable cues to separate speech from singing. In addition, a rule-based post-filtering scheme is also described which leads to further improvements in speech/music discrimination. Source-independent audio indexing experiments on the PBS Skills database indicate that the proposed approach can greatly reduce the classification error rate on singing segments in the audio stream. Comparing with existing approaches, the overall segmentation error rate is reduced by more than 30%, averaged over all shows in the database.