scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2001"


Journal ArticleDOI
TL;DR: An unbiased noise estimator is developed which derives the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal by minimizing a conditional mean square estimation error criterion in each time step.
Abstract: We describe a method to estimate the power spectral density of nonstationary noise when a noisy speech signal is given. The method can be combined with any speech enhancement algorithm which requires a noise power spectral density estimate. In contrast to other methods, our approach does not use a voice activity detector. Instead it tracks spectral minima in each frequency band without any distinction between speech activity and speech pause. By minimizing a conditional mean square estimation error criterion in each time step we derive the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal. Based on the optimally smoothed power spectral density estimate and the analysis of the statistics of spectral minima an unbiased noise estimator is developed. The estimator is well suited for real time implementations. Furthermore, to improve the performance in nonstationary noise we introduce a method to speed up the tracking of the spectral minima. Finally, we evaluate the proposed method in the context of speech enhancement and low bit rate speech coding with various noise types.

1,731 citations


Book
01 Jan 2001
TL;DR: This chapter discusses the Discrete-Time Speech Signal Processing Framework, a model based on the FBS Method, and its applications in Speech Communication Pathway and Homomorphic Signal Processing.
Abstract: (NOTE: Each chapter begins with an introduction and concludes with a Summary, Exercises and Bibliography.) 1. Introduction. Discrete-Time Speech Signal Processing. The Speech Communication Pathway. Analysis/Synthesis Based on Speech Production and Perception. Applications. Outline of Book. 2. A Discrete-Time Signal Processing Framework. Discrete-Time Signals. Discrete-Time Systems. Discrete-Time Fourier Transform. Uncertainty Principle. z-Transform. LTI Systems in the Frequency Domain. Properties of LTI Systems. Time-Varying Systems. Discrete-Fourier Transform. Conversion of Continuous Signals and Systems to Discrete Time. 3. Production and Classification of Speech Sounds. Anatomy and Physiology of Speech Production. Spectrographic Analysis of Speech. Categorization of Speech Sounds. Prosody: The Melody of Speech. Speech Perception. 4. Acoustics of Speech Production. Physics of Sound. Uniform Tube Model. A Discrete-Time Model Based on Tube Concatenation. Vocal Fold/Vocal Tract Interaction. 5. Analysis and Synthesis of Pole-Zero Speech Models. Time-Dependent Processing. All-Pole Modeling of Deterministic Signals. Linear Prediction Analysis of Stochastic Speech Sounds. Criterion of "Goodness". Synthesis Based on All-Pole Modeling. Pole-Zero Estimation. Decomposition of the Glottal Flow Derivative. Appendix 5.A: Properties of Stochastic Processes. Random Processes. Ensemble Averages. Stationary Random Process. Time Averages. Power Density Spectrum. Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis. 6. Homomorphic Signal Processing. Concept. Homomorphic Systems for Convolution. Complex Cepstrum of Speech-Like Sequences. Spectral Root Homomorphic Filtering. Short-Time Homomorphic Analysis of Periodic Sequences. Short-Time Speech Analysis. Analysis/Synthesis Structures. Contrasting Linear Prediction and Homomorphic Filtering. 7. Short-Time Fourier Transform Analysis and Synthesis. Short-Time Analysis. Short-Time Synthesis. Short-Time Fourier Transform Magnitude. Signal Estimation from the Modified STFT or STFTM. Time-Scale Modification and Enhancement of Speech. Appendix 7.A: FBS Method with Multiplicative Modification. 8. Filter-Bank Analysis/Synthesis. Revisiting the FBS Method. Phase Vocoder. Phase Coherence in the Phase Vocoder. Constant-Q Analysis/Synthesis. Auditory Modeling. 9. Sinusoidal Analysis/Synthesis. Sinusoidal Speech Model. Estimation of Sinewave Parameters. Synthesis. Source/Filter Phase Model. Additive Deterministic-Stochastic Model. Appendix 9.A: Derivation of the Sinewave Model. Appendix 9.B: Derivation of Optimal Cubic Phase Parameters. 10. Frequency-Domain Pitch Estimation. A Correlation-Based Pitch Estimator. Pitch Estimation Based on a "Comb Filter<170. Pitch Estimation Based on a Harmonic Sinewave Model. Glottal Pulse Onset Estimation. Multi-Band Pitch and Voicing Estimation. 11. Nonlinear Measurement and Modeling Techniques. The STFT and Wavelet Transform Revisited. Bilinear Time-Frequency Distributions. Aeroacoustic Flow in the Vocal Tract. Instantaneous Teager Energy Operator. 12. Speech Coding. Statistical Models of Speech. Scaler Quantization. Vector Quantization (VQ). Frequency-Domain Coding. Model-Based Coding. LPC Residual Coding. 13. Speech Enhancement. Introduction. Preliminaries. Wiener Filtering. Model-Based Processing. Enhancement Based on Auditory Masking. Appendix 13.A: Stochastic-Theoretic parameter Estimation. 14. Speaker Recognition. Introduction. Spectral Features for Speaker Recognition. Speaker Recognition Algorithms. Non-Spectral Features in Speaker Recognition. Signal Enhancement for the Mismatched Condition. Speaker Recognition from Coded Speech. Appendix 14.A: Expectation-Maximization (EM) Estimation. Glossary.Speech Signal Processing.Units.Databases.Index.About the Author.

984 citations


Journal ArticleDOI
TL;DR: The audio watermarking method presented below offers copyright protection to an audio signal by modifying its temporal characteristics by modifying the output signal by means of a seed created by the copyright owner.
Abstract: The audio watermarking method proposed in this paper offers copyright protection to an audio signal by time domain processing. The strength of audio signal modifications is limited by the necessity to produce an output signal that is perceptually similar to the original one. The watermarking method presented here does not require the use of the original signal for watermark detection. The watermark signal is generated using a key, i.e., a single number known only to the copyright owner. Watermark embedding depends on the audio signal amplitude and frequency in a way that minimizes the audibility of the watermark signal. The embedded watermark is robust to common audio signal manipulations like MPEG audio coding, cropping, time shifting, filtering, resampling, and requantization.

491 citations


Journal ArticleDOI
TL;DR: A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features.
Abstract: While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based on audio content analysis is proposed. The audio signal from movies or TV programs is segmented and classified into basic types such as speech, music, song, environmental sound, speech with music background, environmental sound with music background, silence, etc. Simple audio features including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks are extracted to ensure the feasibility of real-time processing. A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features. Experimental results show that the proposed scheme achieves an accuracy rate of more than 90% in audio classification.

473 citations


Journal ArticleDOI
TL;DR: This work describes a scheme that is able to classify audio segments into seven categories consisting of silence, single speaker speech, music, environmental noise, multiple speakers' speech, simultaneous speech and music, and speech and noise, and shows that cepstral-based features such as the Mel-frequency cep stral coefficients (MFCC) and linear prediction coefficients (LPC) provide better classification accuracy compared to temporal and spectral features.

315 citations


Proceedings ArticleDOI
Lie Lu1, Hao Jiang1, Hong-Jiang Zhang1
01 Oct 2001
TL;DR: A robust algorithm that is capable of segmenting and classifying an audio stream into speech, music, environment sound and silence is presented and some new features such as the noise frame ratio and band periodicity are introduced.
Abstract: In this paper, we present a robust algorithm for audio classification that is capable of segmenting and classifying an audio stream into speech, music, environment sound and silence. Audio classification is processed in two steps, which makes it suitable for different applications. The first step of the classification is speech and non-speech discrimination. In this step, a novel algorithm based on KNN and LSP VQ is presented. The second step further divides non-speech class into music, environment sounds and silence with a rule based classification scheme. Some new features such as the noise frame ratio and band periodicity are introduced and discussed in detail. Our experiments in the context of video structure parsing have shown the algorithms produce very satisfactory results.

266 citations


Journal ArticleDOI
TL;DR: The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames and derives a voicing condition for speech frames based on the relation between the skewness and kurtosis of voiced speech.
Abstract: This paper presents a robust algorithm for voice activity detection (VAD) based on newly established properties of the higher order statistics (HOS) of speech. Analytical expressions for the third and fourth-order cumulants of the LPC residual of short-term speech are derived assuming a sinusoidal model. The flat spectral feature of this residual results in distinct characteristics for these cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. Important properties about these cumulants and their similarity with the autocorrelation function are revealed from this exploratory part. They show that the HOS of speech are sufficiently distinct from those of Gaussian noise and can be used as a basis for speech detection. Their immunity to Gaussian noise makes them particularly useful in algorithms designed for low SNR environments. The proposed VAD algorithm combines HOS metrics with second-order measures, such as SNR and LPC prediction error, to classify speech and noise frames. The variance of the HOS estimators is quantified and used to yield a likelihood measure for noise frames. Moreover, a voicing condition for speech frames is derived based on the relation between the skewness and kurtosis of voiced speech. The performance of the algorithm is compared to the ITU-T G.729B VAD in various noise conditions, and quantified using the probability of correct and false classifications. The results show that the proposed algorithm has an overall better performance than G.729B, with noticeable improvement in Gaussian-like noises, such as street and parking garage, and moderate to low SNR.

249 citations


Patent
26 Feb 2001
TL;DR: In this paper, a speech manager interface allows the speech recognition process and the text-to-speech process to be accessed by other application processes in handheld electronic devices such as a personal digital assistant (PDA).
Abstract: A handheld electronic device such as a personal digital assistant (PDA) has multiple application processes. A speech recognition process takes input speech from a user and produces a recognition output representative of the input speech. A text-to-speech process takes output text and produces a representative speech output. A speech manager interface allows the speech recognition process and the text-to-speech process to be accessed by other application processes.

239 citations


Patent
Steve Tischer1
10 Dec 2001
TL;DR: In this paper, a method and system of customizing voice translation of a text to speech includes digitally recording speech samples of a known speaker, correlating each of the speech samples with a standardized audio representation, and organizing the recorded speech samples and correlated audio representations into a collection.
Abstract: A method and system of customizing voice translation of a text to speech includes digitally recording speech samples of a known speaker, correlating each of the speech samples with a standardized audio representation, and organizing the recorded speech samples and correlated audio representations into a collection. The collection of speech samples correlated with audio representations is saved as a single voice file and stored in a device capable of translating the text to speech. The voice file is applied to a translation of text to speech so that the translated speech is customized according to the applied voice file.

229 citations


Book
01 Jan 2001
TL;DR: This book presents an introduction to Real-Time Digital Signal Processing, a branch of Digital Image Processing, and some of the techniques used in this area, as well as some new ideas on how to implement these techniques in the real-time.
Abstract: Preface. Chapter 1. Introduction to Real-Time Digital Signal Processing. Chapter 2. Introduction to TMS320C55x Digital Signal Processor. Chapter 3. DSP Fundamentals and Implementation Considerations. Chapter 4. Design and Implementation of FIR Filters. Chapter 5. Design and Implementation of IIR Filters. Chapter 6. Frequency Analysis and Fast Fourier Transform. Chapter 7. Adaptive Filtering. Chapter 8. Digital Signal Generators. Chapter 9. Dual-Tone Multi-Frequency Detection. Chapter 10. Adaptive Echo Cancellation. Chapter 11. Speech Coding Techniques. Chapter 12. Speech Enhancement Techniques. Chapter 13. Audio Signal Processing. Chapter 14. Channel Coding Techniques. Chapter 15. Introduction to Digital Image Processing. Appendix A: Some Useful Formulas and Definitions. A.1 Trigonometric Identities. A.2 Geometric Series. A.3 Complex Variables. A.4 Units of Power. References. Appendix B: Software Organization and List of Experiments. Index.

228 citations


Journal ArticleDOI
TL;DR: A new and generalizing approach to error concealment is described as part of a modified robust speech decoder that can be applied to any speech codec standard and preserves bit exactness in the case of an error free channel.
Abstract: In digital speech communication over noisy channels there is the need for reducing the subjective effects of residual bit errors which have not been eliminated by channel decoding. This task is usually called error concealment. We describe a new and generalizing approach to error concealment as part of a modified robust speech decoder. It can be applied to any speech codec standard and preserves bit exactness in the case of an error free channel. The proposed method requires bit reliability information provided by the demodulator or by the equalizer or specifically by the channel decoder and can exploit additionally a priori knowledge about codec parameters. We apply our algorithms to PCM, ADPCM, and GSM full-rate speech coding using AWGN, fading, and GSM channel models, respectively. It turns out that the speech quality is significantly enhanced, showing the desired inherent muting mechanism or graceful degradation behavior in the case of extreme adverse transmission conditions.

PatentDOI
Steven G. Woodward1
TL;DR: In this paper, a method for processing a misrecognition error in an embedded speech recognition system during a speech recognition session can include the step of speech-to-text converting audio input in the embedded SPR system based on an active language model.
Abstract: A method for processing a misrecognition error in an embedded speech recognition system during a speech recognition session can include the step of speech-to-text converting audio input in the embedded speech recognition system based on an active language model. The speech-to-text conversion can produce speech recognized text that can be presented through a user interface. A user-initiated misrecognition error notification can be detected. The audio input and a reference to the active language model can be provided to a speech recognition system training process associated with the embedded speech recognition system.

Proceedings ArticleDOI
07 May 2001
TL;DR: In this paper, the authors present several mechanisms that enable effective spread-spectrum audio watermarking systems: prevention against detection desynchronization, cepstrum filtering, and chess watermarks.
Abstract: We present several mechanisms that enable effective spread-spectrum audio watermarking systems: prevention against detection desynchronization, cepstrum filtering, and chess watermarks. We have incorporated these techniques into a system capable of reliably detecting a watermark in an audio clip that has been modified using a composition of attacks that degrade the original audio characteristics well beyond the limit of acceptable quality. Such attacks include: fluctuating scaling in the time and frequency domain, compression, addition and multiplication of noise, resampling, requantization, normalization, filtering, and random cutting and pasting of signal samples.

Journal ArticleDOI
TL;DR: It is found that lossless audio coders have reached a limit in what can be achieved for lossless compression of audio, and a new lossless Audio coder is described called AudioPak, which low algorithmic complexity and performs well or even better than most of the losslessaudio coders that have been described in the literature.
Abstract: Lossless audio compression is likely to play an important part in music distribution over the Internet, DVD audio, digital audio archiving, and mixing. The article is a survey and a classification of the current state-of-the-art lossless audio compression algorithms. This study finds that lossless audio coders have reached a limit in what can be achieved for lossless compression of audio. It also describes a new lossless audio coder called AudioPak, which low algorithmic complexity and performs well or even better than most of the lossless audio coders that have been described in the literature.

Patent
11 Dec 2001
TL;DR: In this paper, a computer implemented method to map an audio file to a verbatim text file is described, which includes a computer-implemented method for mapping audio files to text files.
Abstract: The invention includes a computer implemented method to map an audio file to a verbatim text file. First, a first window may be loaded with a transcribed text file having words. The transcribed text file is associated with the audio file. A second window is loaded with the verbatim text file having a plurality of words. At least one word from the transcribed text file and at least one word from the verbatim text file are selected. The at least one word from the transcribed text file and the at least one word from the verbatim text file is linked. The steps are repeated until all the words in the verbatim text file have been linked.

Patent
Magnus Westerlund1, Anders Nohlgren1, Anders Uvliden1, Jonas Svedberg1, Jim Sundqvist1 
10 May 2001
TL;DR: In this article, an improved forward error correction (FEC) technique for coding speech data was proposed, which provides interaction between the primary synthesis model and the redundant synthesis model during and after decoding to improve the quality of a synthesized output speech signal.
Abstract: An improved forward error correction (FEC) technique for coding speech data provides an encoder module which primary-encodes an input speech signal using a primary synthesis model to produce primary-encoded data, and redundant-encodes the input speech signal using a redundant synthesis model to produce redundant-encoded data. A packetizer combines the primary-encoded data and the redundant-encoded data into a series of packets and transmits the packets over a packet-based network, such as an Internet Protocol (IP) network. A decoding module primary-decodes the packets using the primary synthesis model, and redundant-decodes the packets using the redundant synthesis model. The technique provides interaction between the primary synthesis model and the redundant synthesis model during and after decoding to improve the quality of a synthesized output speech signal. Such "interaction," for instance, may take the form of updating states in one model using the other model.

Patent
Steven E. Barile1
16 Mar 2001
TL;DR: In this paper, a system and method for embedding audio titles is presented, where information encoded in a format is received about an audio program and the information is transformed into an audio signal conveying an audio description about the audio program.
Abstract: A system and method for embedding audio titles is presented. Information encoded in a format is received about an audio program. The information is transformed into an audio signal conveying an audio description about the audio program. The audio description and the audio program are then embedded in a predetermined format.

PatentDOI
Dan Chazan1, Ron Hoory1
TL;DR: In this article, the spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments, and an output speech signal is reconstructed by concatenating the feature vector corresponding to a sequence of speech segments.
Abstract: A method for speech synthesis includes receiving an input speech signal containing a set of speech segments, and estimating spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments. The spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments. An output speech signal is reconstructed by concatenating the feature vectors corresponding to a sequence of the speech segments.

Patent
14 Dec 2001
TL;DR: In this paper, an audio encoder dynamically selects between joint and independent coding of a multi-channel audio signal via an open-loop decision based upon energy separation between the coding channels, and the disparity between excitation patterns of the separate input channels.
Abstract: An audio encoder implements multi-channel coding decision, band truncation, multi-channel rematrixing, and header reduction techniques to improve quality and coding efficiency. In the multi-channel coding decision technique, the audio encoder dynamically selects between joint and independent coding of a multi-channel audio signal via an open-loop decision based upon (a) energy separation between the coding channels, and (b) the disparity between excitation patterns of the separate input channels. In the band truncation technique, the audio encoder performs open-loop band truncation at a cut-off frequency based on a target perceptual quality measure. In multi-channel rematrixing technique, the audio encoder suppresses certain coefficients of a difference channel by scaling according to a scale factor, which is based on current average levels of perceptual quality, current rate control buffer fullness, coding mode, and the amount of channel separation in the source. In the header reduction technique, the audio encoder selectively modifies the quantization step size of zeroed quantization bands so as to encode in fewer frame header bits.

Journal ArticleDOI
TL;DR: An audio-visual approach to the problem of speech enhancement in noise is considered, since it has been demonstrated in several studies that viewing the speaker's face improves message intelligibility, especially in noisy environments.
Abstract: A key problem for telecommunication or human–machine communication systems concerns speech enhancement in noise. In this domain, a certain number of techniques exist, all of them based on an acoustic-only approach—that is, the processing of the audio corrupted signal using audio information (from the corrupted signal only or additive audio information). In this paper, an audio-visual approach to the problem is considered, since it has been demonstrated in several studies that viewing the speaker’s face improves message intelligibility, especially in noisy environments. A speech enhancement prototype system that takes advantage of visual inputs is developed. A filtering process approach is proposed that uses enhancement filters estimated with the help of lip shape information. The estimation process is based on linear regression or simple neural networks using a training corpus. A set of experiments assessed by Gaussian classification and perceptual tests demonstrates that it is indeed possible to enhance simple stimuli (vowel-plosive-vowel sequences) embedded in white Gaussian noise.

Patent
10 Dec 2001
TL;DR: In this article, the authors propose a technique to adjust the ratio of the primary vocal/dialog content of an audio program relative to the remaining portion of the audio content in that program.
Abstract: The invention enables the inclusion of voice and remaining audio information at different parts of the audio production process. In particular, the invention embodies special techniques for VRA-capable digital mastering and accommodation of VRA by those classes of audio compression formats that sustain less losses of audio data as compared to any codecs that sustain comparable net losses equal or greater than the AC3 compression format. The invention facilitates an end-listener's voice-to-remaining audio (VRA) adjustment upon the playback of digital audio media formats by focusing on new configurations of multiple parts of the entire digital audio system, thereby enabling a new technique intended to benefit audio end-users (end-listeners) who wish to control the ratio of the primary vocal/dialog content of an audio program relative to the remaining portion of the audio content in that program.

Patent
04 Sep 2001
TL;DR: In this article, an audio converter device and a method for using the same is presented, in which the audio data is decompressed and converted into analog electrical data and then transferred to an audio playback device.
Abstract: An audio converter device and a method for using the same are provided. In one embodiment, the audio converter device receives the digital audio data from a first device via a local area network. The audio converter device decompresses the digital audio data and converts the digital audio data into analog electrical data. The audio converter device transfers the analog electrical data to an audio playback device.

Proceedings ArticleDOI
20 Dec 2001
TL;DR: In this paper, a common narrow-band speech signal is expanded into a wide-band signal and the expanded signal gives the impression of a wide band speech signal regardless of what type of vocoder is used in a receiver.
Abstract: A common narrow-band speech signal is expanded into a wide-band speech signal. The expanded signal gives the impression of a wide-band speech signal regardless of what type of vocoder is used in a receiver. The robust techniques suggested herein are based on speech acoustics and fundamentals of human hearing. That is the techniques extend the harmonic structure of the speech signal during voiced speech segments and introduce a linearly estimated amount of speech energy in the wide frequency-band. During unvoiced speech segments, a fricated noise may be introduced in the upper frequency-band.

Proceedings ArticleDOI
07 May 2001
TL;DR: Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.
Abstract: The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.

Patent
05 Jan 2001
TL;DR: In this article, a system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wide band speech signal, and then the lower frequency range of the wide band signal is reproduced using the received narrow band signal.
Abstract: A system and method for speech signal enhancement upsamples a narrowband speech signal at a receiver to generate a wideband speech signal. The lower frequency range of the wideband speech signal is reproduced using the received narrowband speech signal. The received narrowband speech signal is analyzed to determine its formants and pitch information. The upper frequency range of the wideband speech signal is synthesized using information derived from the received narrowband speech signal.

Proceedings ArticleDOI
01 Jan 2001
TL;DR: Experiments on a database composed of clips of 14870 seconds in total length show that the average accuracy rate for the SVM method is much better than that of the traditional Euclidean distance based (nearest neighbor) method.
Abstract: Audio exists at everywhere, but is often out-of-order. It is necessary to arrange them into regularized classes in order to use them more easily. It is also useful, especially in video content analysis, to segment an audio stream according to audio types. In this paper, we present our work in applying support vector machines (SVMs) in audio segmentation and classification. Five audio classes are considered: silence, music, background sound, pure speech, and non-pure speech which includes speech over music and speech over noise. A SVM learns optimal class boundaries from training data to best separate between two classes. A sound clip is segmented by classifying each sub-clip of one second into one of these five classes. Experiments on a database composed of clips of 14870 seconds in total length show that the average accuracy rate for the SVM method is much better than that of the traditional Euclidean distance based (nearest neighbor) method.

PatentDOI
TL;DR: In this article, a distributed speech recognition system includes a speech processor linked to a plurality of speech recognition engines, each of which performs different functions, and the system further includes means for selectively activating or deactivating the plurality of servers based upon usage of the system.
Abstract: A distributed speech recognition system includes a speech processor linked to a plurality of speech recognition engines. The speech processor includes an input for receiving speech files from a plurality of users and storage means for storing the received speech files until such a time that they are forwarded to a selected speech recognition engine for processing. Each of the speech recognition engines includes a plurality of servers selectively performing different functions. The system further includes means for selectively activating or deactivating the plurality of servers based upon usage of the distributed speech recognition system.

PatentDOI
TL;DR: In this paper, a series of original sentences for messages is segmented and stored as audio files with search criteria, and the audio files of the segments in which the examination resulted in the prerequisites for optimal maintaining of the natural speech rhythm are combined and output for reproduction.
Abstract: A method of composing messages for speech output and the improvement of the quality of reproduction of speech outputs. A series of original sentences for messages is segmented and stored as audio files with search criteria. The length, position, and transition values for the respective segments can be recorded and stored. A sentence to be reproduced is transmitted in a format corresponding to the format of the search criteria. It is determined whether the sentence to be reproduced can be fully reproduced by one segment or a succession of stored segments. The segments found in each case are examined using their entries as to how far the individual segments match as regards speech rhythm. The audio files of the segments in which the examination resulted in the pre-requisites for optimal maintaining of the natural speech rhythm are combined and output for reproduction.

Patent
Martin Holzapfel1
10 Sep 2001
TL;DR: In this article, a first speech model is trained with a first time pattern and a second speech model with a second time pattern, and the second model is initialized with the first model.
Abstract: A method and also a configuration for determining a descriptive feature of a speech signal, in which a first speech model is trained with a first time pattern and a second speech model is trained with a second time pattern. The second speech model is initialized with the first speech model.

Patent
16 Aug 2001
TL;DR: In this article, a technique is provided for updating speech models for speech recognition by identifying, from a class of users, speech data for a predetermined set of utterances that differ from a set of stored speech models by at least a predetermined amount.
Abstract: A technique is provided for updating speech models for speech recognition by identifying, from a class of users, speech data for a predetermined set of utterances that differ from a set of stored speech models by at least a predetermined amount. The identified speech data for similar utterances from the class of users is collected and used to correct the set of stored speech models. As a result, the corrected speech models are a closer match to the utterances than were the set of stored speech models. The set of speech models are subsequently updated with the corrected speech models to provide improved speech recognition of utterances from the class of users. For example, the corrected speech models may be processed and stored at a central database and returned, via a suitable communications channel (e.g. the Internet) to individual user sites to update the speech recognition apparatus at those sites.