scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2008"


BookDOI
17 Sep 2008
TL;DR: In this paper, the state of the art in important areas of speech and audio signal processing is discussed, including multi-microphone systems, specific approaches for noise reduction, and evaluations of speech signals and speech processing systems.
Abstract: The book reflects the state of the art in important areas of speech and audio signal processing. It presents topics which are missed so far and most recent findings in the field. Leading international experts report on their field of work and their new results. Considerable amount of space is covered by multi-microphone systems, specific approaches for noise reduction, and evaluations of speech signals and speech processing systems. Multi-microphone systems include automatic calibration of microphones, localisation of sound sources, and source separation procedures. Also covered are recent approaches to the problem of adaptive echo and noise suppression. A novel solution allows the design of filter banks exhibiting bands spaced according to the Bark scale und especially short delay times. Furthermore, a method for engine noise reduction and proposals for improving the signal/noise ratio based on partial signal reconstruction or using a noise reference are reported. A number of contributions deal with speech quality. Besides basic considerations for quality evaluation specific methods for bandwidth extension of telephone speech are described. Procedures to reduce the reverberation of audio signals can help to increase speech intelligibility and speech recognition rates. In addition, solutions for specific applications in speech and audio signal processing are reported including, e.g., the enhancement of audio signal reproduction in automobiles and the automatic evaluation of hands-free systems and hearing aids.

68 citations


Journal ArticleDOI
TL;DR: The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception.
Abstract: For a given mixture of speech and noise, an ideal binary time‐frequency mask is constructed by whether SNR within individual time‐frequency units exceeds a local SNR criterion (LC). With linear filters, co‐reducing mixture SNR and LC does not alter the ideal binary mask. Taking this manipulation to the limit by setting both mixture SNR and LC to minus infinity produces an output that contains only noise with no target speech at all. This particular output corresponds to turning on or off the filtered noise according to a pattern prescribed by the ideal binary mask. Our study was designed to test on speech intelligibility of noise gated by the ideal binary mask obtained this way. It is observed that listeners achieve nearly perfect speech recognition from gated noise. Only sixteen filter channels and a frame rate of one hundred Hertz are sufficient for high intelligibility. The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception in noise.

62 citations


Proceedings ArticleDOI
22 Sep 2008
TL;DR: The Glottal Spectral Separation method, which consists of separating the glottal source effects from the spectral envelope of the speech, enables us to compare the LF-model with the simple impulse excitation, using the same spectral envelope to synthesize speech.
Abstract: The great advantage of using a glottal source model in parametric speech synthesis is the degree of parametric flexibility it gives to transform and model aspects of voice quality and speaker identity. However, few studies have addressed how the glottal source affects the quality of synthetic speech. Here, we have developed the Glottal Spectral Separation (GSS) method which consists of separating the glottal source effects from the spectral envelope of the speech. It enables us to compare the LF-model with the simple impulse excitation, using the same spectral envelope to synthesize speech. The results of a perceptual evaluation showed that the LF-model clearly outperformed the impulse. The GSS method was also used to successfully transform a modal voice into a breathy or tense voice, by modifying the LF-parameters. The proposed technique could be used to improve the speech quality and source parametrization of HMM-based speech synthesizers, which use an impulse excitation. Index Terms: Glottal Spectral Separation, HMM-based speech synthesis, LF-model.

62 citations


Journal ArticleDOI
TL;DR: A new speech dereverberation approach based on a statistical speech model that represents the dynamic short time characteristics of nonreverberant speech segments, including the time and frequency structures of the speech spectrum is proposed.
Abstract: Distant acquisition of acoustic signals in an enclosed space often produces reverberant components due to acoustic reflections in the room. Speech dereverberation is in general desirable when the signal is acquired through distant microphones in such applications as hands-free speech recognition, teleconferencing, and meeting recording. This paper proposes a new speech dereverberation approach based on a statistical speech model. A time-varying Gaussian source model (TVGSM) is introduced as a model that represents the dynamic short time characteristics of nonreverberant speech segments, including the time and frequency structures of the speech spectrum. With this model, dereverberation of the speech signal is formulated as a maximum-likelihood (ML) problem based on multichannel linear prediction, in which the speech signal is recovered by transforming the observed signal into one that is probabilistically more like nonreverberant speech. We first present a general ML solution based on TVGSM, and derive several dereverberation algorithms based on various source models. Specifically, we present a source model consisting of a finite number of states, each of which is manifested by a short time speech spectrum, defined by a corresponding autocorrelation (AC) vector. The dereverberation algorithm based on this model involves a finite collection of spectral patterns that form a codebook. We confirm experimentally that both the time and frequency characteristics represented in the source models are very important for speech dereverberation, and that the prior knowledge represented by the codebook allows us to further improve the dereverberated speech quality. We also confirm that the quality of reverberant speech signals can be greatly improved in terms of the spectral shape and energy time-pattern distortions from simply a short speech signal using a speaker-independent codebook.

45 citations


Proceedings Article
01 Jan 2008
TL;DR: This paper presents two new classes of linear prediction schemes based on the concept of creating a sparse residual rather than a minimum variance one, which will allow a more efficient quantization and creates a sparser residual in the case of unvoiced speech.
Abstract: This paper presents two new classes of linear prediction schemes. The first one is based on the concept of creating a sparse residual rather than a minimum variance one, which will allow a more efficient quantization; we will show that this works well in presence of voiced speech, where the excitation can be represented by an impulse train, and creates a sparser residual in the case of unvoiced speech. The second class aims at finding sparse prediction coefficients; interesting results can be seen applying it to the joint estimation of long-term and short-term predictors. The proposed estimators are all solutions to convex optimization problems, which can be solved efficiently and reliably using, e.g., interior-point methods.

44 citations


28 Sep 2008
TL;DR: A new pattern classification method was implemented, where Neural Networks trained using the Al-Alaoui Algorithm gave comparable results to the already implemented HMM method for the recognition of words, and it has overcome HMM in the recognitionof sentences.
Abstract: In this paper, we compare two different methods for automatic Arabic speech recognition for isolated words and sentences. Isolated word/sentence recognition was performed using cepstral feature extraction by linear predictive coding, as well as Hidden Markov Models (HMM) for pattern training and classification. We implemented a new pattern classification method, where we used Neural Networks trained using the Al-Alaoui Algorithm. This new method gave comparable results to the already implemented HMM method for the recognition of words, and it has overcome HMM in the recognition of sentences. The speech recognition system implemented is part of the Teaching and Learning Using Information Technology (TLIT) project which would implement a set of reading lessons to assist adult illiterates in developing better reading capabilities.

40 citations


Proceedings ArticleDOI
12 May 2008
TL;DR: This article investigates the use of different techniques of voice scramblers applied to mobile communications vocoders in terms of LPC and cepstral distances, and PESQ values.
Abstract: Speech privacy techniques are used to scramble clear speech into an unintelligible signal in order to avoid eavesdropping. Some analog speech-privacy equipments (scramblers) have been replaced by digital encryption devices (comsec), which have higher degree of security but require complex implementations and large bandwidth for transmission. However, if speech privacy is wanted in a mobile phone using a modern commercial codec, such as the AMR (adaptive multirate) codec, digital encryption may not be an option due to the fact that it requires internal hardware and software modifications. If encryption is applied before the codec, poor voice quality may result, for the vocoder would handle digitally encrypted signal resembling noise. On the other hand, analog scramblers may be placed before the voice encoder without causing much penalty to its performance. Analog scramblers are intended in applications where the degree of security is not too critical and hardware modifications are prohibitive due to its high cost. In this article we investigate the use of different techniques of voice scramblers applied to mobile communications vocoders. We present our results in terms of LPC and cepstral distances, and PESQ values.

32 citations


Proceedings ArticleDOI
01 Nov 2008
TL;DR: This approach uses a template of a speakerpsilas normal phonated speech for extraction of excitation parameters such as pitch and gain, and then injects these estimated excitations into whispered signal to synthesize normal-sounding speech through the CELP codec.
Abstract: In the following paper, a method for the real-time conversion of whispers to normal phonated speech through a code excited linear prediction analysis-by-synthesis codec is discussed. This approach uses a template of a speakerpsilas normal phonated speech for extraction of excitation parameters such as pitch and gain, and then injects these estimated excitations into whispered signal to synthesize normal-sounding speech through the CELP codec. Furthermore, since restoring pitch to whispered speech requires some considerations of quality and accuracy, spectral enhancements are required in terms of formant shifting (LSPs modification) and pitch injection based on voiced/unvoiced decision. Spectral shifting is accomplished through line-spectral pair adjustment. Implementing such methods by using the popular CELP codec allows integration of the technique with any modern speech applications and devices. Subjective testing results are presented to determine the effectiveness of the technique.

31 citations


Patent
22 Aug 2008
TL;DR: In this article, a technique of spectral noise shaping in an audio coding system is disclosed, where the tonality of each sub-band is determined using time domain linear prediction (TDLP) processing.
Abstract: A technique of spectral noise shaping in an audio coding system is disclosed. Frequency decomposition of an input audio signal is performed to obtain multiple frequency sub-bands that closely follow critical bands of human auditory system decomposition. The tonality of each sub-band is determined. If a sub-band is tonal, time domain linear prediction (TDLP) processing is applied to the sub-band, yielding a residual signal and linear predictive coding (LPC) coefficients of an all-pole model representing the sub-band signal. The residual signal is further processed using a frequency domain linear prediction (FDLP) method. The FDLP parameters and LPC coefficients are transferred to a decoder. At the decoder, an inverse-FDLP process is applied to the encoded residual signal followed by an inverse TDLP process, which shapes the quantization noise according to the power spectral density of the original sub-band signal. Non-tonal sub-band signals bypass the TDLP process.

30 citations


Journal ArticleDOI
TL;DR: Results indicate that the combination of speech enhancement pre-processing and the auditory model front-end provides an improvement in recognition performance in noisy conditions over the ETSI front-ends.

27 citations


Patent
30 Sep 2008
TL;DR: In this article, a microphone-array-based speech recognition system using a blind source separation (BBS) and a target speech extraction method in the system is provided. But it is not possible to obtain a high speech recognition rate even in a noise environment.
Abstract: A microphone-array-based speech recognition system using a blind source separation (BBS) and a target speech extraction method in the system are provided. The speech recognition system performs an independent component analysis (ICA) to separate mixed signals input through a plurality of microphone into sound-source signals, extracts one target speech spoken for speech recognition from the separated sound-source signals by using a Gaussian mixture model (GMM) or a hidden Markov Model (HMM), and automatically recognizes a desired speech from the extracted target speech. Accordingly, it is possible to obtain a high speech recognition rate even in a noise environment.

Patent
23 Oct 2008
TL;DR: A speech enhancement system as discussed by the authors improves the speech quality and intelligibility of a speech signal by dynamically attenuating a portion of the noise that occurs in a part of the spectrum of the speech signal.
Abstract: A speech enhancement system improves the speech quality and intelligibility of a speech signal. The system includes a time-to-frequency converter that converts segments of a speech signal into frequency bands. A signal detector measures the signal power of the frequency bands of each speech segment. A background noise estimator measures a background noise detected in the speech signal. A dynamic noise reduction controller dynamically models the background noise in the speech signal. The speech enhancement renders a speech signal perceptually pleasing to a listener by dynamically attenuating a portion of the noise that occurs in a portion of the spectrum of the speech signal.

Book
01 Nov 2008
TL;DR: Signal Processing Fundamentals, Basic Speech Processing Concepts, Speech Recognition, and Speech Synthesis are taught.
Abstract: Signal Processing Fundamentals.- Basic Speech Processing Concepts.- CPU Architectures for Speech Processing.- Peripherals for Speech Processing.- Speech Compression Overview.- Waveform Coders.- Voice Coders.- Noise and Echo Cancellation.- Speech Recognition.- Speech Synthesis.- Conclusion.

Journal ArticleDOI
18 Feb 2008
TL;DR: A combination of linear predictive coding (LPC), mel frequency cepstrum coefficients (MFCCs), Haar Wavelet, Daubechies Wavelet and Symlet coefficients as feature sets for the proposed audio classifier are examined.
Abstract: Content based music genre classification is a key component for next generation multimedia search agents. This paper introduces an audio classification technique based on audio content analysis. Artificial Neural Networks (ANNs), specifically multi-layered perceptrons (MLPs) are implemented to perform the classification task. Windowed audio files of finite length are analyzed to generate multiple feature sets which are used as input vectors to a parallel neural architecture that performs the classification. This paper examines a combination of linear predictive coding (LPC), mel frequency cepstrum coefficients (MFCCs), Haar Wavelet, Daubechies Wavelet and Symlet coefficients as feature sets for the proposed audio classifier. Parallel to MLP, a Gaussian radial basis function (GRBF) based ANN is also implemented and analyzed. The obtained prediction accuracy of 87.3% in determining the audio genres claims the efficiency of the proposed architecture. The ANN prediction values are processed by a rule based inference engine (IE) that presents the final decision.

Proceedings ArticleDOI
25 Mar 2008
TL;DR: The analysis presented in this paper is seen as the first step in creating an Urdu speech recognition system and involves in future use of TI6000 DSK series or linear predictive coding.
Abstract: In this paper frequency analysis of spoken Urdu numbers from dasiasifrpsila (zero) to dasianaupsila (nine) is described. Sound samples from multiple speakers were utilized to extract different features. Initial processing of data, i.e. time-slicing and normalizing and was done using a combination of Simulink and MATLAB. Afterwards, the same tools were used for calculation of Fourier descriptions and correlations. The correlation allowed comparison of the same words spoken by the same and different speakers. The analysis presented in this paper is seen as the first step in creating an Urdu speech recognition system. The speech recognition feed-forward neural network models in Matlab were developed. The models and algorithm exhibited high training and testing accuracies. Our major work involves in future use of TI6000 DSK series or linear predictive coding. Such a system can be potentially utilized in implementation of a voice-driven help setup in different systems.

Proceedings ArticleDOI
16 Mar 2008
TL;DR: This research used Spectral Subtraction as a method to remove noise from speech signals using the fast Fourier transform and had recourse to the speech to noise ratio (SNR) in order to evaluate the performance of the proposed algorithm.
Abstract: We used Spectral Subtraction in this research as a method to remove noise from speech signals. Initially, the spectrum of the noisy speech is computed using the fast Fourier transform (FFT), then the average magnitude of the noise spectrum is subtracted from the noisy speech spectrum. We applied Spectral Subtraction to the speech signal "Hot dog" to which we digitally added vacuum cleaner noise. We implemented the noise removal algorithm by storing the noisy speech data into Hanning time-widowed half-overlapped data buffers, computing the corresponding spectrums using the FFT, removing the noise from the noisy speech, and reconstructing the speech back into the time domain using the inverse fast Fourier transform (IFFT). We had recourse to the speech to noise ratio (SNR )in order to evaluate the performance of the proposed algorithm.

Proceedings ArticleDOI
06 May 2008
TL;DR: A robust dereverberation technique for real-time hands-free speech recognition application that uses the impulse response by effectively identifying the late reflection components of it and introduces a training strategy to optimize the values of the multi-band coefficients to minimize the error.
Abstract: A robust dereverberation technique for real-time hands-free speech recognition application is proposed. Real-time implementation is made possible by avoiding time-consuming blind estimation. Instead, we use the impulse response by effectively identifying the late reflection components of it. Using this information, together with the concept of Spectral Subtraction (SS), we were able to remove the effects of the late reflection of the reverberant signal. After dereverberation, only the effects of the early component is left and used as input to the recognizer. In this method, multi-band SS is used in order to compensate for the error arising from approximation. We also introduced a training strategy to optimize the values of the multi-band coefficients to minimize the error.

Proceedings Article
01 Jan 2008
TL;DR: It is shown that feature extraction based on auditory processing provides better performance in the presence of additive background noise than traditional MFCC processing and it is argued that an expansive nonlinearity in the auditory model contributes the most to noise robustness.
Abstract: This paper discusses the relative impact that different stages of a popular auditory model have on improving the accuracy of automatic speech recognition in the presence of additive noise. Recognition accuracy is measured using the CMU SPHINX-III speech recognition system, and the DARPA Resource Management speech corpus for training and testing. It is shown that feature extraction based on auditory processing provides better performance in the presence of additive background noise than traditional MFCC processing and it is argued that an expansive nonlinearity in the auditory model contributes the most to noise robustness.

Proceedings ArticleDOI
01 Nov 2008
TL;DR: A new feature extraction methods, which utilize reduced order Linear Predictive Coding (LPC) coefficients for speech recognition, which have been derived from the speech frames decomposed using Discrete Wavelet Transform (DWT).
Abstract: In this paper a new feature extraction methods, which utilize reduced order Linear Predictive Coding (LPC) coefficients for speech recognition, have been proposed. The coefficients have been derived from the speech frames decomposed using Discrete Wavelet Transform (DWT). In the literature it is assumed that the speech frame of size 10 msec to 30 msec is stationary, however, in practice different parts of the speech signal may convey different amount of information (hence may not be perfectly stationary). LPC coefficients derived from subband decomposition of speech frame provide better representation than modeling the frame directly. Experimentally it has been shown that, the proposed approaches provide effective (better recognition rate) and efficient (reduced feature vector dimension) features. The speech recognition system using the continuous Hidden Markov Model (HMM) has been implemented. The proposed algorithms are evaluated using NIST TI-46 isolated-word database.

Proceedings ArticleDOI
26 Nov 2008
TL;DR: Reversible watermarking for compressed speech is proposed and can achieve transparent quality of marked speech and a fairly high embedding rate.
Abstract: Reversible watermarking for compressed speech is proposed. Two properties of parameter-based speech coding bitstream are pointed out. Based on the properties, unimportant parameters are further compressed by entropy coding and concatenated with the watermark payload. The proposed scheme is demonstrated with G.729 speech coding and can be applied to other speech coding standards. Experiment results show that this method can achieve transparent quality of marked speech and a fairly high embedding rate.

Proceedings ArticleDOI
15 Aug 2008
TL;DR: The proposed method satisfies the robustness and discrimination requirement of perceptual hash with very low hash bitrate and is also a computational efficient algorithm which could be applied to scenarios with power restriction or real-time communication requirement.
Abstract: A speech perceptual hash algorithm in compression domain is proposed in this paper. Speech coded at very low bitrate requires hash algorithm with high compactness and robustness. LSF could model the changing shape of the speaker vocal tract and is the intermediate result of partial decoding. They are used to generate hash value. The proposed method satisfies the robustness and discrimination requirement of perceptual hash with very low hash bitrate. It is also a computational efficient algorithm which could be applied to scenarios with power restriction or real-time communication requirement.


01 Jan 2008
TL;DR: This study addresses the segmentation problem for vocal effort change by deploying an improved feature based T 2 -BIC algorithm and proposes a new fused evaluation criterion, Multi-Error Score (MES), to explore which feature conveys the most information on vocal effort.
Abstract: Non-neutral speech data has a strong negative impact on speech processing systems such as Automatic Speech Recognition (ASR) or speaker ID systems [1]. It is therefore necessary to detect and segment non-neutral speech data before further processing steps. Alternatively, the detection and segmentation of non-neutral speech segments from an input speech stream can be used in speech analysis and understanding, or in speech file retrieval systems to detect speech files containing whispered speech representing sensitive information, or shouted speech denoting strong emotion. This study addresses the segmentation problem for vocal effort change by deploying an improved feature based T 2 -BIC algorithm. Several features are considered as input to the T 2 -BIC algorithm in this study. A new fused evaluation criterion, Multi-Error Score (MES), is proposed to explore which feature conveys the most information on vocal effort. Results show that the lowest mean MES (56.49) occurs for the energy ratio feature for segmentation of different vocal effort speech segments based on vocal effort change point detection. Finally, recommendations are made for integrating this framework to advance knowledge processing for subsequent speech systems. Index Terms: Segmentation, vocal effort, change point detection, whispered speech, shouted speech, T 2 -BIC

Patent
16 May 2008
TL;DR: In this paper, a method or corresponding apparatus in an example embodiment of the present invention performs voice quality enhancement transparently within a network by detecting use of a coder applying rate reduction to a speech signal and known to have adverse effect on a coded speech signal.
Abstract: To increase channel capacity, mobile phone carriers have deployed speech coders, such as Advanced MultiBand Excitation coding (AMBE), in networks to reduce the bit rate of each call. One undesired consequence of employing such speech coders is that the voice quality can be much worse as compared to higher bit-rate speech coders. A method or corresponding apparatus in an example embodiment of the present invention performs voice quality enhancement transparently within a network by detecting use of a coder applying rate reduction to a speech signal and known to have an adverse effect on a coded speech signal. Upon detection of the use of such coder, the coded speech signal is corrected based on components introduced into the coded speech signal due to the rate reduction. As a result of applying the voice quality enhancement, adverse effects of speech coders can be reduced, while maintaining high quality voice signals.

Patent
17 Nov 2008
TL;DR: In this paper, a non-linear transformation and/or a spectral warping process is employed to enhance particular short-term spectral characteristic information for respective voiced intervals of a speech signal.
Abstract: Coding systems that provide a perceptually improved approximation of the short-term characteristics of speech signals compared to typical coding techniques such as linear predictive analysis while maintaining enhanced coding efficiency. The invention advantageously employs a non-linear transformation and/or a spectral warping process to enhance particular short-term spectral characteristic information for respective voiced intervals of a speech signal. The non-linear transformed and/or warped spectral characteristic information is then coded, such as by linear predictive analysis to produce a corresponding coded speech signal. The use of the non-linear transformation and/or spectral warping operation of the particular spectral information advantageously causes more coding resources to be used for those spectral components that contribute greater to the perceptible quality of the corresponding synthesized speech. It is possible to employ this coding technique in a variety of speech coding techniques including, for example, vocoder and analysis-by-synthesis coding systems.

Book ChapterDOI
01 Aug 2008
TL;DR: An improved method forWiener filter algorithm is proposed by introducing the complex LPC speech analysis instead of the conventional LPC analysis by demonstrating that the proposed method can perform better for babble or car internal noise than the conventional real-valued method.
Abstract: Recently, applications of speech coding and speech recognition have been exploding; for example, cellular phones and car navigation systems in an automobile. Since these are commonly used in noisy environment, noise reduction method, viz., speech enhancement is required as a pre-processor for speech coding and recognition. Iterative Wiener filter (IWF) method has been adopted as the speech enhancement that estimates speech and noise power spectra using LPC analysis iteratively. In this paper, we propose an improved method forWiener filter algorithm by introducing the complex LPC speech analysis instead of the conventional LPC analysis. The complex speech analysis can estimate more accurate spectrum in low frequencies, thus it is expected that it can perform better for the IWF especially for babble noise or car internal noise that contains much energy in low frequencies. The objective evaluation has been performed for speech signal corrupted by white Gaussian, pink noise, babble noise or car internal noise by means of spectral distance. The results demonstrate that the proposed method can perform better for babble or car internal noise than the conventional real-valued method.

Proceedings ArticleDOI
01 Mar 2008
TL;DR: The experimental results show that the combined codec can achieve a performance close to that of iLBC at different loss conditions but with a smaller bit-rate, and scalability is achieved by modifying the number of inserted ACELP-coded frames.
Abstract: While VoIP (voice over IP) is gaining importance in comparison with other types of telephony, packet loss remains as the main source of degradation in VoIP systems. Traditional speech codecs, such as those based on the CELP (code excited linear prediction) paradigm, can achieve low bit-rates at the cost of introducing interframe dependencies. As a result, the effect of a packet loss burst is propagated to the frames correctly received after the burst. iLBC (internet low bit-rate codec) alleviates this problem by removing the interframe dependencies at the cost of a higher bit-rate. In this paper we propose a combination of iLBC with an ACELP (algebraic CELP) codec in which a variable number of ACELP-coded frames is inserted between every two iLBC-coded frames. The experimental results show that the combined codec can achieve a performance close to that of iLBC at different loss conditions but with a smaller bit-rate. Also, scalability is achieved by modifying the number of inserted ACELP-coded frames.

Proceedings ArticleDOI
25 Aug 2008
TL;DR: A novel method of tackling the problem of artificially extending the bandwidth of a narrow-band speech signal using the scalar high-band energy parameter, which effectively controls the artificial information added to the high- band of the bandwidth extended output speech signal.
Abstract: In this paper, we describe a novel method of tackling the problem of artificially extending the bandwidth of a narrow-band speech signal. For a given narrow-band signal, we first estimate the energy in the high-band. The high-band energy is then used to select a suitable high-band spectral envelope shape that is consistent with the estimated high-band energy while simultaneously ensuring that the resulting wide-band spectral envelope is continuous at the boundary between narrow-band and high-band. The scalar high-band energy parameter thus effectively controls the artificial information added to the high-band of the bandwidth extended output speech signal. Artifacts in the output speech are minimized by adapting the high-band energy parameter appropriately. Formal subjective listening tests show that the bandwidth extended speech output generated by the described method outscores the input narrow-band speech by 0.25 MOS.

Patent
Yang Gao1
13 Jun 2008
TL;DR: In this article, a method of speech encoding comprises generating a first synthesized speech signal from a first excitation signal, weighting the first synthesised speech signal using a first error weighting filter to generate a first weighted speech signal, and generating an error signal using the first weighted signal and the second signal.
Abstract: A method of speech encoding comprises generating a first synthesized speech signal from a first excitation signal, weighting the first synthesized speech signal using a first error weighting filter to generate a first weighted speech signal, generating a second synthesized speech signal from a second excitation signal, weighting the second synthesized speech signal using a second error weighting filter to generate a second weighted speech signal, and generating an error signal using the first weighted speech signal and the second weighted speech signal, wherein the first error weighting filter is different from the second error weighting filter. The method may further generate the error signal by weighting the speech signal using a third error weighting filter to generate a third weighted speech signal, and subtracting the first weighted speech signal and the second weighted speech signal from the third weighted speech signal to generate the error signal.

Journal ArticleDOI
01 Jun 2008
TL;DR: The Audio Rate Cognition protocol incorporates the ANN topology that appears to be the most effective into the end-points of a (multi-hop) flow, using it to adapt its transmission rate, and minimizes linear predictive coding cepstral distance, which closely correlates to subjective audio measures.
Abstract: We design a transport protocol that uses artificial neural networks (ANNs) to adapt the audio transmission rate to changing conditions in a mobile ad hoc network. The response variables of throughput, end-to-end delay, and jitter are examined. For each, statistically significant factors and interactions are identified and used in the ANN design. The efficacy of different ANN topologies are evaluated for their predictive accuracy. The Audio Rate Cognition (ARC) protocol incorporates the ANN topology that appears to be the most effective into the end-points of a (multi-hop) flow, using it to adapt its transmission rate. Compared to competing protocols for media streaming, ARC achieves a significant reduction in packet loss and increased goodput while satisfying the requirements of end-to-end delay and jitter. While the average throughput of ARC is less than that of TFRC, its average goodput is much higher. As a result, ARC transmits higher quality audio, minimizing root mean square and Itakura-Saito spectral distances, as well as several parametric distance measures. In particular, ARC minimizes linear predictive coding cepstral (sic) distance, which closely correlates to subjective audio measures.