scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2009"


Patent
10 Dec 2009
Abstract: A speech signal processing system comprises an audio processor (103) for providing a first signal representing an acoustic speech signal of a speaker. An EMG processor (109) provides a second signal which represents an electromyographic signal for the speaker captured simultaneously with the acoustic speech signal. A speech processor (105) is arranged to process the first signal in response to the second signal to generate a modified speech signal. The processing may for example be a beam forming, noise compensation, or speech encoding. Improved speech processing may be achieved in particular in an acoustically noisy environment.

547 citations


Proceedings ArticleDOI
19 Apr 2009
TL;DR: This new codec forms the basis of the reference model in the ongoing MPEG standardization activity for Unified Speech and Audio Coding, which results in a codec that exhibits consistently high quality for speech, music and mixed audio content.
Abstract: Traditionally, speech coding and audio coding were separate worlds. Based on different technical approaches and different assumptions about the source signal, neither of the two coding schemes could efficiently represent both speech and music at low bitrates. This paper presents a unified speech and audio codec, which efficiently combines techniques from both worlds. This results in a codec that exhibits consistently high quality for speech, music and mixed audio content. The paper gives an overview of the codec architecture and presents results of formal listening tests comparing this new codec with HE-AAC(v2) and AMR-WB+. This new codec forms the basis of the reference model in the ongoing MPEG standardization activity for Unified Speech and Audio Coding.

108 citations


Proceedings ArticleDOI
04 Feb 2009
TL;DR: Comparison among different structures of Neural Networks conducted here for a better understanding of the problem and its possible solutions is conducted.
Abstract: This paper presents the Bangla speech recognition system. Bangla speech recognition system is divided mainly into two major parts. The first part is speech signal processing and the second part is speech pattern recognition technique. The speech processing stage consists of speech starting and end point detection, windowing, filtering, calculating the Linear Predictive Coding(LPC) and Cepstral Coefficients and finally constructing the codebook by vector quantization. The second part consists of pattern recognition system using Artificial Neural Network(ANN). Speech signals are recorded using an audio wave recorder in the normal room environment. The recorded speech signal is passed through the speech starting and end-point detection algorithm to detect the presence of the speech signal and remove the silence and pauses portions of the signals. The resulting signal is then filtered for the removal of unwanted background noise from the speech signals. The filtered signal is then windowed ensuring half frame overlap. After windowing, the speech signal is then subjected to calculate the LPC coefficient and Cepstral coefficient. The feature extractor uses a standard LPC Cepstrum coder, which converts the incoming speech signal into LPC Cepstrum feature space. The Self Organizing Map(SOM) Neural Network makes each variable length LPC trajectory of an isolated word into a fixed length LPC trajectory and thereby making the fixed length feature vector, to be fed into to the recognizer. The structures of the neural network is designed with Multi Layer Perceptron approach and tested with 3, 4, 5 hidden layers using the Transfer functions of Tanh Sigmoid for the Bangla speech recognition system. Comparison among different structures of Neural Networks conducted here for a better understanding of the problem and its possible solutions.

92 citations


Patent
Nobuyuki Washio1
22 Dec 2009
TL;DR: In this article, an information processing apparatus for speech recognition includes a first speech dataset storing speech data uttered by low recognition rate speakers, a second speech dataset stored speech data uttering by a plurality of speakers, and a third speech dataset containing speech data to be mixed with the speech data of the second dataset.
Abstract: An information processing apparatus for speech recognition includes a first speech dataset storing speech data uttered by low recognition rate speakers; a second speech dataset storing speech data uttered by a plurality of speakers; a third speech dataset storing speech data to be mixed with the speech data of the second speech dataset; a similarity calculating part obtaining, for each piece of the speech data in the second speech dataset, a degree of similarity to a given average voice in the first speech dataset; a speech data selecting part recording the speech data, the degree of similarity of which is within a given selection range, as selected speech data in the third speech dataset; and an acoustic model generating part generating a first acoustic model using the speech data recorded in the second speech dataset and the third speech dataset.

81 citations


Dissertation
01 Jan 2009
TL;DR: A study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations, and the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed.
Abstract: The classical front end analysis in speech recognition is a spectral analysis which parametrizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (mel- frequency scale), and then decorrelated by using a modified discrete cosine transform. Following a focused introduction on speech production, perception and analysis, this paper gives a study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations. The work has been developed into two steps: first, the computation of the MFCC vectors from the source speech files by using HTK Software; and second, the implementation of the generative model in itself, which, actually, represents the conversion chain from HTK-generated MFCC vectors to speech reconstruction. In order to know the goodness of the speech coding into feature vectors and to evaluate the generative model, the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed. For that, spectral models based on Linear Prediction Coding (LPC) analysis have been used. During the implementation of the generative model some results have been obtained in terms of the reconstruction of the spectral representation and the quality of the synthesized speech.

49 citations


Proceedings ArticleDOI
24 Aug 2009
TL;DR: A measure based on the similarity between the time-varying spectral envelopes of target speech and system output, as measured by correlation, can provide a more meaningful evaluation measure for nonlinear speech enhancement systems, as well as providing a transparent objective function for the optimization of such systems.
Abstract: Applying a binary mask to a pure noise signal can result in speech that is highly intelligible, despite the absence of any of the target speech signal. Therefore, to estimate the intelligibility benefit of highly nonlinear speech enhancement techniques, we contend that SNR is not useful; instead we propose a measure based on the similarity between the time-varying spectral envelopes of target speech and system output, as measured by correlation. As with previous correlation-based intelligibility measures, our system can broadly match subjective intelligibility for a range of enhanced signals. Our system, however, is notably simpler and we explain the practical motivation behind each stage. This measure, freely available as a small Matlab implementation, can provide a more meaningful evaluation measure for nonlinear speech enhancement systems, as well as providing a transparent objective function for the optimization of such systems.

43 citations


Journal ArticleDOI
TL;DR: The results show that residual features carry speaker-dependent features and the combination with the LPCC or the MFCC shows global improvements in terms of robustness under different mismatches.

42 citations


Journal ArticleDOI
TL;DR: The proposed modulation scheme was compared to the regular frequency shift keying method (FSK) and the performance improvement of ARDMA against FSK is observed at higher bit-rates in the case of three compared GSM speech coders.

35 citations


Patent
12 Mar 2009
TL;DR: In this paper, the authors describe methods and apparatus for code excited linear prediction (CELP) audio encoding and decoding that employ linear predictive coding (LPC) synthesis filters controlled by LPC parameters.
Abstract: The invention relates to the coding of audio signals that may include both speech-like and non-speech-like signal components. It describes methods and apparatus for code excited linear prediction (CELP) audio encoding and decoding that employ linear predictive coding (LPC) synthesis filters controlled by LPC parameters, a plurality of codebooks each having codevectors, at least one codebook providing an excitation more appropriate for non-speech-like signals and at least one codebook providing an excitation more appropriate for speech-like signals, and a plurality of gain factors, each associated with a codebook. The encoding methods and apparatus select from the codebooks codevectors and/or associated gain factors by minimizing a measure of the difference between the audio signal and a reconstruction of the audio signal derived from the codebook excitations. The decoding methods and apparatus generate a reconstructed output signal from the LPC parameters, codevectors, and gain factors.

34 citations


Proceedings ArticleDOI
24 Aug 2009
TL;DR: A simple condition for the speech spectrum level of every subband that maximizes the SII for a given noise spectrum level is derived and used to derive a theoretical bound for a maximum achievable SII as well as a new SII optimized algorithm for near end listening enhancement.
Abstract: Signal processing algorithms for near end listening enhancement allow to improve the intelligibility of clean (far end) speech for the near end listener who perceives not only the far end speech but also ambient background noise. A typical scenario is mobile communication conducted in the presence of acoustical background noise such as traffic or babble noise.

28 citations


Proceedings Article
01 Aug 2009
TL;DR: The original DFT was replaced by the state-of-art transformation MDCT, and the vector quantization by the combination of a scalar quantization and an evolved context-adaptive arithmetic coder to enhance the coding efficiency of AMR-WB+ while maintaining its high flexibility.
Abstract: Coding audio material at low bit rates with a consistent quality over a wide range of signals is a current and challenging problem. The high-granularity switched speech and audio coder AMR-WB+ performs especially well for speech and mixed content by promptly adapting its coding model scheme to the signal. However, the high adaptation rate is done at the price of limited performance for non-speech signals. The aim of the paper is to enhance the coding efficiency of AMR-WB+ while maintaining its high flexibility. For this purpose, the original DFT was replaced by the state-of-art transformation MDCT, and the vector quantization by the combination of a scalar quantization and an evolved context-adaptive arithmetic coder. The improvements were measured by both objective and subjective evaluations.

Journal ArticleDOI
TL;DR: The experimental results showed the GMM can achieve a better recognition rate with feature extraction using the FLPCS method, and it is suggested theGMM can complete training and identification in a very short time.
Abstract: In this paper, a frame linear predictive coding spectrum (FLPCS) technique for speaker identification is presented. Traditionally, linear predictive coding (LPC) was applied in many speech recognition applications, nevertheless, the modification of LPC termed FLPCS is proposed in this study for speaker identification. The analysis procedure consists of feature extraction and voice classification. In the stage of feature extraction, the representative characteristics were extracted using the FLPCS technique. Through the approach, the size of the feature vector of a speaker can be reduced within an acceptable recognition rate. In the stage of classification, general regression neural network (GRNN) and Gaussian mixture model (GMM) were applied because of their rapid response and simplicity in implementation. In the experimental investigation, performances of different order FLPCS coefficients which were induced from the LPC spectrum were compared with one another. Further, the capability analysis on GRNN and GMM was also described. The experimental results showed GMM can achieve a better recognition rate with feature extraction using the FLPCS method. It is also suggested the GMM can complete training and identification in a very short time.

Patent
20 Jul 2009
TL;DR: In this paper, a voice conversion rule is used to convert the voice quality of the source speech into the quality of a target speech using a spectral parameter of the target speech, which is then converted into a speech waveform from the converted spectral parameter.
Abstract: A voice conversion apparatus stores, in a parameter memory, target speech spectral parameters of target speech, stores, in a voice conversion rule memory, a voice conversion rule for converting voice quality of source speech into voice quality of the target speech, extracts, from an input source speech, a source speech spectral parameter of the input source speech, converts extracted source speech spectral parameter into a first conversion spectral parameter by using the voice conversion rule, selects target speech spectral parameter similar to the first conversion spectral parameter from the parameter memory, generates an aperiodic component spectral parameter representing from selected target speech spectral parameter, mixes a periodic component spectral parameter included in the first conversion spectral parameter with the aperiodic component spectral parameter, to obtain a second conversion spectral parameter, and generates a speech waveform from the second conversion spectral parameter.

Proceedings ArticleDOI
08 Mar 2009
TL;DR: The experimental results show that the highest recognition rate that can be achieved by the system is 87%.
Abstract: This paper describes about speech recognition implemented on ATmega162 microcontroller. The word (voice command) in a speech signal is applied for controlling movement of a mobile robot. The mobile robot will move according to the voice command in the speech signal. There are five commands, which are Indonesian language, used to control the movement of mobile robot. They are “maju”, “mundur”, “kiri”, “kanan”, and “stop” used to command the mobile robot to move forward, move backward, turn left, turn right and stop respectively.The methods used in this research are Linear Predictive Coding (LPC) and Hidden Markov Model (HMM). LPC is used to extract word data from a speech signal. HMM is used to recognize the word pattern data, which are extracted from a speech signal. Sampling rate of the speech signal is 8 kHz and the speech signal is sampled for 0.5 seconds. Experiments were done in several variations of observation symbol number and number of sample. The experimental results show that the highest recognition rate that can be achieved by the system is 87%. The mobile robot can move in accordance with the voice command that is given to the robot.

Proceedings ArticleDOI
Jianle Chen1, Woo-Jin Han1
07 Nov 2009
TL;DR: This paper investigates the linear prediction method for block-based lossy image coding and proposes a method that merges linear prediction technique into H.264/AVC video coding framework and shows that the proposed technique improves coding efficiency.
Abstract: Linear prediction model has been well investigated and applied in lossless image and video coding. In this paper, we investigate the linear prediction method for block-based lossy image coding and propose a method that merges linear prediction technique into H.264/AVC video coding framework. A block-based linear prediction method is designed instead of pixel-based one in order to cooperate with transform module. Furthermore, line-based linear prediction with 1D transform is developed by considering coding gain tradeoff between prediction and transform. Linear prediction model coefficients are derived by using neighboring reconstructed data with least square error method. The model coefficients implicitly embed the local texture characteristics and no bits overhead is needed for signaling the coefficients since we can derive them with same process at decoder side. We insert block-based and line-based linear prediction modes into H.264/AVC as additional intra prediction modes and select the best mode by minimum rate-distortion sense. Experimental results show that the proposed technique improves coding efficiency of H.264/AVC intra picture with average 4.3% bit saving and up to 7.0% bit saving.

Proceedings ArticleDOI
Jie Zhan1, Ki-hyun Choo1, Eunmi Oh1
19 Apr 2009
TL;DR: Subjective testing results show that the presented technology exhibits a comparable performance compared to 3GPP AMR-WB+ with the same bit-rate in the framework of Audio Video coding of China Standard (AVS) Part 10 - Mobile Speech and Audio Codec.
Abstract: We proposed a new frequency domain BandWidth Extension (BWE) technology. In the new technology, FFT based frequency domain gain shaping combined with Linear Prediction Coding (LPC) based spectral envelope shaping is used for generating high frequency signals. To preserve the amount of noise component in the reconstructed band, gain reduction controlled by Spectrum Flatness Measurement (SFM) is employed. Subjective testing results show that the presented technology exhibits a comparable performance compared to 3GPP AMR-WB+ with the same bit-rate in the framework of Audio Video coding of China Standard (AVS) Part 10 - Mobile Speech and Audio Codec. This technology has been formally adopted as the artificial high band coding module in AVS P10.

Patent
26 Jan 2009
TL;DR: In this paper, a pitch pre-processing procedure was proposed for processing the input speech signal to form a revised speech signal biased toward an ideal voiced and stationary characteristic, which allowed the encoder to fully capture the benefits of a bandwidth-efficient, long-term predictive procedure for a greater amount of speech components of an input speech input signal than would otherwise be possible.
Abstract: In accordance with one aspect of the invention, a selector supports the selection of a first encoding scheme or the second encoding scheme based upon the detection or absence of the triggering characteristic in the interval of the input speech signal. The first encoding scheme has a pitch pre-processing procedure for processing the input speech signal to form a revised speech signal biased toward an ideal voiced and stationary characteristic. The pre-processing procedure allows the encoder to fully capture the benefits of a bandwidth-efficient, long-term predictive procedure for a greater amount of speech components of an input speech signal than would otherwise be possible. In accordance with another aspect of the invention, the second encoding scheme entails a long-term prediction mode for encoding the pitch on a sub-frame by sub-frame basis. The long-term prediction mode is tailored to where the generally periodic component of the speech is generally not stationary or less than completely periodic and requires greater frequency of updates from the adaptive codebook to achieve a desired perceptual quality of the reproduced speech under a long-term predictive procedure.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: In this work, accurate spectral envelope estimation is applied to Voice Conversion in order to achieve High-Quality timbre conversion and shows improved spectral conversion performance as well as increased converted-speech quality when compared to Linear Prediction.
Abstract: In this work, accurate spectral envelope estimation is applied to Voice Conversion in order to achieve High-Quality timbre conversion. True-Envelope based estimators allow model order selection leading to an adaptation of the spectral features to the characteristics of the speaker. Optimal residual signals can also be computed following a local adaptation of the model order in terms of the F 0 . A new perceptual criteria is proposed to measure the impact of the spectral conversion error. The proposed envelope models show improved spectral conversion performance as well as increased converted-speech quality when compared to Linear Prediction.

Proceedings ArticleDOI
30 Sep 2009
TL;DR: A Grassmannian prediction and predictive coding framework for delayed feedback systems by exploiting the memory in the channel and a prediction step size optimization criterion for correlated time-series evolving on the Grassmann manifold is derived.
Abstract: Limited feedback in multiple antenna wireless systems is a practical technique to obtain channel state information at the transmitter. When the channel is time-varying with memory, however, selected codeword may become outdated before its use at the transmitter. To overcome this problem, we propose a Grassmannian prediction and predictive coding framework for delayed feedback systems by exploiting the memory in the channel. A prediction step size optimization criterion for correlated time-series evolving on the Grassmann manifold is derived. The proposed predictive coding framework uses optimized prediction to account for the feedback delay. Application to delayed limited feedback multiuser multiple antenna system shows sum rate improvement and robustness to delay.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: An analysis model is proposed that jointly finds the two predictors by adding a regularization term in the minimization process to impose sparsity constraints on a high order predictor, resulting in a linear predictor that can be easily factorized into the short-term and long-term predictors.
Abstract: In low bit-rate coders, the near-sample and far-sample redundancies of the speech signal are usually removed by a cascade of a short-term and a long-term linear predictor. These two predictors are usually found in a sequential and therefore suboptimal approach. In this paper we propose an analysis model that jointly finds the two predictors by adding a regularization term in the minimization process to impose sparsity constraints on a high order predictor. The result is a linear predictor that can be easily factorized into the short-term and long-term predictors. This estimation method is then incorporated into an Algebraic Code Excited Linear Prediction scheme and shows to have a better performance than traditional cascade methods and other joint optimization methods, offering lower distortion and higher perceptual speech quality.

Patent
27 Feb 2009
TL;DR: In this paper, a voice activity detection method in a low SNR environment is performed by extracting a long-term spectrum variation component and a harmonic structure as feature vectors from a speech signal and increasing difference in feature vectors between speech and non-speech.
Abstract: A voice activity detection method in a low SNR environment. The voice activity detection is performed by extracting a long-term spectrum variation component and a harmonic structure as feature vectors from a speech signal and increasing difference in feature vectors between speech and non-speech (i) using the long-term spectrum variation component feature or (ii) using a long-term spectrum variation component extraction and a harmonic structure feature extraction. A correct rate and an accuracy rate of the voice activity detection is improved over conventional methods by using a long-term spectrum variation component having a window length over an average phoneme duration of an utterance in the speech signal. The voice activity detection system and method provides speech processing, automatic speech recognition, and speech output capable of very accurate voice activity detection.

Proceedings ArticleDOI
18 Sep 2009
TL;DR: A novel algorithm for speech endpoint detection based on Hilbert-Huang transform is provided and results show that the speech signal can be effective detected by this algorithm at low signal-to-noise ratio.
Abstract: Speech endpoint detection in strong noise environment plays an important role in speech signal processing. Hilbert-Huang Transform (HHT) is based on the local characteristics of signals, which is an adaptive and efficient transformation method. It is particularly suitable for analyzing the non-linear and non-stationary signals such as speech signal. In this paper, we chose the noisy speech signal when the signal-to-noise ratio is negative. A novel algorithm for speech endpoint detection based on Hilbert-Huang transform is provided after analyzing the noisy speech signal. The signal is first decomposed by Empirical Mode Decomposition (EMD), and partial decomposition results are processed by Hilbert transform. The threshold of noise is estimated by analyzing the front of signal's Hilbert amplitude spectrum. The speech segments and non-speech segments can be distinguished by the threshold and the whole signal's Hilbert amplitude spectrum. Simulation results show that the speech signal can be effective detected by this algorithm at low signal-to-noise ratio.

Patent
Jong-Hoon Jeong1, Geon-Hyoung Lee1, Chul-woo Lee1, Nam-Suk Lee1, Han-gil Moon1 
29 Jan 2009
TL;DR: In this article, a method and apparatus for encoding or decoding an audio signal by adaptively interpolating a linear predictive coding (LPC) coefficient is presented, depending on whether a transient section is present in a current frame, thereby preventing noise from occurring when interpolating LPC coefficients in the transient section.
Abstract: Provided are a method and apparatus for encoding or decoding an audio signal by adaptively interpolating a linear predictive coding (LPC) coefficient. In the method and apparatus of encoding or decoding an audio signal, LPC coefficient interpolation is selectively performed depending on whether a transient section is present in a current frame, thereby preventing noise from occurring when interpolating LPC coefficients in the transient section.

Proceedings ArticleDOI
24 Aug 2009
TL;DR: This paper evaluates the performance of four established state-of-the-art algorithms for pitch estimation in additive noise and reverberation and shows how accurate estimation of the pitch of a speech signal can influence objective speech quality measurement algorithms.
Abstract: Pitch estimation has a central role in many speech processing applications. In voiced speech, pitch can be objectively defined as the rate of vibration of the vocal folds. However, pitch is an inherently subjective quantity and cannot be directly measured from the speech signal. It is a nonlinear function of the signal's spectral and temporal energy distribution. A number of methods for pitch estimation have been developed but none can claim to work accurately in the presence of high levels of additive noise or reverberation. Any system of practical importance must be robust to additive noise and reverberation as these are encountered frequently in the field of operation of voice telecommunications systems. In non-intrusive speech quality measurement algorithms, such as the P.563 and LCQA, pitch is used as a feature for quality assessment. The accuracy of this feature in noisy speech signals will be shown to correlate with the accuracy of the objective measure of the quality of the speech signal. In this paper we evaluate the performance of four established state-of-the-art algorithms for pitch estimation in additive noise and reverberation. Furthermore, we show how accurate estimation of the pitch of a speech signal can influence objective speech quality measurement algorithms.

Proceedings ArticleDOI
15 Jun 2009
TL;DR: This paper describes how the selection of parameters for the variance fractal dimension (VFD) multiscale time-domain algorithm can create an amplification of the fractal Dimension trajectory that is obtained for a natural-speech waveform in the presence of ambient noise.
Abstract: This paper describes how the selection of parameters for the variance fractal dimension (VFD) multiscale time-domain algorithm can create an amplification of the fractal dimension trajectory that is obtained for a natural-speech waveform in the presence of ambient noise. The technique is based on the variance fractal dimension trajectory (VFDT) algorithm that is used not only to detect the external boundaries of an utterance, but also its internal pauses representing the unvoiced speech. The VFDT algorithm can also amplify internal features of phonemes. This fractal feature amplification is accomplished when the time increments are selected in a dyadic manner rather than selecting the increments in a unit distance sequence. These amplified trajectories for different phonemes are more distinct, thus providing a better characterization of the individual segments in the speech signal. This approach is superior to other energy-based boundary-detection techniques. These observations are based on extensive experimental results on speech utterances digitized at 44.1 kilosamples per second, with 16 bits in each sample.

Journal ArticleDOI
TL;DR: A method for exploiting both the temporal and spatial redundancy, typical of multidimensional biomedical signals, has been proposed and proved to be superior to previous coding schemes.
Abstract: In this paper, we propose a model-based lossy coding technique for biomedical signals in multiple dimensions. The method is based on the codebook-excited linear prediction approach and models signals as filtered noise. The filter models short-term redundancy in time; the shape of the power spectrum of the signal and the residual noise, quantized using an algebraic codebook, is used for reconstruction of the waveforms. In addition to temporal redundancy, redundancy in the coding of the filter and residual noise across spatially related signals is also exploited, yielding better compression performance in terms of SNR for a given bit rate. The proposed coding technique was tested on sets of multichannel electromyography (EMG) and EEG signals as representative examples. For 2-D EMG recordings of 56 signals, the coding technique resulted in SNR greater than 3.4 plusmn 1.3 dB with respect to independent coding of the signals in the grid when the compression ratio was 89%. For EEG recordings of 15 signals and the same compression ratio as for EMG, the average gain in SNR was 2.4 plusmn 0.1 dB. In conclusion, a method for exploiting both the temporal and spatial redundancy, typical of multidimensional biomedical signals, has been proposed and proved to be superior to previous coding schemes.

Patent
Eun-Mi Oh1, Jung-Hoe Kim1, Ki-Hyun Choo1, Ho-Sang Sung1, Miyoung Kim1 
14 Jul 2009
TL;DR: In this paper, a method and apparatus to encode and decode an audio/speech signal is described, where an inputted audio signal or speech signal may be transformed into at least one of a high frequency resolution signal and a high temporal resolution signal.
Abstract: A method and apparatus to encode and decode an audio/speech signal is provided. An inputted audio signal or speech signal may be transformed into at least one of a high frequency resolution signal and a high temporal resolution signal. The signal may be encoded by determining an appropriate resolution, the encoded signal may be decoded, and thus the audio signal, the speech signal, and a mixed signal of the audio signal and the speech signal may be processed.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: A previously suggested time-varying quasi-harmonic model is extended in order or to estimate the chirp rate for each sinusoidal component, thus successfully tracking fast variations in frequency and amplitude.
Abstract: The speech signal is usually considered as stationary during short analysis time intervals. Though this assumption may be sufficient in some applications, it is not valid for high-resolution speech analysis and in applications such as speech transformation and objective voice function assessment for detection of voice disorders. In speech, there are non stationary components, for instance time-varying amplitudes and frequencies, which may change quickly over short time intervals. In this paper, a previously suggested time-varying quasi-harmonic model is extended in order or to estimate the chirp rate for each sinusoidal component, thus successfully tracking fast variations in frequency and amplitude. The parameters of the model are estimated through linear Least Squares and the model accuracy is evaluated on synthetic chirp signals. Experiments on speech signals indicate that the new model is able to efficiently estimate the signal component chirp rates, providing means to develop more accurate speech models for high-quality speech transformations.

Proceedings ArticleDOI
04 Dec 2009
TL;DR: Results show that modified speech created by amplifying transient speech and adding it to original speech has higher percent word recognition scores than original speech in the presence of background noise.
Abstract: Speech transients have been shown to be important cues for identifying and discriminating speech sounds. We previously described a wavelet packet-based method for extracting transient speech (Rasetshwane et al. WASPAA 2007, pp. 179–182). The algorithm uses a “transitivity function” to characterize the rate of change of wavelet coefficients, and it can be implemented in real-time to process continuous speech. Psycho-acoustic experiments to select parameters for and to evaluate this method are presented. Results show that modified speech created by amplifying transient speech and adding it to original speech has higher percent word recognition scores than original speech in the presence of background noise.

Patent
10 Dec 2009
TL;DR: In this paper, the authors proposed a method of regenerating wideband speech from narrowband speech using a modulation signal adapted to upshift each frequency in the first range of frequencies by an amount determined by the modulating frequency.
Abstract: A method of regenerating wideband speech from narrowband speech, the method comprising: receiving samples of a narrowband speech signal in a first range of frequencies; modulating received samples of the narrowband speech signal with a modulation signal having a modulating frequency adapted to upshift each frequency in the first range of frequencies by an amount determined by the modulating frequency wherein the modulating frequency is selected to translate into a target band a selected frequency band within the first range of signals; filtering the modulated samples using a target band filter to form a regenerated speech signal in the target band; and combining the narrow band speech signal with the regenerated speech signal in the target band to regenerate a wideband speech signal, the method comprising the step of controlling the modulated samples to lie in a second range of frequencies identified by determining a signal characteristic of frequencies in the first range of frequencies