scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2014"


Journal ArticleDOI
TL;DR: It is shown that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.
Abstract: The enhancement of speech which is corrupted by noise is commonly performed in the short-time discrete Fourier transform domain. In case only a single microphone signal is available, typically only the spectral amplitude is modified. However, it has recently been shown that an improved spectral phase can as well be utilized for speech enhancement, e.g., for phase-sensitive amplitude estimation. In this paper, we therefore present a method to reconstruct the spectral phase of voiced speech from only the fundamental frequency and the noisy observation. The importance of the spectral phase is highlighted and we elaborate on the reason why noise reduction can be achieved by modifications of the spectral phase. We show that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.

197 citations


Journal ArticleDOI
TL;DR: In this paper, classification of various human activities based on micro-Doppler signatures is studied using linear predictive coding (LPC) to reduce the computational time cost for extracting features, which makes real-time processing feasible.
Abstract: In this letter, classification of various human activities based on micro-Doppler signatures is studied using linear predictive coding (LPC). LPC is proposed to extract the features of micro-Doppler that are mixtures of different frequencies. The use of LPC can not only decrease the time frame required to capture the Doppler signature of human motion but can also reduce the computational time cost for extracting its features, which makes real-time processing feasible. The measured data of 12 human subjects performing seven different activities using a Doppler radar are used. These activities include running, walking, walking while holding a stick, crawling, boxing while moving forward, boxing while standing in place, and sitting still. A support vector machine is then trained using the output of LPC to classify the activities. Multiclass classification is implemented using a one-versus-one decision structure. The resulting classification accuracy is found to be over 85%. The effects of the number of LPC coefficients and the size of the sliding time window, as well as the decision time-frame size used in the extraction of micro-Doppler signatures, are also discussed.

113 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: It is demonstrated that distortion caused by reverberation is substantially attenuated by the DNN whose outputs can be resynthesized to the dereverebrated speech signal.
Abstract: Reverberation distorts human speech and usually has negative effects on speech intelligibility, especially for hearing-impaired listeners. It also causes performance degradation in automatic speech recognition and speaker identification systems. Therefore, the dereverberation problem must be dealt with in daily listening environments. We propose to use deep neural networks (DNNs) to learn a spectral mapping from the reverberant speech to the anechoic speech. The trained DNN produces the estimated spectral representation of the corresponding anechoic speech. We demonstrate that distortion caused by reverberation is substantially attenuated by the DNN whose outputs can be resynthesized to the dereverebrated speech signal. The proposed approach is simple, and our systematic evaluation shows promising dereverberation results, which are significantly better than those of related systems.

87 citations


Journal ArticleDOI
TL;DR: Speech processing has vast applications in voice dialing, telephone communication, call routing, domestic appliances control, Speech to Text conversion, Text to Speech conversion, lip synchronization, automation systems etc.
Abstract: The automatic recognition of speech means enabling a natural and easy mode of communication between human and machine. Speech processing has vast applications in voice dialing, telephone communication, call routing, domestic appliances control, Speech to Text conversion, Text to Speech conversion, lip synchronization, automation systems etc. Here we have discussed some mostly used feature extraction techniques like Mel frequency Cepstral Co-efficient (MFCC), Linear Predictive Coding (LPC) Analysis, Dynamic Time Wrapping (DTW), Relative Spectra Processing (RASTA) and Zero Crossings with Peak Amplitudes (ZCPA).Some parameters like RASTA and MFCC considers the nature of speech while it extracts the features, while LPC predicts the future features based on previous features.

53 citations


Journal ArticleDOI
TL;DR: Instrumental measures predict that by incorporating uncertain prior information of the phase, the quality and intelligibility of processed speech can be improved both over traditional phase insensitive approaches, and approaches that treat prior information on the phase as deterministic.
Abstract: While most short-time discrete Fourier transform-based single-channel speech enhancement algorithms only modify the noisy spectral amplitude, in recent years the interest in phase processing has increased in the field. The goal of this paper is twofold. First, we derive Bayesian probability density functions and estimators for the clean speech phase when different amounts of prior knowledge about the speech and noise amplitudes is given. Second, we derive a joint Bayesian estimator of the clean speech amplitudes and phases, when uncertain a priori knowledge on the phase is available. Instrumental measures predict that by incorporating uncertain prior information of the phase, the quality and intelligibility of processed speech can be improved both over traditional phase insensitive approaches, and approaches that treat prior information on the phase as deterministic.

47 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: The proposed Modulation of Medium Duration Speech Amplitude (MMeDuSA) feature is a composite feature capturing subband speech modulations and a summary modulation that improved recognition performance under both noisy and channel degraded conditions in almost all the recognition tasks.
Abstract: Studies have shown that the performance of state-of-the-art automatic speech recognition (ASR) systems significantly deteriorate with increased noise levels and channel degradations, when compared to human speech recognition capability. Traditionally, noise-robust acoustic features are deployed to improve speech recognition performance under varying background conditions to compensate for the performance degradations. In this paper, we present the Modulation of Medium Duration Speech Amplitude (MMeDuSA) feature, which is a composite feature capturing subband speech modulations and a summary modulation. We analyze MMeDuSA's speech recognition performance using SRI International's DECIPHER ® large vocabulary continuous speech recognition (LVCSR) system, on noise and channel degraded Levantine Arabic speech distributed through the Defense Advance Research Projects Agency (DARPA) Robust Automatic Speech Transcription (RATS) program. We also analyzed MMeDuSA's performance against the Aurora-4 noise-and-channel degraded English corpus. Our results from all these experiments suggest that the proposed MMeDuSA feature improved recognition performance under both noisy and channel degraded conditions in almost all the recognition tasks.

41 citations


Proceedings ArticleDOI
09 Jul 2014
TL;DR: Experimental results show that the quality of the estimated clean speech signal is improved both subjectively and objectively in terms of perceptual evaluation of speech quality (PESQ), especially in mismatch environments where the additive noise is not seen in the DNN training.
Abstract: We address an over-smoothing issue of enhanced speech in deep neural network (DNN) based speech enhancement and propose a global variance equalization framework with two schemes, namely post-processing and post-training with modified object function for the equalization between the global variance of the estimated and the reference speech. Experimental results show that the quality of the estimated clean speech signal is improved both subjectively and objectively in terms of perceptual evaluation of speech quality (PESQ), especially in mismatch environments where the additive noise is not seen in the DNN training.

40 citations


Journal ArticleDOI
TL;DR: If the speech correlation is properly estimated and the previously derived subband filters discussed in this work show significantly less speech distortion compared to conventional noise reduction algorithms, the quality and intelligibility of the processed signals are predicted by objective measures.
Abstract: Recently, it has been proposed to use the minimum-variance distortionless-response (MVDR) approach in single-channel speech enhancement in the short-time frequency domain. By applying optimal FIR filters to each subband signal, these filters reduce additive noise components with less speech distortion compared to conventional approaches. An important ingredient to these filters is the temporal correlation of the speech signals. We derive algorithms to provide a blind estimation of this quantity based on a maximum-likelihood and maximum a-posteriori estimation. To derive proper models for the inter-frame correlation of the speech and noise signals, we investigate their statistics on a large dataset. If the speech correlation is properly estimated, the previously derived subband filters discussed in this work show significantly less speech distortion compared to conventional noise reduction algorithms. Therefore, the focus of the experimental parts of this work lies on the quality and intelligibility of the processed signals. To evaluate the performance of the subband filters in combination with the clean speech inter-frame correlation estimators, we predict the speech quality and intelligibility by objective measures.

37 citations


Journal ArticleDOI
TL;DR: The experiments presented here show that the analysis-synthesis technique based on GSS can produce speech comparable to that of a high-quality vocoder that is based on the spectral envelope representation, and permit control over voice qualities, namely to transform a modal voice into breathy and tense, by modifying the glottal parameters.
Abstract: This paper proposes an analysis method to separate the glottal source and vocal tract components of speech that is called Glottal Spectral Separation (GSS). This method can produce high-quality synthetic speech using an acoustic glottal source model. In the source-filter models commonly used in speech technology applications it is assumed the source is a spectrally flat excitation signal and the vocal tract filter can be represented by the spectral envelope of speech. Although this model can produce high-quality speech, it has limitations for voice transformation because it does not allow control over glottal parameters which are correlated with voice quality. The main problem with using a speech model that better represents the glottal source and the vocal tract filter is that current analysis methods for separating these components are not robust enough to produce the same speech quality as using a model based on the spectral envelope of speech. The proposed GSS method is an attempt to overcome this problem, and consists of the following three steps. Initially, the glottal source signal is estimated from the speech signal. Then, the speech spectrum is divided by the spectral envelope of the glottal source signal in order to remove the glottal source effects from the speech signal. Finally, the vocal tract transfer function is obtained by computing the spectral envelope of the resulting signal. In this work, the glottal source signal is represented using the Liljencrants-Fant model (LF-model). The experiments we present here show that the analysis-synthesis technique based on GSS can produce speech comparable to that of a high-quality vocoder that is based on the spectral envelope representation. However, it also permit control over voice qualities, namely to transform a modal voice into breathy and tense, by modifying the glottal parameters.

34 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: This study focuses on the SNR level of -5 dB in which speech is generally dominated by background noise, and proposes a new feature called Multi-Resolution Cochleagram (MRCG), which is extracted from four cochlea-grams of different resolutions to capture both local information and spectrotemporal context.
Abstract: Speech separation is a challenging problem at low signal-to-noise ratios (SNRs). Separation can be formulated as a classification problem. In this study, we focus on the SNR level of -5 dB in which speech is generally dominated by background noise. In such a low SNR condition, extracting robust features from a noisy mixture is crucial for successful classification. Using a common neural network classifier, we systematically compare separation performance of many monaural features. In addition, we propose a new feature called Multi-Resolution Cochleagram (MRCG), which is extracted from four cochlea-grams of different resolutions to capture both local information and spectrotemporal context. Comparisons using two non-stationary noises show a range of feature robustness for speech separation with the proposed MRCG performing the best. We also find that ARMA filtering, a post-processing technique previously used for robust speech recognition, improves speech separation performance by smoothing the temporal trajectories of feature dimensions.

31 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: Experimental results, obtained using measured impulse responses, indicate that the proposed approach could be used to improve the dereverberation performance compared to the classical technique.
Abstract: Reverberation has a considerable impact on the quality and intelligibility of captured speech signals. In this paper we present an approach for blind multi-microphone speech dereverberation based on the weighted prediction error method, where the reverberant observations are modeled using multi-channel linear prediction in the short-time Fourier transform domain. Instead of using the commonly employed Gaussian distribution for the desired speech signal, the proposed approach uses a Laplacian distribution which is known to be more accurate in modeling speech signals. Maximum-likelihood estimation is used for estimating the model parameters, leading to a linear programming optimization problem. Experimental results, obtained using measured impulse responses, indicate that the proposed approach could be used to improve the dereverberation performance compared to the classical technique.

Proceedings ArticleDOI
20 Nov 2014
TL;DR: Six different single-channel dereverberation algorithms are evaluated subjectively in terms of speech intelligibility and speech quality and the best performance was observed for the regularized spectral inverse approach with pre-echo removal.
Abstract: In this contribution, six different single-channel dereverberation algorithms are evaluated subjectively in terms of speech intelligibility and speech quality. In order to study the influence of the dereverberation algorithms on speech intelligibility, speech reception thresholds in noise were measured for different reverberation times. The quality ratings were obtained following the ITU-T P.835 recommendations (with slight changes for adaptation to the problem of dere-verberation) and included assessment of the attributes: reverberant, colored, distorted, and overall quality. Most of the algorithms improved speech intelligibility for short as well as long reverberation times compared to the reverberant condition. The best performance in terms of speech intelligibility and quality was observed for the regularized spectral inverse approach with pre-echo removal. The overall quality of the processed signals was highly correlated with the attribute reverberant or/and distorted. To generalize the present outcomes, further studies are needed to account for the influence of the estimation errors.

Journal ArticleDOI
TL;DR: This paper intends to focus on the survey of various feature extraction techniques in speech processing such as Fast Fourier Transforms, Linear Predictive Coding, Mel Frequency Cepstral Coefficients, Discrete Wavelet Transforms , Wavelet Packet Transforms and their applications in speechprocessing.
Abstract: Speech processing includes the various techniques such as speech coding, speech synthesis, speech recognition and speaker recognition. In the area of digital signal processing, speech processing has versatile applications so it is still an intensive field of research. Speech processing mostly performs two fundamental operations such as Feature Extraction and Classification. The main criterion for the good speech processing system is the selection of feature extraction technique which plays an important role in the system accuracy. This paper intends to focus on the survey of various feature extraction techniques in speech processing such as Fast Fourier Transforms, Linear Predictive Coding, Mel Frequency Cepstral Coefficients, Discrete Wavelet Transforms, Wavelet Packet Transforms, Hybrid Algorithm DWPD and their applications in speech processing.

Proceedings ArticleDOI
04 May 2014
TL;DR: A mask-based enhancer for very low quality speech that is able to preserve important cues in a noise-robust manner by identifying the time-frequency regions that contain significant speech energy is proposed.
Abstract: We propose a mask-based enhancer for very low quality speech that is able to preserve important cues in a noise-robust manner by identifying the time-frequency regions that contain significant speech energy. We use a classifier to estimate a time-frequency mask from an input feature set that provides information about the energy distribution of both voiced and unvoiced speech. We evaluate the enhancer on a range of noisy speech signals and demonstrate that it yields consistent improvements in an objective intelligibility measure.

Proceedings ArticleDOI
04 May 2014
TL;DR: Several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models.
Abstract: This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7% absolute WER reduction over the baseline system trained on neutral speech, and a 1.3% reduction over a baseline system with whisper-adapted acoustic models.

Journal ArticleDOI
TL;DR: A method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed.

Proceedings ArticleDOI
04 May 2014
TL;DR: This work addresses the F0 modeling in whisper-to-speech conversion and shows that F0 contours can be derived from the mapped spectral vectors, which can be used for the synthesis of a speech signal.
Abstract: In this work, we address the issues involved in whisper-to-audible speech conversion. Spectral mapping techniques using Gaussian mixture models or Artificial Neural Networks borrowed from voice conversion have been applied to transform whisper spectral features to normally phonated audible speech. However, the modeling and generation of fundamental frequency ($F_0$) and its contour in the converted speech is a major issue. Whispered speech does not contain explicit voicing characteristics and hence it is hard to derive a suitable $F_0$, making it difficult to generate a natural prosody after conversion. Our work addresses the $F_0$ modeling in whisper-to-speech conversion. We show that $F_0$ contours can be derived from the mapped spectral vectors, which can be used for the synthesis of a speech signal. We also present a hybrid unit selection approach for whisper-to-speech conversion. Unit selection is performed on the spectral vectors, where $F_0$ and its contour can be obtained as a byproduct without any additional modeling.

Proceedings ArticleDOI
04 May 2014
TL;DR: This paper first obtains a clean speech phase estimate using a recent phase reconstruction algorithm, then proposes to treat this reconstructed phase as uncertain a priori knowledge when deriving a joint MMSE estimate of the clean speech amplitude and phase.
Abstract: In most STFT-based speech enhancement algorithms only the STFT amplitude of speech is processed, while the STFT phase of the noisy signal is neither modified nor employed to improve amplitude estimation. This is also, because modifying the spectral phase often yields undesired artifacts and unnatural sounding speech. In this paper, we first obtain a clean speech phase estimate using a recent phase reconstruction algorithm. Then, we propose to treat this reconstructed phase as uncertain a priori knowledge when deriving a joint MMSE estimate of the clean speech amplitude and phase. The resulting MMSE-estimator yields a compromise between the phase of the noisy signal and the prior phase estimate. Instrumental measures and informal listening show that the proposed estimator reduces un-desired artifacts and results in an improved speech quality.

Patent
18 Jul 2014
TL;DR: In this paper, the authors proposed a method for generating clean speech from a speech signal representing a mixture of a noise and speech using a model of speech using auditory and speech production principles.
Abstract: Provided are systems and methods for generating clean speech from a speech signal representing a mixture of a noise and speech. The clean speech may be generated from synthetic speech parameters. The synthetic speech parameters are derived based on the speech signal components and a model of speech using auditory and speech production principles. The modeling may utilize a source-filter structure of the speech signal. One or more spectral analyses on the speech signal are performed to generate spectral representations. The feature data is derived based on a spectral representation. The features corresponding to the target speech according to a model of speech are grouped and separated from the feature data. The synthetic speech parameters, including spectral envelope, pitch data and voice classification data are generated based on features corresponding to the target speech.

Proceedings ArticleDOI
20 Nov 2014
TL;DR: This paper proposes a probabilistic method to estimate the clean speech phase from noisy observation using von Mises phase priors, which leads to improved speech quality and intelligibility predicted by instrumental measures without explicit incorporation of amplitude enhancement.
Abstract: In many artificial intelligence systems human voice is considered as the medium for information transmission. Human-machine communication by voice becomes difficult when speech is mixed with some background noise. As a remedy, a single-channel speech enhancement is indispensable for reducing background noise from noisy speech to make it suitable for automatic speech recognition and telephony speech. While the conventional techniques for single-channel speech enhancement incorporate noisy phase in both amplitude estimation and signal reconstruction stages, in this paper we propose a probabilistic method to estimate the clean speech phase from noisy observation. Our proposed method consists of phase unwrapping followed by threshold-based temporal smoothing using von Mises phase priors. The proposed phase enhancement method leads to improved speech quality and intelligibility predicted by instrumental measures without explicit incorporation of amplitude enhancement.

Journal ArticleDOI
TL;DR: Results show that the proposed coding scheme has achieved average Mean Opinion Score of the synthesized speech 3.083 in an appropriate bit rate (4.2 Kbps), which outperforms the quality of Code excited linear prediction (CELP).
Abstract: In this paper, we propose a novel speech coding scheme based on compressed sensing and sparse representation. Compressed sensing (CS) attracts great interest for its ability to utilize a few measurements to recover original signals. Measurements preserve part of speech features while projected by row echelon matrix. A dictionary is learned in order to contain redundant information about speech measurements. The synthesized speech is recovered from a sparse approximation of the corresponding measurement. A rear low-pass filter is adopted to improve the subject quality of synthesized speech. Results show that the proposed coding scheme has achieved average Mean Opinion Score (MOS) of the synthesized speech 3.083 in an appropriate bit rate (4.2 Kbps), which outperforms the quality of Code excited linear prediction (CELP).

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis using rich context models, which are statistical models that represent individual acoustic parameter segments.
Abstract: In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and F0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.

Proceedings ArticleDOI
03 Apr 2014
TL;DR: This work has presented a novel simple scheme for classifying audio speech signals into male speech and female speech using popular salient low level time-domain acoustic features which are very closely related to the physical properties of source audio signal.
Abstract: In this work, we have presented a novel simple scheme for classifying audio speech signals into male speech and female speech. In the context of content-based multimedia indexing gender identification based on speech signal is an important task. Some popular salient low level time-domain acoustic features which are very closely related to the physical properties of source audio signal like zero crossing rate (ZCR), short time energy (STE) along with spectral flux, a low level frequency domain feature, are used for this discrimination. RANSAC and Neural-Net has been used as classifier. The experimental result exhibits the efficiency of the proposed scheme.

Journal ArticleDOI
TL;DR: The highest speech recognition rate was obtained using 10 ms length analysis window with the frame shift varying from 7.5 to 10 ms (regardless of analysis type), and the highest increase of recognition rates was 2.5 %.
Abstract: Speech signal is redundant and non-stationary by nature. Because of vocal tract inertness these variations are not very rapid and the signal can be considered as stationary in short segments. It is presumed that in short-time magnitude spectrum the most distinct information of speech is contained. This is the main reason for speech signal analysis in frame-by-frame manner. The analyzed speech signal is segmented into overlapping segments (so-called frames) for this purpose. Segments of 15-25 ms with the overlap of 10-15 ms are used usually. In this paper we present results of our investigation of analysis window length and frame shift influence on speech recognition rate. We have analyzed three different cepstral analysis approaches for this purpose: mel frequency cepstral analysis (MFCC), linear prediction cepstral analysis (LPCC) and perceptual linear prediction cepstral analysis (PLPC). The highest speech recognition rate was obtained using 10 ms length analysis window with the frame shift varying from 7.5 to 10 ms (regardless of analysis type). The highest increase of recognition rate was 2.5 %.

Patent
30 Dec 2014
TL;DR: In this paper, a speech separation method and a system consisting of extracting a speech feature of the mixture speech signal and inputting the extracted speech feature into a regression model for speech separation, obtaining an estimated speech features of a target speech signal, synthesizing to obtain the target speech signals according to the estimated speech feature.
Abstract: An example of the present invention discloses a speech separation method and a system, the method comprises: receiving a mixture speech signal to be separated; extracting a speech feature of the mixture speech signal; inputting the extracted speech feature of the mixture speech signal into a regression model for speech separation, obtaining an estimated speech feature of a target speech signal; synthesizing to obtain the target speech signal according to the estimated speech feature. Speech separation effect can be improved effectively using the present invention.

Proceedings ArticleDOI
25 Sep 2014
TL;DR: It is shown in terms of objective measures, spectrogram analysis and subjective listening tests that the proposed method consistently outperforms one of the state-of-the-art methods of speech enhancement from noisy speech corrupted by babble or car noise at high as well as very low levels of SNR.
Abstract: In this paper, a noisy speech enhancement method based on modified spectral subtraction performed on short time magnitude spectrum is presented. Here the cross-terms containing spectra of noise and clean signals are taken into consideration which are neglected in the traditional spectral subtraction method on the basis of the assumption that clean speech and noise signals are completely uncorrelated which is not true for most of the noises. In this method, the noise estimate to be subtracted from the noisy speech spectrum is proposed to be determined exploiting the low frequency regions of noisy speech of the current frame rather than depending only on the initial silence frames. We argue that this approach of noise estimation is capable of tracking the time variation of the non-stationary noise. By employing the noise estimates thus obtained, a procedure is formulated to reduce noise from the magnitude spectrum of noisy speech signal. The noise reduced magnitude spectrum is then recombined with the unchanged phase spectrum to produce a modified complex spectrum prior to synthesizing an enhanced frame. Extensive simulations are carried out using NOIZEUS database in order to evaluate the performance of the proposed method. It is shown in terms of objective measures, spectrogram analysis and subjective listening tests that the proposed method consistently outperforms one of the state-of-the-art methods of speech enhancement from noisy speech corrupted by babble or car noise at high as well as very low levels of SNR.

Proceedings ArticleDOI
04 May 2014
TL;DR: A novel scheme to estimate the missing values occurring during LPC model transmission and applies the Dirichlet mixture model (DMM) to capture the correlations among the elements in the ΔLSF vector is proposed.
Abstract: In packet networks, a reliable scheme to handle packet loss during speech transmission is of great importance. As a common representation of the linear predictive coding (LPC) model, the line spectral frequency (LSF) parameters are widely used in speech quantization and transmission. In this paper, we propose a novel scheme to estimate the missing values occurring during LPC model transmission. In order to exploit the boundary and ordering properties of the LSF parameters, we utilize the ΔLSF representation and apply the Dirichlet mixture model (DMM) to capture the correlations among the elements in the ΔLSF vector. With the conditional distribution of the missing part given the received part, an optimal nonlinear minimum mean square error estimator for the missing values is proposed. Compared to the previously presented Gaussian mixture model based method, the proposed DMM based nonlinear estimator shows a convincing improvement.

Proceedings ArticleDOI
18 Sep 2014
TL;DR: Aiming at achieving improved accuracy and robustness of the VAD technique to noise, the feature selection has been conducted by introducing the class separation measure (CSM) criterion to evaluate the capability of the feature vectors extracted for classifying speech and non-speech segments.
Abstract: Voice Activity Detection (VAD) is one of the key techniques for many speech applications. Existing VAD algorithms have shown unsatisfied performance under nonstationary noise and low Signal-to-Noise-Ratio (SNR) situations. Motivated by the fact that people is able to distinguish the speech and non-speech even in low SNR situations, this paper studies the VAD technique from the pattern recognition point of view, where the VAD essentially is formulated as a binary classification problem. Specifically, the VAD is implemented by classifying the speech signal into speech and non-speech segments. The radial basis function (RBF) based support vector machine (SVM) is employed with supervised manner, which is perfectly suitable for binary classification tasks with some training samples. Aiming at achieving improved accuracy and robustness of the VAD technique to noise, the feature selection has been conducted by introducing the class separation measure (CSM) criterion to evaluate the capability of the feature vectors extracted for classifying speech and non-speech segments. Most famous speech features have been taken into account, including Mel-frequency cepstral coefficients (MFCC), the principal component analysis of the MFCC (PCA-MFCC), linear predictive coding (LPC) and linear predictive cepstral coding (LPCC). Intensive experimental results show that the MFCC features capture the most relevant information of speech and keep good separability of classification in different noisy conditions, so do the PCA-MFCC features. Moreover, the PCA- MFCC features are more robust to the noise with less computational cost. As a result, a VAD method by using the PCA-MFCC and the RBF-SVM as the classifier has been developed, which is termed as PCA-SVM-VAD for short. The experimental results with the NOIZEUS database show that the proposed PCA-SVM-VAD method has clear improvements over other VAD methods and performs much more robust in car noisy environment at various SNRs.

Journal ArticleDOI
TL;DR: This method proposes the use of signal whitening property of LPC as a preprocessing step in OFDM systems and can achieve a significant reduction in PAPR without degrading the power spectral level, error performance or computational complexity of the systems.
Abstract: High peak-to-average power ratio (PAPR) has always been as a major problem in orthogonal frequency division multiplexing (OFDM) that leads to a severe nonlinear distortion in practical hardware implementations of high power amplifier. In this article, a new PAPR reduction method is proposed based on linear predictive coding (LPC). This method proposes the use of signal whitening property of LPC as a preprocessing step in OFDM systems. Error filtering of the proposed method removes the predictable content of stationary stochastic processes which can reduce the autocorrelation of input data sequences and is shown to be very effective solution for PAPR problem in OFDM transmissions. It is shown that the proposed method can achieve a significant reduction in PAPR without degrading the power spectral level, error performance or computational complexity of the systems. It is also shown that the proposed method is independent of modulation schemes and can be applied to any number of subcarriers under both additive white gaussian noise and wireless Rayleigh fading channel.

Proceedings ArticleDOI
04 May 2014
TL;DR: A novel probabilistic model is proposed which precisely models the speech by taking into account the underlying speech production process as well as its dynamics, which outperforms state-of-the-art methods in terms of objective measures.
Abstract: Model-based speech enhancement methods, which rely on separately modeling the speech and the noise, have been shown to be powerful in many different problem settings. When the structure of the noise can be arbitrary, which is often the case in practice, modelbased methods have to focus on developing good speech models, whose quality will be key to their performance. In this study, we propose a novel probabilistic model for speech enhancement which precisely models the speech by taking into account the underlying speech production process as well as its dynamics. The proposed model follows a source-filter approach where the excitation and filter parts are modeled as non-negative dynamical systems. We present convergence-guaranteed update rules for each latent factor. In order to assess performance, we evaluate our model on a challenging speech enhancement task where the speech is observed under non-stationary noises recorded in a car. We show that our model outperforms state-of-the-art methods in terms of objective measures.