scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2022"


Proceedings ArticleDOI
23 May 2022
TL;DR: This work proposes a neural extension to low bit rate speech codec that aims to improve the perceptual quality of synthesized speech and uses the least-square generative adversarial network to reduce artifacts and prevent over-smoothing in the reconstructed audio.
Abstract: Speech codec compresses the input signal into compact bit stream, which is then decoded at the receiver to generate the best possible perceptual quality. This compression makes storing and transmitting speech efficient. In this work, we propose a neural extension to low bit rate speech codec (e.g., Codec2) that aims to improve the perceptual quality of synthesized speech. Our proposed framework combines decoded audio with neural embeddings without breaking the existing speech coders. In addition to embeddings, we also use the least-square generative adversarial network (LSGAN) to reduce artifacts and prevent over-smoothing in the reconstructed audio. The Mean Opinion Scores (MOS) from the listening tests show that our framework can boost the audio quality of speech encoded at 3.6kbps to outperform that of speech encoded at 6kbps using Opus.

3 citations


Journal ArticleDOI
TL;DR: This paper explored the role of formant and duration-modification-based out-of-domain data augmentation for zero-resource children's speech recognition under clean and noisy conditions.

2 citations


Proceedings ArticleDOI
06 Sep 2022
TL;DR: This study evaluates the impact of various filters on speech signal processing to help create speech coding, voice recognition, speech synthesis, and other approaches based on a series of characterization and processing of speech signals using MATLAB software.
Abstract: Speech signal processing is one of the most quickly increasing domains of information science study, as well as a very active and popular research subject with significant academic and practical value. Voice signal processing research is vital to the disciplines of machine language, speech recognition, and speech synthesis, among others. With its strong computer capability, MATLAB software can easily finish the processing of voice signals. MATLAB may be used to analyze digitalized speech signals in the time and frequency domain, display time and frequency domain curves of speech signals, and analyze speech based on its properties. This study evaluates the impact of various filters on speech signal processing to help us better create speech coding, voice recognition, speech synthesis, and other approaches. It is based on a series of characterization and processing of speech signals using MATLAB software.

2 citations


Journal ArticleDOI
TL;DR: Experimental results show that each of these extraction feature methods give different results, but classification accuracy that is obtained by using PLP features return the best results.
Abstract: In this research, different audio feature extraction techniques are implemented and classification approaches are presented to classify seven types of wind. We applied features techniques such as Zero Crossing Rate (ZCR), Fast Fourier Transformation (FFT), Linear predictive coding (LPC), and Perceptual Linear Prediction (PLP). We know that some of these methods are good with human voices, but we tried to apply them here to characterize the wind audio content. The CNN classification method is implemented to determine the class of input wind sound signal. Experimental results show that each of these extraction feature methods give different results, but classification accuracy that are obtained by using PLP features return the best results. Fast Linear Perceptual Linear

1 citations


Proceedings ArticleDOI
24 Mar 2022
TL;DR: This work proposes a new approach to VAD using a deep generative model, Speech Enhancement GAN (SEGAN) which is a variant of GAN, to analyze the VAD application and gives a better result when compared to other state of theart methods.
Abstract: Voice activity detection (VAD) plays an important role as a pre-processing block in many speech processing applications like speech coding, speech enhancement, speech recognition systems, etc. The main objective of VAD algorithm is to identify speech and non-speech regions in a given audio signal. However the challenging task for the VAD systems would be classifying speech/non-speech frames in an input audio signal that are corrupted by noise i.e environmental noise. With a view to address such a problem, we propose a new approach to VAD using a deep generative model. These models have the ability to learn the underlying distribution of target data through adversarial learning process. In this work, we explore Speech Enhancement GAN (SEGAN) which is a variant of GAN, to analyze the VAD application. The proposed work is evaluated on a subset of Apollo speech corpus as the dataset contain speech files with multiple challenges such as multiple speakers with different noise types, different Signal-to-Noise Ratio (SNR) levels, channel distortion and non-speech for a long duration. The performance of the system is evaluated using detection cost function(DCF) metric. The proposed work gives a better result when compared to other state-of-the-art methods.

1 citations


Journal ArticleDOI
TL;DR: This work presents a low-complexity, low-delay speech codec based on tree-coding with sample-by-sample adaptive long- and short-code generators that incorporates pre- and post-filtering for perceptual weighting and multimode speech classification with comfort noise generation (CNG).
Abstract: As speech-coding standards have improved over the years, so complexity has increased, and less emphasis been placed on low encoding/decoding delay. We present a low-complexity, low-delay speech codec based on tree-coding with sample-by-sample adaptive long- and short-code generators that incorporates pre- and post-filtering for perceptual weighting and multimode speech classification with comfort noise generation (CNG). The pre-/post-weighting filters adapt based on the code generator parameters available at both the encoder and decoder rather than the usual method that uses the input speech. The coding of the multiple speech modes and comfort noise generation is accomplished using the code generator adaptation algorithms, again, rather than using the input speech. Codec complexity comparisons are presented and operational rate distortion curves for several standardized speech codecs and the new codec are given. Finally, codec performance is shown in relation to theoretical rate distortion bounds.

1 citations


Journal ArticleDOI
Hao Duc Do1
TL;DR: In this article , the authors proposed two new methods to strengthen the speech feature vector using this unused factor using linear regression to identify each speech frame's linear trend or linear envelope in the time domain and remove the impact of that trend to normalize the signal and emphasize the stationary elements in the signal.
Abstract: Speech feature extraction usually begins with transforming the signal from the time domain to the frequency domain via integral transforms. Human speech contains sine-shape waves and a linear trend in the time domain, usually ignored. This research proposes two new methods to strengthen the speech feature vector using this unused factor. Before transforming the speech frames into the frequency domain, we use linear regression to identify each speech frame’s linear trend or linear envelope in the time domain. Then we remove the impact of that trend to normalize the signal and emphasize the stationary elements in the signal. The proposed feature vector includes parameters from the linear envelope and the conventional vectors as a spectrum or MFCC. Experimental results demonstrate that the impact of the linear envelope is significant and the linear envelope subtraction is a meaningful stage. Our new features emphasize the stationary property in the speech signal and improve the result for speech recognition in terms of error rate.

Proceedings ArticleDOI
01 Nov 2022
TL;DR: In this article , a method to improve the residual signal excitation based on LPC10 was proposed, and the generated speech and the original speech were scored by PESQ algorithm and the result showed that the improved speech score is 1.68, which is 0.34 points higher than the LPC 10 synthesized speech score.
Abstract: Under narrowband shortwave communication conditions, digital speech coding is mostly in the form of low-rate linear predictive coding, but LPC parametric coding recovers low naturalness of speech with buzz. In this paper, we propose a method to improve the residual signal excitation based on LPC10. At the coding end, the prediction coefficients are solved based on linear prediction analysis, and the original speech is inverse filtered based on the prediction coefficients and differs from the original speech signal to obtain the residual signal; at the decoding end, the original muffled pulse excitation is replaced with the residual signal, and the improved synthesized speech improves the hum in the original LPC synthesized speech. The generated speech and the original speech are scored by PESQ algorithm, and the result showed that the improved speech score is 1.68, which is 0.34 points higher than the LPC 10 synthesized speech score.

Proceedings ArticleDOI
11 Dec 2022
TL;DR: In this article , the authors investigated the effect of temporal misalignment between the two input speech signals on the performance of speech quality evaluation, and found that the prediction power of the PESQ measure was not affected by the temporal mismalignment effect, mainly due to its embedded temporal alignment processing, while the prediction powers of other quality measures were markedly influenced by the increased temporal misalignments, particularly for the measures of output signal-to-noise ratio.
Abstract: Reliable objective speech quality prediction is important for our design of new speech processing and coding algorithms. Many factors may affect the performance of objective speech quality evaluation. Most speech processing algorithms will cause temporal misalignment between the probe and processed (e.g., noise-suppressed) speech signals, which are used as inputs to existing speech quality models. The present work aimed to investigate the effect of temporal misalignment between the two input speech signals on the performance of speech quality evaluation. Subjective speech quality rating data from 120 noise-masked/suppressed conditions (processed by 14 single-channel noise-suppression algorithms) were correlated with existing objective speech quality indices. The probe and processed speech signals were artificially misaligned in a time range up to 50-ms. Results showed that the prediction power of the PESQ measure was not affected by the temporal misalignment effect, mainly due to its embedded temporal alignment processing, while the prediction powers of other quality measures were markedly influenced by the increased temporal misalignment between the two input signals, particularly for the measures of output signal-to-noise ratio. The findings of the present work manifested the importance to compensate the negative temporal misalignment effect for objective speech quality evaluation.

Proceedings ArticleDOI
04 Dec 2022
TL;DR: In this article , an LPC-based speech enhancement (LPCSE) architecture is proposed, which leverages the strong inductive biases in the LPC speech model in conjunction with the expressive power of neural networks.
Abstract: The increasingly stringent requirement on quality-of-experience in 5G/B5G communication systems has led to the emerging neural speech enhancement techniques, which however have been developed in isolation from the existing expert-rule based models of speech pronunciation and distortion, such as the classic Linear Predictive Coding (LPC) speech model because it is difficult to integrate the models with auto-differentiable machine learning frameworks. In this paper, to improve the efficiency of neural speech enhancement, we introduce an LPC-based speech enhancement (LPCSE) architecture, which leverages the strong inductive biases in the LPC speech model in conjunction with the expressive power of neural networks. Differentiable end-to-end learning is achieved in LPCSE via two novel blocks: a block that utilizes the expert rules to reduce the computational overhead when integrating the LPC speech model into neural networks, and a block that ensures the stability of the model and avoids exploding gradients in end-to-end training by mapping the Linear prediction coefficients to the filter poles. The experimental results show that LPCSE successfully restores the formants of the speeches distorted by transmission loss, and outperforms two existing neural speech enhancement methods of comparable neural network sizes in terms of the Perceptual evaluation of speech quality (PESQ) and Short-Time Objective Intelligibility (STOI) on the LJ Speech corpus.

Book ChapterDOI
16 Jun 2022
TL;DR: In this paper , the authors proposed a method to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency coefficients (MFCC), barkfrequency coefficients(BFCC), perceptual linear predictor coefficients (PLPC), perceptual reflection coefficients(PRC), perceptual LPC, perceptual linear predictive coding (RPLPCC), reconsidered perceptual linear prediction coefficient (RCLPC), re-considered perceptual lpc, reconsidered LPC coefficient (RLPCC) and reconsidered lpc coefficient (LPLPC) coefficients.
Abstract: The preliminary stage of the biometric identification is speech signal structuring and extracting features. For calculation of the fundamental tone are considered and in number investigated the following methods – autocorrelation function (ACF) method, average magnitude difference function (AMDF) method, simplified inverse filter transformation (SIFT) method, method on a basis a wavelet analysis, method based on the cepstral analysis, harmonic product spectrum (HPS) method. For speech signal extracting features are considered and in number investigated the following methods – the digital bandpass filters bank; spectral analysis; homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), barkfrequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. The largest probability of identification (equal 0.98) and the smallest number of coefficients (4 coefficients) are provided by coding of a vocal of the speech sound from the TIMIT based on PRC.


Posted ContentDOI
14 Jun 2022
TL;DR: In this paper , an LPC-based speech enhancement (LPCSE) architecture is proposed, which leverages the strong inductive biases in the LPC speech model in conjunction with the expressive power of neural networks.
Abstract: The increasingly stringent requirement on quality-of-experience in 5G/B5G communication systems has led to the emerging neural speech enhancement techniques, which however have been developed in isolation from the existing expert-rule based models of speech pronunciation and distortion, such as the classic Linear Predictive Coding (LPC) speech model because it is difficult to integrate the models with auto-differentiable machine learning frameworks. In this paper, to improve the efficiency of neural speech enhancement, we introduce an LPC-based speech enhancement (LPCSE) architecture, which leverages the strong inductive biases in the LPC speech model in conjunction with the expressive power of neural networks. Differentiable end-to-end learning is achieved in LPCSE via two novel blocks: a block that utilizes the expert rules to reduce the computational overhead when integrating the LPC speech model into neural networks, and a block that ensures the stability of the model and avoids exploding gradients in end-to-end training by mapping the Linear prediction coefficients to the filter poles. The experimental results show that LPCSE successfully restores the formants of the speeches distorted by transmission loss, and outperforms two existing neural speech enhancement methods of comparable neural network sizes in terms of the Perceptual evaluation of speech quality (PESQ) and Short-Time Objective Intelligibility (STOI) on the LJ Speech corpus.

Journal ArticleDOI
TL;DR: It is proved that despite the high level of development of digital processors, the problem of calculating the parameters of the vocal tract model in real time remains one of the main difficulties in speech coding and methods that reduce the complexity are currently of significant interest.
Abstract: The article describes the basic idea of coding a speech signal by the method of linear prediction (Linear Predictive Coding – LPC), which consists in the fact that instead of the parameters of the speech signal, the encoded parameters of a certain filter are transmitted over the communication line, which is, in a sense, an equivalent of the human vocal tract, as well as parameters of the excitation signal of this filter: tone or noise. The essence of the parameters of the synthesizing filter – linear prediction coefficients (LP), calculated in the process of frame-by-frame adaptive filtering – is disclosed. The main advantages of these coefficients are stated, which consists in the ability to completely describe the state of the predictor filter, as well as the main disadvantages obtained as a result of numerous studies that prevent the direct transmission of LPCs over the communication channel due to their sensitivity to quantization errors. The necessity of searching for mathematically equivalent parameters f the reducing filter is substantiated. Alternative parameters of the LSP representation of the vocal tract model, called Linear Spectral Frequencies (LSP), are proposed, and are most often used in low-speed speech codecs at the present time. The main difficulty of the task of calculating the LSP directly from the LSP in real time is described, and it is also investigated that most of the analyzer’s processor time, as a rule, is spent on this task. It is proved that despite the high level of development of digital processors, the problem of calculating the parameters of the vocal tract model in real time remains one of the main difficulties in speech coding. That is why methods that reduce the complexity of this procedure are currently of significant interest. A new method for calculating the LSP and an algorithm based on it are proposed. Its main advantages over the existing ones are described. An assessment of the possible improvement in the quality of speech processing was made, due to the use of the developed method for calculating the LSP.

Proceedings ArticleDOI
06 May 2022
TL;DR: In this article , the authors used LPC (linear predictive coding) and CELP (code-excited linear prediction) to reproduce the original signal through the application of some predictive techniques.
Abstract: Snoring is a disagreeable sound produced by humans while they sleep and in some dimensions, it is considered pathology. Characterized by inspiratory signals, it is closely related to the breathing function. This paper deals with the sleeping snore using an efficient approach based on the synthesis of a recorded snoring signal. The advantages of this approach are very varied such as offers of a non-contact substitute, artificial reproduction by machine of the original signal (snoring), which can even be integrated later in humanoid robots as an example. The method itself of this reconstitution is a reproduce of the signal through the application of some predictive techniques such as LPC (linear predictive coding), and CELP (Code-excited linear prediction). The difference between original and synthetic signals, called also residuals, can be explained by a scanning factor and different types of noises. Finally, to evaluate our approach, we compute the Segmental Signal to Noise Ratio (called segmental SNR which is a special SNR very useful for segmented signals.), and Root Mean Square Error (RMSE), both of which are suitable criteria for sound signals, decisive for us in order to show the effectiveness of these different methods.

Journal ArticleDOI
TL;DR: The designed machine learning evaluation system is able to accurately detect information about the quality of the learner’s pronunciation and yields more natural and higher intelligibility speech signals.
Abstract: In the teaching of English, there is an increasing focus on practical communication skills. As a result, the speaking test component has received more and more attention from education experts. With the rapid development of modern computer technology and network technology, the use of computers to assess the quality of spoken English has become a hot topic of research in related fields at present. A machine learning assessment system based on linear predictive coding is proposed in order to achieve automatic scoring of spoken English tests. First, the principle of linear predictive coding and decoding is analyzed, and the traditional linear predictive coding and decoding algorithm is improved by using hybrid excitation instead of the traditional binary excitation. Second, the overall structure of the machine learning assessment system is designed, which mainly includes division into four modules: acoustic model acquisition module, speech recognition module, standard pronunciation transcription module, and decision module. Then, the speech recognition module is implemented by an improved linear predictive speech coding method to acquire the feature parameters of the speech signal and generate the speech feature vector. Finally, the convolutional neural network algorithm is used to train the speech features so as to implement the acoustic model acquisition module. The experimental results show that the improved linear predictive speech coding method yields more natural and higher intelligibility speech signals. The designed machine learning evaluation system is able to accurately detect information about the quality of the learner’s pronunciation.

Journal ArticleDOI
TL;DR: This paper presented an overview of Linear Predictive Coding (LPC) for feature extraction in text-to-speech (T2Speech) systems, which can be used to extract features from the speech waveform.
Abstract: Over the past years, advancements in speech processing have mostly been driven by DSP approaches. The speech interface was designed to convert speech input into a parametric form for further processing (Speech-to-Text) and the resulting text output to speech synthesis (Text-to-Speech). Feature extraction is done by changing the speech waveform into a parametric representation at a relatively low data rate so that it can be processed and analyzed later. There are numerous feature extraction techniques available. This paper presents the overview of Linear Predictive Coding (LPC).

Proceedings ArticleDOI
09 Jun 2022
TL;DR: In this article , a basic waveform encoding technique, Pulse Code Modulation (PCM) is introduced as the most common technique by changing an analogue signal to digital data. And the Dynamic Rhyme Test (DRT) and Mean Opinion Score (MOS) methods are discussed as a voice quality tests to evaluate the performance of voice coders and checking the quality of the artificial speech.
Abstract: Nowadays, voice encoders are one of the basic elements in the multimedia and telecommunications. This report explains properties of speech, how sound waves are generated and how they are classified and what is voice and unvoiced speech. A basic waveform encoding technique, Pulse Code Modulation (PCM) is introduced as the most common technique by changing an analogue signal to digital data. Three types of voice Encoder: Linear Predictive Coder (LPC), Regular Pulse - Excited (RPE) and Code-Excited Linear Predictive (CELP) are briefly discussed with Performance Comparison table between them is made to give differences summary. Finally, The Dynamic Rhyme Test (DRT) and Mean Opinion Score (MOS) methods are discussed as a voice quality tests to evaluate the performance of voice coders and checking the quality of the artificial speech.

Posted ContentDOI
27 Aug 2022
TL;DR: In this article , the analysis of isolated digit recognition in the presence of different bit rates and at different noise levels has been performed using audacity and HTK toolkit, where feature extraction techniques used are Mel Frequency Cepstrum coefficient (MFCC), Linear Predictive Coding (LPC), perceptual linear predictive (PLP), mel spectrum (MELSPEC), filter bank (FBANK), etc.
Abstract: This research work is about recent development made in speech recognition. In this research work, analysis of isolated digit recognition in the presence of different bit rates and at different noise levels has been performed. This research work has been carried using audacity and HTK toolkit. Hidden Markov Model (HMM) is the recognition model which was used to perform this experiment. The feature extraction techniques used are Mel Frequency Cepstrum coefficient (MFCC), Linear Predictive Coding (LPC), perceptual linear predictive (PLP), mel spectrum (MELSPEC), filter bank (FBANK). There were three types of different noise levels which have been considered for testing of data. These include random noise, fan noise and random noise in real time environment. This was done to analyse the best environment which can used for real time applications. Further, five different types of commonly used bit rates at different sampling rates were considered to find out the most optimum bit rate.

Journal ArticleDOI
22 Dec 2022-Sensors
TL;DR: In this paper , an improved estimator for the speech spatial covariance matrices (SCM) is proposed, which can be parameterized with the speech power spectral density (PSD) and relative transfer function (RTF).
Abstract: Online multi-microphone speech enhancement aims to extract target speech from multiple noisy inputs by exploiting the spatial information as well as the spectro-temporal characteristics with low latency. Acoustic parameters such as the acoustic transfer function and speech and noise spatial covariance matrices (SCMs) should be estimated in a causal manner to enable the online estimation of the clean speech spectra. In this paper, we propose an improved estimator for the speech SCM, which can be parameterized with the speech power spectral density (PSD) and relative transfer function (RTF). Specifically, we adopt the temporal cepstrum smoothing (TCS) scheme to estimate the speech PSD, which is conventionally estimated with temporal smoothing. Furthermore, we propose a novel RTF estimator based on a time difference of arrival (TDoA) estimate obtained by the cross-correlation method. Furthermore, we propose refining the initial estimate of speech SCM by utilizing the estimates for the clean speech spectrum and clean speech power spectrum. The proposed approach showed superior performance in terms of the perceptual evaluation of speech quality (PESQ) scores, extended short-time objective intelligibility (eSTOI), and scale-invariant signal-to-distortion ratio (SISDR) in our experiments on the CHiME-4 database.

Proceedings ArticleDOI
18 Dec 2022
TL;DR: In this paper , a method for performance improvement using the combination of feature vectors for speaker detection from the mixed speech of multiple speakers is presented, where pLPCPD is the feature vector that is introduced and developed for speaker verification.
Abstract: This paper presents a method for performance improvement using the combination of feature vectors for speaker detection from the mixed speech of multiple speakers. Recently, we have shown that the combination of the linear predictive coding spectral envelope (LPCSE), the Mel-frequency cepstral coefficients (MFCC) and the piecewise linear predictive coding pole distribution (pLPCPD) improves the performance of the speaker verification, where pLPCPD is the feature vector that we have introduced and developed for speaker verification. On the other hand, we have examined a method of speaker verification from the mixed speech of multiple speakers using pLPCPD feature vectors, where we have shown that the recall, or the performance measure for the verification, decreases suddenly with the change of the speech from unmixed to mixed, while the precision does not decrease so much. This paper applies the above method of combining features and the findings on speaker verification to the speaker detection from the mixed speech of multiple speakers using the probabilistic prediction which we are also developing. Through the experiments, we show the performance improvement by the combination of three features and the effectiveness of the pesent prediction method.

Proceedings ArticleDOI
23 Sep 2022
TL;DR: In this paper , the authors proposed a spectral subtraction speech enhancement algorithm to restore the spectrum magnitude for signal, the objective assessment like SNR and PESQ has been calculated using simulations.
Abstract: An attractive study topic that addresses signal Noise is Speech signal. Speech enhancement processes Noisy speech signal to improve the human perception of speech. The purpose of the paper is to effectively ensure speech quality. To enhance the signal's quality, approximate the Noise spectrum's average from the Noisy signal spectrum, Spectral Subtraction speech enhancement algorithm restores the spectrum magnitude for signal. The objective assessment like SNR and PESQ has been calculated using simulations. The better speech quality obtained in results. A TMS320C6713 digital signal processor which seems to be good platform has been chosen to implement the selected algorithm. Real-time implementation is carried out on PC using MATLAB While the embedded implementation is carried out on Texas Instruments TMS320C6713 DSP utilizing Spectrum Digital Incorporation's DSP Starter kit- DSK6713

Journal ArticleDOI
TL;DR: Comparisons of selected coding methods for speech signal produced by Electro Larynx (EL) device indicate that PVWT and ACELP coders perform better than other methods having about 40 dB SNR and 3 PESQ score for EL speech and 75 dB with 3.5 PESZ score for normal speech, respectively.
Abstract: Speech coding is a method of earning a tight speech signals representation for efficient storage and efficient transmission over band limited wired or wireless channels. This is usually achieved with acceptable representation and least number of bits without depletion in the perceptual quality. A number of speech coding methods already developed and various speech coding algorithms for speech analysis and synthesis are used. This paper deals with the comparison of selected coding methods for speech signal produced by Electro Larynx (EL) device. The latter is a device used by cancer patients with their vocal laryngeal cords being removed. The used methods are Residual-Excited Linear Prediction (RELP), Code Excited Linear Prediction (CELP), Algebraic Code Excited Linear Predictive (ACELP), Phase Vocoders based on Wavelet Transform (PVWT), Channel Vocoders based on Wavelet Transform (CVWT), and Phase vocoder based on Dual-Tree Rational-Dilation Complex Wavelet Transform (PVDT-RADWT). The aim here is to select the best coding approach based on the quality of the reproduced speech. The signal used in the test is speech signal recorded either directly by normal persons or else produced by EL device. The performance of each method is evaluated using both objective and subjective listening tests. The results indicate that PVWT and ACELP coders perform better than other methods having about 40 dB SNR and 3 PESQ score for EL speech and 75 dB with 3.5 PESQ score for normal speech, respectively.

Journal ArticleDOI
TL;DR: This work reformulates speech coding as a multistage reinforcement learning problem with L step lookahead that incorporates exploration and exploitation to adapt model parameters and to control the speech analysis/synthesis process on a sample-by-sample basis.
Abstract: Speech coding is an essential technology for digital cellular communications, voice over IP, and video conferencing systems. For more than 25 years, the main approach to speech coding for these applications has been block-based analysis-by-synthesis linear predictive coding. An alternative approach that has been less successful is sample-by-sample tree coding of speech. We reformulate this latter approach as a multistage reinforcement learning problem with L step lookahead that incorporates exploration and exploitation to adapt model parameters and to control the speech analysis/synthesis process on a sample-by-sample basis. The minimization of the spectrally shaped reconstruction error to finite depth manages complexity and serves as an effective stand in for the overall subjective evaluation of reconstructed speech quality and intelligibility. Different control policies that attempt to persistently excite the system states and that encourage exploration are studied and evaluated. The resulting methods produce reconstructed speech quality competitive with the most popular speech codec utilized today. This new reinforcement learning formulation provides new insights and opens up new directions for system design and performance improvement.

Journal ArticleDOI
TL;DR: In this paper , a robust polynomial-decompo- sition-based linear prediction coding algorithm, PDLPC, was proposed for formant estimation, which can effectively eliminate merger peaks.
Abstract: This letter proposes a robust polynomial-decompo- sition-based linear prediction coding algorithm, PDLPC, for formant estimation, which can effectively eliminate merger peaks. PDLPC first combines LPC with statistical analysis to obtain the peak and suspicious merger peak screening criteria; then uses Cauchy integral formula to calculate the number of poles located in the fan-shaped area near the suspicious merger peak to determine whether the merger peak occurs; finally, utilizes division algorithm for polynomial to separate those merger peaks. The evaluations on Primi and VTR show that PDLPC is effective in separating merger peaks, and has certain competitiveness compared with some existing formant estimation algorithms.