scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2020"


Journal ArticleDOI
TL;DR: This work approaches the problem of speaker recognition from severely degraded audio data by judiciously combining two commonly used features: Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC), and concludes that MFCC and LPC capture two distinct aspects of speech, viz., speech perception and speech production.
Abstract: Speaker recognition algorithms are negatively impacted by the quality of the input speech signal. In this work, we approach the problem of speaker recognition from severely degraded audio data by judiciously combining two commonly used features: Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC). Our hypothesis rests on the observation that MFCC and LPC capture two distinct aspects of speech, viz., speech perception and speech production. A carefully crafted 1D Triplet Convolutional Neural Network (1D-Triplet-CNN) is used to combine these two features in a novel manner, thereby enhancing the performance of speaker recognition in challenging scenarios. Extensive evaluation on multiple datasets, different types of audio degradations, multi-lingual speech, varying length of audio samples, etc. convey the efficacy of the proposed approach over existing speaker recognition methods, including those based on iVector and xVector.

104 citations


Journal ArticleDOI
TL;DR: This work proposes a meta-heuristic feature selection (FS) method using a hybrid of Golden Ratio Optimization (GRO) and Equilibriumoptimization (EO) algorithms, which it has named as Golden Ratio based Equilibrium optimization (GREO) algorithm.
Abstract: Speech is the most important media of expressing emotions for human beings. Thus, it has often been an area of interest to understand the emotion of a person out of his/her speech by using the intelligence of the computing devices. Traditional machine learning techniques are very much popular in accomplishing such tasks. To provide a less expensive computational model for emotion classification through speech analysis, we propose a meta-heuristic feature selection (FS) method using a hybrid of Golden Ratio Optimization (GRO) and Equilibrium Optimization (EO) algorithms, which we have named as Golden Ratio based Equilibrium Optimization (GREO) algorithm. The optimally selected features by the model are fed to the XGBoost classifier. Linear Predictive Coding (LPC) and Linear Prediction Cepstral Coefficients (LPCC) based features are considered as the input here, and these are optimized by using the proposed GREO algorithm. We have achieved impressive recognition accuracies of 97.31% and 98.46% on two standard datasets namely, SAVEE and EmoDB respectively. The proposed FS model is also found to perform better than their constituent algorithms as well as many well-known optimization algorithms used for FS in the past. Source code of the present work is made available at: https://github.com/arijitdey1/Hybrid-GREO .

31 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: In this paper, a collaborative quantization (CQ) scheme is proposed to jointly learn the codebook of LPC coefficients and the corresponding residuals, which achieves much higher quality than its predecessor at 9 kbps with even lower model complexity.
Abstract: Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

20 citations


Proceedings ArticleDOI
21 Sep 2020
TL;DR: It is shown that the SBL-based method has its advantages in signal-to-interference-plus-noise ratio (SINR) and target peak detection in comparison to conventional CS or classical signal reconstruction algorithms like linear predictive coding (LPC).
Abstract: Automotive radar plays an important role in advanced driver assistant systems to support level-2 automated driving functions. However, the mutual interference between automotive radars increases due to the rising density of radars on the road. Therefore, the radar signal will be distorted to some extent and the performance of radars will degrade if no countermeasures are taken. In this paper, an interference mitigation approach using compressive sensing (CS) and Bayesian learning is introduced. By utilizing the sparsity of the beat signal in the frequency domain, the range-Doppler (RD) spectrum can be reconstructed with the help of undistorted samples in the beat signal. The sparse Bayesian learning method (SBL) is used to estimate the posterior of the signal's sparse representation and to infer the maximally sparse representation by using the Expectation- Maximization (EM) algorithm. It is shown that the SBL-based method has its advantages in signal-to-interference-plus-noise ratio (SINR) and target peak detection in comparison to conventional CS or classical signal reconstruction algorithms like linear predictive coding (LPC).

15 citations


Journal ArticleDOI
TL;DR: Simulations of the proposed dynamic predictive block-adaptive quantization (DP-BAQ) are carried out considering a Tandem-L-like staggered SAR system for different orders of prediction and target scenarios, demonstrating that a significant data reduction can be achieved with a modest increase of the system complexity.
Abstract: Staggered synthetic aperture radar (SAR) is an innovative SAR acquisition concept which exploits digital beamforming (DBF) in elevation to form multiple receive beams and continuous variation of the pulse repetition interval to achieve high-resolution imaging of a wide continuous swath. Staggered SAR requires an azimuth oversampling higher than an SAR with constant pulse repetition interval (PRI), which results in an increased volume of data. In this article, we investigate the use of linear predictive coding, which exploits the correlation properties exhibited by the nonuniform azimuth raw data stream. According to this, the prediction of each sample is calculated onboard as a linear combination of a set of previous samples. The resulting prediction error is then quantized and downlinked (instead of the original value), which allows for a reduction of the signal entropy and, in turn, of the onboard data rate achievable for a given target performance. In addition, the a priori knowledge of the gap positions can be exploited to dynamically adapt the bit rate allocation and the prediction order to further improve the performance. Simulations of the proposed dynamic predictive block-adaptive quantization (DP-BAQ) are carried out considering a Tandem-L-like staggered SAR system for different orders of prediction and target scenarios, demonstrating that a significant data reduction can be achieved with a modest increase of the system complexity.

13 citations


Posted Content
Qiao Tian1, Zewang Zhang, Heng Lu, Ling-Hui Chen1, Shan Liu1 
TL;DR: The FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding, is proposed, which can significantly improve the efficiency of speech synthesis.
Abstract: In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than real-time on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.

12 citations


Posted Content
TL;DR: This work proposes CLCNet, a framework based on complex valued linear coding motivated by linear predictive coding that is applied in the complex frequency domain and defines a parametric normalization for complex valued spectrograms that complies with low-latency and on-line processing.
Abstract: Noise reduction is an important part of modern hearing aids and is included in most commercially available devices. Deep learning-based state-of-the-art algorithms, however, either do not consider real-time and frequency resolution constrains or result in poor quality under very noisy conditions. To improve monaural speech enhancement in noisy environments, we propose CLCNet, a framework based on complex valued linear coding. First, we define complex linear coding (CLC) motivated by linear predictive coding (LPC) that is applied in the complex frequency domain. Second, we propose a framework that incorporates complex spectrogram input and coefficient output. Third, we define a parametric normalization for complex valued spectrograms that complies with low-latency and on-line processing. Our CLCNet was evaluated on a mixture of the EUROM database and a real-world noise dataset recorded with hearing aids and compared to traditional real-valued Wiener-Filter gains.

11 citations


Journal ArticleDOI
TL;DR: A new lossy compression approach using the Bayesian predictive coding (BPC), which is the same efficient as the linear predictive coding when handling independent signals which follow a stationary probability distribution and is more robust toward occasionally erroneous or missing sensor data.
Abstract: Wireless sensor networks (WSNs) generate a variety of continuous data streams. To reduce data storage and transmission cost, compression is recommended to be applied to the data streams from every single sensor node. Local compression falls into two categories: lossless and lossy. Lossy compression techniques are generally preferable for sensors in commercial nodes than the lossless ones as they provide a better compression ratio at a lower computational cost. However, the traditional approaches for data compression in WSNs are sensitive to sensor accuracy. They are less efficient when there are abnormal and faulty measurements or missing data. This paper proposes a new lossy compression approach using the Bayesian predictive coding (BPC). Instead of the original signals, predictive coding transmits the error terms which are calculated by subtracting the predicted signals from the actual signals to the receiving node. Its compression performance depends on the accuracy of the adopted prediction technique. BPC combines the Bayesian inference with the predictive coding. Prediction is made by the Bayesian inference instead of regression models as in traditional predictive coding. In this way, it can utilize prior information and provide inferences that are conditional on the data without reliance on asymptotic approximation. Experimental tests show that the BPC is the same efficient as the linear predictive coding when handling independent signals which follow a stationary probability distribution. More than that, the BPC is more robust toward occasionally erroneous or missing sensor data. The proposed approach is based on the physical knowledge of the phenomenon in applications. It can be considered as a complementary approach to the existing lossy compression family for WSNs.

11 citations


Proceedings ArticleDOI
Qiao Tian1, Zewang Zhang, Heng Lu, Ling-Hui Chen1, Shan Liu1 
25 Oct 2020
TL;DR: In this article, the authors proposed FeatherWave, another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding, which can generate high quality speech with a speed faster than real-time on a single CPU core.
Abstract: In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than real-time on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.

10 citations


Journal ArticleDOI
TL;DR: The experimental results prove the feasibility of the proposed artificial intelligence model that combines a denoise autoencoder with generative adversarial networks and generates voices with similar semantics through the random input from the latent space of generator.
Abstract: Linear predictive coding is an extremely effective voice generation method that operates through simple process. However, linear predictive coding–generated voices have limited variations and exhib...

9 citations


Proceedings ArticleDOI
01 Jun 2020
TL;DR: Fusion of MFCC, LPC and PLP features greatly enhances the speaker recognition system’s performance, and proposed method are giving best result of 100% accuracy for speaker identification system and 0 equal error rate(EER) value for speaker verification system.
Abstract: Speaker recognition is a task of identifying/verifying individual identity with the help of input voice sample. Speaker recognition is further classified into speaker identification(SI) and speaker verification(SV). Unlike other approaches using ELSDSR speech database, which only focuses on speaker identification performance, proposed work also focuses on speaker verification performance results, along with SI performance. Main goal of this research, is to find out best model for speaker identification and speaker verification system for clean speech database. Comparative study is done between performances by various combinations of features, for speaker identification and speaker verification system with Feedforward Artificial Neural Network(FFANN) and Support Vector Machine(SVM) as an classification technique using ELSDSR voice database. Features named as Linear predictive coding(LPC), Mel frequency cepstral coefficient(MFCC) and Perceptual linear prediction(PLP) are used. All features are tested separately and in fusion among each other’s with FFANN and SVM classifier on MATLAB software. Also Proposed model results are compared with some famous techniques using same ELSDSR database for speaker identification system. By comparing experimental results of proposed model and others model, it is observed that fusion of different features gives better results and speaker identification accuracy increases to 3%-5% when compared with single feature’s result. In addition, proposed method are giving best result of100% accuracy for speaker identification system and 0 equal error rate(EER) value for speaker verification system, when fusion of MFCC, LPC and PLP features are used with ANN and SVM classifier. Therefore, it can be said that fusion of MFCC, LPC and PLP features greatly enhances the speaker recognition system’s performance.

Proceedings ArticleDOI
01 May 2020
TL;DR: In this paper, the authors proposed a framework based on complex valued linear coding (CLC) motivated by linear predictive coding (LPC) that is applied in the complex frequency domain.
Abstract: Noise reduction is an important part of modern hearing aids and is included in most commercially available devices. Deep learning-based state-of-the-art algorithms, however, either do not consider real-time and frequency resolution constrains or result in poor quality under very noisy conditions.To improve monaural speech enhancement in noisy environments, we propose CLCNet, a framework based on complex valued linear coding. First, we define complex linear coding (CLC) motivated by linear predictive coding (LPC) that is applied in the complex frequency domain. Second, we propose a framework that incorporates complex spectrogram input and coefficient output. Third, we define a parametric normalization for complex valued spectrograms that complies with low-latency and on-line processing.Our CLCNet was evaluated on a mixture of the EUROM database and a real-world noise dataset recorded with hearing aids and compared to traditional real-valued Wiener-Filter gains.

Posted Content
TL;DR: This paper proposes that the concatenation of features, extracted by using different existing feature extraction methods can not only boost the classification accuracy but also expands the possibility of efficient feature selection in speech emotion analysis.
Abstract: Emotion recognition from audio signals has been regarded as a challenging task in signal processing as it can be considered as a collection of static and dynamic classification tasks. Recognition of emotions from speech data has been heavily relied upon end-to-end feature extraction and classification using machine learning models, though the absence of feature selection and optimization have restrained the performance of these methods. Recent studies have shown that Mel Frequency Cepstral Coefficients (MFCC) have been emerged as one of the most relied feature extraction methods, though it circumscribes the accuracy of classification with a very small feature dimension. In this paper, we propose that the concatenation of features, extracted by using different existing feature extraction methods can not only boost the classification accuracy but also expands the possibility of efficient feature selection. We have used Linear Predictive Coding (LPC) apart from the MFCC feature extraction method, before feature merging. Besides, we have performed a novel application of Manta Ray optimization in speech emotion recognition tasks that resulted in a state-of-the-art result in this field. We have evaluated the performance of our model using SAVEE and Emo-DB, two publicly available datasets. Our proposed method outperformed all the existing methods in speech emotion analysis and resulted in a decent result in these two datasets with a classification accuracy of 97.06% and 97.68% respectively.

Journal ArticleDOI
01 Jan 2020
TL;DR: In this article, the authors investigated the sources of variability in vowel formant estimation, a major analytic activity in sociophonetics, by reviewing the outcomes of two simulations that manipulated the settings used for linear predictive coding (LPC)-based vowel estimation.
Abstract: Abstract This paper contributes insight into the sources of variability in vowel formant estimation, a major analytic activity in sociophonetics, by reviewing the outcomes of two simulations that manipulated the settings used for linear predictive coding (LPC)-based vowel formant estimation. Simulation 1 explores the range of frequency differences obtained when minor adjustments are made to LPC settings, and measurement timepoints around the settings used by trained analysts, in order to determine the range of variability that should be expected in sociophonetic vowel studies. Simulation 2 examines the variability that emerges when LPC settings are varied combinatorially around constant default settings, rather than settings set by trained analysts. The impacts of different LPC settings are discussed as a way of demonstrating the inherent properties of LPC-based formant estimation. This work suggests that differences more fine-grained than about 10 Hz in F1 and 15–20 Hz in F2 are within the range of LPC-based formant estimation variability.

Journal ArticleDOI
TL;DR: The results strongly prove the usefulness of the proposed multivariate feature selection algorithm when compared with filter approach.
Abstract: Speech signals convey speaker’s neurodevelopmental state along with phonological information. Recognize a speech disorder by analyzing the speech is essential for human–machine interaction. To develop a subject independent speech recognition system for neurodevelopmental disorders, by identifying voice features from MATLAB toolbox, spectral characteristics and feature selection algorithms are proposed in this paper. Feature selection is applied to overcome the challenges of dimensionality in various applications. This work presents a novel particle swarm optimization (PSO) based algorithm for feature selection. The experiments were conducted using a speech database of the children with intellectual disability with age-matched typically developed and validate the reliability using 10-fold cross-validation technique. The database consists of 141 speech features extracted from linear predictive coding (LPC) based cepstral parameters and Mel-frequency cepstral coefficients (MFCC). Three classification models were applied and obtained the recognition accuracies 90.30% with ANN, 98.00% with SVM and 91.00% with random forest with PSO feature selection algorithm. The results strongly prove the usefulness of the proposed multivariate feature selection algorithm when compared with filter approach.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this paper, the authors used linear predictive coding coefficients (LPCs) to detect specific language impairment (SLI) from children's voice using behavioral analysis and age-specific language test scores.
Abstract: Specific language impairment (SLI) is a kind of neurodevelopmental disorders which could disturb the speech production and language comprehension ability of a child. The detection/prediction of SLI in children is done through behavioral analysis and age specific language test scores by speech therapists and child psychologists. Modern computational approaches may chop off the long prediction time to diagnose the disorder and support the child as well as parents to start ministrations early. The present work gives an idea for detection of SLI from children’s voice with the help of linear predictive coding coefficients (LPCs) as they have the ability to model human vocal system. Two classifiers: naiveBayes (NB) and support vector machine (SVM) were resorted to categorize children in both SLI and healthy groups respectively. The best classification accuracies of 97.9% (for top-20 LPC features) and 97.8% (for top-10 LPC features) were secured by NB classifier for 5-fold cross-validation protocol. The study concluded that LPC parameters play a crucial role in detection of SLI and could be put to uncover other discern neurodevelopmental disorders via children’s speech signals.

Journal ArticleDOI
TL;DR: In the proposed technique roots of the linear predictive polynomial are used to detect the oscillations and the quantification spectrum of cross-correlation of linear predictive residual and original data is used.

Journal ArticleDOI
TL;DR: The experimental outcomes show that the projected technique can be used to help speech pathologists in estimating intellectual disability at early ages.
Abstract: Classification of intellectually disabled children through manual assessment of speech at an early age is inconsistent, subjective, time-consuming and prone to error. This study attempts to classif...

Proceedings ArticleDOI
23 Oct 2020
TL;DR: In this paper, a speaker recognition algorithm based on Long Short-Term Memory Networks (LSTM) was proposed, which uses a method that combines Linear Predictive Coding (LPC) with Log-Mel spectrum, and the d-vector output through the LSTM network is classified using the Softmax loss function.
Abstract: Speaker recognition is the process of identifying the speaker's identity by extracting the acoustic features in the speaker's audio for recognition. It is mainly the perception and simulation of the speaker's vocal tract information and the human ear's auditory information, which has profound significance in the fields of human daily life and military affairs. In order to solve this problem, this paper proposes a speaker recognition algorithm based on Long Short-Term Memory Networks (LSTM). The feature uses a method that combines Linear Predictive Coding (LPC) with Log-Mel spectrum. The d-vector output through the LSTM network is classified using the Softmax loss function. This method is applied to the VCTK audio data set. The experimental results show that the recognition rate of this method reaches 94.9%.

Journal ArticleDOI
TL;DR: A variable-bit-rate speech codec-based on mixed excitation linear prediction enhanced (MELPe) with an average bit rate of 2 kbps and with a better representation of excitation signal is proposed and the incorporation of Mel-LPC gives a better performance in the estimation of formants and GCIs.
Abstract: In this paper, we propose a variable-bit-rate speech codec-based on mixed excitation linear prediction enhanced (MELPe) with an average bit rate of 2 kbps and with a better representation of excitation signal. The order of the prediction filter in MELPe coding architecture is reduced from 10 to 7 without affecting the perceptual quality of the decoded speech by using psychoacoustic Mel scale. An efficient two-split vector quantization is developed with weighted Euclidean distance measure for Mel scale-based linear predictive coding (Mel-LPC), and it requires only 18 bits/frame. The instantaneous pitch or epoch that is vital for many speech processing applications is preserved in this codec by including it in the excitation signal used for reconstructing the voiced speech. The quantization scheme developed for glottal closure instants (GCIs) causes an increase in the bit requirement for voiced frames by 4–25 bits depending on the position of GCIs. To compensate for that, the Mel-LPC order for both silence and unvoiced frames has been brought down to 4 without compromising the perceptual quality of reconstructed speech. The lowered bit budget for unvoiced frame is 41 bits/frame, and for silence, it is 31 bits/frame. Further reduction of 10 bits for silence frame is obtained by reducing the number of transmitted parameters and by tuning the quantization bit requirement for each. For categorizing the speech frames at the entry of the encoder, a neural network-based voiced/unvoiced/silence classification algorithm using five-dimensional feature set is created. The experimental results show that the proposed coding scheme operates at an average bit rate of 2 kbps, which is less than the bit rate of MELPe (2.4 kbps), but with a better perceptual score. In addition to all these, the incorporation of Mel-LPC gives a better performance in the estimation of formants and GCIs.

Journal ArticleDOI
31 Jul 2020
TL;DR: The use of the linear predictive coding coefficients along with the Elman neural network has led to higher recognition accuracy and improved the speech recognition system.
Abstract: Today, the automatic intelligent system requirement has caused an increasing consideration on the interactive modern techniques between human being and machine. These techniques generally consist of two types: audio and visual methods. Meanwhile, the need for developing the algorithms that enable the human speech recognition by machine is of high importance and frequently studied by the researchers. Using artificial intelligence methods has led to better results in human speech recognition, but the basic problem is the lack of an appropriate strategy to select the recognition data among the huge amount of speech information that practically makes it impossible for the available algorithms to work. In this article, to solve the problem, the linear predictive coding coefficients extraction method is used to sum up the data related to the English digits pronunciation. After extracting the database, it is utilized to an Elman neural network to recognize the relation between the linear coding coefficients of an audio file with the pronounced digit. The results show that this method has a good performance compared to other methods. According to the experiments, the obtained results of network training (99% recognition accuracy) indicate that the network still has better performance than RBF despite many errors. The results of the experiments showed that the Elman memory neural network has had an acceptable performance in recognizing the speech signal compared to the other algorithms. The use of the linear predictive coding coefficients along with the Elman neural network has led to higher recognition accuracy and improved the speech recognition system.

Journal ArticleDOI
TL;DR: This paper integrates a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique and proposes a collapsed speech segment detector (CSSD) to mitigate the negative effects introduced by the LPCDC.
Abstract: In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018.

Journal ArticleDOI
TL;DR: In this work a new Bangla speech corpus along with proper transcriptions has been developed; also various acoustic feature extraction methods have been investigated using Long Short-Term Memory (LSTM) neural network to find their effective integration into a state-of-the-art Banglaspeech recognition system.
Abstract: In this work a new Bangla speech corpus along with proper transcriptions has been developed; also various acoustic feature extraction methods have been investigated using Long Short-Term Memory(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition system. The acoustic features are usually a sequence of representative vectors that are extracted from speech signals and the classes are either words or sub word units such as phonemes. The most commonly used feature extraction method, known as linear predictive coding (LPC), has been used first in this work. Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the human auditory system. A detailed review of the implementation of these methods have been described first. Then the steps of the implementation have been elaborated for the development of an automatic speech recognition system (ASR) for Bangla speech.

Journal ArticleDOI
TL;DR: The objective of SI is to achieve the security needed for the remote access system, and this security can be increased using coding and compression processes.
Abstract: This paper investigates the effect of both decoding and decompression on the Speaker Identification (SI) in a remote access system. The coding and compression processes are used for the communication purpose as a normal action taken for voice communication over Internet or mobile networks. In the proposed system, the speech signal is coded with the Linear Predictive Coding (LPC) technique. Also, the speech signal is compressed using two techniques. The first technique depends on decimation process to compress the signal. The signal can be recovered using inverse solutions. The inverse solutions include maximum entropy and regularized reconstruction. The second technique is the Compressive Sensing (CS) and the speech signal can be reconstructed using linear programming. The coded or compressed speech signal is transmitted into the receiver via a wireless communication channel. At the receiver, the received signal is decoded or decompressed, and then SI is performed on the decoded or decompressed speech signal. The performance of coding and compression techniques is evaluated using some metrics such as Perceptual Evaluation of Speech Quality (PESQ) and Dynamic Time Warping (DTW). The objective of SI is to achieve the security needed for the remote access system, and this security can be increased using coding and compression processes. In the SI system, the feature vectors are captured from different discrete transforms such as Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and Discrete Sine Transform (DST), besides the time domain. The recognition rate for all transforms is computed to evaluate the performance of the SI system.

Journal ArticleDOI
TL;DR: A fast computation algorithm, based on the Bairstow method for computing LSPs frequencies from linear prediction coefficients, is proposed in this paper, and can extract the polynomial roots efficiently and accurately with significantly reduced computation complexity.
Abstract: Linear prediction is the kernel technology in speech processing. It has been widely applied in speech recognition, synthesis, and coding, and can efficiently and correctly represent the speech frequency spectrum with only a few parameters. Line Spectrum Pairs (LSPs) frequencies, as an alternative representation of Linear Predictive Coding (LPC), have the advantages of good quantization accuracy and low spectral sensitivity. However, computing the LSPs frequencies takes a long time. To address this issue, a fast computation algorithm, based on the Bairstow method for computing LSPs frequencies from linear prediction coefficients, is proposed in this paper. The algorithm process first transforms the symmetric and antisymmetric polynomial to general polynomial, then extracts the polynomial roots. Associated with the short-term stationary property of speech signal, an adaptive initial method is applied to reduce the average iteration numbers by 26%, as compared to the statics in the initial method, with a Perceptual Evaluation of Speech Quality (PESQ) score reaching 3.46. Experimental results show that the proposed method can extract the polynomial roots efficiently and accurately with significantly reduced computation complexity. Compared to previous works, the proposed method is 17 times faster than Tschirnhus Transform, and has a 22% PESQ improvement on the Birge-Vieta method with an almost comparable computation time.

Proceedings ArticleDOI
TL;DR: An improved model that uses both linear predictive coding (LPC) and line spectral frequency (LSF) coefficients to parametrize the source speech signal was developed in this work to reveal the effect of over-smoothing.
Abstract: The research presents a voice conversion model using coefficient mapping and neural network. Most previous works on parametric speech synthesis did not account for losses in spectral details causing over smoothing and invariably, an appreciable deviation of the converted speech from the targeted speaker. An improved model that uses both linear predictive coding (LPC) and line spectral frequency (LSF) coefficients to parametrize the source speech signal was developed in this work to reveal the effect of over-smoothing. Non-linear mapping ability of neural network was employed in mapping the source speech vectors into the acoustic vector space of the target. Training LPC coefficients with neural network yielded a poor result due to the instability of the LPC filter poles. The LPC coefficients were converted to line spectral frequency coefficients before been trained with a 3-layer neural network. The algorithm was tested with noisy data with the result evaluated using Mel-Cepstral Distance measurement. Cepstral distance evaluation shows a 35.7 percent reduction in the spectral distance between the target and the converted speech.

Proceedings ArticleDOI
01 May 2020
TL;DR: Experimental results indicate that CS method achieve a significant improvements in performances with respect to the aforementioned methods, namely, Code-excited linear prediction and LIoyd-Max quantization algorithm.
Abstract: Speech coding is an essential procedure in public switched telephone system (PSTN), digital cellular communications, videoconferencing systems, and emerging voice over Internet applications. Compressed sensing is an original signal processing tool for efficiently acquiring and reconstructing a signal, by exploiting its compressibility. In this paper, compressive sensing is employed for speech coding. Particularly in this work and in order to demonstrate its efficiency in speech coding, we propose a comparative study between this method and optimal existing methods, namely, Code-excited linear prediction and LIoyd-Max quantization algorithm. Experimental results indicate that CS method achieve a significant improvements in performances with respect to the aforementioned methods.

22 Jan 2020
TL;DR: This work investigated the optimization techniques for voice pathology and voice disorder classification system using optimization and machine learning techniques with importance to pathology detection and its classification.
Abstract: Voice diseases are increasing dramatically, by unhealthy social habits and voice abuse. In this work, we investigated the optimization techniques for voice pathology and voice disorder classification system. The literature review presents a survey of the acoustic analysis for human voice disorder classification using optimization and machine learning techniques with importance to pathology detection and its classification. The voices input signal is given to numerous classifications methods, in such the simplest way the classification produces the result against the pathology voices and normal voices with relevence to the male and female human voice classification. The foremost goal of this analysis work is to provide a entire study of the most popular machine learning techniques namely, Noise Removal and Silence Removal and different filters in preprocessing, The feature extraction techniques namely, Acoustic features (signal energy, pitch, formant, jitter, shimmer), Reflection coefficients, Autocorrelation, Linear Predictive Coding (LPC), Mel-frequency cepstrum co-efficient (MFCC), Zero crossing with peak amplitude (ZCPA), Dynamic time wrapping (DTW) and Relative spectral processing (RASTA). And to classify the input voice signal with the help of classification algorithms like, Support Vector Machine (SVM) and Back Propagation Neural Network (BPNN).

Book ChapterDOI
01 Jan 2020
TL;DR: In this article, a review of the related research work in the field of feature extraction methodologies viz MFCC (Mel Frequency Cepstral Coefficient), LPC (Linear Predictive Coding), Wavelet, DWT (Discrete Wavelet Transform) and PLP (Perceptual Linear Predictive) etc.
Abstract: Current technology development in the field of artificial intelligence and IoT has resulted in increased importance to research in speech processing. Researchers are emphasizing on speech processing and its applications due to increased acceptance of technology based on AI and IoT. Natural voice or speech signal available needs to be digitized for age in processing and feature extraction. Speech signal consist of scads of information categorized broadly as gender based, voice characteristics based, emotion based, speaker based etc. Recognizing the importance of feature extraction and classification for speech processing in various applications, significant research has been carried out for various methodologies related to diversified applications. This manuscript attempts to study and review the related research work in the field of feature extraction methodologies viz MFCC (Mel Frequency Cepstral Coefficient), LPC (Linear Predictive Coding), Wavelet, DWT (Discrete Wavelet Transform) and PLP (Perceptual Linear Predictive) etc. Researchers have also given importance to classifiers like SVM (Support Vector Machine), ANN (Artificial Neural Network), GMM (Gaussian Mixture model), HMM (Hidden Markov model) etc. The comparison of these classifiers has been presented in this review. The prime objective of this review paper is to observe the relationship between the variance of speech parameters, feature extraction methodologies and classifiers. The endeavor of this review is to establish the comparative observation which shall help the budding researchers for selection of feature extraction technique as well as classifier for various speech processing application considering specific advantages and disadvantages.

Proceedings ArticleDOI
05 Nov 2020
TL;DR: In this article, a speech recognition system for Marathi Ank (Numbers) was proposed, where feature extraction and feature matching technique plays a vital role for speech recognition and here LPC (Linear Predictive Coding) is used for extracting features from samples, whereas ANN (Artificial neural network) was used to classify them.
Abstract: Speech recognition is gaining an increasing research interest in the last five decades. Speech processing is considered as an interdisciplinary branch of electronics and computer science domain. It considers speech as an input and converts it into the corresponding text. This paper describes the design and development of Marathi Numeric Speech Dataset. Marathi Numbers (Ank) ranging from Shunya(0) to Nau(9) and are taken into consideration for recording. Speech samples are collected from 50 native and 50 non-native speakers of Marathi language. The dataset remains as a gender balanced since it is recorded from 50 females and 50 male speakers. The age of speakers will affect the speech. Therefore, 5 different age groups such as 11-20, 21-30, 31-40, 41-50 and 51-60 are considered. Native and non-native speakers are selected to obtain ample amount of variations in the pronunciation of Marathi numerals. Feature extraction and feature matching technique plays a vital role for speech recognition and here LPC (Linear Predictive Coding) is used for extracting features from samples, whereas ANN (Artificial neural network) is used to classify them. Experimental specifications and results are also discussed. This research work has attempted to design and develop a speech recognition system, which can understand Marathi Ank (Numbers) and identify them accurately.