Showing papers on "Linear predictive coding published in 2020"

PDF

Open Access

Journal Article•DOI•

Fusing MFCC and LPC Features Using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals

[...]

Anurag Chowdhury¹, Arun Ross¹•Institutions (1)

01 Jan 2020-IEEE Transactions on Information Forensics and Security

TL;DR: This work approaches the problem of speaker recognition from severely degraded audio data by judiciously combining two commonly used features: Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC), and concludes that MFCC and LPC capture two distinct aspects of speech, viz., speech perception and speech production.

...read moreread less

Abstract: Speaker recognition algorithms are negatively impacted by the quality of the input speech signal. In this work, we approach the problem of speaker recognition from severely degraded audio data by judiciously combining two commonly used features: Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC). Our hypothesis rests on the observation that MFCC and LPC capture two distinct aspects of speech, viz., speech perception and speech production. A carefully crafted 1D Triplet Convolutional Neural Network (1D-Triplet-CNN) is used to combine these two features in a novel manner, thereby enhancing the performance of speaker recognition in challenging scenarios. Extensive evaluation on multiple datasets, different types of audio degradations, multi-lingual speech, varying length of audio samples, etc. convey the efficacy of the proposed approach over existing speaker recognition methods, including those based on iVector and xVector.

...read moreread less

104 citations

Journal Article•DOI•

A Hybrid Meta-Heuristic Feature Selection Method Using Golden Ratio and Equilibrium Optimization Algorithms for Speech Emotion Recognition

[...]

Arijit Dey¹, Soham Chattopadhyay², Pawan Kumar Singh², Ali Ahmadian³, Massimiliano Ferrara⁴, Ram Sarkar² - Show less +2 more•Institutions (4)

Islamic Azad University¹, Jadavpur University², National University of Malaysia³, Mediterranea University of Reggio Calabria⁴

03 Nov 2020-IEEE Access

TL;DR: This work proposes a meta-heuristic feature selection (FS) method using a hybrid of Golden Ratio Optimization (GRO) and Equilibriumoptimization (EO) algorithms, which it has named as Golden Ratio based Equilibrium optimization (GREO) algorithm.

...read moreread less

Abstract: Speech is the most important media of expressing emotions for human beings. Thus, it has often been an area of interest to understand the emotion of a person out of his/her speech by using the intelligence of the computing devices. Traditional machine learning techniques are very much popular in accomplishing such tasks. To provide a less expensive computational model for emotion classification through speech analysis, we propose a meta-heuristic feature selection (FS) method using a hybrid of Golden Ratio Optimization (GRO) and Equilibrium Optimization (EO) algorithms, which we have named as Golden Ratio based Equilibrium Optimization (GREO) algorithm. The optimally selected features by the model are fed to the XGBoost classifier. Linear Predictive Coding (LPC) and Linear Prediction Cepstral Coefficients (LPCC) based features are considered as the input here, and these are optimized by using the proposed GREO algorithm. We have achieved impressive recognition accuracies of 97.31% and 98.46% on two standard datasets namely, SAVEE and EmoDB respectively. The proposed FS model is also found to perform better than their constituent algorithms as well as many well-known optimization algorithms used for FS in the past. Source code of the present work is made available at: https://github.com/arijitdey1/Hybrid-GREO .

...read moreread less

31 citations

Proceedings Article•DOI•

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

[...]

Kai Zhen¹, Mi Suk Lee², Jongmo Sung², Seungkwon Beack², Minje Kim¹ - Show less +1 more•Institutions (2)

Indiana University¹, Electronics and Telecommunications Research Institute²

04 May 2020

TL;DR: In this paper, a collaborative quantization (CQ) scheme is proposed to jointly learn the codebook of LPC coefficients and the corresponding residuals, which achieves much higher quality than its predecessor at 9 kbps with even lower model complexity.

...read moreread less

Abstract: Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

...read moreread less

20 citations

Proceedings Article•DOI•

Automotive Radar Interference Reduction Based on Sparse Bayesian Learning

[...]

Shengyi Chen¹, Jalal Taghia, Uwe Kuhnau, Tai Fei, Frank Grunhaupt, Rainer Martin¹ - Show less +2 more•Institutions (1)

Ruhr University Bochum¹

21 Sep 2020

TL;DR: It is shown that the SBL-based method has its advantages in signal-to-interference-plus-noise ratio (SINR) and target peak detection in comparison to conventional CS or classical signal reconstruction algorithms like linear predictive coding (LPC).

...read moreread less

Abstract: Automotive radar plays an important role in advanced driver assistant systems to support level-2 automated driving functions. However, the mutual interference between automotive radars increases due to the rising density of radars on the road. Therefore, the radar signal will be distorted to some extent and the performance of radars will degrade if no countermeasures are taken. In this paper, an interference mitigation approach using compressive sensing (CS) and Bayesian learning is introduced. By utilizing the sparsity of the beat signal in the frequency domain, the range-Doppler (RD) spectrum can be reconstructed with the help of undistorted samples in the beat signal. The sparse Bayesian learning method (SBL) is used to estimate the posterior of the signal's sparse representation and to infer the maximally sparse representation by using the Expectation- Maximization (EM) algorithm. It is shown that the SBL-based method has its advantages in signal-to-interference-plus-noise ratio (SINR) and target peak detection in comparison to conventional CS or classical signal reconstruction algorithms like linear predictive coding (LPC).

...read moreread less

15 citations

Journal Article•DOI•

Predictive Quantization for Data Volume Reduction in Staggered SAR Systems

[...]

Michele Martone¹, Nicola Gollin¹, Michelangelo Villano¹, Paola Rizzoli¹, Gerhard Krieger¹ - Show less +1 more•Institutions (1)

German Aerospace Center¹

24 Feb 2020-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: Simulations of the proposed dynamic predictive block-adaptive quantization (DP-BAQ) are carried out considering a Tandem-L-like staggered SAR system for different orders of prediction and target scenarios, demonstrating that a significant data reduction can be achieved with a modest increase of the system complexity.

...read moreread less

Abstract: Staggered synthetic aperture radar (SAR) is an innovative SAR acquisition concept which exploits digital beamforming (DBF) in elevation to form multiple receive beams and continuous variation of the pulse repetition interval to achieve high-resolution imaging of a wide continuous swath. Staggered SAR requires an azimuth oversampling higher than an SAR with constant pulse repetition interval (PRI), which results in an increased volume of data. In this article, we investigate the use of linear predictive coding, which exploits the correlation properties exhibited by the nonuniform azimuth raw data stream. According to this, the prediction of each sample is calculated onboard as a linear combination of a set of previous samples. The resulting prediction error is then quantized and downlinked (instead of the original value), which allows for a reduction of the signal entropy and, in turn, of the onboard data rate achievable for a given target performance. In addition, the a priori knowledge of the gap positions can be exploited to dynamically adapt the bit rate allocation and the prediction order to further improve the performance. Simulations of the proposed dynamic predictive block-adaptive quantization (DP-BAQ) are carried out considering a Tandem-L-like staggered SAR system for different orders of prediction and target scenarios, demonstrating that a significant data reduction can be achieved with a modest increase of the system complexity.

...read moreread less

13 citations

Posted Content•

FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

[...]

Qiao Tian¹, Zewang Zhang, Heng Lu, Ling-Hui Chen¹, Shan Liu¹ - Show less +1 more•Institutions (1)

Tencent¹

12 May 2020-arXiv: Sound

TL;DR: The FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding, is proposed, which can significantly improve the efficiency of speech synthesis.

...read moreread less

Abstract: In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than real-time on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.

...read moreread less

12 citations

Posted Content•

CLCNet: Deep learning-based Noise Reduction for Hearing Aids using Complex Linear Coding

[...]

Hendrik Schröter¹, Tobias Rosenkranz, N B Alberto Escalante, Marc Aubreville, Andreas Maier - Show less +1 more•Institutions (1)

University of Erlangen-Nuremberg¹

28 Jan 2020-arXiv: Audio and Speech Processing

TL;DR: This work proposes CLCNet, a framework based on complex valued linear coding motivated by linear predictive coding that is applied in the complex frequency domain and defines a parametric normalization for complex valued spectrograms that complies with low-latency and on-line processing.

...read moreread less

Abstract: Noise reduction is an important part of modern hearing aids and is included in most commercially available devices. Deep learning-based state-of-the-art algorithms, however, either do not consider real-time and frequency resolution constrains or result in poor quality under very noisy conditions. To improve monaural speech enhancement in noisy environments, we propose CLCNet, a framework based on complex valued linear coding. First, we define complex linear coding (CLC) motivated by linear predictive coding (LPC) that is applied in the complex frequency domain. Second, we propose a framework that incorporates complex spectrogram input and coefficient output. Third, we define a parametric normalization for complex valued spectrograms that complies with low-latency and on-line processing. Our CLCNet was evaluated on a mixture of the EUROM database and a real-world noise dataset recorded with hearing aids and compared to traditional real-valued Wiener-Filter gains.

...read moreread less

11 citations

Journal Article•DOI•

A new lossy compression algorithm for wireless sensor networks using Bayesian predictive coding

[...]

Chen Chen¹, Limao Zhang¹, Robert L. K. Tiong¹•Institutions (1)

Nanyang Technological University¹

01 Nov 2020-Wireless Networks

TL;DR: A new lossy compression approach using the Bayesian predictive coding (BPC), which is the same efficient as the linear predictive coding when handling independent signals which follow a stationary probability distribution and is more robust toward occasionally erroneous or missing sensor data.

...read moreread less

Abstract: Wireless sensor networks (WSNs) generate a variety of continuous data streams. To reduce data storage and transmission cost, compression is recommended to be applied to the data streams from every single sensor node. Local compression falls into two categories: lossless and lossy. Lossy compression techniques are generally preferable for sensors in commercial nodes than the lossless ones as they provide a better compression ratio at a lower computational cost. However, the traditional approaches for data compression in WSNs are sensitive to sensor accuracy. They are less efficient when there are abnormal and faulty measurements or missing data. This paper proposes a new lossy compression approach using the Bayesian predictive coding (BPC). Instead of the original signals, predictive coding transmits the error terms which are calculated by subtracting the predicted signals from the actual signals to the receiving node. Its compression performance depends on the accuracy of the adopted prediction technique. BPC combines the Bayesian inference with the predictive coding. Prediction is made by the Bayesian inference instead of regression models as in traditional predictive coding. In this way, it can utilize prior information and provide inferences that are conditional on the data without reliance on asymptotic approximation. Experimental tests show that the BPC is the same efficient as the linear predictive coding when handling independent signals which follow a stationary probability distribution. More than that, the BPC is more robust toward occasionally erroneous or missing sensor data. The proposed approach is based on the physical knowledge of the phenomenon in applications. It can be considered as a complementary approach to the existing lossy compression family for WSNs.

...read moreread less

11 citations

Proceedings Article•DOI•

FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction.

[...]

Qiao Tian¹, Zewang Zhang, Heng Lu, Ling-Hui Chen¹, Shan Liu¹ - Show less +1 more•Institutions (1)

Tencent¹

25 Oct 2020

TL;DR: In this article, the authors proposed FeatherWave, another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding, which can generate high quality speech with a speed faster than real-time on a single CPU core.

...read moreread less

10 citations

Journal Article•DOI•

DNAE-GAN: Noise-free acoustic signal generator by integrating autoencoder and generative adversarial network:

[...]

Ping-Huan Kuo¹, Ssu-Ting Lin¹, Jun Hu¹•Institutions (1)

National Pingtung University¹

20 May 2020-International Journal of Distributed Sensor Networks

TL;DR: The experimental results prove the feasibility of the proposed artificial intelligence model that combines a denoise autoencoder with generative adversarial networks and generates voices with similar semantics through the random input from the latent space of generator.

...read moreread less

Abstract: Linear predictive coding is an extremely effective voice generation method that operates through simple process. However, linear predictive coding–generated voices have limited variations and exhib...

...read moreread less

9 citations

Proceedings Article•DOI•

Speaker Recognition using fusion of features with Feedforward Artificial Neural Network and Support Vector Machine

[...]

Neha Chauhan¹, Tsuyoshi Isshiki¹, Dongju Li¹•Institutions (1)

Tokyo Institute of Technology¹

01 Jun 2020

TL;DR: Fusion of MFCC, LPC and PLP features greatly enhances the speaker recognition system’s performance, and proposed method are giving best result of 100% accuracy for speaker identification system and 0 equal error rate(EER) value for speaker verification system.

...read moreread less

Abstract: Speaker recognition is a task of identifying/verifying individual identity with the help of input voice sample. Speaker recognition is further classified into speaker identification(SI) and speaker verification(SV). Unlike other approaches using ELSDSR speech database, which only focuses on speaker identification performance, proposed work also focuses on speaker verification performance results, along with SI performance. Main goal of this research, is to find out best model for speaker identification and speaker verification system for clean speech database. Comparative study is done between performances by various combinations of features, for speaker identification and speaker verification system with Feedforward Artificial Neural Network(FFANN) and Support Vector Machine(SVM) as an classification technique using ELSDSR voice database. Features named as Linear predictive coding(LPC), Mel frequency cepstral coefficient(MFCC) and Perceptual linear prediction(PLP) are used. All features are tested separately and in fusion among each other’s with FFANN and SVM classifier on MATLAB software. Also Proposed model results are compared with some famous techniques using same ELSDSR database for speaker identification system. By comparing experimental results of proposed model and others model, it is observed that fusion of different features gives better results and speaker identification accuracy increases to 3%-5% when compared with single feature’s result. In addition, proposed method are giving best result of100% accuracy for speaker identification system and 0 equal error rate(EER) value for speaker verification system, when fusion of MFCC, LPC and PLP features are used with ANN and SVM classifier. Therefore, it can be said that fusion of MFCC, LPC and PLP features greatly enhances the speaker recognition system’s performance.

...read moreread less

Proceedings Article•DOI•

CLCNET: Deep Learning-Based Noise Reduction for Hearing aids using Complex Linear Coding

[...]

Hendrik Schröter¹, Tobias Rosenkranz, Alberto N. Escalante-B., Marc Aubreville¹, Andreas Maier¹ - Show less +1 more•Institutions (1)

University of Erlangen-Nuremberg¹

01 May 2020

TL;DR: In this paper, the authors proposed a framework based on complex valued linear coding (CLC) motivated by linear predictive coding (LPC) that is applied in the complex frequency domain.

...read moreread less

Abstract: Noise reduction is an important part of modern hearing aids and is included in most commercially available devices. Deep learning-based state-of-the-art algorithms, however, either do not consider real-time and frequency resolution constrains or result in poor quality under very noisy conditions.To improve monaural speech enhancement in noisy environments, we propose CLCNet, a framework based on complex valued linear coding. First, we define complex linear coding (CLC) motivated by linear predictive coding (LPC) that is applied in the complex frequency domain. Second, we propose a framework that incorporates complex spectrogram input and coefficient output. Third, we define a parametric normalization for complex valued spectrograms that complies with low-latency and on-line processing.Our CLCNet was evaluated on a mixture of the EUROM database and a real-world noise dataset recorded with hearing aids and compared to traditional real-valued Wiener-Filter gains.

...read moreread less

Posted Content•

Optimizing Speech Emotion Recognition using Manta-Ray Based Feature Selection.

[...]

Soham Chattopadhyay, Arijit Dey, Hritam Basak

18 Sep 2020-arXiv: Sound

TL;DR: This paper proposes that the concatenation of features, extracted by using different existing feature extraction methods can not only boost the classification accuracy but also expands the possibility of efficient feature selection in speech emotion analysis.

...read moreread less

Abstract: Emotion recognition from audio signals has been regarded as a challenging task in signal processing as it can be considered as a collection of static and dynamic classification tasks. Recognition of emotions from speech data has been heavily relied upon end-to-end feature extraction and classification using machine learning models, though the absence of feature selection and optimization have restrained the performance of these methods. Recent studies have shown that Mel Frequency Cepstral Coefficients (MFCC) have been emerged as one of the most relied feature extraction methods, though it circumscribes the accuracy of classification with a very small feature dimension. In this paper, we propose that the concatenation of features, extracted by using different existing feature extraction methods can not only boost the classification accuracy but also expands the possibility of efficient feature selection. We have used Linear Predictive Coding (LPC) apart from the MFCC feature extraction method, before feature merging. Besides, we have performed a novel application of Manta Ray optimization in speech emotion recognition tasks that resulted in a state-of-the-art result in this field. We have evaluated the performance of our model using SAVEE and Emo-DB, two publicly available datasets. Our proposed method outperformed all the existing methods in speech emotion analysis and resulted in a decent result in these two datasets with a classification accuracy of 97.06% and 97.68% respectively.

...read moreread less

Journal Article•DOI•

Exploring vowel formant estimation through simulation-based techniques

[...]

Tyler Kendall¹, Charlotte Vaughn¹•Institutions (1)

University of Oregon¹

01 Jan 2020

TL;DR: In this article, the authors investigated the sources of variability in vowel formant estimation, a major analytic activity in sociophonetics, by reviewing the outcomes of two simulations that manipulated the settings used for linear predictive coding (LPC)-based vowel estimation.

...read moreread less

Abstract: Abstract This paper contributes insight into the sources of variability in vowel formant estimation, a major analytic activity in sociophonetics, by reviewing the outcomes of two simulations that manipulated the settings used for linear predictive coding (LPC)-based vowel formant estimation. Simulation 1 explores the range of frequency differences obtained when minor adjustments are made to LPC settings, and measurement timepoints around the settings used by trained analysts, in order to determine the range of variability that should be expected in sociophonetic vowel studies. Simulation 2 examines the variability that emerges when LPC settings are varied combinatorially around constant default settings, rather than settings set by trained analysts. The impacts of different LPC settings are discussed as a way of demonstrating the inherent properties of LPC-based formant estimation. This work suggests that differences more fine-grained than about 10 Hz in F1 and 15–20 Hz in F2 are within the range of LPC-based formant estimation variability.

...read moreread less

Journal Article•DOI•

A Novel Hybrid PSO Assisted Optimization for Classification of Intellectual Disability Using Speech Signal

[...]

Gaurav Aggarwal¹, Rudraksh Monga, Sarada Prasad Gochhayat²•Institutions (2)

Manipal University Jaipur¹, Old Dominion University²

01 Aug 2020-Wireless Personal Communications

TL;DR: The results strongly prove the usefulness of the proposed multivariate feature selection algorithm when compared with filter approach.

...read moreread less

Abstract: Speech signals convey speaker’s neurodevelopmental state along with phonological information. Recognize a speech disorder by analyzing the speech is essential for human–machine interaction. To develop a subject independent speech recognition system for neurodevelopmental disorders, by identifying voice features from MATLAB toolbox, spectral characteristics and feature selection algorithms are proposed in this paper. Feature selection is applied to overcome the challenges of dimensionality in various applications. This work presents a novel particle swarm optimization (PSO) based algorithm for feature selection. The experiments were conducted using a speech database of the children with intellectual disability with age-matched typically developed and validate the reliability using 10-fold cross-validation technique. The database consists of 141 speech features extracted from linear predictive coding (LPC) based cepstral parameters and Mel-frequency cepstral coefficients (MFCC). Three classification models were applied and obtained the recognition accuracies 90.30% with ANN, 98.00% with SVM and 91.00% with random forest with PSO feature selection algorithm. The results strongly prove the usefulness of the proposed multivariate feature selection algorithm when compared with filter approach.

...read moreread less

Proceedings Article•DOI•

Prediction of Specific Language Impairment in Children Using Speech Linear Predictive Coding Coefficients

[...]

Yogesh Sharma¹, Bikesh Kumar Singh¹•Institutions (1)

National Institute of Technology, Raipur¹

01 Jan 2020

TL;DR: In this paper, the authors used linear predictive coding coefficients (LPCs) to detect specific language impairment (SLI) from children's voice using behavioral analysis and age-specific language test scores.

...read moreread less

Abstract: Specific language impairment (SLI) is a kind of neurodevelopmental disorders which could disturb the speech production and language comprehension ability of a child. The detection/prediction of SLI in children is done through behavioral analysis and age specific language test scores by speech therapists and child psychologists. Modern computational approaches may chop off the long prediction time to diagnose the disorder and support the child as well as parents to start ministrations early. The present work gives an idea for detection of SLI from children’s voice with the help of linear predictive coding coefficients (LPCs) as they have the ability to model human vocal system. Two classifiers: naiveBayes (NB) and support vector machine (SVM) were resorted to categorize children in both SLI and healthy groups respectively. The best classification accuracies of 97.9% (for top-20 LPC features) and 97.8% (for top-10 LPC features) were secured by NB classifier for 5-fold cross-validation protocol. The study concluded that LPC parameters play a crucial role in detection of SLI and could be put to uncover other discern neurodevelopmental disorders via children’s speech signals.

...read moreread less

Journal Article•DOI•

Automatic oscillations detection and quantification in process control loops using linear predictive coding

[...]

Sachin Sharma¹, Vineet Kumar¹, K.P.S. Rana¹•Institutions (1)

Netaji Subhas Institute of Technology¹

01 Feb 2020-Engineering Science and Technology, an International Journal

TL;DR: In the proposed technique roots of the linear predictive polynomial are used to detect the oscillations and the quantification spectrum of cross-correlation of linear predictive residual and original data is used.

...read moreread less

Journal Article•DOI•

Comparisons of Speech Parameterisation Techniques for Classification of Intellectual Disability Using Machine Learning

[...]

Gaurav Aggarwal¹, Latika Singh²•Institutions (2)

Manipal University Jaipur¹, Ansal Institute of Technology²

01 Apr 2020-International Journal of Cognitive Informatics and Natural Intelligence

TL;DR: The experimental outcomes show that the projected technique can be used to help speech pathologists in estimating intellectual disability at early ages.

...read moreread less

Abstract: Classification of intellectually disabled children through manual assessment of speech at an early age is inconsistent, subjective, time-consuming and prone to error. This study attempts to classif...

...read moreread less

Proceedings Article•DOI•

Speaker Recognition Based on Long Short-Term Memory Networks

[...]

Qihang Xu¹, Mingjiang Wang¹, Changlai Xu¹, Lu Xu¹•Institutions (1)

Harbin Institute of Technology¹

23 Oct 2020

TL;DR: In this paper, a speaker recognition algorithm based on Long Short-Term Memory Networks (LSTM) was proposed, which uses a method that combines Linear Predictive Coding (LPC) with Log-Mel spectrum, and the d-vector output through the LSTM network is classified using the Softmax loss function.

...read moreread less

Abstract: Speaker recognition is the process of identifying the speaker's identity by extracting the acoustic features in the speaker's audio for recognition. It is mainly the perception and simulation of the speaker's vocal tract information and the human ear's auditory information, which has profound significance in the fields of human daily life and military affairs. In order to solve this problem, this paper proposes a speaker recognition algorithm based on Long Short-Term Memory Networks (LSTM). The feature uses a method that combines Linear Predictive Coding (LPC) with Log-Mel spectrum. The d-vector output through the LSTM network is classified using the Softmax loss function. This method is applied to the VCTK audio data set. The experimental results show that the recognition rate of this method reaches 94.9%.

...read moreread less

Journal Article•DOI•

Design of MELPe-Based Variable-Bit-Rate Speech Coding with Mel Scale Approach Using Low-Order Linear Prediction Filter and Representing Excitation Signal Using Glottal Closure Instants

[...]

M. S. Arun Sankar¹, P. S. Sathidevi¹•Institutions (1)

National Institute of Technology Calicut¹

01 Mar 2020-Arabian Journal for Science and Engineering

TL;DR: A variable-bit-rate speech codec-based on mixed excitation linear prediction enhanced (MELPe) with an average bit rate of 2 kbps and with a better representation of excitation signal is proposed and the incorporation of Mel-LPC gives a better performance in the estimation of formants and GCIs.

...read moreread less

Abstract: In this paper, we propose a variable-bit-rate speech codec-based on mixed excitation linear prediction enhanced (MELPe) with an average bit rate of 2 kbps and with a better representation of excitation signal. The order of the prediction filter in MELPe coding architecture is reduced from 10 to 7 without affecting the perceptual quality of the decoded speech by using psychoacoustic Mel scale. An efficient two-split vector quantization is developed with weighted Euclidean distance measure for Mel scale-based linear predictive coding (Mel-LPC), and it requires only 18 bits/frame. The instantaneous pitch or epoch that is vital for many speech processing applications is preserved in this codec by including it in the excitation signal used for reconstructing the voiced speech. The quantization scheme developed for glottal closure instants (GCIs) causes an increase in the bit requirement for voiced frames by 4–25 bits depending on the position of GCIs. To compensate for that, the Mel-LPC order for both silence and unvoiced frames has been brought down to 4 without compromising the perceptual quality of reconstructed speech. The lowered bit budget for unvoiced frame is 41 bits/frame, and for silence, it is 31 bits/frame. Further reduction of 10 bits for silence frame is obtained by reducing the number of transmitted parameters and by tuning the quantization bit requirement for each. For categorizing the speech frames at the entry of the encoder, a neural network-based voiced/unvoiced/silence classification algorithm using five-dimensional feature set is created. The experimental results show that the proposed coding scheme operates at an average bit rate of 2 kbps, which is less than the bit rate of MELPe (2.4 kbps), but with a better perceptual score. In addition to all these, the incorporation of Mel-LPC gives a better performance in the estimation of formants and GCIs.

...read moreread less

Journal Article•DOI•

Speech Recognition Using Elman Artificial Neural Network and Linear Predictive Coding

[...]

Somayeh Khajehasani¹, Louiza Dehyadegari¹•Institutions (1)

Sirjan University of Technology¹

31 Jul 2020

TL;DR: The use of the linear predictive coding coefficients along with the Elman neural network has led to higher recognition accuracy and improved the speech recognition system.

...read moreread less

Abstract: Today, the automatic intelligent system requirement has caused an increasing consideration on the interactive modern techniques between human being and machine. These techniques generally consist of two types: audio and visual methods. Meanwhile, the need for developing the algorithms that enable the human speech recognition by machine is of high importance and frequently studied by the researchers. Using artificial intelligence methods has led to better results in human speech recognition, but the basic problem is the lack of an appropriate strategy to select the recognition data among the huge amount of speech information that practically makes it impossible for the available algorithms to work. In this article, to solve the problem, the linear predictive coding coefficients extraction method is used to sum up the data related to the English digits pronunciation. After extracting the database, it is utilized to an Elman neural network to recognize the relation between the linear coding coefficients of an audio file with the pronounced digit. The results show that this method has a good performance compared to other methods. According to the experiments, the obtained results of network training (99% recognition accuracy) indicate that the network still has better performance than RBF despite many errors. The results of the experiments showed that the Elman memory neural network has had an acceptable performance in recognizing the speech signal compared to the other algorithms. The use of the linear predictive coding coefficients along with the Elman neural network has led to higher recognition accuracy and improved the speech recognition system.

...read moreread less

Journal Article•DOI•

Non-Parallel Voice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression

[...]

Yi-Chiao Wu¹, Patrick Lumban Tobing¹, Kazuhiro Kobayashi¹, Tomoki Hayashi¹, Tomoki Toda¹ - Show less +1 more•Institutions (1)

Nagoya University¹

30 Mar 2020-IEEE Access

TL;DR: This paper integrates a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique and proposes a collapsed speech segment detector (CSSD) to mitigate the negative effects introduced by the LPCDC.

...read moreread less

Abstract: In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018.

...read moreread less

Journal Article•DOI•

Performance Analysis of Different Acoustic Features based on LSTM for Bangla Speech Recognition

[...]

Nahyan Al Mahmud

31 Aug 2020-The International Journal of Multimedia & Its Applications

TL;DR: In this work a new Bangla speech corpus along with proper transcriptions has been developed; also various acoustic feature extraction methods have been investigated using Long Short-Term Memory (LSTM) neural network to find their effective integration into a state-of-the-art Banglaspeech recognition system.

...read moreread less

Abstract: In this work a new Bangla speech corpus along with proper transcriptions has been developed; also various acoustic feature extraction methods have been investigated using Long Short-Term Memory(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition system. The acoustic features are usually a sequence of representative vectors that are extracted from speech signals and the classes are either words or sub word units such as phonemes. The most commonly used feature extraction method, known as linear predictive coding (LPC), has been used first in this work. Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the human auditory system. A detailed review of the implementation of these methods have been described first. Then the steps of the implementation have been elaborated for the development of an automatic speech recognition system (ASR) for Bangla speech.

...read moreread less

Journal Article•DOI•

Efficient remote access system based on decoded and decompressed speech signals

[...]

Hala Shawky El-Kfafy, Mohammed Abd-Elnaby¹, Mohamed Rihan, Mohamed Nassar, Adel S. El-Fishawy, Moawad I. Dessouky, El-Sayed M. El-Rabaie, Fathi E. Abd El-Samie² - Show less +4 more•Institutions (2)

Taif University¹, Princess Nora bint Abdul Rahman University²

20 May 2020-Multimedia Tools and Applications

TL;DR: The objective of SI is to achieve the security needed for the remote access system, and this security can be increased using coding and compression processes.

...read moreread less

Abstract: This paper investigates the effect of both decoding and decompression on the Speaker Identification (SI) in a remote access system. The coding and compression processes are used for the communication purpose as a normal action taken for voice communication over Internet or mobile networks. In the proposed system, the speech signal is coded with the Linear Predictive Coding (LPC) technique. Also, the speech signal is compressed using two techniques. The first technique depends on decimation process to compress the signal. The signal can be recovered using inverse solutions. The inverse solutions include maximum entropy and regularized reconstruction. The second technique is the Compressive Sensing (CS) and the speech signal can be reconstructed using linear programming. The coded or compressed speech signal is transmitted into the receiver via a wireless communication channel. At the receiver, the received signal is decoded or decompressed, and then SI is performed on the decoded or decompressed speech signal. The performance of coding and compression techniques is evaluated using some metrics such as Perceptual Evaluation of Speech Quality (PESQ) and Dynamic Time Warping (DTW). The objective of SI is to achieve the security needed for the remote access system, and this security can be increased using coding and compression processes. In the SI system, the feature vectors are captured from different discrete transforms such as Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and Discrete Sine Transform (DST), besides the time domain. The recognition rate for all transforms is computed to evaluate the performance of the SI system.

...read moreread less

Journal Article•DOI•

Fast Computation of LSP Frequencies Using the Bairstow Method

[...]

Yuqun Xue, Zhijiu Zhu, Jianhua Jiang, Yi Zhan, Yu Zenghui, Xiaohua Fan, Shushan Qiao - Show less +3 more

26 Feb 2020-Electronics

TL;DR: A fast computation algorithm, based on the Bairstow method for computing LSPs frequencies from linear prediction coefficients, is proposed in this paper, and can extract the polynomial roots efficiently and accurately with significantly reduced computation complexity.

...read moreread less

Abstract: Linear prediction is the kernel technology in speech processing. It has been widely applied in speech recognition, synthesis, and coding, and can efficiently and correctly represent the speech frequency spectrum with only a few parameters. Line Spectrum Pairs (LSPs) frequencies, as an alternative representation of Linear Predictive Coding (LPC), have the advantages of good quantization accuracy and low spectral sensitivity. However, computing the LSPs frequencies takes a long time. To address this issue, a fast computation algorithm, based on the Bairstow method for computing LSPs frequencies from linear prediction coefficients, is proposed in this paper. The algorithm process first transforms the symmetric and antisymmetric polynomial to general polynomial, then extracts the polynomial roots. Associated with the short-term stationary property of speech signal, an adaptive initial method is applied to reduce the average iteration numbers by 26%, as compared to the statics in the initial method, with a Perceptual Evaluation of Speech Quality (PESQ) score reaching 3.46. Experimental results show that the proposed method can extract the polynomial roots efficiently and accurately with significantly reduced computation complexity. Compared to previous works, the proposed method is 17 times faster than Tschirnhus Transform, and has a 22% PESQ improvement on the Birge-Vieta method with an almost comparable computation time.

...read moreread less

Proceedings Article•DOI•

Voice conversion using coefficient mapping and neural network

[...]

Olaide Ayodeji Agbolade, Samson A. Oyetunji

11 Mar 2020-arXiv: Audio and Speech Processing

TL;DR: An improved model that uses both linear predictive coding (LPC) and line spectral frequency (LSF) coefficients to parametrize the source speech signal was developed in this work to reveal the effect of over-smoothing.

...read moreread less

Abstract: The research presents a voice conversion model using coefficient mapping and neural network. Most previous works on parametric speech synthesis did not account for losses in spectral details causing over smoothing and invariably, an appreciable deviation of the converted speech from the targeted speaker. An improved model that uses both linear predictive coding (LPC) and line spectral frequency (LSF) coefficients to parametrize the source speech signal was developed in this work to reveal the effect of over-smoothing. Non-linear mapping ability of neural network was employed in mapping the source speech vectors into the acoustic vector space of the target. Training LPC coefficients with neural network yielded a poor result due to the instability of the LPC filter poles. The LPC coefficients were converted to line spectral frequency coefficients before been trained with a 3-layer neural network. The algorithm was tested with noisy data with the result evaluated using Mel-Cepstral Distance measurement. Cepstral distance evaluation shows a 35.7 percent reduction in the spectral distance between the target and the converted speech.

...read moreread less

Proceedings Article•DOI•

A comparative study between compressive sensing and conventional speech conding methods

[...]

Abdelkader Boukhobza, Messaoud Hettiri, Abdelmalik Taleb-Ahmed¹, Abdennacer Bounoua²•Institutions (2)

Centre national de la recherche scientifique¹, SIDI²

01 May 2020

TL;DR: Experimental results indicate that CS method achieve a significant improvements in performances with respect to the aforementioned methods, namely, Code-excited linear prediction and LIoyd-Max quantization algorithm.

...read moreread less

Abstract: Speech coding is an essential procedure in public switched telephone system (PSTN), digital cellular communications, videoconferencing systems, and emerging voice over Internet applications. Compressed sensing is an original signal processing tool for efficiently acquiring and reconstructing a signal, by exploiting its compressibility. In this paper, compressive sensing is employed for speech coding. Particularly in this work and in order to demonstrate its efficiency in speech coding, we propose a comparative study between this method and optimal existing methods, namely, Code-excited linear prediction and LIoyd-Max quantization algorithm. Experimental results indicate that CS method achieve a significant improvements in performances with respect to the aforementioned methods.

...read moreread less

A Survey on Optimization Techniques in Voice Disorder Classification

[...]

N. A. Sheela Selvakumari, V. Radha

22 Jan 2020

TL;DR: This work investigated the optimization techniques for voice pathology and voice disorder classification system using optimization and machine learning techniques with importance to pathology detection and its classification.

...read moreread less

Abstract: Voice diseases are increasing dramatically, by unhealthy social habits and voice abuse. In this work, we investigated the optimization techniques for voice pathology and voice disorder classification system. The literature review presents a survey of the acoustic analysis for human voice disorder classification using optimization and machine learning techniques with importance to pathology detection and its classification. The voices input signal is given to numerous classifications methods, in such the simplest way the classification produces the result against the pathology voices and normal voices with relevence to the male and female human voice classification. The foremost goal of this analysis work is to provide a entire study of the most popular machine learning techniques namely, Noise Removal and Silence Removal and different filters in preprocessing, The feature extraction techniques namely, Acoustic features (signal energy, pitch, formant, jitter, shimmer), Reflection coefficients, Autocorrelation, Linear Predictive Coding (LPC), Mel-frequency cepstrum co-efficient (MFCC), Zero crossing with peak amplitude (ZCPA), Dynamic time wrapping (DTW) and Relative spectral processing (RASTA). And to classify the input voice signal with the help of classification algorithms like, Support Vector Machine (SVM) and Back Propagation Neural Network (BPNN).

...read moreread less

Book Chapter•DOI•

A Review of Various Techniques Related to Feature Extraction and Classification for Speech Signal Analysis

[...]

Satyajit A. Pangaonkar¹, Ashish R. Panat²•Institutions (2)

Academy of Engineering¹, Massachusetts Institute of Technology²

01 Jan 2020

TL;DR: In this article, a review of the related research work in the field of feature extraction methodologies viz MFCC (Mel Frequency Cepstral Coefficient), LPC (Linear Predictive Coding), Wavelet, DWT (Discrete Wavelet Transform) and PLP (Perceptual Linear Predictive) etc.

...read moreread less

Abstract: Current technology development in the field of artificial intelligence and IoT has resulted in increased importance to research in speech processing. Researchers are emphasizing on speech processing and its applications due to increased acceptance of technology based on AI and IoT. Natural voice or speech signal available needs to be digitized for age in processing and feature extraction. Speech signal consist of scads of information categorized broadly as gender based, voice characteristics based, emotion based, speaker based etc. Recognizing the importance of feature extraction and classification for speech processing in various applications, significant research has been carried out for various methodologies related to diversified applications. This manuscript attempts to study and review the related research work in the field of feature extraction methodologies viz MFCC (Mel Frequency Cepstral Coefficient), LPC (Linear Predictive Coding), Wavelet, DWT (Discrete Wavelet Transform) and PLP (Perceptual Linear Predictive) etc. Researchers have also given importance to classifiers like SVM (Support Vector Machine), ANN (Artificial Neural Network), GMM (Gaussian Mixture model), HMM (Hidden Markov model) etc. The comparison of these classifiers has been presented in this review. The prime objective of this review paper is to observe the relationship between the variance of speech parameters, feature extraction methodologies and classifiers. The endeavor of this review is to establish the comparative observation which shall help the budding researchers for selection of feature extraction technique as well as classifier for various speech processing application considering specific advantages and disadvantages.

...read moreread less

Proceedings Article•DOI•

Native and Non-Native Marathi Numerals Recognition using LPC and ANN

[...]

Shital S. Joshi¹, Vaishali D. Bhagile•Institutions (1)

Dr. Babasaheb Ambedkar Marathwada University¹

05 Nov 2020

TL;DR: In this article, a speech recognition system for Marathi Ank (Numbers) was proposed, where feature extraction and feature matching technique plays a vital role for speech recognition and here LPC (Linear Predictive Coding) is used for extracting features from samples, whereas ANN (Artificial neural network) was used to classify them.

...read moreread less

Abstract: Speech recognition is gaining an increasing research interest in the last five decades. Speech processing is considered as an interdisciplinary branch of electronics and computer science domain. It considers speech as an input and converts it into the corresponding text. This paper describes the design and development of Marathi Numeric Speech Dataset. Marathi Numbers (Ank) ranging from Shunya(0) to Nau(9) and are taken into consideration for recording. Speech samples are collected from 50 native and 50 non-native speakers of Marathi language. The dataset remains as a gender balanced since it is recorded from 50 females and 50 male speakers. The age of speakers will affect the speech. Therefore, 5 different age groups such as 11-20, 21-30, 31-40, 41-50 and 51-60 are considered. Native and non-native speakers are selected to obtain ample amount of variations in the pronunciation of Marathi numerals. Feature extraction and feature matching technique plays a vital role for speech recognition and here LPC (Linear Predictive Coding) is used for extracting features from samples, whereas ANN (Artificial neural network) is used to classify them. Experimental specifications and results are also discussed. This research work has attempted to design and develop a speech recognition system, which can understand Marathi Ank (Numbers) and identify them accurately.

...read moreread less