scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2021"


Journal ArticleDOI
TL;DR: A model of classification is proposed by the use of a discrete wavelet transform DWT to transform the signal and the GA and the classifier SVM algorithm is applied, which achieves the best accuracy.

73 citations


Journal ArticleDOI
TL;DR: In this paper, a speaker-independent connected word Hindi speech recognition system using different feature extraction techniques with comparative analysis of confusing words is presented to understand the reason for the speech recognition errors.
Abstract: The research work presents experimental work to build a speaker-independent connected word Hindi speech recognition system using different feature extraction techniques with comparative analysis of confusing words. Comparative analysis of confusing words is essential to understand the reason for the speech recognition errors. Based on the error analysis, different feature extraction techniques, classification techniques, acoustic models, and pronunciation dictionaries can be selected to improve the speech recognition system's performance. Earlier studies for Hindi speech recognition lack detailed comparative analysis of confusing words for different feature extractions methods. As speaker-independent systems are developed for all, comparative analysis of confusing words is also presented for all feature extraction techniques. Speaker independent system was proposed with five states monophone based hidden Markov model (HMM) using HMM-based tool kit HTK. A Self-created data set of Hindi speech corpus has been used in the experiment. Feature extraction techniques such as linear predictive coding cepstral coefficients (LPCCs), mel frequency cepstral coefficients (MFCCs), and perceptual linear prediction coefficients (PLPs) were applied using delta, double delta, and energy parameters to evaluate the performance of the proposed methodology. The system was assessed by using different feature extraction techniques for speaker-independent mode. Research findings reveal that PLP coefficients show the highest recognition score, while LPCCs got the lowest recognition scores.Investigations also reveal that both PLP and MFCC coefficients are better than LPCC in speech recognition. Comparative analysis of confusing words shows that PLPs and MFCCs show fewer confusions than LPCCs and exhibit mostly the same pattern in the confusion analysis. Research outcomes also reveal that substitution errors are a significant cause of low recognition. It was also found that some words were recognized with individual feature extraction techniques only. Confusion analysis of the words indicates that words which have nasals, liquid, and fricative sound in first place exhibit more confusions. The investigation could improve speech recognition by choosing an appropriate feature extraction method and mixing the various feature extraction methods. The research outcomes can also be utilized to build linguistic resources for improving speech recognition. The results show that the developed recognition framework achieved the highest recognition word accuracy of 76.68% with PLPs for the speaker-independent model. The proposed system was also compared with existing similar work available.

17 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed an algorithm to enhance and encode the speech data by combining spectral subtraction with voice activity detection and linear predictive coding (LPC) under degraded conditions.
Abstract: In this paper, the encoding of noisy and enhanced speech data is demonstrated. To encode and enhance the speech data under an uncontrolled environment, the linear predictive coding (LPC) and spectral subtraction with voice activity detection (SS-VAD) methods are studied individually. The noisy speech data is obtained by considering the amalgamation of the clean speech signal and noise model and it is encoded using the LPC technique. The LPC uses a lossy compression procedure to encode the speech data which converts the data rate from 64 to 2.4 Kbps. Due to reverberations and degradations in noisy speech data, the quality of encoded noisy speech data is very less. Therefore, an algorithm is proposed to enhance and encode the speech data by combining SS-VAD and LPC under degraded conditions. In the first step, the encoding of noisy speech data is done using LPC and its performance is evaluated using signal-to-ratio. The noisy speech data is given as input to the SS-VAD algorithm and the output of SS-VAD is given as input to the LPC encoder is followed in the second step. In the LPC encoder, the coefficients are extracted from the input speech data to design all-pole filters. The cross correlation process is also done for differentiating the voiced and unvoiced samples at the analysis step. The pitch information and extracted coefficients are used in the synthesis step. The experiments are conducted for different types of noisy speech data which are degraded by musical noise, F16 noise, factory noise, and car noise. The experimental results show that there is a significant improvement in the quality of enhanced encoded speech data obtained by the proposed method compared to encoded noisy speech data. The schematic representation of outputs of LPC and proposed combined SS-VAD and LPC waveforms are also given in this work.

10 citations


Posted Content
TL;DR: In this paper, a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine, where quantization and entropy coding are handled during the optimization process.
Abstract: We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPC's quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance.

7 citations


Journal ArticleDOI
TL;DR: A power-weighted formant frequency estimation procedure based on Linear Predictive Coding is presented, which works by pre-emphasizing the dominant spectral components of an input signal, which allows a subsequent estimation step to extract formant frequencies with greater accuracy.
Abstract: A power-weighted formant frequency estimation procedure based on Linear Predictive Coding (LPC) is presented. It works by pre-emphasizing the dominant spectral components of an input signal, which allows a subsequent estimation step to extract formant frequencies with greater accuracy. The accuracy of traditional LPC formant estimation is improved by this new power-weighted formant estimator for different classes of synthetic signals and for speech. Power-weighted LPC significantly and reliably outperforms LPC and variants of LPC at the task of formant estimation using the VTR formants dataset, a database consisting of the Vocal Tract Resonance (VTR) frequency trajectories obtained by human experts for the first three formant frequencies. This performance gain is evident over a range of filter orders.

7 citations


Journal ArticleDOI
29 Jan 2021
TL;DR: The PaMZ-HMM shows improved performance and reduced complexity over existing feature extraction techniques such as Mel-scale frequency cepstral coefficients (MFCC) and linear predictive coding (LPC) and is suitable for real-time detection.
Abstract: Passive acoustic monitoring (PAM) is generally usedto extract acoustic signals produced by cetaceans. However, the large data volume from the PAM process is better analyzed using an automated technique such as the hidden Markovmodels (HMM). In this paper, the HMM is used as a detection and classification technique due to its robustness and low time complexity. Nonetheless, certain parameters, such as the choice of features to be extracted from the signal, the frame duration, and the number of states affect the performance of the model. Theresults show that HMM exhibits best performances as the number of states increases with short frame duration. However, increasing the number of states creates more computational complexity in the model. The inshore Bryde's whales produce short pulse calls with distinct signal features, which are observable in the time-domain. Hence, a time-domain feature vector is utilized to reduce the complexity of the HMM. Simulation results also show that average power as a time-domain feature vector provides the best performance compared to other feature vectors for detecting the short pulse call of inshore Bryde's whales based on the HMM technique. More so, the extracted features such as the average power, mean, and zero-crossing rate, are combined to form a single 3-dimensional vector (PaMZ). The PaMZ-HMM shows improved performance and reduced complexity over existing feature extraction techniques such as Mel-scale frequency cepstral coefficients (MFCC) and linear predictive coding (LPC). Thus, making the PaMZ-HMM suitable for real-time detection.

6 citations


Journal ArticleDOI
TL;DR: The performance of the proposed model-based compression scheme is shown to be superior to that of the LPC and DPCA and to be compared with the distributed principal component analysis algorithm on a real seismic database.
Abstract: This work develops a model-based compression scheme for seismic data. First, seismic traces are modeled as multitone sinusoidal waves superposition. Each sinusoidal wave is regarded as a model component and is represented by a set of distinct parameters. Second, a parameter estimation algorithm for this model is proposed accordingly. In this algorithm, the parameters are estimated for each component sequentially. A suitable number of model components is determined by the level of the residuals energy. Next, the residuals are compressed using entropy coding or quantization coding techniques. The corresponding compression ratios are presented. Finally, the proposed model-based compression scheme is compared with the linear predictive coding (LPC) algorithm and the distributed principal component analysis (DPCA) algorithm on a real seismic database. The performance of the proposed model based is shown to be superior to that of the LPC and DPCA.

6 citations


Journal ArticleDOI
28 Mar 2021
TL;DR: Voice recognition was developed as an individual voice recognition system using a combination of the Linear Predictive Coding (LPC) method of feature extraction and K-Nearest Neighbor (K-NN) classification in the speech recognition process.
Abstract: Humans have a variety of characteristics that are different from one another. Characteristics possessed by humans are genuine which can be used as a differentiator between one individual and another, one of which is sound. Voice recognition is called speech recognition. In this study, it was developed as an individual voice recognition system using a combination of the Linear Predictive Coding (LPC) method of feature extraction and K-Nearest Neighbor (K-NN) classification in the speech recognition process. Testing is done by testing changes in several parameters, namely the LPC order value, the number of frames, the K value, and different distance methods. The results of the parameter combination test showed a fairly good presentation of 73.56321839% with the combination parameter or LPC 8, the number of frames 480, the value of K 5, with the distance method used by Chebychev.

4 citations


Journal ArticleDOI
TL;DR: In this paper, an out-of-domain data augmentation approach based on formant and time-scale modification is proposed in order to reduce the acoustic mismatch due to differences in formant frequencies and speaking rate between the two groups of speakers.
Abstract: Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, the acoustic mismatch due to differences in formant frequencies and speaking rate between the two groups of speakers results in poor recognition rates as reported in earlier works. To reduce the said mismatch, an out-of-domain data augmentation approach based on formant and time-scale modification is proposed in this work. For that purpose, formant frequencies of adults’ speech data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking rate of adults’ speech data is decreased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the ill effects of the acoustic mismatch due to the aforementioned factors get reduced. This, in turn, enhances the recognition performance significantly. Additional improvement in recognition rate is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed approach. As demonstrated by the experimental evaluations presented in this paper, compared to an adult data trained ASR system, a relative reduction of $$37.6\%$$ in word error rate is achieved through data augmentation. Furthermore, the proposed approach yields large reductions in word error rates even under noisy test conditions.

3 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this paper, a block Kronecker compressed sensing (BKCS) algorithm is proposed to mitigate the mutual interference between two automotive radar systems in a 2D compressed sensing framework.
Abstract: In this paper, a computationally efficient approach called block Kronecker compressed sensing (BKCS) algorithm is proposed to mitigate the mutual interference between two automotive radar systems in a 2-dimensional (2D) compressed sensing framework. Within the 2D framework, the receive signals of radar are jointly considered along both fast time and slow time dimensions, so that the signal sparsity can be better conserved than the one in 1-dimension (1D) case. Compared with the conventional Kronecker compressed sensing, BKCS requires much less resource, i.e. storage and computation power. Its performance has been verified with simulation and real measurement. The numerical assessment has shown that BKCS overcomes the shortcoming in 1D CS methods, and significantly outperforms classical signal reconstruction algorithms such as linear predictive coding as well.

3 citations


Proceedings ArticleDOI
25 Mar 2021
TL;DR: In this paper, the authors used Convolutional Neural Networks (CNN) to recognize three types of clustered phonemes using conventional speech recognition techniques, like Mel-Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) combined with CNN.
Abstract: Speech is a skill that most of the time we take for granted. In reality, this ability is a complex mechanism which requires thoughts to be translated into words who are further transposed into sounds, a mechanism which involves precise coordination of several muscles and joints. In some cases, this complex mechanism can no longer be performed and may be accompanied by almost complete loss of motor activity such in diseases as: stroke, Lock-Down syndrome, amyotrophic lateral sclerosis, cerebral palsy etc. The most recent method that aims to supplement the speech mechanism is imaginary speech recognition using electroencephalographic (EEG) signals, by using complex computing mechanisms like Deep Learning (DL) in order to decode the thoughts. In this paper we aim to recognize three types of clustered phonemes using conventional speech recognition techniques, like Mel-Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) combined with Convolutional Neural Networks (CNN). We compared four types of features extraction: MFCC, LPC, MFCC+ LPC combined into 1-channel matrix and MFCC+ LPC combined into a 2-channel matrix. We showed that MFCC coefficients offer a better accuracy than LPC and that concatenating MFCC and LPC into a 2-channel matrix we obtain a better performance than combining them into 1-channel matrix.

Journal ArticleDOI
TL;DR: The experimental results showed that MFCC provides a better representation for Khasi speech compared with the other three spectral features.

Proceedings ArticleDOI
10 Jun 2021
TL;DR: In this paper, a new time-resolved spectral analysis method based on the Linear Prediction Coding (LPC) method was introduced for the study of the dynamics of EEG (Electroencephalography) activity.
Abstract: This paper introduces a new time-resolved spectral analysis method based on the Linear Prediction Coding (LPC) method that is particularly suited to the study of the dynamics of EEG (Electroencephalography) activity. The spectral dynamics of EEG signals can be challenging to analyse as they contain multiple frequency components and are often corrupted by noise. The LPC Filtering (LPCF) method described here processes the LPC poles to generate a series of reduced-order filter transform functions which can accurately estimate the dominant frequencies. The LPCF method is a parameterized time-frequency method that is suitable for identifying the dominant frequencies of multiple-component signals (e.g. EEG signals). We define bias and the frequency resolution metrics to assess the ability of the LPCF method to estimate the frequencies. The experimental results show that the LPCF can reduce the bias of the LPC estimates in the low and high frequency bands and improved frequency resolution. Furthermore, the LPCF method is less sensitive to the filter order and has a higher tolerance of noise compared to the LPC method. Finally, we apply the LPCF method to a real EEG signal where it can identify the dominant frequency in each frequency band and significantly reduce the redundant estimates of the LPC method.

Journal ArticleDOI
TL;DR: A neural network-based voiced-unvoiced classification algorithm using 5 derived features as input has been constructed and this selection of filter order based on signal statistics provides the benefit of bit reduction by 625 and 325 bps, respectively, for 10th order LPC and 7th order Mel-LPC.
Abstract: This paper proposes a novel method to reduce the order of prediction filter from 10 to 7 in Code Excited Linear Prediction (CELP) coding framework by the inclusion of psychoacoustic Mel scale into Linear Predictive Coding (Mel-LPC). Efficient quantization methods using 2-split Vector Quantization (VQ) for Mel-LPC obtained a reduction of 4 bits/frame and resulted in a total bit gain of 200 bps. A weighting scheme for the Euclidean distance measure gave a reduction of 6 bits/frame that adds up to a total bit gain of 300 bps. A lower Mel-LPC order of 3 has been employed for unvoiced frames by using the perceptual quality as selection criteria and an efficient VQ method using 5 bits is developed which brought down the average bit requirement to 11.5 bits/frame. To incorporate this into Mel-LPC-based CELP encoding scheme, a neural network-based voiced-unvoiced classification algorithm using 5 derived features as input has been constructed and this selection of filter order based on signal statistics provides the benefit of bit reduction by 625 and 325 bps, respectively, for 10th order LPC and 7th order Mel-LPC. In addition to all, the incorporation of Mel-LPC gives a better performance in the estimation of formants.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this article, the authors compared several classification algorithms to determine the effect of the features number for linear predictive coding and linear predictive Cepstral Coefficients upon the averaged correct classification rate, in the context of audio signals, part of them used in healthcare applications.
Abstract: The goal of this research is to compare several classification algorithms to determine the effect of the features number for Linear Predictive Coding and Linear Predictive Cepstral Coefficients upon the averaged correct classification rate, in the context of audio signals, part of them used in healthcare applications, recorded by a service robot. The standard deviation and the required computation time, in the case of every classifier, are also illustrated. The best correct classification rate was obtained in the case of Linear Predictive Cepstral Coefficients using Support Vector Machines, for 10-fold cross-validation.

Book ChapterDOI
01 Jan 2021
TL;DR: This study examines the effect of parameter variations, namely downsampling, frame size, overlapping percentage, the order of linear predictive coding, and value of first-order preemphasis filter on the feature extraction techniques are examined for obtaining the best classification accuracy.
Abstract: Speech development usually involves the ability to perceive the dialectal architecture of a language, which is responsible for the production of sound and finally utter the appropriate speech sentences for social communication. A deficit in any of these functions could lead to acoustic-prosodic impairments distinguished in many neurodevelopmental disabilities like intellectual disability. Therefore understanding of acoustic development can give significant insights into the cognitive development of individuals. In this direction, various research has been conducted which explores several acoustical parameters like pitch, formants, energy, voice onset time, and time-domain. Studies have shown that the features like pitch and formant frequencies decrease as the age progresses. The pitch of a male is less than a female. Until the age of 5 years, it is difficult to categorize the pitch of a male voice from a female voice. From previous studies, it can be observed that feature extraction and classification methods play vital roles in this research field. Therefore in this study, several feature extraction techniques, namely, Mel-frequency cepstral coefficients (MFCC), linear predictive coding (LPC), linear predictive cepstral coefficients (LPCC), weighted LPCC, power spectrum density, discrete cosine transform (DCT) and short time Fourier transform (STFT) are used to extract the speech features. For further study, the effect of parameter variations, namely downsampling, frame size, overlapping percentage, the order of linear predictive coding, and value of first-order preemphasis filter on the feature extraction techniques are examined for obtaining the best classification accuracy, and the correlation between speech features and different classification models can be evaluated.

Journal ArticleDOI
TL;DR: This study makes an effort to discuss several modeling ASP techniques like Gaussian Mixture Model GMM, Vector Quantization, and Clustering Algorithms, and found MFCC and GMM methods could be considered as the most successful techniques in the field of speaker recognition so far.
Abstract: Speaker Recognition Defined by the process of recognizing a person by his\her voice through specific features that extract from his\her voice signal. An Automatic Speaker recognition (ASP) is a biometric authentication system. In the last decade, many advances in the speaker recognition field have been attained, along with many techniques in feature extraction and modeling phases. In this paper, we present an overview of the most recent works in ASP technology. The study makes an effort to discuss several modeling ASP techniques like Gaussian Mixture Model GMM, Vector Quantization (VQ), and Clustering Algorithms. Also, several feature extraction techniques like Linear Predictive Coding (LPC) and Mel frequency cepstral coefficients (MFCC) are examined. Finally, as a result of this study, we found MFCC and GMM methods could be considered as the most successful techniques in the field of speaker recognition so far.

DOI
24 Apr 2021
TL;DR: This thesis aims to apply artificial neural networks to voice recognition and create programs that simulate this method using Matlab 7.1 software.
Abstract: Sound is one of the unique and distinguishable parts of the human body. Voice recognition technology is one of biometric technology that does not require much cost and does not require specialized equipment. One of the techniques of speech recognition is with artificial neural networks, where this method uses a working principle similar to the workings of the human brain. This thesis aims to apply artificial neural networks to voice recognition and create programs that simulate this method using Matlab 7.1 software. The data used in the form of sound recordings are converted into numerical values with the Linear Predictive Coding process. The steps taken in Linear Predictive Coding include Pre-emphasis process, frame blocking, windowing Autocorrelation Analysis, Linear Predictive Coding Analysis, and change the Linear Predictive Coding parameter to the cepstral coefficient. This cepstral coefficient is a series of observations used as inputs on artificial neural networks, and will also be used for the training and testing process. In this research, artificial neural network architecture used is Learning Vector Quantization. In the process of Learning Vector Quantization neural network training using data as many as 35 votes, with learning rate 0.01, max depth 100, dec alpha 0.02 and min alpha 0,00001. Validation test results for 15 votes, obtained the conclusion that 73.34% of all validation votes successfully recognized..

Patent
18 Mar 2021
TL;DR: In this article, a method for encoding an audio signal, comprising using one or more algorithms operating on a processor to filter the audio signal into two output signals, wherein each output signal has a sampling rate that is equal to a sampled rate of the audio signals, and wherein one of the output signals includes high frequency data.
Abstract: A method for encoding an audio signal, comprising using one or more algorithms operating on a processor to filter the audio signal into two output signals, wherein each output signal has a sampling rate that is equal to a sampling rate of the audio signal, and wherein one of the output signals includes high frequency data. Using one or more algorithms operating on the processor to window the high frequency data by selecting a set of the high frequency data. Using one or more algorithms operating on the processor to determine a set of linear predictive coding (LPC) coefficients for the windowed data. Using one or more algorithms operating on the processor to generate energy scale values for the windowed data. Using one or more algorithms operating on the processor to generate an encoded high frequency bitstream.

Proceedings ArticleDOI
15 Jul 2021
TL;DR: In this article, four types of acoustic features (i.e., Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive Coding (LPC), Zero-crossing-rate (ZCR) and root mean square value (RMSV) values) were extracted from audio signals for automatic scene classification.
Abstract: The goal of acoustic scene classification (ASC) is to automatically classify the environment based on an audio recording of the situation. It has become a significant but difficult issue in audio processing, allowing for a vast variety of successive applications such as security, social care, and context-aware facilities. Recently, Convolutional Neural Networks (CNNs) have been broadly applied to speech audio classification, yielding promising results. In this paper, we have extracted four types of acoustic features i.e., Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive Coding (LPC), Zero-crossing-rate (ZCR) and root mean square value (RMSV) value from audio signals for automatic scene classification. We have used Litis Rouen Dataset for CNN model training and testing. Experimental results show that the proposed method has achieved a significant accuracy of 84.15% on the Litis Rouen Dataset. Moreover, we have also compared the performance of the proposed method with other ASC techniques used in the literature.

Posted Content
TL;DR: In this article, the authors proposed a Data over Voice (DoV) technique based on codebooks of short harmonic waveforms. But the proposed method relies on general principles of Linear Predictive Coding for voice compression (LPC voice coding) and is more versatile compared to solutions trained on exact channel models.
Abstract: The current increasing need for privacy-preserving voice communications is leading to new ideas for securing voice transmission. This paper refers to a relatively new concept of sending encrypted data or speech as pseudo-speech in the audio domain over existing voice communication infrastructures, like 3G cellular network and Voice over IP (VoIP). The distinctive characteristic of such a communication system is that it relies on the robust transmission of binary information in the form of audio signal. This work presents a novel Data over Voice (DoV) technique based on codebooks of short harmonic waveforms. The technique provides a sufficiently fast and reliable data rate over cellular networks and many VoIP applications. The new method relies on general principles of Linear Predictive Coding for voice compression (LPC voice coding) and is more versatile compared to solutions trained on exact channel models. The technique gives by design a high control over the desired rate of transmission and provides robustness to channel distortion. In addition, an efficient codebook design approach inspired by quaternary error correcting codes is proposed. The usability of the proposed DoV technique for secure voice communication over cellular networks and VoIP has been successfully validated by empirical experiments. The paper details the system parameters, putting a special emphasis on system's security and technical challenges.


DOI
24 Aug 2021
TL;DR: A multilingual analysis of emotion extraction using Turkish and English languages is proposed using MFCC, Mel Spectrogram, Linear Predictive Coding, and PLP-RASTA techniques to extract acoustic features.
Abstract: Emotion extraction and detection are considered as complex tasks due to the nature of data and subjects involved in the acquisition of sentiments. Speech analysis becomes a critical gateway in deep learning where the acoustic features would be trained to obtain more accurate descriptors to disentangle sentiments, customs in natural language. Speech feature extraction varies by the quality of audio records and linguistic properties. The speech nature is handled through a broad spectrum of emotions regarding the age, the gender and the social effects of subjects. Speech emotion analysis is fostered in English and German languages through multilevel corpus. The emotion features disseminate the acoustic analysis in videos or texts. In this study, we propose a multilingual analysis of emotion extraction using Turkish and English languages. MFCC (Mel-Frequency Cepstrum Coefficients), Mel Spectrogram, Linear Predictive Coding (LPC) and PLP-RASTA techniques are used to extract acoustic features. Three different data sets are analyzed using feed forward neural network hierarchy. Different emotion states such as happy, calm, sad and angry are compared in bilingual speech records. The accuracy and precision metrics are reached at level higher than 80%. Turkish language emotion classification is concluded to be more accurate regarding speech features.

04 Jun 2021
TL;DR: In this paper, several methods are proposed for extracting differential features from an audio signal in order to classify the speaker, such as Power Spectral Density (PSD), Short Term Energy (STE), Fast Fourier Transform (FFT), Hue Seven Moment Invariants method (HSMI), Mel Frequency Cepstrum Coefficients (MFCC), cross-correlation estimates of MFCC (XCORR), and Linear predictive coding (LPC).
Abstract: The primary challenge in identifying speakers is extracting recognition features from speech signals to optimize classification algorithms' performance. Several methods are proposed in this article for extracting differential features from an audio signal in order to classify the speaker. The following methods were used to obtain the features of the audio signal in this approach: Power Spectral Density (PSD), Short Term Energy (STE), Fast Fourier Transform (FFT), Hue Seven Moment Invariants method (HSMI), Mel Frequency Cepstrum Coefficients (MFCC), cross-correlation estimates of MFCC (XCORR), and Linear predictive coding (LPC). The classification methods in this paper are the artificial neural network (ANN), the Euclidean distance, and the autocorrelation, where the results obtained from the experiments showed that the accuracy rate is more than 96%.

19 Aug 2021
TL;DR: In this paper, the authors used Backpropagation Neural Network (BPNN) to recognize new words with 100% accuracy for trained data, 80% for the same respondents with trained data and reach 67.5% for new respondents.
Abstract: Technological developments in the world have no boundaries. One of them is Speech Recognition. At first, words spoken by humans cannot be recognized by computers. To be recognizable, the word is processed using a specific method. Linear Predictive Coding Method (LPC) is a method used in this research to extract the characteristics of speech. The result of the LPC method is the LPC coefficient which is the number of LPC orders plus 1. The LPC coefficient is processed using Fast Fourier Transform (FFT) 512 to simplify the process of speech recognition. The results are then trained using Backpropagation Neural Network (BPNN) to recognize the spoken word. Speech recognition on the program is implemented as an animated object motion controller on the computer. The end result of this research is animated objects move in accordance with the spoken word. The optimal BPNN structure in this research is to use traingda training function, number of nodes 3, learning rate 0.05, epoch 1000, performance goal 0,00001. This structure can produce the smallest MSE value that is 0,000009957. So, this structure can recognize new words with 100% accuracy for trained data, 80% for the same respondents with trained data and reach 67.5% for new respondents.

Posted Content
TL;DR: In this article, a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine, where quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process.
Abstract: This work presents a scalable and efficient neural waveform codec (NWC) for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN autoencoder also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model architectures to our fully convolutional network model, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. We redefine LPC's quantization as a trainable module to enhance the bit allocation tradeoff between LPC and its following NWC modules. Compared to the other autoregressive decoder-based neural speech coders, our decoder has significantly smaller architecture, e.g., with only 0.12 million parameters, more than 100 times smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to reduce the network complexity in low bitrates, ours can scale up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable subjective scores against AMR-WB at the low bitrate range and provides transparent coding quality at 32 kbps.