scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors proposed a DNN structure that is provided with the signal surrounding the gap in the form of time-frequency (TF) coefficients, and two DNNs with either complex-valued TF coefficient output or magnitude TF coefficients output were studied by separately training them on inpainting two types of audio signals (music and musical instruments) having 64-ms long gaps.
Abstract: In this article, we study the ability of deep neural networks (DNNs) to restore missing audio content based on its context, i.e., inpaint audio gaps. We focus on a condition which has not received much attention yet: gaps in the range of tens of milliseconds. We propose a DNN structure that is provided with the signal surrounding the gap in the form of time-frequency (TF) coefficients. Two DNNs with either complex-valued TF coefficient output or magnitude TF coefficient output were studied by separately training them on inpainting two types of audio signals (music and musical instruments) having 64-ms long gaps. The magnitude DNN outperformed the complex-valued DNN in terms of signal-to-noise ratios and objective difference grades. Although, for instruments, a reference inpainting obtained through linear predictive coding performed better in both metrics, it performed worse than the magnitude DNN for music. This demonstrates the potential of the magnitude DNN, in particular for inpainting signals that are more complex than single instrument sounds.

55 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this paper, a cross-module residual learning (CMRL) pipeline is proposed as a module carrier with each module reconstructing the residual from its preceding modules, which shows better objective performance than AMR-WB and OPUS.
Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.

33 citations


Journal ArticleDOI
TL;DR: A network architecture is proposed, which allows model adaptation to different formant frequency ranges that were not seen at training time and which compares favorably with alternative methods for formant estimation and tracking.
Abstract: Formant frequency estimation and tracking are among the most fundamental problems in speech processing. In the estimation task, the input is a stationary speech segment such as the middle part of a vowel, and the goal is to estimate the formant frequencies, whereas in the task of tracking the input is a series of speech frames, and the goal is to track the trajectory of the formant frequencies throughout the signal. The use of supervised machine learning techniques trained on an annotated corpus of read-speech for these tasks is proposed. Two deep network architectures were evaluated for estimation: feed-forward multilayer-perceptrons and convolutional neural-networks and, correspondingly, two architectures for tracking: recurrent and convolutional recurrent networks. The inputs to the former are composed of linear predictive coding–based cepstral coefficients with a range of model orders and pitch-synchronous cepstral coefficients, where the inputs to the latter are raw spectrograms. The performance of the methods compares favorably with alternative methods for formant estimation and tracking. A network architecture is further proposed, which allows model adaptation to different formant frequency ranges that were not seen at training time. The adapted networks were evaluated on three datasets, and their performance was further improved.

28 citations


Proceedings ArticleDOI
01 Feb 2019
TL;DR: Efficiency of the system is not dropping by increasing number of speakers from 10 to 20 when using combination of LPC, MFCC, ZCR features with ANN classifier, according to the result.
Abstract: Speaker recognition is a biometric technique which uses individual speakers voice samples as a input for recognition purpose. Main goal of this work is to obtain better accuracy for speaker recognition system for large number of voice database. In this paper, a comparative study is made between various combinations of features for speaker identification system with feedforward artificial neural network(FFANN) and support vector machine for 10 and 20 speakers. Linear predictive coding, Mel frequency cepstral coefficient and zero crossing rate are used as a feature extraction techniques. Each features are tested separately and in combination with FFANN and SVM classifier on Matlab software. For ANN classifier 70% of total database are used for training,15% for validation and remaining 15% data are used for testing, number of hidden layers and number of neurons used are 2 and 80 respectively. It is observed from the result that efficiency of the system is not dropping by increasing number of speakers from 10(320 voice samples) to 20(640 voice samples) when using combination of LPC, MFCC, ZCR features with ANN classifier.

28 citations


Posted Content
TL;DR: A cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules in a two-phase training scheme, showing better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture.
Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.

15 citations


Proceedings ArticleDOI
01 Dec 2019
TL;DR: A suitable set of voice features and classifiers to detect voice disability with a high accuracy is determined and an accuracy of 100% can be achieved provided proper voice feature and classifier algorithm are used.
Abstract: Voice signal processing is a popular tool to detect pathological voice in children. Voice features are first extracted from voice samples and then classifiers are used to discriminate pathological voices from normal voices. However, there is no consensus among the researchers about the voice features and the classifier algorithms that provide a high accuracy. The main contribution of this paper is to determine a suitable set of voice features and classifiers to detect voice disability with a high accuracy. In contrast to other existing works, several discriminative voice features including peaks, pitch, linear predictive coding (LPC) coefficients, Jitter, Shimmer, formants, Mel frequency cepstral coefficients (MFCCs), relative spectral amplitude (RASTA), and perceptual linear prediction (PLP) have been used. We use several classifier algorithms to discriminate pathological voices from healthy ones. We also compare the performances of these classifiers in this work. The results show that an accuracy of 100% can be achieved provided proper voice feature and classifier algorithm are used.

11 citations


Journal ArticleDOI
TL;DR: The results showed that F1 increased with the peak area of the time-varying glottis, while F2 and F3 were not systematically affected, and the effect of the peak glottal area on F1 was strongest for close-mid to close vowels, and more moderate for mid to open vowels.
Abstract: The estimation of formant frequencies from acoustic speech signals is mostly based on Linear Predictive Coding (LPC) algorithms. Since LPC is based on the source-filter model of speech production, the formant frequencies obtained are often implicitly regarded as those for an infinite glottal impedance, i.e., a closed glottis. However, previous studies have indicated that LPC-based formant estimates of vowels generated with a realistically varying glottal area may substantially differ from the resonances of the vocal tract with a closed glottis. In the present study, the deviation between closed-glottis resonances and LPC-estimated formants during phonation with different peak glottal areas has been systematically examined both using physical vocal tract models excited with a self-oscillating rubber model of the vocal folds, and by computer simulations of interacting source and filter models. Ten vocal tract resonators representing different vowels have been analyzed. The results showed that F1 increased with the peak area of the time-varying glottis, while F2 and F3 were not systematically affected. The effect of the peak glottal area on F1 was strongest for close-mid to close vowels, and more moderate for mid to open vowels.

10 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed method performs better than the existing steganalysis methods for detecting multiple steganographies in the AbS-LPC low-bit-rate compressed speech.
Abstract: Analysis-by-synthesis linear predictive coding (AbS-LPC) is widely used in a variety of low-bit-rate speech codecs. The existing steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for one certain category of steganography methods, thus lacking generalization capability. In this paper, a common method for detecting multiple steganographies in low-bit-rate compressed speech based on a code element Bayesian network is proposed. In an AbS-LPC low-bit-rate compressed speech stream, spatiotemporal correlations exist between the code elements, and steganography will eventually change the values of these code elements. Thus, the method presented in this paper is developed from the code element perspective. It consists of constructing a code element Bayesian network based on the strong correlations between code elements, learning the network parameters by utilizing a Dirichlet distribution as the prior distribution, and finally implementing steganalysis based on Bayesian inference. Experimental results demonstrate that the proposed method performs better than the existing steganalysis methods for detecting multiple steganographies in the AbS-LPC low-bit-rate compressed speech.

6 citations


Proceedings ArticleDOI
01 Jan 2019
TL;DR: A wavelet based method using filter bank concepts is presented to estimate formant frequencies, which are the fundamental frequency of the speech signal.
Abstract: Extraction of pitch and formant frequencies is an important issue in speech processing. Pitch frequency is the fundamental frequency of the speech signal, and formant frequencies are essentially resonance frequencies of the vocal tract. These frequencies vary among different persons and words, but they are within certain frequency range. Practically, the first three formants are enough for coding and other processes. The most common methods for estimating formants are cepstrum and linear predictive coding. In this study, a wavelet based method using filter bank concepts is presented to estimate these frequencies.

6 citations


Journal ArticleDOI
TL;DR: This paper proposes lossless linear predictive coding based on the directionality of the interference patterns of a hologram using differential pulse coding modulation (DPCM) to derive an appropriate compression scheme for the hologram by comparing DPCM with thelinear predictive coding.
Abstract: This paper proposes lossless linear predictive coding based on the directionality of the interference patterns of a hologram. We approached this study from two aspects. First, to determine the directionality of the interference patterns, we performed differential pulse coding modulation (DPCM), segmenting interference patterns into n blocks and scanning the pixels in eight directions for each block. Then, we determined the direction that had minimum entropy, calculating entropy in each direction, and encoded the difference by DPCM in the determined direction. In the second approach, we attempted linear prediction using the prediction coefficients for the determined direction based on the first process. In this case, the DPCM was utilized only to determine the direction in which to progress prediction about the original pixel. Then, we calculated the difference between the predicted and the original, and encoded it. Through the above procedure, we derived an appropriate compression scheme for the hologram by comparing DPCM with the linear predictive coding. Experimental results showed that the compression rate of 26.7% could be obtained through the first process.

5 citations


Journal ArticleDOI
TL;DR: The proposed framework for recognition of indoor human activity has been extensively validated on the benchmark of ADL datasets, with a focus that this methodology is robust and attains more precise human activity recognition rate as compared to current methodologies available.
Abstract: In this paper, we have introduced a novel approach for recognition of activities of daily living (ADL). These activities are the ones that the human beings perform in daily life. At the object level, we used computational color model for efficient object segmentation and tracking to handle dynamic background change in indoor environment. To make it computationally efficient, cosine of the angle between the expected image color vector and current image color vector is used. At feature level, we have presented a linear predictive coding of histogram of directional derivative as a spatio-temporal descriptor. Our proposed descriptor describes the local object shape and appearance within cuboids effectively and distinctively. A multiclass support vector machine has been used to classify the human activities. The proposed framework for recognition of indoor human activity has been extensively validated on the benchmark of ADL datasets, with a focus that this methodology is robust and attains more precise human activity recognition rate as compared to current methodologies available.

Book ChapterDOI
01 Jan 2019
TL;DR: The results for the selected features vectors show that the emotion recognition rate is satisfactory when multilingual speech material is used for training and testing, and that the features extracted from speech display a closed dependency with the spoken language.
Abstract: Emotion recognition from speech signal has become more and more important in advanced human-machine applications. The detailed description of emotions and their detection play an important role in the psychiatric studies but also in other fields of medicine such as anamnesis, clinical studies or lie detection. In this paper some experiments using multilingual emotional databases are presented. For the features extracted from the speech material, the LPC (Linear predictive coding), LPCC (Linear Predictive Cepstral Coefficients) and MFCC (Mel Frequency Cepstral Coefficients) coefficients are employed. The Weka tool was used for the classification task, selecting the k-NN (k-nearest neighbors) and SVM (Support Vector Machine) classifiers. The results for the selected features vectors show that the emotion recognition rate is satisfactory when multilingual speech material is used for training and testing. When the training is made using emotional materials for a language and testing with materials in other language the results are poor. Therefore, this shows that the features extracted from speech display a closed dependency with the spoken language.

Journal Article
TL;DR: This paper focuses on speech recognition techniques such as LPC(linear predictive coding), MFCC(Mel-frequency Cepstral coefficients) with Hidden Markov Models, LPCC (linear predictive CepStral coding), and RASTA and will compare these techniques to find a most accurate and efficient way to recognize speech.
Abstract: This paper focuses on speech recognition techniques such as LPC(linear predictive coding), MFCC(Mel-frequency Cepstral coefficients) with Hidden Markov Models, LPCC(linear predictive Cepstral coding), and RASTA and will compare these techniques to find a most accurate and efficient way to recognize speech. Speech recognition is the process in which program or machine do the identification of words or phrases and convert them to machine-readable format. Additionally, this paper also focuses on NLP(natural language processing) techniques used with the speech recognition process. Once the speech signal is converted to text then NLP is used to understand and generate what has been said. NLU(natural language understanding) and NLG(natural language generation) are two important steps in NLP, through this paper, we will compare and analysis techniques to find out which we can use with speech recognition for effective results. The Objective of this paper is to find out the best technique which is currently used.

DOI
24 Oct 2019
TL;DR: This research focuses on developing a speech recognition system, in the form of spoken words and convert to text form, which is able to recognize words with an accuracy rate of 71.875% and transform them into text form properly.
Abstract: Artificial intelligence technology is developing very rapidly. Various fields have applied this technology to help human work. Speech recognition system is one of the artificial intelligence technologies that are widely applied in various fields. However, some research showed that it was still necessary to develop a method for a good speech recognition system. In addition, the development of speech recognition systems that can provide benefits needs to be developed, such as text recording. Based on this, the research focuses on developing a speech recognition system, in the form of spoken words and convert to text form. Speech words that have been recorded are then extracted features using linear predictive coding method. After that, the characteristic features of each sound are trained and tested using the Support Vector Machine (SVM) method for the process of recognition and convert it into text. Based on the evaluation results show that this system is able to recognize words with an accuracy rate of 71.875%. These percentages indicate that the system is able to recognize spoken words and transform them into text form properly.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: The experimental results show that an SS-VAD algorithm significantly improved the signal to noise ratio (SNR) of speech data under various degraded conditions and is combined with LPC for encoding of enhanced speech data which yields better audibility and intelligibility of speech compared to encoding of noisy speech data.
Abstract: In this paper, an enhancement of noisy speech data and encoding of corrupted and enhanced speech data is presented. An algorithm is proposed which is an amalgamation of spectral subtraction with voice activity detection (SS-VAD) and linear predictive coding (LPC) for speech enhancement and encoding purpose. Firstly, the significance of SS-VAD and LPC methods are studied in detail for various types of noises. In SS-V AD technique, the noisy speech data is considered as an input speech signal which is a combination of clean speech data and noise model. The corrupted speech data is windowed using Hanning window and framed for every 20ms. 50% of overlapping is done while windowing the speech data. The output of SS-VAD is given as an input to the LPC encoder. The coefficients are extracted from the input speech data to design all pole filters. The cross correlation process is also done for differentiating the voiced and unvoiced samples at the analysis step. The pitch information and extracted coefficients are used at the synthesis step. The experiments are conducted for different types of noisy speech data which are degraded by background noise, F16 noise, factory noise, white noise and car noise. The experimental results show that an SS-VAD algorithm significantly improved the signal to noise ratio (SNR) of speech data under various degraded conditions. Therefore, the SS-VAD algorithm is combined with LPC for encoding of enhanced speech data which yields better audibility and intelligibility of speech compared to encoding of noisy speech data.

Journal ArticleDOI
30 Nov 2019
TL;DR: The development of science and technology is one way to replace the method of human interaction with computers, one of which is to provide voice input, including the use of Linear Predictive Coding.
Abstract: The development of science and technology is one way to replace the method of human interaction with computers, one of which is to provide voice input. Conversion of sound into text form with the Backpropagation method can be understood and realized through feature extraction, including the use of Linear Predictive Coding (LPC). Linear Predictive Coding is one way to represent the signal in obtaining the features of each sound pattern. In brief, the way this speech recognition system worked was by inputting human voice through a microphone (analog signal) which then sampled with a sampling speed of 8000 Hz so that it became a digital signal with the assistance of sound card on the computer. The digital signal from the sample then entered the initial process using LPC, so that several LPC coefficients were obtained. The LPC outputs were then trained using the Backpropagation learning method. The results of the learning were classified with a word and stored in a database afterwards. The results of the test were in the form of an introduction program that able display the voice plots. the results of speech recognition with voice recognition percentage of respondents in the database iss 80% of the 100 data in the test in Real Time

Proceedings ArticleDOI
18 Apr 2019
TL;DR: The experimental results show that the joint use of Linear Predictive Coding and Lempel-Ziv-Welch is an adequate lossless approach, and the amplitude scaling followed by the Discrete Wavelet Transform achieves the best compression ratio, with a small distortion, among the lossy techniques.
Abstract: The compression of Electrocardiography (ECG) signals acquired in off-the-person scenarios requires methods that cope with noise and other impairments on the acquisition process. In this paper, after a brief review of common on-the-person ECG signal compression algorithms, we propose and evaluate techniques for this compression task with off-the-person acquired signals, in both lossy and lossless scenarios, evaluated with standard metrics. Our experimental results show that the joint use of Linear Predictive Coding and Lempel-Ziv-Welch is an adequate lossless approach, and the amplitude scaling followed by the Discrete Wavelet Transform achieves the best compression ratio, with a small distortion, among the lossy techniques.

Journal ArticleDOI
TL;DR: When the performance measurements of multistage and split vector quantization codebook spectral distortion results are presented,Multistage codebooks gave better performance in each option, compared to split vectors quantization methods.
Abstract: Vector quantization codebook algorithms are used for coding of narrow band speech signals. Multi-stage vector quantization and split vector quantization methods are two important techniques used for coding of narrowband speech signals and these methods are very popular due to the high bit rate minimization during coding of the signals. This paper presents performance measurements of multistage vector quantization and split vector quantization methods. We used line spectral frequencies for coding of the speech signals in codebook tables so as to ensure filter stability after quantization. The codebooks were generated by using the Linde-Buzo-Gray (LBG) algorithm. The tests were performed by selecting large amount of input data in training and test stages and to evaluate noise robustness of the methods, both noisy and clean speech signals were used. As a result, different codebooks were designed and tested in many stages and different bit rates to measure quantization performance. It is measured in terms of spectral distortion evaluation. We obtained the best result in 24bit multistage vector quantization codebook with a spectral distortion less than 1 dB for clean speech training data input. When we compared multistage and split vector quantization codebook spectral distortion results, multistage codebooks gave better performance in each option.

Patent
18 Jun 2019
TL;DR: In this paper, a linear predictive coding apparatus is provided that performs linear predictive analysis using a pseudo correlation function signal sequence obtained by performing inverse Fourier transform regarding the absolute values of the frequency domain sample sequence corresponding to the time-series signal as a power spectrum to obtain coefficients transformable to linear predictive coefficients.
Abstract: A linear predictive coding apparatus is provided that performs linear predictive analysis using a pseudo correlation function signal sequence obtained by performing inverse Fourier transform regarding the η1-th power of the absolute values of the frequency domain sample sequence corresponding to the time-series signal as a power spectrum to obtain coefficients transformable to linear predictive coefficients. The apparatus further adapts values of η for a plurality of candidates for coefficients transformable to linear predictive coefficients stored in a code book and the coefficients transformable to linear predictive coefficients are obtained by the linear predictive analysis. The apparatus further obtains a linear predictive coefficient code corresponding to the coefficients transformable to linear predictive coefficients obtained by the linear predictive analysis, using the plurality of candidates for coefficients transformable to linear predictive coefficients and the coefficients transformable to linear predictive coefficients for which the values of η have been adapted.

Patent
11 Dec 2019
TL;DR: In this paper, the authors proposed a weighting function determination method based on spectral analysis and position information of the LPC coefficient or the immitance spectral frequency (ISF) coefficient.
Abstract: A weighting function determination method includes obtaining a line spectral frequency (LSF) coefficient or an immitance spectral frequency (ISF) coefficient from a linear predictive coding (LPC) coefficient of an input signal and determining a weighting function by combining a first weighting function based on spectral analysis information and a second weighting function based on position information of the LSF coefficient or the ISF coefficient.

Book ChapterDOI
01 Jan 2019
TL;DR: This paper proposes an approach to improve the intelligibility of the dysarthric speech using a simple yet effective speech-transformation technique such as warping the frequency of LPC poles and mapping coefficients of linear predictive coding.
Abstract: Humans utilize many muscles to produce intelligible speech, including lips, face and throat. Dysarthria is a speech disorder that surfaces when one has weal muscles due to brain damage. Primary characteristics of a dysarthric patient are slurred and slow speech that can be difficult to understand based on the severity of the condition. This paper proposes an approach to improve the intelligibility of the dysarthric speech using a simple yet effective speech-transformation technique such as warping the frequency of LPC poles and mapping coefficients of linear predictive coding. This technique was applied to dysarthric audio from the UA-speech database to obtain the desired results. Both objective and subjective measures are used to evaluate the transformed speech. The obtained results pointed towards a significant improvement in the dysarthric speech’s intelligibility. This method can be used to develop special voice-enabled search platforms for dysarthric patients and in helping rehabilitation of the patients by developing a speech therapy based on auditory feedback.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: Using AI and self-training techniques, a high fidelity coding a decoding library can be developed by the layman on a computer and applied to radio voice applications using properly equipped hardware.
Abstract: High digital compression ratios are spectrally efficient; however, excellent fidelity makes the application in radio very useful. LPC (Linear predictive coding) methods have been honed for mobile communication; but a more generic form of compression, which can be applied to ham radio and personal communication services (PCS) in the private sector would extend the usefulness of these applications and services. Using AI and self-training techniques, a high fidelity coding a decoding library can be developed by the layman on a computer and applied to radio voice applications using properly equipped hardware. This paper discusses the coding and decoding algorithm, as well as the recommended hardware and software for implementation.

01 Jan 2019
TL;DR: The proposed method has brought down the bit rate of CELP to 6.4 kbps or reduced the bit requirement by 12% without compromising on the perceptual quality of reconstructed speech.
Abstract: Speech is a highly complex and dynamic acoustic wave produced by the vocal tract as a result of the excitation in the form of air expelled from lungs The vocal tract characteristics vary in different manner during production of various speech categories This time variant acoustic filter has been represented by a Linear Prediction (LP) filter in Speech Production Model based on which Code Excited Linear Prediction (CELP) and many other speech coders are built The periodic nature of voiced speech due to vocal chord vibration causes slow variation for vocal tract characteristics and thus, similarity exists among nearby portions of voiced speech This similarity property is explored to reduce the count of transmitted Linear Predictive Coding (LPC) coefficients and excitation that are bit consuming and also significant parameters of LP filter This has been implemented in 73 kbps CELP by determining appropriate threshold for similarity values of both parameters The proposed method has brought down the bit rate of CELP to 64 kbps or reduced the bit requirement by 12% without compromising on the perceptual quality of reconstructed speech

Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this article, the similarity property of the vocal tract is explored to reduce the count of transmitted LPC coefficients and excitation that are bit consuming and also significant parameters of LP filter.
Abstract: Speech is a highly complex and dynamic acoustic wave produced by the vocal tract as a result of the excitation in the form of air expelled from lungs. The vocal tract characteristics vary in different manner during production of various speech categories. This time variant acoustic filter has been represented by a Linear Prediction (LP) filter in Speech Production Model based on which Code Excited Linear Prediction (CELP) and many other speech coders are built. The periodic nature of voiced speech due to vocal chord vibration causes slow variation for vocal tract characteristics and thus, similarity exists among nearby portions of voiced speech. This similarity property is explored to reduce the count of transmitted Linear Predictive Coding (LPC) coefficients and excitation that are bit consuming and also significant parameters of LP filter. This has been implemented in 7.3 kbps CELP by determining appropriate threshold for similarity values of both parameters. The proposed method has brought down the bit rate of CELP to 6.4 kbps or reduced the bit requirement by 12% without compromising on the perceptual quality of reconstructed speech.

24 Oct 2019
TL;DR: In this paper, a speech recognition system is proposed to recognize spoken words in the form of spoken words and convert them into text form using linear predictive coding method, and the characteristic features of each sound are trained and tested using the Support Vector Machine (SVM) method for the process of recognition and convert it into text.
Abstract: Artificial intelligence technology is developing very rapidly. Various fields have applied this technology to help human work. Speech recognition system is one of the artificial intelligence technologies that are widely applied in various fields. However, some research showed that it was still necessary to develop a method for a good speech recognition system. In addition, the development of speech recognition systems that can provide benefits needs to be developed, such as text recording. Based on this, the research focuses on developing a speech recognition system, in the form of spoken words and convert to text form. Speech words that have been recorded are then extracted features using linear predictive coding method. After that, the characteristic features of each sound are trained and tested using the Support Vector Machine (SVM) method for the process of recognition and convert it into text. Based on the evaluation results show that this system is able to recognize words with an accuracy rate of 71.875%. These percentages indicate that the system is able to recognize spoken words and transform them into text form properly.