Speech Processing and Recognition System
01 Jan 2019-pp 13-43
TL;DR: In the initial decade of the twentieth century, scientists in the Bell System realized that the idea of universal services like telephony services is becoming feasible due to large-scale technological revolution.
Abstract: In the initial decade of the twentieth century, scientists in the Bell System realized that the idea of universal services like telephony services is becoming feasible due to large-scale technological revolution [1].
Citations
More filters
[...]
TL;DR: A state-of-the-art summary and present approaches for using the widely used machine learning and deep learning methods to detect the challenges along with future research directions of speech enhancement systems are provided.
Abstract: Speech enhancement has substantial interest in the utilization of speaker identification, video-conference, speech transmission through communication channels, speech-based biometric system, mobile phones, hearing aids, microphones, voice conversion etc. Pattern mining methods have a vital step in the growth of speech enhancement schemes. To design a successful speech enhancement system consideration to the background noise processing is needed. A substantial number of methods from traditional techniques and machine learning have been utilized to process and remove the additive noise from a speech signal. With the advancement of machine learning and deep learning, classification of speech has become more significant. Methods of speech enhancement consist of different stages, such as feature extraction of the input speech signal, feature selection, feature selection followed by classification. Deep learning techniques are also an emerging field in the classification domain, which is discussed in this review. The intention of this paper is to provide a state-of-the-art summary and present approaches for using the widely used machine learning and deep learning methods to detect the challenges along with future research directions of speech enhancement systems.
10 citations
[...]
TL;DR: The proposed model is a speaker-independent as well as language-independent emotion detection system, a state of art model that has been discussed with an accuracy of 97.89%.
7 citations
[...]
TL;DR: The aim of this state-of-art paper is to produce a summary and guidelines for using the broadly used methods, to identify the challenges as well as future research directions of acoustic signal processing.
Abstract: Audio signal processing is the most challenging field in the current era for an analysis of an audio signal. Audio signal classification (ASC) comprises of generating appropriate features from a sound and utilizing these features to distinguish the class the sound is most likely to fit. Based on the application’s classification domain, the characteristics extraction and classification/clustering algorithms used may be quite diverse. The paper provides the survey of the state-of art for understanding ASC’s general research scope, including different types of audio; representation of audio like acoustic, spectrogram; audio feature extraction techniques like physical, perceptual, static, dynamic; audio pattern matching approaches like pattern matching, acoustic phonetic, artificial intelligence; classification, and clustering techniques. The aim of this state-of-art paper is to produce a summary and guidelines for using the broadly used methods, to identify the challenges as well as future research directions of acoustic signal processing.
5 citations
[...]
TL;DR: The classification results show that the proposed system can achieve the highest accuracy of validation loss value 57.3, accuracy testing 92.0, loss value 193.8, and accuracy training 100%, the accuracy results are an evaluation of the system in improving the foot signal recognition system for security systems in the smart home environment.
Abstract: Footstep recognition is relatively new biometrics and based on the learning of footsteps signals captured from people walking on the sensing area. The footstep signals classification process for security systems still has a low level of accuracy. Therefore, we need a classification system that has a high accuracy for security systems. Most systems are generally developed using geometric and holistic features but still provide high error rates. In this research, a new system is proposed by using the Mel Frequency Cepstral Coefficients (MFCCs) feature extraction, because it has a good linear frequency as a copycat of the human hearing system and Artificial Neural Network (ANN) as a classification algorithm because it has a good level of accuracy with a dataset of 500 recording footsteps. The classification results show that the proposed system can achieve the highest accuracy of validation loss value 57.3, Accuracy testing 92.0%, loss value 193.8, and accuracy training 100%, the accuracy results are an evaluation of the system in improving the foot signal recognition system for security systems in the smart home environment.
Cites methods from "Speech Processing and Recognition S..."
[...]
References
More filters
[...]
TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Abstract: We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure1.
23,814 citations
[...]
TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.
Abstract: The probability of error in decoding an optimal convolutional code transmitted over a memoryless channel is bounded from above and below as a function of the constraint length of the code. For all but pathological channels the bounds are asymptotically (exponentially) tight for rates above R_{0} , the computational cutoff rate of sequential decoding. As a function of constraint length the performance of optimal convolutional codes is shown to be superior to that of block codes of the same length, the relative improvement increasing with rate. The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.
6,412 citations
[...]
TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
5,938 citations
Posted Content•
[...]
TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
5,310 citations
[...]
TL;DR: This work surveys the most widely-used algorithms for smoothing models for language n -gram modeling, and presents an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980), and introduces methodologies for analyzing smoothing algorithm efficacy in detail.
Abstract: We survey the most widely-used algorithms for smoothing models for language n -gram modeling. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980); Katz (1987); Bell, Cleary and Witten (1990); Ney, Essen and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g. Brown vs. Wall Street Journal), count cutoffs, and n -gram order (bigram vs. trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. We find that these factors can significantly affect the relative performance of models, with the most significant factor being training data size. Since no previous comparisons have examined these factors systematically, this is the first thorough characterization of the relative performance of various algorithms. In addition, we introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser?Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented.
1,908 citations
Related Papers (5)
[...]