Experimental Evaluation of CNN Architecture for Speech Recognition

doi:10.1007/978-981-15-0029-9_40

Home
/
Papers
/
Experimental Evaluation of CNN Architecture for Speech Recognition

Book Chapter•DOI•

Experimental Evaluation of CNN Architecture for Speech Recognition

Amaan Haque¹, Abhishek Verma¹, John Sahaya Rani Alex¹, Nithya Venkatesan¹•Institutions (1)

VIT University¹

01 Jan 2020-pp 507-514

TL;DR: A method that uses the CNN on audio samples rather than on the image samples in which the CNN method is usually used to train the model, which was found to have the highest accuracy among the discussed architectures.

read less

Abstract: In recent days, deep learning has been widely used in signal and information processing. Among the deep learning algorithms, Convolution Neural Network (CNN) has been widely used for image recognition and classification because of its architecture, high accuracy and efficiency. This paper proposes a method that uses the CNN on audio samples rather than on the image samples in which the CNN method is usually used to train the model. The one-dimensional audio samples are converted into two-dimensional data that consists of matrix of Mel-Frequency Cepstral Coefficients (MFCCs) that are extracted from the audio samples and the number of windows used in the extraction. This proposed CNN model has been evaluated on the TIDIGITS corpus dataset. The paper analyzes different convolution layer architectures with different number of feature maps in each architecture. The three-layer convolution architecture was found to have the highest accuracy of 97.46% among the other discussed architectures.

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Performance Improvement of Intrusion Detection System for Detecting Attacks on Internet of Things and Edge of Things

[...]

Y.K. Saheed

01 Jan 2022-Internet of things

4 citations

Proceedings Article•DOI•

Loss Function Approaches for Multi-label Music Tagging

[...]

Dillon Knox, Timothy Greer, Benjamin Ma, Emily Kuo, Krishna Somandepalli, Shrikanth S. Narayanan - Show less +2 more

28 Jun 2021

TL;DR: In this paper, an ensemble-based convolutional neural network (CNN) model trained using various loss functions for tagging musical genres from audio is presented, and the effect of different loss functions and resampling strategies on prediction performance is investigated.

...read moreread less

Abstract: Given the ever-increasing volume of music created and released every day, it has never been more important to study automatic music tagging. In this paper, we present an ensemble-based convolutional neural network (CNN) model trained using various loss functions for tagging musical genres from audio. We investigate the effect of different loss functions and resampling strategies on prediction performance, finding that using focal loss improves overall performance on the the MTG-Jamendo dataset: an imbalanced, multi-label dataset with over 18,000 songs in the public domain, containing 57 labels. Additionally, we report results from varying the receptive field on our base classifier—a CNN-based architecture trained using Mel spectrograms—which also results in a model performance boost and state-of-the-art performance on the Jamendo dataset. We conclude that the choice of the loss function is paramount for improving on existing methods in music tagging, particularly in the presence of class imbalance.

...read moreread less

3 citations

Journal Article•DOI•

Utilizing a Two-Dimensional Data-Driven Convolutional Neural Network for Long-Term Prediction of Dissolved Oxygen Content

[...]

Dashe Li , Xuan Zhang

11 Jul 2022-Frontiers in Environmental Science

TL;DR: A two-dimensional data-driven convolutional neural network model (2DD-CNN) is proposed and a novel sequence score matching-filling (SSMF) algorithm is presented based on similar historical series matching to provide missing values to achieve high forecast accuracy and generalizability.

...read moreread less

Abstract: It is significant to establish a precise dissolved oxygen (DO) model to obtain clear knowledge ablout the prospective changing conditions of the aquatic environment of marine ranches and to ensure the healthy growth of fisheries. However Do in marine ranches is affected by many factors. DO trends have complex nonlinear characteristics. Therefore, the accurate prediction of DO is challenging. On this basis, a two-dimensional data-driven convolutional neural network model (2DD-CNN) is proposed. In order to reduce the influence of missing values on experimental results, a novel sequence score matching-filling (SSMF) algorithm is first presented based on similar historical series matching to provide missing values. This paper extends the DO expression dimension and constructs a method that can convert a DO sequence into two-dimensional images and is also convenient for the 2D convolution kernel to further extract various pieces of information. In addition, a self-attention mechanism is applied to construct a CNN to capture the interdependent features of time series. Finally, DO samples from multiple marine ranches are validated and compared with those predicted by other models. The experimental results show that the proposed model is a suitable and effective method for predicting DO in multiple marine ranches. The MSE MAE, RMSE and MAPE of the 2DD-CNN prediction results are reduced by 51.63, 30.06, 32.53, and 30.75% on average, respectively, compared with those of other models, and the R2 is 2.68% higher on average than those of the other models. It is clear that the proposed 2DD-CNN model achieves a high forecast accuracy and exhibits good generalizability.

...read moreread less

1 citations

Proceedings Article•DOI•

Voice Extraction from Background Noise using Filter Bank Analysis for Voice Communication Applications

[...]

J. Padmapriya, T. Sasilatha, j.R Karthickmano, G Aagash., V Bharathi.¹ - Show less +1 more•Institutions (1)

Indian Institutes of Information Technology¹

04 Feb 2021

TL;DR: In this paper, the authors applied prevalent feature extraction techniques to extract the speech signal with the trade off of complexity, compression ratio, and compression ratio for the application of voice communication.

...read moreread less

Abstract: Automatic Speech Recognition plays an evident role in extracting the voice signal in the noisy background. The reduction of noise in the signal is susceptible to the information which is to be transmitted since not all the information is emphasized. This leads to the deterioration in the transmitted information and paved furtherance for automatic speech recognition. Prevalent feature extraction techniques are applied to extract the speech signal with the trade off of complexity, compression ratio. For the application of voice communication, filter bank analysis is applied to extract the voice signals in the noisy environment. This work emphasized on the attributes of the perceptual quality of Loudness, Pitch Intensity, Timing. Band pass filtering provides reliable extraction of the voice signal features in the noisy environment. The power distribution of the extracted signals for the selected audio signal with the length of more than 20 seconds wave file with a sampling rate of 16 Khz along with the background noises has been plotted and its respective spectrogram also been plotted.

...read moreread less

1 citations

Book Chapter•DOI•

New Method of Internal Type-2 Fuzzy-Based CNN for Image Classification

[...]

P. Murugeswari, S. Vijayalakshmi

31 Dec 2020-The International Journal of Fuzzy Logic and Intelligent Systems

TL;DR: In this paper, a fuzzy-based convolutional neural network (CNN) is proposed for image classification, and an interval type-2 fuzzy based CNN is proposed to handle uncertain information effectively.

...read moreread less

Abstract: Last two decades, neural networks and fuzzy logic have been successfully implemented in intelligent systems. The fuzzy neural network system framework infers the union of fuzzy logic and neural system framework thoughts, which consolidates the advantages of fuzzy logic and neural network system framework. This FNN is applied in many scientific and engineering areas. Wherever there is an uncertainty associated with data fuzzy logic place a vital rule, and the fuzzy set can represent and handle uncertain information effectively. The main objective of the FNN system is to achieve a high level of accuracy by including the fuzzy logic in either neural network structure, activation function, or learning algorithms. In computer vision and intelligent systems such as convolutional neural network has more popular architectures, and their performance is excellent in many applications. In this article, fuzzy-based CNN image classification methods are analyzed, and also interval type-2 fuzzy-based CNN is proposed. From the experiment, it is identified that the proposed method performance is well.

...read moreread less

1 citations

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•DOI•

Convolutional Neural Networks for Sentence Classification

[...]

Yoon Kim¹•Institutions (1)

New York University¹

25 Aug 2014

TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.

...read moreread less

Abstract: We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

...read moreread less

9,776 citations

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

Proceedings Article•DOI•

Speech recognition with deep recurrent neural networks

[...]

Alex Graves¹, Abdelrahman Mohamed¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

26 May 2013

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

...read moreread less

7,316 citations

Posted Content•

Speech Recognition with Deep Recurrent Neural Networks

[...]

Alex Graves¹, Abdelrahman Mohamed¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

22 Mar 2013-arXiv: Neural and Evolutionary Computing

TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

...read moreread less

5,310 citations