scispace - formally typeset
Search or ask a question
Book ChapterDOI

Experimental Evaluation of CNN Architecture for Speech Recognition

01 Jan 2020-pp 507-514
TL;DR: A method that uses the CNN on audio samples rather than on the image samples in which the CNN method is usually used to train the model, which was found to have the highest accuracy among the discussed architectures.
Abstract: In recent days, deep learning has been widely used in signal and information processing. Among the deep learning algorithms, Convolution Neural Network (CNN) has been widely used for image recognition and classification because of its architecture, high accuracy and efficiency. This paper proposes a method that uses the CNN on audio samples rather than on the image samples in which the CNN method is usually used to train the model. The one-dimensional audio samples are converted into two-dimensional data that consists of matrix of Mel-Frequency Cepstral Coefficients (MFCCs) that are extracted from the audio samples and the number of windows used in the extraction. This proposed CNN model has been evaluated on the TIDIGITS corpus dataset. The paper analyzes different convolution layer architectures with different number of feature maps in each architecture. The three-layer convolution architecture was found to have the highest accuracy of 97.46% among the other discussed architectures.
Citations
More filters
Proceedings ArticleDOI
28 Jun 2021
TL;DR: In this paper, an ensemble-based convolutional neural network (CNN) model trained using various loss functions for tagging musical genres from audio is presented, and the effect of different loss functions and resampling strategies on prediction performance is investigated.
Abstract: Given the ever-increasing volume of music created and released every day, it has never been more important to study automatic music tagging. In this paper, we present an ensemble-based convolutional neural network (CNN) model trained using various loss functions for tagging musical genres from audio. We investigate the effect of different loss functions and resampling strategies on prediction performance, finding that using focal loss improves overall performance on the the MTG-Jamendo dataset: an imbalanced, multi-label dataset with over 18,000 songs in the public domain, containing 57 labels. Additionally, we report results from varying the receptive field on our base classifier—a CNN-based architecture trained using Mel spectrograms—which also results in a model performance boost and state-of-the-art performance on the Jamendo dataset. We conclude that the choice of the loss function is paramount for improving on existing methods in music tagging, particularly in the presence of class imbalance.

3 citations

Journal ArticleDOI
TL;DR: A two-dimensional data-driven convolutional neural network model (2DD-CNN) is proposed and a novel sequence score matching-filling (SSMF) algorithm is presented based on similar historical series matching to provide missing values to achieve high forecast accuracy and generalizability.
Abstract: It is significant to establish a precise dissolved oxygen (DO) model to obtain clear knowledge ablout the prospective changing conditions of the aquatic environment of marine ranches and to ensure the healthy growth of fisheries. However Do in marine ranches is affected by many factors. DO trends have complex nonlinear characteristics. Therefore, the accurate prediction of DO is challenging. On this basis, a two-dimensional data-driven convolutional neural network model (2DD-CNN) is proposed. In order to reduce the influence of missing values on experimental results, a novel sequence score matching-filling (SSMF) algorithm is first presented based on similar historical series matching to provide missing values. This paper extends the DO expression dimension and constructs a method that can convert a DO sequence into two-dimensional images and is also convenient for the 2D convolution kernel to further extract various pieces of information. In addition, a self-attention mechanism is applied to construct a CNN to capture the interdependent features of time series. Finally, DO samples from multiple marine ranches are validated and compared with those predicted by other models. The experimental results show that the proposed model is a suitable and effective method for predicting DO in multiple marine ranches. The MSE MAE, RMSE and MAPE of the 2DD-CNN prediction results are reduced by 51.63, 30.06, 32.53, and 30.75% on average, respectively, compared with those of other models, and the R2 is 2.68% higher on average than those of the other models. It is clear that the proposed 2DD-CNN model achieves a high forecast accuracy and exhibits good generalizability.

1 citations

Proceedings ArticleDOI
04 Feb 2021
TL;DR: In this paper, the authors applied prevalent feature extraction techniques to extract the speech signal with the trade off of complexity, compression ratio, and compression ratio for the application of voice communication.
Abstract: Automatic Speech Recognition plays an evident role in extracting the voice signal in the noisy background. The reduction of noise in the signal is susceptible to the information which is to be transmitted since not all the information is emphasized. This leads to the deterioration in the transmitted information and paved furtherance for automatic speech recognition. Prevalent feature extraction techniques are applied to extract the speech signal with the trade off of complexity, compression ratio. For the application of voice communication, filter bank analysis is applied to extract the voice signals in the noisy environment. This work emphasized on the attributes of the perceptual quality of Loudness, Pitch Intensity, Timing. Band pass filtering provides reliable extraction of the voice signal features in the noisy environment. The power distribution of the extracted signals for the selected audio signal with the length of more than 20 seconds wave file with a sampling rate of 16 Khz along with the background noises has been plotted and its respective spectrogram also been plotted.

1 citations

Book ChapterDOI
TL;DR: In this paper, a fuzzy-based convolutional neural network (CNN) is proposed for image classification, and an interval type-2 fuzzy based CNN is proposed to handle uncertain information effectively.
Abstract: Last two decades, neural networks and fuzzy logic have been successfully implemented in intelligent systems. The fuzzy neural network system framework infers the union of fuzzy logic and neural system framework thoughts, which consolidates the advantages of fuzzy logic and neural network system framework. This FNN is applied in many scientific and engineering areas. Wherever there is an uncertainty associated with data fuzzy logic place a vital rule, and the fuzzy set can represent and handle uncertain information effectively. The main objective of the FNN system is to achieve a high level of accuracy by including the fuzzy logic in either neural network structure, activation function, or learning algorithms. In computer vision and intelligent systems such as convolutional neural network has more popular architectures, and their performance is excellent in many applications. In this article, fuzzy-based CNN image classification methods are analyzed, and also interval type-2 fuzzy-based CNN is proposed. From the experiment, it is identified that the proposed method performance is well.

1 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings ArticleDOI
Yoon Kim1
25 Aug 2014
TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.
Abstract: We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

9,776 citations

Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

9,091 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

7,316 citations

Posted Content
TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

5,310 citations