scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Speech recognition using MFCC and DTW

19 Jun 2014-pp 1-4
TL;DR: In this paper, an implementation of speech recognition system in MATLAB environment is explained, where two algorithms, Mel-Frequency Cepstral Coefficients (MFCC) and Dynamic Time Wrapping (DTW) are adapted for feature extraction and pattern matching respectively.
Abstract: Speech recognition has wide range of applications in security systems, healthcare, telephony military, and equipment designed for handicapped. Speech is continuous varying signal. So, proper digital processing algorithm has to be selected for automatic speech recognition system. To obtain required information from the speech sample, features have to be extracted from it. For recognition purpose the feature are analyzed to make decisions. In this paper implementation of Speech recognition system in MATLAB environment is explained. Mel-Frequency Cepstral Coefficients (MFCC) and Dynamic Time Wrapping (DTW) are two algorithms adapted for feature extraction and pattern matching respectively. Results are obtained by one time training and continuous testing phases.
Citations
More filters
Journal ArticleDOI
TL;DR: This review describes how the vocalizations produced by three of the most important farm livestock species can be applied to monitor animal welfare with new potential for developing automated methods for large-scale farming.
Abstract: Vocalizations carry emotional, physiological and individual information. This suggests that they may serve as potentially useful indicators for inferring animal welfare. At the same time, automated methods for analysing and classifying sound have developed rapidly, particularly in the fields of ecology, conservation and sound scene classification. These methods are already used to automatically classify animal vocalizations, for example, in identifying animal species and estimating numbers of individuals. Despite this potential, they have not yet found widespread application in animal welfare monitoring. In this review, we first discuss current trends in sound analysis for ecology, conservation and sound classification. Following this, we detail the vocalizations produced by three of the most important farm livestock species: chickens ( Gallus gallus domesticus), pigs ( Sus scrofa domesticus) and cattle ( Bos taurus). Finally, we describe how these methods can be applied to monitor animal welfare with new potential for developing automated methods for large-scale farming.

64 citations

Journal ArticleDOI
18 Mar 2020
TL;DR: This paper presents a meta-modelling framework called “Smart grids”, which automates the very labor-intensive and therefore time-heavy and expensive and expensive process of modeling and programming to solve the challenges of integrating smart grids.
Abstract: Home assistant devices such as Amazon Echo and Google Home have become tremendously popular in the last couple of years. However, due to their voice-controlled functionality, these devices are not accessible to Deaf and Hard-of-Hearing (DHH) people. Given that over half a million people in the United States communicate using American Sign Language (ASL), there is a need of a home assistant system that can recognize ASL. The objective of this work is to design a home assistant system for DHH users (referred to as mmASL) that can perform ASL recognition using 60 GHz millimeter-wave wireless signals. mmASL has two important components. First, it can perform reliable wake-word detection using spatial spectrograms. Second, using a scalable and extensible multi-task deep learning model, mmASL can learn the phonological properties of ASL signs and use them to accurately recognize the ASL signs. We implement mmASL on 60 GHz software radio platform with phased array, and evaluate it using a large-scale data collection from 15 signers, 50 ASL signs and over 12K sign instances. We show that mmASL is tolerant to the presence of other interfering users and their activities, change of environment and different user positions. We compare mmASL with a well-studied Kinect and RGB camera based ASL recognition systems, and find that it can achieve a comparable performance (87% average accuracy of sign recognition), validating the feasibility of using 60 GHz mmWave system for ASL sign recognition.

51 citations


Cites methods from "Speech recognition using MFCC and D..."

  • ...Lastly, we use the log-transformation of amplitude values to normalize (addressed as log normalization) and emphasize on low intensity components (inspired by speech recognition literature [52, 57, 59])....

    [...]

Journal ArticleDOI
TL;DR: Three distinctive modalities consisting of audio, video and physiological channels are assessed and combined for the classification of several levels of pain elicitation and an extensive assessment of several fusion strategies is carried out in order to design a classification architecture that improves the performance of the pain recognition system.
Abstract: The subjective nature of pain makes it a very challenging phenomenon to assess. Most of the current pain assessment approaches rely on an individual’s ability to recognise and report an observed pain episode. However, pain perception and expression are affected by numerous factors ranging from personality traits to physical and psychological health state. Hence, several approaches have been proposed for the automatic recognition of pain intensity, based on measurable physiological and audiovisual parameters. In the current paper, an assessment of several fusion architectures for the development of a multi-modal pain intensity classification system is performed. The contribution of the presented work is two-fold: (1) 3 distinctive modalities consisting of audio, video and physiological channels are assessed and combined for the classification of several levels of pain elicitation. (2) An extensive assessment of several fusion strategies is carried out in order to design a classification architecture that improves the performance of the pain recognition system. The assessment is based on the SenseEmotion Database and experimental validation demonstrates the relevance of the multi-modal classification approach, which achieves classification rates of respectively $83.39\%$ 83 . 39 % , $59.53\%$ 59 . 53 % and $43.89\%$ 43 . 89 % in a 2-class, 3-class and 4-class pain intensity classification task.

47 citations

Proceedings ArticleDOI
01 Nov 2017
TL;DR: The result shows the proposed method obtain a good accuracy with an average acuracy is 92.42%, with recognition accuracy each letters (sa, sya, and tsa) prespectivly 92.38%, 93.26% and 91.63%.
Abstract: This research addresses a challenging issue that is to recognize spoken Arabic letters, that are three letters of hijaiyah that have indentical pronounciation when pronounced by Indonesian speakers but actually has different makhraj in Arabic, the letters are sa, sya and tsa The research uses Mel-Frequency Cepstral Coefficients (MFCC) based feature extraction and Artificial Neural Network (ANN) classification method The result shows the proposed method obtain a good accuracy with an average acuracy is 9242%, with recognition accuracy each letters (sa, sya, and tsa) prespectivly 9238%, 9326% and 9163%

45 citations


Additional excerpts

  • ...Words recognition to compare the similarity of letters in words “ forward”, “Left”, “Right”, “Reverse”, and “Control” [4]....

    [...]

Proceedings ArticleDOI
24 Jul 2018
TL;DR: All of the tested denoising methods using wavelet transformation were able to improve the accuracy of the speech recognition system on input signals with SNR of 0-10 dB compared to the system without denoised method.
Abstract: Mel frequency cepstral coefficient (MFCC) is a popular feature extraction method for a speech recognition system. However, this method is susceptible to noise even though it generates a high accuracy. The conventional MFCC method has a degraded performance when the input signal has noises. This paper presents the implementation of denoising wavelet on speech input of MFCC feature extraction method. The addition of denoising process using wavelet transformation was expected to improve the MFCC performance on noisy signals. The study used 120 speech data, with 30 data were used as the reference, and the other 90 were used as the testing data. The testing data were mixed with white Gaussian noise and then tested to the speech recognition system that already had the reference data. Parameters used in the wavelet denoising process were soft thresholding with the Minimaxi thresholding rule. Eleven wavelet methods on decomposition level 10 were tested on the denoising process. The classification process used K-nearest neighbor (KNN) method. The Fejer-Korovkin 6 wavelet was the best denoising speech signal method that achieved the highest accuracy on input signals with SNR of 5-15dB. Meanwhile, the Daubechies 5 method had a high accuracy on input signal with SNR of 3 dB. All of the tested denoising methods using wavelet transformation were able to improve the accuracy of the speech recognition system on input signals with SNR of 0-10 dB compared to the system without denoising method.

33 citations


Cites background or methods from "Speech recognition using MFCC and D..."

  • ...Some studies concerning speech recognition system using MFCC as the feature extraction methods have been conducted [3]–[6]....

    [...]

  • ...with N is the number of samples per frame, Y[n] is the output signal, X[n] is the input signal, and W[n] is the nth coefficient of the Hamming window [3]....

    [...]

References
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Posted Content
TL;DR: This paper presents the viability of MFCC to extract features and DTW to compare the test patterns and explains why the alignment is important to produce the better performance.
Abstract: — Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology The voice is a signal of infinite information A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal Several methods such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) and etc are evaluated with a view to identify a straight forward and effective method for voice signal The extraction and matching process is implemented right after the Pre Processing or filtering signal is performed The non-parametric method for modelling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) are utilize as extraction techniques The non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques Since it’s obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performanceThis paper present the viability of MFCC to extract features and DTW to compare the test patterns

846 citations


"Speech recognition using MFCC and D..." refers background or methods in this paper

  • ...DCT is calculated using equation shown in (6) [6]....

    [...]

  • ...Features obtained by MFCC algorithm are similar to known variation of the human cochlea’s critical bandwidth with frequency [6]....

    [...]

  • ...On the other hand it should not be to too long such that under a particular frame voice sample is time invariant [1,6]....

    [...]

  • ...For simple isolated word detection MFCC and DTW approach is enough and efficient [6]....

    [...]

  • ...DTW finds the optimal alignment between two times series if one time series may be “warped” non-linearly by stretching or shrinking it along its time axis [6]....

    [...]

01 Jan 2004
TL;DR: This paper presents a security system based on speaker identification based onMel frequency Cepstral Coefficients{MFCCs} have been used for feature extraction and vector quantization technique is used to minimize the amount of data to be handled.
Abstract: This paper presents a security system based on speaker identification. Mel frequency Cepstral Coefficients{MFCCs} have been used for feature extraction and vector quantization technique is used to minimize the amount of data to be handled .

281 citations


"Speech recognition using MFCC and D..." refers background in this paper

  • ...The feature matching algorithm cannot discern the difference between two closely spaced frequencies [9]....

    [...]

01 Jan 2010
TL;DR: The feasibility of MFCC to extract features and DTW to compare the test patterns is presented and the non linear sequence alignment known as Dynamic Time Warping introduced by Sakoe Chiba has been used as features matching techniques.
Abstract: Kurukshetra University, Department of Instrumentation & Control Engineering., H.E.C* Jagadhri, Haryana, 135003, India sachdevaanjali26@gmail.com ABHIJEET KUMAR Mullana University, Department of Electronics and Comm. Engineering., M.M.E.C Mullana, Haryana, 133203, India abhijeetsliet@gmail.com NIDHIKA BIRLA Kurukshetra University, Department of Electronics Engineering., H.E.C Jagadhri, Haryana, 135003, India nidhikabirla@gmail.com Abstract: The Voice is a signal of infinite information. Digital processing of speech signal is very important for high-speed and precise automatic voice recognition technology. Nowadays it is being used for health care, telephony military and people with disabilities therefore the digital signal processes such as Feature Extraction and Feature Matching are the latest issues for study of voice signal. In order to extract valuable information from the speech signal, make decisions on the process, and obtain results, the data needs to be manipulated and analyzed. Basic method used for extracting the features of the voice signal is to find the Mel frequency cepstral coefficients. Mel-frequency cepstral coefficients (MFCCs) are the coefficients that collectively represent the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.This paper is divided into two modules. Under the first module feature of the speech signal are extracted in the form of MFCC coefficients and in another module the non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques. Since it’s obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance. This paper presents the feasibility of MFCC to extract features and DTW to compare the test patterns.

63 citations


"Speech recognition using MFCC and D..." refers background in this paper

  • ...The equation for calculating MEL for a given frequency is shown in (5) [8]....

    [...]

  • ...Pre-emphasis stage increases the magnitude of higher frequency with respect to lower frequencies [8]....

    [...]

  • ...They avoid interaction of noise with significant features [8]....

    [...]

01 Jan 2012
TL;DR: An accuracy of 85% is obtained by the combination of features, when the proposed approach is tested using a dataset of 280 speech samples, which is more than those obtained by using the features singly.
Abstract: This paper proposes an approach to recognize English words corresponding to digits Zero to Nine spoken in an isolated way by different male and female speakers. A set of features consisting of a combination of Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Zero Crossing Rate (ZCR), and Short Time Energy (STE) of the audio signal, is used to generate a 63-element feature vector, which is subsequently used for discrimination. Classification is done using artificial neural networks (ANN) with feedforward back-propagation architectures. An accuracy of 85% is obtained by the combination of features, when the proposed approach is tested using a dataset of 280 speech samples, which is more than those obtained by using the features singly.

42 citations


"Speech recognition using MFCC and D..." refers background in this paper

  • ...Combination of various features is to be adapted in this case for high reliability [4]....

    [...]

  • ...Various methodologies have been proposed for isolated word detection and continuous speech recognition over the years [4]....

    [...]