scispace - formally typeset
Search or ask a question

Showing papers on "Hidden Markov model published in 2013"


Proceedings ArticleDOI
26 May 2013
TL;DR: Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Abstract: Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random “dropout” procedure that drastically improves generalization error by randomly omitting a fraction of the hidden units in all layers. Since dropout helps avoid over-fitting, it has also been successful on a small-scale phone recognition task using larger neural nets. However, training deep neural net acoustic models for large vocabulary speech recognition takes a very long time and dropout is likely to only increase training time. Neural networks with rectified linear unit (ReLU) non-linearities have been highly successful for computer vision tasks and proved faster to train than standard sigmoid units, sometimes also improving discriminative performance. In this work, we show on a 50-hour English Broadcast News task that modified deep neural networks using ReLUs trained with dropout during frame level training provide an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improvement over a strong GMM/HMM system. We were able to obtain our results with minimal human hyper-parameter tuning using publicly available Bayesian optimization code.

1,342 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.
Abstract: Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inefficient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters.

880 citations


Proceedings ArticleDOI
Jui-Ting Huang1, Jinyu Li1, Dong Yu1, Li Deng1, Yifan Gong1 
26 May 2013
TL;DR: It is shown that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers.
Abstract: In the deep neural network (DNN), the hidden layers can be considered as increasingly complex feature transformations and the final softmax layer as a log-linear classifier making use of the most abstract features computed in the hidden layers. While the loglinear classifier should be different for different languages, the feature transformations can be shared across languages. In this paper we propose a shared-hidden-layer multilingual DNN (SHL-MDNN), in which the hidden layers are made common across many languages while the softmax layers are made language dependent. We demonstrate that the SHL-MDNN can reduce errors by 3-5%, relatively, for all the languages decodable with the SHL-MDNN, over the monolingual DNNs trained using only the language specific data. Further, we show that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers. It is particularly interesting that the error reduction can be achieved for the target language that is in different families of the languages used to learn the hidden layers.

653 citations


Proceedings ArticleDOI
Dong Yu1, Kaisheng Yao1, Hang Su1, Gang Li1, Frank Seide1 
26 May 2013
TL;DR: Experiments demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.
Abstract: We propose a novel regularized adaptation technique for context dependent deep neural network hidden Markov models (CD-DNN-HMMs). The CD-DNN-HMM has a large output layer and many large hidden layers, each with thousands of neurons. The huge number of parameters in the CD-DNN-HMM makes adaptation a challenging task, esp. when the adaptation set is small. The technique developed in this paper adapts the model conservatively by forcing the senone distribution estimated from the adapted model to be close to that from the unadapted model. This constraint is realized by adding Kullback-Leibler divergence (KLD) regularization to the adaptation criterion. We show that applying this regularization is equivalent to changing the target distribution in the conventional backpropagation algorithm. Experiments on Xbox voice search, short message dictation, and Switchboard and lecture speech transcription tasks demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.

479 citations


Journal ArticleDOI
09 Apr 2013
TL;DR: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech.
Abstract: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

424 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF), and compares the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures.
Abstract: Reducing the interference noise in a monaural noisy speech signal has been a challenging task for many years. Compared to traditional unsupervised speech enhancement methods, e.g., Wiener filtering, supervised approaches, such as algorithms based on hidden Markov models (HMM), lead to higher-quality enhanced speech signals. However, the main practical difficulty of these approaches is that for each noise type a model is required to be trained a priori. In this paper, we investigate a new class of supervised speech denoising algorithms using nonnegative matrix factorization (NMF). We propose a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF). To circumvent the mismatch problem between the training and testing stages, we propose two solutions. First, we use an HMM in combination with BNMF (BNMF-HMM) to derive a minimum mean square error (MMSE) estimator for the speech signal with no information about the underlying noise type. Second, we suggest a scheme to learn the required noise BNMF model online, which is then used to develop an unsupervised speech enhancement system. Extensive experiments are carried out to investigate the performance of the proposed methods under different conditions. Moreover, we compare the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures. Our simulations show that the proposed BNMF-based methods outperform the competing algorithms substantially.

399 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Abstract: Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. In this paper, we extend the earlier basic form of the CNN and explore it in multiple ways. We first investigate several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers. We then develop a novel weighted softmax pooling layer so that the size in the pooling layer can be automatically learned. Further, we evaluate the effect of CNN pretraining, which is achieved by using a convolutional version of the RBM. We show that all CNN architectures we have investigated outperform the earlier basic form of the DNN on both the phone recognition and large vocabulary speech recognition tasks. The architecture with limited weight sharing provides additional gains over the full weight sharing architecture. The softmax pooling layer performs as well as the best CNN with the manually tuned fixed-pooling size, and has a potential for further improvement. Finally, we show that CNN pretraining produces significantly better results on a large vocabulary speech recognition task.

378 citations


Journal ArticleDOI
TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, which aims to provide real-time information about the response of the immune system to EMTs.
Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM's strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semi-Markov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments.

348 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Abstract: In this paper, we propose a new fast speaker adaptation method for the hybrid NN-HMM speech recognition model. The adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes (one per speaker). The joint training method uses all training data along with speaker labels to update adaptation NN weights and speaker codes based on the standard back-propagation algorithm. In this way, the learned adaptation NN is capable of transforming each speaker features into a generic speaker-independent feature space when a small speaker code is given. Adaptation to a new speaker can be simply done by learning a new speaker code using the same back-propagation algorithm without changing any NN weights. In this method, a separate speaker code is learned for each speaker while the large adaptation NN is learned from the whole training set. The main advantage of this method is that the size of speaker codes is very small. As a result, it is possible to conduct a very fast adaptation of the hybrid NN/HMM model for each speaker based on only a small amount of adaptation data (i.e., just a few utterances). Experimental results on TIMIT have shown that it can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

269 citations


Journal ArticleDOI
24 Apr 2013-Sensors
TL;DR: This paper describes the use of two powerful machine learning schemes, ANN and SVM, within the framework of HMM (Hidden Markov Model), in order to tackle the task of activity recognition in a home setting and shows how the hybrid models achieve significantly better recognition performance.
Abstract: Activities of daily living are good indicators of elderly health status, and activity recognition in smart environments is a well-known problem that has been previously addressed by several studies. In this paper, we describe the use of two powerful machine learning schemes, ANN (Artificial Neural Network) and SVM (Support Vector Machines), within the framework of HMM (Hidden Markov Model) in order to tackle the task of activity recognition in a home setting. The output scores of the discriminative models, after processing, are used as observation probabilities of the hybrid approach. We evaluate our approach by comparing these hybrid models with other classical activity recognition methods using five real datasets. We show how the hybrid models achieve significantly better recognition performance, with significance level p < 0.05, proving that the hybrid approach is better suited for the addressed domain.

243 citations


Journal ArticleDOI
TL;DR: The two-step approach was found to improve the results substantially compared to the context-independent baseline system, and the detection accuracy can be almost doubled by using the proposed context-dependent event detection.
Abstract: The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.

Proceedings ArticleDOI
26 May 2013
TL;DR: This work investigates multilingual modeling in the context of a DNN - hidden Markov model (HMM) hybrid, where the DNN outputs are used as the HMM state likelihoods and proposes that training the hidden layers on multiple languages makes them more suitable for cross-lingual transfer.
Abstract: We investigate multilingual modeling in the context of a deep neural network (DNN) - hidden Markov model (HMM) hybrid, where the DNN outputs are used as the HMM state likelihoods. By viewing neural networks as a cascade of feature extractors followed by a logistic regression classifier, we hypothesise that the hidden layers, which act as feature extractors, will be transferable between languages. As a corollary, we propose that training the hidden layers on multiple languages makes them more suitable for such cross-lingual transfer. We experimentally confirm these hypotheses on the GlobalPhone corpus using seven languages from three different language families: Germanic, Romance, and Slavic. The experiments demonstrate substantial improvements over a monolingual DNN-HMM hybrid baseline, and hint at avenues of further exploration.

Journal ArticleDOI
TL;DR: A wearable motion detection device using tri-axial accelerometer is designed and realized, which can detect and predict falls based on tri-AXial acceleration of human upper trunk based on acceleration time series extracted from human motion processes.
Abstract: Falls in the elderly have always been a serious medical and social problem. To detect and predict falls, a hidden Markov model (HMM)-based method using tri-axial accelerations of human body is proposed. A wearable motion detection device using tri-axial accelerometer is designed and realized, which can detect and predict falls based on tri-axial acceleration of human upper trunk. The acceleration time series (ATS) extracted from human motion processes are used to describe human motion features, and the ATS extracted from human fall courses but before the collision are used to train HMM so as to build a random process mathematical model. Thus, the outputs of HMM, which express the marching degrees of input ATS and HMM, can be used to evaluate the risks to fall. The experiment results show that fall events can be predicted 200-400 ms ahead the occurrence of collisions, and distinguished from other daily life activities with an accuracy of 100%.

Journal ArticleDOI
TL;DR: An adaptation of the original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations is described, which observed improved performances when distinguishing between driver mutations and other germ line variants (both disease-causing and putatively neutral mutations).
Abstract: Motivation: The number of missense mutations being identified in cancer genomes has greatly increased as a consequence of technological advances and the reduced cost of whole-genome/whole-exome sequencing methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumour progression (passenger mutations). Therefore, accurate automated methods capable of discriminating between driver (cancer-promoting) and passenger mutations are becoming increasingly important. In our previous work, we developed the Functional Analysis through Hidden Markov Models (FATHMM) software and, using a model weighted for inherited disease mutations, observed improved performances over alternative computational prediction algorithms. Here, we describe an adaptation of our original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations. Results: The performance of our algorithm was evaluated using two separate benchmarks. In our analysis, we observed improved performances when distinguishing between driver mutations and other germ line variants (both disease-causing and putatively neutral mutations). In addition, when discriminating between somatic driver and passenger mutations, we observed performances comparable with the leading computational prediction algorithms: SPF-Cancer and TransFIC. Availability and implementation: A web-based implementation of our cancer-specific model, including a downloadable stand-alone package, is available at http://fathmm.biocompute.org.uk. Contact: fathmm@biocompute.org.uk Supplementary information:Supplementary data are available at Bioinformatics online.

Patent
31 Jul 2013
TL;DR: In this paper, techniques are shown and described relating to a Hidden Markov Model based architecture to monitor network node activities and predict relevant periods of traffic for individual nodes by extrapolating a most-likely future sequence based on prior respective traffic of the individual nodes and the corresponding matching traffic profile.
Abstract: In one embodiment, techniques are shown and described relating to a Hidden Markov Model based architecture to monitor network node activities and predict relevant periods. In particular, in one embodiment, a device determines a statistical model for each of one or more singular-node traffic profiles (e.g., based on one or more Hidden Markov Models (HMMs) each corresponding to a respective one of the one or more traffic profiles). By analyzing respective traffic from individual nodes in a computer network, and matching the respective traffic against the statistical model for the one or more traffic profiles, the device may detecting a matching traffic profile for the individual nodes in a computer network. In addition, the device may predict relevant periods of traffic for the individual nodes by extrapolating a most-likely future sequence based on prior respective traffic of the individual nodes and the corresponding matching traffic profile.

Proceedings ArticleDOI
26 May 2013
TL;DR: A novel deep convolutional neural network architecture is developed, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance.
Abstract: We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.

Proceedings ArticleDOI
Jing Huang1, Brian Kingsbury1
26 May 2013
TL;DR: This work uses DBNs for audio-visual speech recognition; in particular, it uses deep learning from audio and visual features for noise robust speech recognition and test two methods for using DBN’s in a multimodal setting.
Abstract: Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.

01 Jan 2013
TL;DR: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Abstract: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement I would like to thank Derrick Coetzee, for being an extremely thought-provoking research partner and for guiding me with the formulation of this report. I would also like to thank Professor Armando Fox for his guidance during my Master's program, and Professor John Canny for all his help in discussing and refining our research methodologies. Last, but never the least, I would like to thank my friends and family for their never-ending encouragement without which this would not have been nearly as rewarding.

Journal ArticleDOI
TL;DR: A new unsupervised approach for human activity recognition from raw acceleration data measured using inertial wearable sensors based upon joint segmentation of multidimensional time series using a Hidden Markov Model (HMM) in a multiple regression context.
Abstract: Using supervised machine learning approaches to recognize human activities from on-body wearable accelerometers generally requires a large amount of labeled data. When ground truth information is not available, too expensive, time consuming or difficult to collect, one has to rely on unsupervised approaches. This paper presents a new unsupervised approach for human activity recognition from raw acceleration data measured using inertial wearable sensors. The proposed method is based upon joint segmentation of multidimensional time series using a Hidden Markov Model (HMM) in a multiple regression context. The model is learned in an unsupervised framework using the Expectation-Maximization (EM) algorithm where no activity labels are needed. The proposed method takes into account the sequential appearance of the data. It is therefore adapted for the temporal acceleration data to accurately detect the activities. It allows both segmentation and classification of the human activities. Experimental results are provided to demonstrate the efficiency of the proposed approach with respect to standard supervised and unsupervised classification approaches.

Journal ArticleDOI
TL;DR: The proposed spectral modeling method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.
Abstract: This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.

Posted Content
TL;DR: This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal.
Abstract: In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

Journal ArticleDOI
TL;DR: HMMs with log-normal, Gamma and generalized extreme value (GEV) distributions are developed to predict PM(2.5) concentration at Concord and Sacramento monitors in Northern California and show that the closer the distribution employed in HMM is to the observation sequence, the better the model prediction performance.

Proceedings ArticleDOI
12 Jun 2013
TL;DR: In this article, the authors investigated a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and compared the proposed approach with conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron.
Abstract: In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

Journal ArticleDOI
TL;DR: This paper proposes a discriminative probabilistic object modeling approach that builds Probabilistic models for each object based on the distribution of its views, and the distance between two objects is defined as the upper bound of the Kullback-Leibler divergence of the corresponding probabilism models.
Abstract: In view-based 3D object retrieval and recognition, each object is described by multiple views. A central problem is how to estimate the distance between two objects. Most conventional methods integrate the distances of view pairs across two objects as an estimation of their distance. In this paper, we propose a discriminative probabilistic object modeling approach. It builds probabilistic models for each object based on the distribution of its views, and the distance between two objects is defined as the upper bound of the Kullback-Leibler divergence of the corresponding probabilistic models. 3D object retrieval and recognition is accomplished based on the distance measures. We first learn models for each object by the adaptation from a set of global models with a maximum likelihood principle. A further adaption step is then performed to enhance the discriminative ability of the models. We conduct experiments on the ETH 3D object dataset, the National Taiwan University 3D model dataset, and the Princeton Shape Benchmark. We compare our approach with different methods, and experimental results demonstrate the superiority of our approach.

Journal ArticleDOI
TL;DR: It is shown how various molecular observables of interest that are often computed from MSMs can be computed from HMMs/PMMs, and a practically feasible approximation via Hidden Markov Models (HMMs) is derived.
Abstract: Markov state models (MSMs) have been successful in computing metastable states, slow relaxation timescales and associated structural changes, and stationary or kinetic experimental observables of complex molecules from large amounts of molecular dynamics simulation data. However, MSMs approximate the true dynamics by assuming a Markov chain on a clusters discretization of the state space. This approximation is difficult to make for high-dimensional biomolecular systems, and the quality and reproducibility of MSMs has, therefore, been limited. Here, we discard the assumption that dynamics are Markovian on the discrete clusters. Instead, we only assume that the full phase-space molecular dynamics is Markovian, and a projection of this full dynamics is observed on the discrete states, leading to the concept of Projected Markov Models (PMMs). Robust estimation methods for PMMs are not yet available, but we derive a practically feasible approximation via Hidden Markov Models (HMMs). It is shown how various molecular observables of interest that are often computed from MSMs can be computed from HMMs/PMMs. The new framework is applicable to both, simulation and single-molecule experimental data. We demonstrate its versatility by applications to educative model systems, a 1 ms Anton MD simulation of the bovine pancreatic trypsin inhibitor protein, and an optical tweezer force probe trajectory of an RNA hairpin.

Proceedings ArticleDOI
02 Sep 2013
TL;DR: Experimental results show that when the numbers of the hidden layers as well hidden units are properly set, the DNN could extend the labeling ability of GMM-HMM.
Abstract: Deep Neural Network Hidden Markov Models, or DNN-HMMs, are recently very promising acoustic models achieving good speech recognition results over Gaussian mixture model based HMMs (GMM-HMMs). In this paper, for emotion recognition from speech, we investigate DNN-HMMs with restricted Boltzmann Machine (RBM) based unsupervised pre-training, and DNN-HMMs with discriminative pre-training. Emotion recognition experiments are carried out on these two models on the eNTERFACE'05 database and Berlin database, respectively, and results are compared with those from the GMM-HMMs, the shallow-NN-HMMs with two layers, as well as the Multi-layer Perceptrons HMMs (MLP-HMMs). Experimental results show that when the numbers of the hidden layers as well hidden units are properly set, the DNN could extend the labeling ability of GMM-HMM. Among all the models, the DNN-HMMs with discriminative pre-training obtain the best results. For example, for the eNTERFACE'05 database, the recognition accuracy improves 12.22% from the DNN-HMMs with unsupervised pre-training, 11.67% from the GMM-HMMs, 10.56% from the MLP-HMMs, and even 17.22% from the shallow-NN-HMMs, respectively.

Journal ArticleDOI
TL;DR: An end-to-end learning from demonstration framework for teaching force-based manipulation tasks to robots, and the robot execution phase is developed by implementing a modified version of Gaussian Mixture Regression that uses implicit temporal information from the probabilistic model, needed when tackling tasks with ambiguous perceptions.
Abstract: This paper proposes an end-to-end learning from demonstration framework for teaching force-based manipulation tasks to robots. The strengths of this work are manyfold. First, we deal with the problem of learning through force perceptions exclusively. Second, we propose to exploit haptic feedback both as a means for improving teacher demonstrations and as a human---robot interaction tool, establishing a bidirectional communication channel between the teacher and the robot, in contrast to the works using kinesthetic teaching. Third, we address the well-known what to imitate? problem from a different point of view, based on the mutual information between perceptions and actions. Lastly, the teacher's demonstrations are encoded using a Hidden Markov Model, and the robot execution phase is developed by implementing a modified version of Gaussian Mixture Regression that uses implicit temporal information from the probabilistic model, needed when tackling tasks with ambiguous perceptions. Experimental results show that the robot is able to learn and reproduce two different manipulation tasks, with a performance comparable to the teacher's one.

Proceedings ArticleDOI
19 Jun 2013
TL;DR: This paper presents human activity detection model that uses only 3-D skeleton features generated from an RGB-D sensor (Microsoft Kinect TM) to infer the human activities and implemented Gaussian Mixture Modal based Hidden Markov Model (HMM).
Abstract: Ability to recognize human activities will enhance the capabilities of a robot that interacts with humans. However automatic detection of human activities could be challenging due to the individual nature of the activities. In this paper, we present human activity detection model that uses only 3-D skeleton features generated from an RGB-D sensor (Microsoft Kinect TM). To infer the human activities, we implemented Gaussian Mixture Modal (GMM) based Hidden Markov Model(HMM). GM outputs of the HMM were effectively able to capture multimodel nature of 3D positions of each skeleton joint. We test our model in a publicly available data-set that consists of twelve different daily activities performed by four different people.The proposed model recorded recognition recall accuracy of 84% with previously seen people and 78% with previously unseen people.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: An efficient clustering framework specially for face clustering in videos, considering that faces in adjacent frames of the same face track are very similar, is introduced and is applicable to other clustering algorithms to significantly reduce the computational cost.
Abstract: In this paper, we focus on face clustering in videos. Given the detected faces from real-world videos, we partition all faces into K disjoint clusters. Different from clustering on a collection of facial images, the faces from videos are organized as face tracks and the frame index of each face is also provided. As a result, many pair wise constraints between faces can be easily obtained from the temporal and spatial knowledge of the face tracks. These constraints can be effectively incorporated into a generative clustering model based on the Hidden Markov Random Fields (HMRFs). Within the HMRF model, the pair wise constraints are augmented by label-level and constraint-level local smoothness to guide the clustering process. The parameters for both the unary and the pair wise potential functions are learned by the simulated field algorithm, and the weights of constraints can be easily adjusted. We further introduce an efficient clustering framework specially for face clustering in videos, considering that faces in adjacent frames of the same face track are very similar. The framework is applicable to other clustering algorithms to significantly reduce the computational cost. Experiments on two face data sets from real-world videos demonstrate the significantly improved performance of our algorithm over state-of-the art algorithms.

Proceedings ArticleDOI
Zhi-Jie Yan1, Qiang Huo1, Jian Xu1
25 Aug 2013
TL;DR: A new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR).
Abstract: We present a new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR). The DNN-based feature extractor is trained from a subset of training data to mitigate the scalability issue of DNN training, while GMM-HMMs are trained by using state-of-the-art scalable training methods and tools to leverage the whole training set. In a benchmark evaluation, we used 309-hour Switchboard-I (SWB) training data to train a DNN first, which achieves a word error rate (WER) of 15.4% on NIST-2000 Hub5 evaluation set by a traditional DNN-HMM based approach. When the same DNN is used as a feature extractor and 2,000-hour “SWB+Fisher” training data is used to train the GMM-HMMs, our DNN-GMM-HMM approach achieves a WER of 13.8%. If per-conversation-side based unsupervised adaptation is performed, a WER of 13.1% can be achieved.