Showing papers on "Hidden Markov model published in 2012"

PDF

Open Access

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

Journal Article•DOI•

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

[...]

George E. Dahl¹, Dong Yu², Li Deng², Alex Acero²•Institutions (2)

University of Toronto¹, Microsoft²

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

...read moreread less

3,120 citations

Journal Article•

Deep Neural Networks for Acoustic Modeling in Speech Recognition

[...]

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew W. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, Brian Kingsbury - Show less +7 more

01 Nov 2012-IEEE Signal Processing Magazine

TL;DR: This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

...read moreread less

2,527 citations

Journal Article•DOI•

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment

[...]

Michael Remmert¹, Andreas Biegert¹, Andreas Hauser¹, Johannes Söding²•Institutions (2)

Center for Integrated Protein Science Munich¹, Max Planck Society²

01 Feb 2012-Nature Methods

TL;DR: An open-source, general-purpose tool that represents both query and database sequences by profile hidden Markov models (HMMs): 'HMM-HMM–based lightning-fast iterative sequence search' (HHblits; http://toolkit.genzentrum.lmu.de/hhblits/).

...read moreread less

Abstract: Sequence-based protein function and structure prediction depends crucially on sequence-search sensitivity and accuracy of the resulting sequence alignments. We present an open-source, general-purpose tool that represents both query and database sequences by profile hidden Markov models (HMMs): 'HMM-HMM-based lightning-fast iterative sequence search' (HHblits; http://toolkit.genzentrum.lmu.de/hhblits/). Compared to the sequence-search tool PSI-BLAST, HHblits is faster owing to its discretized-profile prefilter, has 50-100% higher sensitivity and generates more accurate alignments.

...read moreread less

1,865 citations

Journal Article•DOI•

Acoustic Modeling Using Deep Belief Networks

[...]

Abdelrahman Mohamed¹, George E. Dahl¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.

...read moreread less

Abstract: Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

...read moreread less

1,767 citations

Proceedings Article•DOI•

View invariant human action recognition using histograms of 3D joints

[...]

Lu Xia¹, Chia-Chih Chen¹, Jake K. Aggarwal¹•Institutions (1)

University of Texas at Austin¹

16 Jun 2012

TL;DR: This paper presents a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures and achieves superior results on the challenging 3D action dataset.

...read moreread less

Abstract: In this paper, we present a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures. We extract the 3D skeletal joint locations from Kinect depth maps using Shotton et al.'s method [6]. The HOJ3D computed from the action depth sequences are reprojected using LDA and then clustered into k posture visual words, which represent the prototypical poses of actions. The temporal evolutions of those visual words are modeled by discrete hidden Markov models (HMMs). In addition, due to the design of our spherical coordinate system and the robust 3D skeleton estimation from Kinect, our method demonstrates significant view invariance on our 3D action dataset. Our dataset is composed of 200 3D sequences of 10 indoor activities performed by 10 individuals in varied views. Our method is real-time and achieves superior results on the challenging 3D action dataset. We also tested our algorithm on the MSR Action 3D dataset and our algorithm outperforms Li et al. [25] on most of the cases.

...read moreread less

1,453 citations

Proceedings Article•DOI•

Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

[...]

Ossama Abdel-Hamid¹, Abdelrahman Mohamed², Hui Jiang¹, Gerald Penn²•Institutions (2)

York University¹, University of Toronto²

25 Mar 2012

TL;DR: The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.

...read moreread less

Abstract: Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.

...read moreread less

901 citations

Proceedings Article•DOI•

Unstructured human activity detection from RGBD images

[...]

Jaeyong Sung¹, Colin Ponce¹, Bart Selman¹, Ashutosh Saxena¹•Institutions (1)

Cornell University¹

14 May 2012

TL;DR: This paper uses a RGBD sensor as the input sensor, and compute a set of features based on human pose and motion, as well as based on image and point-cloud information, based on a hierarchical maximum entropy Markov model (MEMM).

...read moreread less

Abstract: Being able to detect and recognize human activities is essential for several applications, including personal assistive robotics In this paper, we perform detection and recognition of unstructured human activity in unstructured environments We use a RGBD sensor (Microsoft Kinect) as the input sensor, and compute a set of features based on human pose and motion, as well as based on image and point-cloud information Our algorithm is based on a hierarchical maximum entropy Markov model (MEMM), which considers a person's activity as composed of a set of sub-activities We infer the two-layered graph structure using a dynamic programming approach We test our algorithm on detecting and recognizing twelve different activities performed by four people in different environments, such as a kitchen, a living room, an office, etc, and achieve good performance even when the person was not seen before in the training set1

...read moreread less

555 citations

Proceedings Article•DOI•

Learning latent temporal structure for complex event detection

[...]

Kevin Tang¹, Li Fei-Fei¹, Daphne Koller¹•Institutions (1)

Stanford University¹

16 Jun 2012

TL;DR: A conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks is utilized.

...read moreread less

Abstract: In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.

...read moreread less

430 citations

Understanding the Basis of the Kalman Filter Via a Simple and Intuitive Derivation

[...]

Ramsey Faragher

01 Jan 2012

TL;DR: This article provides a simple and intuitive derivation of the Kalman filter, with the aim of teaching this useful tool to students from disciplines that do not require a strong mathematical background.

...read moreread less

Abstract: T his article provides a simple and intuitive derivation of the Kalman filter, with the aim of teaching this useful tool to students from disciplines that do not require a strong mathematical background. The most complicated level of mathematics required to understand this derivation is the ability to multiply two Gaussian functions together and reduce the result to a compact form. The Kalman filter is over 50 years old but is still one of the most important and common data fusion algorithms in use today. Named after Rudolf E. Kalman, the great success of the Kalman filter is due to its small computational requirement, elegant recursive properties, and its status as the optimal estimator for one-dimensional linear systems with Gaussian error statistics [1] . Typical uses of the Kalman filter include smoothing noisy data and providing estimates of parameters of interest. Applications include global positioning system receivers, phaselocked loops in radio equipment, smoothing the output from laptop trackpads, and many more. From a theoretical standpoint, the Kalman filter is an algorithm permitting exact inference in a linear dynamical system, which is a Bayesian model similar to a hidden Markov model but where the state space of the latent variables is continuous and where all latent and observed variables have a Gaussian distribution (often a multivariate Gaussian distribution). The aim of this lecture note is to permit people who find this description confusing or terrifying to understand the basis of the Kalman filter via a simple and intuitive derivation.

...read moreread less

379 citations

Journal Article•DOI•

Flexible and practical modeling of animal telemetry data: hidden Markov models and extensions

[...]

Roland Langrock¹, Ruth King¹, Jason Matthiopoulos¹, Len Thomas¹, Daniel Fortin², Juan M. Morales³ - Show less +2 more•Institutions (3)

University of St Andrews¹, Laval University², National University of Comahue³

01 Nov 2012-Ecology

TL;DR: A number of extensions of HMMs for animal movement modeling are described, including more flexible state transition models and individual random effects (fitted in a non-Bayesian framework).

...read moreread less

Abstract: We discuss hidden Markov-type models for fitting a variety of multistate random walks to wildlife movement data. Discrete-time hidden Markov models (HMMs) achieve considerable computational gains by focusing on observations that are regularly spaced in time, and for which the measurement error is negligible. These conditions are often met, in particular for data related to terrestrial animals, so that a likelihood-based HMM approach is feasible. We describe a number of extensions of HMMs for animal movement modeling, including more flexible state transition models and individual random effects (fitted in a non-Bayesian framework). In particular we consider so-called hidden semi-Markov models, which may substantially improve the goodness of fit and provide important insights into the behavioral state switching dynamics. To showcase the expediency of these methods, we consider an application of a hierarchical hidden semi-Markov model to multiple bison movement paths.

...read moreread less

Proceedings Article•

A Method of Moments for Mixture Models and Hidden Markov Models

[...]

Animashree Anandkumar¹, Daniel Hsu², Sham M. Kakade²•Institutions (2)

University of California, Irvine¹, Microsoft²

16 Jun 2012

TL;DR: In this article, a method of moments approach is proposed for parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians and hidden Markov models.

...read moreread less

Abstract: Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typically scale exponentially with the number of mixture components. This work develops an ecient method of moments approach to parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models. The new method leads to rigorous unsupervised learning results for mixture models that were not achieved by previous works; and, because of its simplicity, it oers a viable alternative to EM for practical deployment.

...read moreread less

Proceedings Article•DOI•

Predicting future locations with hidden Markov models

[...]

Wesley Mathew¹, Ruben Raposo¹, Bruno Martins¹•Institutions (1)

Instituto Superior Técnico¹

05 Sep 2012

TL;DR: This paper presents an hybrid method for predicting human mobility on the basis of Hidden Markov Models, and reports on a series of experiments with a real-world location history dataset from the GeoLife project, showing that a prediction accuracy of 13.85% can be achieved when considering regions of roughly 1280 squared meters.

...read moreread less

Abstract: The analysis of human location histories is currently getting an increasing attention, due to the widespread usage of geopositioning technologies such as the GPS, and also of online location-based services that allow users to share this information. Tasks such as the prediction of human movement can be addressed through the usage of these data, in turn offering support for more advanced applications, such as adaptive mobile services with proactive context-based functions. This paper presents an hybrid method for predicting human mobility on the basis of Hidden Markov Models (HMMs). The proposed approach clusters location histories according to their characteristics, and latter trains an HMM for each cluster. The usage of HMMs allows us to account with location characteristics as unobservable parameters, and also to account with the effects of each individual's previous actions. We report on a series of experiments with a real-world location history dataset from the GeoLife project, showing that a prediction accuracy of 13.85% can be achieved when considering regions of roughly 1280 squared meters.

...read moreread less

Journal Article•DOI•

Fully Automatic Recognition of the Temporal Phases of Facial Actions

[...]

Michel Valstar¹, Maja Pantic¹•Institutions (1)

Imperial College London¹

01 Feb 2012

TL;DR: The proposed fully automatic method enables the detection of a much larger range of facial behavior by recognizing facial muscle actions [action units (AUs)] that compound expressions.

...read moreread less

Abstract: Past work on automatic analysis of facial expressions has focused mostly on detecting prototypic expressions of basic emotions like happiness and anger. The method proposed here enables the detection of a much larger range of facial behavior by recognizing facial muscle actions [action units (AUs)] that compound expressions. AUs are agnostic, leaving the inference about conveyed intent to higher order decision making (e.g., emotion recognition). The proposed fully automatic method not only allows the recognition of 22 AUs but also explicitly models their temporal characteristics (i.e., sequences of temporal segments: neutral, onset, apex, and offset). To do so, it uses a facial point detector based on Gabor-feature-based boosted classifiers to automatically localize 20 facial fiducial points. These points are tracked through a sequence of images using a method called particle filtering with factorized likelihoods. To encode AUs and their temporal activation models based on the tracking data, it applies a combination of GentleBoost, support vector machines, and hidden Markov models. We attain an average AU recognition rate of 95.3% when tested on a benchmark set of deliberately displayed facial expressions and 72% when tested on spontaneous expressions.

...read moreread less

Proceedings Article•DOI•

Understanding how Deep Belief Networks perform acoustic modelling

[...]

Abdelrahman Mohamed¹, Geoffrey E. Hinton¹, Gerald Penn¹•Institutions (1)

University of Toronto¹

25 Mar 2012

TL;DR: This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the Dbns that preserves the similarity structure of the feature vector at multiple scales.

...read moreread less

Abstract: Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-trained. This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that preserves the similarity structure of the feature vectors at multiple scales. The same two methods are also used to investigate the most suitable type of input representation for a DBN.

...read moreread less

Proceedings Article•

Application Of Pretrained Deep Neural Networks To Large Vocabulary Speech Recognition

[...]

Navdeep Jaitly¹, Patrick Nguyen², Andrew W. Senior², Vincent Vanhoucke²•Institutions (2)

University of Toronto¹, Google²

01 Jan 2012

TL;DR: This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.

...read moreread less

Abstract: The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 4.7% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.5% and 0.9% absolute on the second dataset.

...read moreread less

Journal Article•DOI•

A spectral algorithm for learning Hidden Markov Models

[...]

Daniel Hsu¹, Sham M. Kakade², Tong Zhang¹•Institutions (2)

Rutgers University¹, University of Pennsylvania²

01 Sep 2012-Journal of Computer and System Sciences

TL;DR: In this paper, the authors prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning hidden Markov models.

...read moreread less

Journal Article•DOI•

Statistical modeling and recognition of surgical workflow.

[...]

Nicolas Padoy¹, Tobias Blum², Seyed-Ahmad Ahmadi², Hubertus Feussner², Marie-Odile Berger³, Nassir Navab² - Show less +2 more•Institutions (3)

Johns Hopkins University¹, Technische Universität München², French Institute for Research in Computer Science and Automation³

01 Apr 2012-Medical Image Analysis

TL;DR: A new representation of interventions in terms of multidimensional time-series formed by synchronized signals acquired over time is proposed, which results in workflow models combining low-level signals with high-level information such as predefined phases, which can be used to detect actions and trigger an event.

...read moreread less

Journal Article•DOI•

Sleep stage classification using unsupervised feature learning

[...]

Martin Längkvist¹, Lars Karlsson¹, Amy Loutfi¹•Institutions (1)

Örebro University¹

01 Jan 2012-Advances in Artificial Neural Systems

TL;DR: The use of an unsupervised feature learning architecture called deep belief nets (DBNs) is proposed and how to apply it to sleep data in order to eliminate the use of handmade features is shown.

...read moreread less

Abstract: Most attempts at training computers for the difficult and time-consuming task of sleep stage classification involve a feature extraction step. Due to the complexity of multimodal sleep data, the size of the feature space can grow to the extent that it is also necessary to include a feature selection step. In this paper, we propose the use of an unsupervised feature learning architecture called deep belief nets (DBNs) and show how to apply it to sleep data in order to eliminate the use of handmade features. Using a postprocessing step of hidden Markov model (HMM) to accurately capture sleep stage switching, we compare our results to a feature-based approach. A study of anomaly detection with the application to home environment data collection is also presented. The results using raw data with a deep architecture, such as the DBN, were comparable to a feature-based approach when validated on clinical datasets.

...read moreread less

Proceedings Article•DOI•

Adaptation of context-dependent deep neural networks for automatic speech recognition

[...]

Kaisheng Yao¹, Dong Yu¹, Frank Seide¹, Hang Su¹, Li Deng¹, Yifan Gong¹ - Show less +2 more•Institutions (1)

Microsoft¹

01 Dec 2012

TL;DR: On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances.

...read moreread less

Abstract: In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the affine transformations against direct adaptation of the softmax layer weights. The feature-space discriminative linear regression (fDLR) method with the affine transformations on the input layer is also evaluated. On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances. With a batch update implementation, the softmax layer adaptation technique reduces WERs by 10%. We observe that using bias shift performs as well as doing scaling plus bias shift.

...read moreread less

Proceedings Article•

A Nonparametric Bayesian Approach to Acoustic Model Discovery

[...]

Chia-ying Lee¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

08 Jul 2012

TL;DR: An unsupervised model is presented that simultaneously segments the speech, discovers a proper set of sub-word units and learns a Hidden Markov Model for each induced acoustic unit and outperforms a language-mismatched acoustic model.

...read moreread less

Abstract: We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e.g., phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. We test the quality of the learned acoustic models on a spoken term detection task. Compared to the baselines, our model improves the relative precision of top hits by at least 22.1% and outperforms a language-mismatched acoustic model.

...read moreread less

Journal Article•DOI•

MEMS Accelerometer Based Nonspecific-User Hand Gesture Recognition

[...]

Ruize Xu¹, Shengli Zhou², Wen J. Li²•Institutions (2)

Massachusetts Institute of Technology¹, The Chinese University of Hong Kong²

01 May 2012-IEEE Sensors Journal

TL;DR: A recognition algorithm based on sign sequence and template matching as presented in this paper can be used for nonspecific-users hand-gesture recognition without the time consuming user-training process prior to gesture recognition.

...read moreread less

Abstract: This paper presents three different gesture recognition models which are capable of recognizing seven hand gestures, i.e., up, down, left, right, tick, circle, and cross, based on the input signals from MEMS 3-axes accelerometers. The accelerations of a hand in motion in three perpendicular directions are detected by three accelerometers respectively and transmitted to a PC via Bluetooth wireless protocol. An automatic gesture segmentation algorithm is developed to identify individual gestures in a sequence. To compress data and to minimize the influence of variations resulted from gestures made by different users, a basic feature based on sign sequence of gesture acceleration is extracted. This method reduces hundreds of data values of a single gesture to a gesture code of 8 numbers. Finally, the gesture is recognized by comparing the gesture code with the stored templates. Results based on 72 experiments, each containing a sequence of hand gestures (totaling 628 gestures), show that the best of the three models discussed in this paper achieves an overall recognition accuracy of 95.6%, with the correct recognition accuracy of each gesture ranging from 91% to 100%. We conclude that a recognition algorithm based on sign sequence and template matching as presented in this paper can be used for nonspecific-users hand-gesture recognition without the time consuming user-training process prior to gesture recognition.

...read moreread less

Journal Article•DOI•

Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech

[...]

Phillip L. De Leon¹, Michael Pucher, Junichi Yamagishi², Inma Hernaez³, Ibon Saratxaga³ - Show less +1 more•Institutions (3)

New Mexico State University¹, University of Edinburgh², University of the Basque Country³

01 Oct 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new feature based on relative phase shift (RPS) is proposed, demonstrated reliable detection of synthetic speech, and shown how this classifier can be used to improve security of SV systems.

...read moreread less

Abstract: In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

...read moreread less

Proceedings Article•DOI•

Online map-matching based on Hidden Markov model for real-time traffic sensing applications

[...]

C. Y. Goh¹, Justin Dauwels¹, Nikola Mitrovic¹, Muhammad Tayyab Asif¹, Ali Oran², Patrick Jaillet³ - Show less +2 more•Institutions (3)

Nanyang Technological University¹, Singapore–MIT alliance², Massachusetts Institute of Technology³

01 Sep 2012

TL;DR: This work proposes an online map-matching algorithm based on the Hidden Markov Model (HMM) that is robust to noise and sparseness and suggests that it is viable for low latency applications such as traffic sensing.

...read moreread less

Abstract: In many Intelligent Transportation System (ITS) applications that crowd-source data from probe vehicles, a crucial step is to accurately map the GPS trajectories to the road network in real time. This process, known as map-matching, often needs to account for noise and sparseness of the data because (1) highly precise GPS traces are rarely available, and (2) dense trajectories are costly for live transmission and storage.

...read moreread less

Proceedings Article•DOI•

Learning and generalization of complex tasks from unstructured demonstrations

[...]

Scott Niekum¹, Sarah Osentoski², George Konidaris³, Andrew G. Barto¹•Institutions (3)

University of Massachusetts Amherst¹, Bosch², Massachusetts Institute of Technology³

24 Dec 2012

TL;DR: This work uses the Beta Process Autoregressive Hidden Markov Model and Dynamic Movement Primitives to learn and generalize a multi-step task on the PR2 mobile manipulator and to demonstrate the potential of this framework to learn a large library of skills over time.

...read moreread less

Abstract: We present a novel method for segmenting demonstrations, recognizing repeated skills, and generalizing complex tasks from unstructured demonstrations. This method combines many of the advantages of recent automatic segmentation methods for learning from demonstration into a single principled, integrated framework. Specifically, we use the Beta Process Autoregressive Hidden Markov Model and Dynamic Movement Primitives to learn and generalize a multi-step task on the PR2 mobile manipulator and to demonstrate the potential of our framework to learn a large library of skills over time.

...read moreread less

Proceedings Article•DOI•

Exploiting sparseness in deep neural networks for large vocabulary speech recognition

[...]

Dong Yu¹, Frank Seide¹, Gang Li¹, Li Deng¹•Institutions (1)

Microsoft¹

25 Mar 2012

TL;DR: The goal of enforcing sparseness as soft regularization and convex constraint optimization problems is formulated, solutions under the stochastic gradient ascent setting are proposed, and novel data structures are proposed to exploit the randomSparseness patterns to reduce model size and computation time.

...read moreread less

Abstract: Recently, we developed context-dependent deep neural network (DNN) hidden Markov models for large vocabulary speech recognition. While reducing errors by 33% compared to its discriminatively trained Gaussian-mixture counterpart on the switchboard benchmark task, DNN requires much more parameters. In this paper, we report our recent work on DNN for improved generalization, model size, and computation speed by exploiting parameter sparseness. We formulate the goal of enforcing sparseness as soft regularization and convex constraint optimization problems, and propose solutions under the stochastic gradient ascent setting. We also propose novel data structures to exploit the random sparseness patterns to reduce model size and computation time. The proposed solutions have been evaluated on the voice-search and switchboard datasets. They have decreased the number of nonzero connections to one third while reducing the error rate by 0.2–0.3% over the fully connected model on both datasets. The nonzero connections have been further reduced to only 12% and 19% on the two respective datasets without sacrificing speech recognition performance. Under these conditions we can reduce the model size to 18% and 29%, and computation to 14% and 23%, respectively, on these two datasets.

...read moreread less

Markov Random Fields in Statistics

[...]

Peter Clifford

01 Jan 2012

Proceedings Article•DOI•

Revisiting Recurrent Neural Networks for robust ASR

[...]

Oriol Vinyals¹, Suman V. Ravuri², Daniel Povey¹•Institutions (2)

International Computer Science Institute¹, Microsoft²

25 Mar 2012

TL;DR: The Recurrent Neural Network is revisited, which explicitly models the Markovian dynamics of a set of observations through a non-linear function with a much larger hidden state space than traditional sequence models such as an HMM.

...read moreread less

Abstract: In this paper, we show how new training principles and optimization techniques for neural networks can be used for different network structures. In particular, we revisit the Recurrent Neural Network (RNN), which explicitly models the Markovian dynamics of a set of observations through a non-linear function with a much larger hidden state space than traditional sequence models such as an HMM. We apply pretraining principles used for Deep Neural Networks (DNNs) and second-order optimization techniques to train an RNN. Moreover, we explore its application in the Aurora2 speech recognition task under mismatched noise conditions using a Tandem approach. We observe top performance on clean speech, and under high noise conditions, compared to multi-layer perceptrons (MLPs) and DNNs, with the added benefit of being a “deeper” model than an MLP but more compact than a DNN.

...read moreread less

Journal Article•DOI•

Trajectory Learning for Robot Programming by Demonstration Using Hidden Markov Model and Dynamic Time Warping

[...]

Aleksandar Vakanski¹, Iraj Mantegh², Andrew Irish², Farrokh Janabi-Sharifi¹•Institutions (2)

Ryerson University¹, IAR Systems²

01 Aug 2012

TL;DR: The principal advantage of the proposed approach is utilization of the trajectory key points from all demonstrations for generation of a generalized trajectory, resulting in a generalization procedure which accounts for the relevance of reproduction of different parts of the trajectories.

...read moreread less

Abstract: The main objective of this paper is to develop an efficient method for learning and reproduction of complex trajectories for robot programming by demonstration. Encoding of the demonstrated trajectories is performed with hidden Markov model, and generation of a generalized trajectory is achieved by using the concept of key points. Identification of the key points is based on significant changes in position and velocity in the demonstrated trajectories. The resulting sequences of trajectory key points are temporally aligned using the multidimensional dynamic time warping algorithm, and a generalized trajectory is obtained by smoothing spline interpolation of the clustered key points. The principal advantage of our proposed approach is utilization of the trajectory key points from all demonstrations for generation of a generalized trajectory. In addition, variability of the key points' clusters across the demonstrated set is employed for assigning weighting coefficients, resulting in a generalization procedure which accounts for the relevance of reproduction of different parts of the trajectories. The approach is verified experimentally for trajectories with two different levels of complexity.

...read moreread less

Book Chapter•DOI•

Connectionist Temporal Classification

[...]

Alex Graves¹•Institutions (1)

University of Toronto¹

01 Jan 2012

TL;DR: Experiments on speech and handwriting recognition show that a BLSTM network with a CTC output layer is an effective sequence labeller, generally outperforming standardHMMsandHMM-neural network hybrids, as well asmore recent sequence labelling algorithms such as large margin HMMs and conditional random fields.

...read moreread less

Abstract: This chapter introduces the connectionist temporal classification (CTC) output layer for recurrent neural networks (Graves et al., 2006). As its name suggests, CTC was specifically designed for temporal classification tasks; that is, for sequence labelling problems where the alignment between the inputs and the target labels is unknown. Unlike the hybrid approach described in the previous chapter, CTC models all aspects of the sequence with a single neural network, and does not require the network to be combined with a hidden Markov model. It also does not require presegmented training data, or external postprocessing to extract the label sequence from the network outputs. Experiments on speech and handwriting recognition show that a BLSTM network with a CTC output layer is an effective sequence labeller, generally outperforming standardHMMsandHMM-neural network hybrids, as well asmore recent sequence labelling algorithms such as large margin HMMs (Sha and Saul, 2006) and conditional random fields (Lafferty et al., 2001).

...read moreread less

Collapse