Distributions associated with general runs and patterns in hidden Markov models

doi:10.1214/07-AOAS125

Home
/
Papers
/
Distributions associated with general runs and patterns in hidden Markov models

Journal Article•DOI•

Distributions associated with general runs and patterns in hidden Markov models

27 Jun 2007-arXiv: Methodology-

TL;DR: In this article, a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence, is presented.

read less

Abstract: This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Biological sequence analysis: Background on probability

[...]

Richard Durbin, Sean R. Eddy¹, Anders Krogh², Graeme Mitchison•Institutions (2)

Washington University in St. Louis¹, Technical University of Denmark²

01 Apr 1998

213 citations

Journal Article•DOI•

Quantifying the uncertainty in change points

[...]

Christopher F. H. Nam¹, John A. D. Aston¹, Adam M. Johansen¹•Institutions (1)

University of Warwick¹

01 Sep 2012-Journal of Time Series Analysis

TL;DR: In this article, the authors proposed a methodology for approximating the full posterior distribution of various change point characteristics in the presence of parameter uncertainty, which does not require estimates of the underlying state sequence.

...read moreread less

Abstract: Quantifying the uncertainty in the location and nature of change points in time series is important in a variety of applications. Many existing methods for estimation of the number and location of change points fail to capture fully or explicitly the uncertainty regarding these estimates, whilst others require explicit simulation of large vectors of dependent latent variables. This article proposes methodology for approximating the full posterior distribution of various change point characteristics in the presence of parameter uncertainty. The methodology combines recent work on evaluation of exact change point distributions conditional on model parameters via finite Markov chain imbedding in a hidden Markov model setting, and accounting for parameter uncertainty and estimation via Bayesian modelling and sequential Monte Carlo. The combination of the two leads to a flexible and computationally efficient procedure, which does not require estimates of the underlying state sequence. We illustrate that good estimation of the posterior distributions of change point characteristics is provided for simulated data and functional magnetic resonance imaging data. We use the methodology to show that the modelling of relevant physical properties of the scanner can influence detection of change points and their uncertainty.

...read moreread less

43 citations

Cites background from "Distributions associated with gener..."

...Typical examples include Economics where a recession is said to have occurred when there are at least two consecutive negative growth (contraction) states and thus k 1⁄4 2, or in Genetics where a specific genetic phenomena, for example a CpG island (Aston and Martin, 2007), is at least a few hundred bases long (e....
[...]
...…to have occurred when there are at least two consecutive negative growth (contraction) states and thus k ¼ 2, or in Genetics where a specific genetic phenomena, for example a CpG island (Aston and Martin, 2007), is at least a few hundred bases long (e.g. k ¼ 1000) before being deemed in progress....
[...]

Journal Article•DOI•

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.

[...]

Gregory Nuel¹, Leslie Regad², Juliette Martin³, Juliette Martin², Juliette Martin⁴, Anne-Claude Camproux² - Show less +2 more•Institutions (4)

Centre national de la recherche scientifique¹, Paris Diderot University², Institut national de la recherche agronomique³, University of Lyon⁴

26 Jan 2010-Algorithms for Molecular Biology

TL;DR: Three innovative algorithms based on optimal Markov chain embedding based on deterministic finite automata are introduced and prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity.

...read moreread less

Abstract: Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.

...read moreread less

26 citations

Journal Article•DOI•

Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds

[...]

Laurent Noé¹•Institutions (1)

university of lille¹

14 Feb 2017-Algorithms for Molecular Biology

TL;DR: A generic framework on all the proposed models is established, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter, and it is confirmed that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store.

...read moreread less

Abstract: Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance , we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.

...read moreread less

25 citations

Cites background from "Distributions associated with gener..."

...We have already noticed that several authors have performed equivalent tasks with a matrix for the full automaton [86], or with a vector for each automaton state [1], probably because contiguous memory performance is better....
[...]

Journal Article•DOI•

Statistical Inference in Hidden Markov Models Using k-Segment Constraints

[...]

Michalis K. Titsias, Christopher Holmes, Christopher Yau

05 May 2016-Journal of the American Statistical Association

TL;DR: The amount of information one could obtain from the posterior distribution of an HMM is expanded by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to find MAP sequences, compute posterior probabilities, and simulate sample paths.

...read moreread less

Abstract: Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward-backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online.

...read moreread less

19 citations

Cites background from "Distributions associated with gener..."

...These ideas have been extended and applied more recently to compute distributions of general patterns (Aston and Martin 2007), quantify uncertainty in change points in HMMs (Aston, Peng, and Martin 2012; Nam, Aston, and Johansen 2012) and more general graphical model structures (Martin and Aston…...
[...]

1
2
3
4
…
5
6

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A tutorial on hidden Markov models and selected applications in speech recognition

[...]

Lawrence R. Rabiner¹•Institutions (1)

Bell Labs¹

01 Feb 1989

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.

...read moreread less

Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

...read moreread less

21,819 citations

Journal Article•DOI•

A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle.

[...]

James D. Hamilton

01 Mar 1989-Econometrica

TL;DR: In this article, the parameters of an autoregression are viewed as the outcome of a discrete-state Markov process, and an algorithm for drawing such probabilistic inference in the form of a nonlinear iterative filter is presented.

...read moreread less

Abstract: This paper proposes a very tractable approach to modeling changes in regime. The parameters of an autoregression are viewed as the outcome of a discrete-state Markov process. For example, the mean growth rate of a nonstationary series may be subject to occasional, discrete shifts. The econometrician is presumed not to observe these shifts directly, but instead must draw probabilistic inference about whether and when they may have occurred based on the observed behavior of the series. The paper presents an algorithm for drawing such probabilistic inference in the form of a nonlinear iterative filter

...read moreread less

9,189 citations

"Distributions associated with gener..." refers methods in this paper

...…recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few....
[...]
...As examples, they serve as models in speech recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few....
[...]
...As examples, they serve as models in speech recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few. HMMs essentially specify two structures, an underlying model for the unobserved state of the system, and one for the observations, conditional on the unobserved states. Thus, HMMs are a sub-class of state space models [Harvey (1989)], but have the restriction that the models for the hidden states are defined on finite dimensional spaces....
[...]

Book•

Forecasting, Structural Time Series Models and the Kalman Filter

[...]

Andrew Harvey¹•Institutions (1)

London School of Economics and Political Science¹

30 Mar 1990

TL;DR: In this article, the Kalman filter and state space models were used for univariate structural time series models to estimate, predict, and smoothen the univariate time series model.

...read moreread less

Abstract: List of figures Acknowledgement Preface Notation and conventions List of abbreviations 1. Introduction 2. Univariate time series models 3. State space models and the Kalman filter 4. Estimation, prediction and smoothing for univariate structural time series models 5. Testing and model selection 6. Extensions of the univariate model 7. Explanatory variables 8. Multivariate models 9. Continuous time Appendices Selected answers to exercises References Author index Subject index.

...read moreread less

5,071 citations

Book•

Biological Sequence Analysis

[...]

Durbin

01 Jan 1998

1,639 citations

Journal Article•DOI•

Comprehensive analysis of CpG islands in human chromosomes 21 and 22

[...]

Daiya Takai¹, Peter A. Jones•Institutions (1)

University of Southern California¹

19 Mar 2002-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: The complete genomic sequences of human chromosomes 21 and 22 are used to examine the properties of CpG islands in different sequence classes by using a search algorithm that is compatible with the recent detection of 5-methylcytosine in Drosophila, and might suggest that S. cerevisiae has, or once had, C pG methylation.

...read moreread less

Abstract: CpG islands are useful markers for genes in organisms containing 5-methylcytosine in their genomes. In addition, CpG islands located in the promoter regions of genes can play important roles in gene silencing during processes such as X-chromosome inactivation, imprinting, and silencing of intragenomic parasites. The generally accepted definition of what constitutes a CpG island was proposed in 1987 by Gardiner-Garden and Frommer [Gardiner-Garden, M. & Frommer, M. (1987) J. Mol. Biol. 196, 261–282] as being a 200-bp stretch of DNA with a C+G content of 50% and an observed CpG/expected CpG in excess of 0.6. Any definition of a CpG island is somewhat arbitrary, and this one, which was derived before the sequencing of mammalian genomes, will include many sequences that are not necessarily associated with controlling regions of genes but rather are associated with intragenomic parasites. We have therefore used the complete genomic sequences of human chromosomes 21 and 22 to examine the properties of CpG islands in different sequence classes by using a search algorithm that we have developed. Regions of DNA of greater than 500 bp with a G+C equal to or greater than 55% and observed CpG/expected CpG of 0.65 were more likely to be associated with the 5′ regions of genes and this definition excluded most Alu-repetitive elements. We also used genome sequences to show strong CpG suppression in the human genome and slight suppression in Drosophila melanogaster and Saccharomyces cerevisiae. This finding is compatible with the recent detection of 5-methylcytosine in Drosophila, and might suggest that S. cerevisiae has, or once had, CpG methylation.

...read moreread less

1,553 citations

"Distributions associated with gener..." refers background or methods in this paper

...This is biologically realistic as islands are presumed not only to be longer than a certain length but also to be infrequent and not particularly close to one another [Takai and Jones (2002)]....
[...]
...As CpG islands can be especially useful for identifying genes in human DNA [Takai and Jones (2002)], different methods have been developed for their detection....
[...]
...These locations and size differences were verified by using the software CpG islands searcher [Takai and Jones (2003)], which is based on the (non-HMM) algorithm of Takai and Jones (2002) and requires the use of predetermined thresholds....
[...]
...If the frequency of CG pairs is higher than some predetermined threshold, then the segment is defined to be within a CpG island [Takai and Jones (2002)]....
[...]