scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Distributions associated with general runs and patterns in hidden Markov models

TL;DR: In this article, a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence, is presented.
Abstract: This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, the authors proposed a methodology for approximating the full posterior distribution of various change point characteristics in the presence of parameter uncertainty, which does not require estimates of the underlying state sequence.
Abstract: Quantifying the uncertainty in the location and nature of change points in time series is important in a variety of applications. Many existing methods for estimation of the number and location of change points fail to capture fully or explicitly the uncertainty regarding these estimates, whilst others require explicit simulation of large vectors of dependent latent variables. This article proposes methodology for approximating the full posterior distribution of various change point characteristics in the presence of parameter uncertainty. The methodology combines recent work on evaluation of exact change point distributions conditional on model parameters via finite Markov chain imbedding in a hidden Markov model setting, and accounting for parameter uncertainty and estimation via Bayesian modelling and sequential Monte Carlo. The combination of the two leads to a flexible and computationally efficient procedure, which does not require estimates of the underlying state sequence. We illustrate that good estimation of the posterior distributions of change point characteristics is provided for simulated data and functional magnetic resonance imaging data. We use the methodology to show that the modelling of relevant physical properties of the scanner can influence detection of change points and their uncertainty.

43 citations


Cites background from "Distributions associated with gener..."

  • ...Typical examples include Economics where a recession is said to have occurred when there are at least two consecutive negative growth (contraction) states and thus k 1⁄4 2, or in Genetics where a specific genetic phenomena, for example a CpG island (Aston and Martin, 2007), is at least a few hundred bases long (e....

    [...]

  • ...…to have occurred when there are at least two consecutive negative growth (contraction) states and thus k ¼ 2, or in Genetics where a specific genetic phenomena, for example a CpG island (Aston and Martin, 2007), is at least a few hundred bases long (e.g. k ¼ 1000) before being deemed in progress....

    [...]

Journal ArticleDOI
TL;DR: Three innovative algorithms based on optimal Markov chain embedding based on deterministic finite automata are introduced and prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity.
Abstract: Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.

26 citations

Journal ArticleDOI
Laurent Noé1
TL;DR: A generic framework on all the proposed models is established, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter, and it is confirmed that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store.
Abstract: Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance , we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.

25 citations


Cites background from "Distributions associated with gener..."

  • ...We have already noticed that several authors have performed equivalent tasks with a matrix for the full automaton [86], or with a vector for each automaton state [1], probably because contiguous memory performance is better....

    [...]

Journal ArticleDOI
TL;DR: The amount of information one could obtain from the posterior distribution of an HMM is expanded by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to find MAP sequences, compute posterior probabilities, and simulate sample paths.
Abstract: Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward-backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online.

19 citations


Cites background from "Distributions associated with gener..."

  • ...These ideas have been extended and applied more recently to compute distributions of general patterns (Aston and Martin 2007), quantify uncertainty in change points in HMMs (Aston, Peng, and Martin 2012; Nam, Aston, and Johansen 2012) and more general graphical model structures (Martin and Aston…...

    [...]

References
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: In this article, the parameters of an autoregression are viewed as the outcome of a discrete-state Markov process, and an algorithm for drawing such probabilistic inference in the form of a nonlinear iterative filter is presented.
Abstract: This paper proposes a very tractable approach to modeling changes in regime. The parameters of an autoregression are viewed as the outcome of a discrete-state Markov process. For example, the mean growth rate of a nonstationary series may be subject to occasional, discrete shifts. The econometrician is presumed not to observe these shifts directly, but instead must draw probabilistic inference about whether and when they may have occurred based on the observed behavior of the series. The paper presents an algorithm for drawing such probabilistic inference in the form of a nonlinear iterative filter

9,189 citations


"Distributions associated with gener..." refers methods in this paper

  • ...…recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few....

    [...]

  • ...As examples, they serve as models in speech recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few....

    [...]

  • ...As examples, they serve as models in speech recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few. HMMs essentially specify two structures, an underlying model for the unobserved state of the system, and one for the observations, conditional on the unobserved states. Thus, HMMs are a sub-class of state space models [Harvey (1989)], but have the restriction that the models for the hidden states are defined on finite dimensional spaces....

    [...]

Book
30 Mar 1990
TL;DR: In this article, the Kalman filter and state space models were used for univariate structural time series models to estimate, predict, and smoothen the univariate time series model.
Abstract: List of figures Acknowledgement Preface Notation and conventions List of abbreviations 1. Introduction 2. Univariate time series models 3. State space models and the Kalman filter 4. Estimation, prediction and smoothing for univariate structural time series models 5. Testing and model selection 6. Extensions of the univariate model 7. Explanatory variables 8. Multivariate models 9. Continuous time Appendices Selected answers to exercises References Author index Subject index.

5,071 citations

Book
01 Jan 1998

1,639 citations

Journal ArticleDOI
TL;DR: The complete genomic sequences of human chromosomes 21 and 22 are used to examine the properties of CpG islands in different sequence classes by using a search algorithm that is compatible with the recent detection of 5-methylcytosine in Drosophila, and might suggest that S. cerevisiae has, or once had, C pG methylation.
Abstract: CpG islands are useful markers for genes in organisms containing 5-methylcytosine in their genomes. In addition, CpG islands located in the promoter regions of genes can play important roles in gene silencing during processes such as X-chromosome inactivation, imprinting, and silencing of intragenomic parasites. The generally accepted definition of what constitutes a CpG island was proposed in 1987 by Gardiner-Garden and Frommer [Gardiner-Garden, M. & Frommer, M. (1987) J. Mol. Biol. 196, 261–282] as being a 200-bp stretch of DNA with a C+G content of 50% and an observed CpG/expected CpG in excess of 0.6. Any definition of a CpG island is somewhat arbitrary, and this one, which was derived before the sequencing of mammalian genomes, will include many sequences that are not necessarily associated with controlling regions of genes but rather are associated with intragenomic parasites. We have therefore used the complete genomic sequences of human chromosomes 21 and 22 to examine the properties of CpG islands in different sequence classes by using a search algorithm that we have developed. Regions of DNA of greater than 500 bp with a G+C equal to or greater than 55% and observed CpG/expected CpG of 0.65 were more likely to be associated with the 5′ regions of genes and this definition excluded most Alu-repetitive elements. We also used genome sequences to show strong CpG suppression in the human genome and slight suppression in Drosophila melanogaster and Saccharomyces cerevisiae. This finding is compatible with the recent detection of 5-methylcytosine in Drosophila, and might suggest that S. cerevisiae has, or once had, CpG methylation.

1,553 citations


"Distributions associated with gener..." refers background or methods in this paper

  • ...This is biologically realistic as islands are presumed not only to be longer than a certain length but also to be infrequent and not particularly close to one another [Takai and Jones (2002)]....

    [...]

  • ...As CpG islands can be especially useful for identifying genes in human DNA [Takai and Jones (2002)], different methods have been developed for their detection....

    [...]

  • ...These locations and size differences were verified by using the software CpG islands searcher [Takai and Jones (2003)], which is based on the (non-HMM) algorithm of Takai and Jones (2002) and requires the use of predetermined thresholds....

    [...]

  • ...If the frequency of CG pairs is higher than some predetermined threshold, then the segment is defined to be within a CpG island [Takai and Jones (2002)]....

    [...]