What are the contributions mentioned in the paper "Speeding up hmm decoding and training by exploiting sequence repetitions?" ?

The authors present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. The authors discuss the application of their method to Viterbi ’ s decoding and training algorithms [ 33 ], as well as to the forward-backward and BaumWelch [ 6 ] algorithms. Their approach is based on identifying repeated substrings in the observed input sequence. Initially, the authors show how to exploit repetitions of all sufficiently small substrings ( this is similar to the Four Russians method ). Then, the authors describe four algorithms based alternatively on run length encoding ( RLE ), Lempel-Ziv ( LZ78 ) parsing, grammar-based compression ( SLP ), and byte pair encoding ( BPE ). Furthermore, unlike Viterbi ’ s algorithm, their algorithms are highly parallelizable.

What are the other promising directions for future work?

Other promising directions for future work include extending their results to higher order HMMs, HMMs with numerical observables [21], hierarchical HMMs, infinite alphabet size, and sparse transition matrices.

What is the aim of the Viterbi training algorithm?

The Baum-Welch training algorithm converges to a set of parameters that maximizes the likelihood to observe the given sequence P (X| θ), and is the most commonly used method for model training.

How can the authors calculate the product of two matrices in parallel?

The max-times multiplication of two matrices can be done in parallel in O(1) time using k4 processors so the authors can calculate the product of n matrices in O(log n) time.

What is the speedup of the encoding step?

The encoding step takes O(2|Σ|`k3) time as the authors compute O(2|Σ|`) matrices and each matrix is computed in O(k3) time by a single max-times multiplication.

What is the way to construct X ′?

Their approach for constructing X ′ is to first parse X into all LZ78-words and then apply the following greedy parsing to each LZ78-word W : using the trie, find the longest good substring w′ ∈ D that is a prefix of W , place a parsing comma immediately after w′ and repeat the process for the remainder of W .Time and Space Complexity.

How many processors were used to perform the benchmarks?

The benchmarks were performed on a single processor of a SunFire V880 server with 8 UltraSPARC-IV processorsand 16GB main memory.

What is the length of the string X?

LZ78 parses the string X into substrings ( LZ78-words) in a single pass over X. Each LZ78-word is composed of the longest LZ78-word previously seen plus a single letter.

What is the corresponding pre-processing term for encoding?

The corresponding pre-processing term for encoding is O(|Σ′|k3), where Σ′ denotes the set of character codes in the extended alphabet.

What is the speedup factor of byte pair encoding?

In this section byte pair encoding is utilized to accelerate the Viterbi decoding computations by a factor of Ω(r), where n′ is the number of characters in the BPE-compressed sequence X and r = n/n′ is the BPE compression ratio of the sequence.

What is the likely sequence of states in the LZ78-accelerated decoding?

In LZ78-accelerated decoding, WB is a single letter, when using RLE WA = WB = σ|W |/2, SLP consists just of production rules involving a single letter or exactly two non-terminals, and with BPE WA, WB ∈ Σ′.

How can the authors speed up the decoding of a string?

In this subsection the authors show that if an input sequence has a grammar representation with compression ratio r, then HMM decoding can be accelerated by a factor of Ω( rk ).

What is the way to compute the maximum in a parallel algorithm?

It is possible to compute the maximum in (2) in parallel for all states i, but the dependency of vt+1 on vt makes it difficult to achieve a sublinear parallel algorithm.

What is the complexity of the LZ78 parsing?

The authors show how to use the LZ78 [34] parsing to find good substrings and how to use the incremental nature of the LZ78 parse to compute M(W ) for a good substring W in O(k3) time.

What is the speedup of the propagation step?

The propagation step takes O(nk 2` ) time resulting in an overall running time of O(2|Σ|`k3 + nk 2 ` ). Choos-ing ` = 12 log|Σ|(n), the running time is O ( 2 √ nk3 + 2nk 2log|Σ|(n)) .

What is the simplest way to run a VA?

In the propagation step (IV), given an input sequence of size n, compressed into its BPE-encoding of size n′ (e.g. n′ = 4 in the above example, where X = ABABCABCD and X ′ = XY Y D), the authors run VA using at most n′ matrices.

What is the complexity of the LZ78-parse?

(I) Dictionary Selection: the good substrings are all the LZ78-words in the LZ78-parse of X.(II) Encoding: construct the matrices incrementally, according to their order in the LZ78-trie, M(Wσ) = M(W ) Mσ.(III) Parsing: X ′ is the LZ78-parsing of X. (IV) Propagation: run VA on X ′, as described in section 3.Time and Space Complexity.

What is the simplest way to compute M(Wi)?

(II) Encoding: precompute the matrices M(Wi) for every Wi ∈ D.(III) Parsing: partition the input sequence X into consecutive good substrings X = Wi1Wi2 · · ·Win′′ and let X ′ denote the compressed representation of this parsing of X, such that X ′ = i1i2 · · · in′′ .

What is the speedup for a string with a run-length compression ratio?

In this section the authors obtain an Ω( rlog r ) speedup for decoding an observed sequence with run-length compression ratio r. A string S is run-length encoded if it is described as an ordered sequence of pairs (σ, i), often denoted “σi”.

(Open Access) Speeding up HMM decoding and training by exploiting sequence repetitions (2007) | Shay Mozes

Speeding Up HMM Decoding and Training

by Exploiting Sequence Repetitions

Yury Lifshits

, Shay Mozes

??2

, Oren Weimann

, and Michal Ziv-Ukelson

Steklov Institute of Mathematics at St.Petersburg,

27 Fontanka, St.Petersburg 191023, Russia.

yura@logic.pdmi.ras.ru

MIT Computer Science and Artiﬁcial Intelligence Laboratory,

32 Vassar Street, Cambridge, MA 02139, USA.

shaymozes@gmail.com,oweimann@mit.edu

School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel.

michaluz@post.tau.ac.il

Abstract. We present a method to speed up the dynamic program algorithms used for solving the HMM

decoding and training problems for discrete time-independent HMMs. We discuss the application of our

method to Viterbi’s decoding and training algorithms [33], as well as to the forward-backward and Baum-

Welch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input

sequence. Initially, we show how to exploit repetitions of all suﬃciently small substrings (this is similar to

the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding

(RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE).

Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method,

Ω(

log r

) using RLE, Ω(

log n

) using LZ78, Ω(

) using SLP, and Ω(r) using BPE, where k is the number

of hidden states, n is the length of the observed sequence and r is its compression ratio (under each

compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster

in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.

Key words: HMM, Viterbi, dynamic programming, compression

1 Introduction

Over the last few decades, Hidden Markov Models (HMMs) proved to be an extremely useful frame-

work for modeling processes in diverse areas such as error-correction in communication links [33],

speech recognition [8], optical character recognition [2], computational linguistics [25], and bioinfor-

matics [16].

The core HMM-based applications fall in the domain of classiﬁcation methods and are technically

divided into two stages: a training stage and a decoding stage. During the training stage, the emission

and transition probabilities of an HMM are estimated, based on an input set of observed sequences.

This stage is usually executed once as a preprocessing stage and the generated (“trained”) models

are stored in a database. Then, a decoding stage is run, again and again, in order to classify input

sequences. The objective of this stage is to ﬁnd the most probable sequence of states to have generated

each input sequence given each model, as illustrated in Fig. 1.

Obviously, the training problem is more diﬃcult to solve than the decoding problem. However, the

techniques used for decoding serve as basic ingredients in solving the training problem. The Viterbi

algorithm (VA) [33] is the best known tool for solving the decoding problem. Following its invention

in 1967, several other algorithms have been devised for the decoding and training problems, such as

the forward-backward and Baum-Welch [6] algorithms. These algorithms are all based on dynamic

programs whose running times depend linearly on the length of the observed sequence. The challenge

of speeding up VA by utilizing HMM topology was posed in 1997 by Buchsbaum and Giancarlo [8] as

A preliminary version of this paper appeared in [27].

Work conducted while visiting MIT

Fig. 1. The HMM on the observed sequence X = x

, x

, . . . , x

and states 1, 2, . . . , k. The highlighted path is a possible

path of states that generate the observed sequence. VA ﬁnds the path with highest probability.

a major open problem. In this contribution, we address this open problem by using text compression

and present the ﬁrst provable speedup of these algorithms.

The traditional aim of text compression is the eﬃcient use of resources such as storage and

bandwidth. Here, we will compress the observed sequences in order to speed up HMM algorithms.

Note that this approach, denoted “acceleration by text-compression”, has been recently applied to

some classical problems on strings. Various compression schemes, such as LZ77, LZW-LZ78, Huﬀman

coding, Byte Pair Encoding (BPE) and Run Length Encoding (RLE), were employed to accelerate

exact string matching [19, 24, 31, 22], subsequence matching [10], approximate pattern matching [1,

18, 20, 28] and sequence alignment [3, 4, 9, 15, 23]. In light of the practical importance of HMM-based

classiﬁcation methods in state-of-the-art research, and in view of the fact that such techniques are

also based on dynamic programming, we set out to answer the following question: can “acceleration

by text compression” be applied to HMM decoding and training algorithms?

Our results. In this study we address the above challenge of speeding up HMM dynamic program-

ming algorithms (Viterbi, forward-backward and Baum-Welch) by compression. We compress only

in one dimension, the sequence axis, since typically n >> k and the states are non-repetitive. This

compression enables the algorithm to adapt to the data and to utilize its repetitions. We present a

basic toolkit of operations that could be further extended beyond this paper and applied to variant

HMM-based problems in order to utilize common and repeated substrings. In general, the input

sequences can be pre-compressed, as an oﬄine stage, before our algorithms are applied. Such pre-

compression, which is usually time-linear in the size of the input sequences, is done in this case not in

order to save space but rather as a good investment in preparation for an all-against-all classiﬁcation

scheme in which each input sequence will be decoded many times according to various models and

thus it pays oﬀ to pre-compress it once and for all.

Let X denote the input sequence and let n denote its length. Let k denote the number of states

in the HMM and |Σ| denote the size of alphabet. For any given compression scheme, let n

denote

the number of parsed blocks in X and let r = n/n

denote the compression ratio. Our results are as

follows.

1. Using the Four Russians method, we accelerate decoding by a factor of Θ(log n). Here we assume

that k <

2|Σ|

log n

2. RLE is used to accelerate decoding by a factor of Ω(

logr

3. Using LZ78, we accelerate decoding by a factor of Ω(

log n

). Our algorithm guarantees no degra-

dation in eﬃciency even when k > log n and is experimentally more than ﬁve times faster than

VA when applied to DNA sequences.

4. SLP is used to accelerate decoding by a factor of Ω(

5. BPE is used to accelerate decoding by a factor of Ω(r).

6. The same speedup factors apply to the Viterbi training algorithm.

7. For the Baum-Welch training algorithm, we show how to preprocess a repeated substring of size `

once in O(`k

) time so that we may replace the usual O(`k

) processing work for each occurrence

of this substring with an alternative O(k

) computation. This is beneﬁcial for any repeat with λ

non-overlapping occurrences, such that λ >

`−k

8. As opposed to VA, our algorithms are highly parallelizable.

Roadmap. The rest of the paper is organized as follows. In section 2 we give a uniﬁed presentation of

the HMM dynamic programs. We then show in section 3 how these algorithms can be improved by

identifying repeated substrings. Five diﬀerent implementations of this general idea are presented in

section 4. Section 5 discusses the recovery of the optimal state-path. In section 6 we show how to adapt

the algorithms to the training problem. A parallel implementation of our algorithms is described in

section 7, and experimental results are presented in section 8. We summarize and discuss future work

in section 9.

2 Preliminaries

Let Σ denote a ﬁnite alphabet and let X ∈ Σ

, X = x

, x

, . . . , x

be a sequence of observed letters.

A Markov model is a set of k states, along with emission probabilities e

(σ) - the probability to

observe σ ∈ Σ given that the state is k, and transition probabilities P

i,j

- the probability to make a

transition to state i from state j.

The Viterbi Algorithm. The Viterbi algorithm (VA) ﬁnds the most probable sequence of hidden

states given the model and the observed sequence. i.e., the sequence of states s

, s

, . . . , s

which

maximize

i=1

i−1

(1)

The dynamic program of VA calculates a vector v

[i] which is the probability of the most probable

sequence of states emitting x

, . . . , x

and ending with the state i at time t. v

is usually taken to be

the vector of uniform probabilities (i.e., v

[i] =

). v

t+1

is calculated from v

according to

t+1

[i] = e

t+1

) · max

i,j

· v

[j]} (2)

Deﬁnition 1 (Viterbi Step). We call the computation of v

t+1

from v

a Viterbi step.

Clearly, each Viterbi step requires O(k

) time. Therefore, the total runtime required to compute the

vector v

is O(nk

). The probability of the most likely sequence of states is the maximal element in

. The actual sequence of states can be then reconstructed in linear time.

x3x2

x11x10x9x8x7x6x5x4x1

x13x12

4,1

4,2

4,3

4,4

4,6

4,5

Fig. 2. The VA dynamic program table on sequence X = x

, x

, . . . , x

and states 1, 2, 3, 4, 5, 6. The marked cell

corresponds to v

[4] = e

) · max{P

4,1

· v

[1], P

4,2

· v

[2], . . . , P

4,6

· v

[6]}.

It is useful for our purposes to rewrite VA in a slightly diﬀerent way. Let M

be a k × k matrix

with elements M

i,j

= e

(σ) · P

i,j

. We can now express v

as:

= M

 M

n−1

 ···  M

 M

 v

(3)

where (A B)

i,j

= max

i,k

·B

k,j

} is the so called max-times matrix multiplication. VA computes

using (3) from right to left in O(nk

) time. Notice that if (3) is evaluated from left to right the

computation would take O(nk

) time (matrix-vector multiplication vs. matrix-matrix multiplication).

Throughout, we assume that the max-times matrix-matrix multiplications are done na¨ıvely in O(k

Faster methods for max-times matrix multiplication [11, 12] and standard matrix multiplication [14,

32] can be used to reduce the k

term. However, for small values of k this is not proﬁtable.

The Forward-Backward Algorithms. The forward-backward algorithms are closely related to VA

and are based on very similar dynamic programs. In contrast to VA, these algorithms apply standard

matrix multiplication instead of max-times multiplication. The forward algorithm calculates f

[i],

the probability to observe the sequence x

, x

, . . . , x

requiring that s

= i as follows:

= M

· M

t−1

· ··· · M

· M

· f

(4)

The backward algorithm calculates b

[i], the probability to observe the sequence x

t+1

, x

t+2

, . . . , x

given that s

= i as follows:

= b

· M

n−1

· ··· · M

t+2

· M

t+1

(5)

Another algorithm which is used in the training stage and employs the forward-backward algorithm

as a subroutine, is the Baum-Welch algorithm, to be further discussed in Section 6.

A motivating example. We brieﬂy describe one concrete example from computational biology to

which our algorithms naturally apply. CpG islands [7] are regions of DNA with a large concentration

of the nucleotide pair CG. These regions are typically a few hundred to a few thousand nucleotides

long, located around the promoters of many genes. As such, they are useful landmarks for the

identiﬁcation of genes. The observed sequence (X) is a long DNA sequence composed of four possible

nucleotides (Σ = {A, C, G, T}). The length of this sequence is typically a few millions nucleotides

(n ' 2

). A well-studied classiﬁcation problem is that of parsing a given DNA sequence into CpG

islands and non CpG regions. Previous work on CpG island classiﬁcation used Markov models with

either 8 or 2 states (k = 8 or k = 2) [13, 16].

3 Exploiting Repeated Substrings in the Decoding Stage

Consider a substring W = w

, w

, . . . , w

of X, and deﬁne

M(W ) = M

 M

`−1

 ···  M

 M

(6)

Intuitively, M

i,j

(W ) is the probability of the most likely path starting with state j, making a transition

into some other state, emitting w

, then making a transition into yet another state and emitting w

and so on until making a ﬁnal transition into state i and emitting w

In the core of our method stands the following observation, which is immediate from the associa-

tive nature of matrix multiplication.

Observation 1. We may replace any occurrence of M

M

`−1

···M

in eq. (3) with M(W ).

The application of observation 1 to the computation of equation (3) saves ` − 1 Viterbi steps each

time W appears in X, but incurs the additional cost of computing M(W ) once.

An intuitive exercise. Let λ denote the number of times a given word W appears, in non-

overlapping occurrences, in the input string X. Suppose we na¨ıvely compute M(W ) using (|W |−1)

max-times matrix multiplications, and then apply observation 1 to all occurrences of W before run-

ning VA. We gain some speedup in doing so if

(|W | − 1)k

+ λk

< λ|W |k

λ > k (7)

Hence, if there are at least k non-overlapping occurrences of W in the input sequence, then it is

worthwhile to na¨ıvely precompute M(W ), regardless of it’s size |W |.

Deﬁnition 2 (Good Substring). We call a substring W good if we decide to compute M(W ).

We can now give a general four-step framework of our method:

(I) Dictionary Selection: choose the set D = {W

} of good substrings.

(II) Encoding: precompute the matrices M (W

) for every W

∈ D.

(III) Parsing: partition the input sequence X into consecutive good substrings X = W

···W

and let X

denote the compressed representation of this parsing of X, such that X

···i

(IV) Propagation: run VA on X

, using the matrices M(W

The above framework introduces the challenge of how to select the set of good substrings (step

I) and how to eﬃciently compute their matrices (step II). In the next section we show how the RLE,

LZ78, SLP and BPE compression schemes can be applied to address this challenge, and how the above

framework can be utilized to exploit repetitions of all suﬃciently small substrings (this is similar to

the Four Russians method). In practice, the choice of the appropriate compression scheme should be

made according to the nature of the observed sequences. For example, genomic sequences tend to

compress well with BPE [31] and binary images in facsimile or in optical character recognition [2–4,

9, 23, 26] are well compressed by RLE. LZ78 guarantees asymptotic compression for any sequence

and is useful in cases such as the CpG islands [7] identiﬁcation problem in DNA sequences [13, 16],

where k is smaller than log n.

Another challenge is how to parse the sequence X (step III) in order to maximize acceleration.

We show that, surprisingly, this optimal parsing may diﬀer from the initial parsing induced by the

selected compression scheme. To our knowledge, this feature was not applied by previous “acceleration

by compression” algorithms.

Throughout this paper we focus on computing path probabilities rather than the paths themselves.

The actual paths can be reconstructed in linear time as described in section 5.

4 Five Diﬀerent Implementations of the General Framework

4.1 Acceleration via the Four Russians Method

The most na¨ıve approach is probably using all possible substrings of suﬃciently small length ` as

good ones. This approach is quite similar to the Four Russians method [5], and leads to a Θ(log n)

asymptotic speedup.

(I) Dictionary Selection: all possible strings of length ` over alphabet |Σ| are good substrings.

(II) Encoding: For i = 2 . . . `, compute M(W ) for all strings W with length i by computing

M(W

)  M(σ), where W = W

σ for some previously computed string W

of length i − 1

and some letter σ ∈ Σ.

(III) Parsing: X

is constructed by splitting the input X into blocks of length `.

(IV) Propagation: run VA on X

, using the matrices M(W

) as described in section 3.

Speeding up HMM decoding and training by exploiting sequence repetitions

Figures

Citations

Algorithmics on SLP-compressed strings: A survey

Textual Data Compression In Computational Biology: A synopsis

Exploration and exploitation of multilingual data for statistical machine translation

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression

References

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

A universal algorithm for sequential data compression

Gaussian elimination is not optimal

On the Complexity of Finite Sequences

Matrix multiplication via arithmetic progressions

Related Papers (5)

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

A tutorial on hidden Markov models and selected applications in speech recognition

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

A universal algorithm for sequential data compression

Frequently Asked Questions (22)

Q1. What are the contributions mentioned in the paper "Speeding up hmm decoding and training by exploiting sequence repetitions?" ?

Q2. What are the future works in "Speeding up hmm decoding and training by exploiting sequence repetitions?" ?

Q3. What are the other promising directions for future work?

Q4. What is the aim of the Viterbi training algorithm?

Q5. How can the authors calculate the product of two matrices in parallel?

Q6. What is the speedup of the encoding step?

Q7. What is the way to construct X ′?

Q8. How many processors were used to perform the benchmarks?

Q9. What is the length of the string X?

Q10. What is the corresponding pre-processing term for encoding?

Q11. What is the speedup factor of byte pair encoding?

Q12. What is the likely sequence of states in the LZ78-accelerated decoding?

Q13. Why is it important to pre-compress a sequence?

Q14. How many Viterbi steps does the application of observation 1 save?

Q15. How can the authors speed up the decoding of a string?

Q16. What is the way to compute the maximum in a parallel algorithm?

Q17. What is the complexity of the LZ78 parsing?

Q18. What is the speedup of the propagation step?

Q19. What is the simplest way to run a VA?

Q20. What is the complexity of the LZ78-parse?

Q21. What is the simplest way to compute M(Wi)?

Q22. What is the speedup for a string with a run-length compression ratio?