scispace - formally typeset
Open AccessBook ChapterDOI

Speeding up HMM decoding and training by exploiting sequence repetitions

TLDR
This work presents a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs and describes three algorithms based alternatively on byte pair encoding, run length encoding and Lempel-Ziv parsing.
Abstract
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi's decoding and training algorithms [21], as well as to the forward-backward and Baum-Welch [4] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. We describe three algorithms based alternatively on byte pair encoding (BPE) [19], run length encoding (RLE) and Lempel-Ziv (LZ78) parsing [22]. Compared to Viterbi's algorithm, we achieve a speedup of Ω(r) using BPE, a speedup of Ω(r/log r ) using RLE, and a speedup of Ω(log n/k) using LZ78, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi's algorithm, our algorithms are highly parallelizable.

read more

Content maybe subject to copyright    Report

Speeding Up HMM Decoding and Training
by Exploiting Sequence Repetitions
?
Yury Lifshits
1
, Shay Mozes
??2
, Oren Weimann
2
, and Michal Ziv-Ukelson
3
1
Steklov Institute of Mathematics at St.Petersburg,
27 Fontanka, St.Petersburg 191023, Russia.
yura@logic.pdmi.ras.ru
2
MIT Computer Science and Artificial Intelligence Laboratory,
32 Vassar Street, Cambridge, MA 02139, USA.
shaymozes@gmail.com,oweimann@mit.edu
3
School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel.
michaluz@post.tau.ac.il
Abstract. We present a method to speed up the dynamic program algorithms used for solving the HMM
decoding and training problems for discrete time-independent HMMs. We discuss the application of our
method to Viterbi’s decoding and training algorithms [33], as well as to the forward-backward and Baum-
Welch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input
sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to
the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding
(RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE).
Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method,
(
r
log r
) using RLE, (
log n
k
) using LZ78, (
r
k
) using SLP, and (r) using BPE, where k is the number
of hidden states, n is the length of the observed sequence and r is its compression ratio (under each
compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster
in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
Key words: HMM, Viterbi, dynamic programming, compression
1 Introduction
Over the last few decades, Hidden Markov Models (HMMs) proved to be an extremely useful frame-
work for modeling processes in diverse areas such as error-correction in communication links [33],
speech recognition [8], optical character recognition [2], computational linguistics [25], and bioinfor-
matics [16].
The core HMM-based applications fall in the domain of classification methods and are technically
divided into two stages: a training stage and a decoding stage. During the training stage, the emission
and transition probabilities of an HMM are estimated, based on an input set of observed sequences.
This stage is usually executed once as a preprocessing stage and the generated (“trained”) models
are stored in a database. Then, a decoding stage is run, again and again, in order to classify input
sequences. The objective of this stage is to find the most probable sequence of states to have generated
each input sequence given each model, as illustrated in Fig. 1.
Obviously, the training problem is more difficult to solve than the decoding problem. However, the
techniques used for decoding serve as basic ingredients in solving the training problem. The Viterbi
algorithm (VA) [33] is the best known tool for solving the decoding problem. Following its invention
in 1967, several other algorithms have been devised for the decoding and training problems, such as
the forward-backward and Baum-Welch [6] algorithms. These algorithms are all based on dynamic
programs whose running times depend linearly on the length of the observed sequence. The challenge
of speeding up VA by utilizing HMM topology was posed in 1997 by Buchsbaum and Giancarlo [8] as
?
A preliminary version of this paper appeared in [27].
??
Work conducted while visiting MIT

1
2
k
2
k
1
2
1
k
x
1
x
2
x
3
x
n
1
k
2
Fig. 1. The HMM on the observed sequence X = x
1
, x
2
, . . . , x
n
and states 1, 2, . . . , k. The highlighted path is a possible
path of states that generate the observed sequence. VA finds the path with highest probability.
a major open problem. In this contribution, we address this open problem by using text compression
and present the first provable speedup of these algorithms.
The traditional aim of text compression is the efficient use of resources such as storage and
bandwidth. Here, we will compress the observed sequences in order to speed up HMM algorithms.
Note that this approach, denoted “acceleration by text-compression”, has been recently applied to
some classical problems on strings. Various compression schemes, such as LZ77, LZW-LZ78, Huffman
coding, Byte Pair Encoding (BPE) and Run Length Encoding (RLE), were employed to accelerate
exact string matching [19, 24, 31, 22], subsequence matching [10], approximate pattern matching [1,
18, 20, 28] and sequence alignment [3, 4, 9, 15, 23]. In light of the practical importance of HMM-based
classification methods in state-of-the-art research, and in view of the fact that such techniques are
also based on dynamic programming, we set out to answer the following question: can “acceleration
by text compression” be applied to HMM decoding and training algorithms?
Our results. In this study we address the above challenge of speeding up HMM dynamic program-
ming algorithms (Viterbi, forward-backward and Baum-Welch) by compression. We compress only
in one dimension, the sequence axis, since typically n >> k and the states are non-repetitive. This
compression enables the algorithm to adapt to the data and to utilize its repetitions. We present a
basic toolkit of operations that could be further extended beyond this paper and applied to variant
HMM-based problems in order to utilize common and repeated substrings. In general, the input
sequences can be pre-compressed, as an offline stage, before our algorithms are applied. Such pre-
compression, which is usually time-linear in the size of the input sequences, is done in this case not in
order to save space but rather as a good investment in preparation for an all-against-all classification
scheme in which each input sequence will be decoded many times according to various models and
thus it pays off to pre-compress it once and for all.
Let X denote the input sequence and let n denote its length. Let k denote the number of states
in the HMM and |Σ| denote the size of alphabet. For any given compression scheme, let n
0
denote
the number of parsed blocks in X and let r = n/n
0
denote the compression ratio. Our results are as
follows.
1. Using the Four Russians method, we accelerate decoding by a factor of Θ(log n). Here we assume
that k <
n
2|Σ|
2
log n
.
2. RLE is used to accelerate decoding by a factor of (
r
logr
).
3. Using LZ78, we accelerate decoding by a factor of (
log n
k
). Our algorithm guarantees no degra-
dation in efficiency even when k > log n and is experimentally more than five times faster than
VA when applied to DNA sequences.
4. SLP is used to accelerate decoding by a factor of (
r
k
).
5. BPE is used to accelerate decoding by a factor of (r).
6. The same speedup factors apply to the Viterbi training algorithm.
2

7. For the Baum-Welch training algorithm, we show how to preprocess a repeated substring of size `
once in O(`k
4
) time so that we may replace the usual O(`k
2
) processing work for each occurrence
of this substring with an alternative O(k
4
) computation. This is beneficial for any repeat with λ
non-overlapping occurrences, such that λ >
`k
2
`k
2
.
8. As opposed to VA, our algorithms are highly parallelizable.
Roadmap. The rest of the paper is organized as follows. In section 2 we give a unified presentation of
the HMM dynamic programs. We then show in section 3 how these algorithms can be improved by
identifying repeated substrings. Five different implementations of this general idea are presented in
section 4. Section 5 discusses the recovery of the optimal state-path. In section 6 we show how to adapt
the algorithms to the training problem. A parallel implementation of our algorithms is described in
section 7, and experimental results are presented in section 8. We summarize and discuss future work
in section 9.
2 Preliminaries
Let Σ denote a finite alphabet and let X Σ
n
, X = x
1
, x
2
, . . . , x
n
be a sequence of observed letters.
A Markov model is a set of k states, along with emission probabilities e
k
(σ) - the probability to
observe σ Σ given that the state is k, and transition probabilities P
i,j
- the probability to make a
transition to state i from state j.
The Viterbi Algorithm. The Viterbi algorithm (VA) finds the most probable sequence of hidden
states given the model and the observed sequence. i.e., the sequence of states s
1
, s
2
, . . . , s
n
which
maximize
n
Y
i=1
e
s
i
(x
i
)P
s
i
,s
i1
(1)
The dynamic program of VA calculates a vector v
t
[i] which is the probability of the most probable
sequence of states emitting x
1
, . . . , x
t
and ending with the state i at time t. v
0
is usually taken to be
the vector of uniform probabilities (i.e., v
0
[i] =
1
k
). v
t+1
is calculated from v
t
according to
v
t+1
[i] = e
i
(x
t+1
) · max
j
{P
i,j
· v
t
[j]} (2)
Definition 1 (Viterbi Step). We call the computation of v
t+1
from v
t
a Viterbi step.
Clearly, each Viterbi step requires O(k
2
) time. Therefore, the total runtime required to compute the
vector v
n
is O(nk
2
). The probability of the most likely sequence of states is the maximal element in
v
n
. The actual sequence of states can be then reconstructed in linear time.
x3x2
3
6
x11x10x9x8x7x6x5x4x1
5
4
2
1
x13x12
P
4,1
P
4,2
P
4,3
P
4,4
P
4,6
P
4,5
Fig. 2. The VA dynamic program table on sequence X = x
1
, x
2
, . . . , x
13
and states 1, 2, 3, 4, 5, 6. The marked cell
corresponds to v
8
[4] = e
4
(x
8
) · max{P
4,1
· v
7
[1], P
4,2
· v
7
[2], . . . , P
4,6
· v
7
[6]}.
3

It is useful for our purposes to rewrite VA in a slightly different way. Let M
σ
be a k × k matrix
with elements M
σ
i,j
= e
i
(σ) · P
i,j
. We can now express v
n
as:
v
n
= M
x
n
M
x
n1
··· M
x
2
M
x
1
v
0
(3)
where (A B)
i,j
= max
k
{A
i,k
·B
k,j
} is the so called max-times matrix multiplication. VA computes
v
n
using (3) from right to left in O(nk
2
) time. Notice that if (3) is evaluated from left to right the
computation would take O(nk
3
) time (matrix-vector multiplication vs. matrix-matrix multiplication).
Throughout, we assume that the max-times matrix-matrix multiplications are done na¨ıvely in O(k
3
).
Faster methods for max-times matrix multiplication [11, 12] and standard matrix multiplication [14,
32] can be used to reduce the k
3
term. However, for small values of k this is not profitable.
The Forward-Backward Algorithms. The forward-backward algorithms are closely related to VA
and are based on very similar dynamic programs. In contrast to VA, these algorithms apply standard
matrix multiplication instead of max-times multiplication. The forward algorithm calculates f
t
[i],
the probability to observe the sequence x
1
, x
2
, . . . , x
t
requiring that s
t
= i as follows:
f
t
= M
x
t
· M
x
t1
· ··· · M
x
2
· M
x
1
· f
0
(4)
The backward algorithm calculates b
t
[i], the probability to observe the sequence x
t+1
, x
t+2
, . . . , x
n
given that s
t
= i as follows:
b
t
= b
n
· M
x
n
· M
x
n1
· ··· · M
x
t+2
· M
x
t+1
(5)
Another algorithm which is used in the training stage and employs the forward-backward algorithm
as a subroutine, is the Baum-Welch algorithm, to be further discussed in Section 6.
A motivating example. We briefly describe one concrete example from computational biology to
which our algorithms naturally apply. CpG islands [7] are regions of DNA with a large concentration
of the nucleotide pair CG. These regions are typically a few hundred to a few thousand nucleotides
long, located around the promoters of many genes. As such, they are useful landmarks for the
identification of genes. The observed sequence (X) is a long DNA sequence composed of four possible
nucleotides (Σ = {A, C, G, T}). The length of this sequence is typically a few millions nucleotides
(n ' 2
25
). A well-studied classification problem is that of parsing a given DNA sequence into CpG
islands and non CpG regions. Previous work on CpG island classification used Markov models with
either 8 or 2 states (k = 8 or k = 2) [13, 16].
3 Exploiting Repeated Substrings in the Decoding Stage
Consider a substring W = w
1
, w
2
, . . . , w
`
of X, and define
M(W ) = M
w
`
M
w
`1
··· M
w
2
M
w
1
(6)
Intuitively, M
i,j
(W ) is the probability of the most likely path starting with state j, making a transition
into some other state, emitting w
1
, then making a transition into yet another state and emitting w
2
and so on until making a final transition into state i and emitting w
`
.
In the core of our method stands the following observation, which is immediate from the associa-
tive nature of matrix multiplication.
Observation 1. We may replace any occurrence of M
w
`
M
w
`1
···M
w
1
in eq. (3) with M(W ).
The application of observation 1 to the computation of equation (3) saves ` 1 Viterbi steps each
time W appears in X, but incurs the additional cost of computing M(W ) once.
4

An intuitive exercise. Let λ denote the number of times a given word W appears, in non-
overlapping occurrences, in the input string X. Suppose we na¨ıvely compute M(W ) using (|W |1)
max-times matrix multiplications, and then apply observation 1 to all occurrences of W before run-
ning VA. We gain some speedup in doing so if
(|W | 1)k
3
+ λk
2
< λ|W |k
2
λ > k (7)
Hence, if there are at least k non-overlapping occurrences of W in the input sequence, then it is
worthwhile to na¨ıvely precompute M(W ), regardless of it’s size |W |.
Definition 2 (Good Substring). We call a substring W good if we decide to compute M(W ).
We can now give a general four-step framework of our method:
(I) Dictionary Selection: choose the set D = {W
i
} of good substrings.
(II) Encoding: precompute the matrices M (W
i
) for every W
i
D.
(III) Parsing: partition the input sequence X into consecutive good substrings X = W
i
1
W
i
2
···W
i
n
00
and let X
0
denote the compressed representation of this parsing of X, such that X
0
=
i
1
i
2
···i
n
00
.
(IV) Propagation: run VA on X
0
, using the matrices M(W
i
).
The above framework introduces the challenge of how to select the set of good substrings (step
I) and how to efficiently compute their matrices (step II). In the next section we show how the RLE,
LZ78, SLP and BPE compression schemes can be applied to address this challenge, and how the above
framework can be utilized to exploit repetitions of all sufficiently small substrings (this is similar to
the Four Russians method). In practice, the choice of the appropriate compression scheme should be
made according to the nature of the observed sequences. For example, genomic sequences tend to
compress well with BPE [31] and binary images in facsimile or in optical character recognition [2–4,
9, 23, 26] are well compressed by RLE. LZ78 guarantees asymptotic compression for any sequence
and is useful in cases such as the CpG islands [7] identification problem in DNA sequences [13, 16],
where k is smaller than log n.
Another challenge is how to parse the sequence X (step III) in order to maximize acceleration.
We show that, surprisingly, this optimal parsing may differ from the initial parsing induced by the
selected compression scheme. To our knowledge, this feature was not applied by previous “acceleration
by compression” algorithms.
Throughout this paper we focus on computing path probabilities rather than the paths themselves.
The actual paths can be reconstructed in linear time as described in section 5.
4 Five Different Implementations of the General Framework
4.1 Acceleration via the Four Russians Method
The most na¨ıve approach is probably using all possible substrings of sufficiently small length ` as
good ones. This approach is quite similar to the Four Russians method [5], and leads to a Θ(log n)
asymptotic speedup.
(I) Dictionary Selection: all possible strings of length ` over alphabet |Σ| are good substrings.
(II) Encoding: For i = 2 . . . `, compute M(W ) for all strings W with length i by computing
M(W
0
) M(σ), where W = W
0
σ for some previously computed string W
0
of length i 1
and some letter σ Σ.
(III) Parsing: X
0
is constructed by splitting the input X into blocks of length `.
(IV) Propagation: run VA on X
0
, using the matrices M(W
i
) as described in section 3.
5

Citations
More filters
Journal ArticleDOI

Algorithmics on SLP-compressed strings: A survey

TL;DR: Results on algorithmic problems on strings that are given in a compressed form via straight-line programs are surveyed and applications in combinatorial group theory and computational topology and to the solution of word equations are discussed.
Journal ArticleDOI

Textual Data Compression In Computational Biology: A synopsis

TL;DR: This review focuses on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used and a unifying organization of the main ideas and techniques is provided.

Exploration and exploitation of multilingual data for statistical machine translation

Simon Carter
TL;DR: A novel algorithm is presented that allows for exact inference from high-order hidden Markov models, which is used to segment source text input from an automatic translation system, and gives insight into the retrieval of relevant training data.
Journal ArticleDOI

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

TL;DR: A method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs by exploiting repetitions of all sufficiently small substrings, similar to the Four Russians method.
Posted Content

A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression

TL;DR: In this article, a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs was presented, and an algorithm running in O(n 1.4}N^{1.2})$ time for computing the edit distance of these two strings under any rational scoring function was presented.
References
More filters
Journal ArticleDOI

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.
Journal ArticleDOI

A universal algorithm for sequential data compression

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Journal ArticleDOI

Gaussian elimination is not optimal

TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Journal ArticleDOI

On the Complexity of Finite Sequences

TL;DR: A new approach to the problem of evaluating the complexity ("randomness") of finite sequences is presented, related to the number of steps in a self-delimiting production process by which a given sequence is presumed to be generated.
Journal ArticleDOI

Matrix multiplication via arithmetic progressions

TL;DR: In this article, a new method for accelerating matrix multiplication asymptotically is presented, based on the ideas of Volker Strassen, by using a basic trilinear form which is not a matrix product.
Related Papers (5)
Frequently Asked Questions (22)
Q1. What are the contributions mentioned in the paper "Speeding up hmm decoding and training by exploiting sequence repetitions?" ?

The authors present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. The authors discuss the application of their method to Viterbi ’ s decoding and training algorithms [ 33 ], as well as to the forward-backward and BaumWelch [ 6 ] algorithms. Their approach is based on identifying repeated substrings in the observed input sequence. Initially, the authors show how to exploit repetitions of all sufficiently small substrings ( this is similar to the Four Russians method ). Then, the authors describe four algorithms based alternatively on run length encoding ( RLE ), Lempel-Ziv ( LZ78 ) parsing, grammar-based compression ( SLP ), and byte pair encoding ( BPE ). Furthermore, unlike Viterbi ’ s algorithm, their algorithms are highly parallelizable. 

Other promising directions for future work include extending their results to higher order HMMs, HMMs with numerical observables [ 21 ], hierarchical HMMs, infinite alphabet size, and sparse transition matrices. 

Other promising directions for future work include extending their results to higher order HMMs, HMMs with numerical observables [21], hierarchical HMMs, infinite alphabet size, and sparse transition matrices. 

The Baum-Welch training algorithm converges to a set of parameters that maximizes the likelihood to observe the given sequence P (X| θ), and is the most commonly used method for model training. 

The max-times multiplication of two matrices can be done in parallel in O(1) time using k4 processors so the authors can calculate the product of n matrices in O(log n) time. 

The encoding step takes O(2|Σ|`k3) time as the authors compute O(2|Σ|`) matrices and each matrix is computed in O(k3) time by a single max-times multiplication. 

Their approach for constructing X ′ is to first parse X into all LZ78-words and then apply the following greedy parsing to each LZ78-word W : using the trie, find the longest good substring w′ ∈ D that is a prefix of W , place a parsing comma immediately after w′ and repeat the process for the remainder of W .Time and Space Complexity. 

The benchmarks were performed on a single processor of a SunFire V880 server with 8 UltraSPARC-IV processorsand 16GB main memory. 

LZ78 parses the string X into substrings ( LZ78-words) in a single pass over X. Each LZ78-word is composed of the longest LZ78-word previously seen plus a single letter. 

The corresponding pre-processing term for encoding is O(|Σ′|k3), where Σ′ denotes the set of character codes in the extended alphabet. 

In this section byte pair encoding is utilized to accelerate the Viterbi decoding computations by a factor of Ω(r), where n′ is the number of characters in the BPE-compressed sequence X and r = n/n′ is the BPE compression ratio of the sequence. 

In LZ78-accelerated decoding, WB is a single letter, when using RLE WA = WB = σ|W |/2, SLP consists just of production rules involving a single letter or exactly two non-terminals, and with BPE WA, WB ∈ Σ′. 

Such precompression, which is usually time-linear in the size of the input sequences, is done in this case not in order to save space but rather as a good investment in preparation for an all-against-all classification scheme in which each input sequence will be decoded many times according to various models and thus it pays off to pre-compress it once and for all. 

The application of observation 1 to the computation of equation (3) saves ` − 1 Viterbi steps each time W appears in X, but incurs the additional cost of computing M(W ) once. 

In this subsection the authors show that if an input sequence has a grammar representation with compression ratio r, then HMM decoding can be accelerated by a factor of Ω( rk ). 

It is possible to compute the maximum in (2) in parallel for all states i, but the dependency of vt+1 on vt makes it difficult to achieve a sublinear parallel algorithm. 

The authors show how to use the LZ78 [34] parsing to find good substrings and how to use the incremental nature of the LZ78 parse to compute M(W ) for a good substring W in O(k3) time. 

The propagation step takes O(nk 2` ) time resulting in an overall running time of O(2|Σ|`k3 + nk 2 ` ). Choos-ing ` = 12 log|Σ|(n), the running time is O ( 2 √ nk3 + 2nk 2log|Σ|(n)) . 

In the propagation step (IV), given an input sequence of size n, compressed into its BPE-encoding of size n′ (e.g. n′ = 4 in the above example, where X = ABABCABCD and X ′ = XY Y D), the authors run VA using at most n′ matrices. 

(I) Dictionary Selection: the good substrings are all the LZ78-words in the LZ78-parse of X.(II) Encoding: construct the matrices incrementally, according to their order in the LZ78-trie, M(Wσ) = M(W ) Mσ.(III) Parsing: X ′ is the LZ78-parsing of X. (IV) Propagation: run VA on X ′, as described in section 3.Time and Space Complexity. 

(II) Encoding: precompute the matrices M(Wi) for every Wi ∈ D.(III) Parsing: partition the input sequence X into consecutive good substrings X = Wi1Wi2 · · ·Win′′ and let X ′ denote the compressed representation of this parsing of X, such that X ′ = i1i2 · · · in′′ . 

In this section the authors obtain an Ω( rlog r ) speedup for decoding an observed sequence with run-length compression ratio r. A string S is run-length encoded if it is described as an ordered sequence of pairs (σ, i), often denoted “σi”.