scispace - formally typeset
Open AccessBook ChapterDOI

A Trie-Based Approach for Compacting Automata

Reads0
Chats0
TLDR
The net effect is that of obtaining a lighter automaton than the directed acyclic word graph (DAWG) of Blumer et al., as it uses less nodes, still with arcs labeled by single characters.
Abstract
We describe a new technique for reducing the number of nodes and symbols in automata based on tries. The technique stems from some results on anti-dictionaries for data compression and does not need to retain the input string, differently from other methods based on compact automata. The net effect is that of obtaining a lighter automaton than the directed acyclic word graph (DAWG) of Blumer et al., as it uses less nodes, still with arcs labeled by single characters.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00619974
https://hal-upec-upem.archives-ouvertes.fr/hal-00619974
Submitted on 26 Mar 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A trie-based approach for compacting automata
Maxime Crochemore, Chiara Epifanio, Roberto Grossi, Filippo Mignosi
To cite this version:
Maxime Crochemore, Chiara Epifanio, Roberto Grossi, Filippo Mignosi. A trie-based approach for
compacting automata. Combinatorial Pattern Matching, 2004, Turkey. pp.145-158, �10.1007/978-3-
540-27801-6_11�. �hal-00619974�

A Trie-Based Approach for Compacting Automata
Maxime Crochemore
, Chiara Epifanio
⋆⋆
, Rob erto Grossi
⋆⋆⋆
,and
Filippo Mignosi
Abstract. We describe a new technique for reducing the number of
no des and symbols in automata based on tries. The technique stems
from some results on anti-dictionaries for data compression and does not
need to retain the input string, differently from other methods b ased on
compact automata. The net effect is that of obtaining a lighter automa-
ton than the directed acyclic word graph (DAWG) of Blumer et al., as
it uses less nodes, still with arcs labeled by single characters.
Keywords: Automata and formal languages, suffix tree, factor and suffix au-
tomata, index, text compression.
1 Introduction
One of the seminal results in pattern matching is that the size of the minimal
automaton accepting the suffixes of a word (DAWG) is linear [4]. This result
is surprising as the maximal numb er of subwords that may occur in a word is
quadratic according to the length of the word. Suffix trees are linear too, but
they r epresent strings by pointers to the text, while DAWGs work without the
need of accessing it.
DAWGs can be built in linear time. This result has stimulated further work.
For example, [8] gives a compact version of the DAWG and a direct algorithm
to construct it. In [13] and [14] it is given an algorithm for online construction
of DAWGs. In [11] and [12] space-efficient implementations of compact DAWGs
are designed. For comparisons and results on this subject, see also [5].
In th is paper we present a new compaction technique for shrinking automata
based on antifactorial tries of words. In particular, we show how to apply our
technique to factor automata and DAWGs by compacting their spanning tree
obtained by a breadth-first search. The average number of nodes of the structure
Institut Gaspard-Monge, Universit´e de Marne-la-Vall´ee, France and King’s College
(London), Great Britain (mac@univ-mlv.fr).
⋆⋆
Dipartimento di Matematica e Applicazioni, Universit`a di Palermo, Italy
(epifanio@math.unipa.it).
⋆⋆⋆
Dipartimento di Informatica, Universit`a di Pisa, Italy (grossi@di.unipi.it).
Dipartimento di Matematica e Applicazioni, Universit`a di Palermo, Italy
(mignosi@math.unipa.it).

thus obtained can be sublinear in the number of symbols of the text, for highly
compressible sources. This property seems new to us and it is reinforced by
the fact the numb er of nodes for our automata is always smaller than that for
DAWGs.
We build up our finding on “self compressing” tries of antifactorial binary
sets of word s. They were introduced in [7] for compressing binary strings with
antidictionaries, with the aim of representing in a compact way antidictionar-
ies to be sent to the deco der of a static compression scheme. We present an
improvement scheme for this algorithm that extends its functionalities to any
chosen alphabet for the antifactorial sets of words M . We employ it to represent
compactly the automaton (or better the trim) A(M ) defined in [6] for recog-
nizing the language of all the words avoiding elements of M (we recall that a
word w avoids x M if x does not appear in w as a factor).
Our scheme is general enough for being applied to any index structure having
a failure function. One such example is that of (generalized) suffix tries, which
are the uncompacted version of well-known suffix trees. Unfortunately, their
number of nodes is O(n
2
) and this is why research ers prefer to use the O(n)-
node suffix tree. We obtain compact suffix tries with our scheme that have a
linear number of nodes but are different from suffix trees. Although a compact
suffix trie has a bit more nodes than the corresponding suffix tree, all of its arcs
are labeled by single symbols rather than factors (substrings). Because of this
we can completely drop the text, as searching does not need to access the text
contrarily to what is required for th e suffix tree. We exploit suffix links for this
kind of searching. As a result, we obtain a family of au tomata that can be seen
as an alternative to suffix trees and DAWGs.
This paper is organized as follows. Section 2 contains our generalization of
some of the algorithms in [7] so as to make them work with any alphabet.
Section 3 presents our data structur e, the compact suffix trie and its connection
to automata. Section 4 contains our new searching algorithms for detecting a
pattern in the compact tries and related automata. Finally, we present some
open problems and further work on this subject in Section 5.
2 Compressing with Antidictionaries and Compact Tries
In this section we describe a non-trivial generalization of some of the algorithms
in [7] to any alphab et A, in particular with Encoder and Decoder algorithms
described next. We recall that if w is a word over a finite alphabet A,the
set of its factors is called F (w). For instance, if w = aeddebc, then F (w)=
{ε, a, b,...,aeddebc}.
Let us take some words in the complement of F (w), i.e., let us take some
words that are not factors of w, call these forbidden. Thi s set of such words
AD is called an antidictionary for the language F (w). Antidictionaries can be
finite as well as infinite. For instance, if w = aeddebc the words aa, ddd,and
ded are forbidd en and the set {aa, ddd, ded} is an antidictionary for F (w). If

w
1
= 001001001001, the infinite set of all words that have two 1’s in the i-th
and i + 2-th positions, for some integer i, is an antidictionary for w
1
.
We want to stress that an antidictionary can be any subset of the complement
of F (w). Therefore an antidictionary can be defined by any property concerning
words.
The compression algorithm in [7] treats the input word in an on-line manner.
Let us suppose to have just read the word v, proper prefix of w. If there exists
any word u = u
a, where a ∈{0, 1}, in the antidictionary AD such that u
is a
suffix of v, then surely the letter following v cannot be a, i.e., the next letter is
b, with b = a. In other words, we know in advance th e next letter b that turns
out to be “redundant” or predictable. As remarked in [7], this argument works
only in the case of binary alphabets.
We show how to generalize the above argument to any alphabet A, i.e., any
cardinality of A. The main idea is that of eliminating redundant letters with the
compression algorithm Encoder. In what follows the word to be compressed is
noted w = a
1
···a
n
and its compressed version is denoted by γ(w).
Encoder (antidictionary AD,wordw A
)
1. v ε; γ ε;
2. for a first to last letter of w
3. if there exists a letter b A, b = a such that
for every suffix u
of v, u
b ∈ AD then
4. γ γ.a;
5. v v.a;
6. return (|v|, γ);
As an example, let us run this algorith m on the string w = aeddebc, with
AD = {aa, ab, ac, ad, aeb, ba, bb, bd, be, da, db, dc, ddd, ea, ec, ede, ee}.
The steps of th e execution are d escribed in the next array by the current
values of the prefix v
i
= a
1
···a
i
of w that has been just considered and of the
output γ(v
i
). In the case of a positive answer to the query to the antidictionary
AD, the array indicates the value of the corresponding forbidden word u,too.
The number of times the answer is positive in a run corresponds to the number
of bits erased.
εγ(ε)=ε
v
1
= a γ(v
1
)=a
v
2
= ae γ(v
2
)=aaa, ab, ac, ad AD
v
3
= aed γ(v
3
)=aea, ec, ee, aeb AD
v
4
= aedd γ(v
4
)=ada, db, dc, ede AD
v
5
= aedde γ(v
5
)=ada, db, dc, ddd AD
v
6
= aeddeb γ(v
6
)=ab
v
7
= aeddebc γ(v
7
)=ab ba, bb, bd, be AD
Remark that γ is not injective. For instance, γ(aed)=γ(ae)=a.
In order to have an injective mapping we consider the function γ
(w)=
(|w |(w)). In t his case we can reconstruct the original word w from both γ
(w)
and the antidictionary.

Remark 1. Instead of adding the length |w| of the whole word w other choices
are possible, such as to add the length |w
| of the last encoded fragment w
of
w. In the special case in which the last letter in w is not erased, we have that
|w
| = 0 and it is not necessary to code this length. We will examine this case
while examining the algorithm Decompact.
The decoding algorithm works as follows. The compressed word is γ(w)=
b
1
···b
h
and the length of w is n. The algorithm recovers the word w by predicting
the letter following the current prefix v of w already d ecompressed. If there
exists a unique letter a in the alphabet A such that for any suffix u
of v,the
concatenation u
a does not belong to the antidictionary, then the output letter
is a. Otherwise we have to read the next letter from the input γ.
Decoder (antidictionary AD,wordγ A
, integer n)
1. v ε;
2. while |v| <n
3. if there exists a unique letter a A such that for any u
suffix of v
u
a does not belong to AD then
4. v v · a;
5. else
6. b next letter of γ;
7. v v · b;
8. return (v);
The antidictionary AD must be structured in order to answer, for a given
word v, whether there exist |A|−1wordsu = u
b in AD, with b A and b = a,
such that u
is a suffix of v. In case of a positive answer the output should also
include the letter a.
Languages avoiding finite sets of words are called local and automata recog-
nizing them are ubiquitously present in Computer Science (cf [2]).
Given an antidictionary AD, the algorithm in [6], called L-automaton,
takes as input the trie T that represents AD, and gives as output an automaton
recognizing the language L(AD) of all words avoiding the antidictionary. This
automaton has the same states as those in trie T and the set of labeled edges of
this automaton includes properly the one of the trie. The transition function of
automaton A(AD) is called δ. This automaton is complete, i.e., for any letter a
and for any state v, the value of δ(v,a) is defined.
If AD is the set of the minimal forbidden words of a text t, then it is proved
in [6] that the trimmed version of automaton A(AD)isthefactor automaton of
t. If the last letter in t is a letter $ that does not appear elsewhere in the text, the
factor automaton coincides with the DAWG, apart from the set of final states.
In fact, while in the factor automaton every state is final, in the DAWG the
only final state is the last one in every topological order. Therefore, if we have
a technique for shrinking automata of the form A(AD), for some antidictionary
(AD), this technique will automatically hold for DAWG, by appending at t he
end of the text a symbol $ that does not appear elsewhere. Actually the trie
that we compact is the spanning tree obtained by a breadth-first search of the

Citations
More filters

The Smallest Automaton Recognizing the Subwords of a Text ; CU-CS-300-84

TL;DR: It is demonstrated that the smallest partial DFA for the set of all subwords of a given word w, Iwl>2, has at most 21w(-2 states and 3(wl-4 transition edges, independently of the alphabet size).
Proceedings ArticleDOI

On the adaptive antidictionary code using minimal forbidden words with constant lengths

TL;DR: It is proved that the time complexity of this algorithm is linear with respect to the string length and its effectiveness is demonstrated by simulation results.
Patent

Translating Source Locale Input String To Target Locale Output String

TL;DR: In this paper, a tree is constructed from a dictionary of source and target locale strings, and an input string having the source locale is processed against the tree to generate an output string with the target locale.
Journal ArticleDOI

Linear-size suffix tries

TL;DR: The linear-size suffix trie is presented, which guarantees O ( n ) nodes and is able to check whether a pattern p occurs in w in O ( m log ź | Σ | ) time and the authors can find the longest common substring of two strings w 1 and w 2 in O-time.
Journal ArticleDOI

On-Line Electrocardiogram Lossless Compression Using Antidictionary Codes for a Finite Alphabet

TL;DR: This paper proposes on-line ECG lossless compression for a given data on a finite alphabet and gives not only better compression ratios than traditional lossless data compression algorithms but also uses less computational space than they do.
References
More filters
Journal ArticleDOI

Efficient string matching: an aid to bibliographic search

TL;DR: A simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text that has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
Book

Algebraic Combinatorics on Words

M. Lothaire
TL;DR: In this article, Berstel and Perrin proposed the concept of Sturmian words and the plactic monoid, which is a set of permutations and infinite words.
Book

Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications)

M. Lothaire
TL;DR: Binatorics automata and number theory, binatorial mathematics article about binatorial, and algorithms binations of encyclopedia of mathematics.
Journal ArticleDOI

The smallest automaton recognizing the subwords of a text

TL;DR: In this article, the smallest partial DFA for the set of all subwords of a given word w, Iwl>2, has at most 21w(-2 states and 3wl-4 transition edges, independently of the alphabet size.
Frequently Asked Questions (8)
Q1. What are the contributions mentioned in the paper "A trie-based approach for compacting automata" ?

The authors describe a new technique for reducing the number of nodes and symbols in automata based on tries. 

Given an antidictionary AD, the algorithm in [6], called L-automaton, takes as input the trie T that represents AD, and gives as output an automaton recognizing the language L(AD) of all words avoiding the antidictionary. 

One of the seminal results in pattern matching is that the size of the minimal automaton accepting the suffixes of a word (DAWG) is linear [4]. 

Ifw1 = 001001001001, the infinite set of all words that have two 1’s in the i-th and i + 2-th positions, for some integer i, is an antidictionary for w1. 

The drawback of Decompact and Search is that they may require more then O(m) time to decompact m letters and this does not guarantee a searching cost that is linear in the pattern length. 

They were introduced in [7] for compressing binary strings with antidictionaries, with the aim of representing in a compact way antidictionaries to be sent to the decoder of a static compression scheme. 

The number of nodes v of the trie ST (w) such that v does not belong to the tree S(w) and s(v) is a node of the tree S(w) is smaller than the text size. 

The suffix trie ST (w) of a word w is a trie where the set of leaves is the set of suffixes of w that do not appear previously as factors in w.