What is the drawback of Decompact and Search?

The drawback of Decompact and Search is that they may require more then O(m) time to decompact m letters and this does not guarantee a searching cost that is linear in the pattern length.

What is the length of the arc of the trie ST?

The number of nodes v of the trie ST (w) such that v does not belong to the tree S(w) and s(v) is a node of the tree S(w) is smaller than the text size.

What is the suffix trie of a word w?

The suffix trie ST (w) of a word w is a trie where the set of leaves is the set of suffixes of w that do not appear previously as factors in w.

(Open Access) A Trie-Based Approach for Compacting Automata (2004) | Maxime Crochemore

Q: What are the contributions mentioned in the paper "A trie-based approach for compacting automata" ?

The authors describe a new technique for reducing the number of nodes and symbols in automata based on tries.

Q: What is the algorithm for recognizing the language of the antidictionary?

Given an antidictionary AD, the algorithm in [6], called L-automaton, takes as input the trie T that represents AD, and gives as output an automaton recognizing the language L(AD) of all words avoiding the antidictionary.

Q: What is the definition of an antidictionary?

Ifw1 = 001001001001, the infinite set of all words that have two 1’s in the i-th and i + 2-th positions, for some integer i, is an antidictionary for w1.

HAL Id: hal-00619974

https://hal-upec-upem.archives-ouvertes.fr/hal-00619974

Submitted on 26 Mar 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A trie-based approach for compacting automata

Maxime Crochemore, Chiara Epifanio, Roberto Grossi, Filippo Mignosi

To cite this version:

Maxime Crochemore, Chiara Epifanio, Roberto Grossi, Filippo Mignosi. A trie-based approach for

compacting automata. Combinatorial Pattern Matching, 2004, Turkey. pp.145-158, �10.1007/978-3-

540-27801-6_11�. �hal-00619974�

A Trie-Based Approach for Compacting Automata

Maxime Crochemore

⋆

, Chiara Epifanio

⋆⋆

, Rob erto Grossi

⋆⋆⋆

,and

Filippo Mignosi

†

Abstract. We describe a new technique for reducing the number of

no des and symbols in automata based on tries. The technique stems

from some results on anti-dictionaries for data compression and does not

need to retain the input string, diﬀerently from other methods b ased on

compact automata. The net eﬀect is that of obtaining a lighter automa-

ton than the directed acyclic word graph (DAWG) of Blumer et al., as

it uses less nodes, still with arcs labeled by single characters.

Keywords: Automata and formal languages, suﬃx tree, factor and suﬃx au-

tomata, index, text compression.

1 Introduction

One of the seminal results in pattern matching is that the size of the minimal

automaton accepting the suﬃxes of a word (DAWG) is linear [4]. This result

is surprising as the maximal numb er of subwords that may occur in a word is

quadratic according to the length of the word. Suﬃx trees are linear too, but

they r epresent strings by pointers to the text, while DAWGs work without the

need of accessing it.

DAWGs can be built in linear time. This result has stimulated further work.

For example, [8] gives a compact version of the DAWG and a direct algorithm

to construct it. In [13] and [14] it is given an algorithm for online construction

of DAWGs. In [11] and [12] space-eﬃcient implementations of compact DAWGs

are designed. For comparisons and results on this subject, see also [5].

In th is paper we present a new compaction technique for shrinking automata

based on antifactorial tries of words. In particular, we show how to apply our

technique to factor automata and DAWGs by compacting their spanning tree

obtained by a breadth-ﬁrst search. The average number of nodes of the structure

⋆

Institut Gaspard-Monge, Universit´e de Marne-la-Vall´ee, France and King’s College

(London), Great Britain (mac@univ-mlv.fr).

⋆⋆

Dipartimento di Matematica e Applicazioni, Universit`a di Palermo, Italy

(epifanio@math.unipa.it).

⋆⋆⋆

Dipartimento di Informatica, Universit`a di Pisa, Italy (grossi@di.unipi.it).

†

Dipartimento di Matematica e Applicazioni, Universit`a di Palermo, Italy

(mignosi@math.unipa.it).

thus obtained can be sublinear in the number of symbols of the text, for highly

compressible sources. This property seems new to us and it is reinforced by

the fact the numb er of nodes for our automata is always smaller than that for

DAWGs.

We build up our ﬁnding on “self compressing” tries of antifactorial binary

sets of word s. They were introduced in [7] for compressing binary strings with

antidictionaries, with the aim of representing in a compact way antidictionar-

ies to be sent to the deco der of a static compression scheme. We present an

improvement scheme for this algorithm that extends its functionalities to any

chosen alphabet for the antifactorial sets of words M . We employ it to represent

compactly the automaton (or better the trim) A(M ) deﬁned in [6] for recog-

nizing the language of all the words avoiding elements of M (we recall that a

word w avoids x ∈ M if x does not appear in w as a factor).

Our scheme is general enough for being applied to any index structure having

a failure function. One such example is that of (generalized) suﬃx tries, which

are the uncompacted version of well-known suﬃx trees. Unfortunately, their

number of nodes is O(n

) and this is why research ers prefer to use the O(n)-

node suﬃx tree. We obtain compact suﬃx tries with our scheme that have a

linear number of nodes but are diﬀerent from suﬃx trees. Although a compact

suﬃx trie has a bit more nodes than the corresponding suﬃx tree, all of its arcs

are labeled by single symbols rather than factors (substrings). Because of this

we can completely drop the text, as searching does not need to access the text

contrarily to what is required for th e suﬃx tree. We exploit suﬃx links for this

kind of searching. As a result, we obtain a family of au tomata that can be seen

as an alternative to suﬃx trees and DAWGs.

This paper is organized as follows. Section 2 contains our generalization of

some of the algorithms in [7] so as to make them work with any alphabet.

Section 3 presents our data structur e, the compact suﬃx trie and its connection

to automata. Section 4 contains our new searching algorithms for detecting a

pattern in the compact tries and related automata. Finally, we present some

open problems and further work on this subject in Section 5.

2 Compressing with Antidictionaries and Compact Tries

In this section we describe a non-trivial generalization of some of the algorithms

in [7] to any alphab et A, in particular with Encoder and Decoder algorithms

described next. We recall that if w is a word over a ﬁnite alphabet A,the

set of its factors is called F (w). For instance, if w = aeddebc, then F (w)=

{ε, a, b,...,aeddebc}.

Let us take some words in the complement of F (w), i.e., let us take some

words that are not factors of w, call these forbidden. Thi s set of such words

AD is called an antidictionary for the language F (w). Antidictionaries can be

ﬁnite as well as inﬁnite. For instance, if w = aeddebc the words aa, ddd,and

ded are forbidd en and the set {aa, ddd, ded} is an antidictionary for F (w). If

= 001001001001, the inﬁnite set of all words that have two 1’s in the i-th

and i + 2-th positions, for some integer i, is an antidictionary for w

We want to stress that an antidictionary can be any subset of the complement

of F (w). Therefore an antidictionary can be deﬁned by any property concerning

words.

The compression algorithm in [7] treats the input word in an on-line manner.

Let us suppose to have just read the word v, proper preﬁx of w. If there exists

any word u = u

′

a, where a ∈{0, 1}, in the antidictionary AD such that u

′

is a

suﬃx of v, then surely the letter following v cannot be a, i.e., the next letter is

b, with b = a. In other words, we know in advance th e next letter b that turns

out to be “redundant” or predictable. As remarked in [7], this argument works

only in the case of binary alphabets.

We show how to generalize the above argument to any alphabet A, i.e., any

cardinality of A. The main idea is that of eliminating redundant letters with the

compression algorithm Encoder. In what follows the word to be compressed is

noted w = a

···a

and its compressed version is denoted by γ(w).

Encoder (antidictionary AD,wordw ∈ A

∗

)

1. v ← ε; γ ← ε;

2. for a ← ﬁrst to last letter of w

3. if there exists a letter b ∈ A, b = a such that

for every suﬃx u

′

of v, u

′

b ∈ AD then

4. γ ← γ.a;

5. v ← v.a;

6. return (|v|, γ);

As an example, let us run this algorith m on the string w = aeddebc, with

AD = {aa, ab, ac, ad, aeb, ba, bb, bd, be, da, db, dc, ddd, ea, ec, ede, ee}.

The steps of th e execution are d escribed in the next array by the current

values of the preﬁx v

= a

···a

of w that has been just considered and of the

output γ(v

). In the case of a positive answer to the query to the antidictionary

AD, the array indicates the value of the corresponding forbidden word u,too.

The number of times the answer is positive in a run corresponds to the number

of bits erased.

εγ(ε)=ε

= a γ(v

)=a

= ae γ(v

)=aaa, ab, ac, ad ∈ AD

= aed γ(v

)=aea, ec, ee, aeb ∈ AD

= aedd γ(v

)=ada, db, dc, ede ∈ AD

= aedde γ(v

)=ada, db, dc, ddd ∈ AD

= aeddeb γ(v

)=ab

= aeddebc γ(v

)=ab ba, bb, bd, be ∈ AD

Remark that γ is not injective. For instance, γ(aed)=γ(ae)=a.

In order to have an injective mapping we consider the function γ

′

(w)=

(|w |,γ(w)). In t his case we can reconstruct the original word w from both γ

′

(w)

and the antidictionary.

Remark 1. Instead of adding the length |w| of the whole word w other choices

are possible, such as to add the length |w

′

| of the last encoded fragment w

′

w. In the special case in which the last letter in w is not erased, we have that

′

| = 0 and it is not necessary to code this length. We will examine this case

while examining the algorithm Decompact.

The decoding algorithm works as follows. The compressed word is γ(w)=

···b

and the length of w is n. The algorithm recovers the word w by predicting

the letter following the current preﬁx v of w already d ecompressed. If there

exists a unique letter a in the alphabet A such that for any suﬃx u

′

of v,the

concatenation u

′

a does not belong to the antidictionary, then the output letter

is a. Otherwise we have to read the next letter from the input γ.

Decoder (antidictionary AD,wordγ ∈ A

∗

, integer n)

1. v ← ε;

2. while |v| <n

3. if there exists a unique letter a ∈ A such that for any u

′

suﬃx of v

′

a does not belong to AD then

4. v ← v · a;

5. else

6. b ← next letter of γ;

7. v ← v · b;

8. return (v);

The antidictionary AD must be structured in order to answer, for a given

word v, whether there exist |A|−1wordsu = u

′

b in AD, with b ∈ A and b = a,

such that u

′

is a suﬃx of v. In case of a positive answer the output should also

include the letter a.

Languages avoiding ﬁnite sets of words are called local and automata recog-

nizing them are ubiquitously present in Computer Science (cf [2]).

Given an antidictionary AD, the algorithm in [6], called L-automaton,

takes as input the trie T that represents AD, and gives as output an automaton

recognizing the language L(AD) of all words avoiding the antidictionary. This

automaton has the same states as those in trie T and the set of labeled edges of

this automaton includes properly the one of the trie. The transition function of

automaton A(AD) is called δ. This automaton is complete, i.e., for any letter a

and for any state v, the value of δ(v,a) is deﬁned.

If AD is the set of the minimal forbidden words of a text t, then it is proved

in [6] that the trimmed version of automaton A(AD)isthefactor automaton of

t. If the last letter in t is a letter $ that does not appear elsewhere in the text, the

factor automaton coincides with the DAWG, apart from the set of ﬁnal states.

In fact, while in the factor automaton every state is ﬁnal, in the DAWG the

only ﬁnal state is the last one in every topological order. Therefore, if we have

a technique for shrinking automata of the form A(AD), for some antidictionary

(AD), this technique will automatically hold for DAWG, by appending at t he

end of the text a symbol $ that does not appear elsewhere. Actually the trie

that we compact is the spanning tree obtained by a breadth-ﬁrst search of the

A Trie-Based Approach for Compacting Automata

Citations

The Smallest Automaton Recognizing the Subwords of a Text ; CU-CS-300-84

On the adaptive antidictionary code using minimal forbidden words with constant lengths

Translating Source Locale Input String To Target Locale Output String

Linear-size suffix tries

On-Line Electrocardiogram Lossless Compression Using Antidictionary Codes for a Finite Alphabet

References

Encyclopedia of Mathematics and its Applications.

Efficient string matching: an aid to bibliographic search

Algebraic Combinatorics on Words

Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications)

The smallest automaton recognizing the subwords of a text

Related Papers (5)

Data compression using antidictionaries

Automata and forbidden words

On-line construction of suffix trees

On the On-line Arithmetic Coding Based on Antidictionaries with Linear Complexity

A Block-sorting Lossless Data Compression Algorithm

Frequently Asked Questions (8)

Q1. What are the contributions mentioned in the paper "A trie-based approach for compacting automata" ?

Q2. What is the algorithm for recognizing the language of the antidictionary?

Q3. What is the result of the seminal results in pattern matching?

Q4. What is the definition of an antidictionary?

Q5. What is the drawback of Decompact and Search?

Q6. What is the purpose of the antidictionaries?

Q7. What is the length of the arc of the trie ST?

Q8. What is the suffix trie of a word w?