Adding nesting structure to words

doi:10.1145/1516512.1516518

Adding Nesting Structure to Words

∗

Rajeev Alur

University of Pennsylvania

alur@cis.upenn.edu

P. Madhusudan

University of Illinois, Urbana-Champaign

madhu@cs.uiuc.edu

Abstract

We propose the model of nested words for representation of data with both a linear ordering and

a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical struc-

ture include executions of structured programs, annotated linguistic data, and HTML/XML documents.

Nested words generalize both words and ordered trees, and allow both word and tree operations. We

deﬁne nested word automata—ﬁnite-state acceptors for nested words, and show that the resulting class

of regular languages of nested words has all the appealing theoretical properties that the classical regular

word languages enjoys: deterministic nested word automata are as expressive as their nondeterministic

counterparts; the class is closed under union, intersection, complementation, concatenation, Kleene-*,

preﬁxes, and language homomorphisms; membership, emptiness, language inclusion, and language equiv-

alence are all decidable; and deﬁnability in monadic second order logic corresponds exactly to ﬁnite-state

recognizability. We also consider regular languages of inﬁnite nested words and show that the closure

properties, MSO-characterization, and decidability of decision problems carry over.

The linear encodings of nested words give the class of visibly pushdown languages of words, and

this class lies between balanced languages and deterministic context-free languages. We argue that for

algorithmic veriﬁcation of structured programs, instead of viewing the program as a context-free language

over words, one should view it as a regular language of nested words (or equivalently, a visibly pushdown

language), and this would allow model checking of many properties (such as stack inspection, pre-post

conditions) that are not expressible in existing speciﬁcation logics.

We also study the relationship between ordered trees and nested words, and the corresponding au-

tomata: while the analysis complexity of nested word automata is the same as that of classical tree

automata, they combine both bottom-up and top-down traversals, and enjoy expressiveness and suc-

cinctness beneﬁts over tree automata.

1 Introduction

Linearly structured data is usually modeled as words, and queried using word automata and related speciﬁca-

tion languages such as regular expressions. Hierarchically structured data is naturally modeled as (unordered)

trees, and queried using tree automata. In many applications including executions of structured programs,

annotated linguistic data, and primary/secondary bonds in genomic sequences, the data has both a natural

linear sequencing of positions and a hierarchically nested matching of positions. For example, in natural

language processing, the sentence is a linear sequence of words, and parsing into syntactic categories imparts

the hierarchical structure. Sometimes, even though the only logical structure on data is hierarchical, linear

sequencing is added either for storage or for stream processing. For example, in SAX representation of XML

data, the document is a linear sequence of text characters, along with a hierarchically nested matching of

open-tags with closing tags.

In this paper, we propose the model of nested words for representing and querying data with dual linear-

hierarchical structure. A nested word consists of a sequence of linearly ordered positions, augmented with

nesting edges connecting calls to returns (or open-tags to close-tags). The edges do not cross creating a

∗

This paper uniﬁes and extends results that have appeared in conference papers [AM04], [AM06], and [Alu07].

1

properly nested hierarchical structure, and we allow some of the edges to be pending. This nesting structure

can be uniquely represented by a sequence specifying the types of positions (calls, returns, and internals).

Words are nested words where all positions are internals. Ordered trees can be interpreted as nested words

using the following traversal: to process an a-labeled node, ﬁrst print an a-labeled call, process all the

children in order, and print an a-labeled return. Note that this is a combination of top-down and bottom-

up traversals, and each node is processed twice. Binary trees, ranked trees, unranked trees, hedges, and

documents that do not parse correctly, all can be represented with equal ease. Word operations such as

preﬁxes, suﬃxes, concatenation, reversal, as well as tree operations referring to the hierarchical structure,

can be deﬁned naturally on nested words.

We deﬁne and study ﬁnite-state automata as acceptors of nested words. A nested word automaton

(NWA) is similar to a classical ﬁnite-state word automaton, and reads the input from left to right according

to the linear sequence. At a call, it can propagate states along both linear and nesting outgoing edges,

and at a return, the new state is determined based on states labeling both the linear and nesting incoming

edges. The resulting class of regular languages of nested words has all the appealing theoretical properties

that the regular languages of words and trees enjoy. In particular, we show that deterministic nested word

automata are as expressive as their nondeterministic counterparts. Given a nondeterministic automaton

A with s states, the determinization involves subsets of pairs of states (as opposed to subsets of states

for word automata), leading to a deterministic automaton with 2

s

2

states, and we show this bound to

be tight. The class is closed under all Boolean operations (union, intersection, and complement), and a

variety of word operations such as concatenation, Kleene-∗, and preﬁx-closure. The class is also closed under

nesting-respecting language homomorphisms, which can model tree operations. Decision problems such as

membership, emptiness, language inclusion, and language equivalence are all decidable. We also establish

that the notion of regularity coincides with the deﬁnability in the monadic second order logic (MSO) of

nested words (MSO of nested words has unary predicates over positions, ﬁrst and second order quantiﬁers,

linear successor relation, and the nesting relation).

The motivating application area for our results has been software veriﬁcation. Pushdown automata nat-

urally model the control ﬂow of sequential computation in typical programming languages with nested, and

potentially recursive, invocations of program modules such as procedures and method calls. Consequently,

a variety of program analysis, compiler optimization, and model checking questions can be formulated as

decision problems for pushdown automata. For instance, in contemporary software model checking tools,

to verify whether a program P (written in C, for instance) satisﬁes a regular correctness requirement ϕ

(written in linear temporal logic LTL, for instance), the veriﬁer ﬁrst abstracts the program into a pushdown

model P

a

with ﬁnite-state control, compiles the negation of the speciﬁcation into a ﬁnite-state automaton

A

¬ϕ

that accepts all computations that violate ϕ and algorithmically checks that the intersection of the

languages of P

a

and A

¬ϕ

is empty. The problem of checking regular requirements of pushdown models has

been extensively studied in recent years leading to eﬃcient implementations and applications to program

analysis [RHS95, BEM97, BR00, ABE

+

05, HJM

+

02, EKS03, CW02]. While many analysis problems such as

identifying dead code and accesses to uninitialized variables can be captured as regular requirements, many

others require inspection of the stack or matching of calls and returns, and are context-free. Even though

the general problem of checking context-free properties of pushdown automata is undecidable, algorithmic

solutions have been proposed for checking many diﬀerent kinds of non-regular properties. For example,

access control requirements such as “a module A should be invoked only if the module B belongs to the

call-stack,” and bounds on stack size such as “if the number of interrupt-handlers in the call-stack currently

is less than 5, then a property p holds” require inspection of the stack, and decision procedures for certain

classes of stack properties already exist [JMT99, CW02, EKS03, CMM

+

04]. A separate class of non-regular,

but decidable, properties includes the temporal logic Caret that allows matching of calls and returns and

can express the classical correctness requirements of program modules with pre and post conditions, such

as “if p holds when a module is invoked, the module must return, and q holds upon return” [AEM04]. This

suggests that the answer to the question “which class of properties are algorithmically checkable against

pushdown models?” should be more general than “regular word languages.” Our results suggest that the

answer lies in viewing the program as a generator of nested words. The key feature of checkable requirements,

2

such as stack inspection and matching calls and returns, is that the stacks in the model and the property are

correlated: while the stacks are not identical, the two synchronize on when to push and when to pop, and

are always of the same depth. This can be best captured by modeling the execution of a program P as a

nested word with nesting edges from calls to returns. Speciﬁcation of the program is given as a nested word

automaton A (or written as a formula ϕ in one of the new temporal logics for nested words), and veriﬁcation

corresponds to checking whether every nested word generated by P is accepted by A.IfP is abstracted

into a model P

a

with only boolean variables, then it can be interpreted as an NWA, and veriﬁcation can

be solved using decision procedures for NWAs. Nested-word automata can express a variety of requirements

such as stack-inspection properties, pre-post conditions, and interprocedural data-ﬂow properties. More

broadly, modeling structured programs and program speciﬁcations as languages of nested words generalizes

the linear-time semantics that allows integration of Pnueli-style temporal reasoning [Pnu77] and Hoare-style

structured reasoning [Hoa69]. We believe that the nested-word view will provide a unifying basis for the

next generation of speciﬁcation logics for program analysis, software veriﬁcation, and runtime monitoring.

Given a language L of nested words over Σ, the linear encoding of nested words gives a language

ˆ

L over

the tagged alphabet consisting of symbols tagged with the type of the position. If L is regular language of

nested words, then

ˆ

L is context-free. In fact, the pushdown automata accepting

ˆ

L have a special structure:

while reading a call, the automaton must push one symbol, while reading a return symbol, it must pop one

symbol (if the stack is non-empty), and while reading an internal symbol, it can only update its control

state. We call such automata visibly pushdown automata and the class of word languages they accept visibly

pushdown languages (VPL). Since our automata can be determinized, VPLs correspond to a subclass of

deterministic context-free languages (DCFL). We give a grammar-based characterization of VPLs, which

helps in understanding of VPLs as a generalization of parenthesis languages, bracketed languages, and

balanced languages [McN67, GH67, BB02]. Note that VPLs have better closure properties than CFLs,

DCFLs, or parenthesis languages: CFLs are not closed under intersection and complement, DCFLs are not

closed under union, intersection, and concatenation, and balanced languages are not closed under complement

and preﬁx-closure.

Data with dual linear-hierarchical structure is traditionally modeled using binary, and more generally,

using ordered unranked, trees, and queried using tree automata (see [Nev02, Lib05, Sch07] for recent surveys

on applications of unranked trees and tree automata to XML processing). In ordered trees, nodes with

the same parent are linearly ordered, and the classical tree traversals such as inﬁx (or depth-ﬁrst left-to-

right) can be used to deﬁne an implicit ordering of all nodes. It turns out that, hedges, where a hedge is

a sequence of ordered trees, are a special class of nested words, namely, the ones corresponding to Dyck

words, and regular hedge languages correspond to balanced languages. For document processing, nested

words do have many advantages over ordered trees as trees lack an explicit ordering of all nodes. Tree-based

representation implicitly assumes that the input linear data can be parsed into a tree, and thus, one cannot

represent and process data that may not parse correctly. Word operations such as preﬁxes, suﬃxes, and

concatenation, while natural for document processing, do not have analogous tree operations. Second, tree

automata can naturally express constraints on the sequence of labels along a hierarchical path, and also

along the left-to-right siblings, but they have diﬃculty to capture constraints that refer to the global linear

order. For example, the query that patterns p

1

,...p

k

appear in the document in that order (that is, the

regular expression Σ

∗

p

1

Σ

∗

...p

k

Σ

∗

over the linear order) compiles into a deterministic word automaton (and

hence deterministic NWA) of linear size, but standard deterministic bottom-up tree automaton for this query

must be of size exponential in k. In fact, NWAs can be viewed as a kind of tree automata such that both

bottom-up tree automata and top-down tree automata are special cases.

Analysis of liveness requirements such as “every write operation must be followed by a read operation”

is formulated using automata over inﬁnite words, and the theory of ω-regular languages is well developed

with many of the counterparts of the results for regular languages (c.f. [Tho90, VW94]). Consequently, we

also deﬁne nested ω-words and consider nested word automata augmented with acceptance conditions such

as B¨uchi and Muller, that accept languages of nested ω-words. We establish that the resulting class of

regular languages of nested ω-words is closed under operations such as union, intersection, complementation,

and homomorphisms. Decision problems for these automata have the same complexity as the corresponding

3

problems for NWAs. As in the ﬁnite case, the class can be characterized by the monadic second order logic.

The signiﬁcant diﬀerence is that deterministic automata with Muller acceptance condition on states that

appear inﬁnitely often along the linear run do not capture all regular properties: the language “there are only

ﬁnitely many pending calls” can be easily characterized using a nondeterministic B¨uchi NWA, and we prove

that no deterministic Muller automaton accepts this language. However, we show that nondeterministic

B¨uchi NWAs can be complemented and hence problems such as checking for inclusion are still decidable.

Outline

Section 2 deﬁnes nested words and their word encodings, and gives diﬀerent application domains where

nested words can be useful. Section 3 deﬁnes nested word automata and the notion of regularity. We

consider some variations of the deﬁnition of the automata, including the nondeterministic automata, show

how NWAs can be useful in program analysis, and establish closure properties. Section 4 gives logic based

characterization of regularity. In Section 5, we deﬁne visibly pushdown languages as the class of word

languages equivalent to regular languages of nested words. We also give grammar based characterization, and

study relationship to parenthesis languages and balanced grammars. Section 6 studies decision problems for

NWAs. Section 7 presents encoding of ordered trees and hedges as nested words, and studies the relationship

between regular tree languages, regular nested-word languages, and balanced languages. To understand the

relationship between tree automata and NWAs, we also introduce bottom-up and top-down restrictions of

NWAs. Section 8 considers the extension of nested words and automata over nested words to the case of

inﬁnite words. Finally, we discuss related work and conclusions.

2 Linear Hierarchical Models

2.1 Nested Words

Given a linear sequence, we add hierarchical structure using edges that are well nested (that is, they do not

cross). We will use edges starting at −∞ and edges ending at +∞ to model “pending” edges. Assume that

−∞ <i<+∞ for every integer i.

A matching relation ; of length ,for ≥ 0, is a subset of {−∞, 1, 2,...}×{1, 2,...,+∞} such that

1. Nesting edges go only forward: if i ; j then i<j;

2. No two nesting edges share a position: for 1 ≤ i ≤ , |{j | i ; j}| ≤ 1and|{j | j ; i}| ≤ 1;

3. Nesting edges do not cross: if i ; j and i



; j



then it is not the case that i<i



≤ j<j



.

When i ; j holds, for 1 ≤ i ≤ , the position i is called a call position. For a call position i,ifi ; +∞,

then i is called a pending call, otherwise i is called a matched call, and the unique position j such that i ; j

is called its return-successor. Similarly, when i ; j holds, for 1 ≤ j ≤ , the position j is called a return

position. For a return position j,if−∞ ; j, then j is called a pending return, otherwise j is called a matched

return, and the unique position i such that i ; j is called its call-predecessor . Our deﬁnition requires that a

position cannot be both a call and a return. A position 1 ≤ i ≤  that is neither a call nor a return is called

internal.

A matching relation ; of length  can be viewed as a a directed acyclic graph over  vertices corresponding

to positions. For 1 ≤ i<, there is a linear edge from i to i + 1. The initial position has an incoming linear

edge with no source, and the last position has an outgoing linear edge with no destination. For matched call

positions

i, there is a nesting edge (sometimes also called a summary edge)fromi to its return-successor.

For pending calls i, there is a nesting edge from i with no destination, and for pending returns j, there

is a nesting edge to j with no source. We call such graphs corresponding to matching relations as nested

sequences. Note that a call has indegree 1 and outdegree 2, a return has indegree 2 and outdegree 1, and an

internal has indegree 1 and outdegree 1.

4

2

1

4

8

65

6

3

2

1

3

7

54

89

7

Figure 1: Sample nested sequences

Figure 1 shows two nested sequences. Nesting edges are drawn using dotted lines. For the left sequence,

the matching relation is {(2, 8), (4, 7)}, and for the right sequence, it is {(−∞, 1), (−∞, 4),

(2, 3), (5, +∞), (7, +∞)}. Note that our deﬁnition allows a nesting edge from a position i to its linear

successor, and in that case there will be two edges from i to i + 1; this is the case for positions 2 and 3 of

the second sequence. The second sequence has two pending calls and two pending returns. Also note that

all pending return positions in a nested sequence appear before any of the pending call positions.

A nested word n over an alphabet Σ is a pair (a

1

...a



, ;), for  ≥ 0, such that a

i

, for each 1 ≤ i ≤ ,is

asymbolinΣ,and; is a matching relation of length . In other words, a nested word is a nested sequence

whose positions are labeled with symbols in Σ. Let us denote the set of all nested words over Σ as NW(Σ).

A language of nested words over Σ is a subset of NW(Σ).

A nested word n with matching relation ; is said to be well-matched if there is no position i such that

−∞ ; i or i ; +∞. Thus, in a well-matched nested word, every call has a return-successor and every

return has a call-predecessor. We will use WNW (Σ) ⊆ NW(Σ) to denote the set of all well-matched nested

words over Σ. A nested word n of length  is said to be rooted if 1 ;  holds. Observe that a rooted word

must be well-matched. In Figure 1, only the left sequence is well-matched, and neither of the sequences is

rooted.

While the length of a nested word captures its linear complexity, its (nesting) depth captures its hier-

archical complexity. For i ; j, we say that the call position i is pending at every position k such that

i<k<j.Thedepth of a position i is the number of calls that are pending at i. Note that the depth of the

ﬁrst position 0, it increases by 1 following a call, and decreases by 1 following a matched return. The depth

of a nested word is the maximum depth of any of its positions. In Figure 1, both sequences have depth 2.

2.2 Word Encoding

Nested words over Σ can be encoded by words in a natural way by using the tags  and  to denote calls and

returns, respectively. For each symbol a in Σ, we will use a new symbol a to denote a call position labeled

with a, and a new symbol a to denote a return position labeled with a.WeuseΣ to denote the set of

symbols {a | a ∈ Σ}, and Σ to denote the set of symbols {a|a ∈ Σ}. Then, given an alphabet Σ, deﬁne

the tagged alphabet

ˆ

ΣtobethesetΣ∪Σ ∪ Σ. Formally, we deﬁne the mapping nw

w : NW(Σ) →

ˆ

Σ

∗

as

follows: given a nested word n =(a

1

,...a



, ;) of length  over Σ, ˆn = nw w(n)isawordb

1

,...b



over

ˆ

Σ

such that for each 1 ≤ i ≤ , b

i

= a

i

if i is an internal, b

i

= a

i

if i is a call, and b

i

= a

i

 if i is a return.

For Figure 1, assuming all positions are labeled with the same symbol a, the tagged words corresponding

to the two nested sequences are aaaaaaaaa,andaaaaaaaa.

Since we allow calls and returns to be pending, every word over the tagged alphabet

ˆ

Σ corresponds to a

nested word. This correspondence is captured by the following lemma:

Lemma 1 The transformation nw

w : NW(Σ) →

ˆ

Σ

∗

is a bijection.

The inverse of nw

w is a transformation function that maps words over

ˆ

Σ to nested words over Σ, and

will be denoted w

nw :

ˆ

Σ

∗

→ NW(Σ). This one-to-one correspondence shows that:

5

Adding nesting structure to words

Figures

Citations

Journal of the ACM

A Robust Class of Context-Sensitive Languages

Nested interpolants

Statically-directed dynamic automated test generation

First-Order and Temporal Logics for Nested Words

References

Introduction to Automata Theory, Languages, and Computation

The temporal logic of programs

An Axiomatic Basis for Computer Programming

A Temporal Logic of Nested Calls and Returns

An axiomatic basis for computer programming

Related Papers (5)

Visibly pushdown languages

A Temporal Logic of Nested Calls and Returns

Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations

Introduction to Automata Theory, Languages, and Computation

Syntactic Analysis and Operator Precedence