scispace - formally typeset
Open AccessJournal ArticleDOI

Adding nesting structure to words

TLDR
In this paper, the authors define nested word automata, which generalize both words and ordered trees, and allow both word and tree operations, and show that the resulting class of regular languages of nested words has all the appealing theoretical properties that the classical regular word languages enjoys: deterministic nestedword automata are as expressive as their non-deterministic counterparts; the class is closed under union, intersection, complementation, concatenation, Kleene-a, prefixes, and language homomorphisms; membership, emptiness, language equivalence are all decidable;
Abstract
We propose the model of nested words for representation of data with both a linear ordering and a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical structure include executions of structured programs, annotated linguistic data, and HTML/XML documents. Nested words generalize both words and ordered trees, and allow both word and tree operations. We define nested word automata—finite-state acceptors for nested words, and show that the resulting class of regular languages of nested words has all the appealing theoretical properties that the classical regular word languages enjoys: deterministic nested word automata are as expressive as their nondeterministic counterparts; the class is closed under union, intersection, complementation, concatenation, Kleene-a, prefixes, and language homomorphisms; membership, emptiness, language inclusion, and language equivalence are all decidable; and definability in monadic second order logic corresponds exactly to finite-state recognizability. We also consider regular languages of infinite nested words and show that the closure properties, MSO-characterization, and decidability of decision problems carry over.The linear encodings of nested words give the class of visibly pushdown languages of words, and this class lies between balanced languages and deterministic context-free languages. We argue that for algorithmic verification of structured programs, instead of viewing the program as a context-free language over words, one should view it as a regular language of nested words (or equivalently, a visibly pushdown language), and this would allow model checking of many properties (such as stack inspection, pre-post conditions) that are not expressible in existing specification logics.We also study the relationship between ordered trees and nested words, and the corresponding automata: while the analysis complexity of nested word automata is the same as that of classical tree automata, they combine both bottom-up and top-down traversals, and enjoy expressiveness and succinctness benefits over tree automata.

read more

Content maybe subject to copyright    Report

Adding Nesting Structure to Words
Rajeev Alur
University of Pennsylvania
alur@cis.upenn.edu
P. Madhusudan
University of Illinois, Urbana-Champaign
madhu@cs.uiuc.edu
Abstract
We propose the model of nested words for representation of data with both a linear ordering and
a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical struc-
ture include executions of structured programs, annotated linguistic data, and HTML/XML documents.
Nested words generalize both words and ordered trees, and allow both word and tree operations. We
define nested word automata—finite-state acceptors for nested words, and show that the resulting class
of regular languages of nested words has all the appealing theoretical properties that the classical regular
word languages enjoys: deterministic nested word automata are as expressive as their nondeterministic
counterparts; the class is closed under union, intersection, complementation, concatenation, Kleene-*,
prefixes, and language homomorphisms; membership, emptiness, language inclusion, and language equiv-
alence are all decidable; and definability in monadic second order logic corresponds exactly to finite-state
recognizability. We also consider regular languages of infinite nested words and show that the closure
properties, MSO-characterization, and decidability of decision problems carry over.
The linear encodings of nested words give the class of visibly pushdown languages of words, and
this class lies between balanced languages and deterministic context-free languages. We argue that for
algorithmic verification of structured programs, instead of viewing the program as a context-free language
over words, one should view it as a regular language of nested words (or equivalently, a visibly pushdown
language), and this would allow model checking of many properties (such as stack inspection, pre-post
conditions) that are not expressible in existing specification logics.
We also study the relationship between ordered trees and nested words, and the corresponding au-
tomata: while the analysis complexity of nested word automata is the same as that of classical tree
automata, they combine both bottom-up and top-down traversals, and enjoy expressiveness and suc-
cinctness benefits over tree automata.
1 Introduction
Linearly structured data is usually modeled as words, and queried using word automata and related specifica-
tion languages such as regular expressions. Hierarchically structured data is naturally modeled as (unordered)
trees, and queried using tree automata. In many applications including executions of structured programs,
annotated linguistic data, and primary/secondary bonds in genomic sequences, the data has both a natural
linear sequencing of positions and a hierarchically nested matching of positions. For example, in natural
language processing, the sentence is a linear sequence of words, and parsing into syntactic categories imparts
the hierarchical structure. Sometimes, even though the only logical structure on data is hierarchical, linear
sequencing is added either for storage or for stream processing. For example, in SAX representation of XML
data, the document is a linear sequence of text characters, along with a hierarchically nested matching of
open-tags with closing tags.
In this paper, we propose the model of nested words for representing and querying data with dual linear-
hierarchical structure. A nested word consists of a sequence of linearly ordered positions, augmented with
nesting edges connecting calls to returns (or open-tags to close-tags). The edges do not cross creating a
This paper unifies and extends results that have appeared in conference papers [AM04], [AM06], and [Alu07].
1

properly nested hierarchical structure, and we allow some of the edges to be pending. This nesting structure
can be uniquely represented by a sequence specifying the types of positions (calls, returns, and internals).
Words are nested words where all positions are internals. Ordered trees can be interpreted as nested words
using the following traversal: to process an a-labeled node, first print an a-labeled call, process all the
children in order, and print an a-labeled return. Note that this is a combination of top-down and bottom-
up traversals, and each node is processed twice. Binary trees, ranked trees, unranked trees, hedges, and
documents that do not parse correctly, all can be represented with equal ease. Word operations such as
prefixes, suffixes, concatenation, reversal, as well as tree operations referring to the hierarchical structure,
can be defined naturally on nested words.
We define and study finite-state automata as acceptors of nested words. A nested word automaton
(NWA) is similar to a classical finite-state word automaton, and reads the input from left to right according
to the linear sequence. At a call, it can propagate states along both linear and nesting outgoing edges,
and at a return, the new state is determined based on states labeling both the linear and nesting incoming
edges. The resulting class of regular languages of nested words has all the appealing theoretical properties
that the regular languages of words and trees enjoy. In particular, we show that deterministic nested word
automata are as expressive as their nondeterministic counterparts. Given a nondeterministic automaton
A with s states, the determinization involves subsets of pairs of states (as opposed to subsets of states
for word automata), leading to a deterministic automaton with 2
s
2
states, and we show this bound to
be tight. The class is closed under all Boolean operations (union, intersection, and complement), and a
variety of word operations such as concatenation, Kleene-, and prefix-closure. The class is also closed under
nesting-respecting language homomorphisms, which can model tree operations. Decision problems such as
membership, emptiness, language inclusion, and language equivalence are all decidable. We also establish
that the notion of regularity coincides with the definability in the monadic second order logic (MSO) of
nested words (MSO of nested words has unary predicates over positions, first and second order quantifiers,
linear successor relation, and the nesting relation).
The motivating application area for our results has been software verification. Pushdown automata nat-
urally model the control flow of sequential computation in typical programming languages with nested, and
potentially recursive, invocations of program modules such as procedures and method calls. Consequently,
a variety of program analysis, compiler optimization, and model checking questions can be formulated as
decision problems for pushdown automata. For instance, in contemporary software model checking tools,
to verify whether a program P (written in C, for instance) satisfies a regular correctness requirement ϕ
(written in linear temporal logic LTL, for instance), the verifier first abstracts the program into a pushdown
model P
a
with finite-state control, compiles the negation of the specification into a finite-state automaton
A
¬ϕ
that accepts all computations that violate ϕ and algorithmically checks that the intersection of the
languages of P
a
and A
¬ϕ
is empty. The problem of checking regular requirements of pushdown models has
been extensively studied in recent years leading to efficient implementations and applications to program
analysis [RHS95, BEM97, BR00, ABE
+
05, HJM
+
02, EKS03, CW02]. While many analysis problems such as
identifying dead code and accesses to uninitialized variables can be captured as regular requirements, many
others require inspection of the stack or matching of calls and returns, and are context-free. Even though
the general problem of checking context-free properties of pushdown automata is undecidable, algorithmic
solutions have been proposed for checking many different kinds of non-regular properties. For example,
access control requirements such as “a module A should be invoked only if the module B belongs to the
call-stack,” and bounds on stack size such as “if the number of interrupt-handlers in the call-stack currently
is less than 5, then a property p holds” require inspection of the stack, and decision procedures for certain
classes of stack properties already exist [JMT99, CW02, EKS03, CMM
+
04]. A separate class of non-regular,
but decidable, properties includes the temporal logic Caret that allows matching of calls and returns and
can express the classical correctness requirements of program modules with pre and post conditions, such
as “if p holds when a module is invoked, the module must return, and q holds upon return” [AEM04]. This
suggests that the answer to the question “which class of properties are algorithmically checkable against
pushdown models?” should be more general than “regular word languages.” Our results suggest that the
answer lies in viewing the program as a generator of nested words. The key feature of checkable requirements,
2

such as stack inspection and matching calls and returns, is that the stacks in the model and the property are
correlated: while the stacks are not identical, the two synchronize on when to push and when to pop, and
are always of the same depth. This can be best captured by modeling the execution of a program P as a
nested word with nesting edges from calls to returns. Specification of the program is given as a nested word
automaton A (or written as a formula ϕ in one of the new temporal logics for nested words), and verification
corresponds to checking whether every nested word generated by P is accepted by A.IfP is abstracted
into a model P
a
with only boolean variables, then it can be interpreted as an NWA, and verification can
be solved using decision procedures for NWAs. Nested-word automata can express a variety of requirements
such as stack-inspection properties, pre-post conditions, and interprocedural data-flow properties. More
broadly, modeling structured programs and program specifications as languages of nested words generalizes
the linear-time semantics that allows integration of Pnueli-style temporal reasoning [Pnu77] and Hoare-style
structured reasoning [Hoa69]. We believe that the nested-word view will provide a unifying basis for the
next generation of specification logics for program analysis, software verification, and runtime monitoring.
Given a language L of nested words over Σ, the linear encoding of nested words gives a language
ˆ
L over
the tagged alphabet consisting of symbols tagged with the type of the position. If L is regular language of
nested words, then
ˆ
L is context-free. In fact, the pushdown automata accepting
ˆ
L have a special structure:
while reading a call, the automaton must push one symbol, while reading a return symbol, it must pop one
symbol (if the stack is non-empty), and while reading an internal symbol, it can only update its control
state. We call such automata visibly pushdown automata and the class of word languages they accept visibly
pushdown languages (VPL). Since our automata can be determinized, VPLs correspond to a subclass of
deterministic context-free languages (DCFL). We give a grammar-based characterization of VPLs, which
helps in understanding of VPLs as a generalization of parenthesis languages, bracketed languages, and
balanced languages [McN67, GH67, BB02]. Note that VPLs have better closure properties than CFLs,
DCFLs, or parenthesis languages: CFLs are not closed under intersection and complement, DCFLs are not
closed under union, intersection, and concatenation, and balanced languages are not closed under complement
and prefix-closure.
Data with dual linear-hierarchical structure is traditionally modeled using binary, and more generally,
using ordered unranked, trees, and queried using tree automata (see [Nev02, Lib05, Sch07] for recent surveys
on applications of unranked trees and tree automata to XML processing). In ordered trees, nodes with
the same parent are linearly ordered, and the classical tree traversals such as infix (or depth-first left-to-
right) can be used to define an implicit ordering of all nodes. It turns out that, hedges, where a hedge is
a sequence of ordered trees, are a special class of nested words, namely, the ones corresponding to Dyck
words, and regular hedge languages correspond to balanced languages. For document processing, nested
words do have many advantages over ordered trees as trees lack an explicit ordering of all nodes. Tree-based
representation implicitly assumes that the input linear data can be parsed into a tree, and thus, one cannot
represent and process data that may not parse correctly. Word operations such as prefixes, suffixes, and
concatenation, while natural for document processing, do not have analogous tree operations. Second, tree
automata can naturally express constraints on the sequence of labels along a hierarchical path, and also
along the left-to-right siblings, but they have difficulty to capture constraints that refer to the global linear
order. For example, the query that patterns p
1
,...p
k
appear in the document in that order (that is, the
regular expression Σ
p
1
Σ
...p
k
Σ
over the linear order) compiles into a deterministic word automaton (and
hence deterministic NWA) of linear size, but standard deterministic bottom-up tree automaton for this query
must be of size exponential in k. In fact, NWAs can be viewed as a kind of tree automata such that both
bottom-up tree automata and top-down tree automata are special cases.
Analysis of liveness requirements such as “every write operation must be followed by a read operation”
is formulated using automata over infinite words, and the theory of ω-regular languages is well developed
with many of the counterparts of the results for regular languages (c.f. [Tho90, VW94]). Consequently, we
also define nested ω-words and consider nested word automata augmented with acceptance conditions such
as uchi and Muller, that accept languages of nested ω-words. We establish that the resulting class of
regular languages of nested ω-words is closed under operations such as union, intersection, complementation,
and homomorphisms. Decision problems for these automata have the same complexity as the corresponding
3

problems for NWAs. As in the finite case, the class can be characterized by the monadic second order logic.
The significant difference is that deterministic automata with Muller acceptance condition on states that
appear infinitely often along the linear run do not capture all regular properties: the language “there are only
finitely many pending calls” can be easily characterized using a nondeterministic uchi NWA, and we prove
that no deterministic Muller automaton accepts this language. However, we show that nondeterministic
uchi NWAs can be complemented and hence problems such as checking for inclusion are still decidable.
Outline
Section 2 defines nested words and their word encodings, and gives different application domains where
nested words can be useful. Section 3 defines nested word automata and the notion of regularity. We
consider some variations of the definition of the automata, including the nondeterministic automata, show
how NWAs can be useful in program analysis, and establish closure properties. Section 4 gives logic based
characterization of regularity. In Section 5, we define visibly pushdown languages as the class of word
languages equivalent to regular languages of nested words. We also give grammar based characterization, and
study relationship to parenthesis languages and balanced grammars. Section 6 studies decision problems for
NWAs. Section 7 presents encoding of ordered trees and hedges as nested words, and studies the relationship
between regular tree languages, regular nested-word languages, and balanced languages. To understand the
relationship between tree automata and NWAs, we also introduce bottom-up and top-down restrictions of
NWAs. Section 8 considers the extension of nested words and automata over nested words to the case of
infinite words. Finally, we discuss related work and conclusions.
2 Linear Hierarchical Models
2.1 Nested Words
Given a linear sequence, we add hierarchical structure using edges that are well nested (that is, they do not
cross). We will use edges starting at −∞ and edges ending at + to model “pending” edges. Assume that
−∞ <i<+ for every integer i.
A matching relation ; of length ,for 0, is a subset of {−∞, 1, 2,...}×{1, 2,...,+∞} such that
1. Nesting edges go only forward: if i ; j then i<j;
2. No two nesting edges share a position: for 1 i , |{j | i ; j}| 1and|{j | j ; i}| 1;
3. Nesting edges do not cross: if i ; j and i
; j
then it is not the case that i<i
j<j
.
When i ; j holds, for 1 i , the position i is called a call position. For a call position i,ifi ; +,
then i is called a pending call, otherwise i is called a matched call, and the unique position j such that i ; j
is called its return-successor. Similarly, when i ; j holds, for 1 j , the position j is called a return
position. For a return position j,if−∞ ; j, then j is called a pending return, otherwise j is called a matched
return, and the unique position i such that i ; j is called its call-predecessor . Our definition requires that a
position cannot be both a call and a return. A position 1 i that is neither a call nor a return is called
internal.
A matching relation ; of length can be viewed as a a directed acyclic graph over vertices corresponding
to positions. For 1 i<, there is a linear edge from i to i + 1. The initial position has an incoming linear
edge with no source, and the last position has an outgoing linear edge with no destination. For matched call
positions
i, there is a nesting edge (sometimes also called a summary edge)fromi to its return-successor.
For pending calls i, there is a nesting edge from i with no destination, and for pending returns j, there
is a nesting edge to j with no source. We call such graphs corresponding to matching relations as nested
sequences. Note that a call has indegree 1 and outdegree 2, a return has indegree 2 and outdegree 1, and an
internal has indegree 1 and outdegree 1.
4

2
1
4
8
65
6
3
2
1
3
7
54
89
7
Figure 1: Sample nested sequences
Figure 1 shows two nested sequences. Nesting edges are drawn using dotted lines. For the left sequence,
the matching relation is {(2, 8), (4, 7)}, and for the right sequence, it is {(−∞, 1), (−∞, 4),
(2, 3), (5, +), (7, +)}. Note that our definition allows a nesting edge from a position i to its linear
successor, and in that case there will be two edges from i to i + 1; this is the case for positions 2 and 3 of
the second sequence. The second sequence has two pending calls and two pending returns. Also note that
all pending return positions in a nested sequence appear before any of the pending call positions.
A nested word n over an alphabet Σ is a pair (a
1
...a
, ;), for 0, such that a
i
, for each 1 i ,is
asymboli,and; is a matching relation of length . In other words, a nested word is a nested sequence
whose positions are labeled with symbols in Σ. Let us denote the set of all nested words over Σ as NW(Σ).
A language of nested words over Σ is a subset of NW(Σ).
A nested word n with matching relation ; is said to be well-matched if there is no position i such that
−∞ ; i or i ; +. Thus, in a well-matched nested word, every call has a return-successor and every
return has a call-predecessor. We will use WNW (Σ) NW(Σ) to denote the set of all well-matched nested
words over Σ. A nested word n of length is said to be rooted if 1 ; holds. Observe that a rooted word
must be well-matched. In Figure 1, only the left sequence is well-matched, and neither of the sequences is
rooted.
While the length of a nested word captures its linear complexity, its (nesting) depth captures its hier-
archical complexity. For i ; j, we say that the call position i is pending at every position k such that
i<k<j.Thedepth of a position i is the number of calls that are pending at i. Note that the depth of the
first position 0, it increases by 1 following a call, and decreases by 1 following a matched return. The depth
of a nested word is the maximum depth of any of its positions. In Figure 1, both sequences have depth 2.
2.2 Word Encoding
Nested words over Σ can be encoded by words in a natural way by using the tags and to denote calls and
returns, respectively. For each symbol a in Σ, we will use a new symbol a to denote a call position labeled
with a, and a new symbol a to denote a return position labeled with a.WeuseΣ to denote the set of
symbols {a | a Σ}, and Σ to denote the set of symbols {a|a Σ}. Then, given an alphabet Σ, define
the tagged alphabet
ˆ
Σtobethese∪Σ Σ. Formally, we define the mapping nw
w : NW(Σ) →
ˆ
Σ
as
follows: given a nested word n =(a
1
,...a
, ;) of length over Σ, ˆn = nw w(n)isawordb
1
,...b
over
ˆ
Σ
such that for each 1 i , b
i
= a
i
if i is an internal, b
i
= a
i
if i is a call, and b
i
= a
i
if i is a return.
For Figure 1, assuming all positions are labeled with the same symbol a, the tagged words corresponding
to the two nested sequences are aaaaaaaaa,andaaaaaaaa.
Since we allow calls and returns to be pending, every word over the tagged alphabet
ˆ
Σ corresponds to a
nested word. This correspondence is captured by the following lemma:
Lemma 1 The transformation nw
w : NW(Σ) →
ˆ
Σ
is a bijection.
The inverse of nw
w is a transformation function that maps words over
ˆ
Σ to nested words over Σ, and
will be denoted w
nw :
ˆ
Σ
→ NW(Σ). This one-to-one correspondence shows that:
5

Citations
More filters
Journal ArticleDOI

Journal of the ACM

Dan Suciu, +1 more
- 01 Jan 2006 - 
Proceedings ArticleDOI

A Robust Class of Context-Sensitive Languages

TL;DR: MVPLs are an extension of visibly pushdown languages that captures noncontext free behaviors, and has applications in analyzing abstractions of multithreaded recursive programs, signifi- cantly enlarging the search space that can be explored for them.
Proceedings ArticleDOI

Nested interpolants

TL;DR: The potential of the theory of nested words for partial correctness proofs of recursive programs and an interpolant-based software model checking method for recursive programs is explored.
Proceedings ArticleDOI

Statically-directed dynamic automated test generation

TL;DR: A new technique for exploiting static analysis to guide dynamic automated test generation for binary programs, prioritizing the paths to be explored and showing that static analysis allows exploration to reach vulnerabilities it otherwise would not, and the generated test inputs prove that the static warnings indicate true positives.
Journal ArticleDOI

First-Order and Temporal Logics for Nested Words

TL;DR: It is proved that first-order logic over nested words has the three-variable property, and a temporal logic for nested words which is complete for the two- variable fragment of first- order is presented.
References
More filters
Book

Introduction to Automata Theory, Languages, and Computation

TL;DR: This book is a rigorous exposition of formal languages and models of computation, with an introduction to computational complexity, appropriate for upper-level computer science undergraduates who are comfortable with mathematical arguments.
Proceedings ArticleDOI

The temporal logic of programs

Amir Pnueli
TL;DR: A unified approach to program verification is suggested, which applies to both sequential and parallel programs, and the main proof method is that of temporal reasoning in which the time dependence of events is the basic concept.
Journal Article

An Axiomatic Basis for Computer Programming

Book ChapterDOI

A Temporal Logic of Nested Calls and Returns

TL;DR: This work introduces a temporal logic of calls and returns (CaRet) for specification and algorithmic verification of correctness requirements of structured programs and presents a tableau construction that reduces the model checking problem to the emptiness problem for a Buchi pushdown system.
Journal ArticleDOI

An axiomatic basis for computer programming

TL;DR: An attempt is made to explore the logical foundations of computer programming by use of techniques which were first applied in the study of geometry and have later been extended to other branches of mathematics.