scispace - formally typeset
Search or ask a question

Showing papers on "Context-sensitive grammar published in 2010"


Book
25 Aug 2010
TL;DR: This book provides an extensive overview of the formal language landscape between CFG and PTIME, moving from Tree Adjoining Grammars to Multiple Context-Free grammars and then to Range Concatenation Grammar while explaining available parsing techniques for these formalisms.
Abstract: Given that context-free grammars (CFG) cannot adequately describe natural languages, grammar formalisms beyond CFG that are still computationally tractable are of central interest for computational linguists. This book provides an extensive overview of the formal language landscape between CFG and PTIME, moving from Tree Adjoining Grammars to Multiple Context-Free Grammars and then to Range Concatenation Grammars while explaining available parsing techniques for these formalisms. Although familiarity with the basic notions of parsing and formal languages is helpful when reading this book, it is not a strict requirement. The presentation is supported with many illustrations and examples relating to the different formalisms and algorithms, and chapter summaries, problems and solutions. The book will be useful for students and researchers in computational linguistics and in formal language theory.

134 citations


Proceedings ArticleDOI
17 Jan 2010
TL;DR: The design and theory of a new parsing engine, YAKKER, capable of satisfying the many needs of modern programmers and modern data processing applications is presented and its use on examples ranging from difficult programming language grammars to web server logs to binary data specification is illustrated.
Abstract: We present the design and theory of a new parsing engine, YAKKER, capable of satisfying the many needs of modern programmers and modern data processing applications. In particular, our new parsing engine handles (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries,such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications. We illustrate the importance and utility of this rich collection of features by presenting its use on examples ranging from difficult programming language grammars to web server logs to binary data specification. We also show that our grammars have important compositionality properties and explain why such properties areimportant in modern applications such as automatic grammar induction.In terms of technical contributions, we provide a traditional high-level semantics for our new grammar formalization and show how to compile grammars into non deterministic automata. These automata are stack-based, somewhat like conventional push-down automata,but are also equipped with environments to track data-dependent parsing state. We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley's parsing algorithm.

61 citations


Book ChapterDOI
13 Sep 2010
TL;DR: This work presents a learning algorithm for context free grammars which uses positive data and membership queries, and proves its correctness under the identification in the limit paradigm.
Abstract: The Syntactic Concept Lattice is a residuated lattice based on the distributional structure of a language; the natural representation based on this is a context sensitive formalism. Here we examine the possibility of basing a context free grammar (CFG) on the structure of this lattice; in particular by choosing non-terminals to correspond to concepts in this lattice. We present a learning algorithm for context free grammars which uses positive data and membership queries, and prove its correctness under the identification in the limit paradigm. Since the lattice itself may be infinite, we consider only a polynomially bounded subset of the set of concepts, in order to get an efficient algorithm. We compare this on the one hand to learning algorithms for context free grammars, where the non-terminals correspond to congruence classes, and on the other hand to the use of context sensitive techniques such as Binary Feature Grammars and Distributional Lattice Grammars. The class of CFGs that can be learned in this way includes inherently ambiguous and thus non-deterministic languages; this approach therefore breaks through an important barrier in CFG inference.

34 citations


Journal ArticleDOI
TL;DR: A simple, direct proof of the fact that second-order ACGs are simulated by hyperedge replacement grammars is given, which implies that the string and tree generating power of the former is included in that of the latter.
Abstract: Second-order abstract categorial grammars (de Groote in Association for computational linguistics, 39th annual meeting and 10th conference of the European chapter, proceedings of the conference, pp. 148---155, 2001) and hyperedge replacement grammars (Bauderon and Courcelle in Math Syst Theory 20:83---127, 1987; Habel and Kreowski in STACS 87: 4th Annual symposium on theoretical aspects of computer science. Lecture notes in computer science, vol 247, Springer, Berlin, pp 207---219, 1987) are two natural ways of generalizing "context-free" grammar formalisms for string and tree languages. It is known that the string generating power of both formalisms is equivalent to (non-erasing) multiple context-free grammars (Seki et al. in Theor Comput Sci 88:191---229, 1991) or linear context-free rewriting systems (Weir in Characterizing mildly context-sensitive grammar formalisms, University of Pennsylvania, 1988). In this paper, we give a simple, direct proof of the fact that second-order ACGs are simulated by hyperedge replacement grammars, which implies that the string and tree generating power of the former is included in that of the latter. The normal form for tree-generating hyperedge replacement grammars given by Engelfriet and Maneth (Graph transformation. Lecture notes in computer science, vol 1764. Springer, Berlin, pp 15---29, 2000) can then be used to show that the tree generating power of second-order ACGs is exactly the same as that of hyperedge replacement grammars.

25 citations


Journal ArticleDOI
TL;DR: In adaptive star grammars, rules are actually schemata which, via the cloning of so-called multiple nodes, may adapt to potentially infinitely many contexts when they are applied, and they turn out to be restricted enough to share some of the basic characteristics of context-free devices.

24 citations


Journal ArticleDOI
TL;DR: If it is furthermore required that each rule of the general form A->w has a nonempty w, then a substantial subfamily of conjunctive languages can be generated, yet it remains unknown whether such grammars are as powerful as conj unctive grammARS of thegeneral form.

21 citations


Book ChapterDOI
06 Jul 2010
TL;DR: Three open questions in the theory of regulated rewriting are addressed, including whether every permitting random context grammar has a non-erasing equivalent and whether permitting random Context Grammars have the same generative capacity as matrix grammars without appearance checking.
Abstract: Three open questions in the theory of regulated rewriting are addressed. The first is whether every permitting random context grammar has a non-erasing equivalent. The second asks whether the same is true for matrix grammars without appearance checking. The third concerns whether permitting random context grammars have the same generative capacity as matrix grammars without appearance checking. The main result is a positive answer to the first question. For the other two, conjectures are presented. It is then deduced from the main result that at least one of the two holds.

20 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that the membership problem for second order non-linear abstract categorical grammars is decidable, and that Montague-like semantics yield to a text generation problem.
Abstract: In this paper we show that the membership problem for second order non-linear Abstract Categorial Grammars is decidable. A consequence of that result is that Montague-like semantics yield to a decidable text generation problem. Furthermore the proof we propose is based on a new tool, Higher Order Intersection Signatures, which grasps statically dynamic properties of ?-terms and presents an interest in its own.

16 citations


Journal ArticleDOI
TL;DR: It is shown that a graph grammar can be translated into an Event-B specification preserving its semantics, such that one can use several theorem provers available for Event- B to analyze the reachable states of the original graph grammar.
Abstract: Graph grammars may be used as specification technique for different kinds of systems, specially in situations in which states are complex structures that can be adequately modeled as graphs (possibly with an attribute data part) and in which the behavior involves a large amount of parallelism and can be described as reactions to stimuli that can be observed in the state of the system. The verification of properties of such systems is a difficult task due to many aspects: in many situations the systems have an infinite number of states; states themselves are complex and large; there are a number of different computation possibilities due to the fact that rule applications may occur in parallel. There are already some approaches to verification of graph grammars based on model checking, but in these cases only finite state systems can be analyzed. Other approaches propose over- and/or under-approximations of the state-space, but in this case it is not possible to check arbitrary properties. In this work, we propose to use the Event-B formal method and its theorem proving tools to analyze graph grammars. We show that a graph grammar can be translated into an Event-B specification preserving its semantics, such that one can use several theorem provers available for Event-B to analyze the reachable states of the original graph grammar. The translation is based on a relational definition of graph grammars, that was shown to be equivalent to the Single-Pushout approach to graph grammars.

16 citations


Journal ArticleDOI
TL;DR: It is explained how making this distinction obviates the need for directed types in type-theoretic grammars and a simple grammatical formalism is sketched in which representations at all levels are lambda terms.
Abstract: This paper argues for the idea that in describing language we should follow Haskell Curry in distinguishing between the structure of an expression and its appearance or manifestation. It is explained how making this distinction obviates the need for directed types in type-theoretic grammars and a simple grammatical formalism is sketched in which representations at all levels are lambda terms. The lambda term representing the abstract structure of an expression is homomorphically translated to a lambda term representing its manifestation, but also to a lambda term representing its semantics.

14 citations


Book ChapterDOI
21 Jun 2010
TL;DR: This article presents a framework for grammars and grammar transformations using Agda, and implements the left-corner transformation for left-recursion removal and proves a language-inclusion property as use cases.
Abstract: Parser combinators are a popular tool for designing parsers in functional programming languages. If such combinators generate an abstract representation of the grammar as an intermediate step, it becomes easier to perform analyses and transformations that can improve the behaviour of the resulting parser. Grammar transformations must satisfy a number of invariants. In particular, they have to preserve the semantics associated with the grammar. Using conventional type systems, these constraints cannot be expressed satisfactorily, but as we show in this article, dependent types are a natural fit. We present a framework for grammars and grammar transformations using Agda. We implement the left-corner transformation for left-recursion removal and prove a language-inclusion property as use cases.

Journal ArticleDOI
TL;DR: It is proved that twelve nonterminals are enough for cooperating distributed grammar systems working in the terminal derivation mode with two left-forbidding components (including erasing productions) to characterize the family of recursively enumerable languages.

Book ChapterDOI
18 Jan 2010
TL;DR: A lazy-evaluation based top-down parsing algorithm has been implemented as a set of higher-order functions (combinators) which support directly-executable specifications of fully general attribute grammars.
Abstract: A lazy-evaluation based top-down parsing algorithm has been implemented as a set of higher-order functions (combinators) which support directly-executable specifications of fully general attribute grammars. This approach extends aspects of previous approaches, and allows natural language processors to be constructed as modular and declarative specifications while accommodating ambiguous context-free grammars (including direct and indirect left-recursive rules), augmented with semantic rules with arbitrary attribute dependencies (including dependencies from right). This one-pass syntactic and semantic analysis method has polynomial time and space (w.r.t. the input length) for processing ambiguous input, and helps language developers build and test their models with little concern for the underlying computational methods.

Journal ArticleDOI
TL;DR: It is proved that the language family generated by Boolean grammars is effectively closed under injective gSM mappings and inverse gsm mappings.
Abstract: It is proved that the language family generated by Boolean grammars is effectively closed under injective gsm mappings and inverse gsm mappings (where gsm stands for a generalized sequential machine). The same results hold for conjunctive grammars, unambiguous Boolean grammars and unambiguous conjunctive grammars.

Journal ArticleDOI
TL;DR: A polynomial algorithm for deciding whether a given word belongs to a language generated by a given unidirectional Lambek grammar is presented.
Abstract: Lambek grammars provide a useful tool for studying formal and natural languages. The generative power of unidirectional Lambek grammars equals that of context-free grammars. However, no feasible algorithm was known for deciding membership in the corresponding formal languages. In this paper we present a polynomial algorithm for deciding whether a given word belongs to a language generated by a given unidirectional Lambek grammar.

01 Dec 2010
TL;DR: The various structures and rules that are needed to derive a semantic representation from the categorial view of a transformational syntactic analysis are illustrated.
Abstract: We first recall some basic notions on minimalist grammars and on categorial grammars. Next we shortly introduce partially commutative linear logic, and our representation of minimalist grammars within this categorial system, the so-called categorial minimalist grammars. Thereafter we briefly present λμ-DRT (Discourse Representation Theory) an extension of λ-DRT (compositional DRT) in the framework of λμ calculus: it avoids type raising and derives different readings from a single semantic representation, in a setting which follows discourse structure. We run a complete example which illustrates the various structures and rules that are needed to derive a semantic representation from the categorial view of a transformational syntactic analysis.

Journal ArticleDOI
01 Jan 2010
TL;DR: The new version of TBL algorithm has been experimentally proved to be not so much vulnerable to block size and population size, and is able to find the solutions faster than standard one.
Abstract: This paper describes an improved version of TBL algorithm [Y. Sakakibara, Learning context-free grammars using tabular representations, Pattern Recognition 38(2005) 1372-1383; Y. Sakakibara, M. Kondo, GA-based learning of context-free grammars using tabular representations, in: Proceedings of 16th International Conference in Machine Learning (ICML-99), Morgan-Kaufmann, Los Altos, CA, 1999] for inference of context-free grammars in Chomsky Normal Form. The TBL algorithm is a novel approach to overcome the hardness of learning context-free grammars from examples without structural information available. The algorithm represents the grammars by parsing tables and thanks to this tabular representation the problem of grammar learning is reduced to the problem of partitioning the set of nonterminals. Genetic algorithm is used to solve NP-hard partitioning problem. In the improved version modified fitness function and new delete specialized operator is applied. Computer simulations have been performed to determine improved a tabular representation efficiency. The set of experiments has been divided into 2 groups: in the first one learning the unknown context-free grammar proceeds without any extra information about grammatical structure, in the second one learning is supported by a partial knowledge of the structure. In each of the performed experiments the influence of partition block size in an initial population and the size of population at grammar induction has been tested. The new version of TBL algorithm has been experimentally proved to be not so much vulnerable to block size and population size, and is able to find the solutions faster than standard one.

Journal ArticleDOI
TL;DR: Some results on the power of tree controlled grammars are presented where the regular languages are restricted to some known subclasses of the family of regular languages.
Abstract: Tree controlled grammars are context-free grammars where the associated language only contains those terminal words which have a derivation where the word of any level of the corresponding derivation tree belongs to a given regular language. We present some results on the power of such grammars where we restrict the regular languages to some known subclasses of the family of regular languages.

Journal Article
TL;DR: The generative power and closure properties of the families of languages generated by such Petri net controlled grammars are investigated and it is shown that these families form an infinite hierarchy with respect to the numbers of additional places.
Abstract: A context-free grammar and its derivations can be described by a Petri net, called a context-free Petri net, whose places and transitions correspond to the nonterminals and the production rules of the grammar, respectively, and tokens are separate instances of the nonterminals in a sentential form. Therefore, the control of the derivations in a context-free grammar can be implemented by adding some features to the associated of petri net. The addition of new places and new arcs from/to these new places to/from transitions of the net leads grammars controlled by k-Petri, nets, i.e., Petri nets with additional k places. In the paper we investigate the generative power and give closure properties of the families of languages generated by such Petri net controlled grammars, in particular, we show that these families form an infinite hierarchy with respect to the numbers of additional places.

Book ChapterDOI
24 May 2010
TL;DR: Stochastic context-free grammars are extended such that the probability of applying a production can depend on the length of the subword that is generated from the application and show that existing algorithms for training and determining the most probable parse tree can easily be adapted to the extended model without losses in performance.
Abstract: We extend stochastic context-free grammars such that the probability of applying a production can depend on the length of the subword that is generated from the application and show that existing algorithms for training and determining the most probable parse tree can easily be adapted to the extended model without losses in performance. Furthermore we show that the extended model is suited to improve the quality of predictions of RNA secondary structures. The extended model may also be applied to other fields where SCFGs are used like natural language processing. Additionally some interesting questions in the field of formal languages arise from it.

Posted Content
TL;DR: A way of rewriting Minimalist Grammars as Linear Context-Free Rewriting Systems, allowing to easily create a top-down parser, and a method of refining the probabilistic field by using algorithms used in data compression.
Abstract: This paper describes a probabilistic top-down parser for minimalist grammars. Top-down parsers have the great advantage of having a certain predictive power during the parsing, which takes place in a left-to-right reading of the sentence. Such parsers have already been well-implemented and studied in the case of Context-Free Grammars, which are already top-down, but these are difficult to adapt to Minimalist Grammars, which generate sentences bottom-up. I propose here a way of rewriting Minimalist Grammars as Linear Context-Free Rewriting Systems, allowing to easily create a top-down parser. This rewriting allows also to put a probabilistic field on these grammars, which can be used to accelerate the parser. Finally, I propose a method of refining the probabilistic field by using algorithms used in data compression.

Proceedings ArticleDOI
28 Mar 2010
TL;DR: Tear-Insert-Fold grammars are introduced which add tree-manipulation annotations to standard CFGs which allow typical abstract forms to be constructed directly from the grammar for the concrete syntax and provide a convenient and concise specification of the relationship between the set of derivation trees and theSet of abstract trees for a parser.
Abstract: Context Free Grammars (CFGs) are simple and powerful formalisms for defining languages (sets of strings) whose semantics are specified hierarchically --- the meaning of a string is determined by terminals and the meanings of substrings. This hierarchy is captured in the derivation tree corresponding to the string. Derivation trees usually contain more structure than is strictly required to determine the semantics of the string so in practice a simplified or abstract syntax tree is used as an internal representation of a concrete text. Indeed, much of the work of a compiler or source-source translator may be described in terms of stepwise transformation of such trees, culminating in a final traversal during which the translated text is output.This paper introduces Tear-Insert-Fold grammars which add tree-manipulation annotations to standard CFGs. These annotations allow typical abstract forms to be constructed directly from the grammar for the concrete syntax and provide a convenient and concise specification of the relationship between the set of derivation trees and the set of abstract trees for a parser. More significantly, for any TIF grammar Γ0 there is a TIF grammar Γ1 whose derivation trees are the abstract trees produced by Γ0.

Journal Article
TL;DR: It is proved thatMultigenerative grammar systems based on cooperating context-free grammatical components that simultaneously generate their strings in a rule-controlled or nonterminal-controlled rewriting way are equivalent with the matrix grammars.
Abstract: Multigenerative grammar systems are based on cooperating context-free grammatical components that simultaneously generate their strings in a rule-controlled or nonterminal-controlled rewriting way, and after this simultaneous generation is completed, all the generated terminal strings are combined together by some common string operations, such as concatenation, and placed into the generated languages of these systems. The present paper proves that these systems are equivalent with the matrix grammars. In addition, we demonstrate that these systems with any number of grammatical components can be transformed to equivalent two-component versions of these systems. The paper points out that if these systems work in the leftmost rewriting way, they are more powerful than the systems working in a general way.


Proceedings Article
01 May 2010
TL;DR: It is found that the parsing performance of a STIG model is tied to the size of the underlying Tree Insertion Grammar, with a more compact grammar, a spinal STIG, outperforming a genuine STIG.
Abstract: We evaluate statistical parsing of French using two probabilistic models derived from the Tree Adjoining Grammar framework: a Stochastic Tree Insertion Grammars model (STIG) and a specific instance of this formalism, called Spinal Tree Insertion Grammar model which exhibits interesting properties with regard to data sparseness issues common to small treebanks such as the Paris 7 French Treebank. Using David Chiang’s STIG parser (Chiang, 2003), we present results of various experiments we conducted to explore those models for French parsing. The grammar induction makes use of a head percolation table tailored for the French Treebank and which is provided in this paper. Using two evaluation metrics, we found that the parsing performance of a STIG model is tied to the size of the underlying Tree Insertion Grammar, with a more compact grammar, a spinal STIG, outperforming a genuine STIG. We finally note that a ""spinal"" framework seems to emerge in the literature. Indeed, the use of vertical grammars such as Spinal STIG instead of horizontal grammars such as PCFGs, afflicted with well known data sparseness issues, seems to be a promising path toward better parsing performance.

Book ChapterDOI
17 Aug 2010
TL;DR: It is shown that every Muller context-free grammar can be transformed into a normal form grammar in polynomial space without increasing the size of the grammar, and that many decision problems can be solved inPolynomial time for Mullercontext-free grammars in normal form.
Abstract: We define context-free grammars with Muller acceptance condition that generate languages of countable words. We establish several elementary properties of the class of Muller context-free languages including closure properties and others. We show that every Muller context-free grammar can be transformed into a normal form grammar in polynomial space without increasing the size of the grammar, and then we show that many decision problems can be solved in polynomial time for Muller context-free grammars in normal form. These problems include deciding whether the language generated by a normal form grammar contains only well-ordered, scattered, or dense words. In a further result we establish a limitedness property of Muller context-free grammars: If the language generated by a grammar contains only scattered words, then either there is an integer n such that each word of the language has Hausdorff rank at most n, or the language contains scattered words of arbitrarily large Hausdorff rank. We also show that it is decidable which of the two cases applies.


Book ChapterDOI
13 Sep 2010
TL;DR: This paper develops methods to find upper bounds for the unlabeled F1 performance that any UWNTS grammar can achieve over a given treebank and defines a new metric that is NP-Hard but solvable with specialized software.
Abstract: Unambiguous Non-Terminally Separated (UNTS) grammars have good learnability properties but are too restrictive to be used for natural language parsing. We present a generalization of UNTS grammars called Unambiguous Weakly NTS (UWNTS) grammars that preserve the learnability properties. Then, we study the problem of using them to parse natural language and evaluating against a gold treebank. If the target language is not UWNTS, there will be an upper bound in the parsing performance. In this paper we develop methods to find upper bounds for the unlabeled F1 performance that any UWNTS grammar can achieve over a given treebank. We define a new metric, show that its optimization is NP-Hard but solvable with specialized software, and show a translation of the result to a bound for the F1. We do experiments with the WSJ10 corpus, finding an F1 bound of 76.1% for the UWNTS grammars over the POS tags alphabet.

Journal ArticleDOI
TL;DR: It is shown that Non-Associative Lambek grammars as well as their derivations can be defined using ACGs of order two, which solves a natural but still open question: can abstract categorial Grammars (ACGs) respresent usual categorial grammARS?
Abstract: This paper solves a natural but still open question: can abstract categorial grammars (ACGs) respresent usual categorial grammars? Despite their name and their claim to be a unifying framework, up to now there was no faithful representation of usual categorial grammars in ACGs. This paper shows that Non-Associative Lambek grammars as well as their derivations can be defined using ACGs of order two. To conclude, the outcome of such a representation are discussed.

Proceedings ArticleDOI
01 Sep 2010