scispace - formally typeset
Search or ask a question

Showing papers on "Context-free grammar published in 2007"


Proceedings ArticleDOI
07 Jul 2007
TL;DR: This tutorial gives a brief introduction to Backus Naur Form Grammars and a background into the use of grammars with Genetic Programming, before describing the inner workings of Grammatical Evolution and some of the more commonly used extensions.
Abstract: Grammatical Evolution is an automatic programming system that is a form of Genetic Programming that uses grammars to evolve structures. These structures can be in any form that can be specified using a grammar, including computer languages, graphs and neural networks. When evolving computer languages, multiple types can be handled in a completely transparent manner.This tutorial gives a brief introduction to Backus Naur Form grammars and a background into the use of grammars with Genetic Programming, before describing the inner workings of Grammatical Evolution and some of the more commonly used extensions.

344 citations


Proceedings Article
01 Apr 2007
TL;DR: Two Markov chain Monte Carlo algorithms for Bayesian inference of probabilistic context free grammars (PCFGs) from terminal strings are presented, providing an alternative to maximum-likelihood estimation using the Inside-Outside algorithm.
Abstract: This paper presents two Markov chain Monte Carlo (MCMC) algorithms for Bayesian inference of probabilistic context free grammars (PCFGs) from terminal strings, providing an alternative to maximum-likelihood estimation using the Inside-Outside algorithm. We illustrate these methods by estimating a sparse grammar describing the morphology of the Bantu language Sesotho, demonstrating that with suitable priors Bayesian techniques can infer linguistic structure in situations where maximum likelihood methods such as the Inside-Outside algorithm only produce a trivial grammar.

214 citations


Proceedings ArticleDOI
05 Nov 2007
TL;DR: CESE, a tool that combines exhaustive enumeration of test inputs from a structured domain with symbolic execution driven test generation, and symbolic grammars, where some original tokens are replaced with symbolic constants, to target programs whose valid inputs are determined by some context free grammar.
Abstract: We present CESE, a tool that combines exhaustive enumeration of test inputs from a structured domain with symbolic execution driven test generation. We target programs whose valid inputs are determined by some context free grammar. We abstract the concrete input syntax with symbolic grammars, where some original tokens are replaced with symbolic constants. This reduces the set of input strings that must be enumerated exhaustively. For each enumerated input string, which may contain symbolic constants, symbolic execution based test generation instantiates the constants based on program execution paths. The "template" generated by enumerating valid strings reduces the burden on the symbolic execution to generate syntactically valid inputs and helps exercise interesting code paths. Together, symbolic grammars provide a link between exhaustive enumeration of valid inputs and execution-directed symbolic test generation Preliminary experiments with CESE show that the combination achieves better coverage than both pure enumerative test generation and pure directed symbolic test generation, in orders of magnitude less time and number of generated inputs. In addition, CESE is able to automatically generate inputs that achieve coverage within 10% of manually constructed tests.

84 citations


Book ChapterDOI
16 Jul 2007
TL;DR: This work observes that there is a simple linguistic characterization of the grammar ambiguity problem, and shows how to exploit this to conservatively approximate the problem based on local regular approximations and grammar unfoldings.
Abstract: It has been known since 1962 that the ambiguity problem for context-free grammars is undecidable. Ambiguity in context-free grammars is a recurring problem in language design and parser generation, as well as in applications where grammars are used as models of real-world physical structures. We observe that there is a simple linguistic characterization of the grammar ambiguity problem, and we show how to exploit this to conservatively approximate the problem based on local regular approximations and grammar unfoldings. As an application, we consider grammars that occur in RNA analysis in bioinformatics, and we demonstrate that our static analysis of context-free grammars is sufficiently precise and efficient to be practically useful.

75 citations


Proceedings Article
22 Jul 2007
TL;DR: This work describes a method in which a minimal grammar is hierarchically refined using EM to give accurate, compact grammars, yet the resulting parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English.
Abstract: Treebank parsing can be seen as the search for an optimally refined grammar consistent with a coarse training treebank. We describe a method in which a minimal grammar is hierarchically refined using EM to give accurate, compact grammars. The resulting grammars are extremely compact compared to other high-performance parsers, yet the parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English. In addition, we give an associated coarse-to-fine inference scheme which vastly improves inference time with no loss in test set accuracy.

67 citations


Book ChapterDOI
03 Jul 2007
TL;DR: A negative answer is given, contrary to the conjectured positive one, by constructing a conjunctive grammar for the language \(\{ a^{4^{n}} : n \in \mathbb{N} \}\).
Abstract: Conjunctive grammars were introduced by A. Okhotin in [1] as a natural extension of context-free grammars with an additional operation of intersection in the body of any production of the grammar. Several theorems and algorithms for context-free grammars generalize to the conjunctive case. Still some questions remained open. A. Okhotin posed nine problems concerning those grammars. One of them was a question, whether a conjunctive grammar over unary alphabet can generate only regular languages. We give a negative answer, contrary to the conjectured positive one, by constructing a conjunctive grammar for the language \(\{ a^{4^{n}} : n \in \mathbb{N} \}\). We then generalise this result—for every set of numbers L such that their representation in some k-ary system is regular set we show that \(\{ a^{k^{n}} : n \in L \}\) is generated by some conjunctive grammar over unary alphabet.

61 citations


Journal ArticleDOI
TL;DR: This paper exemplifies with the specification and computation of the nullable, first, and follow sets used in parser construction, a problem which is highly recursive and normally programmed by hand using an iterative algorithm, and presents a general demand-driven evaluation algorithm for CRAGs.

60 citations


Journal ArticleDOI
TL;DR: This paper describes Christiansen grammar evolution (CGE), a new evolutionary automatic programming algorithm that extends standard grammar evolution by replacing context-free grammars by Christiansen Grammars.
Abstract: This paper describes Christiansen grammar evolution (CGE), a new evolutionary automatic programming algorithm that extends standard grammar evolution (GE) by replacing context-free grammars by Christiansen grammars. GE only takes into account syntactic restrictions to generate valid individuals. CGE adds semantics to ensure that both semantically and syntactically valid individuals are generated. It is empirically shown that our approach improves GE performance and even allows the solution of some problems are difficult to tackle by GE

59 citations


Patent
28 Feb 2007
TL;DR: In this article, a system is disclosed for checking grammar and usage using a flexible portfolio of different mechanisms, and automatically providing a variety of different examples of standard usage, selected from analogous Web content.
Abstract: A system is disclosed for checking grammar and usage using a flexible portfolio of different mechanisms, and automatically providing a variety of different examples of standard usage, selected from analogous Web content. The system can be used for checking the grammar and usage in any application that involves natural language text, such as word processing, email, and presentation applications. The grammar and usage can be evaluated using several complementary evaluation modules, which may include one based on a trained classifier, one based on regular expressions, and one based on comparative searches of the Web or a local corpus. The evaluation modules can provide a set of suggested alternative segments with corrected grammar and usage. A followup, screened Web search based on the alternative segments, in context, may provide several different in-context examples of proper grammar and usage that the user can consider and select from.

59 citations


Journal ArticleDOI
TL;DR: Any parsing or labeling accuracy improvement from conditional estimation of WCFGs or conditional random fields (CRFs) over joint estimation of PCFGs or hidden Markov models (HMMs) is due to the estimation procedure rather than the change in model class, becausePCFGs and HMMs are exactly as expressive as W CFGs and chain-structured CRFs, respectively.
Abstract: This article studies the relationship between weighted context-free grammars (WCFGs), where each production is associated with a positive real-valued weight, and probabilistic context-free grammars (PCFGs), where the weights of the productions associated with a nonterminal are constrained to sum to one. Because the class of WCFGs properly includes the PCFGs, one might expect that WCFGs can describe distributions that PCFGs cannot. However, Z. Chi (1999, Computational Linguistics, 25(1):131--160) and S. P. Abney, D. A. McAllester, and P. Pereira (1999, In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 542--549, College Park, MD) proved that every WCFG distribution is equivalent to some PCFG distribution. We extend their results to conditional distributions, and show that every WCFG conditional distribution of parses given strings is also the conditional distribution defined by some PCFG, even when the WCFG's partition function diverges. This shows that any parsing or labeling accuracy improvement from conditional estimation of WCFGs or conditional random fields (CRFs) over joint estimation of PCFGs or hidden Markov models (HMMs) is due to the estimation procedure rather than the change in model class, because PCFGs and HMMs are exactly as expressive as WCFGs and chain-structured CRFs, respectively.

54 citations


Book
01 Jan 2007
TL;DR: This chapter discusses automata theory in the context of finite state machines, which is concerned with Turing Machines, and its applications in linguistics, where Turing Machines are concerned with language recognition.
Abstract: PART I: INTRODUCTION 1 Why Study Automata Theory? 2 Review of Mathematical Concepts 2.1 Logic 2.2 Sets 2.3 Relations 2.4 Functions 2.5 Closures 2.6 Proof Techniques 2.7 Reasoning about Programs 2.8 References 3 Languages and Strings 3.1 Strings 3.2 Languages 4 The Big Picture: A Language Hierarchy 4.1 Defining the Task: Language Recognition 4.2 The Power of Encoding 4.3 A Hierarchy of Language Classes 5 Computation 5.1 Decision Procedures 5.2 Determinism and Nondeterminism 5.3 Functions on Languages and Programs PART II: FINITE STATE MACHINES AND REGULAR LANGUAGES 6 Finite State Machines 6.2 Deterministic Finite State Machines 6.3 The Regular Languages 6.4 Programming Deterministic Finite State Machines 6.5 Nondeterministic FSMs 6.6 Interpreters for FSMs 6.7 Minimizing FSMs 6.8 Finite State Transducers 6.9 Bidirectional Transducers 6.10 Stochastic Finite Automata 6.11 Finite Automata, Infinite Strings: Buchi Automata 6.12 Exercises 7 Regular Expressions 7.1 What is a Regular Expression? 7.2 Kleene's Theorem 7.3 Applications of Regular Expressions 7.4 Manipulating and Simplifying Regular Expressions 8 Regular Grammars 8.1 Definition of a Regular Grammar 8.2 Regular Grammars and Regular Languages 8.3 Exercises 9 Regular and Nonregular Languages 9.1 How Many Regular Languages Are There? 9.2 Showing That a Language Is Regular.124 9.3 Some Important Closure Properties of Regular Languages 9.4 Showing That a Language is Not Regular 9.5 Exploiting Problem-Specific Knowledge 9.6 Functions on Regular Languages 9.7 Exercises 10 Algorithms and Decision Procedures for Regular Languages 10.1 Fundamental Decision Procedures 10.2 Summary of Algorithms and Decision Procedures for Regular Languages 10.3 Exercises 11 Summary and References PART III: CONTEXT-FREE LANGUAGES AND PUSHDOWN AUTOMATA 144 12 Context-Free Grammars 12.1 Introduction to Grammars 12.2 Context-Free Grammars and Languages 12.3 Designing Context-Free Grammars 12.4 Simplifying Context-Free Grammars 12.5 Proving That a Grammar is Correct 12.6 Derivations and Parse Trees 12.7 Ambiguity 12.8 Normal Forms 12.9 Stochastic Context-Free Grammars 12.10 Exercises 13 Pushdown Automata 13.1 Definition of a (Nondeterministic) PDA 13.2 Deterministic and Nondeterministic PDAs 13.3 Equivalence of Context-Free Grammars and PDAs 13.4 Nondeterminism and Halting 13.5 Alternative Definitions of a PDA 13.6 Exercises 14 Context-Free and Noncontext-Free Languages 14.1 Where Do the Context-Free Languages Fit in the Big Picture? 14.2 Showing That a Language is Context-Free 14.3 The Pumping Theorem for Context-Free Languages 14.4 Some Important Closure Properties of Context-Free Languages 14.5 Deterministic Context-Free Languages 14.6 Other Techniques for Proving That a Language is Not Context-Free 14.7 Exercises 15 Algorithms and Decision Procedures for Context-Free Languages 15.1 Fundamental Decision Procedures 15.2 Summary of Algorithms and Decision Procedures for Context-Free Languages 16 Context-Free Parsing 16.1 Lexical Analysis 16.2 Top-Down Parsing 16.3 Bottom-Up Parsing 16.4 Parsing Natural Languages 16.5 Stochastic Parsing 16.6 Exercises 17 Summary and References PART IV: TURING MACHINES AND UNDECIDABILITY 18 Turing Machines 18.1 Definition, Notation and Examples 18.2 Computing With Turing Machines 18.3 Turing Machines: Extensions and Alternative Definitions 18.4 Encoding Turing Machines as Strings 18.5 The Universal Turing Machine 18.6 Exercises 19 The Church-Turing 19.1 The Thesis 19.2 Examples of Equivalent Formalisms 20 The Unsolvability of the Halting Problem 20.1 The Language H is Semidecidable but Not Decidable 20.2 Some Implications of the Undecidability of H 20.3 Back to Turing, Church, and the Entscheidungsproblem 21 Decidable and Semidecidable Languages 21.2 Subset Relationships between D and SD 21.3 The Classes D and SD Under Complement 21.4 Enumerating a Language 21.5 Summary 21.6 Exercises 22 Decidability and Undecidability Proofs 22.1 Reduction 22.2 Using Reduction to Show that a Language is Not Decidable 22.3 Rice's Theorem 22.4 Undecidable Questions About Real Programs 22.5 Showing That a Language is Not Semidecidable 22.6 Summary of D, SD/D and (R)SD Languages that Include Turing Machine Descriptions 22.7 Exercises 23 Undecidable Languages That Do Not Ask Questions about Turing Machines 23.1 Hilbert's 10th Problem 23.2 Post Correspondence Problem 23.3 Tiling Problems 23.4 Logical Theories 23.5 Undecidable Problems about Context-Free Languages APPENDIX C: HISTORY, PUZZLES, AND POEMS 43 Part I: Introduction 43.1 The 15-Puzzle Part II: Finite State Machines and Regular Languages 44.1 Finite State Machines Predate Computers 44.2 The Pumping Theorem Inspires Poets REFERENCES INDEX Appendices for Automata, Computability and Complexity: Theory and Applications: * Math Background* Working with Logical Formulas* Finite State Machines and Regular Languages* Context-Free Languages and PDAs* Turing Machines and Undecidability* Complexity* Programming Languages and Compilers* Tools for Programming, Databases and Software Engineering* Networks* Security* Computational Biology* Natural Language Processing* Artificial Intelligence and Computational Reasoning* Art & Entertainment: Music & Games* Using Regular Expressions* Using Finite State Machines and Transducers* Using Grammars

Journal ArticleDOI
TL;DR: This work introduces event-driven Grammars, a kind of graph grammars that are especially suited for visual modelling environments generated by meta-modelling and their combination with triple graph transformation systems.
Abstract: In this work we introduce event-driven grammars, a kind of graph grammars that are especially suited for visual modelling environments generated by meta-modelling. Rules in these grammars may be triggered by user actions (such as creating, editing or connecting elements) and in their turn may trigger other user-interface events. Their combination with triple graph transformation systems allows constructing and checking the consistency of the abstract syntax graph while the user is building the concrete syntax model, as well as managing the layout of the concrete syntax representation. As an example of these concepts, we show the definition of a modelling environment for UML sequence diagrams. A discussion is also presented of methodological aspects for the generation of environments for visual languages with multiple views, its connection with triple graph grammars, the formalization of the latter in the double pushout approach and its extension with an inheritance concept.

Proceedings ArticleDOI
03 Sep 2007
TL;DR: CESI, an algorithm that combines exhaustive enumeration of test inputs from a structured domain with symbolic execution driven test generation, and symbolic grammars, where the original tokens are replaced with symbolic constants, that link enumerative grammar-based input generation with symbolic directed testing.
Abstract: We present CESI, an algorithm that combines exhaustive enumeration of test inputs from a structured domain with symbolic execution driven test generation. CESI is a hybrid of two predominant techniques: specification-based enumerative test generation (which exhaustively generates all possible inputs satisfying some constraint) and symbolic directed test generation (which explores program paths based on symbolic path constraint solving). We target programs whose valid inputs are determined by some context free grammar. We introduce symbolic grammars, where the original tokens are replaced with symbolic constants, that link enumerative grammar-based input generation with symbolic directed testing. Symbolic grammars abstract the concrete input syntax, thus reducing the set of input strings that must be enumerated exhaustively. For each enumerated input string, which may contain symbolic constants, symbolic execution based test generation instantiates the constants based on program execution paths. The "template" generated by enumerating valid strings reduces the burden on the symbolic execution to generate syntactically valid inputs and hence exercise interesting code paths. Together, symbolic grammars provide a link between exhaustive enumeration of valid inputs and execution-directed symbolic test generation. In preliminary experiments, CESI is better than if both enumerative and symbolic techniques are used alone.

Book ChapterDOI
09 Jul 2007
TL;DR: A safe, conservative approach is presented, where the approximations cannot result in overlooked ambiguous cases and the complexity of the verification is analyzed, and formal comparisons are provided with several other ambiguity detection methods.
Abstract: The ability to detect ambiguities in context-free grammars is vital for their use in several fields, but the problem is undecidable in the general case. We present a safe, conservative approach, where the approximations cannot result in overlooked ambiguous cases. We analyze the complexity of the verification, and provide formal comparisons with several other ambiguity detection methods.

01 Jan 2007
TL;DR: It is argued that the tools and techniques of grammar engineering are used as a means to take thevelopment and evaluation of syntactic hypoth-esis testing to a new level and theoretical ideas are validated through the de-velopment of explicit grammars which can relate strings from some fragment.
Abstract: In this paper , I argue that the tools and techniques of grammar engineeringpro vide a means to tak e the de velopment and evaluation of syntactic hypoth-esis testing to a new level. Grammar engineering is the process of creatingmachine-readable implementations of formal grammars. T raditionally , lin-guistic hypotheses are encoded as statements within a grammatical theory andtested by collecting rele vant examples and manually verifying that the gram-mars correctly predict the grammaticality and linguistic structure of thoseexamples. Computerized implementations of their grammars allo w linguiststo more efÞ ciently and effecti vely test hypotheses, for tw o reasons: First, lan-guages are made up of man y subsystems with comple x interactions. Linguistsgenerally focus on just one subsystem at a time, yet the predictions of an yparticular analysis cannot be calculated independently of the interacting sub-systems. W ith implemented grammars, the computer can track the effects ofall aspects of the implementation while the linguist focuses on de veloping justone. Second, automated application of grammars to test suites and naturallyoccurring data allo ws for much more thorough testing of linguistic analysesNagainst thousands as opposed to tens of examples and including examples notanticipated by the linguist.This w ork is situated within the Montago vian tradition of the Omethodof fragmentsO (Montague, 1974, P artee, 1979, Gazdar et al., 1985). In thismethodology ,theoretical ideas are validated (and extended) through the de-velopment of explicit grammars which can relate strings from some fragment

Journal ArticleDOI
TL;DR: The goal is to make it possible for linguistically untrained programmers to write linguistically correct application grammars encoding the semantics of special domains, and the type system of GF guarantees that grammaticality is preserved.
Abstract: The Grammatical Framework GF is a grammar formalism designed for multilingual grammars. A multilingual grammar has a shared representation, called abstract syntax, and a set of concrete syntaxes that map the abstract syntax to different languages. A GF grammar consists of modules, which can share code through inheritance, but which can also hide information to achieve division of labour between grammarians working on different modules. The goal is to make it possible for linguistically untrained programmers to write linguistically correct application grammars encoding the semantics of special domains. Such programmers can rely on resource grammars, written by linguists, which play the role of standard libraries. Application grammarians use resource grammars through abstract interfaces, and the type system of GF guarantees that grammaticality is preserved. The ongoing GF resource grammar project provides resource grammars for ten languages. In addition to their use as libraries, resource grammars serve as an experiment showing how much grammar code can be shared between different languages.

Proceedings ArticleDOI
09 Jul 2007
TL;DR: This work has built an compiler that takes the interface grammar for a component as input and generates a stub for that component, which can be used to replace that component during state space exploration, or to provide an executable environment for the component under verification.
Abstract: We propose an interface specification language based on grammars for modular software model checking. In our interface specification language, component interfaces are specified as context free grammars. An interface grammar for a component specifies the sequences of method invocations that are allowed by that component. Using interface grammars one can specify nested call sequences that cannot be specified using interface specification formalisms that rely on finite state machines. Moreover, our interface grammars allow specification of semantic predicates and actions, which are Java code segments that can be used to express additional interface constraints. We have built an interface compiler that takes the interface grammar for a component as input and generates a stub for that component. The resulting stub is a table-driven parser generated from the input interface grammar. Invocation of a method within the component becomes the lookahead symbol for the stub/parser. The stub/parser uses a parser stack, the lookahead, and a parse table to guide the parsing. The semantic predicates and semantic actions that appear in the right hand sides of the production rules are executed when they appear at the top of the stack. We conducted a case study by writing an interface grammar for the Enterprise JavaBeans (EJB) persistence interface. Using our interface compiler we automatically generated an EJB stub using the EJB interface grammar. We used the JPF model checker to check EJB clients using this automatically generated EJB stub. Our results show that EJB clients can be verified efficiently using our approach.

Journal Article
TL;DR: It is proved that every recursively enumerable language can be generated by a graph-controlled grammar with only two nonterminal symbols when both symbols are used in the appearance checking mode.
Abstract: We refine the classical notion of the nonterminal complexity of graph-controlled grammars, programmed grammars, and matrix grammars by also counting, in addition, the number of nonterminal symbols that are actually used in the appearance checking mode. We prove that every recursively enumerable language can be generated by a graph-controlled grammar with only two nonterminal symbols when both symbols are used in the appearance checking mode. This result immediately implies that programmed grammars with three nonterminal symbols where two of them are used in the appearance checking mode as well as matrix grammars with three nonterminal symbols all of them used in the appearance checking mode are computationally complete. Moreover, we prove that matrix grammars with four nonterminal symbols with only two of them being used in the appearance checking mode are computationally complete, too. On the other hand, every language is recursive if it is generated by a graph-controlled grammar with an arbitrary number of nonterminal symbols but only one of the nonterminal symbols being allowed to be used in the appearance checking mode. This implies, in particular, that the result proving the computational completeness of graph-controlled grammars with two nonterminal symbols and both of them being used in the appearance checking mode is already optimal with respect to the overall number of nonterminal symbols as well as with respect to the number of nonterminal symbols used in the appearance checking mode, too. Finally, we also investigate in more detail the computational power of several language families which are generated by graph-controlled, programmed grammars or matrix grammars, respectively, with a very small number of nonterminal symbols and therefore are proper subfamilies of the family of recursively enumerable languages.

Proceedings ArticleDOI
23 Jun 2007
TL;DR: This paper combines aspects of previous approaches and presents a method by which parsers can be built as modular and efficient executable specifications of ambiguous grammars containing unconstrained left recursion.
Abstract: In functional and logic programming, parsers can be built as modular executable specifications of grammars, using parser combinators and definite clause grammars respectively. These techniques are based on top-down backtracking search. Commonly used implementations are inefficient for ambiguous languages, cannot accommodate left-recursive grammars, and require exponential space to represent parse trees for highly ambiguous input. Memoization is known to improve efficiency, and work by other researchers has had some success in accommodating left recursion. This paper combines aspects of previous approaches and presents a method by which parsers can be built as modular and efficient executable specifications of ambiguous grammars containing unconstrained left recursion.

Journal ArticleDOI
TL;DR: A Java library for substructure matching that features easy-to-read syntax and extensibility and converts the found match from the internal format of MQL to the format of the external toolkit.
Abstract: We have developed a Java library for substructure matching that features easy-to-read syntax and extensibility. This molecular query language (MQL) is grounded on a context-free grammar, which allows for straightforward modification and extension. The formal description of MQL is provided in this paper. Molecule primitives are atoms, bonds, properties, branching, and rings. User-defined features can be added via a Java interface. In MQL, molecules are represented as graphs. Substructure matching was implemented using the Ullmann algorithm because of favorable run-time performance. The Ullmann algorithm carries out a fast subgraph isomorphism search by combining backtracking with effective forward checking. MQL software design was driven by the aim to facilitate the use of various cheminformatics toolkits. Two Java interfaces provide a bridge from our MQL package to an external toolkit: the first one provides the matching rules for every feature of a particular toolkit; the second one converts the found m...

Journal ArticleDOI
TL;DR: The design and implementation of a parser combinator library in Newspeak, a new language in the Smalltalk family, which allows the grammar to be specified as a separate class or mixin, independent of tools that rely upon it such as parsers, syntax colorizers etc.

Journal ArticleDOI
TL;DR: A context-free grammatical inference algorithm operating on positive data only is described, which integrates an information theoretic constituent likelihood measure together with more traditional heuristics based on substitutability and frequency.
Abstract: This paper describes the winning entry to the Omphalos context free grammar learning competition. We describe a context-free grammatical inference algorithm operating on positive data only, which integrates an information theoretic constituent likelihood measure together with more traditional heuristics based on substitutability and frequency. The competition is discussed from the perspective of a competitor. We discuss a class of deterministic grammars, the Non-terminally Separated (NTS) grammars, that have a property relied on by our algorithm, and consider the possibilities of extending the algorithm to larger classes of languages.

Patent
Mehryar Mohri1
18 Sep 2007
TL;DR: In this paper, the output rules are output in a specific format that specifies, for each rule, the lefthand non-terminal symbol, a single right-hand nonterminal symbols, and zero, one or more terminal symbols.
Abstract: Context-free grammars generally comprise a large number of rules, where each rule defines how a string of symbols is generated from a different series of symbols. While techniques for creating finite-state automata from the rules of context-free grammars exist, these techniques require an input grammar to be strongly regular. Systems and methods that convert the rules of a context-free grammar into a strongly regular grammar include transforming each input rule into a set of output rules that approximate the input rule. The output rules are all right- or left-linear and are strongly regular. In various exemplary embodiments, the output rules are output in a specific format that specifies, for each rule, the left-hand non-terminal symbol, a single right-hand non-terminal symbol, and zero, one or more terminal symbols. If the input context-free grammar rule is weighted, the weight of that rule is distributed and assigned to the output rules.

Proceedings ArticleDOI
26 Apr 2007
TL;DR: By modifying the algorithm of Uno and Yagiura (2000) for the closely related problem of finding all common intervals of two permutations, this paper achieves a linear time algorithm for the permutation factorization problem.
Abstract: Factoring a Synchronous Context-Free Grammar into an equivalent grammar with a smaller number of nonterminals in each rule enables synchronous parsing algorithms of lower complexity. The problem can be formalized as searching for the tree-decomposition of a given permutation with the minimal branching factor. In this paper, by modifying the algorithm of Uno and Yagiura (2000) for the closely related problem of finding all common intervals of two permutations, we achieve a linear time algorithm for the permutation factorization problem. We also use the algorithm to analyze the maximum SCFG rule length needed to cover hand-aligned data from various language pairs.

Journal ArticleDOI
TL;DR: The recursive descent parsing method for the context-free grammars is extended for their generalization, Boolean Grammars, which include explicit set-theoretic operations in the formalism of rules and which are formally defined by language equations.
Abstract: The recursive descent parsing method for the context-free grammars is extended for their generalization, Boolean grammars, which include explicit set-theoretic operations in the formalism of rules and which are formally defined by language equations. The algorithm is applicable to a subset of Boolean grammars. The complexity of a direct implementation varies between linear and exponential, while memoization keeps it down to linear.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: This work introduces a novel method for representing and classifying events in video sequences using reversible context-free grammars and demonstrates the efficacy of the learning algorithm and the event detection method applied to traffic video sequences.
Abstract: Automatic detection of dynamic events in video sequences has a variety of applications including visual surveillance and monitoring, video highlight extraction, intelligent transportation systems, video summarization, and many more. Learning an accurate description of the various events in real-world scenes is challenging owing to the limited user-labeled data as well as the large variations in the pattern of the events. Pattern differences arise either due to the nature of the events themselves such as the spatio-temporal events or due to missing or ambiguous data interpretation using computer vision methods. In this work, we introduce a novel method for representing and classifying events in video sequences using reversible context-free grammars. The grammars are learned using a semi-supervised learning method. More concretely, by using the classification entropy as a heuristic cost function, the grammars are iteratively learned using a search method. Experimental results demonstrating the efficacy of the learning algorithm and the event detection method applied to traffic video sequences are presented.

Book ChapterDOI
28 Jul 2007
TL;DR: This paper offers a retrospective analysis and evaluation of Noam Chomsky's Syntactic Structures (Chomsky) and shows that none of these things are true.
Abstract: Syntactic Structures (Chomsky [6]) is widely believed to have laid the foundations of a cognitive revolution in linguistic science, and to have presented (i) the first use in linguistics of powerful new ideas regarding grammars as generative systems, (ii) a proof that English was not a regular language, (iii) decisive syntactic arguments against context-free phrase structure grammar description, and (iv) a demonstration of how transformational rules could provide a formal solution to those problems. None of these things are true. This paper offers a retrospective analysis and evaluation.

01 Jan 2007
TL;DR: It is argued that context free grammars are not sufficiently expressive to handle important use cases in spatial intention recognition and it is shown that Tree Adjoining Grammars can be used to handle rule-ruleconstraints.
Abstract: In its most general form, the problem of inferring the intentions of a mobile user from his or her spatial behavior is equivalent to the plan recognition problem which is known to be intractable. Tractable special cases of the problem are therefore of great practical interest. Using formal grammars, intention recognition problems can be stated as parsing problems in a way that makes the connection between expressiveness and complexity explicit. We argue that context free grammars are not sufficiently expressive to handle important use cases. Furthermore, we identify three types of constraints on the grammar’s productions that may arise in spatial intention recognition: rule-at-location constraints, rule-rule-constraints, and complex rule-location-constraints. Finally we show that Tree Adjoining Grammars can be used to handle rule-ruleconstraints.

Journal IssueDOI
TL;DR: The technique maps a program's hierarchical structure to a context-free grammar, normalizes the grammar, and uses a fast check for homomorphism between the normalized grammars.
Abstract: Computer viruses continue to proliferate despite the use of virus detection systems (VDS). This is due to VDS inability to detect variants not represented in signature databases. Detection systems look for contiguous byte sequences, use regular expressions for noncontiguous sequences, or detect initial behavior within a sandbox. Recent research has focused on using control-flow graph isomorphism in detection. These techniques are ineffective at detecting some polymorphs, which change their byte sequences and initial behavior and produce nonisomorphic control-flow graphs. Our approach compares program hierarchical structure. We observed that polymorphic instances are variants of the same program, these variants use the same algorithm, and a program's algorithm determines its hierarchical structure. Our technique maps a program's hierarchical structure to a context-free grammar, normalizes the grammar, and uses a fast check for homomorphism between the normalized grammars. © 2007 Alcatel-Lucent.

Proceedings ArticleDOI
29 Jun 2007
TL;DR: This paper describes how grammar-based language models for speech recognition systems can be generated from Grammatical Framework (GF) grammars, which enables rapid development of portable, multilingual and easily modifiable speech recognition applications.
Abstract: This paper describes how grammar-based language models for speech recognition systems can be generated from Grammatical Framework (GF) grammars. Context-free grammars and finite-state models can be generated in several formats: GSL, SRGS, JSGF, and HTK SLF. In addition, semantic interpretation code can be embedded in the generated context-free grammars. This enables rapid development of portable, multilingual and easily modifiable speech recognition applications.