Monadic datalog and the expressive power of languages for Web information extraction
Georg Gottlob,Christoph Koch +1 more
Reads0
Chats0
TLDR
It is believed that MSO has the right expressiveness required for Web information extraction and is proposed as a yardstick for evaluating and comparing wrappers and a simple normal form for this language is presented.Abstract:
Research on information extraction from Web pages (wrapping) has seen much activity recently (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, we first study monadic datalog over trees as a wrapping language. We show that this simple language is equivalent to monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and propose MSO as a yardstick for evaluating and comparing wrappers. Along the way, several other results on the complexity of query evaluation and query containment for monadic datalog over trees are established, and a simple normal form for this language is presented. Using the above results, we subsequently study the kernel fragment Elog− of the Elog wrapping language used in the Lixto system (a visual wrapper generator). Curiously, Elog− exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified.read more
Citations
More filters
Journal ArticleDOI
Web data extraction, applications and techniques
TL;DR: A structured and comprehensive overview of the literature in the field of Web Data Extraction is provided, namely applications at the Enterprise level and at the Social Web level, which allows to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users.
Book ChapterDOI
Structural Properties of XPath Fragments
TL;DR: This work characterize the expressive power of these language fragments in terms of both logics and tree patterns, and investigates closure properties, focusing on the ability to perform basic Boolean operations while remaining within the fragment.
Proceedings ArticleDOI
Datalog+/-: A Family of Logical Knowledge Representation and Query Languages for New Applications
TL;DR: This paper discusses three paradigms ensuring decidability: chase termination, guardedness, and stickiness, and extends plain Datalog by features such as existentially quantified rule heads and restricts the rule syntax so as to achieveDecidability and tractability.
Journal ArticleDOI
Automata for XML---A survey
TL;DR: An overview of fundamental properties of the different kinds of automata used in XML processing are given to relate them to the four key aspects of XML processing: schemas, navigation, querying and transformation.
Proceedings ArticleDOI
The Lixto data extraction project: back and forth between theory and practice
TL;DR: The Lixto project is presented, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software and theoretical results on monadic datalog over trees and Elog, its close relative, are presented.
References
More filters
Book
Introduction to Automata Theory, Languages, and Computation
TL;DR: This book is a rigorous exposition of formal languages and models of computation, with an introduction to computational complexity, appropriate for upper-level computer science undergraduates who are comfortable with mathematical arguments.
Book
Foundations of databases
TL;DR: This book discusses Languages, Computability, and Complexity, and the Relational Model, which aims to clarify the role of Semantic Data Models in the development of Query Language Design.
Book ChapterDOI
Languages, automata, and logic
TL;DR: The subject of this chapter is the study of formal languages (mostly languages recognizable by finite automata) in the framework of mathematical logic.
Journal ArticleDOI
Linear-time algorithms for testing the satisfiability of propositional horn formulae
William F. Dowling,Jean Gallier +1 more
TL;DR: The formulation of the satisfiability problem as a data flow problem appears to be new and suggests the possibility of improving efficiency using parallel processors.
Journal ArticleDOI
Generalized finite automata theory with an application to a decision problem of second-order logic
TL;DR: The standard closure theorems are proved for the class of sets “recognizable” by finite algebras, and a generalization of Kleene's regularity theory is presented.