scispace - formally typeset
Open AccessJournal ArticleDOI

Monadic datalog and the expressive power of languages for Web information extraction

Georg Gottlob, +1 more
- 01 Jan 2004 - 
- Vol. 51, Iss: 1, pp 74-113
Reads0
Chats0
TLDR
It is believed that MSO has the right expressiveness required for Web information extraction and is proposed as a yardstick for evaluating and comparing wrappers and a simple normal form for this language is presented.
Abstract
Research on information extraction from Web pages (wrapping) has seen much activity recently (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, we first study monadic datalog over trees as a wrapping language. We show that this simple language is equivalent to monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and propose MSO as a yardstick for evaluating and comparing wrappers. Along the way, several other results on the complexity of query evaluation and query containment for monadic datalog over trees are established, and a simple normal form for this language is presented. Using the above results, we subsequently study the kernel fragment Elog− of the Elog wrapping language used in the Lixto system (a visual wrapper generator). Curiously, Elog− exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified.

read more

Citations
More filters
Journal ArticleDOI

Web data extraction, applications and techniques

TL;DR: A structured and comprehensive overview of the literature in the field of Web Data Extraction is provided, namely applications at the Enterprise level and at the Social Web level, which allows to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users.
Book ChapterDOI

Structural Properties of XPath Fragments

TL;DR: This work characterize the expressive power of these language fragments in terms of both logics and tree patterns, and investigates closure properties, focusing on the ability to perform basic Boolean operations while remaining within the fragment.
Proceedings ArticleDOI

Datalog+/-: A Family of Logical Knowledge Representation and Query Languages for New Applications

TL;DR: This paper discusses three paradigms ensuring decidability: chase termination, guardedness, and stickiness, and extends plain Datalog by features such as existentially quantified rule heads and restricts the rule syntax so as to achieveDecidability and tractability.
Journal ArticleDOI

Automata for XML---A survey

TL;DR: An overview of fundamental properties of the different kinds of automata used in XML processing are given to relate them to the four key aspects of XML processing: schemas, navigation, querying and transformation.
Proceedings ArticleDOI

The Lixto data extraction project: back and forth between theory and practice

TL;DR: The Lixto project is presented, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software and theoretical results on monadic datalog over trees and Elog, its close relative, are presented.
References
More filters
Book

Introduction to Automata Theory, Languages, and Computation

TL;DR: This book is a rigorous exposition of formal languages and models of computation, with an introduction to computational complexity, appropriate for upper-level computer science undergraduates who are comfortable with mathematical arguments.
Book

Foundations of databases

TL;DR: This book discusses Languages, Computability, and Complexity, and the Relational Model, which aims to clarify the role of Semantic Data Models in the development of Query Language Design.
Book ChapterDOI

Languages, automata, and logic

TL;DR: The subject of this chapter is the study of formal languages (mostly languages recognizable by finite automata) in the framework of mathematical logic.
Journal ArticleDOI

Linear-time algorithms for testing the satisfiability of propositional horn formulae

TL;DR: The formulation of the satisfiability problem as a data flow problem appears to be new and suggests the possibility of improving efficiency using parallel processors.
Journal ArticleDOI

Generalized finite automata theory with an application to a decision problem of second-order logic

TL;DR: The standard closure theorems are proved for the class of sets “recognizable” by finite algebras, and a generalization of Kleene's regularity theory is presented.