scispace - formally typeset
Book ChapterDOI

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

Reads0
Chats0
TLDR
This chapter proposes an Evolutionary Computation approach to the problem of automatically learn software entities based on Genetic Algorithms and regular expressions, also called wrappers, that will be able to extract some kind of Web data structures from examples.
Abstract
Data Extraction from the World Wide Web is a well known, unsolved, and critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount ofWeb data available. These data usually has a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build reliable systems. This chapter proposes an Evolutionary Computation approach to the problem of automatically learn software entities based on Genetic Algorithms and regular expressions. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Automatic Synthesis of Regular Expressions from Examples

TL;DR: A system that can produce regular expressions from user-provided examples performed with high precision and recall in 12 text-extraction tasks from real-world datasets, demonstrating the effectiveness of text extraction based on genetic programming.

IJCSI Publicity Board 2010

TL;DR: This paper presents a systematic analysis of a variety of different ad hoc network topologies in terms of node placement, node mobility and routing protocols through several simulated scenarios.
Proceedings ArticleDOI

Program Boosting: Program Synthesis via Crowd-Sourcing

TL;DR: This paper proposes an approach to program synthesis that involves crowd-sourcing imperfect solutions to a difficult programming problem from developers and then blending these programs together in a way that improves their correctness, and demonstrates that program boosting can be performed at a relatively modest monetary cost.
Book ChapterDOI

Learning Text Patterns Using Separate-and-Conquer Genetic Programming

TL;DR: This work considers the problem of extracting text slices that adhere to a syntactic pattern and proposes an approach capable of generating the desired pattern automatically, from a few annotated examples, based on Genetic Programming and generates extraction patterns in the form of regular expressions that may be input to existing engines without any post-processing.
Proceedings ArticleDOI

Playing regex golf with genetic programming

TL;DR: This paper generates a population of candidate regular expressions represented as trees and evolves such population based on a multi-objective fitness which minimizes the errors and the length of the regular expression.
References
More filters
BookDOI

Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence

TL;DR: Initially applying his concepts to simply defined artificial systems with limited numbers of parameters, Holland goes on to explore their use in the study of a wide range of complex, naturally occuring processes, concentrating on systems having multiple factors that interact in nonlinear ways.
Book

Modern Information Retrieval

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
Proceedings Article

Wrapper induction for information extraction

TL;DR: This work introduces wrapper induction, a method for automatically constructing wrappers, and identifies hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources.
Journal ArticleDOI

Programming Techniques: Regular expression search algorithm

TL;DR: A method for locating specific character strings embedded in character text is described and an implementation of this method in the form of a compiler is discussed.
Journal ArticleDOI

Wrapper induction: efficiency and expressiveness

TL;DR: This article describes six wrapper classes, and uses a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them, finding that most of their wrapper classes are reasonably useful, yet can rapidly learned.