scispace - formally typeset
Book ChapterDOI

Finite-State Approaches to Web Information Extraction

TLDR
This work surveys a variety of information extraction techniques that enable information agents to automatically gather information from heterogeneous sources and delivers the results to the users.
Abstract
Information agents are emerging as an important approach to building next- generation value-added information services. An information agent is a distributed system that receives a goal through its user interface, gathers information relevant to this goal from a variety of sources, processes this content as appropriate,and delivers the results to the users. We focus on the second stage in this generic architecture. We survey a variety of information extraction techniques that enable information agents to automatically gather information from heterogeneous sources.

read more

Citations
More filters
Journal ArticleDOI

Web data extraction, applications and techniques

TL;DR: A structured and comprehensive overview of the literature in the field of Web Data Extraction is provided, namely applications at the Enterprise level and at the Social Web level, which allows to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users.
Patent

Methods and systems for selecting a language for text segmentation

TL;DR: In this paper, methods and systems for selecting a language for text segmentation are disclosed. But they do not specify a language classifier for each of the candidate languages and the second candidate language associated with a string of characters.
Journal ArticleDOI

TEG—a hybrid approach to information extraction

TL;DR: The experiments show that the hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data.
Journal ArticleDOI

Understanding deep web search interfaces: a survey

TL;DR: This paper presents a survey on the major approaches to search interface understanding, and organizes the works on a 2-D graph based on the underlying database information extracted andbased on the technique employed.
Journal ArticleDOI

Interactive learning of node selecting tree transducer

TL;DR: In this paper, the authors propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that they introduce.
References
More filters
Proceedings Article

Wrapper induction for information extraction

TL;DR: This work introduces wrapper induction, a method for automatically constructing wrappers, and identifies hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources.
Book ChapterDOI

Extracting Patterns and Relations from the World Wide Web

TL;DR: In this article, the authors present a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample and test it to extract a relation of (author,title) pairs from the World Wide Web.
Journal ArticleDOI

Learning Information Extraction Rules for Semi-Structured and Free Text

TL;DR: WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences, and can also handle extraction from free text such as news stories.
Proceedings Article

Towards automatic data extraction from large web sites

Abstract: The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.
Proceedings Article

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

TL;DR: A novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences is developed, which confirms the feasibility of the approach on real-life data-intensive Web sites.
Related Papers (5)