Finite-State Approaches to Web Information Extraction

doi:10.1007/978-3-540-45092-4_4

Book ChapterDOI

Finite-State Approaches to Web Information Extraction

Nicholas Kushmerick

- 31 Jul 2002 -

Lecture Notes in Computer Science

- pp 77-91

TLDR

This work surveys a variety of information extraction techniques that enable information agents to automatically gather information from heterogeneous sources and delivers the results to the users.

Abstract:

Information agents are emerging as an important approach to building next- generation value-added information services. An information agent is a distributed system that receives a goal through its user interface, gathers information relevant to this goal from a variety of sources, processes this content as appropriate,and delivers the results to the users. We focus on the second stage in this generic architecture. We survey a variety of information extraction techniques that enable information agents to automatically gather information from heterogeneous sources.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Web data extraction, applications and techniques

Emilio Ferrara, +3 more

- 01 Nov 2014 -

Knowledge Based Systems

TL;DR: A structured and comprehensive overview of the literature in the field of Web Data Extraction is provided, namely applications at the Enterprise level and at the Social Web level, which allows to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users.

...read moreread less

Patent

Methods and systems for selecting a language for text segmentation

Gilad Israel Elbaz, +1 more

TL;DR: In this paper, methods and systems for selecting a language for text segmentation are disclosed. But they do not specify a language classifier for each of the candidate languages and the second candidate language associated with a string of characters.

...read moreread less

Journal ArticleDOI

TEG—a hybrid approach to information extraction

Ronen Feldman, +2 more

- 01 Jan 2006 -

Knowledge and Information Systems

TL;DR: The experiments show that the hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data.

...read moreread less

Journal ArticleDOI

Understanding deep web search interfaces: a survey

Ritu Khare, +2 more

TL;DR: This paper presents a survey on the major approaches to search interface understanding, and organizes the works on a 2-D graph based on the underlying database information extracted andbased on the technique employed.

...read moreread less

Journal ArticleDOI

Interactive learning of node selecting tree transducer

Julien Carme, +3 more

- 01 Jan 2007 -

Machine Learning

TL;DR: In this paper, the authors propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that they introduce.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Wrapper induction for information extraction

Nicholas Kushmerick, +1 more

TL;DR: This work introduces wrapper induction, a method for automatically constructing wrappers, and identifies hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources.

...read moreread less

Book ChapterDOI

Extracting Patterns and Relations from the World Wide Web

Sergey Brin

TL;DR: In this article, the authors present a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample and test it to extract a relation of (author,title) pairs from the World Wide Web.

...read moreread less

Journal ArticleDOI

Learning Information Extraction Rules for Semi-Structured and Free Text

Stephen Soderland

- 01 Feb 1999 -

Machine Learning

TL;DR: WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences, and can also handle extraction from free text such as news stories.

...read moreread less

Proceedings Article

Towards automatic data extraction from large web sites

Valter Crescenzi, +2 more

Abstract: The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.

...read moreread less

Proceedings Article

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Valter Crescenzi, +2 more

TL;DR: A novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences is developed, which confirms the feasibility of the approach on real-life data-intensive Web sites.

...read moreread less