A fast and simple method for extracting relevant content from news webpages

doi:10.1145/1645953.1646204

Proceedings ArticleDOI

A fast and simple method for extracting relevant content from news webpages

Eduardo Sany Laber, +7 more

- pp 1685-1688

Chats0

TLDR

The main advantages of the NCE method are its simplicity and its computational performance, which is at least an order of magnitude faster than methods that use visual features, very suitable for applications that process a large number of pages.

Abstract:

We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements. Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Language independent web news extraction system based on text detection framework

Yu-Chieh Wu

- 10 May 2016 -

Information Sciences

TL;DR: This study presents a web news extraction system that is based on a text detection framework and is very useful for constructing a multilingual corpus because it requires no language-specific processing component.

...read moreread less

Proceedings ArticleDOI

An efficient language-independent method to extract content from news webpages

Eduardo Teixeira Cardoso, +4 more

TL;DR: The chosen approach extends previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms, and retaining a good quality of extraction.

...read moreread less

Journal ArticleDOI

Specification and discovery of web patterns

Amin Roudaki, +2 more

- 20 Jan 2016 -

Information Sciences

TL;DR: A generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars based on a grammar induction engine and a graph parsing process is presented.

...read moreread less

Journal ArticleDOI

An FAR-SW based approach for webpage information extraction

Zhan Bu, +3 more

- 01 Nov 2014 -

Information Systems Frontiers

TL;DR: A statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages is proposed.

...read moreread less

Proceedings ArticleDOI

SpeedReader: Reader Mode Made Fast and Private

Mohammad Ghasemisharif, +3 more

TL;DR: This work proposes SpeedReader as an alternative multistep pipeline that is part of the rendering pipeline, and believes that SpeedReader can be continuously enabled in order to drastically improve end-user experience, especially on slow mobile connections.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

A brief survey of web data extraction tools

Alberto H. F. Laender, +3 more

TL;DR: A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.

...read moreread less

Proceedings ArticleDOI

Automatic web news extraction using tree edit distance

Davi De Castro Reis, +3 more

TL;DR: A domain-oriented approach to Web data extraction is presented and its application to automatically extracting news from Web sites is discussed, based on a highly efficient tree structure analysis that produces very effective results.

...read moreread less

Improving pseud-relevance feedback in web information retrieval using web page segmentation

S. Yu

TL;DR: In this paper, a VIsion-based Page Segmentation (VIPS) algorithm was proposed to detect the semantic content structure in a web page, which utilizes useful visual cues to obtain a better partition of a page at the semantic level.

...read moreread less

Proceedings ArticleDOI

Improving pseudo-relevance feedback in web information retrieval using web page segmentation

Shipeng Yu, +3 more

TL;DR: This paper proposes a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page and achieves 27% performance improvement on Web Track dataset.

...read moreread less

Proceedings ArticleDOI

The volume and evolution of web page templates

David Gibson, +2 more

TL;DR: This work develops new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality.

...read moreread less

Related Papers (5)

ViDE: A Vision-Based Approach for Deep Web Data Extraction

Wei Liu, +2 more

- 01 Mar 2010 -

IEEE Transactions on Knowledge and Data ...

A fast and simple method for extracting relevant content from news webpages

Citations

Language independent web news extraction system based on text detection framework

An efficient language-independent method to extract content from news webpages

Specification and discovery of web patterns

An FAR-SW based approach for webpage information extraction

SpeedReader: Reader Mode Made Fast and Private

References

A brief survey of web data extraction tools

Automatic web news extraction using tree edit distance

Improving pseud-relevance feedback in web information retrieval using web page segmentation

Improving pseudo-relevance feedback in web information retrieval using web page segmentation

The volume and evolution of web page templates

Related Papers (5)

ViDE: A Vision-Based Approach for Deep Web Data Extraction

A very efficient approach to news title and content extraction on the web

Extraction of News Content for Text Mining Based on Edit Distance

Template-independent news extraction based on visual consistency

Automatic web news extraction using tree edit distance