scispace - formally typeset
Proceedings ArticleDOI

A fast and simple method for extracting relevant content from news webpages

Reads0
Chats0
TLDR
The main advantages of the NCE method are its simplicity and its computational performance, which is at least an order of magnitude faster than methods that use visual features, very suitable for applications that process a large number of pages.
Abstract
We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements. Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.

read more

Citations
More filters
Journal ArticleDOI

Language independent web news extraction system based on text detection framework

TL;DR: This study presents a web news extraction system that is based on a text detection framework and is very useful for constructing a multilingual corpus because it requires no language-specific processing component.
Proceedings ArticleDOI

An efficient language-independent method to extract content from news webpages

TL;DR: The chosen approach extends previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms, and retaining a good quality of extraction.
Journal ArticleDOI

Specification and discovery of web patterns

TL;DR: A generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars based on a grammar induction engine and a graph parsing process is presented.
Journal ArticleDOI

An FAR-SW based approach for webpage information extraction

TL;DR: A statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages is proposed.
Proceedings ArticleDOI

SpeedReader: Reader Mode Made Fast and Private

TL;DR: This work proposes SpeedReader as an alternative multistep pipeline that is part of the rendering pipeline, and believes that SpeedReader can be continuously enabled in order to drastically improve end-user experience, especially on slow mobile connections.
References
More filters
Journal ArticleDOI

A brief survey of web data extraction tools

TL;DR: A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.
Proceedings ArticleDOI

Automatic web news extraction using tree edit distance

TL;DR: A domain-oriented approach to Web data extraction is presented and its application to automatically extracting news from Web sites is discussed, based on a highly efficient tree structure analysis that produces very effective results.

Improving pseud-relevance feedback in web information retrieval using web page segmentation

S. Yu
TL;DR: In this paper, a VIsion-based Page Segmentation (VIPS) algorithm was proposed to detect the semantic content structure in a web page, which utilizes useful visual cues to obtain a better partition of a page at the semantic level.
Proceedings ArticleDOI

Improving pseudo-relevance feedback in web information retrieval using web page segmentation

TL;DR: This paper proposes a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page and achieves 27% performance improvement on Web Track dataset.
Proceedings ArticleDOI

The volume and evolution of web page templates

TL;DR: This work develops new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality.
Related Papers (5)