Proceedings ArticleDOI
A fast and simple method for extracting relevant content from news webpages
Eduardo Sany Laber,Críston P. de Souza,Iam Vita Jabour,Evelin Amorim,Eduardo Teixeira Cardoso,Raúl Pierre Rentería,Lucio C. Tinoco,Caio Valentim +7 more
- pp 1685-1688
Reads0
Chats0
TLDR
The main advantages of the NCE method are its simplicity and its computational performance, which is at least an order of magnitude faster than methods that use visual features, very suitable for applications that process a large number of pages.Abstract:
We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements. Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.read more
Citations
More filters
Journal ArticleDOI
Language independent web news extraction system based on text detection framework
TL;DR: This study presents a web news extraction system that is based on a text detection framework and is very useful for constructing a multilingual corpus because it requires no language-specific processing component.
Proceedings ArticleDOI
An efficient language-independent method to extract content from news webpages
TL;DR: The chosen approach extends previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms, and retaining a good quality of extraction.
Journal ArticleDOI
Specification and discovery of web patterns
Amin Roudaki,Jun Kong,Kang Zhang +2 more
TL;DR: A generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars based on a grammar induction engine and a graph parsing process is presented.
Journal ArticleDOI
An FAR-SW based approach for webpage information extraction
TL;DR: A statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages is proposed.
Proceedings ArticleDOI
SpeedReader: Reader Mode Made Fast and Private
TL;DR: This work proposes SpeedReader as an alternative multistep pipeline that is part of the rendering pipeline, and believes that SpeedReader can be continuously enabled in order to drastically improve end-user experience, especially on slow mobile connections.
References
More filters
Journal ArticleDOI
A brief survey of web data extraction tools
Alberto H. F. Laender,Berthier Ribeiro-Neto,Altigran Soares da Silva,Juliana Silveira Teixeira +3 more
TL;DR: A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.
Proceedings ArticleDOI
Automatic web news extraction using tree edit distance
TL;DR: A domain-oriented approach to Web data extraction is presented and its application to automatically extracting news from Web sites is discussed, based on a highly efficient tree structure analysis that produces very effective results.
Improving pseud-relevance feedback in web information retrieval using web page segmentation
TL;DR: In this paper, a VIsion-based Page Segmentation (VIPS) algorithm was proposed to detect the semantic content structure in a web page, which utilizes useful visual cues to obtain a better partition of a page at the semantic level.
Proceedings ArticleDOI
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
TL;DR: This paper proposes a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page and achieves 27% performance improvement on Web Track dataset.
Proceedings ArticleDOI
The volume and evolution of web page templates
TL;DR: This work develops new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality.