scispace - formally typeset
Proceedings ArticleDOI

A densitometric approach to web page segmentation

TLDR
A new approach to segment HTML pages is described, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision, utilizing the notion of text-density as a measure to identify the individual text segments of a web page.
Abstract
Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segment-level text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

read more

Citations
More filters
Proceedings ArticleDOI

Boilerplate detection using shallow text features

TL;DR: This paper analyzes a small set of shallow text features for classifying the individual text elements in a Web page and derives a simple and plausible stochastic model for describing the boilerplate creation process.
Proceedings ArticleDOI

Evaluating the visual quality of web pages using a computational aesthetic approach

TL;DR: A computational aesthetics approach is proposed to learn the evaluation model for the visual quality of Web pages and it is concluded that the Web page's layout visual features and text visual features are the primary affecting factors toward Webpage's visual quality.
Journal ArticleDOI

A hybrid approach for extracting informative content from web pages

TL;DR: This paper presents a hybrid approach that contains two steps that can invoke each other and discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method.
Journal ArticleDOI

Measuring the Visual Complexities of Web Pages

TL;DR: A new approach combining Web mining techniques and machine learning algorithms for measuring the VisComs of Web pages is provided, utilizing a distribution to quantify the VisCom of a Web page.
Proceedings ArticleDOI

Block-o-Matic: A web page segmentation framework

TL;DR: The proposed Block-o-Matic is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques that gives promising results in segmentation of a web page.
References
More filters
Journal ArticleDOI

Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

TL;DR: This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings and proposes three effective and efficient techniques for obtaining high-quality combiners (consensus functions).
Journal Article

Computer vision

TL;DR: How the field of computer (and robot) vision has evolved, particularly over the past 20 years, is described, and its central methodological paradigms are introduced.
Book

Computer Vision

TL;DR: Computer Vision presents the necessary theory and techniques for students and practitioners who will work in fields where significant information must be extracted automatically from images, a useful resource book for professionals and a core text for both undergraduate and beginning graduate computer vision and imaging courses.
Proceedings ArticleDOI

Multi-paragraph segmentation expository text

TL;DR: TextTiling as mentioned in this paper is an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts using domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes.
Book

Web Site Usability: A Designer's Guide

TL;DR: This book is the most comprehensive data demonstrating how Web sites actually work when users need specific answers, and offers guidance for evaluating and improving the usability of Web sites.