Proceedings ArticleDOI
A densitometric approach to web page segmentation
Christian Kohlschütter,Wolfgang Nejdl +1 more
- pp 1173-1182
TLDR
A new approach to segment HTML pages is described, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision, utilizing the notion of text-density as a measure to identify the individual text segments of a web page.Abstract:
Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segment-level text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.read more
Citations
More filters
Proceedings ArticleDOI
Boilerplate detection using shallow text features
TL;DR: This paper analyzes a small set of shallow text features for classifying the individual text elements in a Web page and derives a simple and plausible stochastic model for describing the boilerplate creation process.
Proceedings ArticleDOI
Evaluating the visual quality of web pages using a computational aesthetic approach
TL;DR: A computational aesthetics approach is proposed to learn the evaluation model for the visual quality of Web pages and it is concluded that the Web page's layout visual features and text visual features are the primary affecting factors toward Webpage's visual quality.
Journal ArticleDOI
A hybrid approach for extracting informative content from web pages
TL;DR: This paper presents a hybrid approach that contains two steps that can invoke each other and discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method.
Journal ArticleDOI
Measuring the Visual Complexities of Web Pages
Ou Wu,Weiming Hu,Lei Shi +2 more
TL;DR: A new approach combining Web mining techniques and machine learning algorithms for measuring the VisComs of Web pages is provided, utilizing a distribution to quantify the VisCom of a Web page.
Proceedings ArticleDOI
Block-o-Matic: A web page segmentation framework
Andrés Sanoja,Stéphane Gançarski +1 more
TL;DR: The proposed Block-o-Matic is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques that gives promising results in segmentation of a web page.
References
More filters
Journal ArticleDOI
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
Alexander Strehl,Joydeep Ghosh +1 more
TL;DR: This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings and proposes three effective and efficient techniques for obtaining high-quality combiners (consensus functions).
Journal Article
Computer vision
TL;DR: How the field of computer (and robot) vision has evolved, particularly over the past 20 years, is described, and its central methodological paradigms are introduced.
Book
Computer Vision
George Stockman,Linda G. Shapiro +1 more
TL;DR: Computer Vision presents the necessary theory and techniques for students and practitioners who will work in fields where significant information must be extracted automatically from images, a useful resource book for professionals and a core text for both undergraduate and beginning graduate computer vision and imaging courses.
Proceedings ArticleDOI
Multi-paragraph segmentation expository text
TL;DR: TextTiling as mentioned in this paper is an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts using domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes.
Book
Web Site Usability: A Designer's Guide
TL;DR: This book is the most comprehensive data demonstrating how Web sites actually work when users need specific answers, and offers guidance for evaluating and improving the usability of Web sites.