Big Data Linkage for Product Specification Pages

doi:10.1145/3183713.3183757

Proceedings ArticleDOI

Big Data Linkage for Product Specification Pages

- pp 67-81

TLDR

This paper presents the RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands ofHead and tail sources.

Abstract:

An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to beresolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.

Big Data Linkage for Product Specification Pages

Citations

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Big Data Integration for Product Specifications.

Fine-grained semantic type discovery for heterogeneous sources using clustering

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications.

OpenTRIAGE: Entity Linkage for Detail Webpages

References

Fast unfolding of communities in large networks

Fast unfolding of communities in large networks

Authoritative sources in a hyperlinked environment

Knowledge vault: a web-scale approach to probabilistic knowledge fusion

YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

Related Papers (5)

The missing links: discovering hidden same-as links among a billion of triples

Synthesizing products for online catalogs

Web Objects Clustering Through Aggregation for Enhanced Search Results

The Internet as Scientific Knowledge Base: Navigating the Chem-Bio Space

A construction product browser