scispace - formally typeset
Proceedings ArticleDOI

Big Data Linkage for Product Specification Pages

TLDR
This paper presents the RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands ofHead and tail sources.
Abstract
An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to beresolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.

read more

Citations
More filters
Journal ArticleDOI

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

TL;DR: A notion of data context is introduced, which associates portions of a target schema with extensional data of types that are commonly available, and a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling is defined.
Journal Article

Big Data Integration for Product Specifications.

TL;DR: This paper presents a pipeline that decomposes the problem into different tasks from source and data discovery, to extraction, data linkage, schema alignment and data fusion, and presents the results of these efforts towards big data integration for product specifications.
Journal ArticleDOI

Fine-grained semantic type discovery for heterogeneous sources using clustering

TL;DR: In this article , the authors focus on the key task of semantic type discovery over a set of heterogeneous sources, and propose an iterative RaF-STD solution, which consists of three key steps: (i) a Bayesian model analysis of overlapping information across sources to match the most locally homogeneous attributes; (ii) a tagging approach, inspired by NLP techniques, to create (virtual) attributes from portions of heterogenous attribute values; and (iii) a novel use of classical techniques based on matching of attribute names and domains.

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications.

TL;DR: This paper presents ongoing efforts, challenges, and the research agenda to address big data integration for product specifications in the product domain.
Proceedings Article

OpenTRIAGE: Entity Linkage for Detail Webpages

TL;DR: OpenTriage, a system for extracting structured entities from detail Web pages of several sites and finding linkages between the extracted data, is presented, based on a hybrid human-machine learning technique that targets a desired quality level.
References
More filters
Journal ArticleDOI

Fast unfolding of communities in large networks

TL;DR: This work proposes a heuristic method that is shown to outperform all other known community detection methods in terms of computation time and the quality of the communities detected is very good, as measured by the so-called modularity.
Journal ArticleDOI

Fast unfolding of communities in large networks

TL;DR: In this paper, the authors proposed a simple method to extract the community structure of large networks based on modularity optimization, which is shown to outperform all other known community detection methods in terms of computation time.
Journal ArticleDOI

Authoritative sources in a hyperlinked environment

TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.
Proceedings ArticleDOI

Knowledge vault: a web-scale approach to probabilistic knowledge fusion

TL;DR: The Knowledge Vault is a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories that computes calibrated probabilities of fact correctness.
Journal ArticleDOI

YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

TL;DR: YAGO2 as mentioned in this paper is an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space, and it contains 447 million facts about 9.8 million entities.