scispace - formally typeset
Proceedings ArticleDOI

InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables

Reads0
Chats0
TLDR
A semantic graph that labels columns with unit, scale and timestamp information and computes semantic matches between columns even when the same numeric attribute is expressed in different units or scales and a novel entity augmentation API suited for numeric and time-varying attributes that leverages the semantic graph.
Abstract
Users often need to gather information about "entities" of interest. Recent efforts try to automate this task by leveraging the vast corpus of HTML tables; this is referred to as "entity augmentation". The accuracy of entity augmentation critically depends on semantic relationships between web tables as well as semantic labels of those tables. Current techniques work well for string-valued and static attributes but perform poorly for numeric and time-varying attributes.In this paper, we first build a semantic graph that (i) labels columns with unit, scale and timestamp information and (ii) computes semantic matches between columns even when the same numeric attribute is expressed in different units or scales. Second, we develop a novel entity augmentation API suited for numeric and time-varying attributes that leverages the semantic graph. Building the graph is challenging as such label information is often missing from the column headers. Our key insight is to leverage the wealth of tables on the web and infer label information from semantically matching columns of other web tables; this complements "local" extraction from column headers. However, this creates an interdependence between labels and semantic matches; we address this challenge by representing the task as a probabilistic graphical model that jointly discovers labels and semantic matches over all columns. Our experiments on real-life datasets show that (i) our semantic graph contains higher quality labels and semantic matches and (ii) entity augmentation based on the above graph has significantly higher precision and recall compared with the state-of-the-art.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Profiling relational data: a survey

TL;DR: Data profiling as mentioned in this paper is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases, and encompasses a vast array of methods to examine datasets and produce metadata, including statistics such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values.
Proceedings ArticleDOI

Big data integration

TL;DR: This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Proceedings ArticleDOI

A Large Public Corpus of Web Tables containing Time and Context Metadata

TL;DR: A large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl is presented to provide a common ground for evaluating Web table systems.
Journal ArticleDOI

Big Data Integration

TL;DR: In this article, a tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Book

Big Data Integration

TL;DR: In this paper, a tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
References
More filters
Book

Probabilistic graphical models : principles and techniques

TL;DR: The framework of probabilistic graphical models, presented in this book, provides a general approach for causal reasoning and decision making under uncertainty, allowing interpretable models to be constructed and then manipulated by reasoning algorithms.
Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Journal ArticleDOI

A survey of approaches to automatic schema matching

TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.
Journal ArticleDOI

WebTables: exploring the power of tables on the web

TL;DR: The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.
Journal ArticleDOI

Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields

TL;DR: The first nontrivial polynomial-time approximation algorithms for a general family of classification problems of this type are provided, the metric labeling problem, which contains as special cases a number of standard classification frameworks, including several arising from the theory of Markov random fields.
Related Papers (5)