scispace - formally typeset
Open AccessJournal ArticleDOI

Effective and efficient Semantic Table Interpretation using TableMiner

Ziqi Zhang
- 07 Aug 2017 - 
- Vol. 8, Iss: 6, pp 921-957
TLDR
This article introduces TableMiner+, a Semantic Table Interpretation method that annotates Web tables in a both effective and efficient way and significantly reduces computational overheads in terms of wall-clock time when compared against classic methods that ‘exhaustively’ process the entire table content to build features for inference.
Abstract
This article introduces TableMiner+, a Semantic Table Interpretation method that annotates Web tables in a both effective and efficient way. Built on our previous work TableMiner, the extended version advances state-of-the-art in several ways. First, it improves annotation accuracy by making innovative use of various types of contextual information both inside and outside tables as features for inference. Second, it reduces computational overheads by adopting an incremental, bootstrapping approach that starts by creating preliminary and partial annotations of a table using ‘sample’ data in the table, then using the outcome as ‘seed’ to guide interpretation of remaining contents. This is then followed by a message passing process that iteratively refines results on the entire table to create the final optimal annotations. Third, it is able to handle all annotation tasks of Semantic Table Interpretation (e.g., annotating a column, or entity cells) while state-of-the-art methods are limited in different ways. We also compile the largest dataset known to date and extensively evaluate TableMiner+ against four baselines and two re-implemented (near-identical, as adaptations are needed due to the use of different knowledge bases) state-of-the-art methods. TableMiner+ consistently outperforms all models under all experimental settings. On the two most diverse datasets covering multiple domains and various table schemata, it achieves improvement in F1 by between 1 and 42 percentage points depending on specific annotation tasks. It also significantly reduces computational overheads in terms of wall-clock time when compared against classic methods that ‘exhaustively’ process the entire table content to build features for inference. As a concrete example, compared against a method based on joint inference implemented with parallel computation, the non-parallel implementation of TableMiner+ achieves significant improvement in learning accuracy and almost orders of magnitude of savings in wall-clock time.

read more

Content maybe subject to copyright    Report

Undefined 1 (2009) 1–5 1
IOS Press
Effective and Efficient Semantic Table
Interpretation using TableMiner
+
Editor(s): Name Surname, University, Country
Solicited review(s): Name Surname, University, Country
Open review(s): Name Surname, University, Country
Ziqi Zhang
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP
E-mail: ziqi.zhang@sheffield.ac.uk
Abstract. This article introduces TableMiner
+
, a Semantic Table Interpretation method that annotates Web tables in a both
effective and efficient way. Built on our previous work TableMiner, the extended version advances state-of-the-art in several ways.
First, it improves annotation accuracy by making innovative use of various types of contextual information both inside and outside
tables as features for inference. Second, it reduces computational overheads by adopting an incremental, bootstrapping approach
that starts by creating preliminary and partial annotations of a table using ‘sample’ data in the table, then using the outcome as
‘seed’ to guide interpretation of remaining contents. This is then followed by a message passing process that iteratively refines
results on the entire table to create the final optimal annotations. Third, it is able to handle all annotation tasks of Semantic Table
Interpretation (e.g., annotating a column, or entity cells) while state-of-the-art methods are limited in different ways. We also
compile the largest dataset known to date and extensively evaluate TableMiner
+
against four baselines and two re-implemented
state-of-the-art methods. TableMiner
+
consistently outperforms all comparative models under all experimental settings. On the
two most diverse datasets covering multiple domains and various table schemata, it achieves improvement in F1 by between 1
and 42 percentage points depending on specific annotation tasks. It also significantly reduces computational overheads in terms
of wall-clock time when compared against classic methods that ‘exhaustively’ process the entire table content to build features
for inference. As a concrete example, compared against an existing method based on joint inference implemented with parallel
computation, the non-parallel implementation of TableMiner
+
achieves significant improvement in learning accuracy and almost
orders of magnitude of savings in wall-clock time.
Keywords: Web table, Named Entity Recognition, Named Entity Disambiguation, Relation Extraction, Linked Data, Semantic
Table Interpretation, table annotation
1. Introduction
Recovering semantics from tables is a crucial task
in realizing the vision of Semantic Web. On the one
hand, the amount of high-quality tables containing use-
ful relational data is growing rapidly to hundreds of
millions [4,5]. On the other hand, search engines typ-
ically ignore underlying semantics of such structures
at indexing, hence performing poorly on tabular data
[21,26]. Research directed to this particular problem
*
Corresponding author. E-mail: ziqi.zhang@sheffield.ac.uk.
is Semantic Table Interpretation [14,15,21,27,34,25,
35,36,3,26,23], which deals with three types of anno-
tation tasks in tables. Given a well-formed relational
table (e.g., Figure 1) and reference sets of concepts
(or classes, types), named entities (or simply ‘enti-
ties’) and relations, (1) disambiguate entity mentions
in content cells (or simply ‘cells’) by linking them to
the existing reference entities; (2) annotate columns
with semantic concepts if they contain entity mentions
(NE-columns), or properties of concepts if they con-
tain data literals (literal-columns); and (3) identify the
semantic relations between columns. The annotations
0000-0000/09/$00.00 © 2009 IOS Press and the authors. All rights reserved

2 Z. Zhang / Effective and Efficient Semantic Table Interpretation using TableMiner
+
created can enable semantic indexing and search of the
data and used to create Linked Open Data (LOD).
Although classic Natural Language Processing (NLP)
and Information Extraction (IE) techniques address
similar research problems [45,10,6,28], they are tai-
lored for well-formed sentences in unstructured texts,
and are unlikely to succeed on tabular data [21,26].
Typical Semantic Table Interpretation methods make
extensive use of structured knowledge bases, which
contain candidate concepts and entities, each defined
with rich lexical and semantic information and linked
by relations. The general workflow involves: (1) re-
trieving candidates corresponding to table components
(e.g., concepts given a column header, or entities given
the text of a cell) from the knowledge base, (2) repre-
sent candidates using features extracted from both the
knowledge base and tables to model semantic inter-
dependence between table components and candidates
(e.g., the header text of a column and the name of a
candidate concept), and between various table compo-
nents (e.g., a column should be annotated by a con-
cept that is shared by all entities in the cells from the
column), and (3) applying inference to choose the best
candidates.
This work addresses several limitations of state-of-
the-art on three dimensions: effectiveness, efficiency,
and completeness.
Effectiveness - Semantic Table Interpretation meth-
ods so far have primarily exploited features derived
from two sources: the knowledge bases, and table com-
ponents such as header and row content (to be called
‘in-table context’). In this work, we propose to uti-
lize the so-called ‘out-table context’, i.e., the textual
content around and outside tables (e.g., paragraphs,
captions), to further improve interpretation accuracy.
As an example, the first column in the table shown
in Figure 1 (to be called the ‘example table’) has
a header ‘Title’, which is highly ambiguous and ar-
guably irrelevant to the concept we should use to an-
notate the column. However, on the containing web-
page, the word ‘film’ is repeated 17 times. This is a
strong indicator for us to select a suitable concept for
the column. A particular type of out-table context we
utilize is the semantic markups inserted within web-
pages by data publishers such as RDFa/Microdata an-
notations. These markups are growing rapidly as major
search engines use them to enable semantic indexing
and search. When available, they provide high-quality,
important information about the webpages and tables
they contain.
We show empirically that we can derive useful fea-
tures from out-table context to improve annotation ac-
curacy. Such out-table features are highly generic and
generally available. While on the contrary, many exist-
ing methods use knowledge base specific features that
are impossible to generalize, or suffers substantially in
terms of accuracy when they can be adapted, which we
shall show in experiments.
Efficiency - We argue that efficiency is also an im-
portant factor to consider in the task of Semantic Table
Interpretation, even though it has never been explic-
itly addressed before. The major bottleneck is mainly
due to three types of operations: querying the knowl-
edge bases, building feature representations for can-
didates, and computing similarity between candidates.
Both the number of queries and similarity computa-
tion can grow quadratically with respect to the size of
a table as often such operations are required for each
pair of candidates [21,26,24]. Empirically, Limaye et
al. [21] show that the actual inference algorithm only
consumes less than 1% of total running time. Using
a local copy of the knowledge base only partially ad-
dresses the issue but introduces more problems. First,
hosting a local knowledge base requires infrastructural
support and involves set-up and maintenance. As we
enter the ‘Big-Data’ era, knowledge bases are growing
rapidly towards a colossal structure such as the Google
Knowledge Graph [1], which constantly integrates in-
creasing numbers of heterogeneous sources. Maintain-
ing a local copy of such a knowledge base is likely
to require an infrastructure that not every organization
can afford [29]. Second, local data are not guaranteed
to be up-to-date. Third, scaling up to very large amount
of input data requires efficient algorithms in addition
to parallelization [32], as the process could be bound
by the large number of I/O operations. Therefore in
our view, a more versatile solution is cutting down the
number of queries and data items to be processed. This
reduces I/O operations in both local and remote scenar-
ios, also reducing costs associated with making remote
calls to Web service providers.
In this direction, we identify an opportunity to im-
prove state-of-the-art in terms of efficiency. To illus-
trate, consider the example table that in reality contains
over 60 rows. To annotate each column, existing meth-
ods would use content from every row in the column.
However, from a human reader’s point of view, this
is unnecessary. Simply reading the the eight rows one
can confidently assign a concept to the column to best
describe its content. Being able to make such inference
with limited data would give substantial efficiency ad-

Z. Zhang / Effective and Efficient Semantic Table Interpretation using TableMiner
+
3
Fig. 1. An example Wikipedia webpage containing a relational table (Last retrieved on 9 April 2014).
vantage to Semantic Table Interpretation algorithms,
as it will significantly reduce the number of queries
to the underlying knowledge bases and the number of
candidates to be considered for inference.
Completeness - Many existing methods only deal
with one or two types of annotation tasks in a table
[35,36]. In those that deal with all tasks [21,27,25,
26,34], only NE-columns are considered. As shown
in Figure 1, tables can contain both NE-columns con-
taining entity mentions, and literal-columns contain-
ing data values of entities on the corresponding rows.
Methods such as Limaye et al. [21] and Mulwad et
al. [26] can recognize relations between the first and
third columns, but are unable to identify the relation
between the first and the second columns. Therefore,
we argue that a ‘complete’ Semantic Table Interpreta-
tion method should handle columns of both data types.
To address these issues, we developed TableMiner
previously [42] that uses features from both in- and
out-table context and annotates NE-columns and cells
in a relational table based on the principle of ‘start
small, build complete’. That is, (1) create prelimi-
nary, likely erroneous annotations based on partial ta-
ble content and a simple model assuming limited in-
terdependence between table components; (2) and then
iteratively optimize the preliminary annotations by en-
forcing interdependence between table components. In
this work we extend it to build TableMiner
+
, by adding
‘subject column’ [35,36] detection, relation enumera-
tion, and improving the iterative optimization process.
Concretely, TableMiner
+
firstly interprets NE-columns
(to be called column interpretation), while coupling
column classification and entity disambiguation in a
mutually recursive process that consists of a LEARN-
ING phase and an UPDATE phase. The LEARNING
phase interprets each column independently by firstly
learning to create preliminary column annotations us-
ing an automatically determined ‘sample’ from the
column, followed by ‘constrained’ entity disambigua-
tion of the cells in the column (limiting candidate en-
tity space using preliminary column annotations). The
UPDATE phase iteratively optimizes the classification
and disambiguation results in each column based on
a notion of ‘domain consensus’ that captures inter-
column and inter-task dependence, creating a global
optimum. For relation enumeration, TableMiner
+
de-
tects a subject column in a table and infers its relations
with other columns (both NE- and literal-columns) in
the table.
TableMiner
+
is evaluated on four datasets contain-
ing over 15,000 tables, against four baselines and
two re-implemented state-of-the-art methods. It con-
sistently obtains the best performance on all datasets.
On the two most diverse datasets covering multiple do-
mains and various table schemata, it obtains an im-
provement of about 1-18 percentage points in disam-

4 Z. Zhang / Effective and Efficient Semantic Table Interpretation using TableMiner
+
biguation, 6-42 in classification, and 4-16 in relation
enumeration. It is also very efficient, contributing up to
66% reduction in terms of the amount of candidates to
be processed and up to 29% savings in wall-clock time
compared against exhaustive baseline methods. Even
in the setting where a local copy of the knowledge base
is used, TableMiner
+
delivers almost orders of magni-
tude savings in wall-clock time compared against one
re-implemented state-of-the-art method.
The remainder of this paper is organized as follows.
Section 2 defines terms and concepts used in the rel-
evant domain. Section 3 discusses related work. Sec-
tions 4 to 9 introduce TableMiner
+
in detail. Sections
10 and 11 describe experiment settings and discuss re-
sults, followed by conclusion in Section 12.
2. Terms and concepts
A relational table contains regular rows and columns
resembling tables in traditional databases. In practice,
tables containing complex structures constitute a small
population and have not been the focus of research. In
theory, complex tables can be interpreted by adding a
pre-process that parse complex structures using meth-
ods such as Zanibbi et al. [39]
Relational tables may or may not contain a header
row, which is typically the first row in a table. They of-
ten contain a subject column that usually (but not nec-
essarily) corresponds to the ‘primary key’ columns in a
database table [35,36]. This contains the set of entities
the table is about (subject entities, e.g., column ‘Title’
in Figure 1 contains a list of films the table is about),
while other columns contain either entities forming bi-
nary relationships with subject entities, or literals de-
scribing attributes of subject entities.
A knowledge base defines a set of concepts (or
types, classes), their object instances or entities, liter-
als representing concrete data values, and semantic re-
lations that define possible associations between enti-
ties (hence also between concepts they belong to), or
between an entity and a literal, in which case the rela-
tion is usually called a property of the entity (hence
a property of its concept) and the literal as the prop-
erty value. In the generic form, a knowledge base is a
liked data set containing a set of triples, statements,
or facts, each composed of a subject, predicate and ob-
ject. The subject could be a concept or entity, the ob-
ject could be a concept, entity, or literal, and the predi-
cate could be a relation or property. A knowledge base
can be a populated ontology, such as the YAGO
1
and
DBpedia
2
datasets, in which case a concept hierarchy
is defined. However this is not always true as some
knowledge bases do not define strict ontology but loose
concept networks, such as Freebase
3
.
The task of Semantic Table Interpretation addresses
three annotation tasks. Named Entity Disambiguation
associates each cell in NE-columns with one canon-
ical entity; column classification annotates each NE-
column with one concept, or in the case of literal-
columns, associates the column to one property of the
concept assigned to the subject column of the table;
Relation Extraction (or enumeration) identifies binary
relations between NE-columns, or in the case of one
NE-column and a literal-column and given that the
NE-column is annotated by a specific concept, identi-
fies a property of that concept that could explain the
data literals. The candidate entities, concepts and rela-
tions are drawn from the knowledge base.
Using the example table and Freebase as example,
the first column can be considered a reasonable subject
column and should be annotated by the Freebase type
‘Film’ (URI ‘fb
4
:/film/film’). A Difficult Life’ in the
first column should be annotated by ‘fb:/m/02qlhz2’
that denotes a movie directed by ‘Dino Risi’ (in the
third column, ‘fb:/m/0j_nhj’). The relation between the
first and third column should be annotated as ‘Directed
by’ (‘fb:/film/film/directed_by’). And the relation be-
tween the first and second column (which is a literal-
column) should be the property of ‘Film’: ‘initial re-
lease date’ (‘fb:/film/film/initial_release_date’), which
we also use to annotate the second column.
3. Related work
3.1. Legacy tabular data to linked data
Research on converting tabular data in legacy data
sources to linked data format has made solid contri-
bution toward the rapid growth of the LOD cloud in
the past decade [12,19,30,7]. The key difference from
the task of Semantic Table Interpretation is that the
focus is on data generation rather than interpretation,
since the goal is to pragmatically convert tabular data
from databases, spreadsheets, and similar data struc-
1
http://www.mpi-inf.mpg.de/yago-naga/yago/
2
http://wiki.dbpedia.org/Ontology
3
http://www.freebase.com
4
fb:http://www.freebase.com

Z. Zhang / Effective and Efficient Semantic Table Interpretation using TableMiner
+
5
tures into RDF. Typical methods require manually (or
partially automated) mapping the two data structures
(input and output RDF), and they do not link data to
existing concepts, entities and relations from the LOD
cloud. As a result, the implicit semantics of the data
remain uncovered.
3.2. General NLP and IE
Some may argue to use the general purpose NLP/IE
methods for Semantic Table Interpretation, due to their
highly similar objectives. This is infeasible for a num-
ber of reasons. First, state-of-the-art methods [31,17]
are typically tailored to unstructured text content that
is different from tabular data. The interdependence
among the table components cannot be easily modeled
in such methods [22]. Second and particularly for the
tasks of Named Entity Classification and Relation Ex-
traction, classic methods require each target semantic
label (i.e., concept or relation) to be pre-defined and
learning requires training or seed data [28,40]. In Se-
mantic Table Interpretation however, due to the large
degree of variations in table schemata (e.g., Limaye et
al. [21] use a dataset of over 6,000 randomly crawled
Web tables of which no information about the table
schemata is known a priori), defining a comprehen-
sive set of semantic concepts and relations and sub-
sequently creating necessary training or seed data are
infeasible.
A related IE task tailored to structured data is wrap-
per induction [18,9], which automatically learns wrap-
pers that can extract information from regular, recur-
rent structures (e.g., product attributes from Amazon
webpages). In the context of relational tables, wrap-
per induction methods can be adapted to annotate table
columns that describe entity attributes. However, they
also require training data and the table schemata to be
known a priori.
3.3. Table extension and augmentation
Table extension and augmentation aims at gather-
ing relational tables that contain the same entities but
cover complementary attributes of the entities, and in-
tegrate these tables by joining them on the same en-
tities. For example, Yakout et al. [38] propose Info-
Gather for populating a table of entities with their at-
tributes by harvesting related tables on the Web. The
users need to either provide the desired attribute names
of the entities, or example values of their attributes.
The system can also discover the set of attributes for
similar entities. Bhagavatula et al. [2] introduce Wik-
iTables, which given a query table and a collection of
other tables, identifies columns from other tables that
would make relevant additions to the query table. They
first identify a reference column (e.g., country names
in a table of country population) in the query table to
use for joining, then find a different table (e.g. a list of
countries by GDP) with a column similar to the refer-
ence column, and perform a left outer join to augment
the query table with an automatically selected column
from the new table (e.g., the GDP amounts). Lehm-
berg et al. [20] create the Mannheim Search Joins En-
gine with the same goal as WikiTables but focus on
handling tens of millions of tables from heterogeneous
sources.
The key difference between these systems and the
task of Semantic Table Interpretation is that they focus
on integration rather than interpretation. The data col-
lected are not linked to knowledge bases and ambigu-
ity still remains.
3.4. Semantic Table Interpretation
Hignette et al. [14,15] and Buche et al. [3] pro-
pose methods to identify concepts represented by ta-
ble columns and detect relations present in tables in a
domain-specific context. An NE-column is annotated
based on two factors: similarity between the header
text of the column and the name of a candidate con-
cept; plus the the similarities calculated for each cell
in the column and each term in the hierarchical paths
containing the candidate concept. For relations, they
only detect the presence of semantic relations in the
table without specifying the columns forming binary
relations.
Venetis et al. [35] annotate table columns and iden-
tify relations between the subject column and other
columns using types and relations from a database con-
structed by mining the Web using lexico-syntactic pat-
terns such as the Hearst patterns [13]. The database
contains co-occurrence statistics about the subject and
object of triples. For example, how many times the
word ‘cat’ and ‘animal’ has been extracted by the pat-
tern <?, suchas, ? > representing the is-a relation be-
tween concept and instances. A maximum likelihood
inference model predicts the best type for a column
to be the one maximizing the probability of seeing all
the values in the column given that type for the col-
umn. Such probability is computed based on the co-
occurrence statistics gathered in the database. Relation
interpretation follows the same principle.

Citations
More filters
Proceedings ArticleDOI

Matching HTML Tables to DBpedia

TL;DR: This paper presents the T2D gold standard for measuring and comparing the performance of HTML table to knowledge base matching systems, and shows that T2K Match discovers table-to-class correspondences with a precision of 94%, row/columns and entities/schema elements of the knowledge base need to be found.
Book ChapterDOI

TabEL: Entity Linking in Web Tables

TL;DR: TabEL differs from previous work by weakening the assumption that the semantics of a table can be mapped to pre-defined types and relations found in the target KB, and enforces soft constraints in the form of a graphical model that assigns higher likelihood to sets of entities that tend to co-occur in Wikipedia documents and tables.
Journal ArticleDOI

Information extraction meets the semantic web: a survey

TL;DR: Millennium Institute for Foundational Research on Data (IMFD) Comision Nacional de Investigacion Cientifica y Tecnologica (CONICYT), CONICyT FONDECYT: 1181896
Posted Content

TURL: Table Understanding through Representation Learning

TL;DR: This paper proposes a structure-aware Transformer encoder to model the row-column structure of relational tables, and presents a new Masked Entity Recovery objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data.
Book ChapterDOI

SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems

TL;DR: The datasets, infrastructure and lessons learned from the first edition of the SemTab challenge are reported about, enabling a new family of data analytics and data science applications.
References
More filters
Proceedings ArticleDOI

Automatic acquisition of hyponyms from large text corpora

TL;DR: A set of lexico-syntactic patterns that are easily recognizable, that occur frequently and across text genre boundaries, and that indisputably indicate the lexical relation of interest are identified.
Journal ArticleDOI

A survey of named entity recognition and classification

TL;DR: Observations about languages, named entity types, domains and textual genres studied in the literature, along with other critical aspects of NERC such as features and evaluation methods, are reported.
Proceedings Article

Large-Scale Named Entity Disambiguation Based on Wikipedia Data

TL;DR: Through a process of maximizing the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities, the implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles.
Proceedings Article

Wrapper induction for information extraction

TL;DR: This work introduces wrapper induction, a method for automatically constructing wrappers, and identifies hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources.
Proceedings ArticleDOI

Probase: a probabilistic taxonomy for text understanding

TL;DR: This paper presents a universal, probabilistic taxonomy that is more comprehensive than any existing ones, and contains 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages.
Related Papers (5)