Effective and efficient Semantic Table Interpretation using TableMiner

doi:10.3233/SW-160242

Undeﬁned 1 (2009) 1–5 1

IOS Press

Effective and Efﬁcient Semantic Table

Interpretation using TableMiner

+

Editor(s): Name Surname, University, Country

Solicited review(s): Name Surname, University, Country

Open review(s): Name Surname, University, Country

Ziqi Zhang

∗

Department of Computer Science, University of Shefﬁeld, Regent Court, 211 Portobello, Shefﬁeld, S1 4DP

E-mail: ziqi.zhang@shefﬁeld.ac.uk

Abstract. This article introduces TableMiner

+

, a Semantic Table Interpretation method that annotates Web tables in a both

effective and efﬁcient way. Built on our previous work TableMiner, the extended version advances state-of-the-art in several ways.

First, it improves annotation accuracy by making innovative use of various types of contextual information both inside and outside

tables as features for inference. Second, it reduces computational overheads by adopting an incremental, bootstrapping approach

that starts by creating preliminary and partial annotations of a table using ‘sample’ data in the table, then using the outcome as

‘seed’ to guide interpretation of remaining contents. This is then followed by a message passing process that iteratively reﬁnes

results on the entire table to create the ﬁnal optimal annotations. Third, it is able to handle all annotation tasks of Semantic Table

Interpretation (e.g., annotating a column, or entity cells) while state-of-the-art methods are limited in different ways. We also

compile the largest dataset known to date and extensively evaluate TableMiner

+

against four baselines and two re-implemented

state-of-the-art methods. TableMiner

+

consistently outperforms all comparative models under all experimental settings. On the

two most diverse datasets covering multiple domains and various table schemata, it achieves improvement in F1 by between 1

and 42 percentage points depending on speciﬁc annotation tasks. It also signiﬁcantly reduces computational overheads in terms

of wall-clock time when compared against classic methods that ‘exhaustively’ process the entire table content to build features

for inference. As a concrete example, compared against an existing method based on joint inference implemented with parallel

computation, the non-parallel implementation of TableMiner

+

achieves signiﬁcant improvement in learning accuracy and almost

orders of magnitude of savings in wall-clock time.

Keywords: Web table, Named Entity Recognition, Named Entity Disambiguation, Relation Extraction, Linked Data, Semantic

Table Interpretation, table annotation

1. Introduction

Recovering semantics from tables is a crucial task

in realizing the vision of Semantic Web. On the one

hand, the amount of high-quality tables containing use-

ful relational data is growing rapidly to hundreds of

millions [4,5]. On the other hand, search engines typ-

ically ignore underlying semantics of such structures

at indexing, hence performing poorly on tabular data

[21,26]. Research directed to this particular problem

*

Corresponding author. E-mail: ziqi.zhang@shefﬁeld.ac.uk.

is Semantic Table Interpretation [14,15,21,27,34,25,

35,36,3,26,23], which deals with three types of anno-

tation tasks in tables. Given a well-formed relational

table (e.g., Figure 1) and reference sets of concepts

(or classes, types), named entities (or simply ‘enti-

ties’) and relations, (1) disambiguate entity mentions

in content cells (or simply ‘cells’) by linking them to

the existing reference entities; (2) annotate columns

with semantic concepts if they contain entity mentions

(NE-columns), or properties of concepts if they con-

tain data literals (literal-columns); and (3) identify the

semantic relations between columns. The annotations

2 Z. Zhang / Effective and Efﬁcient Semantic Table Interpretation using TableMiner

+

created can enable semantic indexing and search of the

data and used to create Linked Open Data (LOD).

Although classic Natural Language Processing (NLP)

and Information Extraction (IE) techniques address

similar research problems [45,10,6,28], they are tai-

lored for well-formed sentences in unstructured texts,

and are unlikely to succeed on tabular data [21,26].

Typical Semantic Table Interpretation methods make

extensive use of structured knowledge bases, which

contain candidate concepts and entities, each deﬁned

with rich lexical and semantic information and linked

by relations. The general workﬂow involves: (1) re-

trieving candidates corresponding to table components

(e.g., concepts given a column header, or entities given

the text of a cell) from the knowledge base, (2) repre-

sent candidates using features extracted from both the

knowledge base and tables to model semantic inter-

dependence between table components and candidates

(e.g., the header text of a column and the name of a

candidate concept), and between various table compo-

nents (e.g., a column should be annotated by a con-

cept that is shared by all entities in the cells from the

column), and (3) applying inference to choose the best

candidates.

This work addresses several limitations of state-of-

the-art on three dimensions: effectiveness, efﬁciency,

and completeness.

Effectiveness - Semantic Table Interpretation meth-

ods so far have primarily exploited features derived

from two sources: the knowledge bases, and table com-

ponents such as header and row content (to be called

‘in-table context’). In this work, we propose to uti-

lize the so-called ‘out-table context’, i.e., the textual

content around and outside tables (e.g., paragraphs,

captions), to further improve interpretation accuracy.

As an example, the ﬁrst column in the table shown

in Figure 1 (to be called the ‘example table’) has

a header ‘Title’, which is highly ambiguous and ar-

guably irrelevant to the concept we should use to an-

notate the column. However, on the containing web-

page, the word ‘ﬁlm’ is repeated 17 times. This is a

strong indicator for us to select a suitable concept for

the column. A particular type of out-table context we

utilize is the semantic markups inserted within web-

pages by data publishers such as RDFa/Microdata an-

notations. These markups are growing rapidly as major

search engines use them to enable semantic indexing

and search. When available, they provide high-quality,

important information about the webpages and tables

they contain.

We show empirically that we can derive useful fea-

tures from out-table context to improve annotation ac-

curacy. Such out-table features are highly generic and

generally available. While on the contrary, many exist-

ing methods use knowledge base speciﬁc features that

are impossible to generalize, or suffers substantially in

terms of accuracy when they can be adapted, which we

shall show in experiments.

Efﬁciency - We argue that efﬁciency is also an im-

portant factor to consider in the task of Semantic Table

Interpretation, even though it has never been explic-

itly addressed before. The major bottleneck is mainly

due to three types of operations: querying the knowl-

edge bases, building feature representations for can-

didates, and computing similarity between candidates.

Both the number of queries and similarity computa-

tion can grow quadratically with respect to the size of

a table as often such operations are required for each

pair of candidates [21,26,24]. Empirically, Limaye et

al. [21] show that the actual inference algorithm only

consumes less than 1% of total running time. Using

a local copy of the knowledge base only partially ad-

dresses the issue but introduces more problems. First,

hosting a local knowledge base requires infrastructural

support and involves set-up and maintenance. As we

enter the ‘Big-Data’ era, knowledge bases are growing

rapidly towards a colossal structure such as the Google

Knowledge Graph [1], which constantly integrates in-

creasing numbers of heterogeneous sources. Maintain-

ing a local copy of such a knowledge base is likely

to require an infrastructure that not every organization

can afford [29]. Second, local data are not guaranteed

to be up-to-date. Third, scaling up to very large amount

of input data requires efﬁcient algorithms in addition

to parallelization [32], as the process could be bound

by the large number of I/O operations. Therefore in

our view, a more versatile solution is cutting down the

number of queries and data items to be processed. This

reduces I/O operations in both local and remote scenar-

ios, also reducing costs associated with making remote

calls to Web service providers.

In this direction, we identify an opportunity to im-

prove state-of-the-art in terms of efﬁciency. To illus-

trate, consider the example table that in reality contains

over 60 rows. To annotate each column, existing meth-

ods would use content from every row in the column.

However, from a human reader’s point of view, this

is unnecessary. Simply reading the the eight rows one

can conﬁdently assign a concept to the column to best

describe its content. Being able to make such inference

with limited data would give substantial efﬁciency ad-

Z. Zhang / Effective and Efﬁcient Semantic Table Interpretation using TableMiner

+

3

Fig. 1. An example Wikipedia webpage containing a relational table (Last retrieved on 9 April 2014).

vantage to Semantic Table Interpretation algorithms,

as it will signiﬁcantly reduce the number of queries

to the underlying knowledge bases and the number of

candidates to be considered for inference.

Completeness - Many existing methods only deal

with one or two types of annotation tasks in a table

[35,36]. In those that deal with all tasks [21,27,25,

26,34], only NE-columns are considered. As shown

in Figure 1, tables can contain both NE-columns con-

taining entity mentions, and literal-columns contain-

ing data values of entities on the corresponding rows.

Methods such as Limaye et al. [21] and Mulwad et

al. [26] can recognize relations between the ﬁrst and

third columns, but are unable to identify the relation

between the ﬁrst and the second columns. Therefore,

we argue that a ‘complete’ Semantic Table Interpreta-

tion method should handle columns of both data types.

To address these issues, we developed TableMiner

previously [42] that uses features from both in- and

out-table context and annotates NE-columns and cells

in a relational table based on the principle of ‘start

small, build complete’. That is, (1) create prelimi-

nary, likely erroneous annotations based on partial ta-

ble content and a simple model assuming limited in-

terdependence between table components; (2) and then

iteratively optimize the preliminary annotations by en-

forcing interdependence between table components. In

this work we extend it to build TableMiner

+

, by adding

‘subject column’ [35,36] detection, relation enumera-

tion, and improving the iterative optimization process.

Concretely, TableMiner

+

ﬁrstly interprets NE-columns

(to be called column interpretation), while coupling

column classiﬁcation and entity disambiguation in a

mutually recursive process that consists of a LEARN-

ING phase and an UPDATE phase. The LEARNING

phase interprets each column independently by ﬁrstly

learning to create preliminary column annotations us-

ing an automatically determined ‘sample’ from the

column, followed by ‘constrained’ entity disambigua-

tion of the cells in the column (limiting candidate en-

tity space using preliminary column annotations). The

UPDATE phase iteratively optimizes the classiﬁcation

and disambiguation results in each column based on

a notion of ‘domain consensus’ that captures inter-

column and inter-task dependence, creating a global

optimum. For relation enumeration, TableMiner

+

de-

tects a subject column in a table and infers its relations

with other columns (both NE- and literal-columns) in

the table.

TableMiner

+

is evaluated on four datasets contain-

ing over 15,000 tables, against four baselines and

two re-implemented state-of-the-art methods. It con-

sistently obtains the best performance on all datasets.

On the two most diverse datasets covering multiple do-

mains and various table schemata, it obtains an im-

provement of about 1-18 percentage points in disam-

4 Z. Zhang / Effective and Efﬁcient Semantic Table Interpretation using TableMiner

+

biguation, 6-42 in classiﬁcation, and 4-16 in relation

enumeration. It is also very efﬁcient, contributing up to

66% reduction in terms of the amount of candidates to

be processed and up to 29% savings in wall-clock time

compared against exhaustive baseline methods. Even

in the setting where a local copy of the knowledge base

is used, TableMiner

+

delivers almost orders of magni-

tude savings in wall-clock time compared against one

re-implemented state-of-the-art method.

The remainder of this paper is organized as follows.

Section 2 deﬁnes terms and concepts used in the rel-

evant domain. Section 3 discusses related work. Sec-

tions 4 to 9 introduce TableMiner

+

in detail. Sections

10 and 11 describe experiment settings and discuss re-

sults, followed by conclusion in Section 12.

2. Terms and concepts

A relational table contains regular rows and columns

resembling tables in traditional databases. In practice,

tables containing complex structures constitute a small

population and have not been the focus of research. In

theory, complex tables can be interpreted by adding a

pre-process that parse complex structures using meth-

ods such as Zanibbi et al. [39]

Relational tables may or may not contain a header

row, which is typically the ﬁrst row in a table. They of-

ten contain a subject column that usually (but not nec-

essarily) corresponds to the ‘primary key’ columns in a

database table [35,36]. This contains the set of entities

the table is about (subject entities, e.g., column ‘Title’

in Figure 1 contains a list of ﬁlms the table is about),

while other columns contain either entities forming bi-

nary relationships with subject entities, or literals de-

scribing attributes of subject entities.

A knowledge base deﬁnes a set of concepts (or

types, classes), their object instances or entities, liter-

als representing concrete data values, and semantic re-

lations that deﬁne possible associations between enti-

ties (hence also between concepts they belong to), or

between an entity and a literal, in which case the rela-

tion is usually called a property of the entity (hence

a property of its concept) and the literal as the prop-

erty value. In the generic form, a knowledge base is a

liked data set containing a set of triples, statements,

or facts, each composed of a subject, predicate and ob-

ject. The subject could be a concept or entity, the ob-

ject could be a concept, entity, or literal, and the predi-

cate could be a relation or property. A knowledge base

can be a populated ontology, such as the YAGO

1

and

DBpedia

2

datasets, in which case a concept hierarchy

is deﬁned. However this is not always true as some

knowledge bases do not deﬁne strict ontology but loose

concept networks, such as Freebase

3

.

The task of Semantic Table Interpretation addresses

three annotation tasks. Named Entity Disambiguation

associates each cell in NE-columns with one canon-

ical entity; column classiﬁcation annotates each NE-

column with one concept, or in the case of literal-

columns, associates the column to one property of the

concept assigned to the subject column of the table;

Relation Extraction (or enumeration) identiﬁes binary

relations between NE-columns, or in the case of one

NE-column and a literal-column and given that the

NE-column is annotated by a speciﬁc concept, identi-

ﬁes a property of that concept that could explain the

data literals. The candidate entities, concepts and rela-

tions are drawn from the knowledge base.

Using the example table and Freebase as example,

the ﬁrst column can be considered a reasonable subject

column and should be annotated by the Freebase type

‘Film’ (URI ‘fb

4

:/ﬁlm/ﬁlm’). ‘A Difﬁcult Life’ in the

ﬁrst column should be annotated by ‘fb:/m/02qlhz2’

that denotes a movie directed by ‘Dino Risi’ (in the

third column, ‘fb:/m/0j_nhj’). The relation between the

ﬁrst and third column should be annotated as ‘Directed

by’ (‘fb:/ﬁlm/ﬁlm/directed_by’). And the relation be-

tween the ﬁrst and second column (which is a literal-

column) should be the property of ‘Film’: ‘initial re-

lease date’ (‘fb:/ﬁlm/ﬁlm/initial_release_date’), which

we also use to annotate the second column.

3. Related work

3.1. Legacy tabular data to linked data

Research on converting tabular data in legacy data

sources to linked data format has made solid contri-

bution toward the rapid growth of the LOD cloud in

the past decade [12,19,30,7]. The key difference from

the task of Semantic Table Interpretation is that the

focus is on data generation rather than interpretation,

since the goal is to pragmatically convert tabular data

from databases, spreadsheets, and similar data struc-

1

http://www.mpi-inf.mpg.de/yago-naga/yago/

2

http://wiki.dbpedia.org/Ontology

3

http://www.freebase.com

4

fb:http://www.freebase.com

Z. Zhang / Effective and Efﬁcient Semantic Table Interpretation using TableMiner

+

5

tures into RDF. Typical methods require manually (or

partially automated) mapping the two data structures

(input and output RDF), and they do not link data to

existing concepts, entities and relations from the LOD

cloud. As a result, the implicit semantics of the data

remain uncovered.

3.2. General NLP and IE

Some may argue to use the general purpose NLP/IE

methods for Semantic Table Interpretation, due to their

highly similar objectives. This is infeasible for a num-

ber of reasons. First, state-of-the-art methods [31,17]

are typically tailored to unstructured text content that

is different from tabular data. The interdependence

among the table components cannot be easily modeled

in such methods [22]. Second and particularly for the

tasks of Named Entity Classiﬁcation and Relation Ex-

traction, classic methods require each target semantic

label (i.e., concept or relation) to be pre-deﬁned and

learning requires training or seed data [28,40]. In Se-

mantic Table Interpretation however, due to the large

degree of variations in table schemata (e.g., Limaye et

al. [21] use a dataset of over 6,000 randomly crawled

Web tables of which no information about the table

schemata is known a priori), deﬁning a comprehen-

sive set of semantic concepts and relations and sub-

sequently creating necessary training or seed data are

infeasible.

A related IE task tailored to structured data is wrap-

per induction [18,9], which automatically learns wrap-

pers that can extract information from regular, recur-

rent structures (e.g., product attributes from Amazon

webpages). In the context of relational tables, wrap-

per induction methods can be adapted to annotate table

columns that describe entity attributes. However, they

also require training data and the table schemata to be

known a priori.

3.3. Table extension and augmentation

Table extension and augmentation aims at gather-

ing relational tables that contain the same entities but

cover complementary attributes of the entities, and in-

tegrate these tables by joining them on the same en-

tities. For example, Yakout et al. [38] propose Info-

Gather for populating a table of entities with their at-

tributes by harvesting related tables on the Web. The

users need to either provide the desired attribute names

of the entities, or example values of their attributes.

The system can also discover the set of attributes for

similar entities. Bhagavatula et al. [2] introduce Wik-

iTables, which given a query table and a collection of

other tables, identiﬁes columns from other tables that

would make relevant additions to the query table. They

ﬁrst identify a reference column (e.g., country names

in a table of country population) in the query table to

use for joining, then ﬁnd a different table (e.g. a list of

countries by GDP) with a column similar to the refer-

ence column, and perform a left outer join to augment

the query table with an automatically selected column

from the new table (e.g., the GDP amounts). Lehm-

berg et al. [20] create the Mannheim Search Joins En-

gine with the same goal as WikiTables but focus on

handling tens of millions of tables from heterogeneous

sources.

The key difference between these systems and the

task of Semantic Table Interpretation is that they focus

on integration rather than interpretation. The data col-

lected are not linked to knowledge bases and ambigu-

ity still remains.

3.4. Semantic Table Interpretation

Hignette et al. [14,15] and Buche et al. [3] pro-

pose methods to identify concepts represented by ta-

ble columns and detect relations present in tables in a

domain-speciﬁc context. An NE-column is annotated

based on two factors: similarity between the header

text of the column and the name of a candidate con-

cept; plus the the similarities calculated for each cell

in the column and each term in the hierarchical paths

containing the candidate concept. For relations, they

only detect the presence of semantic relations in the

table without specifying the columns forming binary

relations.

Venetis et al. [35] annotate table columns and iden-

tify relations between the subject column and other

columns using types and relations from a database con-

structed by mining the Web using lexico-syntactic pat-

terns such as the Hearst patterns [13]. The database

contains co-occurrence statistics about the subject and

object of triples. For example, how many times the

word ‘cat’ and ‘animal’ has been extracted by the pat-

tern <?, suchas, ? > representing the is-a relation be-

tween concept and instances. A maximum likelihood

inference model predicts the best type for a column

to be the one maximizing the probability of seeing all

the values in the column given that type for the col-

umn. Such probability is computed based on the co-

occurrence statistics gathered in the database. Relation

interpretation follows the same principle.

Effective and efficient Semantic Table Interpretation using TableMiner

Citations

Matching HTML Tables to DBpedia

TabEL: Entity Linking in Web Tables

Information extraction meets the semantic web: a survey

TURL: Table Understanding through Representation Learning

SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems

References

Automatic acquisition of hyponyms from large text corpora

A survey of named entity recognition and classification

Large-Scale Named Entity Disambiguation Based on Wikipedia Data

Wrapper induction for information extraction

Probase: a probabilistic taxonomy for text understanding

Related Papers (5)

Annotating and searching web tables using entities, types and relationships

Recovering semantics of tables on the web

WebTables: exploring the power of tables on the web

A Large Public Corpus of Web Tables containing Time and Context Metadata

Understanding tables on the web