scispace - formally typeset
Open AccessBook ChapterDOI

Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing

Reads0
Chats0
TLDR
This system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs.
Abstract
Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. However, industrial requirements on data quality are high while the time to market as well as the required costs for data preparation have to be kept low. Unfortunately, many Linked Data sources are error-prone which prevents their direct use in productive systems. Hence, (semi-)automatic quality assurance processes are needed as manual ontology repair procedures by domain experts are expensive and time consuming. In this article, we present CROCUS – a pipeline for cluster-based ontology data cleansing. Our system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. CROCUS has been evaluated on two datasets. The experiments show that we are able to detect errors with high recall. Furthermore, we provide an exhaustive related work as well as a number of lessons learned.

read more

Content maybe subject to copyright    Report

Lessons Learned the Case of CROCUS:
Cluster-based Ontology Data Cleansing
Didier Cherix
2
, Ricardo Usbeck
12
, Andreas Both
2
, and Jens Lehmann
1
1
University of Leipzig, Germany
{usbeck,lehmann}@informatik.uni-leipzig.de
2
R & D, Unister GmbH, Leipzig, Germany
{andreas.both,didier.cherix}@unister.de
Abstract. Over the past years, a vast number of datasets have been
published based on Semantic Web standards, which provides an oppor-
tunity for creating novel industrial applications. However, industrial re-
quirements on data quality are high while the time to market as well
as the required costs for data preparation have to be kept low. Unfortu-
nately, many Linked Data sources are error-prone which prevents their
direct use in productive systems. Hence, (semi-)automatic quality as-
surance processes are needed as manual ontology repair procedures by
domain experts are expensive and time consuming. In this article, we
present CROCUS a pipeline for cluster-based ontology data cleansing.
Our system provides a semi-automatic approach for instance-level error
detection in ontologies which is agnostic of the underlying Linked Data
knowledge base and works at very low costs. CROCUS has been evalu-
ated on two datasets. The experiments show that we are able to detect
errors with high recall. Furthermore, we provide an exhaustive related
work as well as a number of lessons learned.
1 Introduction
The Semantic Web movement including the Linked Open Data (LOD) cloud
1
represents a combustion point for commercial and free-to-use applications. The
Linked Open Data cloud hosts over 300 publicly available knowledge bases with
an extensive range of topics and DBpedia [1] as central and most important
dataset. While providing a short time-to-market of large and structured datasets,
Linked Data has yet not reached industrial requirements in terms of provenance,
interlinking and especially data quality. In general, LOD knowledge bases com-
prise only few logical constraints or are not well modelled.
Industrial environments need to provide high quality data in a short amount
of time. A solution might be a significant number of domain experts that are
checking a given dataset and defining constraints, ensuring the demanded data
quality. However, depending on the size of the given dataset the manual evalu-
ation process by domain experts will be time consuming and expensive. Com-
monly, a dataset is integrated in iteration cycles repeatedly which leads to a
1
http://lod-cloud.net/

generally good data quality. However, new or updated instances might be error-
prone. Hence, the data quality of the dataset might be contaminated after a
re-import.
From this scenario, we derive the requirements for our data quality evalua-
tion process. (1) Our aim is to find singular faults, i.e., unique instance errors,
conflicting with large business relevant areas of a knowledge base. (2) The data
evaluation process has to be efficient. Due to the size of LOD datasets, reason-
ing is infeasible due to performance constraints, but graph-based statistics and
clustering methods can work efficiently. (3) This process has to be agnostic of
the underlying knowledge base, i.e., it should be independent of the evaluated
dataset.
Often, mature ontologies, grown over years, edited by a large amount of
processes and people, created by a third party provide the basis for industrial
applications (e.g., DBpedia). Aiming at short time-to-market, industry needs
scalable algorithms to detect errors. Furthermore, the lack of costly domain ex-
perts requires non-experts or even layman to validate the data before influencing
a productive system. Resulting knowledge bases may still contain errors, how-
ever, they offer a fair trade-off in an iterative production cycle.
In this article, we present CROCUS, a cluster-based ontology data cleansing
framework. CROCUS can be configured to find several types of errors in a semi-
automatic way, which are afterwards validated by non-expert users called quality
raters. By applying CROCUS’ methodology iteratively, resulting ontology data
can be safely used in industrial environments.
On top of our previous work [2]. our contributions are as follows: we present
(1) an exhaustive related work and classify our approach according to three well-
known surveys, (2) a pipeline for semi-automatic instance-level error detection
that is (3) capable of evaluating large datasets. Moreover, it is (4) an approach
agnostic to the analysed class of the instance as well as the Linked Data knowl-
edge base. (5) we provide an evaluation on a synthetic and a real-world dataset.
Finally, (6) we present a number of lessons learned according to error detection
in real-world datasets.
2 Related Work
The research field of ontology data cleansing, especially instance data can be
regarded threefold: (1) development of statistical metrics to discover anomalies,
(2) manual, semi-automatic and full-automatic evaluation of data quality and
(3) rule- or logic-based approaches to prevent outliers in application data.
In 2013, Zaveri et al. [10] evaluate the data quality of DBpedia. This manual
approach introduces a taxonomy of quality dimensions: (i) accuracy, which con-
cerns wrong triples, data type problems and implicit relations between attributes,
(ii) relevance, indicating significance of extracted information, (iii) representa-
tional consistency, measuring numerical stability and (iv) interlinking, which
looks for links to external resources. Moreover, the authors present a manual

Procedure
Dimension Automatic Semi - Auto-
matic
Manual Not Speci-
fied
Believability [3]
Objectivity [3]
Reputation [3]
Intrisic DQ
Correctness [4,5] [3]
Completeness [6] [5] [3]
Added Value [6] [3]
Relevancy [3]
Timeliness [3]
Contextual DQ
Amount of data [3]
Interpretability [3] [7]
Understandability [3]
Consistency [3] [7]
Repre-
sentation
Conciseness [3] [7]
Availability [3] [7]
Response time [3]
Acces-
sibility
Security [3]
Table 1: Table of founded papers for each in [8] defined Dimension on the basis of [9,
Table 8-9]. The blue dimensions are considered in this work.
error detection tool called TripleCheckMate
2
and a semi-automatic approach
supported by the description logic learner (DL-Learner) [11,12], which generates
a schema extension for preventing already identified errors. Those methods mea-
sured an error rate of 11.93% in DBpedia which will be a starting point for our
evaluation.
A rule-based framework is presented by Furber et al. [13] where the authors
define 9 rules of data quality. Following, the authors define an error by the num-
ber of instances not following a specific rule normalized by the overall number
of relevant instances. Afterwards, the framework is able to generate statistics
on which rules have been applied to the data. Several semi-automatic processes,
e.g., [4,5], have been developed to detect errors in instance data of ontologies.
Bohm et al. [4] profiled LOD knowledge bases, i.e., statistical metadata is gener-
ated to discover outliers. Therefore, the authors clustered the ontology to ensure
2
http://github.com/AKSW/TripleCheckMate

partitions contain only semantically correlated data and are able to detect out-
liers. Hogan et al. [5] only identified errors in RDF data without evaluating the
data properties itself.
In 2013, Kontokostas et al. [14] present an automatic methodology to assess
data quality via a SPARQL-endpoint
3
. The authors define 14 basic graph pat-
terns (BGP) to detect diverse error types. Each pattern leads to the construction
of several cases with meta variables bound to specific instances of resources and
literals, e.g., constructing a SPARQL query testing that a person is born before
the person dies. This approach is not able to work iteratively to refine its result
and is thus not usable in circular development processes.
Bizer et al. [3] present a manual framework as well as a browser to filter
Linked Data. The framework enables users to define rules which will be used to
clean the RDF data. Those rules have to be created manually in a SPARQL-like
syntax. In turn, the browser shows the processed data along with an explanation
of the filtering.
Network measures like degree and centrality are used by Guer et al. [6] to
quantify the quality of data. Furthermore, they present an automatic framework
to evaluate the influence of each measure on the data quality. The authors proof
that the presented measures are only capable of discovering a few quality-lacking
triples.
Hogan et al. [7] compare the quality of several Linked Data datasets. There-
fore, the authors extracted 14 rules from best practices and publications. Those
rules are applied to each dataset and compared against the Page Rank of each
data supplier. Thereafter, the Page Rank of a certain data supplier is correlated
with the datasets quality. The authors suggest new guidelines to align the Linked
Data quality with the users need for certain dataset properties.
A first classification of quality dimensions is presented by Wang et al. [8]
with respect to their importance to the user. This study reveals a classification
of data quality metrics in four categories, cf. Table 1. Recently, Zaveri et al. [9]
present a systematic literature review on different methodologies for data quality
assessment. The authors chose 21 articles, extracted 26 quality dimensions and
categorized them according to [8]. The results shows which error types exist and
whether they are repairable manually, semi-automatic or fully automatic. The
presented measures were used to classify CROCUS.
To the best of our knowledge, our tool is the first tool tackling error accuracy
(intrinsic data quality), completeness (contextual data quality) and consistency
(data modelling) at once in a semi-automatic manner reaching high f1-measure
on real-world data.
3 Method
First, we need a standardized extraction of target data to be agnostic of the
underlying knowledge base. SPARQL [15] is a W3C standard to query instance
3
http://www.w3.org/TR/rdf-sparql-query/

data from Linked Data knowledge bases. The DESCRIBE query command is a way
to retrieve descriptive data of certain instances. However, this query command
depends on the knowledge base vendor and its configuration. To circumvent
knowledge base dependence, we use Concise Bounded Descriptions (CBD) [16].
Given a resource r and a certain description depth d the CBD works as follows:
(1) extract all triples with r as subject and (2) resolve all blank nodes retrieved
so far, i.e., for each blank node add every triple containing a blank node with
the same identifier as a subject to the description. Finally, CBD repeats these
steps d times. CBD configured with d = 1 retrieves only triples with r as subject
although triples with r as object could contain useful information. Therefore, a
rule is added to CBD, i.e., (3) extract all triples with r as object, which is called
Symmetric Concise Bounded Description (SCDB) [16].
Second, CROCUS needs to calculate a numeric representation of an instance
to facilitate further clustering steps. Metrics are split into three categories:
(1) The simplest metric counts each property (count). For example, this
metric can be used if a person is expected to have only one telephone number.
(2) For each instance, the range of the resource at a certain property is
counted (range count). In general, an undergraduate student should take un-
dergraduate courses. If there is an undergraduate student taking courses with
another type (e.g., graduate courses), this metric is able to detect it.
(3) The most general metric transforms each instance into a numeric vector
and normalizes it (numeric). Since instances created by the SCDB consist of
properties with multiple ranges, CROCUS defines the following metrics: (a) nu-
meric properties are taken as is, (b) properties based on strings are converted to
a metric by using string length although more sophisticated measures could be
used (e.g., n-gram similarities) and (c) object properties are discarded for this
metric.
As a third step, we apply the density-based spatial clustering of applications
with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the
order of instances has no influence on the clustering result. DBSCAN clusters
instances based on the size of a cluster and the distance between those instances.
Thus, DBSCAN has two parameters: , the distance between two instances, here
calculated by the metrics above and MinP ts, the minimum number of instances
needed to form a cluster. If a cluster has less than M inP ts instances, they are
regarded as outliers. We report the quality of CROCUS for different values of
MinP ts in Section 4.
Finally, identified outliers are extracted and given to human quality judges.
Based on the revised set of outliers, the algorithm can be adjusted and con-
straints can be added to the Linked Data knowledge base to prevent repeating
discovered errors.
4 Evaluation
LUBM benchmark. First, we used the LUBM benchmark [18] to create a
perfectly modelled dataset. This benchmark allows to generate arbitrary know-

Citations
More filters
Journal ArticleDOI

Dengue Epidemics Prediction: A Survey of the State-of-the-Art Based on Data Science Processes

TL;DR: This substantial review of the literature on the state of the art of research over the past decades identified six main issues to be explored and analyzed: the available data sources; 2) data preparation techniques; 3) data representations; 4) forecasting models and methods; 5) dengue forecasting models evaluation approaches; and 6) future challenges and possibilities in forecasting modeling of d Dengue outbreaks.
Journal ArticleDOI

Detecting Linked Data quality issues via crowdsourcing: A DBpedia study

TL;DR: The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.
Journal ArticleDOI

Toward semantic data imputation for a dengue dataset

TL;DR: An improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value.
Dissertation

Detecting errors in linked data using ontology learning and outlier detection

TL;DR: A number of approaches that allow to detect errors in Linked Data without a requirement for additional, external data are introduced, including an approach to detect data-level errors in numerical values.
References
More filters
Book ChapterDOI

Assessing linked data mappings using network measures

TL;DR: Link-QA as discussed by the authors is an extensible framework that allows for the assessment of Linked Data mappings using network metrics and test five metrics using this framework on a set of known good and bad links generated by a common mapping system.
BookDOI

The Semantic Web - ISWC 2013

TL;DR: This work has adapted, extended, and integrated several open source applications and frameworks that handle major portions of functionality for these platforms and includes an object-type repository, collaboration tools, an ability to identify and manage all key entities in the platform, and an integrated portal to manage diverse content and applications.
Proceedings Article

Swiqa – a semantic web information quality assessment framework

TL;DR: This paper provides a framework for information quality assessment ofSemantic Web data called SWIQA by solely using Semantic Web technologies and employs data quality rule templates to express quality requirements which are automatically used to identify deficient data and calculate quality scores.
Proceedings ArticleDOI

Profiling linked open data with ProLOD

TL;DR: A suite of methods ranging from the domain level (clustering, labeling), via the schema level (matching, disambiguation), to the data level (data type detection, pattern detection, value distribution) are proposed, packaged into an interactive, web-based tool that allows iterative exploration and discovery of new LOD sources.

Quality Assessment Methodologies for Linked Open Data A Systematic Literature Review and Conceptual Framework

TL;DR: A systematic review of approaches for assessing the data quality of LOD is presented and a comprehensive list of the dimensions and metrics is presented to provide researchers and data curators a comprehensive understanding of existing work.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What contributions have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. In this article, the authors present CROCUS – a pipeline for cluster-based ontology data cleansing. Their system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. Furthermore, the authors provide an exhaustive related work as well as a number of lessons learned. 

In the future, the authors aim at a more extensive evaluation on domain specific knowledge bases. Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data. 

Since instances created by the SCDB consist of properties with multiple ranges, CROCUS defines the following metrics: (a) numeric properties are taken as is, (b) properties based on strings are converted to a metric by using string length although more sophisticated measures could be used (e.g., n-gram similarities) and (c) object properties are discarded for this metric. 

The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications. 

Due to the size of LOD datasets, reasoning is infeasible due to performance constraints, but graph-based statistics and clustering methods can work efficiently. 

To the best of their knowledge, their tool is the first tool tackling error accuracy (intrinsic data quality), completeness (contextual data quality) and consistency (data modelling) at once in a semi-automatic manner reaching high f1-measure on real-world data. 

the lack of costly domain experts requires non-experts or even layman to validate the data before influencing a productive system. 

CROCUS has already been successfully used on a travel domain-specific productive environment comprising more than 630.000 instances (the dataset cannot be published due to its license). 

For instance, an object property with more than one authorized class as range needs another rule for each class, e.g., a property located 

(1) Their aim is to find singular faults, i.e., unique instance errors, conflicting with large business relevant areas of a knowledge base. 

Given a resource r and a certain description depth d the CBD works as follows: (1) extract all triples with r as subject and (2) resolve all blank nodes retrieved so far, i.e., for each blank node add every triple containing a blank node with the same identifier as a subject to the description. 

a dataset is integrated in iteration cycles repeatedly which leads to a1 http://lod-cloud.net/generally good data quality. 

CROCUS can be configured to find several types of errors in a semiautomatic way, which are afterwards validated by non-expert users called quality raters. 

the authors clustered the ontology to ensure2 http://github.com/AKSW/TripleCheckMatepartitions contain only semantically correlated data and are able to detect outliers. 

To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset. 

As a third step, the authors apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result. 

depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive. 

Combining different error types yielding a more realistic scenario influences the recall which results in a lower f1-measure than on each individual error type.