Book Chapter•DOI•

Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing

Didier Cherix, Ricardo Usbeck¹, Andreas Both, Jens Lehmann¹•Institutions (1)

25 May 2014-pp 14-24

TL;DR: This system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs.

read less

Abstract: Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. However, industrial requirements on data quality are high while the time to market as well as the required costs for data preparation have to be kept low. Unfortunately, many Linked Data sources are error-prone which prevents their direct use in productive systems. Hence, (semi-)automatic quality assurance processes are needed as manual ontology repair procedures by domain experts are expensive and time consuming. In this article, we present CROCUS – a pipeline for cluster-based ontology data cleansing. Our system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. CROCUS has been evaluated on two datasets. The experiments show that we are able to detect errors with high recall. Furthermore, we provide an exhaustive related work as well as a number of lessons learned.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Related Work] – [3 Method] – [4 Evaluation] – [Metric Metric Metric] – [LUBM] – [5 Lessons learned] and [6 Conclusion]

1 Introduction

The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.
Depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.
From this scenario, the authors derive the requirements for their data quality evaluation process.
Often, mature ontologies, grown over years, edited by a large amount of processes and people, created by a third party provide the basis for industrial applications (e.g., DBpedia).
The authors contributions are as follows: the authors present (1) an exhaustive related work and classify their approach according to three wellknown surveys, (2) a pipeline for semi-automatic instance-level error detection that is (3) capable of evaluating large datasets.

3 Method

SPARQL [15] is a W3C standard to query instance 3 http://www.w3.org/TR/rdf-sparql-query/ data from Linked Data knowledge bases.
Therefore, a rule is added to CBD, i.e., (3) extract all triples with r as object, which is called Symmetric Concise Bounded Description (SCDB) [16].
Metrics are split into three categories: (1) The simplest metric s each property .
DBSCAN clusters instances based on the size of a cluster and the distance between those instances.
If a cluster has less than MinPts instances, they are regarded as outliers.

4 Evaluation

First, the authors used the LUBM benchmark [18] to create a perfectly modelled dataset.
This benchmark allows to generate arbitrary know-.

Metric Metric Metric

The authors dataset consists of exactly one university and can be downloaded from their project homepage4.
– semantic correctness of properties (range count) has been evaluated by adding for non-graduate students to 20 graduate students .
For each set of instances holds: |Icount| = |Irangecount| = |Inumeric| = 20 and additionally |Icount∩ Irangecount∩ Inumeric| = 3. The second equation overcomes a biased evaluation and introduces some realistic noise into the dataset.
To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.

LUBM

Table 3 presents the results for the combination of all error types for the LUBM benchmark as well as for the German universities DBpedia subset.
CROCUS achieves a high recall on the real-world data from DBpedia.
Table 4 lists the identified reasons of errors from the German universities DBpedia subset detected as outlier.
As mentioned before, some universities do not have a dbo:country property.
Some literals are of type xsd:string although they represent a numeric value.

5 Lessons learned

Based on those candidates quality raters and domain experts are able to define constraints to avoid a specific type of failure.
Obviously, there are some failures which are too complex for a single constraint.
An object property with more than one authorized class as range needs another rule for each class, e.g., a property located.
In with the possible classes Continent, Country, AdminDivision.
Any object having this property should only have one instance of Country linked to it.

6 Conclusion

The authors presented CROCUS, a novel architecture for cluster-based, iterative ontology data cleansing, agnostic of the underlying knowledge base.
With this approach the authors aim at the iterative integration of data into a productive environment which is a typical task of industrial software life cycles.
Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.
Additionally, a guided constraint derivation for laymen will be added.
This work has been partly supported by the ESF and the Free State of Saxony and by grants from the European Union’s 7th Framework Programme provided for the project GeoKnow (GA no. 318159).

Did you find this useful? Give us your feedback

Figures (5)

Table 1: Table of founded papers for each in [8] defined Dimension on the basis of [9,

Table 4: Different error types discovered by quality raters using the German universities DBpedia subset.

Table 2: Results of the LUBM benchmark for all three error types.

Table 3: Evaluation of CROCUS against a synthetic and a real-world dataset using all metrics combined.

Content maybe subject to copyright Report

Lessons Learned — the Case of CROCUS:

Cluster-based Ontology Data Cleansing

Didier Cherix

, Ricardo Usbeck

, Andreas Both

, and Jens Lehmann

University of Leipzig, Germany

{usbeck,lehmann}@informatik.uni-leipzig.de

R & D, Unister GmbH, Leipzig, Germany

{andreas.both,didier.cherix}@unister.de

Abstract. Over the past years, a vast number of datasets have been

published based on Semantic Web standards, which provides an oppor-

tunity for creating novel industrial applications. However, industrial re-

quirements on data quality are high while the time to market as well

as the required costs for data preparation have to be kept low. Unfortu-

nately, many Linked Data sources are error-prone which prevents their

direct use in productive systems. Hence, (semi-)automatic quality as-

surance processes are needed as manual ontology repair procedures by

domain experts are expensive and time consuming. In this article, we

present CROCUS – a pipeline for cluster-based ontology data cleansing.

Our system provides a semi-automatic approach for instance-level error

detection in ontologies which is agnostic of the underlying Linked Data

knowledge base and works at very low costs. CROCUS has been evalu-

ated on two datasets. The experiments show that we are able to detect

errors with high recall. Furthermore, we provide an exhaustive related

work as well as a number of lessons learned.

1 Introduction

The Semantic Web movement including the Linked Open Data (LOD) cloud

represents a combustion point for commercial and free-to-use applications. The

Linked Open Data cloud hosts over 300 publicly available knowledge bases with

an extensive range of topics and DBpedia [1] as central and most important

dataset. While providing a short time-to-market of large and structured datasets,

Linked Data has yet not reached industrial requirements in terms of provenance,

interlinking and especially data quality. In general, LOD knowledge bases com-

prise only few logical constraints or are not well modelled.

Industrial environments need to provide high quality data in a short amount

of time. A solution might be a signiﬁcant number of domain experts that are

checking a given dataset and deﬁning constraints, ensuring the demanded data

quality. However, depending on the size of the given dataset the manual evalu-

ation process by domain experts will be time consuming and expensive. Com-

monly, a dataset is integrated in iteration cycles repeatedly which leads to a

http://lod-cloud.net/

generally good data quality. However, new or updated instances might be error-

prone. Hence, the data quality of the dataset might be contaminated after a

re-import.

From this scenario, we derive the requirements for our data quality evalua-

tion process. (1) Our aim is to ﬁnd singular faults, i.e., unique instance errors,

conﬂicting with large business relevant areas of a knowledge base. (2) The data

evaluation process has to be eﬃcient. Due to the size of LOD datasets, reason-

ing is infeasible due to performance constraints, but graph-based statistics and

clustering methods can work eﬃciently. (3) This process has to be agnostic of

the underlying knowledge base, i.e., it should be independent of the evaluated

dataset.

Often, mature ontologies, grown over years, edited by a large amount of

processes and people, created by a third party provide the basis for industrial

applications (e.g., DBpedia). Aiming at short time-to-market, industry needs

scalable algorithms to detect errors. Furthermore, the lack of costly domain ex-

perts requires non-experts or even layman to validate the data before inﬂuencing

a productive system. Resulting knowledge bases may still contain errors, how-

ever, they oﬀer a fair trade-oﬀ in an iterative production cycle.

In this article, we present CROCUS, a cluster-based ontology data cleansing

framework. CROCUS can be conﬁgured to ﬁnd several types of errors in a semi-

automatic way, which are afterwards validated by non-expert users called quality

raters. By applying CROCUS’ methodology iteratively, resulting ontology data

can be safely used in industrial environments.

On top of our previous work [2]. our contributions are as follows: we present

(1) an exhaustive related work and classify our approach according to three well-

known surveys, (2) a pipeline for semi-automatic instance-level error detection

that is (3) capable of evaluating large datasets. Moreover, it is (4) an approach

agnostic to the analysed class of the instance as well as the Linked Data knowl-

edge base. (5) we provide an evaluation on a synthetic and a real-world dataset.

Finally, (6) we present a number of lessons learned according to error detection

in real-world datasets.

2 Related Work

The research ﬁeld of ontology data cleansing, especially instance data can be

regarded threefold: (1) development of statistical metrics to discover anomalies,

(2) manual, semi-automatic and full-automatic evaluation of data quality and

(3) rule- or logic-based approaches to prevent outliers in application data.

In 2013, Zaveri et al. [10] evaluate the data quality of DBpedia. This manual

approach introduces a taxonomy of quality dimensions: (i) accuracy, which con-

cerns wrong triples, data type problems and implicit relations between attributes,

(ii) relevance, indicating signiﬁcance of extracted information, (iii) representa-

tional consistency, measuring numerical stability and (iv) interlinking, which

looks for links to external resources. Moreover, the authors present a manual

Procedure

Dimension Automatic Semi - Auto-

matic

Manual Not Speci-

ﬁed

Believability [3]

Objectivity [3]

Reputation [3]

Intrisic DQ

Correctness [4,5] [3]

Completeness [6] [5] [3]

Added Value [6] [3]

Relevancy [3]

Timeliness [3]

Contextual DQ

Amount of data [3]

Interpretability [3] [7]

Understandability [3]

Consistency [3] [7]

Repre-

sentation

Conciseness [3] [7]

Availability [3] [7]

Response time [3]

Acces-

sibility

Security [3]

Table 1: Table of founded papers for each in [8] deﬁned Dimension on the basis of [9,

Table 8-9]. The blue dimensions are considered in this work.

error detection tool called TripleCheckMate

and a semi-automatic approach

supported by the description logic learner (DL-Learner) [11,12], which generates

a schema extension for preventing already identiﬁed errors. Those methods mea-

sured an error rate of 11.93% in DBpedia which will be a starting point for our

evaluation.

A rule-based framework is presented by Furber et al. [13] where the authors

deﬁne 9 rules of data quality. Following, the authors deﬁne an error by the num-

ber of instances not following a speciﬁc rule normalized by the overall number

of relevant instances. Afterwards, the framework is able to generate statistics

on which rules have been applied to the data. Several semi-automatic processes,

e.g., [4,5], have been developed to detect errors in instance data of ontologies.

Bohm et al. [4] proﬁled LOD knowledge bases, i.e., statistical metadata is gener-

ated to discover outliers. Therefore, the authors clustered the ontology to ensure

http://github.com/AKSW/TripleCheckMate

partitions contain only semantically correlated data and are able to detect out-

liers. Hogan et al. [5] only identiﬁed errors in RDF data without evaluating the

data properties itself.

In 2013, Kontokostas et al. [14] present an automatic methodology to assess

data quality via a SPARQL-endpoint

. The authors deﬁne 14 basic graph pat-

terns (BGP) to detect diverse error types. Each pattern leads to the construction

of several cases with meta variables bound to speciﬁc instances of resources and

literals, e.g., constructing a SPARQL query testing that a person is born before

the person dies. This approach is not able to work iteratively to reﬁne its result

and is thus not usable in circular development processes.

Bizer et al. [3] present a manual framework as well as a browser to ﬁlter

Linked Data. The framework enables users to deﬁne rules which will be used to

clean the RDF data. Those rules have to be created manually in a SPARQL-like

syntax. In turn, the browser shows the processed data along with an explanation

of the ﬁltering.

Network measures like degree and centrality are used by Guer et al. [6] to

quantify the quality of data. Furthermore, they present an automatic framework

to evaluate the inﬂuence of each measure on the data quality. The authors proof

that the presented measures are only capable of discovering a few quality-lacking

triples.

Hogan et al. [7] compare the quality of several Linked Data datasets. There-

fore, the authors extracted 14 rules from best practices and publications. Those

rules are applied to each dataset and compared against the Page Rank of each

data supplier. Thereafter, the Page Rank of a certain data supplier is correlated

with the datasets quality. The authors suggest new guidelines to align the Linked

Data quality with the users need for certain dataset properties.

A ﬁrst classiﬁcation of quality dimensions is presented by Wang et al. [8]

with respect to their importance to the user. This study reveals a classiﬁcation

of data quality metrics in four categories, cf. Table 1. Recently, Zaveri et al. [9]

present a systematic literature review on diﬀerent methodologies for data quality

assessment. The authors chose 21 articles, extracted 26 quality dimensions and

categorized them according to [8]. The results shows which error types exist and

whether they are repairable manually, semi-automatic or fully automatic. The

presented measures were used to classify CROCUS.

To the best of our knowledge, our tool is the ﬁrst tool tackling error accuracy

(intrinsic data quality), completeness (contextual data quality) and consistency

(data modelling) at once in a semi-automatic manner reaching high f1-measure

on real-world data.

3 Method

First, we need a standardized extraction of target data to be agnostic of the

underlying knowledge base. SPARQL [15] is a W3C standard to query instance

http://www.w3.org/TR/rdf-sparql-query/

data from Linked Data knowledge bases. The DESCRIBE query command is a way

to retrieve descriptive data of certain instances. However, this query command

depends on the knowledge base vendor and its conﬁguration. To circumvent

knowledge base dependence, we use Concise Bounded Descriptions (CBD) [16].

Given a resource r and a certain description depth d the CBD works as follows:

(1) extract all triples with r as subject and (2) resolve all blank nodes retrieved

so far, i.e., for each blank node add every triple containing a blank node with

the same identiﬁer as a subject to the description. Finally, CBD repeats these

steps d times. CBD conﬁgured with d = 1 retrieves only triples with r as subject

although triples with r as object could contain useful information. Therefore, a

rule is added to CBD, i.e., (3) extract all triples with r as object, which is called

Symmetric Concise Bounded Description (SCDB) [16].

Second, CROCUS needs to calculate a numeric representation of an instance

to facilitate further clustering steps. Metrics are split into three categories:

(1) The simplest metric counts each property (count). For example, this

metric can be used if a person is expected to have only one telephone number.

(2) For each instance, the range of the resource at a certain property is

counted (range count). In general, an undergraduate student should take un-

dergraduate courses. If there is an undergraduate student taking courses with

another type (e.g., graduate courses), this metric is able to detect it.

(3) The most general metric transforms each instance into a numeric vector

and normalizes it (numeric). Since instances created by the SCDB consist of

properties with multiple ranges, CROCUS deﬁnes the following metrics: (a) nu-

meric properties are taken as is, (b) properties based on strings are converted to

a metric by using string length although more sophisticated measures could be

used (e.g., n-gram similarities) and (c) object properties are discarded for this

metric.

As a third step, we apply the density-based spatial clustering of applications

with noise (DBSCAN) algorithm [17] since it is an eﬃcient algorithm and the

order of instances has no inﬂuence on the clustering result. DBSCAN clusters

instances based on the size of a cluster and the distance between those instances.

Thus, DBSCAN has two parameters: , the distance between two instances, here

calculated by the metrics above and MinP ts, the minimum number of instances

needed to form a cluster. If a cluster has less than M inP ts instances, they are

regarded as outliers. We report the quality of CROCUS for diﬀerent values of

MinP ts in Section 4.

Finally, identiﬁed outliers are extracted and given to human quality judges.

Based on the revised set of outliers, the algorithm can be adjusted and con-

straints can be added to the Linked Data knowledge base to prevent repeating

discovered errors.

4 Evaluation

LUBM benchmark. First, we used the LUBM benchmark [18] to create a

perfectly modelled dataset. This benchmark allows to generate arbitrary know-

HTML Viewer

Frequently Asked Questions (18)

Q1. What contributions have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. In this article, the authors present CROCUS – a pipeline for cluster-based ontology data cleansing. Their system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. Furthermore, the authors provide an exhaustive related work as well as a number of lessons learned.

Q2. What future works have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

In the future, the authors aim at a more extensive evaluation on domain specific knowledge bases. Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.

Q3. What metric is used to determine the range of a resource?

Since instances created by the SCDB consist of properties with multiple ranges, CROCUS defines the following metrics: (a) numeric properties are taken as is, (b) properties based on strings are converted to a metric by using string length although more sophisticated measures could be used (e.g., n-gram similarities) and (c) object properties are discarded for this metric.

Q4. What is the purpose of the article?

The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.

Q5. Why is the LOD cloud so important?

Due to the size of LOD datasets, reasoning is infeasible due to performance constraints, but graph-based statistics and clustering methods can work efficiently.

Q6. What is the way to classify CROCUS?

To the best of their knowledge, their tool is the first tool tackling error accuracy (intrinsic data quality), completeness (contextual data quality) and consistency (data modelling) at once in a semi-automatic manner reaching high f1-measure on real-world data.

Q7. What is the main reason why CROCUS is used in industrial environments?

the lack of costly domain experts requires non-experts or even layman to validate the data before influencing a productive system.

Q8. How many instances of CROCUS have been used?

CROCUS has already been successfully used on a travel domain-specific productive environment comprising more than 630.000 instances (the dataset cannot be published due to its license).

Q9. What type of property needs another rule for each class?

For instance, an object property with more than one authorized class as range needs another rule for each class, e.g., a property located

Q10. What is the aim of the evaluation process?

(1) Their aim is to find singular faults, i.e., unique instance errors, conflicting with large business relevant areas of a knowledge base.

Q11. What is the way to extract data from a Linked Data knowledge base?

Given a resource r and a certain description depth d the CBD works as follows: (1) extract all triples with r as subject and (2) resolve all blank nodes retrieved so far, i.e., for each blank node add every triple containing a blank node with the same identifier as a subject to the description.

Q12. What is the definition of a good data quality?

a dataset is integrated in iteration cycles repeatedly which leads to a1 http://lod-cloud.net/generally good data quality.

Q13. What is the main purpose of CROCUS?

CROCUS can be configured to find several types of errors in a semiautomatic way, which are afterwards validated by non-expert users called quality raters.

Q14. How did the authors cluster the ontology?

the authors clustered the ontology to ensure2 http://github.com/AKSW/TripleCheckMatepartitions contain only semantically correlated data and are able to detect outliers.

Q15. How many errors were found in the LUBM benchmark?

To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.

Q16. What is the way to cluster a data set?

As a third step, the authors apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result.

Q17. How much time does it take to evaluate a given dataset?

depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.

Q18. What is the effect of a more realistic scenario?

Combining different error types yielding a more realistic scenario influences the recall which results in a lower f1-measure than on each individual error type.

Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing

Summary (2 min read)

1 Introduction

3 Method

4 Evaluation

Metric Metric Metric

LUBM

5 Lessons learned

6 Conclusion

Figures (5)

Citations

References

"Lessons Learned — The Case of CROCU..." refers background in this paper

Additional excerpts

"Lessons Learned — The Case of CROCU..." refers background in this paper

Additional excerpts

Related Papers (5)

Frequently Asked Questions (18)

Q1. What contributions have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

Q2. What future works have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

Q3. What metric is used to determine the range of a resource?

Q4. What is the purpose of the article?

Q5. Why is the LOD cloud so important?

Q6. What is the way to classify CROCUS?

Q7. What is the main reason why CROCUS is used in industrial environments?

Q8. How many instances of CROCUS have been used?

Q9. What type of property needs another rule for each class?

Q10. What is the aim of the evaluation process?

Q11. What is the way to extract data from a Linked Data knowledge base?

Q12. What is the definition of a good data quality?

Q13. What is the main purpose of CROCUS?

Q14. How did the authors cluster the ontology?

Q15. How many errors were found in the LUBM benchmark?

Q16. What is the way to cluster a data set?

Q17. How much time does it take to evaluate a given dataset?

Q18. What is the effect of a more realistic scenario?