scispace - formally typeset
Search or ask a question
Book ChapterDOI

Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing

25 May 2014-pp 14-24
TL;DR: This system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs.
Abstract: Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. However, industrial requirements on data quality are high while the time to market as well as the required costs for data preparation have to be kept low. Unfortunately, many Linked Data sources are error-prone which prevents their direct use in productive systems. Hence, (semi-)automatic quality assurance processes are needed as manual ontology repair procedures by domain experts are expensive and time consuming. In this article, we present CROCUS – a pipeline for cluster-based ontology data cleansing. Our system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. CROCUS has been evaluated on two datasets. The experiments show that we are able to detect errors with high recall. Furthermore, we provide an exhaustive related work as well as a number of lessons learned.

Summary (2 min read)

1 Introduction

  • The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.
  • Depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.
  • From this scenario, the authors derive the requirements for their data quality evaluation process.
  • Often, mature ontologies, grown over years, edited by a large amount of processes and people, created by a third party provide the basis for industrial applications (e.g., DBpedia).
  • The authors contributions are as follows: the authors present (1) an exhaustive related work and classify their approach according to three wellknown surveys, (2) a pipeline for semi-automatic instance-level error detection that is (3) capable of evaluating large datasets.

3 Method

  • SPARQL [15] is a W3C standard to query instance 3 http://www.w3.org/TR/rdf-sparql-query/ data from Linked Data knowledge bases.
  • Therefore, a rule is added to CBD, i.e., (3) extract all triples with r as object, which is called Symmetric Concise Bounded Description (SCDB) [16].
  • Metrics are split into three categories: (1) The simplest metric s each property .
  • DBSCAN clusters instances based on the size of a cluster and the distance between those instances.
  • If a cluster has less than MinPts instances, they are regarded as outliers.

4 Evaluation

  • First, the authors used the LUBM benchmark [18] to create a perfectly modelled dataset.
  • This benchmark allows to generate arbitrary know-.

Metric Metric Metric

  • The authors dataset consists of exactly one university and can be downloaded from their project homepage4.
  • – semantic correctness of properties (range count) has been evaluated by adding for non-graduate students to 20 graduate students .
  • For each set of instances holds: |Icount| = |Irangecount| = |Inumeric| = 20 and additionally |Icount∩ Irangecount∩ Inumeric| = 3. The second equation overcomes a biased evaluation and introduces some realistic noise into the dataset.
  • To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.

LUBM

  • Table 3 presents the results for the combination of all error types for the LUBM benchmark as well as for the German universities DBpedia subset.
  • CROCUS achieves a high recall on the real-world data from DBpedia.
  • Table 4 lists the identified reasons of errors from the German universities DBpedia subset detected as outlier.
  • As mentioned before, some universities do not have a dbo:country property.
  • Some literals are of type xsd:string although they represent a numeric value.

5 Lessons learned

  • Based on those candidates quality raters and domain experts are able to define constraints to avoid a specific type of failure.
  • Obviously, there are some failures which are too complex for a single constraint.
  • An object property with more than one authorized class as range needs another rule for each class, e.g., a property located.
  • In with the possible classes Continent, Country, AdminDivision.
  • Any object having this property should only have one instance of Country linked to it.

6 Conclusion

  • The authors presented CROCUS, a novel architecture for cluster-based, iterative ontology data cleansing, agnostic of the underlying knowledge base.
  • With this approach the authors aim at the iterative integration of data into a productive environment which is a typical task of industrial software life cycles.
  • Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.
  • Additionally, a guided constraint derivation for laymen will be added.
  • This work has been partly supported by the ESF and the Free State of Saxony and by grants from the European Union’s 7th Framework Programme provided for the project GeoKnow (GA no. 318159).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Lessons Learned the Case of CROCUS:
Cluster-based Ontology Data Cleansing
Didier Cherix
2
, Ricardo Usbeck
12
, Andreas Both
2
, and Jens Lehmann
1
1
University of Leipzig, Germany
{usbeck,lehmann}@informatik.uni-leipzig.de
2
R & D, Unister GmbH, Leipzig, Germany
{andreas.both,didier.cherix}@unister.de
Abstract. Over the past years, a vast number of datasets have been
published based on Semantic Web standards, which provides an oppor-
tunity for creating novel industrial applications. However, industrial re-
quirements on data quality are high while the time to market as well
as the required costs for data preparation have to be kept low. Unfortu-
nately, many Linked Data sources are error-prone which prevents their
direct use in productive systems. Hence, (semi-)automatic quality as-
surance processes are needed as manual ontology repair procedures by
domain experts are expensive and time consuming. In this article, we
present CROCUS a pipeline for cluster-based ontology data cleansing.
Our system provides a semi-automatic approach for instance-level error
detection in ontologies which is agnostic of the underlying Linked Data
knowledge base and works at very low costs. CROCUS has been evalu-
ated on two datasets. The experiments show that we are able to detect
errors with high recall. Furthermore, we provide an exhaustive related
work as well as a number of lessons learned.
1 Introduction
The Semantic Web movement including the Linked Open Data (LOD) cloud
1
represents a combustion point for commercial and free-to-use applications. The
Linked Open Data cloud hosts over 300 publicly available knowledge bases with
an extensive range of topics and DBpedia [1] as central and most important
dataset. While providing a short time-to-market of large and structured datasets,
Linked Data has yet not reached industrial requirements in terms of provenance,
interlinking and especially data quality. In general, LOD knowledge bases com-
prise only few logical constraints or are not well modelled.
Industrial environments need to provide high quality data in a short amount
of time. A solution might be a significant number of domain experts that are
checking a given dataset and defining constraints, ensuring the demanded data
quality. However, depending on the size of the given dataset the manual evalu-
ation process by domain experts will be time consuming and expensive. Com-
monly, a dataset is integrated in iteration cycles repeatedly which leads to a
1
http://lod-cloud.net/

generally good data quality. However, new or updated instances might be error-
prone. Hence, the data quality of the dataset might be contaminated after a
re-import.
From this scenario, we derive the requirements for our data quality evalua-
tion process. (1) Our aim is to find singular faults, i.e., unique instance errors,
conflicting with large business relevant areas of a knowledge base. (2) The data
evaluation process has to be efficient. Due to the size of LOD datasets, reason-
ing is infeasible due to performance constraints, but graph-based statistics and
clustering methods can work efficiently. (3) This process has to be agnostic of
the underlying knowledge base, i.e., it should be independent of the evaluated
dataset.
Often, mature ontologies, grown over years, edited by a large amount of
processes and people, created by a third party provide the basis for industrial
applications (e.g., DBpedia). Aiming at short time-to-market, industry needs
scalable algorithms to detect errors. Furthermore, the lack of costly domain ex-
perts requires non-experts or even layman to validate the data before influencing
a productive system. Resulting knowledge bases may still contain errors, how-
ever, they offer a fair trade-off in an iterative production cycle.
In this article, we present CROCUS, a cluster-based ontology data cleansing
framework. CROCUS can be configured to find several types of errors in a semi-
automatic way, which are afterwards validated by non-expert users called quality
raters. By applying CROCUS’ methodology iteratively, resulting ontology data
can be safely used in industrial environments.
On top of our previous work [2]. our contributions are as follows: we present
(1) an exhaustive related work and classify our approach according to three well-
known surveys, (2) a pipeline for semi-automatic instance-level error detection
that is (3) capable of evaluating large datasets. Moreover, it is (4) an approach
agnostic to the analysed class of the instance as well as the Linked Data knowl-
edge base. (5) we provide an evaluation on a synthetic and a real-world dataset.
Finally, (6) we present a number of lessons learned according to error detection
in real-world datasets.
2 Related Work
The research field of ontology data cleansing, especially instance data can be
regarded threefold: (1) development of statistical metrics to discover anomalies,
(2) manual, semi-automatic and full-automatic evaluation of data quality and
(3) rule- or logic-based approaches to prevent outliers in application data.
In 2013, Zaveri et al. [10] evaluate the data quality of DBpedia. This manual
approach introduces a taxonomy of quality dimensions: (i) accuracy, which con-
cerns wrong triples, data type problems and implicit relations between attributes,
(ii) relevance, indicating significance of extracted information, (iii) representa-
tional consistency, measuring numerical stability and (iv) interlinking, which
looks for links to external resources. Moreover, the authors present a manual

Procedure
Dimension Automatic Semi - Auto-
matic
Manual Not Speci-
fied
Believability [3]
Objectivity [3]
Reputation [3]
Intrisic DQ
Correctness [4,5] [3]
Completeness [6] [5] [3]
Added Value [6] [3]
Relevancy [3]
Timeliness [3]
Contextual DQ
Amount of data [3]
Interpretability [3] [7]
Understandability [3]
Consistency [3] [7]
Repre-
sentation
Conciseness [3] [7]
Availability [3] [7]
Response time [3]
Acces-
sibility
Security [3]
Table 1: Table of founded papers for each in [8] defined Dimension on the basis of [9,
Table 8-9]. The blue dimensions are considered in this work.
error detection tool called TripleCheckMate
2
and a semi-automatic approach
supported by the description logic learner (DL-Learner) [11,12], which generates
a schema extension for preventing already identified errors. Those methods mea-
sured an error rate of 11.93% in DBpedia which will be a starting point for our
evaluation.
A rule-based framework is presented by Furber et al. [13] where the authors
define 9 rules of data quality. Following, the authors define an error by the num-
ber of instances not following a specific rule normalized by the overall number
of relevant instances. Afterwards, the framework is able to generate statistics
on which rules have been applied to the data. Several semi-automatic processes,
e.g., [4,5], have been developed to detect errors in instance data of ontologies.
Bohm et al. [4] profiled LOD knowledge bases, i.e., statistical metadata is gener-
ated to discover outliers. Therefore, the authors clustered the ontology to ensure
2
http://github.com/AKSW/TripleCheckMate

partitions contain only semantically correlated data and are able to detect out-
liers. Hogan et al. [5] only identified errors in RDF data without evaluating the
data properties itself.
In 2013, Kontokostas et al. [14] present an automatic methodology to assess
data quality via a SPARQL-endpoint
3
. The authors define 14 basic graph pat-
terns (BGP) to detect diverse error types. Each pattern leads to the construction
of several cases with meta variables bound to specific instances of resources and
literals, e.g., constructing a SPARQL query testing that a person is born before
the person dies. This approach is not able to work iteratively to refine its result
and is thus not usable in circular development processes.
Bizer et al. [3] present a manual framework as well as a browser to filter
Linked Data. The framework enables users to define rules which will be used to
clean the RDF data. Those rules have to be created manually in a SPARQL-like
syntax. In turn, the browser shows the processed data along with an explanation
of the filtering.
Network measures like degree and centrality are used by Guer et al. [6] to
quantify the quality of data. Furthermore, they present an automatic framework
to evaluate the influence of each measure on the data quality. The authors proof
that the presented measures are only capable of discovering a few quality-lacking
triples.
Hogan et al. [7] compare the quality of several Linked Data datasets. There-
fore, the authors extracted 14 rules from best practices and publications. Those
rules are applied to each dataset and compared against the Page Rank of each
data supplier. Thereafter, the Page Rank of a certain data supplier is correlated
with the datasets quality. The authors suggest new guidelines to align the Linked
Data quality with the users need for certain dataset properties.
A first classification of quality dimensions is presented by Wang et al. [8]
with respect to their importance to the user. This study reveals a classification
of data quality metrics in four categories, cf. Table 1. Recently, Zaveri et al. [9]
present a systematic literature review on different methodologies for data quality
assessment. The authors chose 21 articles, extracted 26 quality dimensions and
categorized them according to [8]. The results shows which error types exist and
whether they are repairable manually, semi-automatic or fully automatic. The
presented measures were used to classify CROCUS.
To the best of our knowledge, our tool is the first tool tackling error accuracy
(intrinsic data quality), completeness (contextual data quality) and consistency
(data modelling) at once in a semi-automatic manner reaching high f1-measure
on real-world data.
3 Method
First, we need a standardized extraction of target data to be agnostic of the
underlying knowledge base. SPARQL [15] is a W3C standard to query instance
3
http://www.w3.org/TR/rdf-sparql-query/

data from Linked Data knowledge bases. The DESCRIBE query command is a way
to retrieve descriptive data of certain instances. However, this query command
depends on the knowledge base vendor and its configuration. To circumvent
knowledge base dependence, we use Concise Bounded Descriptions (CBD) [16].
Given a resource r and a certain description depth d the CBD works as follows:
(1) extract all triples with r as subject and (2) resolve all blank nodes retrieved
so far, i.e., for each blank node add every triple containing a blank node with
the same identifier as a subject to the description. Finally, CBD repeats these
steps d times. CBD configured with d = 1 retrieves only triples with r as subject
although triples with r as object could contain useful information. Therefore, a
rule is added to CBD, i.e., (3) extract all triples with r as object, which is called
Symmetric Concise Bounded Description (SCDB) [16].
Second, CROCUS needs to calculate a numeric representation of an instance
to facilitate further clustering steps. Metrics are split into three categories:
(1) The simplest metric counts each property (count). For example, this
metric can be used if a person is expected to have only one telephone number.
(2) For each instance, the range of the resource at a certain property is
counted (range count). In general, an undergraduate student should take un-
dergraduate courses. If there is an undergraduate student taking courses with
another type (e.g., graduate courses), this metric is able to detect it.
(3) The most general metric transforms each instance into a numeric vector
and normalizes it (numeric). Since instances created by the SCDB consist of
properties with multiple ranges, CROCUS defines the following metrics: (a) nu-
meric properties are taken as is, (b) properties based on strings are converted to
a metric by using string length although more sophisticated measures could be
used (e.g., n-gram similarities) and (c) object properties are discarded for this
metric.
As a third step, we apply the density-based spatial clustering of applications
with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the
order of instances has no influence on the clustering result. DBSCAN clusters
instances based on the size of a cluster and the distance between those instances.
Thus, DBSCAN has two parameters: , the distance between two instances, here
calculated by the metrics above and MinP ts, the minimum number of instances
needed to form a cluster. If a cluster has less than M inP ts instances, they are
regarded as outliers. We report the quality of CROCUS for different values of
MinP ts in Section 4.
Finally, identified outliers are extracted and given to human quality judges.
Based on the revised set of outliers, the algorithm can be adjusted and con-
straints can be added to the Linked Data knowledge base to prevent repeating
discovered errors.
4 Evaluation
LUBM benchmark. First, we used the LUBM benchmark [18] to create a
perfectly modelled dataset. This benchmark allows to generate arbitrary know-

Citations
More filters
Journal ArticleDOI
TL;DR: This substantial review of the literature on the state of the art of research over the past decades identified six main issues to be explored and analyzed: the available data sources; 2) data preparation techniques; 3) data representations; 4) forecasting models and methods; 5) dengue forecasting models evaluation approaches; and 6) future challenges and possibilities in forecasting modeling of d Dengue outbreaks.
Abstract: Dengue infection is a mosquitoborne disease caused by dengue viruses, which are carried by several species of mosquito of the genus Aedes , principally Ae. aegypti . Dengue outbreaks are endemic in tropical and sub-tropical regions of the world, mainly in urban and sub-urban areas. The outbreak is one of the top 10 diseases causing the most deaths worldwide. According to the World Health Organization, dengue infection has increased 30-fold globally over the past five decades. About 50–100 million new infections occur annually in more than 80 countries. Many researchers are working on measures to prevent and control the spread. One avenue of research is collaboration between computer science and the epidemiology researchers in developing methods of predicting potential outbreaks of dengue infection. An important research objective is to develop models that enable, or enhance, forecasting of outbreaks of dengue, giving medical professionals the opportunity to develop plans for handling the outbreak, well in advance. Researchers have been gathering and analyzing data to better identify the relational factors driving the spread of the disease, as well as the development of a variety of methods of predictive modeling using statistical and mathematical analysis and machine learning. In this substantial review of the literature on the state of the art of research over the past decades, we identified six main issues to be explored and analyzed: 1) the available data sources; 2) data preparation techniques; 3) data representations; 4) forecasting models and methods; 5) dengue forecasting models evaluation approaches; and 6) future challenges and possibilities in forecasting modeling of dengue outbreaks. Our comprehensive exploration of the issues provides a valuable information foundation for new researchers in this important area of public health research and epidemiology.

45 citations

Journal ArticleDOI
TL;DR: The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.
Abstract: In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difficult to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources, and a classification of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then propose and compare different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts and lay workers. By testing two distinct Find-Verify workflows (lay users only and experts verified by lay users) we reveal how to best combine different crowds’ complementary aptitudes in quality issue detection. The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.

33 citations

Journal ArticleDOI
TL;DR: An improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value.
Abstract: Missing data are a major problem that affects data analysis techniques for forecasting. Traditional methods suffer from poor performance in predicting missing values using simple techniques, e.g., mean and mode. In this paper, we present and discuss a novel method of imputing missing values semantically with the use of an ontology model. We make three new contributions to the field: first, an improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value. Second, the incorporation of an ontology with PSO for the purpose of narrowing the search space, to make PSO provide greater accuracy in predicting numerical missing values while quickly converging on the answer. Third, the facilitation of a framework to substitute nominal data that are lost from the dataset using the relationships of concepts and a reasoning mechanism concerning the knowledge-based model. The experimental results indicated that the proposed method could estimate missing data more efficiently and with less chance of error than conventional methods, as measured by the root mean square error.

8 citations

Dissertation
01 Jan 2015
TL;DR: A number of approaches that allow to detect errors in Linked Data without a requirement for additional, external data are introduced, including an approach to detect data-level errors in numerical values.
Abstract: Linked Data is one of the most successful implementations of the Semantic Web idea. This is demonstrated by the large amount of data available in repositories constituting the Linked Open Data cloud and being linked to each other. Many of these datasets are not created manually but are extracted automatically from existing datasets. Thus, extraction errors, which a human would easily recognize, might go unnoticed and could hence considerably diminish the usability of Linked Data. The large amount of data renders manual detection of such errors unrealistic and makes automatic approaches for detecting errors desirable. To tackle this need, this thesis focuses on error detection approaches on the logical level and on the level of numerical data. In addition, the presented methods operate solely on the Linked Data dataset without a requirement for additional external data. The first two parts of this work deal with the detection of logical errors in Linked Data. It is argued that an upstream formalization of the knowledge, which is required for the error detection, into ontologies and then applying it in a separate step has several advantages over approaches that skip the formalization step. Consequently, the first part introduces inductive approaches for learning highly expressive ontologies from existing instance data as a basis for detecting logical errors. The proposed and evaluated techniques allow to learn class disjointness axioms as well as several property-centric axiom types from instance data. The second part of this thesis operates on the ontologies learned by the approaches proposed in the previous part. First, their quality is improved by detecting errors possibly introduced by the automatic learning process. For this purpose, a pattern-based approach for finding the root causes of ontology errors that is tailored to the specifics of the learned ontologies is proposed and then used in the context of ontology debugging approaches. To conclude the logical error detection, the usage of learned ontologies for finding erroneous statements in Linked Data is evaluated in the final chapter of the second part. This is done by applying a pattern-based error detection approach that employs the learned ontologies to the DBpedia dataset and then manually evaluating the results which finally shows the adequacy of learned ontologies for logical error detection. The final part of this thesis complements the previously shown logical error detection with an approach to detect data-level errors in numerical values. The presented method applies outlier detection techniques to the datatype property values to find potentially erroneous ones whereby the result and performance of the detection step is improved by the introduction of additional preprocessing steps. Furthermore, a subsequent cross-checking step is proposed which allows to handle the outlier detection imminent problem of natural outliers. In summary, this work introduces a number of approaches that allow to detect errors in Linked Data without a requirement for additional, external data. The generated lists of potentially erroneous facts can be a first indication for errors and the intermediate step of learning ontologies makes the full workflow even more suited for being used in a scenario which includes human interaction.
References
More filters
01 Jan 2010
TL;DR: This paper discusses common errors in RDF publishing, their consequences for applications, along with possible publisher-oriented approaches to improve the quality of structured, machine-readable and open data on the Web.
Abstract: Over a decade after RDF has been published as a W3C recommendation, publishing open and machine-readable content on the Web has recently received a lot more attention, including from corporate and governmental bodies; notably thanks to the Linked Open Data community, there now exists a rich vein of heterogeneous RDF data published on the Web (the so-called \Web of Data") accessible to all. However, RDF publishers are prone to making errors which compromise the e ectiveness of applications leveraging the resulting data. In this paper, we discuss common errors in RDF publishing, their consequences for applications, along with possible publisher-oriented approaches to improve the quality of structured, machine-readable and open data on the Web.

219 citations


"Lessons Learned — The Case of CROCU..." refers background in this paper

  • ...[5] only identified errors in RDF data without evaluating the data properties itself....

    [...]

  • ..., [4,5], have been developed to detect errors in instance data of ontologies....

    [...]

Journal Article
Jens Lehmann1
TL;DR: DL-Learner is a framework for learning in description logics and OWL, a cross-platform framework implemented in Java that allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service.
Abstract: In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the official W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems. It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service.

197 citations


Additional excerpts

  • ...error detection tool called TripleCheckMate(2) and a semi-automatic approach supported by the description logic learner (DL-Learner) [11,12], which generates a schema extension for preventing already identified errors....

    [...]

Journal ArticleDOI
TL;DR: A list of fourteen concrete guidelines as given in the ''How to Publish Linked Data on the Web'' tutorial is compiled, and conformance of current RDF data providers with respect to these guidelines is evaluated.

196 citations


"Lessons Learned — The Case of CROCU..." refers background in this paper

  • ...[7] compare the quality of several Linked Data datasets....

    [...]

Proceedings ArticleDOI
07 Apr 2014

173 citations

Proceedings ArticleDOI
04 Sep 2013
TL;DR: This study aims to assess the quality of this sample of DBpedia resources and adopt an agile methodology to improve the quality in future versions by regularly providing feedback to the DBpedia maintainers.
Abstract: Linked Open Data (LOD) comprises of an unprecedented volume of structured datasets on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced and even extracted data of relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. We identified 17 data quality problem types and 58 users assessed a total of 521 resources. Overall, 11.93% of the evaluated DBpedia triples were identified to have some quality issues. Applying the semi-automatic component yielded a total of 222,982 triples that have a high probability to be incorrect. In particular, we found that problems such as object values being incorrectly extracted, irrelevant extraction of information and broken links were the most recurring quality problems. With this study, we not only aim to assess the quality of this sample of DBpedia resources but also adopt an agile methodology to improve the quality in future versions by regularly providing feedback to the DBpedia maintainers.

131 citations


Additional excerpts

  • ...[10] evaluate the data quality of DBpedia....

    [...]

Frequently Asked Questions (18)
Q1. What contributions have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. In this article, the authors present CROCUS – a pipeline for cluster-based ontology data cleansing. Their system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. Furthermore, the authors provide an exhaustive related work as well as a number of lessons learned. 

In the future, the authors aim at a more extensive evaluation on domain specific knowledge bases. Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data. 

Since instances created by the SCDB consist of properties with multiple ranges, CROCUS defines the following metrics: (a) numeric properties are taken as is, (b) properties based on strings are converted to a metric by using string length although more sophisticated measures could be used (e.g., n-gram similarities) and (c) object properties are discarded for this metric. 

The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications. 

Due to the size of LOD datasets, reasoning is infeasible due to performance constraints, but graph-based statistics and clustering methods can work efficiently. 

To the best of their knowledge, their tool is the first tool tackling error accuracy (intrinsic data quality), completeness (contextual data quality) and consistency (data modelling) at once in a semi-automatic manner reaching high f1-measure on real-world data. 

the lack of costly domain experts requires non-experts or even layman to validate the data before influencing a productive system. 

CROCUS has already been successfully used on a travel domain-specific productive environment comprising more than 630.000 instances (the dataset cannot be published due to its license). 

For instance, an object property with more than one authorized class as range needs another rule for each class, e.g., a property located 

(1) Their aim is to find singular faults, i.e., unique instance errors, conflicting with large business relevant areas of a knowledge base. 

Given a resource r and a certain description depth d the CBD works as follows: (1) extract all triples with r as subject and (2) resolve all blank nodes retrieved so far, i.e., for each blank node add every triple containing a blank node with the same identifier as a subject to the description. 

a dataset is integrated in iteration cycles repeatedly which leads to a1 http://lod-cloud.net/generally good data quality. 

CROCUS can be configured to find several types of errors in a semiautomatic way, which are afterwards validated by non-expert users called quality raters. 

the authors clustered the ontology to ensure2 http://github.com/AKSW/TripleCheckMatepartitions contain only semantically correlated data and are able to detect outliers. 

To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset. 

As a third step, the authors apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result. 

depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive. 

Combining different error types yielding a more realistic scenario influences the recall which results in a lower f1-measure than on each individual error type.