scispace - formally typeset
Search or ask a question
Book ChapterDOI

Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing

25 May 2014-pp 14-24
TL;DR: This system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs.
Abstract: Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. However, industrial requirements on data quality are high while the time to market as well as the required costs for data preparation have to be kept low. Unfortunately, many Linked Data sources are error-prone which prevents their direct use in productive systems. Hence, (semi-)automatic quality assurance processes are needed as manual ontology repair procedures by domain experts are expensive and time consuming. In this article, we present CROCUS – a pipeline for cluster-based ontology data cleansing. Our system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. CROCUS has been evaluated on two datasets. The experiments show that we are able to detect errors with high recall. Furthermore, we provide an exhaustive related work as well as a number of lessons learned.

Summary (2 min read)

1 Introduction

  • The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.
  • Depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.
  • From this scenario, the authors derive the requirements for their data quality evaluation process.
  • Often, mature ontologies, grown over years, edited by a large amount of processes and people, created by a third party provide the basis for industrial applications (e.g., DBpedia).
  • The authors contributions are as follows: the authors present (1) an exhaustive related work and classify their approach according to three wellknown surveys, (2) a pipeline for semi-automatic instance-level error detection that is (3) capable of evaluating large datasets.

3 Method

  • SPARQL [15] is a W3C standard to query instance 3 http://www.w3.org/TR/rdf-sparql-query/ data from Linked Data knowledge bases.
  • Therefore, a rule is added to CBD, i.e., (3) extract all triples with r as object, which is called Symmetric Concise Bounded Description (SCDB) [16].
  • Metrics are split into three categories: (1) The simplest metric s each property .
  • DBSCAN clusters instances based on the size of a cluster and the distance between those instances.
  • If a cluster has less than MinPts instances, they are regarded as outliers.

4 Evaluation

  • First, the authors used the LUBM benchmark [18] to create a perfectly modelled dataset.
  • This benchmark allows to generate arbitrary know-.

Metric Metric Metric

  • The authors dataset consists of exactly one university and can be downloaded from their project homepage4.
  • – semantic correctness of properties (range count) has been evaluated by adding for non-graduate students to 20 graduate students .
  • For each set of instances holds: |Icount| = |Irangecount| = |Inumeric| = 20 and additionally |Icount∩ Irangecount∩ Inumeric| = 3. The second equation overcomes a biased evaluation and introduces some realistic noise into the dataset.
  • To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.

LUBM

  • Table 3 presents the results for the combination of all error types for the LUBM benchmark as well as for the German universities DBpedia subset.
  • CROCUS achieves a high recall on the real-world data from DBpedia.
  • Table 4 lists the identified reasons of errors from the German universities DBpedia subset detected as outlier.
  • As mentioned before, some universities do not have a dbo:country property.
  • Some literals are of type xsd:string although they represent a numeric value.

5 Lessons learned

  • Based on those candidates quality raters and domain experts are able to define constraints to avoid a specific type of failure.
  • Obviously, there are some failures which are too complex for a single constraint.
  • An object property with more than one authorized class as range needs another rule for each class, e.g., a property located.
  • In with the possible classes Continent, Country, AdminDivision.
  • Any object having this property should only have one instance of Country linked to it.

6 Conclusion

  • The authors presented CROCUS, a novel architecture for cluster-based, iterative ontology data cleansing, agnostic of the underlying knowledge base.
  • With this approach the authors aim at the iterative integration of data into a productive environment which is a typical task of industrial software life cycles.
  • Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.
  • Additionally, a guided constraint derivation for laymen will be added.
  • This work has been partly supported by the ESF and the Free State of Saxony and by grants from the European Union’s 7th Framework Programme provided for the project GeoKnow (GA no. 318159).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Lessons Learned the Case of CROCUS:
Cluster-based Ontology Data Cleansing
Didier Cherix
2
, Ricardo Usbeck
12
, Andreas Both
2
, and Jens Lehmann
1
1
University of Leipzig, Germany
{usbeck,lehmann}@informatik.uni-leipzig.de
2
R & D, Unister GmbH, Leipzig, Germany
{andreas.both,didier.cherix}@unister.de
Abstract. Over the past years, a vast number of datasets have been
published based on Semantic Web standards, which provides an oppor-
tunity for creating novel industrial applications. However, industrial re-
quirements on data quality are high while the time to market as well
as the required costs for data preparation have to be kept low. Unfortu-
nately, many Linked Data sources are error-prone which prevents their
direct use in productive systems. Hence, (semi-)automatic quality as-
surance processes are needed as manual ontology repair procedures by
domain experts are expensive and time consuming. In this article, we
present CROCUS a pipeline for cluster-based ontology data cleansing.
Our system provides a semi-automatic approach for instance-level error
detection in ontologies which is agnostic of the underlying Linked Data
knowledge base and works at very low costs. CROCUS has been evalu-
ated on two datasets. The experiments show that we are able to detect
errors with high recall. Furthermore, we provide an exhaustive related
work as well as a number of lessons learned.
1 Introduction
The Semantic Web movement including the Linked Open Data (LOD) cloud
1
represents a combustion point for commercial and free-to-use applications. The
Linked Open Data cloud hosts over 300 publicly available knowledge bases with
an extensive range of topics and DBpedia [1] as central and most important
dataset. While providing a short time-to-market of large and structured datasets,
Linked Data has yet not reached industrial requirements in terms of provenance,
interlinking and especially data quality. In general, LOD knowledge bases com-
prise only few logical constraints or are not well modelled.
Industrial environments need to provide high quality data in a short amount
of time. A solution might be a significant number of domain experts that are
checking a given dataset and defining constraints, ensuring the demanded data
quality. However, depending on the size of the given dataset the manual evalu-
ation process by domain experts will be time consuming and expensive. Com-
monly, a dataset is integrated in iteration cycles repeatedly which leads to a
1
http://lod-cloud.net/

generally good data quality. However, new or updated instances might be error-
prone. Hence, the data quality of the dataset might be contaminated after a
re-import.
From this scenario, we derive the requirements for our data quality evalua-
tion process. (1) Our aim is to find singular faults, i.e., unique instance errors,
conflicting with large business relevant areas of a knowledge base. (2) The data
evaluation process has to be efficient. Due to the size of LOD datasets, reason-
ing is infeasible due to performance constraints, but graph-based statistics and
clustering methods can work efficiently. (3) This process has to be agnostic of
the underlying knowledge base, i.e., it should be independent of the evaluated
dataset.
Often, mature ontologies, grown over years, edited by a large amount of
processes and people, created by a third party provide the basis for industrial
applications (e.g., DBpedia). Aiming at short time-to-market, industry needs
scalable algorithms to detect errors. Furthermore, the lack of costly domain ex-
perts requires non-experts or even layman to validate the data before influencing
a productive system. Resulting knowledge bases may still contain errors, how-
ever, they offer a fair trade-off in an iterative production cycle.
In this article, we present CROCUS, a cluster-based ontology data cleansing
framework. CROCUS can be configured to find several types of errors in a semi-
automatic way, which are afterwards validated by non-expert users called quality
raters. By applying CROCUS’ methodology iteratively, resulting ontology data
can be safely used in industrial environments.
On top of our previous work [2]. our contributions are as follows: we present
(1) an exhaustive related work and classify our approach according to three well-
known surveys, (2) a pipeline for semi-automatic instance-level error detection
that is (3) capable of evaluating large datasets. Moreover, it is (4) an approach
agnostic to the analysed class of the instance as well as the Linked Data knowl-
edge base. (5) we provide an evaluation on a synthetic and a real-world dataset.
Finally, (6) we present a number of lessons learned according to error detection
in real-world datasets.
2 Related Work
The research field of ontology data cleansing, especially instance data can be
regarded threefold: (1) development of statistical metrics to discover anomalies,
(2) manual, semi-automatic and full-automatic evaluation of data quality and
(3) rule- or logic-based approaches to prevent outliers in application data.
In 2013, Zaveri et al. [10] evaluate the data quality of DBpedia. This manual
approach introduces a taxonomy of quality dimensions: (i) accuracy, which con-
cerns wrong triples, data type problems and implicit relations between attributes,
(ii) relevance, indicating significance of extracted information, (iii) representa-
tional consistency, measuring numerical stability and (iv) interlinking, which
looks for links to external resources. Moreover, the authors present a manual

Procedure
Dimension Automatic Semi - Auto-
matic
Manual Not Speci-
fied
Believability [3]
Objectivity [3]
Reputation [3]
Intrisic DQ
Correctness [4,5] [3]
Completeness [6] [5] [3]
Added Value [6] [3]
Relevancy [3]
Timeliness [3]
Contextual DQ
Amount of data [3]
Interpretability [3] [7]
Understandability [3]
Consistency [3] [7]
Repre-
sentation
Conciseness [3] [7]
Availability [3] [7]
Response time [3]
Acces-
sibility
Security [3]
Table 1: Table of founded papers for each in [8] defined Dimension on the basis of [9,
Table 8-9]. The blue dimensions are considered in this work.
error detection tool called TripleCheckMate
2
and a semi-automatic approach
supported by the description logic learner (DL-Learner) [11,12], which generates
a schema extension for preventing already identified errors. Those methods mea-
sured an error rate of 11.93% in DBpedia which will be a starting point for our
evaluation.
A rule-based framework is presented by Furber et al. [13] where the authors
define 9 rules of data quality. Following, the authors define an error by the num-
ber of instances not following a specific rule normalized by the overall number
of relevant instances. Afterwards, the framework is able to generate statistics
on which rules have been applied to the data. Several semi-automatic processes,
e.g., [4,5], have been developed to detect errors in instance data of ontologies.
Bohm et al. [4] profiled LOD knowledge bases, i.e., statistical metadata is gener-
ated to discover outliers. Therefore, the authors clustered the ontology to ensure
2
http://github.com/AKSW/TripleCheckMate

partitions contain only semantically correlated data and are able to detect out-
liers. Hogan et al. [5] only identified errors in RDF data without evaluating the
data properties itself.
In 2013, Kontokostas et al. [14] present an automatic methodology to assess
data quality via a SPARQL-endpoint
3
. The authors define 14 basic graph pat-
terns (BGP) to detect diverse error types. Each pattern leads to the construction
of several cases with meta variables bound to specific instances of resources and
literals, e.g., constructing a SPARQL query testing that a person is born before
the person dies. This approach is not able to work iteratively to refine its result
and is thus not usable in circular development processes.
Bizer et al. [3] present a manual framework as well as a browser to filter
Linked Data. The framework enables users to define rules which will be used to
clean the RDF data. Those rules have to be created manually in a SPARQL-like
syntax. In turn, the browser shows the processed data along with an explanation
of the filtering.
Network measures like degree and centrality are used by Guer et al. [6] to
quantify the quality of data. Furthermore, they present an automatic framework
to evaluate the influence of each measure on the data quality. The authors proof
that the presented measures are only capable of discovering a few quality-lacking
triples.
Hogan et al. [7] compare the quality of several Linked Data datasets. There-
fore, the authors extracted 14 rules from best practices and publications. Those
rules are applied to each dataset and compared against the Page Rank of each
data supplier. Thereafter, the Page Rank of a certain data supplier is correlated
with the datasets quality. The authors suggest new guidelines to align the Linked
Data quality with the users need for certain dataset properties.
A first classification of quality dimensions is presented by Wang et al. [8]
with respect to their importance to the user. This study reveals a classification
of data quality metrics in four categories, cf. Table 1. Recently, Zaveri et al. [9]
present a systematic literature review on different methodologies for data quality
assessment. The authors chose 21 articles, extracted 26 quality dimensions and
categorized them according to [8]. The results shows which error types exist and
whether they are repairable manually, semi-automatic or fully automatic. The
presented measures were used to classify CROCUS.
To the best of our knowledge, our tool is the first tool tackling error accuracy
(intrinsic data quality), completeness (contextual data quality) and consistency
(data modelling) at once in a semi-automatic manner reaching high f1-measure
on real-world data.
3 Method
First, we need a standardized extraction of target data to be agnostic of the
underlying knowledge base. SPARQL [15] is a W3C standard to query instance
3
http://www.w3.org/TR/rdf-sparql-query/

data from Linked Data knowledge bases. The DESCRIBE query command is a way
to retrieve descriptive data of certain instances. However, this query command
depends on the knowledge base vendor and its configuration. To circumvent
knowledge base dependence, we use Concise Bounded Descriptions (CBD) [16].
Given a resource r and a certain description depth d the CBD works as follows:
(1) extract all triples with r as subject and (2) resolve all blank nodes retrieved
so far, i.e., for each blank node add every triple containing a blank node with
the same identifier as a subject to the description. Finally, CBD repeats these
steps d times. CBD configured with d = 1 retrieves only triples with r as subject
although triples with r as object could contain useful information. Therefore, a
rule is added to CBD, i.e., (3) extract all triples with r as object, which is called
Symmetric Concise Bounded Description (SCDB) [16].
Second, CROCUS needs to calculate a numeric representation of an instance
to facilitate further clustering steps. Metrics are split into three categories:
(1) The simplest metric counts each property (count). For example, this
metric can be used if a person is expected to have only one telephone number.
(2) For each instance, the range of the resource at a certain property is
counted (range count). In general, an undergraduate student should take un-
dergraduate courses. If there is an undergraduate student taking courses with
another type (e.g., graduate courses), this metric is able to detect it.
(3) The most general metric transforms each instance into a numeric vector
and normalizes it (numeric). Since instances created by the SCDB consist of
properties with multiple ranges, CROCUS defines the following metrics: (a) nu-
meric properties are taken as is, (b) properties based on strings are converted to
a metric by using string length although more sophisticated measures could be
used (e.g., n-gram similarities) and (c) object properties are discarded for this
metric.
As a third step, we apply the density-based spatial clustering of applications
with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the
order of instances has no influence on the clustering result. DBSCAN clusters
instances based on the size of a cluster and the distance between those instances.
Thus, DBSCAN has two parameters: , the distance between two instances, here
calculated by the metrics above and MinP ts, the minimum number of instances
needed to form a cluster. If a cluster has less than M inP ts instances, they are
regarded as outliers. We report the quality of CROCUS for different values of
MinP ts in Section 4.
Finally, identified outliers are extracted and given to human quality judges.
Based on the revised set of outliers, the algorithm can be adjusted and con-
straints can be added to the Linked Data knowledge base to prevent repeating
discovered errors.
4 Evaluation
LUBM benchmark. First, we used the LUBM benchmark [18] to create a
perfectly modelled dataset. This benchmark allows to generate arbitrary know-

Citations
More filters
Journal ArticleDOI
TL;DR: This substantial review of the literature on the state of the art of research over the past decades identified six main issues to be explored and analyzed: the available data sources; 2) data preparation techniques; 3) data representations; 4) forecasting models and methods; 5) dengue forecasting models evaluation approaches; and 6) future challenges and possibilities in forecasting modeling of d Dengue outbreaks.
Abstract: Dengue infection is a mosquitoborne disease caused by dengue viruses, which are carried by several species of mosquito of the genus Aedes , principally Ae. aegypti . Dengue outbreaks are endemic in tropical and sub-tropical regions of the world, mainly in urban and sub-urban areas. The outbreak is one of the top 10 diseases causing the most deaths worldwide. According to the World Health Organization, dengue infection has increased 30-fold globally over the past five decades. About 50–100 million new infections occur annually in more than 80 countries. Many researchers are working on measures to prevent and control the spread. One avenue of research is collaboration between computer science and the epidemiology researchers in developing methods of predicting potential outbreaks of dengue infection. An important research objective is to develop models that enable, or enhance, forecasting of outbreaks of dengue, giving medical professionals the opportunity to develop plans for handling the outbreak, well in advance. Researchers have been gathering and analyzing data to better identify the relational factors driving the spread of the disease, as well as the development of a variety of methods of predictive modeling using statistical and mathematical analysis and machine learning. In this substantial review of the literature on the state of the art of research over the past decades, we identified six main issues to be explored and analyzed: 1) the available data sources; 2) data preparation techniques; 3) data representations; 4) forecasting models and methods; 5) dengue forecasting models evaluation approaches; and 6) future challenges and possibilities in forecasting modeling of dengue outbreaks. Our comprehensive exploration of the issues provides a valuable information foundation for new researchers in this important area of public health research and epidemiology.

45 citations

Journal ArticleDOI
TL;DR: The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.
Abstract: In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difficult to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources, and a classification of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then propose and compare different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts and lay workers. By testing two distinct Find-Verify workflows (lay users only and experts verified by lay users) we reveal how to best combine different crowds’ complementary aptitudes in quality issue detection. The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.

33 citations

Journal ArticleDOI
TL;DR: An improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value.
Abstract: Missing data are a major problem that affects data analysis techniques for forecasting. Traditional methods suffer from poor performance in predicting missing values using simple techniques, e.g., mean and mode. In this paper, we present and discuss a novel method of imputing missing values semantically with the use of an ontology model. We make three new contributions to the field: first, an improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value. Second, the incorporation of an ontology with PSO for the purpose of narrowing the search space, to make PSO provide greater accuracy in predicting numerical missing values while quickly converging on the answer. Third, the facilitation of a framework to substitute nominal data that are lost from the dataset using the relationships of concepts and a reasoning mechanism concerning the knowledge-based model. The experimental results indicated that the proposed method could estimate missing data more efficiently and with less chance of error than conventional methods, as measured by the root mean square error.

8 citations

Dissertation
01 Jan 2015
TL;DR: A number of approaches that allow to detect errors in Linked Data without a requirement for additional, external data are introduced, including an approach to detect data-level errors in numerical values.
Abstract: Linked Data is one of the most successful implementations of the Semantic Web idea. This is demonstrated by the large amount of data available in repositories constituting the Linked Open Data cloud and being linked to each other. Many of these datasets are not created manually but are extracted automatically from existing datasets. Thus, extraction errors, which a human would easily recognize, might go unnoticed and could hence considerably diminish the usability of Linked Data. The large amount of data renders manual detection of such errors unrealistic and makes automatic approaches for detecting errors desirable. To tackle this need, this thesis focuses on error detection approaches on the logical level and on the level of numerical data. In addition, the presented methods operate solely on the Linked Data dataset without a requirement for additional external data. The first two parts of this work deal with the detection of logical errors in Linked Data. It is argued that an upstream formalization of the knowledge, which is required for the error detection, into ontologies and then applying it in a separate step has several advantages over approaches that skip the formalization step. Consequently, the first part introduces inductive approaches for learning highly expressive ontologies from existing instance data as a basis for detecting logical errors. The proposed and evaluated techniques allow to learn class disjointness axioms as well as several property-centric axiom types from instance data. The second part of this thesis operates on the ontologies learned by the approaches proposed in the previous part. First, their quality is improved by detecting errors possibly introduced by the automatic learning process. For this purpose, a pattern-based approach for finding the root causes of ontology errors that is tailored to the specifics of the learned ontologies is proposed and then used in the context of ontology debugging approaches. To conclude the logical error detection, the usage of learned ontologies for finding erroneous statements in Linked Data is evaluated in the final chapter of the second part. This is done by applying a pattern-based error detection approach that employs the learned ontologies to the DBpedia dataset and then manually evaluating the results which finally shows the adequacy of learned ontologies for logical error detection. The final part of this thesis complements the previously shown logical error detection with an approach to detect data-level errors in numerical values. The presented method applies outlier detection techniques to the datatype property values to find potentially erroneous ones whereby the result and performance of the detection step is improved by the introduction of additional preprocessing steps. Furthermore, a subsequent cross-checking step is proposed which allows to handle the outlier detection imminent problem of natural outliers. In summary, this work introduces a number of approaches that allow to detect errors in Linked Data without a requirement for additional, external data. The generated lists of potentially erroneous facts can be a first indication for errors and the intermediate step of learning ontologies makes the full workflow even more suited for being used in a scenario which includes human interaction.
References
More filters
Proceedings Article
02 Aug 1996
TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

17,056 citations

Proceedings Article
01 Jan 1996
TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

14,297 citations


"Lessons Learned — The Case of CROCU..." refers methods in this paper

  • ...As a third step, we apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result....

    [...]

Journal ArticleDOI
TL;DR: Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs and this research provides a basis for future studies that measure data quality along the dimensions of this framework.
Abstract: Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework.

4,069 citations

Journal ArticleDOI
TL;DR: An overview of the DBpedia community project is given, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications, including DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud.
Abstract: The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.

2,856 citations


"Lessons Learned — The Case of CROCU..." refers background in this paper

  • ...Table 4 lists the identified reasons of errors from the German universities DBpedia subset detected as outlier....

    [...]

  • ...However, CROCUS achieves a high recall on the real-world data from DBpedia....

    [...]

  • ...To evaluate the performance of CROCUS, we used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset....

    [...]

  • ...Often, mature ontologies, grown over years, edited by a large amount of processes and people, created by a third party provide the basis for industrial applications (e.g., DBpedia)....

    [...]

  • ...In 2013, Zaveri et al. [10] evaluate the data quality of DBpedia....

    [...]

BookDOI
01 Jan 2004
TL;DR: DODDLE-R, a support environment for user-centered ontology development, consists of two main parts: pre-processing part and quality improvement part, which generates a prototype ontology semi-automatically and supports the refinement of it interactively.
Abstract: In order to realize the on-the-fly ontology construction for the Semantic Web, this paper proposes DODDLE-R, a support environment for user-centered ontology development. It consists of two main parts: pre-processing part and quality improvement part. Pre-processing part generates a prototype ontology semi-automatically, and quality improvement part supports the refinement of it interactively. As we believe that careful construction of ontologies from preliminary phase is more efficient than attempting generate ontologies full-automatically (it may cause too many modification by hand), quality improvement part plays significant role in DODDLE-R. Through interactive support for improving the quality of prototype ontology, OWL-Lite level ontology, which consists of taxonomic relationships (class sub class relationship) and non-taxonomic relationships (defined as property), is constructed effi-

2,006 citations

Frequently Asked Questions (18)
Q1. What contributions have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?

Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. In this article, the authors present CROCUS – a pipeline for cluster-based ontology data cleansing. Their system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. Furthermore, the authors provide an exhaustive related work as well as a number of lessons learned. 

In the future, the authors aim at a more extensive evaluation on domain specific knowledge bases. Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data. 

Since instances created by the SCDB consist of properties with multiple ranges, CROCUS defines the following metrics: (a) numeric properties are taken as is, (b) properties based on strings are converted to a metric by using string length although more sophisticated measures could be used (e.g., n-gram similarities) and (c) object properties are discarded for this metric. 

The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications. 

Due to the size of LOD datasets, reasoning is infeasible due to performance constraints, but graph-based statistics and clustering methods can work efficiently. 

To the best of their knowledge, their tool is the first tool tackling error accuracy (intrinsic data quality), completeness (contextual data quality) and consistency (data modelling) at once in a semi-automatic manner reaching high f1-measure on real-world data. 

the lack of costly domain experts requires non-experts or even layman to validate the data before influencing a productive system. 

CROCUS has already been successfully used on a travel domain-specific productive environment comprising more than 630.000 instances (the dataset cannot be published due to its license). 

For instance, an object property with more than one authorized class as range needs another rule for each class, e.g., a property located 

(1) Their aim is to find singular faults, i.e., unique instance errors, conflicting with large business relevant areas of a knowledge base. 

Given a resource r and a certain description depth d the CBD works as follows: (1) extract all triples with r as subject and (2) resolve all blank nodes retrieved so far, i.e., for each blank node add every triple containing a blank node with the same identifier as a subject to the description. 

a dataset is integrated in iteration cycles repeatedly which leads to a1 http://lod-cloud.net/generally good data quality. 

CROCUS can be configured to find several types of errors in a semiautomatic way, which are afterwards validated by non-expert users called quality raters. 

the authors clustered the ontology to ensure2 http://github.com/AKSW/TripleCheckMatepartitions contain only semantically correlated data and are able to detect outliers. 

To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset. 

As a third step, the authors apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result. 

depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive. 

Combining different error types yielding a more realistic scenario influences the recall which results in a lower f1-measure than on each individual error type.