Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing
Summary (2 min read)
1 Introduction
- The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.
- Depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.
- From this scenario, the authors derive the requirements for their data quality evaluation process.
- Often, mature ontologies, grown over years, edited by a large amount of processes and people, created by a third party provide the basis for industrial applications (e.g., DBpedia).
- The authors contributions are as follows: the authors present (1) an exhaustive related work and classify their approach according to three wellknown surveys, (2) a pipeline for semi-automatic instance-level error detection that is (3) capable of evaluating large datasets.
3 Method
- SPARQL [15] is a W3C standard to query instance 3 http://www.w3.org/TR/rdf-sparql-query/ data from Linked Data knowledge bases.
- Therefore, a rule is added to CBD, i.e., (3) extract all triples with r as object, which is called Symmetric Concise Bounded Description (SCDB) [16].
- Metrics are split into three categories: (1) The simplest metric s each property .
- DBSCAN clusters instances based on the size of a cluster and the distance between those instances.
- If a cluster has less than MinPts instances, they are regarded as outliers.
4 Evaluation
- First, the authors used the LUBM benchmark [18] to create a perfectly modelled dataset.
- This benchmark allows to generate arbitrary know-.
Metric Metric Metric
- The authors dataset consists of exactly one university and can be downloaded from their project homepage4.
- – semantic correctness of properties (range count) has been evaluated by adding for non-graduate students to 20 graduate students .
- For each set of instances holds: |Icount| = |Irangecount| = |Inumeric| = 20 and additionally |Icount∩ Irangecount∩ Inumeric| = 3. The second equation overcomes a biased evaluation and introduces some realistic noise into the dataset.
- To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.
LUBM
- Table 3 presents the results for the combination of all error types for the LUBM benchmark as well as for the German universities DBpedia subset.
- CROCUS achieves a high recall on the real-world data from DBpedia.
- Table 4 lists the identified reasons of errors from the German universities DBpedia subset detected as outlier.
- As mentioned before, some universities do not have a dbo:country property.
- Some literals are of type xsd:string although they represent a numeric value.
5 Lessons learned
- Based on those candidates quality raters and domain experts are able to define constraints to avoid a specific type of failure.
- Obviously, there are some failures which are too complex for a single constraint.
- An object property with more than one authorized class as range needs another rule for each class, e.g., a property located.
- In with the possible classes Continent, Country, AdminDivision.
- Any object having this property should only have one instance of Country linked to it.
6 Conclusion
- The authors presented CROCUS, a novel architecture for cluster-based, iterative ontology data cleansing, agnostic of the underlying knowledge base.
- With this approach the authors aim at the iterative integration of data into a productive environment which is a typical task of industrial software life cycles.
- Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.
- Additionally, a guided constraint derivation for laymen will be added.
- This work has been partly supported by the ESF and the Free State of Saxony and by grants from the European Union’s 7th Framework Programme provided for the project GeoKnow (GA no. 318159).
Did you find this useful? Give us your feedback
Citations
45 citations
33 citations
8 citations
References
219 citations
"Lessons Learned — The Case of CROCU..." refers background in this paper
...[5] only identified errors in RDF data without evaluating the data properties itself....
[...]
..., [4,5], have been developed to detect errors in instance data of ontologies....
[...]
197 citations
Additional excerpts
...error detection tool called TripleCheckMate(2) and a semi-automatic approach supported by the description logic learner (DL-Learner) [11,12], which generates a schema extension for preventing already identified errors....
[...]
196 citations
"Lessons Learned — The Case of CROCU..." refers background in this paper
...[7] compare the quality of several Linked Data datasets....
[...]
173 citations
131 citations
Additional excerpts
...[10] evaluate the data quality of DBpedia....
[...]
Related Papers (5)
Frequently Asked Questions (18)
Q2. What future works have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?
In the future, the authors aim at a more extensive evaluation on domain specific knowledge bases. Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.
Q3. What metric is used to determine the range of a resource?
Since instances created by the SCDB consist of properties with multiple ranges, CROCUS defines the following metrics: (a) numeric properties are taken as is, (b) properties based on strings are converted to a metric by using string length although more sophisticated measures could be used (e.g., n-gram similarities) and (c) object properties are discarded for this metric.
Q4. What is the purpose of the article?
The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.
Q5. Why is the LOD cloud so important?
Due to the size of LOD datasets, reasoning is infeasible due to performance constraints, but graph-based statistics and clustering methods can work efficiently.
Q6. What is the way to classify CROCUS?
To the best of their knowledge, their tool is the first tool tackling error accuracy (intrinsic data quality), completeness (contextual data quality) and consistency (data modelling) at once in a semi-automatic manner reaching high f1-measure on real-world data.
Q7. What is the main reason why CROCUS is used in industrial environments?
the lack of costly domain experts requires non-experts or even layman to validate the data before influencing a productive system.
Q8. How many instances of CROCUS have been used?
CROCUS has already been successfully used on a travel domain-specific productive environment comprising more than 630.000 instances (the dataset cannot be published due to its license).
Q9. What type of property needs another rule for each class?
For instance, an object property with more than one authorized class as range needs another rule for each class, e.g., a property located
Q10. What is the aim of the evaluation process?
(1) Their aim is to find singular faults, i.e., unique instance errors, conflicting with large business relevant areas of a knowledge base.
Q11. What is the way to extract data from a Linked Data knowledge base?
Given a resource r and a certain description depth d the CBD works as follows: (1) extract all triples with r as subject and (2) resolve all blank nodes retrieved so far, i.e., for each blank node add every triple containing a blank node with the same identifier as a subject to the description.
Q12. What is the definition of a good data quality?
a dataset is integrated in iteration cycles repeatedly which leads to a1 http://lod-cloud.net/generally good data quality.
Q13. What is the main purpose of CROCUS?
CROCUS can be configured to find several types of errors in a semiautomatic way, which are afterwards validated by non-expert users called quality raters.
Q14. How did the authors cluster the ontology?
the authors clustered the ontology to ensure2 http://github.com/AKSW/TripleCheckMatepartitions contain only semantically correlated data and are able to detect outliers.
Q15. How many errors were found in the LUBM benchmark?
To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.
Q16. What is the way to cluster a data set?
As a third step, the authors apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result.
Q17. How much time does it take to evaluate a given dataset?
depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.
Q18. What is the effect of a more realistic scenario?
Combining different error types yielding a more realistic scenario influences the recall which results in a lower f1-measure than on each individual error type.