Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing
read more
Citations
Dengue Epidemics Prediction: A Survey of the State-of-the-Art Based on Data Science Processes
Detecting Linked Data quality issues via crowdsourcing: A DBpedia study
Toward semantic data imputation for a dengue dataset
Detecting errors in linked data using ontology learning and outlier detection
References
Assessing linked data mappings using network measures
The Semantic Web - ISWC 2013
Swiqa – a semantic web information quality assessment framework
Profiling linked open data with ProLOD
Quality Assessment Methodologies for Linked Open Data A Systematic Literature Review and Conceptual Framework
Related Papers (5)
Frequently Asked Questions (18)
Q2. What future works have the authors mentioned in the paper "Lessons learned — the case of crocus: cluster-based ontology data cleansing" ?
In the future, the authors aim at a more extensive evaluation on domain specific knowledge bases. Furthermore, CROCUS will be extended towards a pipeline comprising a change management, an open API and semantic versioning of the underlying data.
Q3. What metric is used to determine the range of a resource?
Since instances created by the SCDB consist of properties with multiple ranges, CROCUS defines the following metrics: (a) numeric properties are taken as is, (b) properties based on strings are converted to a metric by using string length although more sophisticated measures could be used (e.g., n-gram similarities) and (c) object properties are discarded for this metric.
Q4. What is the purpose of the article?
The Semantic Web movement including the Linked Open Data (LOD) cloud1 represents a combustion point for commercial and free-to-use applications.
Q5. Why is the LOD cloud so important?
Due to the size of LOD datasets, reasoning is infeasible due to performance constraints, but graph-based statistics and clustering methods can work efficiently.
Q6. What is the way to classify CROCUS?
To the best of their knowledge, their tool is the first tool tackling error accuracy (intrinsic data quality), completeness (contextual data quality) and consistency (data modelling) at once in a semi-automatic manner reaching high f1-measure on real-world data.
Q7. What is the main reason why CROCUS is used in industrial environments?
the lack of costly domain experts requires non-experts or even layman to validate the data before influencing a productive system.
Q8. How many instances of CROCUS have been used?
CROCUS has already been successfully used on a travel domain-specific productive environment comprising more than 630.000 instances (the dataset cannot be published due to its license).
Q9. What type of property needs another rule for each class?
For instance, an object property with more than one authorized class as range needs another rule for each class, e.g., a property located
Q10. What is the aim of the evaluation process?
(1) Their aim is to find singular faults, i.e., unique instance errors, conflicting with large business relevant areas of a knowledge base.
Q11. What is the way to extract data from a Linked Data knowledge base?
Given a resource r and a certain description depth d the CBD works as follows: (1) extract all triples with r as subject and (2) resolve all blank nodes retrieved so far, i.e., for each blank node add every triple containing a blank node with the same identifier as a subject to the description.
Q12. What is the definition of a good data quality?
a dataset is integrated in iteration cycles repeatedly which leads to a1 http://lod-cloud.net/generally good data quality.
Q13. What is the main purpose of CROCUS?
CROCUS can be configured to find several types of errors in a semiautomatic way, which are afterwards validated by non-expert users called quality raters.
Q14. How did the authors cluster the ontology?
the authors clustered the ontology to ensure2 http://github.com/AKSW/TripleCheckMatepartitions contain only semantically correlated data and are able to detect outliers.
Q15. How many errors were found in the LUBM benchmark?
To evaluate the performance of CROCUS, the authors used each error type individually on the adjusted LUBM benchmark datasets as well as a combination of all error types on LUBM5 and the real-world DBpedia subset.
Q16. What is the way to cluster a data set?
As a third step, the authors apply the density-based spatial clustering of applications with noise (DBSCAN) algorithm [17] since it is an efficient algorithm and the order of instances has no influence on the clustering result.
Q17. How much time does it take to evaluate a given dataset?
depending on the size of the given dataset the manual evaluation process by domain experts will be time consuming and expensive.
Q18. What is the effect of a more realistic scenario?
Combining different error types yielding a more realistic scenario influences the recall which results in a lower f1-measure than on each individual error type.