scispace - formally typeset
Search or ask a question

Showing papers by "Heiko Paulheim published in 2018"


Proceedings ArticleDOI
28 Aug 2018
TL;DR: A weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets, and shows that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.
Abstract: The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.

106 citations


Journal ArticleDOI
TL;DR: It is argued that in order to gain a better understanding of forest health in the authors' complex world, it would be conducive to implement the concepts of data science with the components of digitalization, standardization with metadata management after the FAIR principles.
Abstract: Forest ecosystems fulfill a whole host of ecosystem functions that are essential for life on our planet. However, an unprecedented level of anthropogenic influences is reducing the resilience and stability of our forest ecosystems as well as their ecosystem functions. The relationships between drivers, stress, and ecosystem functions in forest ecosystems are complex, multi-faceted, and often non-linear, and yet forest managers, decision makers, and politicians need to be able to make rapid decisions that are data-driven and based on short and long-term monitoring information, complex modeling, and analysis approaches. A huge number of long-standing and standardized forest health inventory approaches already exist, and are increasingly integrating remote-sensing based monitoring approaches. Unfortunately, these approaches in monitoring, data storage, analysis, prognosis, and assessment still do not satisfy the future requirements of information and digital knowledge processing of the 21st century. Therefore, this paper discusses and presents in detail five sets of requirements, including their relevance, necessity, and the possible solutions that would be necessary for establishing a feasible multi-source forest health monitoring network for the 21st century. Namely, these requirements are: (1) understanding the effects of multiple stressors on forest health; (2) using remote sensing (RS) approaches to monitor forest health; (3) coupling different monitoring approaches; (4) using data science as a bridge between complex and multidimensional big forest health (FH) data; and (5) a future multi-source forest health monitoring network. It became apparent that no existing monitoring approach, technique, model, or platform is sufficient on its own to monitor, model, forecast, or assess forest health and its resilience. In order to advance the development of a multi-source forest health monitoring network, we argue that in order to gain a better understanding of forest health in our complex world, it would be conducive to implement the concepts of data science with the components: (i) digitalization; (ii) standardization with metadata management after the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles; (iii) Semantic Web; (iv) proof, trust, and uncertainties; (v) tools for data science analysis; and (vi) easy tools for scientists, data managers, and stakeholders for decision-making support.

68 citations


Journal ArticleDOI
TL;DR: This paper uses neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances.

33 citations


01 Jan 2018
TL;DR: This paper proposes ways to estimate the cost of knowledge graphs, and advocates for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.
Abstract: Knowledge graphs are used in various applications and have been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 250 cheaper (i.e., 1g to 15g per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.

33 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This paper shows how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis, and shows that the resulting large-scale knowledge graph is complementary to DBpedia.
Abstract: Popular knowledge graphs such as DBpedia and YAGO are built from Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia.

28 citations


01 Jan 2018
TL;DR: DOME (Deep Ontology MatchEr) is a scalable matcher which relies on large texts describing the ontological concepts to train a fixed-length vector representation of the concepts using the doc2vec approach.
Abstract: DOME (Deep Ontology MatchEr) is a scalable matcher which relies on large texts describing the ontological concepts. Using the doc2vec approach, these texts are used to train a fixed-length vector representation of the concepts. Mappings are generated if two concepts are close to each other in the resulting vector space. If no large texts are available, DOME falls back to a string based matching technique. Due to its high scalability, it can also produce results in the largebio track of OAEI and can be applied to very large ontologies. The results look promising if huge texts are available, but there is still a lot of room for improvement.

21 citations


Journal ArticleDOI
TL;DR: A language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features is presented.
Abstract: Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph.

18 citations


01 Jan 2018
TL;DR: The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases, which can be based on ontologies of different levels of complexity and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus).
Abstract: The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity (from simple thesauri to expressive OWL ontologies) and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus). The OAEI 2018 campaign offered 12 tracks with 23 test cases, and was attended by 19 participants. This paper is an overall presentation of that campaign.

15 citations


01 Jan 2018
TL;DR: A claim is made for semantic embeddings and possible ideas towards their construction are discussed, which show superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks.
Abstract: The original Semantic Web vision foresees to describe entities in a way that the meaning can be interpreted both by machines and humans. Following that idea, large-scale knowledge graphs capturing a significant portion of knowledge have been developed. In the recent past, vector space embeddings of semantic web knowledge graphs – i.e., projections of a knowledge graph into a lower-dimensional, numerical feature space (a.k.a. latent feature space) – have been shown to yield superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks. At the same time, those projections describe an entity as a numerical vector, without any semantics attached to the dimensions. Thus, embeddings are as far from the original Semantic Web vision as can be. As a consequence, the results achieved with embeddings – as impressive as they are in terms of quantitative performance – are most often not interpretable, and it is hard to obtain a justification for a prediction, e.g., an explanation why an item has been suggested by a recommender system. In this paper, we make a claim for semantic embeddings and discuss possible ideas towards their construction.

15 citations


Proceedings ArticleDOI
09 Apr 2018
TL;DR: This work proposes a data-driven method to detect incorrect mappings automatically by analyzing the information from both instance data as well as ontological axioms and concludes that the best model achieves 93% accuracy.
Abstract: DBpedia releases consist of more than 70 multilingual datasets that cover data extracted from different language-specific Wikipedia instances. The data extracted from those Wikipedia instances are transformed into RDF using mappings created by the DBpedia community. Nevertheless, not all the mappings are correct and consistent across all the distinct language-specific DBpedia datasets. As these incorrect mappings are spread in a large number of mappings, it is not feasible to inspect all such mappings manually to ensure their correctness. Thus, the goal of this work is to propose a data-driven method to detect incorrect mappings automatically by analyzing the information from both instance data as well as ontological axioms. We propose a machine learning based approach to building a predictive model which can detect incorrect mappings. We have evaluated different supervised classification algorithms for this task and our best model achieves 93% accuracy. These results help us to detect incorrect mappings and achieve a high-quality DBpedia.

15 citations


Posted Content
04 Mar 2018
TL;DR: It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpreted than more complex ones.
Abstract: It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption, and recapitulate evidence for and against this postulate. We also report the results of an evaluation in a crowd-sourcing study, which does not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then continue to review criteria for interpretability from the psychological literature, evaluate some of them, and briefly discuss their potential use in machine learning.

BookDOI
01 Jan 2018
TL;DR: This work considers how to select a subgraph of an RDF graph in an ontology learning problem and proposes to address this by selecting RDF triples that can not be inferred using a reasoner and presents an algorithm to find them.
Abstract: We consider how to select a subgraph of an RDF graph in an ontology learning problem in order to avoid learning redundant axioms. We propose to address this by selecting RDF triples that can not be inferred using a reasoner and we present an algorithm to find them.

Book ChapterDOI
22 Sep 2018
TL;DR: How machine learning is used to improve knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems are discussed.
Abstract: Large-scale cross-domain knowledge graphs, such as DBpedia or Wikidata, are some of the most popular and widely used datasets of the Semantic Web. In this paper, we introduce some of the most popular knowledge graphs on the Semantic Web. We discuss how machine learning is used to improve those knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems.

Journal ArticleDOI
TL;DR: In this paper, the plausibility of a model is defined as the likelihood that a user accepts it as an explanation for a prediction, and it is shown that, all other things being equal, longer explanations may be more convincing than shorter ones.
Abstract: It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption by focusing on one particular aspect of interpretability, namely the plausibility of models. Roughly speaking, we equate the plausibility of a model with the likeliness that a user accepts it as an explanation for a prediction. In particular, we argue that, all other things being equal, longer explanations may be more convincing than shorter ones, and that the predominant bias for shorter models, which is typically necessary for learning powerful discriminative models, may not be suitable when it comes to user acceptance of the learned models. To that end, we first recapitulate evidence for and against this postulate, and then report the results of an evaluation in a crowd-sourcing study based on about 3.000 judgments. The results do not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then relate these results to well-known cognitive biases such as the conjunction fallacy, the representative heuristic, or the recogition heuristic, and investigate their relation to rule length and plausibility.

01 Jan 2018
TL;DR: Several alternatives and design decisions for providing statement-level provenance information at large scale for the WebIsALOD dataset are described and the practical impact of that provenance data for computing confidence scores approximating the correctness of each subsumption relation is shown.
Abstract: The WebIsALOD dataset provides a linked data endpoint to the WebIsA database, which harvests millions of subsumption relations from a large scale Web crawl using text patterns. For each of the relations, the dataset also contains rich provenance data, such as the text pattern used, the original sentence in which the pattern was found, and the source on the Web. In this paper, we describe several alternatives and design decisions for providing statement-level provenance information at large scale for the WebIsALOD dataset. Furthermore, we show the practical impact of that provenance information for computing confidence scores approximating the correctness of each subsumption relation.

Book ChapterDOI
19 Sep 2018
TL;DR: This work extends snapshot ensembles to the application of time series forecasting and shows that determining reasonable selections for sequence lengths can be used to efficiently escape local minima and combining the forecasts of snapshot LSTMs with a stacking approach greatly boosts the performance.
Abstract: Ensembles of machine learning models have proven to improve the performance of prediction tasks in various domains. The additional computational costs for the performance increase are usually high since multiple models must be trained. Recently, snapshot ensembles (Huang et al. in Snapshot ensembles: train 1 get M for free, (2017) [16]) provide a comparably computationally cheap way of ensemble learning for artificial neural networks (ANNs). We extend snapshot ensembles to the application of time series forecasting, which comprises two essential steps. First, we show that determining reasonable selections for sequence lengths can be used to efficiently escape local minima. Additionally, combining the forecasts of snapshot LSTMs with a stacking approach greatly boosts the performance compared to the mean of the forecasts as used in the original snapshot ensemble approach. We demonstrate the effectiveness of the algorithm on five real-world datasets and show that the forecasting performance of our approach is superior to conservative ensemble architectures as well as a single, highly optimized LSTM.