Showing papers by "Heiko Paulheim published in 2018"

PDF

Open Access

Proceedings Article•DOI•

Weakly supervised learning for fake news detection on Twitter

[...]

Stefan Helmstetter¹, Heiko Paulheim¹•Institutions (1)

28 Aug 2018

TL;DR: A weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets, and shows that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.

...read moreread less

Abstract: The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.

...read moreread less

106 citations

Journal Article•DOI•

Understanding forest health with remote sensing, Part III: Requirements for a scalable multi-source forest health monitoring network based on data science approaches

[...]

Angela Lausch¹, Erik Borg², Jan Bumberger, Peter Dietrich³, Marco Heurich⁴, Marco Heurich⁵, Andreas Huth, András Jung⁶, Reinhard Klenke, Sonja Knapp, Hannes Mollenhauer, Hendrik Paasche, Heiko Paulheim⁷, Marion Pause⁸, Christian Schweitzer⁹, Christiane Schmulius¹⁰, Josef Settele¹¹, Andrew K. Skidmore¹², Andrew K. Skidmore¹³, Martin Wegmann¹⁴, Steffen Zacharias, Toralf Kirsten¹⁵, Michael E. Schaepman¹⁶ - Show less +19 more•Institutions (16)

Humboldt University of Berlin¹, German Aerospace Center², University of Tübingen³, Bavarian Forest National Park⁴, University of Freiburg⁵, Szent István University⁶, University of Mannheim⁷, Dresden University of Technology⁸, Environment Agency⁹, University of Jena¹⁰, Leipzig University¹¹, ITC Enschede¹², Macquarie University¹³, University of Würzburg¹⁴, Hochschule Mittweida¹⁵, University of Zurich¹⁶

15 Jul 2018-Remote Sensing

TL;DR: It is argued that in order to gain a better understanding of forest health in the authors' complex world, it would be conducive to implement the concepts of data science with the components of digitalization, standardization with metadata management after the FAIR principles.

...read moreread less

Abstract: Forest ecosystems fulfill a whole host of ecosystem functions that are essential for life on our planet. However, an unprecedented level of anthropogenic influences is reducing the resilience and stability of our forest ecosystems as well as their ecosystem functions. The relationships between drivers, stress, and ecosystem functions in forest ecosystems are complex, multi-faceted, and often non-linear, and yet forest managers, decision makers, and politicians need to be able to make rapid decisions that are data-driven and based on short and long-term monitoring information, complex modeling, and analysis approaches. A huge number of long-standing and standardized forest health inventory approaches already exist, and are increasingly integrating remote-sensing based monitoring approaches. Unfortunately, these approaches in monitoring, data storage, analysis, prognosis, and assessment still do not satisfy the future requirements of information and digital knowledge processing of the 21st century. Therefore, this paper discusses and presents in detail five sets of requirements, including their relevance, necessity, and the possible solutions that would be necessary for establishing a feasible multi-source forest health monitoring network for the 21st century. Namely, these requirements are: (1) understanding the effects of multiple stressors on forest health; (2) using remote sensing (RS) approaches to monitor forest health; (3) coupling different monitoring approaches; (4) using data science as a bridge between complex and multidimensional big forest health (FH) data; and (5) a future multi-source forest health monitoring network. It became apparent that no existing monitoring approach, technique, model, or platform is sufficient on its own to monitor, model, forecast, or assess forest health and its resilience. In order to advance the development of a multi-source forest health monitoring network, we argue that in order to gain a better understanding of forest health in our complex world, it would be conducive to implement the concepts of data science with the components: (i) digitalization; (ii) standardization with metadata management after the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles; (iii) Semantic Web; (iv) proof, trust, and uncertainties; (v) tools for data science analysis; and (vi) easy tools for scientists, data managers, and stakeholders for decision-making support.

...read moreread less

68 citations

Journal Article•DOI•

A machine learning approach for product matching and categorization

[...]

Petar Ristoski¹, Petar Petrovski¹, Peter Mika², Heiko Paulheim¹•Institutions (2)

University of Mannheim¹, Yahoo!²

01 Jan 2018-Social Work

TL;DR: This paper uses neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances.

...read moreread less

33 citations

How much is a triple? Estimating the cost of knowledge graph creation

[...]

Heiko Paulheim

01 Jan 2018

TL;DR: This paper proposes ways to estimate the cost of knowledge graphs, and advocates for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.

...read moreread less

Abstract: Knowledge graphs are used in various applications and have been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 250 cheaper (i.e., 1g to 15g per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.

...read moreread less

33 citations

Proceedings Article•DOI•

DBkWik: A Consolidated Knowledge Graph from Thousands of Wikis

[...]

Sven Hertling, Heiko Paulheim

01 Nov 2018

TL;DR: This paper shows how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis, and shows that the resulting large-scale knowledge graph is complementary to DBpedia.

...read moreread less

Abstract: Popular knowledge graphs such as DBpedia and YAGO are built from Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia.

...read moreread less

28 citations

DOME results for OAEI 2018

[...]

Sven Hertling, Heiko Paulheim

01 Jan 2018

TL;DR: DOME (Deep Ontology MatchEr) is a scalable matcher which relies on large texts describing the ontological concepts to train a fixed-length vector representation of the concepts using the doc2vec approach.

...read moreread less

Abstract: DOME (Deep Ontology MatchEr) is a scalable matcher which relies on large texts describing the ontological concepts. Using the doc2vec approach, these texts are used to train a fixed-length vector representation of the concepts. Mappings are generated if two concepts are close to each other in the resulting vector space. If no large texts are available, DOME falls back to a string based matching technique. Due to its high scalability, it can also produce results in the largebio track of OAEI and can be applied to very large ontologies. The results look promising if huge texts are available, but there is still a lot of room for improvement.

...read moreread less

21 citations

Journal Article•DOI•

Language-Agnostic Relation Extraction from Abstracts in Wikis

[...]

Nicolas Heist, Sven Hertling, Heiko Paulheim

29 Mar 2018-Information-an International Interdisciplinary Journal

TL;DR: A language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features is presented.

...read moreread less

Abstract: Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph.

...read moreread less

18 citations

Results of the Ontology Alignment Evaluation Initiative 2018

[...]

01 Jan 2018

TL;DR: The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases, which can be based on ontologies of different levels of complexity and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus).

...read moreread less

Abstract: The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity (from simple thesauri to expressive OWL ontologies) and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus). The OAEI 2018 campaign offered 12 tracks with 23 test cases, and was attended by 19 participants. This paper is an overall presentation of that campaign.

...read moreread less

15 citations

Make Embeddings Semantic Again

[...]

Heiko Paulheim

01 Jan 2018

TL;DR: A claim is made for semantic embeddings and possible ideas towards their construction are discussed, which show superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks.

...read moreread less

Abstract: The original Semantic Web vision foresees to describe entities in a way that the meaning can be interpreted both by machines and humans. Following that idea, large-scale knowledge graphs capturing a significant portion of knowledge have been developed. In the recent past, vector space embeddings of semantic web knowledge graphs – i.e., projections of a knowledge graph into a lower-dimensional, numerical feature space (a.k.a. latent feature space) – have been shown to yield superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks. At the same time, those projections describe an entity as a numerical vector, without any semantics attached to the dimensions. Thus, embeddings are as far from the original Semantic Web vision as can be. As a consequence, the results achieved with embeddings – as impressive as they are in terms of quantitative performance – are most often not interpretable, and it is hard to obtain a justification for a prediction, e.g., an explanation why an item has been suggested by a recommender system. In this paper, we make a claim for semantic embeddings and discuss possible ideas towards their construction.

...read moreread less

15 citations

Proceedings Article•DOI•

Predicting incorrect mappings: a data-driven approach applied to DBpedia

[...]

Mariano Rico¹, Nandana Mihindukulasooriya¹, Dimitris Kontokostas², Heiko Paulheim³, Sebastian Hellmann², Asunción Gómez-Pérez¹ - Show less +2 more•Institutions (3)

Technical University of Madrid¹, Leipzig University², University of Mannheim³

09 Apr 2018

TL;DR: This work proposes a data-driven method to detect incorrect mappings automatically by analyzing the information from both instance data as well as ontological axioms and concludes that the best model achieves 93% accuracy.

...read moreread less

Abstract: DBpedia releases consist of more than 70 multilingual datasets that cover data extracted from different language-specific Wikipedia instances. The data extracted from those Wikipedia instances are transformed into RDF using mappings created by the DBpedia community. Nevertheless, not all the mappings are correct and consistent across all the distinct language-specific DBpedia datasets. As these incorrect mappings are spread in a large number of mappings, it is not feasible to inspect all such mappings manually to ensure their correctness. Thus, the goal of this work is to propose a data-driven method to detect incorrect mappings automatically by analyzing the information from both instance data as well as ontological axioms. We propose a machine learning based approach to building a predictive model which can detect incorrect mappings. We have evaluated different supervised classification algorithms for this task and our best model achieves 93% accuracy. These results help us to detect incorrect mappings and achieve a high-quality DBpedia.

...read moreread less

15 citations

Posted Content•

On Cognitive Preferences and the Interpretability of Rule-based Models.

[...]

Johannes Fürnkranz, Tomáš Kliegr, Heiko Paulheim

04 Mar 2018

TL;DR: It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpreted than more complex ones.

...read moreread less

Book•DOI•

The Semantic Web: ESWC 2018 Satellite Events : ESWC 2018 Satellite Events, Heraklion, Crete, Greece, June 3-7, 2018, Revised Selected Papers

[...]

Aldo Gangemi, Anna Lisa Gentile, Andrea Giovanni Nuzzolese, Sebastian Rudolph, Maria Maleshkova, Heiko Paulheim, Jeff Z. Pan, Mehwish Alam - Show less +4 more

01 Jan 2018

TL;DR: This work considers how to select a subgraph of an RDF graph in an ontology learning problem and proposes to address this by selecting RDF triples that can not be inferred using a reasoner and presents an algorithm to find them.

...read moreread less

Abstract: We consider how to select a subgraph of an RDF graph in an ontology learning problem in order to avoid learning redundant axioms. We propose to address this by selecting RDF triples that can not be inferred using a reasoner and we present an algorithm to find them.

...read moreread less

Book Chapter•DOI•

Machine Learning with and for Semantic Web Knowledge Graphs

[...]

Heiko Paulheim¹•Institutions (1)

University of Mannheim¹

22 Sep 2018

TL;DR: How machine learning is used to improve knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems are discussed.

...read moreread less

Abstract: Large-scale cross-domain knowledge graphs, such as DBpedia or Wikidata, are some of the most popular and widely used datasets of the Semantic Web. In this paper, we introduce some of the most popular knowledge graphs on the Semantic Web. We discuss how machine learning is used to improve those knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems.

...read moreread less

Journal Article•DOI•

On Cognitive Preferences and the Plausibility of Rule-based Models.

[...]

Johannes Fürnkranz¹, Tomáš Kliegr², Heiko Paulheim³•Institutions (3)

Technische Universität Darmstadt¹, University of Economics, Prague², University of Mannheim³

04 Mar 2018-arXiv: Learning

TL;DR: In this paper, the plausibility of a model is defined as the likelihood that a user accepts it as an explanation for a prediction, and it is shown that, all other things being equal, longer explanations may be more convincing than shorter ones.

...read moreread less

Abstract: It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption by focusing on one particular aspect of interpretability, namely the plausibility of models. Roughly speaking, we equate the plausibility of a model with the likeliness that a user accepts it as an explanation for a prediction. In particular, we argue that, all other things being equal, longer explanations may be more convincing than shorter ones, and that the predominant bias for shorter models, which is typically necessary for learning powerful discriminative models, may not be suitable when it comes to user acceptance of the learned models. To that end, we first recapitulate evidence for and against this postulate, and then report the results of an evaluation in a crowd-sourcing study based on about 3.000 judgments. The results do not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then relate these results to well-known cognitive biases such as the conjunction fallacy, the representative heuristic, or the recogition heuristic, and investigate their relation to rule length and plausibility.

...read moreread less

Provisioning and Usage of Provenance Data in the WebIsALOD Knowledge Graph.

[...]

Sven Hertling, Heiko Paulheim

01 Jan 2018

TL;DR: Several alternatives and design decisions for providing statement-level provenance information at large scale for the WebIsALOD dataset are described and the practical impact of that provenance data for computing confidence scores approximating the correctness of each subsumption relation is shown.

...read moreread less

Abstract: The WebIsALOD dataset provides a linked data endpoint to the WebIsA database, which harvests millions of subsumption relations from a large scale Web crawl using text patterns. For each of the relations, the dataset also contains rich provenance data, such as the text pattern used, the original sentence in which the pattern was found, and the source on the Web. In this paper, we describe several alternatives and design decisions for providing statement-level provenance information at large scale for the WebIsALOD dataset. Furthermore, we show the practical impact of that provenance information for computing confidence scores approximating the correctness of each subsumption relation.

...read moreread less

Book Chapter•DOI•

Stacked LSTM Snapshot Ensembles for Time Series Forecasting

[...]

Sascha Krstanovic¹, Heiko Paulheim¹•Institutions (1)

University of Mannheim¹

19 Sep 2018

TL;DR: This work extends snapshot ensembles to the application of time series forecasting and shows that determining reasonable selections for sequence lengths can be used to efficiently escape local minima and combining the forecasts of snapshot LSTMs with a stacking approach greatly boosts the performance.

...read moreread less

Abstract: Ensembles of machine learning models have proven to improve the performance of prediction tasks in various domains. The additional computational costs for the performance increase are usually high since multiple models must be trained. Recently, snapshot ensembles (Huang et al. in Snapshot ensembles: train 1 get M for free, (2017) [16]) provide a comparably computationally cheap way of ensemble learning for artificial neural networks (ANNs). We extend snapshot ensembles to the application of time series forecasting, which comprises two essential steps. First, we show that determining reasonable selections for sequence lengths can be used to efficiently escape local minima. Additionally, combining the forecasts of snapshot LSTMs with a stacking approach greatly boosts the performance compared to the mean of the forecasts as used in the original snapshot ensemble approach. We demonstrate the effectiveness of the algorithm on five real-world datasets and show that the forecasting performance of our approach is superior to conservative ensemble architectures as well as a single, highly optimized LSTM.

...read moreread less