scispace - formally typeset
Search or ask a question
Author

Julián Moreno-Schneider

Bio: Julián Moreno-Schneider is an academic researcher from German Research Centre for Artificial Intelligence. The author has contributed to research in topics: Workflow & Digital content. The author has an hindex of 9, co-authored 23 publications receiving 208 citations.

Papers
More filters
Book ChapterDOI
09 Sep 2019
TL;DR: The work presented in this paper was carried out under the umbrella of the European project LYNX that develops a semantic platform that enables the development of various document processing and analysis applications for the legal domain.
Abstract: This paper describes an approach at Named Entity Recognition (NER) in German language documents from the legal domain. For this purpose, a dataset consisting of German court decisions was developed. The source texts were manually annotated with 19 semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European legal norm, regulation, contract, court decision, and legal literature. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities. The 19 fine-grained classes were automatically generalised to seven more coarse-grained classes (person, location, organization, legal norm, case-by-case regulation, court decision, and legal literature). Thus, the dataset includes two annotation variants, i.e., coarse- and fine-grained. For the task of NER, Conditional Random Fields (CRFs) and bidirectional Long-Short Term Memory Networks (BiLSTMs) were applied to the dataset as state of the art models. Three different models were developed for each of these two model families and tested with the coarse- and fine-grained annotations. The BiLSTM models achieve the best performance with an 95.46 F\(_1\) score for the fine-grained classes and 95.95 for the coarse-grained ones. The CRF models reach a maximum of 93.23 for the fine-grained classes and 93.22 for the coarse-grained ones. The work presented in this paper was carried out under the umbrella of the European project LYNX that develops a semantic platform that enables the development of various document processing and analysis applications for the legal domain.

68 citations

Posted Content
TL;DR: Building upon BERT, a deep neural language model, it is demonstrated how to combine text representations with metadata and knowledge graph embeddings, which encode author information.
Abstract: In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available

47 citations

Book ChapterDOI
13 Sep 2017
TL;DR: This contribution evaluates a set of classification algorithms on two types of user-generated online content (tweets and Wikipedia Talk comments) in two languages (English and German) and focuses on classifying the data according to the annotated characteristics using several text classification algorithms.
Abstract: The sheer ease with which abusive and hateful utterances can be made online – typically from the comfort of your home and the lack of any immediate negative repercussions – using today’s digital communication technologies (especially social media), is responsible for their significant increase and global ubiquity. Natural Language Processing technologies can help in addressing the negative effects of this development. In this contribution we evaluate a set of classification algorithms on two types of user-generated online content (tweets and Wikipedia Talk comments) in two languages (English and German). The different sets of data we work on were classified towards aspects such as racism, sexism, hatespeech, aggression and personal attacks. While acknowledging issues with inter-annotator agreement for classification tasks using these labels, the focus of this paper is on classifying the data according to the annotated characteristics using several text classification algorithms. For some classification tasks we are able to reach f-scores of up to 81.58.

32 citations

Book ChapterDOI
29 May 2016
TL;DR: The platform is intended to enable human experts (knowledge workers) to get a grasp and understand the contents of large document collections in an efficient way so that they can curate, process and further analyse the collection according to their sector-specific needs.
Abstract: In an attempt to put a Semantic Web-layer that provides linguistic analysis and discourse information on top of digital content, we develop a platform for digital curation technologies. The platform offers language-, knowledge- and data-aware services as a flexible set of workflows and pipelines for the efficient processing of various types of digital content. The platform is intended to enable human experts (knowledge workers) to get a grasp and understand the contents of large document collections in an efficient way so that they can curate, process and further analyse the collection according to their sector-specific needs.

30 citations

Book ChapterDOI
09 Jul 2017
TL;DR: A platform that provides curation services that can be integrated into concrete curation or content management systems and a user interface that is currently under development at ART+COM, one of the SME partners in the project.
Abstract: Digital content and online media have reached an unprecedented level of relevance and importance. In the context of a research and technology transfer project on Digital Curation Technologies for online content we develop a platform that provides curation services that can be integrated into concrete curation or content management systems. In this project, the German Research Center for Artificial Intelligence (DFKI) collaborates with four Berlin-based SMEs that work with and on digital content in four different sectors. The curation services comprise several semantic text and document analytics processes as well as knowledge technologies that can be applied to document collections. The key objective of this set of curation services is to support knowledge workers and digital curators in their daily work, i.e., to automate or to semi-automate processes that the human experts are normally required to do intellectually and without tool support. The goal is to help this group of information and knowledge workers to become more efficient and more effective as well as to enable them to produce high-quality content in their respective sectors. In this article we concentrate on the current state of a user interface that is currently under development at ART+COM, one of the SME partners in the project. A second, more generic, i.e., not domain-specific user interface is under development at DFKI. In this article we describe the technology platform and the two different interfaces. We also take a look at the different requirements for ART+COM’s domain-specific and DFKI’s generic user interface.

16 citations


Cited by
More filters
Journal Article
TL;DR: In this article, van Dijk proposed a new, interdisciplinary theory of news in the press, which represents a very ambitious and somewhat speculative effort to weave together a broad range of existing news research approaches into a coherent, heuristic framework.
Abstract: VAN DIJK, TEUN A., News as Discourse Hillsdale, N.J.: Lawrence Erlbaum, 1988. $29.95 cloth. This book attempts the development of a "new, interdisciplinary theory of news in the press" (p. vii). It represents a very ambitious and somewhat speculative effort to weave together a broad range of existing news research approaches into a coherent, heuristic framework. Van Dijk succeeds in providing a useful summary of the literature in news research. Especially valuable is his discussion of recent European research. However, the overall framework is still at an early stage of development. Its utility remains to be demonstrated by future research. For some time now, communication researchers have talked of passing paradigms and ferment in the field. With the decline of past paradigms, our discipline has been left with a hodgepodge of small-scale theories. This is particularly true in the area of news research where various narrative, discourse and information processing theories abound. Van Dijk's book may signal a new era, an era in which efforts will be made to integrate existing conceptual fragments into broader frameworks. His work may be seen as providing a model for others who seek to make sense of our current proliferation of theories. Van Dijk's approach is centered in the tradition of discourse analysis, which evolved out of an integration of literary analysis and linguistics. However, he has aggressively modified earlier forms of discourse analysis in an effort to incorporate insights into the structure and interpretation of discourse derived from cognitive psychology. He is not content to simply apply discourse analysis to the evaluation of news stories. He recognizes the utility of constructing an approach which also considers the production of news by media practitioners and the interpretation of news by audience members. It is these broader concerns which set van Dijk's approach apart from previous analyses of news content. A central concept in van Dijk's theory is the notion of story schemas, which are defined as implicit structures that underlie typical stories. These schemas permit the easy production of news and also facilitate its interpretation by news consumers. The schema concept is at once powerful and ambiguous. There is growing research evidence that demonstrates the utility of positing the existence of cognitive structures (schemas) in people's minds which are activated by content cues and guide interpretion of all forms of communication. The schema concept helps to explain why complex and seemingly ambiguous messages often are easily interpreted by audience members. It also can explain why the same message can be interpreted in highly discrepent ways. If messages contain conflicting cues that lead people to activate different schemas, or if people don't share a homogeneous set of schemas, then it is likely that many contrasting interpretations of story content will be developed. But despite growing consensus concerning the utility of schema as a concept, researchers remain quite divided over both its definition and the type of research that will lead to the most useful findings. …

581 citations

Proceedings ArticleDOI
01 Jun 2019
TL;DR: It is shown that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced, most notably on datasets that are created by focused sampling instead of random sampling.
Abstract: We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.

185 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work wants to contribute to the debate on how to deal with fake news and related online phenomena with technological means, by providing means to separate related from unrelated headlines and further classifying the related headlines.
Abstract: We present a system for the detection of the stance of headlines with regard to their corresponding article bodies The approach can be applied in fake news, especially clickbait detection scenarios The component is part of a larger platform for the curation of digital content; we consider veracity and relevancy an increasingly important part of curating online information We want to contribute to the debate on how to deal with fake news and related online phenomena with technological means, by providing means to separate related from unrelated headlines and further classifying the related headlines On a publicly available data set annotated for the stance of headlines with regard to their corresponding article bodies, we achieve a (weighted) accuracy score of 8959

127 citations

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work proposes a novel topic-informed BERT-based architecture for pairwise semantic similarity detection and shows that the model improves performance over strong neural baselines across a variety of English language datasets.
Abstract: Semantic similarity detection is a fundamental task in natural language understanding. Adding topic information has been useful for previous feature-engineered semantic similarity models as well as neural models for other tasks. There is currently no standard way of combining topics with pretrained contextual representations such as BERT. We propose a novel topic-informed BERT-based architecture for pairwise semantic similarity detection and show that our model improves performance over strong neural baselines across a variety of English language datasets. We find that the addition of topics to BERT helps particularly with resolving domain-specific cases.

107 citations