scispace - formally typeset
Search or ask a question

Showing papers by "Sebastian Möller published in 2022"


Journal Article
TL;DR: The GermEval 2022 shared task is designed as text regression in which participants developed models to predict complexity of pieces of text for a German learner in a range from 1 to 7.
Abstract: In this paper we present the GermEval 2022 shared task on Text Complexity Assessment of German text. Text forms an integral part of exchanging information and interacting with the world, correlating with quality and experience of life. Text complexity is one of the factors which affects a reader’s understanding of a text. The mapping of a body of text to a mathematical unit quantifying the degree of readability is the basis of complexity assessment. As readability might be influenced by representation, we only target the text complexity for readers in this task. We designed the task as text regression in which participants developed models to predict complexity of pieces of text for a German learner in a range from 1 to 7. The shared task is organized in two phases; the development and the test phases. Among 24 participants who registered for the shared task, ten teams submitted their results on the test data.

9 citations


Proceedings ArticleDOI
01 Jan 2022
TL;DR: This work presents one of the first works that include experiments on both parallel corpora of the German Sign Language (PHOENIX14T and the Public DGS Corpus), and experiment with two NMT architectures with optimization of their hyperparameters, several tokenization methods and two data augmentation techniques (back-translation and paraphrasing).
Abstract: We examine methods and techniques, proven to be helpful for the text-to-text translation of spoken languages in the context of gloss-to-text translation systems, where the glosses are the written representation of the signs. We present one of the first works that include experiments on both parallel corpora of the German Sign Language (PHOENIX14T and the Public DGS Corpus). We experiment with two NMT architectures with optimization of their hyperparameters, several tokenization methods and two data augmentation techniques (back-translation and paraphrasing). Through our investigation we achieve a substantial improvement of 5.0 and 2.2 BLEU scores for the models trained on the two corpora respectively. Our RNN models outperform our Transformer models, and the segmentation method we achieve best results with is BPE, whereas back-translation and paraphrasing lead to minor but not significant improvements.

6 citations


Proceedings Article
TL;DR: In this paper , the authors present a fine-grained test suite for the language pair German-English, which is based on a number of linguistically motivated categories and phenomena and the semi-automatic evaluation is carried out with regular expressions.
Abstract: This paper presents a fine-grained test suite for the language pair German–English. The test suite is based on a number of linguistically motivated categories and phenomena and the semi-automatic evaluation is carried out with regular expressions. We describe the creation and implementation of the test suite in detail, providing a full list of all categories and phenomena. Furthermore, we present various exemplary applications of our test suite that have been implemented in the past years, like contributions to the Conference of Machine Translation, the usage of the test suite and MT outputs for quality estimation, and the expansion of the test suite to the language pair Portuguese–English. We describe how we tracked the development of the performance of various systems MT systems over the years with the help of the test suite and which categories and phenomena are prone to resulting in MT errors. For the first time, we also make a large part of our test suite publicly available to the research community.

3 citations


Book ChapterDOI
TL;DR: In this paper , the text reading experience is not always in focus for VR systems because of limited hardware capabilities, lack of standardization, user interface (UI) design flaws, and physical design of head-mounted displays (HMDs).
Abstract: Virtual Reality (VR) technology is mostly used in gaming, videos, engineering applications, and training simulators. One thing which is shared among all of them is the necessity to display text. Text reading experience is not always in focus for VR systems because of limited hardware capabilities, lack of standardization, user interface (UI) design flaws, and physical design of Head-Mounted Displays (HMDs). With this paper, key variables from the UI design side were researched that can improve text reading user experience in VR. Therefore four important points for reading in VR application were selected to be focused on: 1) Difference in canvas type (flat/curved), 2) Contrast on virtual scene (light/dark), 3) Number of columns in layout (1 column/2 column/3 column) 4) Text distance from the subject (1.5 m/6.5 m). For a user study a VR app for Oculus Quest was developed, enabling the possibility to display text while varying some of the features important for readability in VR. This user experiment has shown parameters that are important for text reading experience in VR. Specifically, subjects performed very well when the text was on a 6.5-meter distance from the subject with font size 22pt, on a flat canvas with one column layout. When it comes to physiological variables, the conditions measurements were behaving similarly, as all of the selected parameters were in line with the design guidelines. Therefore, selection on final settings should be more oriented towards user experience and preferences.

3 citations


Proceedings ArticleDOI
05 Sep 2022
TL;DR: In this paper , the authors created a dataset of simulated videotelephony clips to act as stimuli in quality perception research, consisting of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long.
Abstract: To study people's natural behavior during different conditions of audiovisual quality, we usually invite people into a lab and let them talk to each other. In such conversation settings, not only the media quality impacts the quality perception, but, e.g., social aspects of a real conversation are reflected by individual conversational and rating behavior. Hence, to study quality perception in conversational settings, we try to create an environment that isolates the media quality from such outside factors and is consistent for each participant in the lab. Therefore, we created a dataset of simulated videotelephony clips to act as stimuli in quality perception research. The dataset consists of four different stories in the German language that are told through ten consecutive parts, each about 10 seconds long. Each of these parts is available in four different quality levels, ranging from perfect to stalling. All clips (FullHD, H.264 / AAC) are actual recordings from end-user video-conference software to ensure ecological validity and realism of quality degradation. To ensure consistency among different clips of the same quality level, each video has been scored using VMAF and POLQA and selected to match predefined selection criteria. To analyze the perceived quality of the clips, we conducted a user study (N=25) and evaluated perceived quality, interest in the stories, and speaker engagement. Results validate the consistency of the quality levels of the video clips. Apart from a detailed description of the methodological approach, we contribute the entire stimuli dataset containing 160 videos and all rating scores for each file.

2 citations


Journal ArticleDOI
27 Apr 2022-PLOS ONE
TL;DR: This work presents a multidisciplinary view on machine learning in medical decision support systems and covers information technology, medical, as well as ethical aspects and an implemented risk prediction system in nephrology.
Abstract: Scientific publications about the application of machine learning models in healthcare often focus on improving performance metrics. However, beyond often short-lived improvements, many additional aspects need to be taken into consideration to make sustainable progress. What does it take to implement a clinical decision support system, what makes it usable for the domain experts, and what brings it eventually into practical usage? So far, there has been little research to answer these questions. This work presents a multidisciplinary view of machine learning in medical decision support systems and covers information technology, medical, as well as ethical aspects. The target audience is computer scientists, who plan to do research in a clinical context. The paper starts from a relatively straightforward risk prediction system in the subspecialty nephrology that was evaluated on historic patient data both intrinsically and based on a reader study with medical doctors. Although the results were quite promising, the focus of this article is not on the model itself or potential performance improvements. Instead, we want to let other researchers participate in the lessons we have learned and the insights we have gained when implementing and evaluating our system in a clinical setting within a highly interdisciplinary pilot project in the cooperation of computer scientists, medical doctors, ethicists, and legal experts.

2 citations


Proceedings ArticleDOI
01 Jan 2022
TL;DR: Assessment of Information Savviness from chatbot interactions in a technical customer service domain suggests a potential application for essential personalization and user adaptation strategies utilizing information savviness for the personalization of technical customer support chatbots.
Abstract: : Information savviness describes the ability to find, evaluate and reflect information online. Customers with high information savviness are more likely to look up product information online, read customer reviews before making a purchase decision. By assessing Information Savviness from chatbot interactions in a technical customer service domain, we analyze its impact on user experience (UX), expectations and preferences of the users in order to determine assessable personalization targets that acts dedicatedly on UX. To find out which UX factors can be assessed reliably, we conduct an assessment study through a set of scenario-based tasks using a crowd-sourcing set-up and analyze UX factors. We reveal significant differences in users’ UX expectations with respect to a series of UX factors like acceptability, task efficiency, system error, ease of use, naturalness, personality and promoter score . Our results strongly suggest a potential application for essential personalization and user adaptation strategies utilizing information savviness for the personalization of technical customer support chatbots.

2 citations



Proceedings Article
TL;DR: In this paper , a semantic similarity detection model was proposed to compare text in the test set with the sentences in the train set to find the most similar instances to identify propaganda techniques in the Arabic social media text.
Abstract: Propaganda and the spreading of fake news through social media have become a serious problem in recent years. In this paper we present our approach for the shared task on propaganda detection in Arabic in which the goal is to identify propaganda techniques in the Arabic social media text. We propose a semantic similarity detection model to compare text in the test set with the sentences in the train set to find the most similar instances. The label of the target text is obtained from the most similar texts in the train set. The proposed model obtained the micro F1 score of 0.494 on the text data set.

1 citations


Proceedings ArticleDOI
23 Aug 2022
TL;DR: It is suggested that increased control over information sharing does not necessarily lead to improved privacy-decision making, and privacy by default might be a more effective design choice.
Abstract: Android applications request specific permissions from users during the installations to perform required functionalities by accessing system resources and personal information. Usually, users must approve the permissions requested by applications (apps) during the installation process and before the apps can collect privacy- or security-relevant information. However, recent studies have shown that users are overwhelmed with the information provided in privacy policies and do not understand permission requests and which functionalities are necessary for certain applications. Hereby, the collection of personal information remains mostly hidden, as the task of verifying to which information different apps have access to can be very complicated. Therefore, it is necessary to develop frameworks and apps that enable the user to perform informed decisions about apps’ run-time permission access to facilitate the control over sensitive information collected by various apps on smartphones. In this work, we conducted an online study with 70 participants who interacted with a mockup app that enables advanced control over permission requests. The selected permissions are based on the apps’ run-time permission access patterns and explanations, and commonly known visual cues are used to facilitate the user’s understanding and privacy-conscious decision making. Furthermore, the effects of perceived control over information sharing and privacy awareness are examined in combination with the permission manager mockup app to investigate if increased control over information sharing increases general privacy awareness. Our results show an interplay between increased control and privacy awareness when explanations and common visual cues are presented to the user. However, the direction of the interplay between increased control and privacy awareness was surprising. Privacy awareness dropped for the experimental group, which received advanced explanations and visual nudges for privacy-conscious decision making. Interestingly privacy awareness significantly increased for the control group, which only received a plain privacy nudge. Therefore, we suggest that increased control over information sharing does not necessarily lead to improved privacy-decision making, and privacy by default might be a more effective design choice.

1 citations



Proceedings ArticleDOI
10 Oct 2022
TL;DR: A new bitstream-based model named Deep-BVQM is proposed, which outperforms the standard models on the tested datasets and offers a frame-level quality prediction which is essential diagnostic information for some video streaming services such as cloud gaming.
Abstract: With the rapid increase of video streaming content, high-quality video quality metrics, mainly signal-based video quality metrics, are emerging, notably VMAF, SSIMPLUS, and AVQM. Besides signal-based video quality metrics, within the standardization body, ITU-T Study Group 12, two well-known bitstream-based video quality metrics are developed named P.1203 and P.1204.3. Due to the low complexity and low level of access to the bitstream data, these models gained attention from network providers and service providers. In this paper, we proposed a new bitstream-based model named Deep-BVQM, which outperforms the standard models on the tested datasets. While the model comes with slightly higher computational complexity, it offers a frame-level quality prediction which is essential diagnostic information for some video streaming services such as cloud gaming. Deep-BVQM is developed in two layers; first, the frame quality was predicted using a lightweight CNN model. Next, the latent features of the CNN were used to train an LSTM network to predict the video quality in a short-term duration.

Book ChapterDOI
01 Jan 2022
TL;DR: In this paper , the authors compared the performance of fake news detection in tweets as a text classification task, using support vector machine, long short-term memory networks, and BERT pre-trained model.
Abstract: AbstractFake news spreading through social media has become a serious problem in recent years, especially after the United States presidential election in 2016. Accordingly, more attention has been paid to this issue by scientists to develop automated tools to combat those pieces of information that contain misinformation, using natural language processing methods. Although the performance of fake news detection models has increased by using more complex architectures and state-of-the-art models, less attention has been paid to the impact of pre-processing on the overall performance of such models. In this study, we focus on investigating the impact of pre-processing, especially removing URLs on the performance of fake news detection systems. We compared the performance of fake news detection in tweets as a text classification task, using support vector machine, long short-term memory networks, and BERT pre-trained model. In addition to URLs, we analyzed the impact of different approaches for dealing with emojis and Twitter handles on the performance of the models. Our results show URLs could be good clues for identifying fake news, despite the fact that they are usually removed in pre-processing step.KeywordsFake News DetectionPre-processingLSTMsBERT

Proceedings Article
TL;DR: This paper describes subjective experiments to assess the readability of German text, and shows that a linear regression model with a subset of linguistically motivated features can be a very good predictor of text complexity.
Abstract: For different reasons, text can be difficult to read and understand for many people, especially if the text’s language is too complex. In order to provide suitable text for the target audience, it is necessary to measure its complexity. In this paper we describe subjective experiments to assess the readability of German text. We compile a new corpus of sentences provided by a German IT service provider. The sentences are annotated with the subjective complexity ratings by two groups of participants, namely experts and non-experts for that text domain. We then extract an extensive set of linguistically motivated features that are supposedly interacting with complexity perception. We show that a linear regression model with a subset of these features can be a very good predictor of text complexity.

Proceedings ArticleDOI
05 Sep 2022
TL;DR: Results show that Placement Modality significantly affects Personal Space boundaries in terms of position, width, and hedonic quality of the experience, and an effect of Virtual Character Gender was found on users' Personal Space inner boundary, and Social Presence in its emotional contagion dimension.
Abstract: Over the past five years, the interest in Augmented Reality (AR) technologies has significantly increased. Different industries adopted it, such as gaming, tourism, e-commerce, and entertainment, to name a few. Due to its non-fully immersive property, easy implementation, and compatibility with the most recent smartphones, AR seems to be suitable for daily usage. This poses a significant challenge in creating a good User Experience (UX), as this kind of technology needs to be designed for private scenarios and public ones. In its context, it is essential to understand how users are influenced in their social interactions while using AR. Through the use of humanoid Virtual Characters, this paper is intended to better the understanding of how people perceive personal space in AR. By varying Virtual Character Gender (male, female) and its Placement Modality (predefined, dynamic distance) in the real environment, the effects on UX, Emotions, and Social Presence are investigated. As Personal Space is a dynamic psychological construct, which could depend on individual factors, this paper further investigates some possible co-variant effects of users' preferences towards a female or a male character based on their Sexual Attraction. Results show that Placement Modality significantly affects Personal Space boundaries in terms of position, width, and hedonic quality of the experience. Moreover, an effect of Virtual Character Gender was found on users' Personal Space inner boundary, and Social Presence in its emotional contagion dimension. Finally, an effect of Sexual Attraction on UX, Social Presence in its emotional contagion dimension, and dominance were discovered.

Proceedings ArticleDOI
05 Sep 2022
TL;DR: This paper assesses the data quality differences between two P.P.808 implementations used in a large-scale crowdsourcing study with about two hundred users from Amazon Mechanical Turk and shows that both implementations correlate strongly with the laboratory and with each other, suggesting that the ITU-T Rec.P.,808 is robust enough to be implemented by non-experts in speech evaluation or crowdsourcing.
Abstract: Subjective assessments are a key component of speech quality research. Traditionally, the assessments are conducted in laboratories in controlled conditions and following international standards like ITU-T Rec.P.800. However, even before the current pandemic, more speech quality research used crowdsourcing-based approaches for collecting subjective ratings. Crowdsourcing allows researchers to collect data even without a dedicated test laboratory, to collect data from a huge and diverse group of participants, and to perform the assessment in various real-life settings. Still, this approach raises questions about the reliability and validity of the subjective ratings, especially when comparing the ratings with data collected in standardized procedures. One step to approach these challenges was the development of the ITU-T Rec.P.808 standard. This standard helps practitioners implement best practices from speech quality studies and crowdsourcing studies in their crowdsourced speech quality assessments. However, even with the ITU-T Rec.P.808 in action, it is unclear how much background knowledge is necessary to successfully “implement” this standard. Therefore, this paper aims to assess the data quality differences between two P.808 implementations. One implementation is from a co-author of the P.808 standard, and the other is a researcher with only a little background in crowdsourcing and speech quality assessments. Both implementations are used in a large-scale crowdsourcing study with about two hundred users from Amazon Mechanical Turk. The collected ratings are compared to gold-standard data from a certified laboratory. Also, the two implementations are compared to analyze whether they lead to the same conclusions. The results show that both implementations correlate strongly with the laboratory and with each other. Thus, suggesting that the ITU-T Rec.P.808 is robust enough to be implemented by non-experts in speech evaluation or crowdsourcing.


Proceedings ArticleDOI
01 Jan 2022
TL;DR: A crowdsourcing survey in the style of a Semantic Differential and an Exploratory Factor Analysis revealed four factors that operate as relevant dimensions for the Quality of Experience of MT outputs: precision, complexity, grammaticality, and transparency.
Abstract: The quality of machine-generated text is a complex construct consisting of various aspects and dimensions. We present a study that aims to uncover relevant perceptual quality dimensions for one type of machine-generated text, that is, Machine Translation. We conducted a crowdsourcing survey in the style of a Semantic Differential to collect attribute ratings for German MT outputs. An Exploratory Factor Analysis revealed the underlying perceptual dimensions. As a result, we extracted four factors that operate as relevant dimensions for the Quality of Experience of MT outputs: precision, complexity, grammaticality, and transparency.

Proceedings Article
TL;DR: In this article , a fine-grained linguistically motivated analysis of 29 machine translation systems submitted at the shared task of the 7th Conference of Machine Translation (WMT22) is presented.
Abstract: This document describes a fine-grained linguistically motivated analysis of 29 machine translation systems submitted at the Shared Task of the 7th Conference of Machine Translation (WMT22). This submission expands the test suite work of previous years by adding the language direction of English–Russian. As a result, evaluation takes place for the language directions of German–English, English–German, and English–Russian. We find that the German–English systems suffer in translating idioms, some tenses of modal verbs, and resultative predicates, the English–German ones in idioms, transitive-past progressive, and middle voice, whereas the English–Russian ones in pseudogapping and idioms.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the effect of spatial auditory cues on human listeners' response strategy for identifying two alternately active talkers (turn-taking) in a turn-taking listening scenario.
Abstract: This study investigates effects of spatial auditory cues on human listeners' response strategy for identifying two alternately active talkers (“turn-taking” listening scenario). Previous research has demonstrated subjective benefits of audio spatialization with regard to speech intelligibility and talker-identification effort. So far, the deliberate activation of specific perceptual and cognitive processes by listeners to optimize their task performance remained largely unexamined. Spoken sentences selected as stimuli were either clean or degraded due to background noise or bandpass filtering. Stimuli were presented via three horizontally positioned loudspeakers: In a non-spatial mode, both talkers were presented through a central loudspeaker; in a spatial mode, each talker was presented through the central or a talker-specific lateral loudspeaker. Participants identified talkers via speeded keypresses and afterwards provided subjective ratings (speech quality, speech intelligibility, voice similarity, talker-identification effort). In the spatial mode, presentations at lateral loudspeaker locations entailed quicker behavioral responses, which were significantly slower in comparison to a talker-localization task. Under clean speech, response times globally increased in the spatial vs. non-spatial mode (across all locations); these “response time switch costs,” presumably being caused by repeated switching of spatial auditory attention between different locations, diminished under degraded speech. No significant effects of spatialization on subjective ratings were found. The results suggested that when listeners could utilize task-relevant auditory cues about talker location, they continued to rely on voice recognition instead of localization of talker sound sources as primary response strategy. Besides, the presence of speech degradations may have led to increased cognitive control, which in turn compensated for incurring response time switch costs.

Journal ArticleDOI
TL;DR: In this article , the authors investigated the impact of the derived vocal features in the generation of the desired characteristics and found that the convex combination of acoustic features displays higher Mean Opinion Scores of warmth and competence when compared to that of individual features.
Abstract: In our previous work, we derived the acoustic features, that contribute to the perception of warmth and competence in synthetic speech. As an extension, in our current work, we investigate the impact of the derived vocal features in the generation of the desired characteristics. The acoustic features, spectral flux, F1 mean and F2 mean and their convex combinations were explored for the generation of higher warmth in female speech. The voiced slope, spectral flux, and their convex combinations were investigated for the generation of higher competence in female speech. We have employed a feature quantization ap-proach in the traditional end-to-end tacotron based speech synthesis model. The listening tests have shown that the convex combination of acoustic features displays higher Mean Opinion Scores of warmth and competence when compared to that of individual features.

Proceedings Article
TL;DR: An adaptive conversational agent that can automatically adjust according to a user’s personality type carefully excerpted from the Myers-Briggs type indicators is analyzed.
Abstract: Chatbots are increasingly used to automate operational processes in customer service. However, most chatbots lack adaptation towards their users which may results in an unsatisfactory experience. Since knowing and meeting personal preferences is a key factor for enhancing usability in conversational agents, in this study we analyze an adaptive conversational agent that can automatically adjust according to a user’s personality type carefully excerpted from the Myers-Briggs type indicators. An experiment including 300 crowd workers examined how typifications like extroversion/introversion and thinking/feeling can be assessed and designed for a conversational agent in a job recommender domain. Our results validate the proposed design choices, and experiments on a user-matched personality typification, following the so-called law of attraction rule, show a significant positive influence on a range of selected usability criteria such as overall satisfaction, naturalness, promoter score, trust and appropriateness of the conversation.