Author
Chris Emmery
Other affiliations: University of Antwerp
Bio: Chris Emmery is an academic researcher from Tilburg University. The author has contributed to research in topics: Computer science & Stylometry. The author has an hindex of 6, co-authored 16 publications receiving 232 citations. Previous affiliations of Chris Emmery include University of Antwerp.
Papers
More filters
••
TL;DR: This paper describes the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and performs a series of binary classification experiments to determine the feasibility of automatic cyberbullies detection.
Abstract: While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.
231 citations
••
TL;DR: An effective crowdsourcing method is presented: simulating real-life bullying scenarios in a lab setting generates plausible data that can be effectively used to enrich real data, and largely circumvents the restrictions on data that could be collected, and increases classifier performance.
Abstract: The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of the recent research uses small, heterogeneous datasets, without a thorough evaluation of applicability. In this paper, we further illustrate these issues, as we (i) evaluate many publicly available resources for this task and demonstrate difficulties with data collection. These predominantly yield small datasets that fail to capture the required complex social dynamics and impede direct comparison of progress. We (ii) conduct an extensive set of experiments that indicate a general lack of cross-domain generalization of classifiers trained on these sources, and openly provide this framework to replicate and extend our evaluation criteria. Finally, we (iii) present an effective crowdsourcing method: simulating real-life bullying scenarios in a lab setting generates plausible data that can be effectively used to enrich real data. This largely circumvents the restrictions on data that can be collected, and increases classifier performance. We believe these contributions can aid in improving the empirical practices of future research in the field.
28 citations
•
01 Jul 2016TL;DR: The authors demonstrate the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification.
Abstract: Word embeddings have recently seen a strong increase in interest as a result of strong performance gains on a variety of tasks. However, most of this research also underlined the importance of benchmark datasets, and the difficulty of constructing these for a variety of language-specific tasks. Still, many of the datasets used in these tasks could prove to be fruitful linguistic resources, allowing for unique observations into language use and variability. In this paper we demonstrate the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification. For the latter, we compare unsupervised methods with a traditional, hand-crafted dictionary. With this research, we provide the embeddings themselves, the relation evaluation task benchmark for use in further research, and demonstrate how the benchmarked embeddings prove a useful unsupervised linguistic resource, effectively used in a downstream task.
26 citations
•
TL;DR: This paper demonstrates the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification.
Abstract: Word embeddings have recently seen a strong increase in interest as a result of strong performance gains on a variety of tasks. However, most of this research also underlined the importance of benchmark datasets, and the difficulty of constructing these for a variety of language-specific tasks. Still, many of the datasets used in these tasks could prove to be fruitful linguistic resources, allowing for unique observations into language use and variability. In this paper we demonstrate the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification. For the latter, we compare unsupervised methods with a traditional, hand-crafted dictionary. With this research, we provide the embeddings themselves, the relation evaluation task benchmark for use in further research, and demonstrate how the benchmarked embeddings prove a useful unsupervised linguistic resource, effectively used in a downstream task.
21 citations
••
01 Sep 2017TL;DR: This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries and offers a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification.
Abstract: The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confirm the reliability of this query heuristic by comparing with manual annotation. Moreover, using these labels for distant supervision, we demonstrate competitive model performance on the same data as models trained on manual annotations. As such, we offer a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification.
18 citations
Cited by
More filters
01 Jan 2015
410 citations
••
30 Jan 2020
TL;DR: The authors examined and analyzed the linguistic and psychological features of political discourse using a computer-based Linguistic Inquiry and Word Count (LIWC) content analysis program to explore the relationship between political discourse and the personality of politicians.
Abstract: The article examines and analyzes the linguistic and psychological features of political discourse using a computer-based Linguistic Inquiry and Word Count (LIWC) content analysis program to explore the relationship between political discourse and the personality of politicians. As for political discourse, it is perhaps the communicator, the linguistic personality, who plays the most important role in the communication. The linguistic personality of a politician is of particular interest in political discourse content-analysis, since it has the greatest influence on the public consciousness via mass media. Using text as a source of psychological and cognitive information has been gaining popularity. Researchers use a variety of methods to analyze texts, but Linguistic Inquiry Word Count (LIWC) has proved to be the most common technique. The analysis of linguistic patterns of political discourse shows that in the context of political speech events such as media interviews, politicians make a unique choice of lexical units, which can be interpreted as a manifestation of certain personality traits. However, despite the significance of the results, there are clear limitations to the use of computerized methodologies to make political discourse content-analysis, such as the limited interpretive capacity of software to understand pragmatic and contextual use of lexical units.
286 citations
••
TL;DR: It is found that all of the recommendation logics under study proved to lead to a rather diverse set of recommendations that are on par with human editors and that basing recommendations on user histories can substantially increase topic diversity within a recommendation set.
Abstract: In the debate about filter bubbles caused by algorithmic news recommendation, the conceptualization of the two core concepts in this debate, diversity and algorithms, has received little attention in social scientific research. This paper examines the effect of multiple recommender systems on different diversity dimensions. To this end, it maps different values that diversity can serve, and a respective set of criteria that characterizes a diverse information offer in this particular conception of diversity. We make use of a data set of simulated article recommendations based on actual content of one of the major Dutch broadsheet newspapers and its users (N=21,973 articles, N=500 users). We find that all of the recommendation logics under study proved to lead to a rather diverse set of recommendations that are on par with human editors and that basing recommendations on user histories can substantially increase topic diversity within a recommendation set.
222 citations
••
01 Aug 2019
TL;DR: In this article, the authors delineate and clarify the main challenges and frontiers in the abusive content detection field, critically evaluate their implications and discuss potential solutions, and highlight ways in which social scientific insights can advance research.
Abstract: Online abusive content detection is an inherently difficult task. It has received considerable attention from academia, particularly within the computational linguistics community, and performance appears to have improved as the field has matured. However, considerable challenges and unaddressed frontiers remain, spanning technical, social and ethical dimensions. These issues constrain the performance, efficiency and generalizability of abusive content detection systems. In this article we delineate and clarify the main challenges and frontiers in the field, critically evaluate their implications and discuss potential solutions. We also highlight ways in which social scientific insights can advance research. We discuss the lack of support given to researchers working with abusive content and provide guidelines for ethical research.
153 citations
24 Mar 1995
TL;DR: This report summarizes results from the national survey, 24 state surveys, and nine local surveys conducted among high school students during February through May 1993, which indicated substantial morbidity and social problems among adolescents also result from unintended pregnancies and sexually transmitted diseases.
Abstract: PROBLEM/CONDITION
Priority health risk behaviors that contribute to the leading causes of mortality, morbidity, and social problems among youth and adults often are established during youth, extend into adulthood, and are interrelated.
REPORTING PERIOD
February through May 1993.
DESCRIPTION OF SYSTEM
The Youth Risk Behavior Surveillance System (YRBSS) monitors six categories of priority health risk behaviors among youth and young adults: behaviors that contribute to unintentional and intentional injuries, tobacco use, alcohol and other drug use, sexual behaviors, dietary behaviors, and physical activity. The YRBSS includes a national, school-based survey conducted by CDC and state and local school-based surveys conducted by state and local education agencies. This report summarizes results from the national survey, 24 state surveys, and nine local surveys conducted among high school students during February through May 1993.
RESULTS AND INTERPRETATION
In the United States, 72% of all deaths among school-age youth and young adults are from four causes: motor vehicle crashes, other unintentional injuries, homicide, and suicide. Results from the 1993 YRBSS suggest that many high school students practice behaviors that may increase their likelihood of death from these four causes: 19.1% rarely or never used a safety belt, 35.3% had ridden with a driver who had been drinking alcohol during the 30 days preceding the survey, 22.1% had carried a weapon during the 30 days preceding the survey, 80.9% ever drank alcohol, 32.8% ever used marijuana, and 8.6% had attempted suicide during the 12 months preceding the survey. Substantial morbidity and social problems among adolescents also result from unintended pregnancies and sexually transmitted diseases, including human immunodeficiency virus (HIV) infection. YRBSS results indicate that in 1993, 53.0% of high school students had had sexual intercourse, 52.8% of sexually active students had used a condom during last sexual intercourse, and 1.4% ever injected an illegal drug. Among adults, 67% of all deaths are from three causes: heart disease, cancer, and stroke. In 1993, many high school students practiced behaviors that may increase the risk for these health problems: 30.5% of high school students had smoked cigarettes during the 30 days preceding the survey, only 15.4% had eaten five or more servings of fruits and vegetables during the day preceding the survey, and only 34.3% had attended physical education class daily.
ACTIONS TAKEN
YRBSS data are being used nationwide by health and education officials to improve school health policies and programs designed to reduce risks associated with the leading causes of mortality and morbidity. At the national level, YRBSS data are being used to measure progress toward achieving 26 national health objectives and one of eight National Education Goals.
132 citations