Large language models (LLMs) represent a major advance in artificial intelligence (AI) research. However, the widespread use of LLMs is also coupled with significant ethical and social challenges. Previous research has pointed towards auditing as a promising governance mechanism to help ensure that AI systems are designed and deployed in ways that are ethical, legal, and technically robust. However, existing auditing procedures fail to address the governance challenges posed by LLMs, which display emergent capabilities and are adaptable to a wide range of downstream tasks. In this article, we address that gap by outlining a novel blueprint for how to audit LLMs. Specifically, we propose a three-layered approach, whereby governance audits (of technology providers that design and disseminate LLMs), model audits (of LLMs after pre-training but prior to their release), and application audits (of applications based on LLMs) complement and inform each other. We show how audits, when conducted in a structured and coordinated manner on all three levels, can be a feasible and effective mechanism for identifying and managing some of the ethical and social risks posed by LLMs. However, it is important to remain realistic about what auditing can reasonably be expected to achieve. Therefore, we discuss the limitations not only of our three-layered approach but also of the prospect of auditing LLMs at all. Ultimately, this article seeks to expand the methodological toolkit available to technology providers and policymakers who wish to analyse and evaluate LLMs from technical, ethical, and legal perspectives. 

/pdf/auditing-large-language-models-a-three-layered-approach-1cb9wwbi.pdf

Auditing large language models: a three-layered approach

The socializing of hate and its saturation on platforms as a resonant and emotional connection online reveal the networked nature of convergent platforms which pump hate as a mechanism of connection and fracture in society in the post-digital age. The violence of hate and negative sentiments online morph to appropriate a multitude of manifestations from cyberbullying and revenge porn to trolling and memes as subversive, denigrative humour. Social media, designed through an architecture for sharing and transaction, distributes hate as a popular sentiment, building connections with disparate communities through the articulation of hate for fellow humans and humanity at large. Trauma induced through hatred and bullying as an active aspect of social media platforms and interactivity distribute sentiments through its excess and disproportionality. This chapter interrogates the sentiment of hate and its workings on social media as a technology of trauma in distributing hate as a form of communion. 

The Social Psychology of Hate Online: From Cyberbullying to Gaming

A literature survey on multimodal and multilingual automatic hate speech identification

Internet Memes (IMs) are creative media that combine text and vision modalities that people use to describe their situation by reusing an existing, familiar situation. Prior work on IMs has focused on analyzing their spread over time or high-level classification tasks like hate speech detection, while a principled analysis of their stratified semantics is missing. Hypothesizing that Semantic Web technologies are appropriate to help us bridge this gap, we build the first Internet Meme Knowledge Graph (IMKG): an explicit representation with 2 million edges that capture the semantics encoded in the text, vision, and metadata of thousands of media frames and their adaptations as memes. IMKG is designed to fulfil seven requirements derived from the inherent characteristics of IMs. IMKG is based on a comprehensive semantic model, it is populated with data from representative IM sources, and enriched with entities extracted from text and vision connected through background knowledge from Wikidata. IMKG integrates its knowledge both in RDF and as a labelled property graph. We provide insights into the structure of IMKG, analyze its central concepts, and measure the effect of knowledge enrichment from different information modalities. We demonstrate its ability to support novel use cases, like querying for IMs that are based on films, and we provide insights into the signal captured by the structure and the content of its nodes. As a novel publicly available resource, IMKG opens the possibility for further work to study the semantics of IMs, develop novel reasoning tasks, and improve its quality. 

IMKG: The Internet Meme Knowledge Graph

This paper proposes a multi-channel convolutional neural network (MC-CNN) for classifying memes and non-memes. Our architecture is trained and validated on a challenging dataset that includes non-meme formats with textual attributes, which are also circulated online but rarely accounted for in meme classification tasks. Alongside a transfer learning base, two additional channels capture low-level and fundamental features of memes that make them unique from other images with text. We contribute an approach which outperforms previous meme classifiers specifically in live data evaluation, and one that is better able to generalise ‘in the wild’. Our research aims to improve accurate collation of meme content to support continued research in meme content analysis, and meme-related sub-tasks such as harmful content detection.

Multi-channel Convolutional Neural Network for Precise Meme Classification

Detecting online hate is a complex task, and low-performing detection models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is a key emerging challenge for online hate detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate how detection models perform on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HatemojiTrain dataset using an innovative human-and-model-in-the-loop approach. Models trained on these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HatemojiCheck and HatemojiTrain are made publicly available.

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate.

Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted text captions, but it is unclear whether these synthetic examples generalize to ‘memes in the wild’. In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that ‘memes in the wild’ differ in two key aspects: 1) Captions must be extracted via OCR, injecting noise and diminishing performance of multimodal models, and 2) Memes are more diverse than ‘traditional memes’, including screenshots of conversations or text on a plain background. This paper thus serves as a reality-check for the current benchmark of hateful meme detection and its applicability for detecting real world hate.

/pdf/memes-in-the-wild-assessing-the-generalizability-of-the-gmiea06fpc.pdf

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an in-depth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads in the past month alone. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labour Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models _should_ learn - whether they should reflect or correct for existing inequalities.

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted text captions, but it is unclear whether these synthetic examples generalize to `memes in the wild'. In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that memes in the wild differ in two key aspects: 1) Captions must be extracted via OCR, injecting noise and diminishing performance of multimodal models, and 2) Memes are more diverse than `traditional memes', including screenshots of conversations or text on a plain background. This paper thus serves as a reality check for the current benchmark of hateful meme detection and its applicability for detecting real world hate.

The capabilities of natural language models trained on large-scale data have increased immensely over the past few years Downstream applications are at risk of inheriting biases contained in these models, with potential negative consequences especially for marginalized groups In this paper, we analyze the occupational biases of a popular generative language model, GPT-2, intersecting gender with five protected categories: religion, sexuality, ethnicity, political affiliation, and name origin Using a novel data collection pipeline we collect 396k sentence completions of GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Fitting 262 logistic models shows intersectional interactions to be highly relevant for occupational associations; (iii) For a given job, GPT-2 reflects the societal skew of gender and ethnicity in the US, and in some cases, pulls the distribution towards gender parity, raising the normative question of what language models _should_ learn

Hannah Rose Kirk

Papers

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate.

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases.