Gender and Dialect Bias in YouTube’s Automatic Captions
Rachael Tatman
- pp 53-59
Reads0
Chats0
TLDR
This project evaluates the accuracy of YouTube’s automatically-generated captions across two genders and five dialect groups, and demonstrates the need for sociolinguistically-stratified validation of systems.Citations
More filters
Journal ArticleDOI
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
Emily M. Bender,Batya Friedman +1 more
TL;DR: It is argued that data statements will help alleviate issues related to exclusion and bias in language technology, lead to better precision in claims about how natural language processing research can generalize and thus better engineering results, protect companies from public embarrassment, and ultimately lead to language technology that meets its users in their own preferred linguistic style.
Posted Content
WILDS: A Benchmark of in-the-Wild Distribution Shifts
Pang Wei Koh,Shiori Sagawa,Henrik Marklund,Sang Michael Xie,Marvin Zhang,Akshay Balsubramani,Weihua Hu,Michihiro Yasunaga,Richard Lanas Phillips,Irena Gao,Tony Lee,Etienne David,Ian Stavness,Wei Guo,Berton A. Earnshaw,Imran S. Haque,Sara Beery,Jure Leskovec,Anshul Kundaje,Emma Pierson,Sergey Levine,Chelsea Finn,Percy Liang +22 more
TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.
Posted Content
Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization.
TL;DR: The results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization, and introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.
Proceedings ArticleDOI
Measuring and Mitigating Unintended Bias in Text Classification
TL;DR: A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.
Proceedings ArticleDOI
Racial Bias in Hate Speech and Abusive Language Detection Datasets
TL;DR: This article examined racial bias in five different sets of Twitter data annotated for hate speech and abusive language and found that abusive language detection systems may discriminate against the groups who are often the targets of the abuse we are trying to detect.
References
More filters
Proceedings ArticleDOI
Librispeech: An ASR corpus based on public domain audio books
TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Proceedings ArticleDOI
Unbiased look at dataset bias
Antonio Torralba,Alexei A. Efros +1 more
TL;DR: A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented.
Proceedings ArticleDOI
SWITCHBOARD: telephone speech corpus for research and development
TL;DR: SWITCHBOARD as mentioned in this paper is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition.
Dataset
TIMIT Acoustic-Phonetic Continuous Speech Corpus
John S. Garofolo,Lori Lamel,William M. Fisher,Jonathan C. Fiscus,David S. Pallett,Nancy L. Dahlgren,Victor W. Zue +6 more
TL;DR: The TIMIT corpus as mentioned in this paper contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, including time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.
Related Papers (5)
Semantics derived automatically from language corpora contain human-like biases
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
Joy Buolamwini,Timnit Gebru +1 more