Datasheets for Datasets

Open AccessPosted Content

Datasheets for Datasets

Timnit Gebru, +6 more

- 23 Mar 2018 -

arXiv: Databases

Chats0

TLDR

Documentation to facilitate communication between dataset creators and consumers and consumers is presented.

Abstract:

The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on. Datasheets for datasets will facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritize transparency and accountability.

Citations

PDF

Open Access

More filters

Posted Content

A Survey on Bias and Fairness in Machine Learning

Ninareh Mehrabi, +4 more

- 23 Aug 2019 -

arXiv: Learning

TL;DR: This survey investigated different real-world applications that have shown biases in various ways, and created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems.

...read moreread less

Journal Article

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, +66 more

- 05 Apr 2022 -

arXiv.org

TL;DR: A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.

...read moreread less

Proceedings ArticleDOI

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, +13 more

TL;DR: This work presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, and finds that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.

...read moreread less

Proceedings ArticleDOI

Model Cards for Model Reporting

Margaret Mitchell, +8 more

- 05 Oct 2018 -

arXiv: Learning

TL;DR: This work proposes model cards, a framework that can be used to document any trained machine learning model in the application fields of computer vision and natural language processing, and provides cards for two supervised models: One trained to detect smiling faces in images, and one training to detect toxic comments in text.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Gary B. Huang, +3 more

TL;DR: The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life, and exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background.

...read moreread less

Journal ArticleDOI

Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Alvis Brazma, +23 more

- 01 Dec 2001 -

Nature Genetics

TL;DR: The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools.

...read moreread less

Proceedings ArticleDOI

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

Bo Pang, +1 more

TL;DR: This paper proposed a machine learning method that applies text-categorization techniques to just the subjective portions of the document, extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints.

...read moreread less

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

Joy Buolamwini, +1 more

TL;DR: It is shown that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men, in commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition.

...read moreread less

Journal ArticleDOI

Semantics derived automatically from language corpora contain human-like biases

Aylin Caliskan, +3 more

- 14 Apr 2017 -

Science

TL;DR: This article showed that applying machine learning to ordinary human language results in human-like semantic biases and replicated a spectrum of known biases, as measured by the Implicit Association Test, using a widely used, purely statistical machine-learning model trained on a standard corpus of text from the World Wide Web.

...read moreread less

Collapse

Datasheets for Datasets

Citations

A Survey on Bias and Fairness in Machine Learning

PaLM: Scaling Language Modeling with Pathways

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

OPT: Open Pre-trained Transformer Language Models

Model Cards for Model Reporting

References

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

Semantics derived automatically from language corpora contain human-like biases

Related Papers (5)

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Deep Residual Learning for Image Recognition

ImageNet: A large-scale hierarchical image database

Trending Questions (1)