Debiasing Vision-Language Models via Biased Prompts

doi:10.48550/arXiv.2302.00070

Journal ArticleDOI

Debiasing Vision-Language Models via Biased Prompts

Ching-Yao Chuang, +4 more

- 31 Jan 2023 -

arXiv.org

- Vol. abs/2302.00070

TLDR

This article proposed a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding, which reduces social bias and spurious correlation in both discriminative and generative vision language models without the need for additional data or training.

Abstract:

Machine learning models have been shown to inherit biases from their training datasets. This can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. The biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. In this study, we propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. In particular, we show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models. The proposed closed-form solution enables easy integration into large-scale pipelines, and empirical results demonstrate that our approach effectively reduces social bias and spurious correlation in both discriminative and generative vision-language models without the need for additional data or training.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Categorical Archive of ChatGPT Failures

Ali Borji

- 06 Feb 2023 -

arXiv.org

TL;DR: In this paper , the authors present a comprehensive analysis of ChatGPT's failures, including reasoning, factual errors, math, coding, and bias, and highlight the risks, limitations, and societal implications of chatGPT.

...read moreread less

Journal ArticleDOI

What does CLIP know about a red circle? Visual prompt engineering for VLMs

Aleksandar Shtedritski, +2 more

- 13 Apr 2023 -

arXiv.org

TL;DR: In this article , the authors explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text, and show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks.

...read moreread less

Journal ArticleDOI

The Hidden Language of Diffusion Models

Hila Chefer, +7 more

- 01 Jun 2023 -

arXiv.org

TL;DR: Chefer et al. as discussed by the authors decompose an input text prompt into a small set of interpretable elements and learn a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary with the objective of reconstructing the images generated for the given concept.

...read moreread less

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

Matthew Trager, +5 more

TL;DR: The authors investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs) and empirically explore these structures in CLIP's embedding and evaluate their usefulness for solving different vision language tasks such as classification, debiasing, and retrieval.

...read moreread less

Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation

Younghyun Kim, +5 more

TL;DR: The authors proposed a bias-to-text (B2T) framework to identify and mitigate biases in vision models, such as image classifiers and text-toimage generative models.

...read moreread less

References

PDF

Open Access

More filters

Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

Proceedings ArticleDOI

Deep Learning Face Attributes in the Wild

Ziwei Liu, +3 more

TL;DR: A novel deep learning framework for attribute prediction in the wild that cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently.

...read moreread less

Journal ArticleDOI

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, +4 more

- 13 Apr 2022 -

arXiv.org

TL;DR: This work proposes a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the imageembedding, and shows that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.

...read moreread less

Posted Content

A Survey on Bias and Fairness in Machine Learning

Ninareh Mehrabi, +4 more

- 23 Aug 2019 -

arXiv: Learning

TL;DR: This survey investigated different real-world applications that have shown biases in various ways, and created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems.

...read moreread less

Collapse

Debiasing Vision-Language Models via Biased Prompts

Citations

A Categorical Archive of ChatGPT Failures

What does CLIP know about a red circle? Visual prompt engineering for VLMs

The Hidden Language of Diffusion Models

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation

References

Deep Residual Learning for Image Recognition

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Deep Learning Face Attributes in the Wild

Hierarchical Text-Conditional Image Generation with CLIP Latents

A Survey on Bias and Fairness in Machine Learning

Trending Questions (1)