MULE: Multimodal Universal Language Embedding

doi:10.1609/AAAI.V34I07.6785

Open AccessJournal ArticleDOI

MULE: Multimodal Universal Language Embedding

Donghyun Kim, +4 more

- Vol. 34, Iss: 07, pp 11254-11261

Chats0

TLDR

A modular approach which can easily be incorporated into existing vision-language methods in order to support many languages by learning a single shared Multimodal Universal Language Embedding (MULE), which has been visually-semantically aligned across all languages.

Abstract:

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available1.

Citations

PDF

Open Access

More filters

Posted Content

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, +2 more

- 22 Jul 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.

...read moreread less

Posted Content

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, +3 more

- 01 Nov 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.

...read moreread less

Posted Content

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

Haoyang Huang, +13 more

- 04 Jun 2020 -

arXiv: Computation and Language

TL;DR: M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre- training into a unified framework via multitask pre- Training, can achieve comparable results for English and new state-of-the-art results for non-English languages.

...read moreread less

Proceedings ArticleDOI

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Siqi Sun, +5 more

TL;DR: A simple yet highly effective approach, LightningDOT is proposed that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy, and achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets.

...read moreread less

Proceedings ArticleDOI

UC 2 : Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Mingyang Zhou, +6 more

TL;DR: UC2 as mentioned in this paper extends the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e., using image as pivot).

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Collapse

Transactions of the Association for Comp...

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

MULE: Multimodal Universal Language Embedding

Citations

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

UC 2 : Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention is All you Need

Adam: A Method for Stochastic Optimization

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Deep Residual Learning for Image Recognition

Trending Questions (1)