scispace - formally typeset
Open AccessJournal ArticleDOI

MULE: Multimodal Universal Language Embedding

Reads0
Chats0
TLDR
A modular approach which can easily be incorporated into existing vision-language methods in order to support many languages by learning a single shared Multimodal Universal Language Embedding (MULE), which has been visually-semantically aligned across all languages.
Abstract
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

TL;DR: This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.
Posted Content

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

TL;DR: This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.
Posted Content

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

TL;DR: M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre- training into a unified framework via multitask pre- Training, can achieve comparable results for English and new state-of-the-art results for non-English languages.
Proceedings ArticleDOI

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

TL;DR: A simple yet highly effective approach, LightningDOT is proposed that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy, and achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets.
Proceedings ArticleDOI

UC 2 : Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

TL;DR: UC2 as mentioned in this paper extends the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e., using image as pivot).
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Related Papers (5)
Trending Questions (1)
How many languages can Sophia the robot speak?

In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages.