MULE: Multimodal Universal Language Embedding
Donghyun Kim,Kuniaki Saito,Kate Saenko,Stan Sclaroff,Bryan A. Plummer +4 more
- Vol. 34, Iss: 07, pp 11254-11261
Reads0
Chats0
TLDR
A modular approach which can easily be incorporated into existing vision-language methods in order to support many languages by learning a single shared Multimodal Universal Language Embedding (MULE), which has been visually-semantically aligned across all languages.Abstract:
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available1.read more
Citations
More filters
Posted Content
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
TL;DR: This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.
Posted Content
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
TL;DR: This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.
Posted Content
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
Haoyang Huang,Lin Su,Di Qi,Nan Duan,Edward Cui,Taroon Bharti,Lei Zhang,Lijuan Wang,Jianfeng Gao,Bei Liu,Jianlong Fu,Dongdong Zhang,Xin Liu,Ming Zhou +13 more
TL;DR: M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre- training into a unified framework via multitask pre- Training, can achieve comparable results for English and new state-of-the-art results for non-English languages.
Proceedings ArticleDOI
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
TL;DR: A simple yet highly effective approach, LightningDOT is proposed that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy, and achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets.
Proceedings ArticleDOI
UC 2 : Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
TL;DR: UC2 as mentioned in this paper extends the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e., using image as pivot).
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings Article
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe,Christian Szegedy +1 more
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.