Home
/
Authors
/
Mei Gao

Author

Mei Gao

Bio: Mei Gao is an academic researcher from Microsoft. The author has contributed to research in topics: Engineering & Computer science. The author has an hindex of 1, co-authored 1 publications receiving 9 citations.

Topics: Engineering, Computer science, Linear classifier, Feature learning ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

i-Code: An Integrative and Composable Multimodal Learning Framework

[...]

Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yuefeng Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel C. F. Codella, Naoyuki Kanda, Bin Xiao, Yuanxun Lu, Takuya Yoshioka, Michael Zeng, Xuedong Huang - Show less +16 more

03 May 2022

TL;DR: Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

...read moreread less

Abstract: Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

...read moreread less

18 citations

Posted Content•

Efficient Self-supervised Vision Transformers for Representation Learning

[...]

Chunyuan Li¹, Jianwei Yang¹, Pengchuan Zhang¹, Mei Gao¹, Bin Xiao¹, Xiyang Dai¹, Lu Yuan¹, Jianfeng Gao¹ - Show less +4 more•Institutions (1)

Microsoft¹

17 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, self-supervised vision transformers (EsViT) are proposed to capture fine-grained correspondences between image regions and improve the quality of the learned vision representations.

...read moreread less

Abstract: This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

...read moreread less

10 citations

E fficient s elf - supervised v ision t ransformers for r epresentation l earning

[...]

Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao - Show less +3 more

TL;DR: This paper investigates two techniques for developing efﬁcient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture ﬁne-grained region dependencies and as a result improves the quality of the learned vision representations.

...read moreread less

Abstract: This paper investigates two techniques for developing efﬁcient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can signiﬁcantly reduce modeling complexity but with a cost of losing the ability to capture ﬁne-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture ﬁne-grained region dependencies and as a result signiﬁcantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 accuracy on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classiﬁcation tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and pre-trained models are released at: https://github. com/microsoft/esvit 2020). When adding region-matching tasks to BYOL and pre-training 50 epochs, Softmax and MSE yield k -NN accuracy of 37.2% and 34.9%, while the baseline BYOL yields 33.1%. We also replace the region-matching metric in EsViT as MSE , yielding k -NN accuracy 72.6%, which lower than the view-level task only (74.2%). These results show that Softmax is essential in L R . ( ii ) Optimal Transport (OT) vs Simple Argmax. To avoid heavy computational overhead, a simple feature-level argmax solution is considered in Eq. (2) to pair two local regions. To study the impact of high region-matching quality, we consider OT. Empirically, we observe OT yields slightly higher k -NN accuracy at the early stage, but the gain is diminished in the end. Considering the extra computational cost of solving OT with an inner loop in sinkhorn algorithm (Cuturi, 2013), we opt for simple argmax in our experiments.

...read moreread less

Journal Article•DOI•

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

[...]

Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel C. F. Codella, Bin Xiao, Yuefeng Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang - Show less +15 more

21 May 2023-arXiv.org

TL;DR: i-Code V2 as mentioned in this paper is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space.

...read moreread less

Abstract: The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

...read moreread less

Cited by

PDF

Open Access

More filters

Posted Content•

Transformers in Vision: A Survey

[...]

Salman H. Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah - Show less +2 more

04 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Transformer networks as mentioned in this paper enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).

...read moreread less

Abstract: Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

...read moreread less

128 citations

Journal Article•DOI•

VIMA: General Robot Manipulation with Multimodal Prompts

[...]

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Animashree Anandkumar, Yuke Zhu, Linxi Fan - Show less +6 more

06 Oct 2022-arXiv.org

TL;DR: This work designs a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively and achieves strong scalability in both model capacity and data size.

...read moreread less

Abstract: Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/

...read moreread less

58 citations

Posted Content•

A Survey on Vision Transformer

[...]

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao - Show less +9 more

23 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Transformer as mentioned in this paper is a type of deep neural network mainly based on the self-attention mechanism, which has been applied to the field of natural language processing, and has received more and more attention from the computer vision community.

...read moreread less

Abstract: Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.

...read moreread less

36 citations

Proceedings Article•DOI•

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

[...]

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu U. Chang, Mohit Bansal, Heng Ji - Show less +9 more

22 May 2022

TL;DR: The goal of this work is to build ﬂexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-speciﬁc captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.

...read moreread less

Abstract: The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL .

...read moreread less

33 citations

Journal Article•DOI•

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

[...]

Xiao Zhang Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiaoyong Wei, Yaowei Wang, Yonghong Tian, Wen Gao - Show less +4 more

20 Feb 2023-Machine Intelligence Research

TL;DR: Wang et al. as discussed by the authors provide a comprehensive survey of multi-modal pre-training models, including bidirectional encoder representations (BERT), vision transformer (ViT), generative pre-trained transformers (GPT), etc.

...read moreread less

Abstract: With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative pre-trained transformers (GPT), etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey.

...read moreread less

16 citations

1
2
3
4
…
5
6

Collapse