Y
Yumao Lu
Researcher at Microsoft
Publications - Â 6
Citations - Â 28
Yumao Lu is an academic researcher from Microsoft. The author has contributed to research in topics: Computer science & Closed captioning. The author has an hindex of 1, co-authored 6 publications receiving 3 citations.
Papers
More filters
Posted Content
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
TL;DR: Wang et al. as discussed by the authors proposed an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description.
Posted Content
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
TL;DR: PICa as discussed by the authors proposes to use image captions for knowledge-based visual question answering (VQA) in a few-shot manner by converting the image into captions (or tags) that GPT-3 can understand.
Posted Content
Florence: A New Foundation Model for Computer Vision
Lu Yuan,Dongdong Chen,Yi-Ling Chen,Noel C. F. Codella,Xiyang Dai,Jianfeng Gao,Houdong Hu,Xuedong Huang,Boxin Li,Chunyuan Li,Ce Liu,Mengchen Liu,Zicheng Liu,Yumao Lu,Yu Shi,Lijuan Wang,Jianfeng Wang,Bin Xiao,Zhen Xiao,Jianwei Yang,Michael Zeng,Luowei Zhou,Pengchuan Zhang +22 more
TL;DR: Florence as discussed by the authors proposes a new computer vision foundation model to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth).
Posted Content
Scaling Up Vision-Language Pre-training for Image Captioning
TL;DR: Li et al. as mentioned in this paper used the state-of-the-art VinVL model as their reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters.
Posted Content
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
TL;DR: In this article, a single UniFied transfOrmer (UFO) is proposed for vision-language representation learning, which is capable of processing either unimodal inputs (e.g., image or language) or multimodal input (i.e., the concatenation of the image and the question), for visual question answering.