scispace - formally typeset
Search or ask a question

Showing papers by "Jakob Uszkoreit published in 2022"


TL;DR: This project examines a similar idea on the ImageNet dataset that gradually shrink the image patch length dimension of Vision Transformer as the layers go deeper, in order to save the computational resources (Funnel-ViT).
Abstract: Vision Transformer(ViT) [6] adopts the Transformer architecture on the image classification tasks and outperforms the state-of-the-art convolutional networks with substan-tially fewer computational resources. However, it’s still expensive to train Transformer either on a very large pretraining dataset or with a large model size. So model efficiency is still an important area to explore. Spatial compression is a common technique widely used in convolutional networks for image classification tasks, which indicates the spatial information redundancy for classification tasks. In addition, inspired by the success of Funnel-Transformer [4] in NLP, this project examines a similar idea on the ImageNet dataset that gradually shrink the image patch length dimension of Vision Transformer as the layers go deeper, in order to save the computational resources (Funnel-ViT). The results show that with with a small pretraining accuracy compromise ( < 1% ), we can save 40% memory, get 37.5% speedup with three funnel blocks, and get 0.6% fine-tuning accuracy improvement. The saved resources can even be re-invested to a wider and deeper Funnel-ViT model to further reduce the pre-training accuracy loss to 0.1%.