GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

Chenhongyi Yang*, Jiarui Xu, Shalini De Mello, Elliot J Crowley, Xiaolong Wang

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

Abstract / Description of output

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require highresolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters
Original languageEnglish
Publication statusPublished - 1 May 2023
EventThe Eleventh International Conference on Learning Representations - Kigali, Rwanda
Duration: 1 May 20235 May 2023
https://iclr.cc/Conferences/2023

Conference

ConferenceThe Eleventh International Conference on Learning Representations
Abbreviated titleICLR 2023
Country/TerritoryRwanda
CityKigali
Period1/05/235/05/23
Internet address

Fingerprint

Dive into the research topics of 'GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation'. Together they form a unique fingerprint.

Cite this