Ε(2)-Equivariant Vision Transformer

Renjun Xu, Kaifan Yang, Ke Liu*, Fengxiang He

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Vision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Initial attempts have been made on designing equivariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator. We prove that GE-ViT meets all the theoretical requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.
Original languageEnglish
Title of host publicationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
EditorsRobin J. Evans, Ilya Shpitser
PublisherPMLR
Pages2356-2366
Number of pages11
Volume216
Publication statusPublished - 1 Jul 2023
Event39th Conference on Uncertainty in Artificial Intelligence - Pittsburgh, United States
Duration: 31 Jul 20234 Aug 2023
Conference number: 39
https://www.auai.org/uai2023/

Publication series

NameProceedings of Machine Learning Research
PublisherPMLR
ISSN (Electronic)2640-3498

Conference

Conference39th Conference on Uncertainty in Artificial Intelligence
Abbreviated titleUAI 2023
Country/TerritoryUnited States
CityPittsburgh
Period31/07/234/08/23
Internet address

Fingerprint

Dive into the research topics of 'Ε(2)-Equivariant Vision Transformer'. Together they form a unique fingerprint.

Cite this