Constituent Attention for Vision Transformers

Li, HL; Xue, MQ; Song, J; Zhang, HF; Huang, WQ; Liang, LY; Song, ML

Song, J (通讯作者),Zhejiang Univ, Sch Software Technol, Hangzhou, Peoples R China.

COMPUTER VISION AND IMAGE UNDERSTANDING, 2023; 237 ():

Abstract

Multi-head self-attention (MSA) endows vision Transformers (ViTs) with the ability of modeling long-range interactions between tokens. However, recent......

Full Text Link