SCFormer: Integrating hybrid Features in Vision Transformers

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Hybrid modules that combine self-attention and convolution operations can benefit from the advantages of both, and consequently achieve higher performance than either operation alone. However, current hybrid modules do not capitalize directly on the intrinsic relation between self-attention and convolution, but rather introduce external mechanisms that come with increased computation cost. In this paper, we propose a new hybrid vision transformer called Shift and Concatenate Transformer (SCFormer), which benefits from the intrinsic relationship between convolution and self-attention. SCFormer roots in the Shift and Concatenate Attention (SCA) block, that integrates convolution and self-attention features. We propose a shifting mechanism and corresponding aggregation rules for the feature integration of SCA blocks such that generated features more closely approximate the optimal output features. Extensive experiments show that, with comparable computational complexity, SCFormer consistently achieves improved results over competitive baselines on image recognition and downstream tasks. Our code is available at: https://github.com/hotfinda/SCFormer.
Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PublisherIEEE
Pages1883-1888
Number of pages6
ISBN (Electronic)978-1-6654-6891-6
DOIs
Publication statusPublished - 2023

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
Volume2023-July
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Keywords

  • Vision transformer
  • feature integration
  • hybrid module

Fingerprint

Dive into the research topics of 'SCFormer: Integrating hybrid Features in Vision Transformers'. Together they form a unique fingerprint.

Cite this