Abstract
Attention layers have contributed to state-of-the-art results on vision tasks. Still, they leave room for improvement because position information is used in a fixed manner, and the computation cost is typically high. To mitigate both issues, we propose a convolution-style local attention layer (LA-layer) as a replacement for traditional attention layers. LA-layers not only encode the position information of pixels in a convolutional manner, but also produce position offsets following a novel constrained rule so that keys will deform and result in larger receptive fields. Query and keys are processed by a novel aggregation function that outputs attention weights for the values. In our experiments with different types of ResNets, we replace convolutional layers with LA-layers and address image recognition, object detection and instance segmentation tasks. We consistently demonstrate performance gains, despite having fewer FLOPs and training parameters. Our code is available at: https://github.com/hotfinda/LA-layer.
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023 |
Pages | 2057-2062 |
Number of pages | 6 |
ISBN (Electronic) | 978-1-6654-6891-6 |
DOIs | |
Publication status | Published - 25 Aug 2023 |
Publication series
Name | Proceedings - IEEE International Conference on Multimedia and Expo |
---|---|
Volume | 2023-July |
ISSN (Print) | 1945-7871 |
ISSN (Electronic) | 1945-788X |
Bibliographical note
Funding Information:ACKNOWLEDGMENT This work is supported in part by the scholarship from China Scholarship Council (CSC) under the Grant No.202106290068.
Publisher Copyright:
© 2023 IEEE.
Keywords
- Local attention
- CNN
- Deformable Kernel
- Convolutional neural network