Abstract: This paper proposes an end-to-end video saliency prediction network model, termed TM2SP-Net (Transformer-based Multi-level Spatiotemporal Feature Pyramid Network). Leveraging the strong ...
Abstract: Existing video inpainting approaches tend to adopt vision transformers with rare customized designs, which poses two limitations. Firstly, the conventional self-attention mechanism treats ...