ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding
- Yubin Wang ,
- Xinyang Jiang ,
- De Cheng ,
- Dongsheng Li ,
- Cairong Zhao
IEEE Transactions on Image Processing | , Vol 35: pp. 2714-2726
Video temporal grounding, including moment retrieval and highlight detection, is an emerging topic aiming to identify specific clips within videos. In addition to pre-trained video models, contemporary methods utilize pre-trained vision-language models (VLMs) to capture detailed characteristics of diverse scenes and objects from video frames. However, as pre-trained on images, directly using pre-extracted VLM features neglects the domain gap between the pre-trained and temporal grounding datasets, thus inducing domain shifts due to the data-level distribution disparity. As a result, VLMs may struggle to distinguish action-sensitive patterns from static objects, making it necessary to adapt them to specific data domains for effective feature representation over temporal grounding. In this work, we address two primary challenges to achieve this goal. Specifically, to mitigate high adaptation costs, we propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation before standard downstream training, where downstream-adaptive features are learned through several well-designed pretext tasks that ensure improved performance. Furthermore, to integrate action-sensitive information into VLMs, we introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLMs to discover action-sensitive visual patterns better. This is followed by context-aware temporal prompt learning, which considers both action cues and temporal context to enhance the ability to recognize patterns associated with actions for downstream tasks. Extensive experiments demonstrate that ActPrompt is an off-the-shelf training framework that can be applied effectively to various SOTA methods, resulting in notable improvements.