运动特征激励的无候选框视频描述定位
Proposal-free video grounding based on motion excitation
- 2023年28卷第10期 页码:3077-3091
纸质出版日期: 2023-10-16
DOI: 10.11834/jig.220109
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2023-10-16 ,
移动端阅览
郭义臣, 李坤, 郭丹. 2023. 运动特征激励的无候选框视频描述定位. 中国图象图形学报, 28(10):3077-3091
Guo Yichen, Li Kun, Guo Dan. 2023. Proposal-free video grounding based on motion excitation. Journal of Image and Graphics, 28(10):3077-3091
目的
2
视频描述定位是视频理解领域一个重要且具有挑战性的任务,该任务需要根据一个自然语言描述的查询,从一段未修剪的视频中定位出文本描述的视频片段。由于语言模态与视频模态之间存在巨大的特征表示差异,因此如何构建出合适的视频—文本多模态特征表示,并准确高效地定位目标片段成为该任务的关键点和难点。针对上述问题,本文聚焦于构建视频—文本多模态特征的优化表示,提出使用视频中的运动信息去激励多模态特征表示中的运动语义信息,并以无候选框的方式实现视频描述定位。
方法
2
基于自注意力的方法提取自然语言描述中的多个短语特征,并与视频特征进行跨模态融合,得到多个关注不同语义短语的多模态特征。为了优化多模态特征表示,分别从时序维度及特征通道两个方面进行建模:1)在时序维度上使用跳连卷积,即一维时序卷积对运动信息的局部上下文进行建模,在时序维度上对齐语义短语与视频片段;2)在特征通道上使用运动激励,通过计算时序相邻的多模态特征向量之间的差异,构建出响应运动信息的通道权重分布,从而激励多模态特征中表示运动信息的通道。本文关注不同语义短语的多模态特征融合,采用非局部神经网络(non-local neural network)建模不同语义短语之间的依赖关系,并采用时序注意力池化模块将多模态特征融合为一个特征向量,回归得到目标片段的开始与结束时刻。
结果
2
在多个数据集上验证了本文方法的有效性。在Charades-STA数据集和ActivityNet Captions数据集上,模型的平均交并比(mean intersection over union,mIoU)分别达到了52.36%和42.97%,模型在两个数据集上的召回率R@1(Recall@1)分别在交并比阈值为0.3、0.5和0.7时达到了73.79%、61.16%和52.36%以及60.54%、43.68%和25.43%。与LGI(local-global video-text interactions)和CPNet(contextual pyramid network)等方法相比,本文方法在性能上均有明显的提升。
结论
2
本文在视频描述定位任务上提出了使用运动特征激励优化视频—文本多模态特征表示的方法,在多个数据集上的实验结果证明了运动激励下的特征能够更好地表征视频片段和语言查询的匹配信息。
Objective
2
Video grounding is an essential and challenging task in relevance to video understanding nowadays. A natural language query can be used to describe a particular video segment in an untrimmed video. Given such a natural language query, the target of video grounding is focused on locating an action segment in the untrimmed video. As a high-level semantic understanding task in computer vision, video grounding faces many challenges, since it requires the joint modeling of visual modality and linguistic modality simultaneously. First, compared to static images, the content of videos in the real world usually contains more complicated scenes. Such a few-minute video is usually composed of several action scenarios, which can be as an integration status of actors, objectives, and motions. Second, natural language is inevitably ambiguous and subjective to some extent. The description of the same activity may diverse. Intuitively, there is a big semantic gap for visual and textual-between modality. Therefore, it needs to build an appropriate video-text multi-modal feature for accurate grounding further. To resolve the challenges mentioned above, we facilitate a novel proposal-free method to learn an appropriate multi-modal features with motion excitation. Specifically, the motion excitation is exploited to highlight motion clues of multi-modal features for accurate grounding.
Method
2
The proposed method consists of three key modules relevant to: 1) feature extraction, 2) feature optimization, and 3) boundary prediction. First, for the feature extraction module, the 3D convolutional neural network (CNN) networks and a bi-directional long short-term memory (Bi-LSTM) layer is used to get the video and query features. To get fine-grained semantic cues from a language query, we extract attention mechanism-based phrase-level feature of the query. The video-text multi-modal features can be focused on multiple semantic phrases via fusing the phrase-level and video features. Subsequently, we highlight the motion information in the above multi-modal features in the feature optimization module. The features contain contextual clues of motion on temporal dimension. Meanwhile, some channels of the features represent the dynamic motion pattern of the target moment; the other channels represent irrelevant redundant information. To optimize multi-modal feature representation utilizing motion information, the skip-connection convolution and the motion excitation are used in the feature optimization module. 1) For the skip-connection convolution, a 1D temporal convolution network is used to model the local context of motion and align it with the query on the temporal dimension. 2) For the motion excitation, the temporal adjacent multi-modal feature vectors-between differences is calculated, and the attention weight distribution of motion channel response is constructed, and the motion-sensitive channels are activated. Finally, we aggregate the multi-modal features focused on different semantic phrases. Non-local neural network is utilized to model the dependency among different semantic phrases. A temporal attentive pooling module is employed to aggregate the feature into a vector and a multilayer perceptron (MLP) layer to regress the temporal boundaries as well.
Result
2
Extensive experiments are carried out to verify the effectiveness of our proposed method on two public datasets, e.g., the Charades-STA dataset and the ActivityNet Captions dataset. Comparative analysis can be reached to 52.36% and 42.97% in terms of the evaluation metric mean intersection over union (mIoU) on these two datasets. In addition, each of the evaluation metric R@1, IoU = {0.3, 0.5, 0.7} can reach to 73.79%, 61.16%, 52.36% and 60.54%, 43.68%, 25.43%. It is also compared with such two methods like local-global video-text interactions (LGI) and contextual pyramid network (CPNet). Experimental results show that our proposed method achieves significant improvement in performance compared to other methods.
Conclusion
2
To optimize the complicated scenes of the video and bridge the gap between the video and the language, we enhance the motion patterns in related to video grounding. Accordingly, the usage of the skip-connection convolution and the motion excitation can be used optimize video-text multi-modal feature representation effectively. In this way, the model can be used to represent semantic matching information between video clips and text queries accurately to a certain extent.
视频描述定位运动激励多模态特征表示无候选框计算机视觉视频理解
video groundingmotion excitationmulti-modal feature representationproposal freecomputer visionvideo understanding
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 4724-4733 [DOI: 10.1109/cvpr.2017.502http://dx.doi.org/10.1109/cvpr.2017.502]
Chen J Y, Chen X P, Ma L, Jie Z Q and Chua T S. 2018. Temporally grounding natural sentence in video//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL: 162-171 [DOI: 10.18653/v1/d18-1015http://dx.doi.org/10.18653/v1/d18-1015]
Chen L, Lu C J, Tang S L, Xiao J, Zhang D, Tan C L and Li X L. 2020. Rethinking the bottom-up framework for query-based video localization. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 10551-10558 [DOI: 10.1609/aaai.v34i07.6627http://dx.doi.org/10.1609/aaai.v34i07.6627]
Chen S X and Jiang Y G. 2019. Semantic proposal for activity localization in videos via sentence query. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 8199-8206 [DOI: 10.1609/aaai.v33i01.33018199http://dx.doi.org/10.1609/aaai.v33i01.33018199]
Gao J Y, Sun C, Yang Z H and Nevatia R. 2017. TALL: temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5277-5285 [DOI: 10.1109/iccv.2017.563http://dx.doi.org/10.1109/iccv.2017.563]
Ge R Z, Gao J Y, Chen K and Nevatia R. 2019. MAC: mining activity concepts for language-based temporal localization//Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa Village, USA: IEEE: 245-253 [DOI: 10.1109/wacv.2019.00032http://dx.doi.org/10.1109/wacv.2019.00032]
Ghosh S, Agarwal A, Parekh Z and Hauptmann A. 2019. ExCL: extractive clip localization using natural language descriptions//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, USA: ACL: 1984-1990 [DOI: 10.18653/v1/N19-1198http://dx.doi.org/10.18653/v1/N19-1198]
Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T and Russell B. 2017. Localizing moments in video with natural language//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5804-5813 [DOI: 10.1109/iccv.2017.618http://dx.doi.org/10.1109/iccv.2017.618]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/cvpr.2018.00745http://dx.doi.org/10.1109/cvpr.2018.00745]
Jiang B Y, Wang M M, Gan W H, Wu W and Yan J J. 2019. STM: spatiotemporal and motion encoding for action recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 2000-2009 [DOI: 10.1109/iccv.2019.00209http://dx.doi.org/10.1109/iccv.2019.00209]
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization//Proceedings of 2015 International Conference on Learning Representations. San Diego, USA: 1-13 [DOI: 10.48550/arXiv.1412.6980http://dx.doi.org/10.48550/arXiv.1412.6980]
Krishna R, Hata K, Ren F, Li F F and Niebles J C. 2017. Dense-captioning events in videos//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 706-715 [DOI: 10.1109/iccv.2017.83http://dx.doi.org/10.1109/iccv.2017.83]
Li K, Guo D and Wang M. 2021. Proposal-free video grounding with contextual pyramid network. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3): 1902-1910 [DOI: 10.1609/aaai.v35i3.16285http://dx.doi.org/10.1609/aaai.v35i3.16285]
Li Y, Ji B, Shi X T, Zhang J G, Kang B and Wang L M. 2020. TEA: temporal excitation and aggregation for action recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 906-915 [DOI: 10.1109/cvpr42600.2020.00099http://dx.doi.org/10.1109/cvpr42600.2020.00099]
Liu M, Wang X, Nie L Q, He X N, Chen B Q and Chua T S. 2018a. Attentive moment retrieval in videos//Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. Ann Arbor, USA: ACM: 15-24 [DOI: 10.1145/3209978.3210003http://dx.doi.org/10.1145/3209978.3210003]
Liu M, Wang X, Nie L Q, Tian Q, Chen B Q and Chua T S. 2018b. Cross-modal moment localization in videos//Proceedings of the 26th ACM International Conference on Multimedia. Seoul Korea (South): ACM: 843-851 [DOI: 10.1145/3240508.3240549http://dx.doi.org/10.1145/3240508.3240549]
Liu X F, Nie X S, Teng J Y, Lian L and Yin Y L. 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications, 17(3): #84 [DOI: 10.1145/3441577http://dx.doi.org/10.1145/3441577]
Mun J, Cho M and Han B. 2020. Local-global video-text interactions for temporal grounding//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10807-10816 [DOI: 10.1109/cvpr42600.2020.01082http://dx.doi.org/10.1109/cvpr42600.2020.01082]
Pennington J, Socher R and Manning C. 2014. GloVe: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL: 1532-1543 [DOI: 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]
Rodriguez-Opazo C, Marrese-Taylor E, Saleh F S, Li H D and Gould S. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass Village, USA: IEEE: 2453-2462 [DOI: 10.1109/wacv45572.2020.9093328http://dx.doi.org/10.1109/wacv45572.2020.9093328]
Sigurdsson G A, Varol G, Wang X L, Farhadi A, Laptev I and Gupta A. 2016. Hollywood in homes: crowdsourcing data collection for activity understanding//Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, the Netherlands: Springer: 510-526 [DOI: 10.1007/978-3-319-46448-0_31http://dx.doi.org/10.1007/978-3-319-46448-0_31]
Sun X Y, Wang H L and He B. 2021. MABAN: multi-agent boundary-aware network for natural language moment retrieval. IEEE Transactions on Image Processing, 30: 5589-5599 [DOI: 10.1109/tip.2021.3086591http://dx.doi.org/10.1109/tip.2021.3086591]
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 4489-4497 [DOI: 10.1109/iccv.2015.510http://dx.doi.org/10.1109/iccv.2015.510]
Wang J W, Ma L and Jiang W H. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 12168-12175 [DOI: 10.1609/aaai.v34i07.6897http://dx.doi.org/10.1609/aaai.v34i07.6897]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and Van Gool L. 2016. Temporal segment networks: towards good practices for deep action recognition//Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, the Netherlands: Springer: 20-36 [DOI: 10.1007/978-3-319-46484-8_2http://dx.doi.org/10.1007/978-3-319-46484-8_2]
Wang W N, Huang Y and Wang L. 2019. Language-driven temporal activity localization: a semantic matching reinforcement learning model//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 334-343 [DOI: 10.1109/cvpr.2019.00042http://dx.doi.org/10.1109/cvpr.2019.00042]
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803 [DOI: 10.1109/cvpr.2018.00813http://dx.doi.org/10.1109/cvpr.2018.00813]
Xiao S N, Chen L, Zhang S Y, Ji W, Shao J, Ye L and Xiao J. 2021. Boundary proposal network for two-stage natural language video localization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4): 2986-2994 [DOI: 10.1609/aaai.v35i4.16406http://dx.doi.org/10.1609/aaai.v35i4.16406]
Yuan Y T, Mei T and Zhu W W. 2019. To find where you talk: temporal sentence localization in video with attention based location regression. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 9159-9166 [DOI: 10.1609/aaai.v33i01.33019159http://dx.doi.org/10.1609/aaai.v33i01.33019159]
Zeng R H, Xu H M, Huang W B, Chen P H, Tan M K and Gan C. 2020. Dense regression network for video grounding//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 10284-10293 [DOI: 10.1109/cvpr42600.2020.01030http://dx.doi.org/10.1109/cvpr42600.2020.01030]
Zhang D, Dai X Y, Wang X, Wang Y F and Davis L S. 2019a. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1247-1257 [DOI: 10.1109/cvpr.2019.00134http://dx.doi.org/10.1109/cvpr.2019.00134]
Zhang S Y, Peng H W, Fu J J and Luo J B. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 12870-12877 [DOI: 10.1609/aaai.v34i07.6984http://dx.doi.org/10.1609/aaai.v34i07.6984]
Zhang S Y, Su J S and Luo J B. 2019b. Exploiting temporal relationships in video moment localization with natural language//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 1230-1238 [DOI: 10.1145/3343031.3350879http://dx.doi.org/10.1145/3343031.3350879]
相关作者
相关机构