Large language model-guided iterative optimization for video retrieval data
- Pages: 1-15(2024)
Published Online: 23 December 2024
DOI: 10.11834/jig.240545
移动端阅览
浏览全部资源
扫码关注微信
Published Online: 23 December 2024 ,
移动端阅览
曾润浩,李嘉梁,卓奕深等.大语言模型引导的视频检索数据迭代优化方法[J].中国图象图形学报,
Zeng runhao,Li jialiang,Zhuo yishen,et al.Large language model-guided iterative optimization for video retrieval data[J].Journal of Image and Graphics,
目的
2
视频文本跨模态检索旨在从视频库或给定视频中检索出语义上与给定查询文本最相似的视频或视频片段,是视频理解的重要应用之一。现有方法主要聚焦于如何通过跨模态交互提高模态间的语义匹配,但忽略了目前数据集存在一个查询文本对应多个视频片段或视频的问题。该问题在训练过程中可能导致模型混淆,制约模型性能。为此,提出一种大语言模型引导的视频检索数据迭代优化方法。
方法
2
首先,通过视觉文本相似度定位出数据集中存在一对多问题的查询文本及对应视频,并提取视频中未被查询文本所描述的对象、详细外观、颜色属性等细粒度信息。接着,将这些信息与原查询文本输入到大语言模型中总结优化为更细粒度的查询文本。同时,通过基于视频文本语义关联的迭代条件判断,自动选择优化当前提示并进行下一轮优化或退出优化过程,从而不断优化查询文本。最终,将优化后的数据用于视频文本跨模态检索模型的训练。
结果
2
在视频片段检索任务上,4种神经网络模型在使用了本文方法优化后的Charades文本时序标注(charades-sentence temporal annotations,Charades-STA) 数据集进行训练,在交并比(intersection over union, IoU)为0.5时,首一召回率(Recall@Top1,R@1)平均提升2.42%,在基于查询的视频高光时刻检测(query-based video highlights,QVHighlights )数据集上,2种神经网络模型平均提升3.42%。在视频检索中,2种神经网络模型在微软视频文本检索(microsoft research video to text,MSR-VTT)数据集的R@1指标上平均提升1.4%。
结论
2
提出的大语言模型引导的视频检索数据迭代优化方法,缓解了数据集中存在的一对多问题,使模型性能显著提升。
Objective
2
In recent years, video-text cross-modal retrieval has garnered widespread attention from academia and industry due to its significant application value in areas such as video recommendation, public safety, sports analysis, and personalized advertising. This task primarily involves VR (video retrieval) and VMR (video moment retrieval), aiming to identify videos or video moments from a video library or a specific video that are semantically most similar to a given query text. The inherent heterogeneity between video and text, as they belong to different modalities, makes direct feature matching highly challenging. Thus, the key challenge in video-text cross-modal retrieval lies in effectively aligning these two cross-modal data types in the feature space to achieve precise semantic relevance calculation. Current methods primarily focus on enhancing semantic matching across modalities through cross-modal interactions on existing datasets to improve retrieval performance. While significant progress has been made in model improvement, issues inherent to datasets remain unexplored. In the context of video-text cross-modal retrieval, this study observes an ill-posed problem during training with existing datasets, manifested as a single query text corresponding to multiple videos or video moments, leading to non-unique retrieval results. These one-to-many samples frequently lead to model confusion during training, hinder the alignment of cross-modal feature representations, and degrade overall model performance. For instance, if a query text describes both a target video and a non-target video, retrieving the latter during training is penalized as incorrect, artificially increasing the distance between the query text and the non-target video in the feature space, despite their high semantic relevance. This paper defines these problematic one-to-many samples as hard samples, while one-to-one samples are defined as easy samples. To address this issue, this paper proposes a large language model-guided iterative optimization for video retrieval data. By leveraging the built-in knowledge of large language models, this method augments one-to-many text-video pairs with fine-grained information and iteratively refines them into one-to-one mappings.
Method
2
Initially, the dataset is divided into easy and hard sample sets based on visual-text similarity. Specifically, the similarity between the query text and all videos is calculated; if the similarity between the query text and the target video is not the highest, the data pair is classified into the hard sample set; otherwise, it is classified into the easy sample set. For videos in the difficult sample set, several frames are uniformly sampled and input into an image-text generation model to produce frame-level descriptive texts. This process aims to capture fine-grained information, such as objects not described by the query text, detailed appearances, and color attributes in the video. However, since multiple frames may contain similar scenes and objects, the extracted fine-grained textual descriptions are often redundant and noisy. To address this, an iterative optimization module based on video-text semantic associations is introduced. This module combines the original query text with fine-grained information extracted from the target video and integrates it with a carefully designed prompt template, which is input into a large language model. The model then generates a refined, fine-grained, and unique query text. The quality of the optimization results depends significantly on the design of the prompt templates. The templates include the following key elements: 1) Clear task descriptions; 2) Relevant examples that meet specified conditions; 3) Specific requirements, such as extracting co-occurring content across multiple frames during summarization. The emphasis on co-occurring content is justified by two key reasons: first, such content often carries critical and essential information; second, summarizing shared elements effectively reduces the likelihood of introducing erroneous descriptions. High-quality outputs from large language models typically result from multiple interactions with the user, as these models can refine their responses based on user feedback. Inspired by this, the study aims to automate the optimization process without requiring predefined interaction rounds. To further optimize the fine-grained query text, an iterative condition based on video-text semantic association is designed. Specifically, the optimized query text and corresponding video are encoded through an encoder, and if the similarity of the extracted features in the feature space meets a predefined condition, the optimized query text is deemed satisfactory, and the optimization process is terminated. If the condition is not met, the current optimization results are used to update the prompt information, and the query text is further refined iteratively until the dataset no longer contains one-to-many issues for any query text. Finally, the optimized data is used to train the video-text cross-modal retrieval model.
Result
2
The effectiveness of the proposed method was validated on multiple mainstream video-text cross-modal retrieval datasets. In the video moment retrieval task, four neural network models trained on the Charades-STA dataset, optimized using the proposed method, showed an average improvement of 2.42% in the R@1, IoU=0.5 metric, with a maximum improvement of 3.23%. When IoU=0.7, performance improvements reached up to 4.38%. In the QVHighlights dataset, the performance of MomentDETR and QDDETR improved by 5.48% and 1.35%, respectively, with an average improvement of 3% when IoU=0.7. In the video retrieval task, two methods demonstrated an average improvement of 1.4% in the R@1 metric on the MSR-VTT dataset, with a maximum improvement of 1.6%. These results demonstrate the proposed method’s effectiveness and its generalizability across different datasets.
Conclusion
2
The proposed large language model-guided iterative optimization for video retrieval data effectively alleviates the one-to-many issue in datasets. A single optimization of the dataset can enhance the retrieval performance of multiple methods. This approach offers a novel perspective for video-text cross-modal retrieval research and promotes advancements in related technologies.
视频理解跨模态检索跨模态特征对齐大语言模型数据优化
video understandingcross-modal retrievalcross-modal feature alignmentlarge language modeldata optimization
Anne Hendricks L, Wang O, Shechtman E, Sivic J, Darrell T and Russell B. 2017. Localizing moments in video with natural language/Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5803-5812 [DOI: 10.1109/ICCV.2017.618http://dx.doi.org/10.1109/ICCV.2017.618]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6299-6308 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Chen L, Xi X M and Liu L B. 2024. Survey on Video-Text Cross-Modal Retrieval. Computer Engineering and Applications, 60(04): 01-20
陈磊, 习怡萌, 刘立波. 2024. 视频文本跨模态检索研究综述. 计算机应用研究,60(04): 01-20 [DOI:10.3778/j.issn.1002-8331.2306-0382http://dx.doi.org/10.3778/j.issn.1002-8331.2306-0382]
Chen S Z, Zhao Y D, Jin Q and Wu Q. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 10638-10647 [DOI: 10.1109/CVPR42600.2020.01065http://dx.doi.org/10.1109/CVPR42600.2020.01065]
Deng C R. Chen Q. Qin P D. Chen D and Wu Q. 2023. Prompt switch: Efficient clip adaptation for text-video retrieval//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE: 15648-15658 [DOI: 10.1109/ICCV51070.2023.01434http://dx.doi.org/10.1109/ICCV51070.2023.01434]
Fang B, Wu W H, Liu C, Zhou Y, Song Y X, Wang W P, Shu X B, Ji X Y and Wang J D. 2023. Uatvr: Uncertainty-adaptive text-video retrieval//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE: 13723-13733 [DOI: 10.1109/ICCV51070.2023.01262http://dx.doi.org/10.1109/ICCV51070.2023.01262]
Gabeur V, Sun C, Alahari K and Schmid C. 2020. Multi-modal transformer for video retrieval//Proceedings of the 16th European Conference on Computer Vision.. In Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK: Springer: 214-229 [DOI:10.1007/978-3-030-58548-8_13http://dx.doi.org/10.1007/978-3-030-58548-8_13]
Gao J Y and Xu C S. 2021. Fast video moment retrieval// Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 1523-1532 [DOI: 10.1109/ICCV48922.2021.00155http://dx.doi.org/10.1109/ICCV48922.2021.00155]
Gao J Y, Sun C, Yang Z H and Nevatia R. 2017. Tall: Temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5267-5275 [DOI: 10.1109/ICCV.2017.563http://dx.doi.org/10.1109/ICCV.2017.563]
Gorti S K, Vouitsis N, Ma J W, Golestan K, Volkovs M, Garg A and Yu G W. 2022. X-pool: Cross-modal language-video attention for text-video retrieval//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 5006-5015 [DOI: 10.1109/CVPR52688.2022.00495http://dx.doi.org/10.1109/CVPR52688.2022.00495]
Hadamard J. 1902. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton university bulletin, 13: 49-52.
He C and Wei H X. 2023. Image retrieval based on transformer and asymmetric learning strategy. Journal of Image and Graphics, 28(02): 0535-0544
贺超, 魏宏喜. 2023. 结合Transformer与非对称学习策略的图像检索. 中国图象图形学报,28(02): 0535- 0544 [DOI:10. 11834 / jig. 210842http://dx.doi.org/10.11834/jig.210842]
Huang J B, Jin H L, Gong S G and Liu Y. 2022. Video Activity Localisation with Uncertainties in Temporal Boundary// Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 724-740 [DOI: 10.1007/978-3-031-19830-4_41http://dx.doi.org/10.1007/978-3-031-19830-4_41]
Jin P, Li H, Cheng Z S, Li K H, Ji X Y, Liu C, Yuan L and Chen J. 2023. Diffusionret: Generative text-video retrieval with diffusion model//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE: 2470-2481 [DOI: 10.1109/ICCV51070.2023.00234http://dx.doi.org/10.1109/ICCV51070.2023.00234]
Lei J, Berg T L and Bansal M. 2021. Detecting moments and highlights in videos via natural language queries//Proceedings of the 35th Conference on Neural Information Processing Systems. [s.l.]: Curran Associates: 11846-11858 [DOI: https://doi.org/10.48550/arXiv.2107.09609http://dx.doi.org/https://doi.org/10.48550/arXiv.2107.09609]
Li K C, He Y N, Wang Y, Li Y Z, Wang W H, Luo P, Wang Y L, Wang L M and Qiao Y. 2024. Videochat: Chat-centric video understanding[EB/OL].[2024-01-04]. https://arxiv.org/pdf/2305.06355https://arxiv.org/pdf/2305.06355
Li P D, Xie C W, Xie H T, Zhao L M, Zhang L, Zheng Y, Zhao D L and Zhang Y D. 2023. Momentdiff: Generative video moment retrieval from random to real//Proceedings of the 37th Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates: 65948-65966 [DOI: https://doi.org/10.48550/arXiv.2307.02869http://dx.doi.org/https://doi.org/10.48550/arXiv.2307.02869]
Liu H F, Chen J J, Li L, Bao B K, Li Z C, Liu J Y and Nie L Q. 2023. Cross-modal representation learning and generation. Journal of Image and Graphics, 28(06): 1608-1629
刘华峰,陈静静,李亮,鲍秉坤,李泽超,刘家瑛,聂礼强. 2023. 跨模态表征与生成技术. 中国图象图形学报,28(06): 1608-1629[DOI:10. 11834/jig. 230035http://dx.doi.org/10.11834/jig.230035]
Liu R Y, Huang J J, Li G, Feng J S, Wu X L and Li T H. 2023. Revisiting temporal modeling for clip-based image-to-video knowledge transferring//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 6555-6564 [DOI: 10.1109/CVPR52729.2023.00634http://dx.doi.org/10.1109/CVPR52729.2023.00634]
Liu Y, Cheng M, Wang F P, Li D X, Liu W and Fan J L. 2020. Deep Hashing image retrieval methods. Journal of Image and Graphics, 25(07):1296-1317
刘颖, 程美, 王富平, 李大湘, 刘伟, 范九伦. 2020. 深度哈希图像检索方法综述. 中国图象图形学报, 25(7):1296-1317 [DOI: 10.11834/jig.190518http://dx.doi.org/10.11834/jig.190518]
Liu Y, Li S Y, Wu Y, Chen C W, Shan Y and Qie X H. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 3042-3051 [DOI: 10.1109/CVPR52688.2022.00305http://dx.doi.org/10.1109/CVPR52688.2022.00305]
Liu Y Q. Xiong P F. Xu L H. Cao S M and Jin Q. 2022. Ts2-net: Token shift and selection transformer for text-video retrieval// Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 319-335 [DOI: 10.1007/978-3-031-19781-9_19http://dx.doi.org/10.1007/978-3-031-19781-9_19]
Liu Z H, Li J, Xie H T, Li P D, Ge J N, Liu S A and Jin G Q. 2024. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI: 3855-3863 [DOI: 10.1609/aaai.v38i4.28177http://dx.doi.org/10.1609/aaai.v38i4.28177]
Li Y G and Wu H Y. 2012. A clustering method based on K-means algorithm. Physics Procedia,25: 1104-1109 [DOI: 10.1016/j.phpro.2012.03.206http://dx.doi.org/10.1016/j.phpro.2012.03.206]
Luo H S. Ji L. Zhong M. Chen Y. Lei W. Duan N and Li T R. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 293-304 [DOI: 10.1016/j.neucom.2022.07.028http://dx.doi.org/10.1016/j.neucom.2022.07.028]
Ma Y W. Xu G H. Sun X S. Yan M. Zhang J and Ji R. 2022. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval//Proceedings of the 30th ACM International Conference on Multimedia. Lisbon, Portugal: ACM: 638-647 [DOI: 10.1145/3503161.3547910http://dx.doi.org/10.1145/3503161.3547910]
Moon W, Hyun S, Park S, Park D and Heo J P. 2023. Query-dependent video representation for moment retrieval and highlight detection//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 23023-23033[DOI: 10.1109/CVPR52729.2023.02205http://dx.doi.org/10.1109/CVPR52729.2023.02205]
Moon W, Hyun S, Lee S., & Heo J P. 2024. Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Groundings[EB/OL].[2024-03-30]. https://arxiv.org/pdf/2311.08835https://arxiv.org/pdf/2311.08835
Pennington J, Socher R and Manning C D. Glove: Global vectors for word representation//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL: 1532–1543 [DOI: 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. Vienna, Austria: PMLR: 8748-8763[DOI: 10.48550/arXiv.2103.00020http://dx.doi.org/10.48550/arXiv.2103.00020]
Sigurdsson G A, Varol G, Wang X L, Laptev I and Gupta A. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, Netherlands: Springer: 660-676 [DOI: 10.1007/978-3-030-58568-6_39http://dx.doi.org/10.1007/978-3-030-58568-6_39]
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR: 1-14 [DOI: 10.48550/arXiv.1409.1556http://dx.doi.org/10.48550/arXiv.1409.1556]
Sun H, Zhou M Y, Chen W J and Xie W. 2024. Tr-detr: Task-reciprocal transformer for joint moment retrieval and highlight detection//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI: 4998-5007 [DOI: 10.1609/aaai.v38i5.28304http://dx.doi.org/10.1609/aaai.v38i5.28304]
Wang W H, Lv Q S, Yu W M, Hong W Y, Qi J, Wang Y, Ji J H, Yang Z Y, Zhao L, Song X X, Xu J Z, Xu B, Li J Z, Dong Y X, Ding M, and Tang J. 2024. Cogvlm: Visual expert for pretrained language models[EB/OL].[2024-02-04]. https://arxiv.org/pdf/2311.03079https://arxiv.org/pdf/2311.03079
Wang Y G, Kang X D, Guo J, Li B, Zhang H L and Liu H Q. 2020. Image Hash retrieval with DenseNet. Journal of Image and Graphics, 25(5):900-912
王亚鸽, 康晓东, 郭军, 李博, 张华丽, 刘汉卿. 2020. 密集网络图像哈希检索. 中国图象图形学报, 25(5):900-912 [DOI: 10.11834/jig.190416http://dx.doi.org/10.11834/jig.190416]
Wang Y, Li K C, Li Y Z, He Y N, Huang B K, Zhao Z Y, Zhang H J, Xu J L, Liu Y, Wang Z, Xing S, Chen G, Pan J T, Yu J S, Wang Y L, Wang L M and Qiao Y. 2022. Internvideo: General video foundation models via generative and discriminative learning[EB/OL].[2022-12-07]. https://arxiv.org/pdf/2212.03191https://arxiv.org/pdf/2212.03191
Xu J, Mei T, Yao T and Rui Y. 2016. Msr-vtt: A large video description dataset for bridging video and language//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 5288-5296 [DOI: 10.1109/CVPR.2016.571http://dx.doi.org/10.1109/CVPR.2016.571]
Xu Z, Chen D, Wei K, Deng C and Xue H. 2022. HiSA: Hierarchically semantic associating for video temporal grounding. IEEE Transactions on Image Processing, 31: 5178-5188 [DOI: 10.1109/TIP.2022.3191841http://dx.doi.org/10.1109/TIP.2022.3191841]
Yin Q Y, Huang Y, Zhang J G, Wu S and Wang L. 2021. Survey on deep learning based cross-modal retrieval. Journal of Image and Graphics,26(06): 1368-1388
尹奇跃, 黄岩, 张俊格, 吴书, 王亮. 2021. 基于深度学习的跨模态检索综述. 中国图象图形学报,26(06): 1368- 1388[DOI:10. 11834 / jig. 200862http://dx.doi.org/10.11834/jig.200862]
Yu Y, Kim J and Kim G. 2018. A joint sequence fusion model for video question answering and retrieval//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 471-487 [DOI: 10.1007/978-3-030-01234-2_29http://dx.doi.org/10.1007/978-3-030-01234-2_29]
Zhang D. Dai X Y. Wang X. Wang Y F and Davis L S. 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1247-1257 [DOI: 10.1109/CVPR.2019.00134http://dx.doi.org/10.1109/CVPR.2019.00134]
Zhang H Y, Wang T B, Li M Z, Zhao Z, Pu S L and Wu F. 2022. Comprehensive review of visual-language-oriented multimodal pretraining methods. Journal of Image and Graphics,27(09): 2652-2682
张浩宇, 王天保, 李孟择, 赵洲, 浦世亮, 吴飞. 2022. 视觉语言多模态预训练综述. 中国图象图形学报,27(09): 2652-2682[DOI:10. 11834 / jig. 220173http://dx.doi.org/10.11834/jig.220173]
Zhang S Y, Peng H W, Fu J L and Luo J B. 2020. Learning 2d temporal adjacent networks for moment localization with natural language//Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 12870-12877 [DOI: 10.1609/aaai.v34i07.6984http://dx.doi.org/10.1609/aaai.v34i07.6984]
相关文章
相关作者
相关机构