针对用户兴趣的视频精彩片段提取
Video highlight extraction based on the interests of users
- 2018年23卷第5期 页码:748-755
收稿:2017-07-10,
修回:2017-10-19,
纸质出版:2018-05-16
DOI: 10.11834/jig.170365
移动端阅览

浏览全部资源
扫码关注微信
收稿:2017-07-10,
修回:2017-10-19,
纸质出版:2018-05-16
移动端阅览
目的
2
视频精彩片段提取是视频内容标注、基于内容的视频检索等领域的热点研究问题。视频精彩片段提取主要根据视频底层特征进行精彩片段的提取,忽略了用户兴趣对于提取结果的影响,导致提取结果可能与用户期望不相符。另一方面,基于用户兴趣的语义建模需要大量的标注视频训练样本才能获得较为鲁棒的语义分类器,而对于大量训练样本的标注费时费力。考虑到互联网中包含内容丰富且易于获取的图像,将互联网图像中的知识迁移到视频片段的语义模型中可以减少大量的视频数据标注工作。因此,提出利用互联网图像的用户兴趣的视频精彩片段提取框架。
方法
2
利用大量互联网图像对用户兴趣语义进行建模,考虑到从互联网中获取的知识变化多样且有噪声,如果不加选择盲目地使用会影响视频片段提取效果,因此,将图像根据语义近似性进行分组,将语义相似但使用不同关键词检索得到的图像称为近义图像组。在此基础上,提出使用近义语义联合组权重模型权衡,根据图像组与视频的语义相关性为不同图像组分配不同的权重。首先,根据用户兴趣从互联网图像搜索引擎中检索与该兴趣语义相关的图像集,作为用户兴趣精彩片段提取的知识来源;然后,通过对近义语义图像组的联合组权重学习,将图像中习得的知识迁移到视频中;最后,使用图像集中习得的语义模型对待提取片段进行精彩片段提取。
结果
2
本文使用CCV数据库中的视频对本文提出的方法进行验证,同时与多种已有的视频关键帧提取算法进行比较,实验结果显示本文算法的平均准确率达到46.54,较其他算法相比提高了21.6%,同时算法耗时并无增加。此外,为探究优化过程中不同平衡参数对最终结果的影响,进一步验证本文方法的有效性,本文在实验过程中通过移除算法中的正则项来验证每一项对于算法框架的影响。实验结果显示,在移除任何一项后算法的准确率明显降低,这表明本文方法所提出的联合组权重模型对提取用户感兴趣视频片段的有效性。
结论
2
本文提出了一种针对用户兴趣语义的视频精彩片段提取方法,根据用户关注点的不同,为不同用户提取其感兴趣的视频片段。
Objective
2
Video highlight extraction is of interest in video summary
organization
browsing
and indexing. Current research mainly focuses on extraction by optimizing the low-level feature diversity or representativeness of video frames
ignoring the interests of users
which leads to extraction results that are inconsistent with the expectation of users. However
collecting a large number of required labeled videos to model different user interest concepts for different videos is time consuming and labor intensive.
Method
2
We propose to learn models for user interest concepts on different videos by leveraging numerous Web images that which cover many roughly annotated concepts and are often captured in a maximally informative manner to alleviate the labeling process. However
knowledge from the Web is noisy and diverse such that brute force knowledge transfer may adversely affect the highlight extraction performance. In this study
we propose a novel user-oriented keyframe extraction framework for online videos by leveraging a large number of Web images queried by synonyms from image search engines. Our work is based on the observation that users may have different interests in different frames when browsing the same video. By using user interest-related words as keywords
we can easily collect weakly labeled image data for interest concept model training. Given that different users may have different descriptions of the same interest concept
we denote different descriptions with similar semantic meanings as synonyms. When querying images from the Web
we use synonyms as keywords to avoid semantic one-sidedness. An image set returned by a synonym is considered a synonym group. Different synonym groups are weighted according to their relevance to the video frames. Moreover
the group weights and classifiers are simultaneously learned by a joint synonym group optimization problem to make them mutually beneficial and reciprocal. We also exploit the unlabeled online videos to optimize the group weights and classifiers for building the target classifier. Specifically
new data-dependent regularizers are introduced to enhance the generalization capability and adaptiveness of the target classifier.
Result
2
Our method's mAP achieved 46.54 in average and boosted 21.6% compare to the stat-of-the-art without take much longer time. Experimental results several challenging video datasets that using grouped knowledge obtained from Web images for video highlight extraction is effective and provides comprehensive results.
Conclusion
2
We presented a new framework for video highlight extraction by leveraging a large number of loosely labeled Web images. Specifically
we exploited synonym groups to learn more sophisticated representations of source domain Web images. The group classifiers and weights are jointly learned in a unified optimization algorithm to build the target domain classifiers. We also introduced two new data-dependent regularizers based on the unlabeled target domain consumer videos to enhance the generalization capability of the target classifier.
Wolf W H. Key frame selection by motion analysis[C]//Proceedings of 1996 IEEE International Acoustics, Speech, and Signal Processing. Atlanta, GA: IEEE, 1996: 1228-1231. [ DOI:10.1109/ICASSP.1996.543588 http://dx.doi.org/10.1109/ICASSP.1996.543588 ]
Zhang H J, Wu J H, Zhong D, et al. An integrated system for content-based video retrieval and browsing[J]. Pattern Recognition, 1997, 30(4):643-658.[DOI:10.1016/S0031-3203(96)00109-4]
Lu Z, Grauman K. Story-driven summarization for egocentric video[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR: IEEE, 2013: 2714-2721. [ DOI:10.1109/CVPR.2013.350 http://dx.doi.org/10.1109/CVPR.2013.350 ]
Yao T, Mei T, Ngo C W, et al. Annotation for free: video tagging by mining user search behavior[C] //Proceedings of the 21st ACM International Conference on Multimedia. New York: ACM, 2013: 977-986. [ DOI:10.1145/2502081.2502085 http://dx.doi.org/10.1145/2502081.2502085 ]
El Sayad I, Martinet J, Urruty T, et al. A semantically significant visual representation for social image retrieval[C]//Proceedings of 2011 IEEE International Conference on Multimedia and Expo. Barcelona, Spain: IEEE, 2011: 1-6. [ DOI:10.1109/ICME.2011.6011867 http://dx.doi.org/10.1109/ICME.2011.6011867 ]
Wang H, Wu X X, Jia Y D. Video annotation by using heterogeneous multiple image groups on the web[J]. Chinese Journal of Computers, 2013, 36(10):2062-2069.
王晗, 吴心筱, 贾云得.使用异构互联网图像组的视频标注[J].计算机学报, 2013, 36(10):2062-2069. [DOI:10.3724/SP.J.1016.2013.02062]
Wang H, Video Annotation Based on Transfer Learning[D]. Beijing: Beijing Institute of Technology, 2014. http://cdmd.cnki.com.cn/article/cdmd-10007-1014086880.htm .
王晗. 基于迁移学习的视频标注方法[D]. 北京: 北京理工大学, 2014.
Wang H, Wu X X. Finding event videos via image search engine[C]//Proceedings of 2015 IEEE International Conference on Data Mining Workshop. Atlantic City, NJ: IEEE, 2015: 1221-1228. [ DOI:10.1109/ICDMW.2015.78 http://dx.doi.org/10.1109/ICDMW.2015.78 ]
Wang H, Wu X X, Jia Y D. Video annotation via image groups from the web[J]. IEEE Transactions on Multimedia, 2014, 16(5):1282-1291.[DOI:10.1109/TMM.2014.2312251]
Wang H, Song H, Wu X X, et al. Video annotation by incremental learning from grouped heterogeneous sources[C]//Proceedings of the 12th Asian Conference on Computer Vision. Taipei, Taiwan, China: Springer, 2014: 493-507. [ DOI:10.1007/978-3-319-16814-2_32 http://dx.doi.org/10.1007/978-3-319-16814-2_32 ]
Jiang Y G, Ye G N, Chang S F, et al. Consumer video understanding: a benchmark database and an evaluation of human and machine performance[C]//Proceedings of the 1st ACM International Conference on Multimedia Retrieval. Trento, Italy: ACM, 2011: 29. [ DOI:10.1145/1991996.1992025 http://dx.doi.org/10.1145/1991996.1992025 ]
Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110.[DOI:10.1023/B:VISI.0000029664.99615.94]
Hoiem D, Efros A A, Hebert M. Recovering surface layout from an image[J]. International Journal of Computer Vision, 2007, 75(1):151-172.[DOI:10.1007/s11263-006-0031-y]
Oliva A, Torralba A. Modeling the shape of the scene:a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3):145-175.[DOI:10.1023/A:1011139631724]
Fernando B, Habrard A, Sebban M, et al. Unsupervised visual domain adaptation using subspace alignment[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW: IEEE, 2013: 2960-2967. [ DOI:10.1109/ICCV.2013.368 http://dx.doi.org/10.1109/ICCV.2013.368 ]
Gong B Q, Shi Y, Sha F, et al. Geodesic flow kernel for unsupervised domain adaptation[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI: IEEE, 2012: 2066-2073. [ DOI:10.1109/CVPR.2012.6247911 http://dx.doi.org/10.1109/CVPR.2012.6247911 ]
Mei T, Tang L X, Tang J H, et al. Near-lossless semantic video summarization and its applications to video analysis[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2013, 9(3):#16.[DOI:10.1145/2487268.2487269]
Platt J C, Cristianini N, Shawe-Taylor J. Large margin DAGs for multiclass classification[C]//Advances in Neural Information Processing Systems. Cambridge, UK: MIT Press, 2000: 547-553.
Meng J J, Wang H X, Yuan J S, et al. From keyframes to key objects: video summarization by representative object proposal selection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 1039-1048. [ DOI:10.1109/CVPR.2016.118 http://dx.doi.org/10.1109/CVPR.2016.118 ]
相关作者
相关机构
京公网安备11010802024621