GDC 2017会议专栏

1. 北京林业大学信息学院, 北京 100083;
2. 北京电影学院数字媒体学院, 北京 100088
 国家自然科学基金项目（61703046，31770589）；中央高校基本科研业务费专项基金项目（2015ZCQ-XX）

# 关键词

Video highlight extraction based on the interests of users
Wang Han1, Yu Huangyue1, Hua Rui1, Zou Ling2
1. School of Information Science & Technology, Beijing Forestry University, Beijing 100083, China;
2. School of Digital Media, Beijing Film Academy, Beijing 100088, China
Supported by: National Natural Science Foundation of China (61703046, 31770589)

# Abstract

Objective Video highlight extraction is of interest in video summary, organization, browsing, and indexing. Current research mainly focuses on extraction by optimizing the low-level feature diversity or representativeness of video frames, ignoring the interests of users, which leads to extraction results that are inconsistent with the expectation of users. However, collecting a large number of required labeled videos to model different user interest concepts for different videos is time consuming and labor intensive. Method We propose to learn models for user interest concepts on different videos by leveraging numerous Web images that which cover many roughly annotated concepts and are often captured in a maximally informative manner to alleviate the labeling process. However, knowledge from the Web is noisy and diverse such that brute force knowledge transfer may adversely affect the highlight extraction performance. In this study, we propose a novel user-oriented keyframe extraction framework for online videos by leveraging a large number of Web images queried by synonyms from image search engines. Our work is based on the observation that users may have different interests in different frames when browsing the same video. By using user interest-related words as keywords, we can easily collect weakly labeled image data for interest concept model training. Given that different users may have different descriptions of the same interest concept, we denote different descriptions with similar semantic meanings as synonyms. When querying images from the Web, we use synonyms as keywords to avoid semantic one-sidedness. An image set returned by a synonym is considered a synonym group. Different synonym groups are weighted according to their relevance to the video frames. Moreover, the group weights and classifiers are simultaneously learned by a joint synonym group optimization problem to make them mutually beneficial and reciprocal. We also exploit the unlabeled online videos to optimize the group weights and classifiers for building the target classifier. Specifically, new data-dependent regularizers are introduced to enhance the generalization capability and adaptiveness of the target classifier. Result Our method's mAP achieved 46.54 in average and boosted 21.6% compare to the stat-of-the-art without take much longer time. Experimental results several challenging video datasets that using grouped knowledge obtained from Web images for video highlight extraction is effective and provides comprehensive results. Conclusion We presented a new framework for video highlight extraction by leveraging a large number of loosely labeled Web images. Specifically, we exploited synonym groups to learn more sophisticated representations of source domain Web images. The group classifiers and weights are jointly learned in a unified optimization algorithm to build the target domain classifiers. We also introduced two new data-dependent regularizers based on the unlabeled target domain consumer videos to enhance the generalization capability of the target classifier.

# Key words

video retrieval; highlights extraction; video analysis; knowledge transfer

# 2.1 近义图像组分类器预学习

 ${f_s}\left( {{\mathit{\boldsymbol{x}}^s}} \right) = {\left( {{\mathit{\boldsymbol{\omega }}^\mathit{s}}} \right)^{\rm{T}}}\varphi \left( {{\mathit{\boldsymbol{x}}^s}} \right)$ (1)

# 2.2 联合组权重学习模型

 ${F_t}\left( \mathit{\boldsymbol{x}} \right) = \sum\limits_{s = 1}^S {{\alpha _s}} {f_s}\left( {{\mathit{\boldsymbol{x}}^s}} \right)$ (2)

 $\begin{array}{l} \;\;\;\;\;\;\;\;\;\;\;\Omega \left( {{F_t}\left( \mathit{\boldsymbol{x}} \right)} \right) = \frac{1}{2}{\left\| \mathit{\boldsymbol{\alpha }} \right\|^2} + {\Omega _L}\left( {{F_t}\left( \mathit{\boldsymbol{x}} \right)} \right)\\ {\Omega _L}\left( {{F_t}\left( \mathit{\boldsymbol{x}} \right)} \right) = \sum\limits_{i = 0}^{{N_s}} {\sum\limits_{s = 1}^S {{\alpha _s}\sum\limits_{k = 1, k \ne s}^S {{{\left\| {{f_s}\left( {{\mathit{\boldsymbol{x}}^\mathit{s}}} \right) - {f_k}\left( {{\mathit{\boldsymbol{x}}^\mathit{k}}} \right)} \right\|}^2}} } } \end{array}$ (3)

 ${\Omega _G}\left( {F_t^i\left( \mathit{\boldsymbol{x}} \right)} \right) = \sum\limits_{i = 1}^{{N_s}} {{{\left\| {F_t^i\left( \mathit{\boldsymbol{x}} \right) - {\mathit{\boldsymbol{Y}}^i}} \right\|}^2}}$ (4)

 $\begin{array}{l} \;\;\;\;\;\;\;\;\;\mathop {Q\left( \mathit{\boldsymbol{\alpha }} \right)}\limits_{{\rm{min}}\mathit{\alpha }} = \frac{1}{2}{\left\| \mathit{\boldsymbol{\alpha }} \right\|^2} + {\lambda _L}\sum\limits_{i = 0}^{{N_s}} {\sum\limits_{s = 1}^S {{\alpha _s}} } \\ \sum\limits_{k = 1, k \ne s}^S {{{\left\| {{f_s}\left( {{\mathit{\boldsymbol{x}}^\mathit{s}}} \right) - {f_k}\left( {{\mathit{\boldsymbol{x}}^\mathit{k}}} \right)} \right\|}^2}} + {\lambda _G}\sum\limits_{i = 1}^{{N_s}} {{{\left\| {F_t^i\left( \mathit{\boldsymbol{x}} \right) - {\mathit{\boldsymbol{Y}}^i}} \right\|}^2}} \\ \;\;\;\;\;\;\;\;\;\;{\rm{s}}{\rm{.t}}{\rm{.}}\;\;\;\;\;\;\sum\limits_{s = 1}^S {{\alpha _s} = 1} \end{array}$ (5)

 $L\left( {\mathit{\boldsymbol{\alpha }}, \lambda } \right) = Q\left( \mathit{\boldsymbol{\alpha }} \right) - {\mathit{\boldsymbol{\mu }}^{\rm{T}}}\left( {\sum\limits_{s = 1}^S {{\alpha _s} = 1} } \right)$ (6)

# 3.1.1 视频数据集

Table 1 User-interest semantics

 视频类别 关注事件数 近义描述 篮球比赛 3 {扣篮、灌篮、入樽}、{运球、控球}、{开场、跳球} 跳水 2 {入水}、{跳台、跳板} 足球运动 1 {进球、射门、入门、得分} 游泳 1 {到达终点、排名、冠军} 生日视频 2 {吹蜡烛、灭蜡烛}、{切蛋糕}

# 3.2 实验设置

1) SIFT特征[12]和HOG特征[13]，通过量化局部特征比较图像与视频帧间的相似性，使用PCA算法将SIFT特征向量降低至2 048维。

2) GIST特征[14]，由于不同用户兴趣关注点不同，视频常常表现为随意且无重点，这时若通过局部特征去识别图像，计算量无疑巨大，因此利用GIST特征忽略图像的局部特点，用一种更加“宏观”的方式去描述图像，减少计算复杂度；将特征连接为4 324维特征向量，使用k-mean算法将特征向量降低至2 000维左右，构建训练集与测试集数据。进一步，为尽可能客观地对比不同方法间结果的差异，选用平均正确率(AP)、平均正确率均值(mAP)和运行时间(RT)对算法性能进行评价。

# 3.3.1 方法对比

1) 子空间配准法。将测试帧与训练图像看做不同空间上的特征集合，采用构建特征子空间的方式连接两个空间中不同特征，从而对测试帧与训练图像进行相似度比较。

2) GFK核函数法。将源域(训练图像集)和目标域(测试帧集合)的数据在Grassmann流形空间标记并连接，通过核函数最大程度地拟合不同域间的差异，得到与源域数据最相近的目标域视频提取结果。

3) 随机选取法。构造随机模拟器，模拟生成多个在[0, 1]服从均匀分布的随机样本点$\nu$，得到随机数$rand = \nu \cdot \left( {{N_S} \cdot {n_s}} \right)$，从测试视频中选取对应帧图像所在的视频片段作为视频精彩片段，该方法不考虑用户需求且不使用图像特征进行训练。

4) 颜色直方图比较法。考虑使用文献[17]提出的基于RGB颜色特征聚类算法，对训练图像和视频帧进行基于颜色特征的聚类，定义4个聚类中心对彩色图像进行迭代聚类，将图像每个像素的R、G和B的值都分成4×4区域，统计直方图颜色信息，并对比测试图像与训练帧间颜色直方图的欧式距离进行精彩片段提取。

5) 不使用联合组权重学习的底层特征比较法。使用PCA算法对本文算法中提取的4 324维特征向量(颜色直方图、SIFT(尺度不变特征变换)、GIST(空间包络特征)和HOG2×2 (方向梯度直方图))降至2 000维左右，然后直接计算测试图像与视频帧特征空间的KNN(k-NearestNeighbor)距离，不采用任何分类函数构建分类器，设定$k=4$，提取距离较小的视频帧所在的视频片段作为视频精彩片段。

