Current Issue Cover
特征金字塔结构的时序行为识别网络

何嘉宇, 雷军, 李国辉(国防科技大学信息系统工程重点实验室, 长沙 410072)

摘 要
目的 时序行为识别是视频理解中最重要的任务之一,该任务需要对一段视频中的行为片段同时进行分类和回归,而视频中往往包含不同时间长度的行为片段,对持续时间较短的行为片段进行检测尤其困难。针对持续时间较短的行为片段检测问题,文中构建了3维特征金字塔层次结构以增强网络检测不同持续时长的行为片段的能力,提出了一种提案网络后接分类器的两阶段新型网络。方法 网络以 RGB 连续帧作为输入,经过特征金字塔结构产生不同分辨率和抽象程度的特征图,这些不同级别的特征图主要在网络的后两个阶段发挥作用:1)在提案阶段结合锚方法,使得不同时间长度的锚段具有与之对应的不同大小的感受野,锚段的初次预测将更加准确;2)在感兴趣区域池化阶段,不同的提案片段映射给对应级别特征图进行预测,平衡了分类和回归对特征图抽象度和分辨率的需求。结果 在THUMOS Challenge 2014数据集上对模型进行测试,在与没有使用光流特征的其他典型方法进行比较时,本文模型在不同交并比阈值上超过了对比方法3%以上,按类别比较时,对持续时间较短的行为片段检测准确率则普遍得到提升。消融性实验中,在交并比阈值为0.5时,带特征金字塔结构的网络则超过使用普通特征提取网络的模型1.8%。结论 本文提出的基于3维特征金字塔特征提取结构的双阶段时序行为模型能有效提升对持续时间较短的行为片段的检测准确率。
关键词
Temporal action detection based on feature pyramid hierarchies

He Jiayu, Lei Jun, Li Guohui(Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410072, China)

Abstract
Objective Temporal action localization is one of the most important tasks in video understanding and has great application prospects in practice. With the rise of various online video applications, the number of short videos on the Internet has increased sharply, many of which contain different human behaviors. A model that can automatically locate and classify human action segments in videos is needed to detect and distinguish human behavior in short videos quickly and efficiently. However, public security departments also need real-time human behavior detection systems to help monitor and provide early warning of public safety incidents. In the task of temporal action localization, the human action segments in a video must be classified and regressed simultaneously. Accurately locating the boundaries of human behavior segments is more difficult than classifying known segments. A video always contains action segments of different temporal lengths, and detecting action segments with a short duration is especially difficult because short-duration action segment is easily ignored by the detection model or regarded as part of a closer, longer-duration segment. Existing methods have various attempts to improve the detection accuracy of human behavior fragments with different durations. In this paper, a 3D feature pyramid hierarchy is proposed to enhance the network’s ability to detect action segments of different temporal durations. Method A new two-stage network with a proposal network followed by a classifier named 3D feature pyramid convolutional network(3D-FPCN) is proposed. In 3D-FPCN, feature extraction is performed through the 3D feature pyramid feature extraction network built. The 3D feature pyramid feature extraction network has a bottom-up pathway and a top-down pathway. The bottom-up pathway simultaneously encodes the temporal and spatial characteristics of consecutive input frames through a series of 3D convolutional neural networks to obtain highly abstract feature maps. The top-down pathway uses a series of deconvolutional networks and lateral connection layers to fuse high-abstraction and high-resolution features, and obtain low-level feature maps. Through the feature pyramid feature extraction network, multilevel feature maps with different abstraction levels and different resolutions can be obtained. Highly abstract feature maps are used for the classification and regression of long-duration human action segments, and high-resolution feature maps are used for the regression and classification of short-duration human action segments, which can effectively improve the detection effect of the network on human behavior fragments of different durations. The whole network takes RGB frames as input and generates feature maps of different resolutions and abstract degrees via a feature pyramid structure. These feature maps of different levels mainly play a role in the latter two stages of the network. First, the anchor mechanism is used in the proposal stage. Thus, anchor segments of different temporal lengths have corresponding receptive fields of different sizes, and this is equivalent to a receptive field calibration. Second, in the region of interest pooling stage, different proposal segments are mapped to corresponding level feature maps for prediction, which makes feature prediction more targeted and balances the requirements for the abstraction and resolution of feature maps for action segments’ classification and regression. Result Our model is evaluated on the THUMOS’14 dataset. Compared with other classic methods that do not use optical flow features, our network surpasses most of them. Specifically, when the intersection over union threshold is set to 0.5, the mean average precision (mAP) of 3D-FPCN is up to 37.4%. Compared with the classic two-stage network region convolutional 3D network(R-C3D), the mAP of our method is increased by 8.5 percentage points. The comparison results of the detection precision on different class human action segments when the intersection ratio threshold is 0.5 are shown. The detection result of 3D-FPCN for short-duration human actions segments is greatly improved compared with other methods. For example, 3D-FPCN’s detection accuracy of basketball dunk and cliff diving is 10% higher than that of the same two-stage network method R-C3D, and the detection accuracy of pole vault is higher than the multi-stage segment convolutional neural network(SCNN) is about 40%. This finding proves the improvement of our model for detecting short-duration human action segments. An ablation test is also conducted in the feature pyramid feature extraction network to explore the effect of this structure on the model. When the feature pyramid structure is removed from the network, the detection accuracy of the network is approximately 2% lower than before when the intersection over union threshold is 0.5. When only the multilevel feature map generated by the feature pyramid structure is used in the first stage of the network, which is the proposal generation stage, the detection accuracy is only 0.2% higher than the model with the feature pyramid structure removed. This finding proves that the feature pyramid hierarchy can effectively enhance the detection of action with different durations, and it mainly works in the second stage of the network, which is region of interest pooling stage. Conclusion A two-stage temporal action localization network 3D-FPCN is proposed based on 3D feature pyramid feature extraction network. The network takes continuous RGB frames as input, which can quickly and effectively detect human action segments in short videos. Through a number of experiments, the superiority of the model is proven, and the mechanism of the 3D feature pyramid structure in the model is discussed and explored. The 3D feature pyramid structure effectively improves the model’s ability to detect short-duration human action segments, but the overall mAP of the model remains low. In the next work, the model will be improved, and different feature inputs will be introduced to study the method of temporal action localization further. We hope that our work can inspire other researchers and promote the development of the field.
Keywords

订阅号|日报