融合显著性图像语义特征的人体相似动作识别

白忠玉; 丁其川; 徐红丽; 吴成东

doi:10.11834/jig.220028

图像分析和识别 | 浏览量 : 0 下载量: 1 CSCD: 0

PDF
导出
分享
收藏
专辑

融合显著性图像语义特征的人体相似动作识别
Human similar action recognition by fusing saliency image semantic features
2023年28卷第9期页码：2872-2886
纸质出版日期： 2023-09-16 ，
DOI： 10.11834/jig.220028
稿件说明：

移动端阅览

白忠玉，丁其川，徐红丽，吴成东. 2023. 融合显著性图像语义特征的人体相似动作识别. 中国图象图形学报， 28(09):2872-2886

Bai Zhongyu， Ding Qichuan， Xu Hongli， Wu Chengdong. 2023. Human similar action recognition by fusing saliency image semantic features. Journal of Image and Graphics， 28(09):2872-2886
白忠玉，丁其川，徐红丽，吴成东. 2023. 融合显著性图像语义特征的人体相似动作识别. 中国图象图形学报， 28(09):2872-2886 DOI： 10.11834/jig.220028.

Bai Zhongyu， Ding Qichuan， Xu Hongli， Wu Chengdong. 2023. Human similar action recognition by fusing saliency image semantic features. Journal of Image and Graphics， 28(09):2872-2886 DOI： 10.11834/jig.220028.

摘要

目的

基于骨骼的动作识别技术由于在光照变化、动态视角和复杂背景等情况下具有更强的鲁棒性而成为研究热点。利用骨骼/关节数据识别人体相似动作时，因动作间关节特征差异小，且缺少其他图像语义信息，易导致识别混乱。针对该问题，提出一种基于显著性图像特征强化的中心连接图卷积网络（saliency image feature enhancement based center-connected graph convolutional network，SIFE-CGCN）模型。

方法

首先，设计一种骨架中心连接拓扑结构，建立所有关节点到骨架中心的连接，以捕获相似动作中关节运动的细微差异；其次，利用高斯混合背景建模算法将每一帧图像与实时更新的背景模型对比，分割出动态图像区域并消除背景干扰作为显著性图像，通过预训练的VGG-Net（Visual Geometry Group network）提取特征图，并进行动作语义特征匹配分类；最后，设计一种融合算法利用分类结果对中心连接图卷积网络的识别结果强化修正，提高对相似动作的识别能力。此外，提出了一种基于骨架的动作相似度的计算方法，并建立一个相似动作数据集。

结果

实验在相似动作数据集与NTU RGB+D 60/120（Nanyang Technological University RGB+D 60/120）数据集上与其他方法进行比较。在相似动作数据集中，相比于次优模型识别准确率在跨参与者识别（X-Sub）和跨视角识别（X-View）基准分别提高4.6%和6.0%；在NTU RGB+D 60数据集中，相比于次优模型识别准确率在X-Sub和X-View基准分别提高1.4%和0.6%；在NTU RGB+D 120数据集中，相比于次优模型识别准确率在X-Sub和跨设置识别（X-Set）基准分别提高1.7%和1.1%。此外，进行多种对比实验，验证了中心连接图卷积网络、显著性图像提取方法以及融合算法的有效性。

结论

提出的方法可以实现对相似动作的准确有效识别分类，且模型的整体识别性能及鲁棒性也得以提升。

Abstract

Objective

Human action recognition is a valuable research area in computer vision. It has a wide range of applications， such as security monitoring， intelligent monitoring， human-computer interaction， and virtual reality. The skeleton-based action recognition method first extracts the specific position coordinates of the major body joints from the video or image by using a hardware method or a software method. Then， the skeleton information is used for action recognition. In recent years， skeleton-based action recognition has received increasing attention because of its robustness in dynamic environments， complex backgrounds， and occlusion situations. Early action recognition methods usually use hand-crafted features for action recognition modeling. However， the hand-crafted feature methods have poor generalization because of the lack of diversity in the extracted features. Deep learning has become the mainstream action recognition method because of its powerful automatic feature extraction capabilities. Traditional deep learning methods use constructed skeleton data as joint coordinate vectors or pseudo-images， which are directly input into recurrent neural networks （RNNs） or convolutional neural networks （CNNs） for action classification. However， the RNN-based or CNN-based methods lose the spatial structure information of skeleton data because of the limitation set by the European data structure. Moreover， these methods cannot extract the natural correlation of human joints. Thus， distinguishing subtle differences between similar actions becomes difficult. Human joints are naturally structured as graph structures in non-Euclidean space. Several works have successfully adopted graph convolutional networks （GCNs） to achieve state-of-the-art performance for skeleton-based action recognition. In these methods， the subtle differences between the joints are not explicitly learned. These subtle differences are crucial to recognizing similar actions. Moreover， the skeleton data extracted from the video shield the object information that interacts with humans and only retain the primary joint coordinates. The lack of image semantics and the reliance only on joint sequences remarkably challenge the recognition of similar actions.

Method

Given the above factors， the saliency image feature enhancement based center-connected graph convolutional network （SIFE-CGCN） is proposed in this work for skeleton-based similar action recognition. The proposed model is based on GCN， which can fully utilize the spatial and temporal dependence information between human joints. First， the CGCN is proposed for skeleton-based similar action recognition. For the spatial dimension， a center-connection skeleton topology is designed to establish connections between all human joints and the skeleton center to capture the small difference in joint movements in similar actions. For the temporal dimension， each frame is associated with the previous and subsequent frames in the sequence. Therefore， the number of adjacent nodes in the frame is fixed at 2. The regular 1D convolution is used on the temporal dimension as the temporal graph convolution. A basic co-occurrence graph convolution unit includes a spatial graph convolution， a temporal graph convolution， and a dropout layer. For training stability， the residual connection is added for each unit. The proposed network is formed by stacking nine graph convolution basic units. The batch normalization （BN） layer is added before the beginning of the network to standardize the input data， and a global average pooling layer is added at the end to unify the feature dimensions. The dual-stream architecture is used for utilizing the joint and bone information of the skeleton data simultaneously to extract data features from multiple angles. Given the different roles of each joint in different actions， the attention map is added to focus on the main motion joints in action. Second， the saliency image in the video is selected using the Gaussian mixture background modeling method. Each image frame is compared with the real-time updated background model to segment the image area with considerable changes， and the background interference is eliminated. The effective extraction of semantic feature maps from saliency images is the key to distinguishing similar actions. The Visual Geometry Group network （VGG-Net） can effectively extract the spatial structure features of objects from images. In this work， the feature map is extracted through pre-trained VGG-Net， and the fully connected layer is used for feature matching. Finally， the feature map matching result is used to strengthen and revise the recognition result of CGCN and improve the recognition ability for similar actions. In addition， the similarity calculation method for skeleton sequences is proposed， and a similar action dataset is established in this work.

Result

The proposed model is compared with the state-of-the-art models on the proposed similar action dataset and Nanyang Technological University RGB+D （NTU RGB+D） 60/120 dataset. The methods for comparison include CNN-based， RNN-based， and GCN-based models. On the cross-subject （X-Sub） and cross-view （X-View） benchmarks in the proposed similar action dataset， the recognition accuracy of the proposed model can reach 80.3% and 92.1%， which are 4.6% and 6.0% higher than the recognition accuracies of the suboptimal algorithm， respectively. The recognition accuracy of the proposed model on the X-Sub and X-View benchmarks in the NTU RGB+D 60 dataset can reach 91.7% and 96.9%. Compared with the suboptimal algorithm， the proposed model improves by 1.4% and 0.6%. Compared with the suboptimal model feedback graph convolutional network （FGCN）， the proposed model improves the recognition accuracy by 1.7% and 1.1% on the X-Sub and cross-setup（X-Set） benchmarks in the NTU RGB+D 120 dataset， respectively. In addition， we conduct a series of comparative experiments to show clearly the effectiveness of the proposed CGCN， the saliency image extraction method， and the fusion algorithm.

Conclusion

In this study， we propose a SIFE-CGCN to solve the recognition confusion when recognizing similar actions due to the ambiguity between the skeleton feature and the lack of image semantic information. The experimental results show that the proposed method can effectively recognize similar actions， and the overall recognition performance and robustness of the model are improved.

关键词

动作识别骨架序列相似动作图卷积网络（GCN）图像显著性特征

Keywords

action recognitionskeleton sequencesimilar actiongraph convolutional network （GCN）image salient features

references

Cai N， Chen S W， Guo W T and Pan Q. 2011. Moving object detection using Gaussian mixture model and wavelet transform. Journal of Image and Graphics， 16（9）： 1716-1721

蔡念，陈世文，郭文婷，潘晴. 2011. 融合高斯混合模型和小波变换的运动目标检测. 中国图象图形学报， 16（9）： 1716-1721 ［DOI： 10.11834/jig.20110923http://dx.doi.org/10.11834/jig.20110923］

Cao C Q， Lan C L， Zhang Y F， Zeng W J， Lu H Q and Zhang Y N. 2019. Skeleton-based action recognition with gated convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology， 29（11）： 3247-3257 ［DOI： 10.1109/TCSVT.2018.2879913http://dx.doi.org/10.1109/TCSVT.2018.2879913］

Cheng K Y， Wu J X， Wang W S， Rong L and Zhan Y Z. 2021. Multi-person interaction action recognition based on spatio-temporal graph convolution. Journal of Image and Graphics， 26（7）： 1681-1691

成科扬，吴金霞，王文杉，荣兰，詹永照. 2021. 融合时空图卷积的多人交互行为识别. 中国图象图形学报， 26（7）： 1681-1691 ［DOI： 10.11834/jig.200510http://dx.doi.org/10.11834/jig.200510］

Du Y， Fu Y and Wang L. 2016. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Transactions on Image Processing， 25（7）： 3010-3022 ［DOI： 10.1109/TIP.2016.2552404http://dx.doi.org/10.1109/TIP.2016.2552404］

Du Y， Wang W and Wang L. 2015. Hierarchical recurrent neural network for skeleton based action recognition//Proceedings of 2015 IEEE Conference on computer Vision and Pattern Recognition. Boston， USA： IEEE： 1110-1118 ［DOI： 10.1109/CVPR.2015.7298714http://dx.doi.org/10.1109/CVPR.2015.7298714］

Fernando B， Gavves E， Oramas M J， Ghodrati A and Tuytelaars T. 2015. Modeling video evolution for action recognition//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 5378-5387 ［DOI： 10.1109/CVPR.2015.7299176http://dx.doi.org/10.1109/CVPR.2015.7299176］

Gao X， Hu W， Tang J X， Liu J Y and Guo Z M. 2019. Optimized skeleton-based action recognition via sparsified graph regression//Proceedings of the 27th ACM International Conference on Multimedia. Nice， France： Association for Computing Machinery： 601-610 ［DOI： 10.1145/3343031.3351170http://dx.doi.org/10.1145/3343031.3351170］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Hu J F， Zheng W S， Lai J H and Zhang J G. 2015. Jointly learning heterogeneous features for RGB-D activity recognition//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， MA： IEEE： 5344-5352 ［DOI： 10.1109/CVPR.2015.7299172http://dx.doi.org/10.1109/CVPR.2015.7299172］

Ke Q H， Bennamoun M， An S J， Sohel F and Boussaid F. 2017. A new representation of skeleton sequences for 3D action recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 4570-4579 ［DOI： 10.1109/CVPR.2017.486http://dx.doi.org/10.1109/CVPR.2017.486］

Kim T S and Reiter A. 2017. Interpretable 3D human action analysis with temporal convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Honolulu， USA： IEEE： 1623-1631 ［DOI： 10.1109/CVPRW.2017.207http://dx.doi.org/10.1109/CVPRW.2017.207］

Lee I， Kim D， Kang S and Lee S. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 1012-1020 ［DOI： 10.1109/ICCV.2017.115http://dx.doi.org/10.1109/ICCV.2017.115］

Li B， Dai Y C， Cheng X L， Chen H H， Lin Y and He M Y. 2017. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN//Proceedings of 2017 IEEE International Conference on Multimedia and Expo Workshops （ICMEW）. Hong Kong， China： IEEE： 601-604 ［DOI： 10.1109/ICMEW.2017.8026282http://dx.doi.org/10.1109/ICMEW.2017.8026282］

Li M S， Chen S H， Chen X， Zhang Y， Wang Y F and Tian Q. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 3590-3598 ［DOI： 10.1109/CVPR.2019.00371http://dx.doi.org/10.1109/CVPR.2019.00371］

Li S， Li W Q， Cook C， Zhu C and Gao Y B. 2018. Independently recurrent neural network （indRNN）： building a longer and deeper RNN//Proceedings of 2018 IEEE/CVF conference on computer vision and pattern recognition. Salt Lake City， USA： IEEE： 5457-5466 ［DOI： 10.1109/CVPR.2018.00572http://dx.doi.org/10.1109/CVPR.2018.00572］

Liang C W， Liu D Y， Qi L and Guan L. 2020. Multi-modal human action recognition with sub-action exploiting and class-privacy preserved collaborative representation learning. IEEE Access， 8： 39920-39933 ［DOI： 10.1109/ACCESS.2020.2976496http://dx.doi.org/10.1109/ACCESS.2020.2976496］

Liu J， Shahroudy A， Perez M， Wang G， Duan L Y and Kot A C. 2020. NTU RGB+D 120： a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence， 42（10）： 2684-2701 ［DOI： 10.1109/TPAMI.2019.2916873http://dx.doi.org/10.1109/TPAMI.2019.2916873］

Liu J， Shahroudy A， Xu D and Wang G. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 816-833. ［DOI： 10.1007/978-3-319-46487-9_50http://dx.doi.org/10.1007/978-3-319-46487-9_50］

Liu J， Wang G， Duan L Y， Abdiyeva K and Kot A C. 2018. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing， 27（4）： 1586-1599 ［DOI： 10.1109/TIP.2017.2785279http://dx.doi.org/10.1109/TIP.2017.2785279］

Peng W， Shi J G， Xia Z Q and Zhao G Y. 2020. Mix dimension in poincaré geometry for 3D skeleton-based action recognition//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 1432-1440 ［DOI： 10.1145/3394171.3413910http://dx.doi.org/10.1145/3394171.3413910］

Plizzari C， Cannici M and Matteucci M. 2021. Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding， 208-209： 103219 ［DOI： 10.1016/j.cviu.2021.103219http://dx.doi.org/10.1016/j.cviu.2021.103219］

Pourchot A and Sigaud O. 2018. CEM-RL： combining evolutionary and gradient-based methods for policy search//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： ICLR

Ran X Y， Liu K， Li G， Ding W W and Chen B. 2018. Human action recognition algorithm based on adaptive skeleton center. Journal of Image and Graphics， 23（4）： 519-525

冉宪宇，刘凯，李光，丁文文，陈斌. 2018. 自适应骨骼中心的人体行为识别算法. 中国图象图形学报， 23（4）： 519-525 ［DOI： 10.11834/jig.170420http://dx.doi.org/10.11834/jig.170420］

Sánchez J， Perronnin F， Mensink T and Verbeek J. 2013. Image classification with the fisher vector： theory and practice. International Journal of Computer Vision， 105（3）： 222-245 ［DOI： 10.1007/s11263-013-0636-xhttp://dx.doi.org/10.1007/s11263-013-0636-x］

Shahroudy A， Liu J， Ng T T and Wang G. 2016. NTU RGB+D： a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 1010-1019 ［DOI： 10.1109/CVPR.2016.115http://dx.doi.org/10.1109/CVPR.2016.115］

Shi L， Zhang Y F， Cheng J and Lu H Q. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 12018-12027 ［DOI： 10.1109/CVPR.2019.01230http://dx.doi.org/10.1109/CVPR.2019.01230］

Shi L， Zhang Y F， Cheng J and Lu H Q. 2020. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing， 29： 9532-9545 ［DOI： 10.1109/TIP.2020.3028207http://dx.doi.org/10.1109/TIP.2020.3028207］

Si C Y， Chen W T， Wang W， Wang L and Tan T N. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 1227-1236 ［DOI： 10.1109/CVPR.2019.00132http://dx.doi.org/10.1109/CVPR.2019.00132］

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition ［EB/OL］. ［2022-01-26］. https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf

Sultani W and Saleemi I. 2014. Human action recognition across datasets by foreground-weighted histogram decomposition//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus， USA： IEEE： 764-771 ［DOI： 10.1109/CVPR.2014.103http://dx.doi.org/10.1109/CVPR.2014.103］

Szegedy C， Liu W， Jia Y Q， Sermanet P， Reed S， Anguelov D， Erhan D， Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： #7298594 ［DOI： 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594］

Vemulapalli R， Arrate F and Chellappa R. 2014. Human action recognition by representing 3D skeletons as points in a lie group//Proceedings of 2014 IEEE Conference on computer Vision and Pattern Recognition. Columbus， USA： IEEE： 588-595 ［DOI： 10.1109/CVPR.2014.82http://dx.doi.org/10.1109/CVPR.2014.82］

Yan S J， Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans， USA： AAAI Press： 7444-7452 ［DOI： 10.1609/aaai.v32i1.12328http://dx.doi.org/10.1609/aaai.v32i1.12328］

Yang H， Yan D， Zhang L， Sun Y D， Li D and Maybank S J. 2022. Feedback graph convolutional network for skeleton-based action recognition. IEEE Transactions on Image Processing， 31： 164-175 ［DOI： 10.1109/TIP.2021.3129117http://dx.doi.org/10.1109/TIP.2021.3129117］

Zhang P F， Lan C L， Xing J L， Zeng W J， Xue J R and Zheng N N. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2136-2145 ［DOI： 10.1109/ICCV.2017.233http://dx.doi.org/10.1109/ICCV.2017.233］

文章被引用时，请邮件提醒。

提交

基于多视图自适应3D骨架网络的工业装箱动作识别