高性能整数倍稀疏网络行为识别研究
Action recognition analysis derived of integer sparse network
- 2022年27卷第8期 页码:2404-2417
收稿:2021-02-19,
修回:2021-5-12,
录用:2021-5-19,
纸质出版:2022-08-16
DOI: 10.11834/jig.210087
移动端阅览

浏览全部资源
扫码关注微信
收稿:2021-02-19,
修回:2021-5-12,
录用:2021-5-19,
纸质出版:2022-08-16
移动端阅览
目的
2
行为识别在人体交互、行为分析和监控等实际场景中具有广泛的应用。大部分基于骨架的行为识别方法利用空间和时间两个维度的信息才能获得好的效果。GCN(graph convolutional network)能够将空间和时间信息有效地结合起来,然而基于GCN的方法具有较高的计算复杂度,结合注意力模块和多流融合策略使整个训练过程具有更低的效率。目前大多数研究都专注于算法的性能,如何在保证精度的基础上减少算法的计算量是行为识别需要解决的关键性问题。对此,本文在轻量级Shift-GCN(shift graph convolutional network)的基础上,提出了整数倍稀疏网络IntSparse-GCN(integer sparse graph convolutional network)。
方法
2
首先提出奇数列向上移动,偶数列向下移动,并将移出部分用0替代新的稀疏移位操作,并在此基础上,提出将网络每层的输入输出设置成关节点的整数倍,即整数倍稀疏网络IntSparse-GCN。然后对Shift-GCN中的mask掩膜函数进行研究分析,通过自动化遍历方式得到精度最高的优化参数。
结果
2
消融实验表明,每次算法改进都能提高算法整体性能。在NTU RGB + D数据集的子集X-sub和X-view上,4流IntSparse-GCN + M-Sparse的Top-1精度分别为90.72%和96.57%。在Northwestern-UCLA数据集上,4流IntSparse-GCN + M-Sparse的Top-1精度达到96.77%,较原模型提高2.17%。相比代表性的其他算法,在不同数据集及4个流上的准确率均有提升,尤其在Northwestern-UCLA数据集上提升非常明显。
结论
2
本文针对shift稀疏特征提出整数倍IntSparse-GCN网络,对Shift-GCN中的mask掩膜函数进行研究分析,并设计自动化遍历方式得到精度最高的优化参数,不但提高了精度,也为进一步的剪枝及量化提供了依据。
Objective
2
The task of action recognition is focused on multi-frame images analysis like the pose of the human body from a given sensor input or recognize the in situ action of the human body through the obtained images. Action recognition has a wide range of applications in ground truth scenarios
such as human interaction
action analysis and monitoring. Specifically
some illegal human behaviors monitoring in public sites related to bus interchange
railway stations and airports. At present
most of skeleton-based methods are required to use spatio-temporal information in order to obtain good results. Graph convolutional network (GCN) can combine space and time information effectively. However
GCN-based methods have high computational complexity. The integrated strategies of attention modules and multi-stream fusion will cause lower efficiency in the completed training process. The issue of algorithm cost as well as ensuring accuracy is essential to be dealt with in action recognition. Shift graph convolutional network (Shift-GCN) is applied shift to GCN effectively. Shift-GCN is composed of novel shift graph operations and lightweight point-wise convolutions
where the shift graph operations provide flexible receptive fields for spatio-temporal graphs. Our proposed Shift-GCN has its priority with more than 10× less computational complexity based on three datasets for skeleton-based action recognition However
the featured network is redundant and the internal structural design of the network has not yet optimized. Therefore
our research analysis optimizes it on the basis of lightweight Shift-GCN and finally gets our own integer sparse graph convolutional network (IntSparse-GCN).
Method
2
In order to effectively solve the feature redundancy problem of Shift-GCN
we proposes to move each layer of the network on the basis of the feature shift operation that the odd-numbered columns are moved up and the even-numbered columns are moved down and the removed part is replaced with 0. The input and output of is set to an integer multiple of the joint point. First
we adopt a basic network structure similar to the previous network parameters. In the process of designing the number of input and output channels
try to make the 0 in the characteristics of each joint point balanced and finally get the optimization network structure. This network makes the position of almost half of the feature channel 0
which can express features more accurately
making the feature matrix a sparse feature matrix with strong regularity. The network can improve the robustness of the model and the accuracy of recognition more effectively. Next
we analyzed the mask function in Shift-GCN. The results are illustrated that the learned network mask is distributed in a range centered on 0 and the learned weights will focus on few features. Most of features do not require mask intervention. Finally
our experiments found that more than 80% of the mask function is ineffective. Hence
we conducted a lot of experiments and found that the mask value in different intervals is set to 0. The influence is irregular
so we designed an automated traversal method to obtain the most accurate optimized parameters and then get the optimal network model. Not only improves the accuracy of the network
but also reduces the multiplication operation of the feature matrix and the mask vector.
Result
2
Our ablation experiment shows that each algorithm improvement can harness the ability of the overall algorithm. On the X-sub dataset
the Top-1 of 1 stream(s) IntSparse-GCN reached 87.98%
the Top-1 of 1 s IntSparse-GCN+M-Sparse reached 88.01%; the Top-1 of 2 stream(s) IntSparse-GCN reached 89.80%
and the Top-1 of 2 s IntSparse-GCN+M-Sparse's Top-1 reached 89.82%; 4 stream(s) IntSparse-GCN's Top-1 reached 90.72%
4 s IntSparse-GCN+M-Sparse's Top-1 reached 90.72%.
Our evaluation is carried out on the NTU RGB + D dataset
X-view's 1 s IntSparse-GCN+M-Sparse's Top-1 reached 94.89%
and 2 s IntSparse-GCN+M-Sparse's Top-1 reached 96.21%
and the Top-1 of 4 s IntSparse-GCN+M-Sparse reached 96.57% through the ablation experiment
the Top-1 of 1s IntSparse-GCN+M-Sparse reached 92.89%
the Top-1 of 2 s IntSparse-GCN+M-Sparse reached 95.26%
and the Top of 4 s IntSparse-GCN+M-Sparse-1 reached 96.77%
which is 2.17% higher than the original model through the Northwestern-UCLA dataset evaluation. Compared to other representative algorithms
the multiple data sets accuracy and 4 streams have been improved.
Conclusion
2
We first proposed a novel method called IntSparse-GCN. A spatial shift algorithm is introduced based on integer multiples of the channel. Such feature matrix is a sparse feature matrix with strong regularity. The matrix facilitates the possibility to optimize the model pruning. To obtain the most accurate optimization parameters
our research analyzed the mask function in Shift-GCN and designed an automated traversal method. Sparse feature matrix and the mask parameter have potential to pruning and quantification further.
Baek S, Shi Z Y, Kawade M and Kim T K. 2017. Kinematic-layout-aware random forests for depth-based action recognition//Proceedings of 2017 British Machine Vision Conference (BMVC). London, UK: BMVA Press: #13[ DOI: 10.5244/C.31.13 http://dx.doi.org/10.5244/C.31.13 ]
Caetano C, Brémond F and Schwartz W R. 2019b. Skeleton image representation for 3D action recognition based on tree structure and reference joints//Proceedings of the 32nd SIBGRAPI Conference on Graphics, Patterns and Images. Rio de Janeiro, Brazil: IEEE: 16-23 [ DOI: 10.1109/SIBGRAPI.2019.00011 http://dx.doi.org/10.1109/SIBGRAPI.2019.00011 ]
Caetano C, Sena J, Brémond F, dos Santos J A and Schwartz W R. 2019a. SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition//Proceedings of the 16th IEEE International Conference on Advanced Video and Signal based Surveillance. Taipei, China: IEEE: 1-8 [ DOI: 10.1109/AVSS.2019.8909840 http://dx.doi.org/10.1109/AVSS.2019.8909840 ]
Cao Z, Hidalgo G, Simon T, Wei S E and Sheikh Y. 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1): 172-186 [DOI: 10.1109/TPAMI.2019.2929257]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]
Cheng K, Zhang Y F, He X Y, Chen W H, Cheng J and Lu H Q. 2020. Skeleton-based action reco gnition with shift graph convolutional network//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 180-189 [ DOI: 10.1109/CVPR42600.2020.00026 http://dx.doi.org/10.1109/CVPR42600.2020.00026 ]
Feichtenhofer C, Pinz A and Zisserman A. 2016.Convolutional two-stream network fusion for video action recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1933-1941 [ DOI: 10.1109/CVPR.2016.213 http://dx.doi.org/10.1109/CVPR.2016.213 ]
Jeon Y and Kim J. 2018. Constructing fast network through deconstruction of convolution//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc. : 5955-5965
Kim T S and Reiter A. 2017. Interpretable 3D human action analysis with temporal convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1623-1631 [ DOI: 10.1109/CVPRW.2017.207 http://dx.doi.org/10.1109/CVPRW.2017.207 ]
Lee I, Kim D, Kang S and Lee S. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1012-1020 [ DOI: 10.1109/ICCV.2017.115 http://dx.doi.org/10.1109/ICCV.2017.115 ]
Li C, Zhong Q Y, Xie D and Pu S L. 2017. Skeleton-based action recognition with convolutional neural networks//Proceedings of 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). [s. l. ] : IEEE: 597-600 [ DOI: 10.1109/LSP.2017.2678539 http://dx.doi.org/10.1109/LSP.2017.2678539 ]
Li C, Zhong Q Y, Xie D and Pu S L. 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IJCAI: 786-792 [ DOI: 10.24963/ijcai.2018/109 http://dx.doi.org/10.24963/ijcai.2018/109 ]
Li M S, Chen S H, Chen X, Zhang Y, Wang Y F and Tian Q. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3590-3598 [ DOI: 10.1109/CVPR.2019.00371 http://dx.doi.org/10.1109/CVPR.2019.00371 ]
Liu M Y, Liu H and Chen C. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68: 346-362 [DOI: 10.1016/j.patcog.2017.02.030]
Martinez-Hernandez U, Dodd T J and Prescott T J. 2018. Feeling the shape: active exploration behaviors for object recognition with a robotic hand. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(12): 2339-2348 [DOI: 10.1109/TSMC.2017.2732952]
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z M, Desmaison A, Antiga L and Lerer A. 2017. Automatic differentiation in PyTorch//Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.
Shahroudy A, Liu J, Ng T T and Wang G. 2016. NTU RGB+D: a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1010-1019 [ DOI: 10.1109/CVPR.2016.115 http://dx.doi.org/10.1109/CVPR.2016.115 ]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019a. Two-stream adaptive graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12018-12027 [ DOI: 10.1109/CVPR.2019.01230 http://dx.doi.org/10.1109/CVPR.2019.01230 ]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019b. Skeleton-based action recognition with directed graph neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7904-7913 [ DOI: 10.1109/CVPR.2019.00810 http://dx.doi.org/10.1109/CVPR.2019.00810 ]
Si C Y, Chen W T, Wang W, Wang L and Tan T N. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1227-1236 [ DOI: 10.1109/CVPR.2019.00132 http://dx.doi.org/10.1109/CVPR.2019.00132 ]
Sun Y C, Wu X X, Yu W N and Yu F W. 2018. Action recognition with motion map 3D network. Neurocomputing, 297: 33-39 [DOI: 10.1016/j.neucom.2018.02.028]
Veeriah V, Zhuang N F and Qi G J. 2015. Differential recurrent neural networks for action recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4041-4049 [ DOI: 10.1109/ICCV.2015.460 http://dx.doi.org/10.1109/ICCV.2015.460 ]
Wang H S and Wang L. 2017. Modeling temporal dynamics and spatial configurations of actions us ing two-stream recurrent neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3633-3642 [ DOI: 10.1109/CVPR.2017.387 http://dx.doi.org/10.1109/CVPR.2017.387 ]
Wang J, Nie X H, Xia Y, Wu Y and Zhu S C. 2014a. Cross-view action modeling, learning, and recognition//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 2649-2656 [ DOI: 10.1109/CVPR.2014.339 http://dx.doi.org/10.1109/CVPR.2014.339 ]
Wang J, Liu Z C, Wu Y and Yuan J S. 2014b. Learning Actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5): 914-927 [DOI: 10.1109/TPAMI.2013.198]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and Van Gool L. 2016. Temporal segment networks: towards good practices for deep action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 20-36 [ DOI: 10.1007/978-3-319-46484-8_2 http://dx.doi.org/10.1007/978-3-319-46484-8_2 ]
Wen Y H, Gao L, Fu H B, Zhang F L and Xia S H. 2019. Graph CNNs with motif and variable temporal block for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 8989-8996 [DOI: 10.1609/aaai.v33i01.33018989]
Wu B C, Wan A, Yue X Y, Jin P, Zhao S C, Golmant N, Gholaminejad A, Gonzalez J and Keutzer K. 2018. Shift: a zero FLOP, zero parameter alternative to spatial convolutions//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 9127-9135 [ DOI: 10.1109/CVPR.2018.00951 http://dx.doi.org/10.1109/CVPR.2018.00951 ]
Xie C Y, Li C, Zhang B C, Chen C, Han J G and Liu J Z. 2018. Memory attention networks for skeleton-based action recognition//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IJCAI: 1639-1645 [ DOI: 10.24963/ijcai.2018/227 http://dx.doi.org/10.24963/ijcai.2018/227 ]
Xu C, Govindarajan L N, Zhang Y and Cheng L. 2017. Lie-X: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. International Journal of Computer Vision, 123(3): 454-478 [DOI: 10.1007/s11263- 017-0998-6]
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 7444-7452
Zhang P F, Lan C L, Xing J L, Zeng W J, Xue J R and Zheng N N. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2136-2145 [ DOI: 10.1109/ICCV.2017.233 http://dx.doi.org/10.1109/ICCV.2017.233 ]
Zhang P F, Xue J R, Lan C L, Zeng W J, Gao Z N and Zheng N N. 2018. Adding attentiveness to the neurons in recurrent neural networks//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 136-152 [ DOI: 10.1007/978-3-030-01240-3_9 http://dx.doi.org/10.1007/978-3-030-01240-3_9 ]
Zhang Z Y. Microsoft kinect sensor and its effect. IEEE Multimedia, 2012, 19(2): 4-10 [DOI: 10.1109/MMUL.2012.24]
Zheng W, Li L, Zhang Z X, Huang Y and Wang L. 2019. Relational network for skeleton-based action recognition//Proceedings of 2019 IEEE International Conference on Multimedia and Expo (ICME). Shanghai, China: IEEE: 826-831 [ DOI: 10.1109/ICME.2019.00147 http://dx.doi.org/10.1109/ICME.2019.00147 ]
Zhong H S, Liu X G, He Y H and Ma Y C. 2018. Shift-based primitives for efficient convolutional neural networks [EB/OL ] . [2018-09-25 ] . https://arxiv.org/pdf/1809.08458.pdf https://arxiv.org/pdf/1809.08458.pdf
Zhu J G, Zou W and Zhu Z. 2018. End-to-end video-level representation learning for action recognition//Proceedings of the 24th International Conference on Pattern Recognition. Beijing, China: IEEE: 645-650 [ DOI: 10.1109/ICPR.2018.8545710 http://dx.doi.org/10.1109/ICPR.2018.8545710 ]
相关作者
相关机构
京公网安备11010802024621