结合姿态估计和时序分段网络分析的羽毛球视频动作识别

陶树; 王美丽

doi:10.11834/jig.210407

图像分析和识别 | 浏览量 : 0 下载量: 4 CSCD: 2

PDF
导出
分享
收藏
专辑

结合姿态估计和时序分段网络分析的羽毛球视频动作识别
Stroke recognition in badminton videos based on pose estimation and temporal segment networks analysis
2022年27卷第11期页码：3280-3291
纸质出版日期： 2022-11-16 ，

录用日期： 2021-11-03
DOI： 10.11834/jig.210407
稿件说明：

移动端阅览

陶树, 王美丽. 结合姿态估计和时序分段网络分析的羽毛球视频动作识别[J]. 中国图象图形学报, 2022,27(11):3280-3291.

Shu Tao, Meili Wang. Stroke recognition in badminton videos based on pose estimation and temporal segment networks analysis[J]. Journal of Image and Graphics, 2022,27(11):3280-3291.
陶树, 王美丽. 结合姿态估计和时序分段网络分析的羽毛球视频动作识别[J]. 中国图象图形学报, 2022,27(11):3280-3291. DOI： 10.11834/jig.210407.

Shu Tao, Meili Wang. Stroke recognition in badminton videos based on pose estimation and temporal segment networks analysis[J]. Journal of Image and Graphics, 2022,27(11):3280-3291. DOI： 10.11834/jig.210407.

摘要

目的

为了满足羽毛球教练针对球员单打视频中的动作进行辅助分析，以及用户欣赏每种击球动作的视频集锦等多元化需求，提出一种在提取的羽毛球视频片段中对控球球员动作进行时域定位和分类的方法。

方法

在羽毛球视频片段上基于姿态估计方法检测球员执拍手臂，并根据手臂的挥动幅度变化特点定位击球动作时域，根据定位结果生成元视频。将通道—空间注意力机制引入时序分段网络，并通过网络训练实现对羽毛球动作的分类，分类结果包括正手击球、反手击球、头顶击球和挑球4种常见类型，同时基于图像形态学处理方法将头顶击球判别为高远球或杀球。

结果

实验结果表明，本文对羽毛球视频片段中动作时域定位的交并比（intersection over union，IoU）值为82.6%，对羽毛球每种动作类别预测的AUC（area under curve）值均在0.98以上，平均召回率与平均查准率分别为91.2%和91.6%，能够有效针对羽毛球视频片段中的击球动作进行定位与分类，较好地实现对羽毛球动作的识别。

结论

本文提出的基于羽毛球视频片段的动作识别方法，兼顾了羽毛球动作时域定位和动作分类，使羽毛球动作识别过程更为智能，对体育视频分析提供了重要的应用价值。

Abstract

Objective

Video-based intelligent action recognition has been developing for computer vision analysis nowadays. It is required to recognize action in a specific scene of video due to such multiple video types. To appreciate sports leisure for users like the meta-video set of various badminton stroke

it can assist coaches to analyze stroke better if badminton strokes can be accurately located and recognized in a badminton video. Sports video analysis like the approach of the badminton stroke recognition can be transferred to tennis and table tennis via similar sports features. For a long time span of video based action recognition method

it is necessary to locate the action time domain. Badminton-oriented video can be as this kind of videos to locate stroke time domains. For the time domain localization of video actions

current research is focused on a clear action switching boundary between adjacent actions in a video

and the foreground or background features of adjacent actions are quite different

such as the action video dataset 50Salads and dataset Breakfast. However

there is no obvious boundary information between foreground and background of adjacent strokes in a badminton video. Therefore

the action recognition based long time span video is not suitable for the localization of badminton strokes. In addition

most existing researches on badminton stroke recognition are based on a static image of a stroke derived from a badminton video

and the stroke recognition of badminton-relevant meta-video is lacking. Our method is focused on an approach for locating and classifying the strokes of ball-control player in an extracted badminton video highlight.

Method

First

the pose estimation model regional multi-person pose estimation(RMPE) is used to detect human poses in a badminton video highlight. The pose of the targeted player is located via adding prediction scores and position constraints to shield other irrelevant factors of human bones. For the detected pose of targeted player

the node constraints are added to locate arms of the player. The holding arm and the non-holding arm are distinguished according to the difference of the swinging amplitude

and the time domain localization of badminton stroke is carried out by the swinging amplitude variation of the holding arm for extracting the meta-video of badminton stroke. The swing amplitude of the player's arm in a frame is defined as the linear weighted sum of the square of the upper and lower limbs swing vector modulus. Then

the dataset of badminton meta-videos is applied to train convolutional block attention module-temporal segment networks (CBAM-TSN) for predicting badminton strokes in meta-videos

which add convolutional block attention module in temporal segment networks. It is necessary to extract two-stream of meta-videos from dataset beforehand through training CBAM-TSN because temporal segment network (TSN) inherited the structure of two-stream convolutional neural network(CNN). The two-stream is composed of spatial stream (RGB frames) and temporal stream (optical frames). The predicted stroke from the model of CBAM-TSN contains four familiar types: forehand

backhand

overhead and lob. Finally

we classify the overhead scenario into clear or smash by morphology processing

the clear-oriented meta-videos tend to continuous dynamic mask in the background area at the end of the stroke

but the smash-oriented meta-videos have no continuous dynamic mask information in the background area. Our badminton mask in a meta-video is captured based on the result of images morphological processing. The strokes of clear and smash can be distinguished based on position-relevant features of the badminton mask.

Result

In a highlighted badminton video

it shows that the segmentation is correct if a meta video segmented by the method of strokes localization and a meta video extracted manually both contain the same badminton stroke. Our indicator of intersection over union (IoU) is used to evaluate the performance of strokes localization. Furthermore

the performance of badminton strokes classification is evaluated via using machine learning based indicator ROC-AUC

recall and precision. The experiment results show that our IoU of stroke localization in badminton video highlights is reached to 82.6%. The indicator AUC about four kinds of badminton strokes (forehand

backhand

overhead and lob) predicted by the model of CBAM-TSN is all over 0.98

the micro-AUC

macro-AUC

average recall and precision is reached to 0.990 8

0.990 3

93.5% and 94.3%

respectively. In addition

the CBAM-TSN is compared to the three popular approaches of action recognition in the context of badminton strokes recognition

gets the highest result on precision

micro-AUC and macro-AUC. The final average recall and precision is reached to 91.2% and 91.6% of each. Therefore

it can effectively locate and classify major player's strokes in a badminton video highlight.

Conclusion

We facilitate a novel badminton strokes recognizing method in badminton video highlights

which is in combination with badminton stroke localization and badminton stroke classification. The potential sports video analysis is developed further.

关键词

姿态估计元视频羽毛球动作定位注意力机制-时序分段网络(CBAM-TSN)形态学处理羽毛球动作识别

Keywords

pose estimationmeta videobadminton stroke localizationconvolutional block attention module-temporal segment network (CBAM-TSN)morphological processingbadminton stroke recognition

references

Chu W T and Situmeang S. 2017. Badminton video analysis based on spatiotemporal and stroke features//Proceedings of 2017 ACM on International Conference on Multimedia Retrieval. Bucharest, Romania: ACM: 448-451[DOI: 10.1145/3078971.3079032http://dx.doi.org/10.1145/3078971.3079032]

Fang H S, Xie S Q, Tai Y W and Lu C W. 2017. RMPE: regional multi-person pose estimation//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2353-2362[DOI: 10.1109/ICCV.2017.256http://dx.doi.org/10.1109/ICCV.2017.256]

Farha Y A and Gall J. 2019. MS-TCN: multi-stage temporal convolutional network for action segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: #369[DOI: 10.1109/CVPR.2019.00369http://dx.doi.org/10.1109/CVPR.2019.00369]

Feng L, Liu S L, Wang J and Xiao Y. 2013. Human motion segmentation algorithm: manifold learning of sequence local warp. Journal of Computer-Aided Design and Computer Graphics, 25(4): 460-467, 473

冯林, 刘胜蓝, 王静, 肖尧. 2013. 人体运动分割算法: 序列局部弯曲的流形学习. 计算机辅助设计与图形学学报, 25(4): 460-467, 473[DOI: 10.3969/j.issn.1003-9775.2013.04.004]

Hu J, Shen L, Albanie S, Sun G and Wu E H. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8): 2011-2023[DOI: 10.1109/TPAMI.2019.2913372]

Lei P and Todorovic S. 2018. Temporal deformable residual networks for action segmentation in videos//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: #705[DOI: 10.1109/CVPR.2018.00705http://dx.doi.org/10.1109/CVPR.2018.00705]

Munro J and Damen D. 2020. Multi-modal domain adaptation for fine-grained action recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: #20[DOI: 10.1109/CVPR42600.2020.00020http://dx.doi.org/10.1109/CVPR42600.2020.00020]

Phomsoupha M and Laffaye G. 2015. The science of badminton: game characteristics, anthropometry, physiology, visual fitness and biomechanics. Sports Medicine, 45(4): 473-495[DOI: 10.1007/s40279-014-0287-2]

Qiu Z F, Yao T and Mei T. 2017. Learning spatio-temporal representation with pseudo-3D residual networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: #590[DOI: 10.1109/ICCV.2017.590http://dx.doi.org/10.1109/ICCV.2017.590]

Ramasinghe S, Chathuramali K G M and Rodrigo R. 2014. Recognition of badminton strokes using dense trajectories//Proceedings of the 7th International Conference on Information and Automation for Sustainability. Colombo, Sri Lanka: IEEE: #7069620[DOI: 10.1109/ICIAFS.2014.7069620http://dx.doi.org/10.1109/ICIAFS.2014.7069620]

Shen Q, Ban X J, Chang Z and Guo J. 2015. On-line detection and temporal segmentation of actions in video based human-computer interaction. Chinese Journal of Computers, 38(12): 2477-2487

沈晴, 班晓娟, 常征, 郭靖. 2015. 基于视频的人机交互中动作在线发现与时域分割. 计算机学报, 38(12): 2477-2487 [DOI: 10.11897/SP.J.1016.2015.02477]

Simonyan K and Zisserman A. 2014. Two-stream convolutional networks for action recognition in videos. Published in NIPS, 2014[DOI: 10.1002/14651858.CD001941.pub3]

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: #7298594[DOI: 10.1109/cvpr.2015.7298594http://dx.doi.org/10.1109/cvpr.2015.7298594]

Tao S, Luo J K, Shang J and Wang M L. 2020. Extracting highlights from a badminton video combine transfer learning with players' velocity//Proceedings of the 33rd International Conference on Computer Animation and Social Agents. Bournemouth, UK: Springer: 82-91[DOI: 10.1007/978-3-030-63426-1_9http://dx.doi.org/10.1007/978-3-030-63426-1_9]

Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and Van Gool L. 2019. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11): 2740-2755[DOI: 10.1109/TPAMI.2018.2868668]

Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 3-19[DOI: 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1]

Xiong C X, Guo D and Liu X L. 2020. Temporal proposal optimization for temporal action detection. Journal of Image and Graphics, 25(7): 1447-1458

熊成鑫, 郭丹, 刘学亮. 2020. 时域候选优化的时序动作检测. 中国图象图形学报, 25(7): 1447-1458[DOI: 10.11834/jig.190440]

Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI Press: 912

Yang J. 2018. Badminton athlete's movement identification in sports videos. Techniques of Automation and Applications, 37(10): 120-124

杨静. 2018. 体育视频中羽毛球运动员的动作识别. 自动化技术与应用, 37(10): 120-124[DOI: 10.3969/j.issn.1003-7241.2018.10.028]

文章被引用时，请邮件提醒。

提交

用于驾驶员分心行为识别的姿态引导实例感知学习

着装场景下双分支网络的人体姿态估计

结合掩码定位和漏斗网络的6D姿态估计

单幅图像刚体目标姿态估计方法综述

结合稀疏表示和深度学习的视频中3D人体姿态估计