复杂背景下的手势识别
Hand gesture recognition in complex background
- 2021年26卷第4期 页码:815-827
收稿:2020-06-16,
修回:2020-9-16,
录用:2020-9-23,
纸质出版:2021-04-16
DOI: 10.11834/jig.200211
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-06-16,
修回:2020-9-16,
录用:2020-9-23,
纸质出版:2021-04-16
移动端阅览
目的
2
手势识别是人机交互领域的热点问题。针对传统手势识别方法在复杂背景下识别率低,以及现有基于深度学习的手势识别方法检测时间长等问题,提出了一种基于改进TinyYOLOv3算法的手势识别方法。
方法
2
对TinyYOLOv3主干网络重新进行设计,增加网络层数,从而确保网络提取到更丰富的语义信息。使用深度可分离卷积代替传统卷积,并对不同网络层的特征进行融合,在保证识别准确率的同时,减小网络模型的大小。采用CIoU(complete intersection over union)损失对原始的边界框坐标预测损失进行改进,将通道注意力模块融合到特征提取网络中,提高了定位精度和识别准确率。使用数据增强方法避免训练过拟合,并通过超参数优化和先验框聚类等方法加快网络收敛速度。
结果
2
改进后的网络识别准确率达到99.1%,网络模型大小为27.6 MB,相比原网络(TinyYOLOv3)准确率提升了8.5%,网络模型降低了5.6 MB,相比于YOLO(you only look once)v3和SSD(single shot multibox detector)300算法,准确率略有降低,但网络模型分别减小到原来的1/8和1/3左右,相比于YOLO-lite和MobileNet-SSD等轻量级网络,准确率分别提升61.12%和3.11%。同时在自制的复杂背景下的手势数据集对改进后的网络模型进行验证,准确率达到97.3%,充分证明了本文算法的可行性。
结论
2
本文提出的改进Tiny-YOLOv3手势识别方法,对于复杂背景下的手势具有较高的识别准确率,同时在检测速度和模型大小方面都优于其他算法,可以较好地满足在嵌入式设备中的使用要求。
Objective
2
The rapid development of artificial intelligence and target detection technology has accelerated the iteration of intelligent devices and also promoted the development of related technologies in the field of human-computer interaction. As an important body language and an important means to realize human-computer interaction
gesture recognition has attracted considerable attention. It has the characteristics of simplicity
high efficiency
directness
and rich content. This interaction mode is more in line with people's daily behavior and easier to understand. Gesture recognition has a wide application prospect in smart homes
virtual reality
sign language recognition
and other fields. Gesture recognition involves a wide range of disciplines such as image processing
ergonomics
machine vision
and deep learning. In addition
due to the variety of gestures and the complexity of practical application environment
gesture recognition has become a very challenging research topic.
Method
2
The traditional vision-based gesture recognition method mainly uses the skin color or skeleton model of the human body to partially segment gestures and realizes gesture classification through manual design and extraction of effective features. However
the collected RGB images are greatly affected by light conditions
skin color
clothing
and background. In the case of backlight
dark light
and dark skin color
the effects of segmentation and gesture recognition are poor. Features such as texture and edge are extracted manually
feature omission and misjudgment are easily generated during the extraction process. The recognition rate is low
and the robustness is poor under complex background. In recent years
deep learning technology has attracted more and more attention due to its robustness and high accuracy. The convolutional neural network model based on deep learning has gradually replaced the traditional manual feature extraction method and gradually become the mainstream method of gesture recognition. Although existing mainstream deep learning methods such as you only look once (YOLO) and single shot multibox detector (SSD) have achieved high accuracy in gesture recognition under complex backgrounds
their models are generally large
which is a difficult requirement to meet. It is difficult to achieve real-time detection effect with embedded devices and detection time. Therefore
how to reduce the complexity of the model and algorithms while ensuring the detection accuracy and meeting the requirements of real-time detection in practical applications has become an urgent problem that needs to be solved. The TinyYOLOv3 algorithm has the advantages of fast detection speed and small model
but its recognition accuracy is far from meeting the requirements of practical application. Therefore
to solve the above problems
this study proposes a gesture recognition method based on the improved TinyYOLOv3 algorithm. In this study
the TinyYOLOv3 backbone network is redesigned. The convolution operation with stride of 2 is used to replace the original maximum pooling layer
and the number of network layers is increased to ensure that the network extracts richer semantic information. At the same time
the depthwise separable convolution is used to replace the traditional convolution
and the characteristics of different network layers are integrated to reduce the size of the network model
ensure the recognition accuracy
and avoid the loss of feature information due to the deepening of the network structure. In the improvement of the loss function
the CIoU(complete intersection over union) loss is used to replace the original bounding box coordinate to predict loss. The experimental results show that CIoU is helpful to speed up the convergence of the model
reduce the training time
and improve the accuracy to a certain extent. The channel attention module is integrated into the feature extraction network
and the information of different channels is recalibrated to reduce the increase of parameters and improve the recognition accuracy. The data enhancement method is used to avoid overfitting training
and super parameter optimization
dynamic learning rate setting
prior frame clustering
and other methods are used to accelerate network convergence.
Result
2
This study uses the NUS-Ⅱ(National University of Singapore) gesture dataset for verification experiments. The experimental results show that the accuracy rate of the improved network recognition reaches 99.1%
which is 8.5% higher than that of the original network (TinyYOLOv3
90.6%)
and the size of the network model is reduced from 33.2 MB to 27.6 MB. Compared with YOLOv3
the recognition accuracy of the improved algorithm is slightly reduced; however
the detection speed is nearly doubled
the model size is about one-eighth that of YOLOv3
and the number of parameters is also reduced by nearly 10 times
verifying the feasibility of the algorithm. At the same time
ablation experiments were carried out on different improved modules
and the results showed that the improvement of each module helped to improve the accuracy of the algorithm. By comparing and analyzing the accuracy and loss rate changes of Tiny YOLOv3
improved Tiny YOLOv3 by CIoU and the algorithm in this paper
the advantages of the algorithm in this paper in training time and convergence speed are verified. The advantages of this algorithm in terms of training time and convergence speed are verified. This study also compared the improved algorithm with some classical traditional and deep learning gesture recognition algorithms. In terms of model size
detection time
and accuracy
the algorithm in this study achieved better results.
Conclusion
2
Gesture recognition in complex background is a key and difficult problem in the field of gesture recognition. To solve the problems of low gesture recognition rate of traditional gesture recognition methods in complex background and long detection time of existing gesture recognition methods based on deep learning
a gesture recognition method based on improved TinyYOLOv3 algorithm is proposed in this study. The network structure
loss function
feature channel optimization
and prior frame clustering are improved. The use of depthwise separable convolution makes it possible to deepen the network while reducing the number of parameters. The deepening of the network structure and the optimization of the feature channel enable the network to extract more effective semantic information and improve the detection effect. The improved network not only ensures the accuracy
but also takes into account the balance between the network model size and detection time
which can meet the use requirements of embedded equipment.
Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]
Gong X T, Ma L and Ouyang H K. 2020. An improved method of Tiny YOLOV3. IOP Conference Series: Earth and Environmental Science, 440(5): #052025[DOI:10.1088/1755-1315/440/5/052025]
Gupta B, Shukla P and Mittal A. 2016. K-nearest correlated neighbor classification for Indian sign language gesture recognition using feature fusion//Proceedings of 2016 International Conference on Computer Communication and Informatics. Coimbatore, India: IEEE: 1-5[ DOI: 10.1109/ICCCI.2016.7479951 http://dx.doi.org/10.1109/ICCCI.2016.7479951 ]
Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2020-06-02] . https://arxiv.org/pdf/1704.04861.pdf https://arxiv.org/pdf/1704.04861.pdf
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141[ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]
Huang R, Pedoeem J and Chen C X. 2018. YOLO-LITE: a real-time object detection algorithm optimized for non-GPU computers//Proceedings of 2018 IEEE International Conference on Big Data (Big Data). Seattle, USA: IEEE: 2503-2510[ DOI: 10.1109/bigdata.2018.8621865 http://dx.doi.org/10.1109/bigdata.2018.8621865 ]
Iandola F N, Han S, Moskewicz M W, Ashraf K, Dally W J and Keutzer K. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size[EB/OL]. [2020-06-02] . https://arxiv.org/pdf/1602.07360.pdf https://arxiv.org/pdf/1602.07360.pdf
Jia J, Jiang J M and Wang D. 2008. Recognition of hand gesture based on Gaussian mixture model//Proceedings of 2008 International Workshop on Content-Based Multimedia Indexing. London, UK: IEEE: 353-356[ DOI: 10.1109/CBMI.2008.4564968 http://dx.doi.org/10.1109/CBMI.2008.4564968 ]
Jiang D, Zheng Z J, Li G F, Sun Y, Kong J Y, Jiang G Z, Xiong H G, Tao B, Xu S, Yu H, Liu H H and Ju Z J. 2019. Gesture recognition based on binocular vision. Cluster Computing, 22(6): 13261-13271[DOI:10.1007/s10586-018-1844-5]
Liu S P, Liu Y, Yu J and Wang Z F. 2015. Hierarchical static hand gesture recognition by combining finger detection and HOG features. Journal of Image and Graphics, 20(6): 781-788
刘淑萍, 刘羽, 於俊, 汪增福. 2015. 结合手指检测和HOG特征的分层静态手势识别. 中国图象图形学报, 20(6): 781-788[DOI:10.11834/jig.20150607]
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of the European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Lyu N, Yang X H, Jiang Y and Xu T. 2017. Sparse decomposition for data glove gesture recognition//Proceedings of the 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics. Shanghai, China: IEEE: 1-5[ DOI: 10.1109/CISP-BMEI.2017.8302114 http://dx.doi.org/10.1109/CISP-BMEI.2017.8302114 ]
Marcel S. 1999. Hand posture recognition in a body-face centered space//Proceedings of the CHI'99 Extended Abstracts on Human Factors in Computing Systems. Pittsburgh, USA: ACM: 302-303[ DOI: 10.1145/632716.632901 http://dx.doi.org/10.1145/632716.632901 ]
Pisharady P K, Vadakkepat P and Loh A P. 2013. Attention based detection and recognition of hand postures against complex backgrounds. International Journal of Computer Vision, 101(3): 403-419[DOI:10.1007/s11263-012-0560-5]
Priyal S P and Bora P K. 2013. A robust static hand gesture recognition system using geometry based normalizations and Krawtchouk moments. Pattern Recognition, 46(8): 2202-2219[DOI:10.1016/j.patcog.2013.01.033]
Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]
Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6517-6525[ DOI: 10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 ]
Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2020-06-02] . https://arxiv.org/pdf/1804.02767.pdf https://arxiv.org/pdf/1804.02767.pdf
Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 91-99
Rezatofighi H, Tsoi N, Gwak J Y, Sadeghian A, Reid I and Savarese S. 2019. Generalized intersection over union: a metric and a loss for bounding box regression//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 658-666[ DOI: 10.1109/CVPR.2019.00075 http://dx.doi.org/10.1109/CVPR.2019.00075 ]
Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetV2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520[ DOI: 10.1109/CVPR.2018.00474 http://dx.doi.org/10.1109/CVPR.2018.00474 ]
Tan MX, Brain G and Le Q V. 2019. MixConv: mixed depthwise convolutional kernels[EB/OL]. [2020-06-02] . https://arxiv.org/pdf/1907.09595.pdf https://arxiv.org/pdf/1907.09595.pdf
Triesch J and von der Malsburg C. 2001. A system for person-independent hand posture recognition against complex backgrounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12): 1449-1453[DOI:10.1109/34.977568]
Wang L, Liu H, Wang B and Li P J. 2017. Gesture recognition method combining skin color models and convolution neural network. Computer Engineering and Applications, 53(6): 209-214
王龙, 刘辉, 王彬, 李鹏举. 2017. 结合肤色模型和卷积神经网络的手势识别方法. 计算机工程与应用, 53(6): 209-214[DOI:10.3778/j.issn.1002-8331.1508-0251]
Wu B C, Wan A, Iandola F, Jin P H and Keutzer K. 2017. SqueezeDet: unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA: IEEE: 129-137[ DOI: 10.1109/CVPRW.2017.60 http://dx.doi.org/10.1109/CVPRW.2017.60 ]
Wu Q. 2018. Research on Gesture Recognition Algorithm Based on Improved CNN and SVM. Nanchang: Jiangxi Agricultural University
吴晴. 2018. 基于改进的CNN和SVM手势识别算法研究. 南昌: 江西农业大学
Zhang X Y, Zhou X Y, Lin M X and Sun J. 2018. ShuffleNet: an extremely efficient convolutional neural network for mobile devices//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6848-6856[ DOI: 10.1109/CVPR.2018.00716 http://dx.doi.org/10.1109/CVPR.2018.00716 ]
Zheng Z H, Wang P, Liu W, Li J Z, Ye R G and Ren D W. 2020. Distance-IoU loss: faster and better learning for bounding box regression//Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 12993-13000[ DOI: 10.1609/aaai.v34i07.6999 http://dx.doi.org/10.1609/aaai.v34i07.6999 ]
相关作者
相关机构
京公网安备11010802024621