嵌入双尺度分离式卷积块注意力模块的口罩人脸姿态分类
A embedded dual-scale separable CBAM model for masked human face poses classification
- 2022年27卷第4期 页码:1125-1136
纸质出版日期: 2022-04-16 ,
录用日期: 2021-01-13
DOI: 10.11834/jig.200736
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-04-16 ,
录用日期: 2021-01-13
移动端阅览
陈森楸, 刘文波, 张弓. 嵌入双尺度分离式卷积块注意力模块的口罩人脸姿态分类[J]. 中国图象图形学报, 2022,27(4):1125-1136.
Senqiu Chen, Wenbo Liu, Gong Zhang. A embedded dual-scale separable CBAM model for masked human face poses classification[J]. Journal of Image and Graphics, 2022,27(4):1125-1136.
目的
2
针对口罩遮挡的人脸姿态分类新需求,为了提高基于卷积神经网络的人脸姿态分类效率和准确率,提出了一个轻量级卷积神经网络用于口罩人脸姿态分类。
方法
2
本文设计的轻量级卷积神经网络的核心为双尺度可分离注意力卷积单元。该卷积单元由3 × 3和5 × 5两个尺度的深度可分离卷积并联而成,并且将卷积块注意力模块(convolutional block attention module,CBAM)的空间注意力模块(spatial attention module,SAM)和通道注意力模块(channel attention module,CAM)分别嵌入深度(depthwise,DW)卷积和点(pointwise,PW)卷积中,针对性地对DW卷积及PW卷积的特征图进行调整。同时对SAM模块补充1 × 1的点卷积挤压结果增强其对空间信息的利用,形成更加有效的注意力图。在保证模型性能的前提下,控制构建网络的卷积单元通道数和单元数,并丢弃全连接层,采用卷积层替代,进一步轻量化网络模型。
结果
2
实验结果表明,本文模型的准确率较未改进SAM模块分离嵌入CBAM的模型、标准方式嵌入CBAM的模型和未嵌入注意力模块的模型分别提升了2.86%、6.41% 和12.16%。采用双尺度卷积核丰富特征,在有限的卷积单元内增强特征提取能力。与经典卷积神经网络对比,本文设计的模型仅有1.02 MB的参数量和24.18 MB的每秒浮点运算次数(floating-point operations per second,FLOPs),大幅轻量化了模型并能达到98.57%的准确率。
结论
2
本文设计了一个轻量高效的卷积单元构建网络模型,该模型具有较高的准确率和较低的参数量及计算复杂度,提高了口罩人脸姿态分类模型的效率和准确率。
Objective
2
Human face poses classification is one of the key aspects of computer vision and intelligent analysis. It is a potential technology for human behavior analysis
human-computer interaction
motivation detection
fatigue driving monitoring
face recognition and virtual reality. Wearing masks is regarded as an intervened method during the outbreak of the corona virus disease 2019 (COVID-19) pandemic. It is a new challenge to achieve masked face poses classification. The convolutional neural network is widely applied to identify human face information and it is using in the face pose estimation. The convolutional neural network (CNN) research is to achieve face pose estimation based on low resolution
occlusion interference and complicated environment. In terms of the stronger capability of convolutional neural network and its successful application in face pose classification
we implement it to the masked face pose classification. Face pose estimation is one of the mediums of computer vision and intelligent analysis technology
and estimation results are used for subsequent analysis and decision. As an intermediate part of face pose estimation technology
the lightweight and efficient network structure can make the estimation to play a greater role within limited resources. Therefore
our research focuses on an efficient and lightweight convolutional neural network for masked face pose estimation.
Method
2
The core of the designed network is an efficient and lightweight dual-scale separable attention convolution (DSAC) unit and we construct the model based on 5 DSAC units stacking. The DSAC unit is constructed via two depthwise separating convolution with 3 × 3 and 5 × 5 kernel size in parallel
and we embed convolutional block attention module (CBAM) in these two convolutional ways. But
the embedding method that we proposed is different from the traditional CBAM embedding method. We split CBAM into spatial attention module (SAM) and channel attention module (CAM). We embed SAM following the depthwise (DW) convolution and embed CAM following the pointwise (PW) convolution respectively
and we cascade these two parts at final. In this way
the features of DW convolution and PW convolution can be matched. Meanwhile
we improve the SAM via the result of 1 × 1 pointwise convolution supplements
which can enhance the utilization of spatial information and organize a more effective attention map
we make use of 5 DSAC units to build the high accuracy network. The convolution channels and the number of units in this network are manipulated in rigor. The full connection layers are discarded because of their redundant number of parameters and the computational complexity
and a integration of point convolutional layer to global average pooling (GAP) layer is used to replace them. Therefore
these operations make the network more lightweight further. In addition
a large scale of human face data collection cannot be achieved temporarily because of the impact of COVID-19. We set the mask images that are properly scaled
rotated
and deformed on the common face pose dataset to construct a semisynthetic dataset. Simultaneously
a small amount of real masked face poses images are collected to construct a real masked face poses dataset. We use the transfer learning method to train the model under the lack of a massive real face poses dataset.
Result
2
The embedded demonstration results show that our proposed accuracy of the model is 2.86%
6.41% and 12.16% higher than that of the model which is embedded separable CBAM without improved SAM module
the model embedded standard CBAM and the model without embedded CBAM. The results show that the model which is embedded separable CBAM with improved SAM module has an efficient and lightweight structure. It can effectively improve the performance of the model with little parameters and computational complexity. Depth wise separable convolution makes the model more compact and the CBAM attention module makes the model more efficient when the model is regressing. In addition
the dual scale convolution is used to tailor the features and enhance the feature extraction capability within the limited convolution unit. The method of extracting different scale features to enhance the performance of the model can avoid over-fitting and rapid growth of parameters derived of stacked convolutional layers. Compared with the classical convolutional neural network
such as AlexNet
Visual Geometry Group network(VGGNet)
ResNet
GoogLeNet
the parameters and computational complexity of our designated model decrease significantly
and the accuracy is 3.57%~30.71% higher than AlexNet
VGG16
ResNet18 and GoogLeNet. Compared with the classical lightweight convolutional neural network like SqueezeNet
MobileNet
ShuffleNet
EfficientNet
the demonstated model has the lowest parameters
computational complexity
and the highest accuracy
which only has 1.02 M parameters and 24.18 M floating-point operations per second (FLOPs)
and achieves 98.57% accuracy.
Conclusion
2
A lightweight and efficient convolution unit is designed to construct the network
which has low parameters
computational complexity
and high accuracy.
轻量级卷积神经网络口罩人脸姿态分类深度可分离卷积卷积块注意力模块(CBAM)深度学习新冠肺炎(COVID-19)
lightweight convolutional neural networkmasked face poses classificationdepthwise separable convolutionconvolutional block attention module (CBAM)deep learningcorona virus disease 2019(COVID-19)
Byungtae A, Park J and Kweon I S. 2015. Real-time head orientation from a monocular camera using deep neural network//Cremers D, Reid I, Saito H and Yang M H, eds. Lecture Notes in Computer Science. Cham: Springer: 82-96 [DOI: 10.1007/978-3-319-16811-1_6http://dx.doi.org/10.1007/978-3-319-16811-1_6]
Borghi G, Fabbri M, Vezzani R, Calderara S and Cucchiara R. 2020. Face-from-depth for head pose estimation on depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3): 596-609 [DOI: 10.1109/TPAMI.2018.2885472]
Denton E, Zaremba W, Bruna J, LeCun Y and Fergus R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 1269-1277 [DOI: 10.5555/2968826.2968968http://dx.doi.org/10.5555/2968826.2968968]
Dong L F, Ren L L and Dong Y D. 2016. Complex illumination face pose estimation. Journal of Chinese Computer Systems, 37(3): 598-602
董兰芳, 任乐乐, 董玉德. 2016. 复杂光照下的人脸姿态估计. 小型微型计算机系统, 37(3): 598-602
Dua I, Nambi A U, Jawahar C V and Padmanabhan V. 2019. AutoRate: how attentive is the driver?//Proceedings of the 14th IEEE International Conference on Automatic Face&Gesture Recognition. Lille, France: IEEE: 1-8 [DOI: 10.1109/FG.2019.8756620http://dx.doi.org/10.1109/FG.2019.8756620]
Han S, Pool J, Tran J and Dally W J. 2015. Learning both weights and connections for efficient Neural Networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 1135-1143 [DOI: 10.5555/2969239.2969366http://dx.doi.org/10.5555/2969239.2969366]
Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications [EB/OL]. [2020-11-11].http://arxiv.org/pdf/1704.04861.pdfhttp://arxiv.org/pdf/1704.04861.pdf
Iandola F N, Han S, Moskewicz M W, Ashraf K, Dally W J and Keutzer K. 2016. Squeezenet: alexNet-level accuracy with 50x fewer parameters and<0.5 MB model size [EB/OL]. [2020-11-11].http://arxiv.org/pdf/1602.07360.pdfhttp://arxiv.org/pdf/1602.07360.pdf
Khan S D, Ali Y, Zafar B and Noorwali A. 2020. Robust head detection in complex videos using two-stage deep convolution framework. IEEE Access, 8: 98679-98692 [DOI: 10.1109/ACCESS.2020.2995764]
LeCun Y, Bengio Y and Hinton G. 2015. Deep learning. Nature, 521(7553): 436-444 [DOI: 10.1038/nature14539]
Lu Y, Wang S G, Zhao W T and WuW. 2015. Technology of virtual eyeglasses try-on system based on face pose estimation. Chinese Optics, 8(4): 582-588
卢洋, 王世刚, 赵文婷, 武伟. 2015. 基于人脸姿态估计的虚拟眼镜试戴技术. 中国光学, 8(4): 582-588 [DOI: 10.3788/CO.20150804.0582]
Ma N N, Zhang X Y, Zheng H T and Sun J. 2018. Shufflenetv2: practical guidelines for efficient CNN architecture design//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 122-138 [DOI: 10.1007/978-3-030-01264-9_8http://dx.doi.org/10.1007/978-3-030-01264-9_8]
Murphy-Chutorian E and Trivedi M M. 2009. Head pose estimation in computer vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4): 607-626 [DOI: 10.1109/TPAMI.2008.106]
Patacchiola M and Cangelosi A. 2017. Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods. Pattern Recognition, 71: 132-143 [DOI: 10.1016/j.patcog.2017.06.009]
Raza M, Chen Z H, Rehman S U, Wang P and Bao P. 2018. Appearance based pedestrians' head pose and body orientation estimation using deep learning. Neurocomputing, 272: 647-659 [DOI: 10.1016/j.neucom.2017.07.029]
Ruiz N, Chong E and Rehg J M. 2018. Fine-grained head pose estimation without keypoints//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, USA: IEEE: 2155-215509 [DOI: 10.1109/CVPRW.2018.00281http://dx.doi.org/10.1109/CVPRW.2018.00281]
Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetv2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520 [DOI: 10.1109/CVPR.2018.00474http://dx.doi.org/10.1109/CVPR.2018.00474]
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D and Batra D. 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization//Proceedings of 2017 International Conference on Computer Vision. Venice, Italy: IEEE: 618-626 [DOI: 10.1109/ICCV.2017.74http://dx.doi.org/10.1109/ICCV.2017.74]
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 1-9 [DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]
Tan M X and Le Q V. 2019a. Efficientnet: rethinking model scaling for convolutional neural networks [EB/OL]. [2020-11-11].https://arxiv.org/pdf/1905.11946.pdfhttps://arxiv.org/pdf/1905.11946.pdf
Tan M X and Le Q V. 2019b. MixConv: mixed depthwise convolutional kernels [EB/OL]. [2020-11-11].http://arxiv.org/pdf/1907.09595.pdfhttp://arxiv.org/pdf/1907.09595.pdf
Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 3-19 [DOI: 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1]
Wu C Z, Zheng R S, Zang H J, Liu M W, Xu J J and Zhan S. 2021. Face pose correction based on morphable model and image inpainting. Journal of Image and Graphics, 26(4): 828-836
吴从中, 郑荣生, 臧怀娟, 刘明威, 徐甲甲, 詹曙. 2021. 结合形变模型与图像修复的人脸姿态矫正. 中国图象图形学报, 26(4): 828-836 [DOI: 10.11834/jig.200011]
Yosinski J, Clune J, Bengio Y and Lipson H. 2014. How transferable are features in deep neural networks?//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 3320-3328 [DOI: 10.5555/2969033.2969197http://dx.doi.org/10.5555/2969033.2969197]
Zhang X, Zhou X, Lin M and Sun J. 2018. ShuffleNet: an extremely efficient convolutional neural network for mobile devices//Proceedings of 2018 Computer Society Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6848-6856 [DOI: 10.1109/CVPR.2018.00716http://dx.doi.org/10.1109/CVPR.2018.00716]
Zhang X H, Shan S G, Cao B, Gao W, Zhou D L and Zhao D B. 2005. CAS-PEAL: a large-scale Chinese face database and some primary evaluations. Journal of Computer-Aided Design&Computer Graphics, 17(1): 9-17
张晓华, 山世光, 曹波, 高文, 周德龙, 赵德斌. 2005. CAS-PEAL大规模中国人脸图像数据库及其基本评测介绍. 计算机辅助设计与图形学学报, 17(1): 9-17 [DOI: 10.3321/j.issn:1003-9775.2005.01.002]
Zhou Y, Chen S C, Wang Y M and Huan W M. 2020. Review of research on lightweight convolutional neural networks//Proceedings of the 5th Information Technology and Mechatronics Engineering Conference (ITOEC). Chongqing, China: IEEE: 1713-1720 [DOI: 10.1109/ITOEC49072.2020.9141847http://dx.doi.org/10.1109/ITOEC49072.2020.9141847]
Zhuang Y and Qi Y. 2021. Driving fatigue detection based on pseudo 3D convolutional neural network and attention mechanisms. Journal of Image and Graphics, 26(1): 143-153
庄员, 戚湧. 2021. 伪3D卷积神经网络与注意力机制结合的疲劳驾驶检测. 中国图象图形学报, 26(1): 143-153 [DOI: 10.11834/jig.200079]
相关作者
相关机构