Current Issue Cover
嵌入双尺度分离式卷积块注意力模块的口罩人脸姿态分类

陈森楸1,2, 刘文波1,2, 张弓3(1.南京航空航天大学自动化学院, 南京 211106;2.高速载运设施的无损检测监控技术工业和信息化部重点实验室, 南京 211106;3.南京航空航天大学电子信息工程学院, 南京 211106)

摘 要
目的 针对口罩遮挡的人脸姿态分类新需求,为了提高基于卷积神经网络的人脸姿态分类效率和准确率,提出了一个轻量级卷积神经网络用于口罩人脸姿态分类。方法 本文设计的轻量级卷积神经网络的核心为双尺度可分离注意力卷积单元。该卷积单元由3×3和5×5两个尺度的深度可分离卷积并联而成,并且将卷积块注意力模块(convolutional block attention module,CBAM)的空间注意力模块(spatial attention module,SAM)和通道注意力模块(channel attention module,CAM)分别嵌入深度(depthwise,DW)卷积和点(pointwise,PW)卷积中,针对性地对DW卷积及PW卷积的特征图进行调整。同时对SAM模块补充1×1的点卷积挤压结果增强其对空间信息的利用,形成更加有效的注意力图。在保证模型性能的前提下,控制构建网络的卷积单元通道数和单元数,并丢弃全连接层,采用卷积层替代,进一步轻量化网络模型。结果 实验结果表明,本文模型的准确率较未改进SAM模块分离嵌入CBAM的模型、标准方式嵌入CBAM的模型和未嵌入注意力模块的模型分别提升了2.86%、6.41% 和12.16%。采用双尺度卷积核丰富特征,在有限的卷积单元内增强特征提取能力。与经典卷积神经网络对比,本文设计的模型仅有1.02 MB的参数量和24.18 MB的每秒浮点运算次数(floating-point operations per second,FLOPs),大幅轻量化了模型并能达到98.57%的准确率。结论 本文设计了一个轻量高效的卷积单元构建网络模型,该模型具有较高的准确率和较低的参数量及计算复杂度,提高了口罩人脸姿态分类模型的效率和准确率。
关键词
A embedded dual-scale separable CBAM model for masked human face poses classification

Chen Senqiu1,2, Liu Wenbo1,2, Zhang Gong3(1.College of Automation Engineering, Nanjing University of Aeronautics and Astronauts, Nanjing 211106, China;2.Non-Destructive Testing and Monitoring Technology for High-Speed Transport Facilities Key Laboratory of Ministry of Industry and Information Technology, Nanjing 211106, China;3.College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronauts, Nanjing 211106, China)

Abstract
Objective Human face poses classification is one of the key aspects of computer vision and intelligent analysis. It is a potential technology for human behavior analysis, human-computer interaction, motivation detection, fatigue driving monitoring, face recognition and virtual reality. Wearing masks is regarded as an intervened method during the outbreak of the corona virus disease 2019 (COVID-19) pandemic. It is a new challenge to achieve masked face poses classification. The convolutional neural network is widely applied to identify human face information and it is using in the face pose estimation. The convolutional neural network (CNN) research is to achieve face pose estimation based on low resolution, occlusion interference and complicated environment. In terms of the stronger capability of convolutional neural network and its successful application in face pose classification, we implement it to the masked face pose classification. Face pose estimation is one of the mediums of computer vision and intelligent analysis technology, and estimation results are used for subsequent analysis and decision. As an intermediate part of face pose estimation technology, the lightweight and efficient network structure can make the estimation to play a greater role within limited resources. Therefore, our research focuses on an efficient and lightweight convolutional neural network for masked face pose estimation. Method The core of the designed network is an efficient and lightweight dual-scale separable attention convolution (DSAC) unit and we construct the model based on 5 DSAC units stacking. The DSAC unit is constructed via two depthwise separating convolution with 3×3 and 5×5 kernel size in parallel, and we embed convolutional block attention module (CBAM) in these two convolutional ways. But, the embedding method that we proposed is different from the traditional CBAM embedding method. We split CBAM into spatial attention module (SAM) and channel attention module (CAM). We embed SAM following the depthwise (DW) convolution and embed CAM following the pointwise (PW) convolution respectively, and we cascade these two parts at final. In this way, the features of DW convolution and PW convolution can be matched. Meanwhile, we improve the SAM via the result of 1×1 pointwise convolution supplements, which can enhance the utilization of spatial information and organize a more effective attention map, we make use of 5 DSAC units to build the high accuracy network. The convolution channels and the number of units in this network are manipulated in rigor. The full connection layers are discarded because of their redundant number of parameters and the computational complexity, and a integration of point convolutional layer to global average pooling (GAP) layer is used to replace them. Therefore, these operations make the network more lightweight further. In addition, a large scale of human face data collection cannot be achieved temporarily because of the impact of COVID-19. We set the mask images that are properly scaled, rotated, and deformed on the common face pose dataset to construct a semisynthetic dataset. Simultaneously, a small amount of real masked face poses images are collected to construct a real masked face poses dataset. We use the transfer learning method to train the model under the lack of a massive real face poses dataset. Result The embedded demonstration results show that our proposed accuracy of the model is 2.86%,6.41% and 12.16% higher than that of the model which is embedded separable CBAM without improved SAM module, the model embedded standard CBAM and the model without embedded CBAM. The results show that the model which is embedded separable CBAM with improved SAM module has an efficient and lightweight structure. It can effectively improve the performance of the model with little parameters and computational complexity. Depth wise separable convolution makes the model more compact and the CBAM attention module makes the model more efficient when the model is regressing. In addition, the dual scale convolution is used to tailor the features and enhance the feature extraction capability within the limited convolution unit. The method of extracting different scale features to enhance the performance of the model can avoid over-fitting and rapid growth of parameters derived of stacked convolutional layers. Compared with the classical convolutional neural network, such as AlexNet, Visual Geometry Group network(VGGNet), ResNet, GoogLeNet, the parameters and computational complexity of our designated model decrease significantly, and the accuracy is 3.57%30.71% higher than AlexNet, VGG16, ResNet18 and GoogLeNet. Compared with the classical lightweight convolutional neural network like SqueezeNet, MobileNet, ShuffleNet, EfficientNet, the demonstated model has the lowest parameters, computational complexity, and the highest accuracy, which only has 1.02 M parameters and 24.18 M floating-point operations per second (FLOPs), and achieves 98.57% accuracy. Conclusion A lightweight and efficient convolution unit is designed to construct the network, which has low parameters, computational complexity, and high accuracy.
Keywords

订阅号|日报