发布时间: 2021-12-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200606
2021 | Volume 26 | Number 12

图像分析和识别

融合环境特征与改进YOLOv4的安全帽佩戴检测

葛青青, 张智杰, 袁珑, 李秀梅, 孙军梅

杭州师范大学信息科学与技术学院, 杭州 311121

收稿日期: 2020-10-28; 修回日期: 2021-01-06; 预印本日期: 2021-01-13

基金项目: 国家自然科学基金项目（61571174）；福建省软件测评工程技术研究中心开放课题项目（ST2019004）；杭州市科技计划项目（20201203B124）

作者简介: 葛青青, 1999年生, 女, 本科生, 主要研究方向为机器视觉和目标检测。E-mail: geqingqing999@qq.com
张智杰, 男, 本科生, 主要研究方向为目标检测和生成对抗网络。E-mail: 1843926224@qq.com
袁珑, 男, 硕士研究生, 主要研究方向为对抗样本攻击和防御。E-mail: 835524032@qq.com
李秀梅, 女, 教授, 硕士生导师, 主要研究方向为时频分析及应用、压缩感知与机器学习。E-mail: lixiumei@hznu.edu.cn
孙军梅, 通信作者, 女, 副教授, 硕士生导师, 主要研究方向为深度学习和智能软件系统。E-mail: junmeisun@hznu.edu.cn
*通信作者: 孙军梅 junmeisun@hznu.edu.cn

中图法分类号: TP391.4

文献标识码: A

文章编号: 1006-8961(2021)12-2904-14

摘要

目的在施工现场，安全帽是最为常见和实用的个人防护用具，能够有效防止和减轻意外带来的头部伤害。但在施工现场的安全帽佩戴检测任务中，经常出现难以检测到小目标，或因为复杂多变的环境因素导致检测准确率降低等情况。针对这些问题，提出一种融合环境特征与改进YOLOv4（you only look once version 4）的安全帽佩戴检测方法。方法为补充卷积池化等过程中丢失的特征，在保证YOLOv4得到的3种不同大小的输出特征图与原图经过特征提取得到的特征图感受野一致的情况下，将两者相加，融合高低层特征，捕捉更多细节信息；对融合后的特征图采用3×3卷积操作，以减小特征图融合后的混叠效应，保证特征稳定性；为适应施工现场的各种环境，利用多种数据增强方式进行环境模拟，并采用对抗训练方法增强模型的泛化能力和鲁棒性。结果提出的改进YOLOv4方法在开源安全帽佩戴检测数据集（safety helmet wearing dataset，SHWD）上进行测试，平均精度均值（mean average precision，mAP）达到91.55%，较当前流行的几种目标检测算法性能有所提升，其中相比于YOLOv4，mAP提高了5.2%。此外，改进YOLOv4方法在融合环境特征进行数据增强后，mAP提高了4.27%，在各种真实环境条件下进行测试时都有较稳定的表现。结论提出的融合环境特征与改进YOLOv4的安全帽佩戴检测方法，以改进模型和数据增强的方式提升模型准确率、泛化能力和鲁棒性，为安全帽佩戴检测提供了有效保障。

关键词

安全帽佩戴检测; 特征图融合; 数据增强; 对抗样本; YOLOv4

Safety helmet wearing detection method of fusing environmental features and improved YOLOv4

Ge Qingqing, Zhang Zhijie, Yuan Long, Li Xiumei, Sun Junmei

School of Information Science and Technology, Hangzhou Normal University, Hangzhou 311121, China

Supported by: National Natural Science Foundation of China (61571174)

Abstract

Objective National production safety data in 2019 has showed that 95% of production safety accidents were caused by unsafe behaviors of workers, including improperly wearing protection supplies. Therefore, safety helmet wearing detection has played a vital role in safety production. An end-to-end detection algorithm with high accuracy and strong generalization ability is of great significance to ensure operators' personal safety and reduce safety accidents. Safety helmet wearing detection has belonged to the category of target detection. Early target detection algorithms have been mostly conducted via manual feature construction. With the development of deep learning, target detection has been divided into "two-stage detection" and "one-stage detection" and these series of detectors greatly improved the detection speed and detection accuracy. However, the current deep learning algorithms have failed to ensure accurate detection of small targets, and are not generally applicable in various scenarios, resulting in poor generalization ability and week anti-interference ability. To solve these problems, a safety helmet wearing detection method that combines environmental characteristics and improved you only look once version 4 (YOLOv4) has been proposed to achieve efficient detection of safety helmets wearing. Method Based on YOLOv4, cross stage partial darknet53 (CSPDarknet53) has been as backbone network, path aggregation network (PANET) and spatial pyramid pooling (SPP) as neck. The feature maps have been achieved with three different sizes of YOLOv4. With the input size 608×608 pixels, the resulting resolutions of YOLO head have been 76×76 pixels, 38×38 pixels, and 19×19 pixels respectively. Due to the great difference between the high-level and low-level feature map information, the given input original image has extracted feature to achieve the same resolution as the YOLO head. For the input original image, a 3×3 convolution operation has been conducted via a batch normalization (BN) layer for normalization operation and ReLU with unilateral suppression and sparse activation as the activation function. The above process has been iterated until the resolution of the feature map is consistent with the size of the corresponding YOLO head. Then, under the condition that the receptive field is consistent, the output feature maps with three different sizes of YOLOv4 have been added to the feature maps obtained by feature extraction from the original image, thereby fusing high-level features with low-level features to capture more detailed information. After that, 3×3 convolution operation has been used for the fused feature maps to reduce the aliasing effect after feature map fusion to get three sizes of output. The feature map obtained by feature extraction of the original image has represented a shallow network with high resolution and more detailed features to predict the location information while YOLO head represents a deep network with low resolution and more semantic features, which helps to decide the category information. The model can achieve higher accuracy in detecting large and small targets via combining the two feature maps. Moreover, data enhancement techniques have been used, such as random cropping, CutMix, simulating environment corrupted with noise and using adversarial samples for adversarial training, to add small disturbances to the training data. The training data has been enhanced to improve the generalization ability of the model and the robustness of the model has been improved. Result The improved YOLOv4 has been tested on the open source safety helmet wearing dataset (SHWD). The mean average precision (mAP) has reached 91.55% and the recall reached 98.62%. Compared with the existing CornerNet-Lite, Faster region convolutional neural network (RCNN), YOLOv3, YOLOv4 and other algorithms, the proposed method can achieve improved performance in mAP and recall. When compared with traditional YOLOv4, this improved YOLOv4 can increase mAP and recall by 5.2% and 5.79% respectively. The data enhancement methods adopted in this paper has improved the mAP of CornerNet-Lite, Faster RCNN, YOLOv3, YOLOv4 and Improved YOLOv4 from 2% to 5%. The improved YOLOv4 has increased mAP by 4.27% from 91.55% to 95.82%. In addition, the proposed method has more stable performance after data enhancement when testing under different environmental conditions. For instance, the detection performance of night images has been highly improved with mAP increasing from 67.73% to 84.10%. The comparison of experimental results via adding adversarial samples in the training set has shown that the recall of the proposed model has increased by 0.29% and the mAP has increased by 0.56%. Conclusion The method which fuses environmental features and improved YOLOv4 has been proposed for safety helmet wearing detection. The proposed method has used convolutional neural networks to extract convolutional features, and improve the efficiency of feature extraction and target detection greatly. Moreover, the effective combined information of high and low layers by fusing different feature maps can improve the detection accuracy of small targets. The multiple data enhancement methods have been used to improve the robustness of the model in complex scenarios. The end-to-end training of the detection algorithm has been realized and achieved the accuracy, generalization ability and robustness of the model improvement for the effective detection of safety helmet wearing.

Key words

safety helmet wearing detection; feature map fusion; data enhancement; adversarial examples; YOLOv4

0 引言

随着工业化的发展和人们安全意识的不断提高，工地安全问题已成为企业家和工人最关心的问题之一。2019年全国安全生产数据分析发现，95%的生产安全事故由作业人员的不安全行为所导致，其中包括未正确穿戴劳保用品等。因此安全帽佩戴检测在安全生产中起着至关重要的作用。研究一种准确率高、泛化能力强、端到端的安全帽佩戴检测算法对于保障作业人员的人身安全，降低安全事故的发生具有重要意义。

安全帽佩戴检测属于目标检测范畴。早期的目标检测算法大多基于手工特征构建，由于当时缺乏有效的图像表示，只能设计复杂的特征表示和各种加速技术以充分利用有限的计算资源。Viola和Jones(2001, 2004)在2001年提出一种人脸检测算法——Viola-Jones(VJ)算法，在2004年提出VJ人脸检测器，采用滑动窗口，在没有任何约束条件的情况下首次实现了人脸的实时检测；Felzenszwalb等人(2008)提出可变形的组件模型(deformable parts model，DPM)，达到传统检测器的巅峰，实现了级联结构，在不牺牲任何准确率的情况下实现了超过10倍的加速度，但检测的准确率和速度仍远未达到实际应用的要求。随着深度学习在目标检测中的应用，目标检测分为两阶段检测(two-stage detection)和单阶段检测(one-stage detection)。在两阶段检测系列中，Girshick等人(2014)、Girshick(2015)和Ren等人(2017)提出区域卷积神经网络(region convolutional neural network，RCNN)、快速区域卷积神经网络(Fast RCNN)和超快区域卷积神经网络(Faster RCNN)等一系列检测器，在准确率和速度上均有极大提升。与此同时，以YOLO(you only look once)系列(Redmon等，2016；Redmon和Farhadi, 2017, 2018；Bochkovskiy等，2020)为代表的单阶段检测器虽然在准确率上不如两阶段检测器，但在速度上有了很大提升。YOLOv4在MS COCO(Microsoft common objects in context)数据集(Google开源的大型数据集，共包含80个类别，超过33万幅图像)上的AP(average precision)值为43.5%，比YOLOv3的33.0%提高了10%，速度提升了12%，是目前检测效果最好的算法之一，但存在对小目标检测准确率不高、无法适应外界环境因素带来的干扰等问题。针对小目标检测问题，姜文涛等人(2019)提出融合多尺度特征图构建具有较强语义信息的特征图。曾接贤等人(2019)提出使用最邻近上采样法将最后一层提取的特征图放大两倍后再用相加法融合特征信息。Xu和Wu(2020)提出采用密集连接网络(DenseNet)增强特征提取能力。这些方法均在一定程度上有效提高了小目标的检测准确率。此外，Law等人(2020)提出了一种基于关键点模式进行目标检测的方法CornerNet-Lite，不需要依赖于锚框(anchor boxes)，是一种精简的检测网络，但性能较YOLOv4略差，使用范围较小。

目前，已有学者运用目标检测算法对安全帽佩戴检测进行了相关研究。Silva等人(2014)应用圆形霍夫变换和定向梯度描述符的直方图来提取摩托车安全帽图像属性，然后使用多层感知器进行分类。刘晓慧和叶西宁(2014)采用肤色检测的方法定位人脸，然后提取脸部以上的Hu矩特征向量，再利用支持向量机(supported vector machine，SVM)实现安全帽的识别。方明等人(2019)在YOLO v2的基础上加入密集块实现安全帽的检测。施辉等人(2019)在YOLOv3的基础上加入特征金字塔结构，并对目标框维度进行聚类以检测安全帽。Kumar等人(2020)使用YOLO针对机动车骑行者是否佩戴安全帽进行检测。

现有的安全帽佩戴检测方法存在以下问题：1)采用传统图像处理与机器学习方法导致检测准确率低；2)深度学习算法未能保证小目标的精确检测，或深度学习模型未能在各种环境下普遍适用，导致泛化能力差。为了克服上述问题，本文提出融合环境特征与改进YOLOv4的安全帽佩戴检测方法，以YOLOv4模型为主体，在保证感受野一致的情况下，将YOLOv4的每一种输出特征图都与原图经过特征提取得到的特征图相加以补充丢失的特征，从而将高层特征与低层特征融合，捕捉更多细节信息，然后对融合后的特征图进行3$×$3卷积操作，以减小特征图融合后的混叠效应，保证特征稳定性，使模型在大小目标的检测上都能达到更高的准确率。同时，采用多种数据增强方法模拟各种环境下的施工现场，并采用加入对抗样本的对抗训练方法，以提高模型的泛化能力和鲁棒性，使其在各种环境下都能达到好的检测效果。实验表明，本文模型能在保证检测准确率的前提下，适应各种环境的变化且稳定性好。

1 相关工作

1.1 YOLOv4目标检测算法

YOLOv4(Bochkovskiy等，2020)算法是在YOLO目标检测架构基础上采用卷积神经网络(convolutional neural network，CNN)领域中优秀的优化策略形成的一种目标检测算法，在数据处理、主干网络、网络训练、激活函数和损失函数等方面都有不同程度的优化，使得任何人都可以使用一个1080Ti或2080Ti GPU(graphies processing unit)去训练一个超级快速和精确的目标检测器。YOLOv4验证了一系列主流目标检测器训练方法的影响，并且修改了这些主流方法，使得它们在使用单个GPU(graphics processing unit)进行训练时更加有效和适配，包括交叉迭代批处理归一化(cross-iteration batch normalization，CBN)、路径聚合网络(path aggregation network，PANet)和空间注意力模块(spatial attention module，SAM)等。YOLOv4中包含两类训练方法：1)免费包(bag of freebies，BoF)，只改变训练策略或者只增加训练成本，如数据增强；2)特价包(bag of specials，BoS)，包括插件模块和后处理方法，仅增加少量推理成本，但是可以极大提升目标检测的准确率。

YOLOv4网络结构采用的算法保留了YOLOv3的head部分，将主干网络修改为CSPDarknet53，同时采用空间金字塔池化(spatial pyramid pooling，SPP)思想扩大感受野，并将PANet作为neck部分，具体结构如图 1所示(Bochkovskiy等，2020)。基于以上设计思想, YOLOv4与其他目标检测算法相比, 在检测准确率(AP-50，IoU阈值大于0.5)和速度(frames per second, FPS)上均取得了较好效果。表 1为YOLOv4网络与其他目标检测框架在MS COCO数据集上的性能对比(Bochkovskiy等，2020；Law等，2020)。

图 1 YOLOv4网络结构

Fig. 1 YOLOv4 network structure

表 1 YOLOv4与其他网络性能对比
Table 1 Comparison of performances among YOLOv4 and other networks

下载CSV

方法	AP-50/%	帧率/ (帧/s)
SSD512(Liu等，2016)	48.5	22.0
RefineDet512(Zhang等，2018)	54.5	22.3
RetinaNet-50-800(Lin等，2017)	57.5	6.5
PFPNet-R512(Kim等，2018)	57.6	24.0
CornerNet(Law和Deng，2018)	57.8	4.4
YOLOv3-608(Redmon和Farhadi，2018)	57.9	20.0
CornerNet-Lite(Law等，2020)	58.4	17.0
Faster RCNN(Ren等，2017)	59.2	9.4
YOLOv4-608(Bochkovskiy等，2020)	65.7	23.0
注：加粗字体表示各列最优结果。SSD为singleshot multibox detector; PFPNet为parallel feature pyramid network。表中的帧率是使用Maxwell GPU进行的推理时间验证。

1.2 数据增强方法

深度学习方法一般要求样本数量充足。样本数量越多，训练出来的模型效果越好，模型的泛化能力越强。但是实际中，样本数量以及样本质量往往无法达到预期，因此需要对样本进行数据增强，提高样本质量。数据增强在目标检测领域有着广泛应用，常用的数据增强方式包括随机裁剪、添加噪声和CutMix(Yun等，2019)等。

图像的随机裁剪是指在原图基础上随机得到一幅设定大小的图像，相当于建立每个因子特征与相应类别的权重关系，减弱背景(或噪音)因子的权重，使模型面对缺失值不敏感，从而可以产生更好的学习效果，增加模型稳定性。基于噪声的数据增强是在初始图像的基础上，叠加一些噪声，添加不同的噪声，达到不同的视觉效果。CutMix是使模型能够从一幅图像的局部视图上识别出多个目标，提高训练效率。它将图像的一部分区域剪裁掉，但不填充0像素，而是随机填充训练集中其他数据的区域像素值，分类结果按一定比例分配。假设${\mathit{\boldsymbol{X}}}_{A}$和${\mathit{\boldsymbol{X}}}_{B}$是两个不同的训练样本，$Y_{A}$和$Y_{B}$是对应的标签值，CutMix需要生成的是新的训练样本和对应标签${\mathit{\boldsymbol{X}}}$和$Y$。具体为

$ \begin{array}{c} \mathit{\boldsymbol{X = M}} \odot {\mathit{\boldsymbol{X}}_\mathit{A}} + (1 - M){\mathit{\boldsymbol{X}}_B}\\ \mathit{Y} = \mathit{\lambda }{Y_A} + (1 - \mathit{\lambda }){\mathit{Y}_B} \end{array} $

(1)

式中，$M∈\{0, 1\}^{W×H}$表示裁剪掉的部分区域和进行填充的二进制掩码，$⊙$是逐像素相乘，$λ$属于Beta分布，最终的${\mathit{\boldsymbol{X}}}$由${\mathit{\boldsymbol{X}}}_{A}$和${\mathit{\boldsymbol{X}}}_{B}$两幅图结合生成。

CutMix充分利用训练像素，保持区域dropout(指在深度学习网络的训练过程中，对于神经网络单元，按照一定的概率将其暂时从网络中丢弃)的正则化效应，具有训练过程中不存在无信息像素的特性，使得训练更加有效，同时保留了区域dropout的优点，可以专注于对象的非歧视性部分。添加的补丁通过要求模型从局部视图识别对象，进一步增强了定位能力。

2 模型改进

YOLOv4采用3个尺度的特征图，分辨率分别为输入原图的1/32、1/16和1/8，可以对不同尺度的目标进行检测，并且采用Mosaic数据增强，随机缩放增加了很多小目标，使得大中小目标的分布趋于均匀。在安全帽佩戴检测问题中，大小目标的尺寸差异大，且遮挡物较多，若使用YOLOv4网络进行检测，在小目标检测上准确率较低，难以满足需求。

在卷积神经网络中，高层次特征图通常具有更加抽象的信息，对目标的位置信息更为敏感，而低层次特征图具有更高的空间分辨率，对细节信息表述得更为清晰。为了使模型能够更加准确地检测小目标，本文提出一种基于YOLOv4的特征图融合方法，如图 2所示。“⊕”代表融合，输入包含两部分，一部分是YOLOv4本身的输出特征图YOLO head，另一部分是原图经过特征提取得到的与YOLO head感受野一致的特征图。两特征图相融合会造成特征的不连续，导致特征混乱。因此，在融合后再利用3×3的卷积层以减小特征图融合后的混叠效应，保证特征稳定性。与YOLOv4类似，使用跨阶段局部网络(CSPDarknet53)作为骨干网络(backbone)，使用路径聚合网络(path aggregation network，PANET)和SPP作为颈部(neck)，得到3个不同大小的YOLO head特征图，以608$×$608像素为输入大小，则得到的YOLO head分辨率分别为76×76像素、38×38像素和19×19像素，记作${\mathit{\boldsymbol{{\hat V}}}}_{1}$，${\mathit{\boldsymbol{{\hat V}}}}_{2}$和${\mathit{\boldsymbol{{\hat V}}}}_{3}$。由于高低层次特征图信息差异较大，因此对于给定的输入原图${\mathit{\boldsymbol{X}}}∈{\bf{R}}^{C′×H′×W′}$，需要进行特征提取达到与YOLO head相同的分辨率。对输入的原图进行3×3卷积运算${\mathit{\boldsymbol{{\tilde F}}}}: {\mathit{\boldsymbol{X}}}→{\mathit{\boldsymbol{{\tilde U}}}}∈{\bf{R}}^{C×H×W}$，加批量归一化层(batch normalization layer，BN)(Ioffe和Szegedy，2015)进行归一化操作，并采用具有单侧抑制且稀疏激活的ReLU(rectified linear unit)(Nair和Hinton，2010)作为激活函数。迭代以上过程，直至特征图的分辨率与对应YOLO head大小一致，分别记作特征图${\mathit{\boldsymbol{{\hat U}}}}_{1}$，${\mathit{\boldsymbol{{\hat U}}}}_{2}$和${\mathit{\boldsymbol{{\hat U}}}}_{3}$。通过两特征图对应元素相加的操作，将两图进行融合，具体为

图 2 改进YOLOv4网络结构图

Fig. 2 Improved YOLOv4 network structure

$ {\mathit{\boldsymbol{F}}_n} = {{\mathit{\boldsymbol{\hat U}}}_n} + {{\mathit{\boldsymbol{\hat V}}}_n} $

(2)

在3种输出的融合图之后再做一次3$×$3的卷积操作以减小融合造成的混叠效应，最终得到3种输出${\mathit{\boldsymbol{F}}}_{{\rm 1}}$，${\mathit{\boldsymbol{F}}}_{{\rm 2}}$和${\mathit{\boldsymbol{F}}}_{{\rm 3}}$。原图经过特征提取得到的特征图代表的是浅层网络，其分辨率高，更多地学习细节特征，有利于位置信息的预测；YOLO head代表深层网络，其分辨率低，更多地学习语义特征，有利于类别信息的判断。将这两者进行融合，可达到同时高准确率检测大小目标的目的。表 2为从原图提取特征图${\mathit{\boldsymbol{{\hat U}}}}_{1}$使用的网络结构分支的具体结构，其中$W$和$H$均为608，以此类推可得${\mathit{\boldsymbol{{\hat U}}}}_{2}$和${\mathit{\boldsymbol{{\hat U}}}}_{3}$的网络分支配置。

表 2 ${\mathit{\boldsymbol{{\hat U}}}}_{1}$网络分支具体结构
Table 2 ${\mathit{\boldsymbol{{\hat U}}}}_{1}$network branch concrete structure

下载CSV

网络层	卷积核大小	通道数	填充	输入尺寸	输出尺寸
Conv1	3×3	32	1	1×H×W	32×H×W
ReLU	-	-	-	32×H×W	32×H×W
MaxPooling	2×2	-	-	32×H/2×W/2	32×H/2×W/2
Conv2	3×3	64	1	32×H/2×W/2	64×H/2×W/2
ReLU	-	-	-	64×H/2×W/2	64×H/2×W/2
MaxPooling	2×2	-	-	64×H/2×W/2	64×H/4×W/4
Conv3	3×3	128	1	64×H/4×W/4	128×H/4×W/4
ReLU	-	-	-	128×H/4×W/4	128×H/4×W/4
MaxPooling	2×2	-	-	128×H/4×W/4	128×H/8×W/8
Conv4	3×3	256	1	128×H/8×W/8	256×H/8×W/8
ReLU	-	-	-	256×H/8×W/8	256×H/8×W/8
注：“-”表示未使用该参数。

3 数据增强

采用数据增强(data augmentation)的目标检测算法多数未考虑天气因素影响，导致应用环境单一，泛化能力有待完善，且抗干扰能力弱，加入细微的扰动网络模型就会错判漏判。在安全帽佩戴检测任务中，不仅遮挡物多，而且现实施工场地天气复杂，白天、夜晚、雨天或薄雾天都可能施工，这些因素导致摄像头拍摄的视频清晰度会受到影响，从而影响安全帽识别的准确性。为此本文使用随机裁剪、CutMix、噪声模拟环境、高斯滤波去噪、生成对抗样本等数据增强技术进行环境模拟，通过在训练数据上增加微小的扰动或者变化，一方面可以增加训练数据，提升模型的泛化能力; 另一方面可以增加噪声数据，增强模型的鲁棒性。

3.1 随机裁剪

随机裁剪是一种较为常用的数据增强方法，可用来扩大数据集，提高模型的普适性。采用该方法进行数据增强，效果如图 3所示。

图 3 随机裁剪效果图

Fig. 3 Results of random cropping

((a)original images; (b)results after random cropping)

3.2 CutMix

目标检测任务中常常会遇到图像中目标遮挡严重的问题，由于遮挡数据复杂多样，遮挡信息丢失严重，模型在训练过程中往往陷入过拟合，对训练集外的数据检测效果下降，在模型层面很难做到较好的改善。针对这一问题，通常引入目标遮挡作为数据增强的一种方式，即对目标的不同位置(左上，右上，左下，右下，左，右，上，下各部分)使用不同遮挡比例(1/4，1/3，1/2)的黑色矩形块遮挡。但是使用黑色填充(0像素)会导致部分信息完全丢失，因此，本文使用CutMix解决目标遮挡问题。随机选取两幅图像，在这两幅图像的同一区域进行裁剪粘贴，效果图如4所示。

图 4 CutMix效果图

Fig. 4 Results of CutMix

((a)original image 1; (b)original image 2;(c)results after combining)

3.3 噪声模拟环境

施工现场的天气环境多变，可能会伴随雨天、雾天和夜晚等，而恰恰是在这样的天气环境下，作业危险度更高。为了增强模型应用的真实性以及减少实际环境、天气和视频设备等因素对模型识别效果的影响，对模型在不同数据集上进行训练。在现有训练集的基础上加入不同噪声作为新的训练集，模拟模型在真实场景应用的识别效果。主要对雨天、夜晚和雾天等影响因素进行模拟。

1) 雨天模拟。首先通过随机生成不同密度的噪声模拟不同大小的雨量，主要使用均匀随机数和阈值来控制噪声的水平。随后对噪声拉长、旋转方向，模拟不同大小和方向的雨水。最后对生成的雨滴噪声和原始图像进行叠加，得到模拟的下雨场景。

2) 夜晚模拟。截取一幅夜晚工地背景图，将其以一定比例加权到图像上，采用的权重为0.3，在此权重下，人眼所见的画面已贴合夜晚真实场景。

3) 雾天模拟。模拟雾天效果的方法与模拟夜晚效果的方法相似，同样是先截取一幅雾天工地的背景图，然后将其以一定比例加权到图像上，采用的权重为0.4，最后可获得雾天状态效果下的数据。3种环境下的图像效果如图 5所示。

图 5 不同环境效果模拟图

Fig. 5 Simulated images of different environmental effects

((a) original; (b) night; (c) rainy; (d) foggy)

3.4 高斯滤波去噪

实际工地摄像头捕捉的图像会因为摄像头老化或质量差而存在模糊现象，为解决这一困扰，首先通过高斯滤波去噪算法，对图像进行去噪处理，使图像的质量大幅提高，再送入模型，有效提高检测准确率。图 6展示了摄像头模糊情况下捕捉的图像及去噪算法处理后的效果。可以看出，处理后的图像清晰度得到明显提升。

图 6 图像去噪前后对比

Fig. 6 Comparison of images before and after denoising

((a) image captured under lens blurring; (b) image captured after performing the denoising algorithm)

3.5 生成对抗样本

对抗样本指经过微小调整即可使机器学习算法输出错误结果的输入样本。深度神经网络对于对抗样本具有脆弱性，如给深度神经网络输入一个细微的扰动，这些扰动甚至人眼无法察觉，不会对人的判断造成影响，但可能导致深度神经网络检测错误。因此，提高安全帽检测模型对对抗样本的防御能力，提升模型的鲁棒性很有必要。采用对抗样本防御方法中的对抗训练方法，将由对抗样本生成算法生成的对抗样本加入到训练集中对模型进行训练，以强化模型。

具体方法为先构造一个与CSPDarknet相似的检测网络，然后通过快速梯度下降算法(fast gradient sign method，FGSM)(Goodfellow等，2015)，沿着深度学习模型的梯度方向添加图像扰动，使损失函数增大，导致模型进行错误检测。即在梯度方向上添加增量来诱导网络对生成的图像${\mathit{\boldsymbol{X}}}′$进行误分类。${\mathit{\boldsymbol{X}}}′$就是所需要的对抗样本，具体生成过程为

$ \begin{array}{c} \mathit{\boldsymbol{X' = X + }}\mathit{\eta }\\ \mathit{\eta } = \mathit{\varepsilon } \times {\rm{sign}}({\nabla _\mathit{X}}\mathit{\boldsymbol{J}}(\mathit{\theta },\mathit{\boldsymbol{X}},\mathit{\boldsymbol{Y}})) \end{array} $

(3)

式中，$η$为添加的扰动，$θ$为模型参数，${\mathit{\boldsymbol{X}}}$为模型输入，${\mathit{\boldsymbol{Y}}}$为结果标签，$J(θ, {\mathit{\boldsymbol{X}}}, {\mathit{\boldsymbol{Y}}})$为损失函数，$ε$为攻击参数，文中$ε$= 0.01。可以通过线性化代价函数的当前值$θ$，获得最优的$η$。采用的损失函数为

$ \mathit{loss} = \sum\limits_{i \in \left\{ {j|{\mathit{\boldsymbol{s}}_j} > t} \right\}} {{S_i}} + \sum\limits_{i = 0}^{{S^2}} {\mathit{\boldsymbol{I}}_i^{{\rm{obj }}}} \sum\limits_{c \in {\rm{ class }}} {{{\left( {{p_i}(c) - {{\hat p}_i}(c)} \right)}^2}} $

(4)

式中，参数$j$为YOLOv4划分出来的网格预测的边界框(bounding box)，$S_{j}$为bounding box的置信度得分，$t$为置信度阈值，$S^{2}$代表网格数，参数$I^{obj}_{i}$代表网格中含有真实物体的中心，$p$代表分类概率，$c$代表类别, class为所有类别。该损失函数由两部分组成：1)目标候选框的置信度函数；2)目标分类误差函数。YOLOv4寻找目标时，首先对所有的bounding box进行筛选，当$S_{j}$大于阈值$t$时，对该bounding box进行如下操作：利用FGSM算法对整幅图像进行攻击，并通过最大化该损失函数来降低目标候选框的置信度得分，使该目标在YOLOv4检测时由于置信度降低到阈值$t$之下导致检测失败，本文$t$取0.3。同时，由于该损失函数包含目标分类误差损失函数，通过最大化该损失函数可误导网络错误分类，最后目标即使检测出来，由于添加了噪声，也会错误分类为其他物体，无法正确识别。

如图 7所示，在原图的基础上添加扰动得到对抗样本。对抗样本与原图非常接近，人的肉眼几乎无法辨别，但是由于添加了微小扰动，对比初始模型(改进YOLOv4模型)和健壮性模型(初始模型经过对抗训练后的模型)对原图以及对抗样本的检测结果，可见初始模型对于对抗样本存在错检漏检现象，而健壮性模型对于对抗样本仍保持正确的检测效果。将生成的对抗样本加入原有训练集，使模型学习微小扰动，从而增强模型的鲁棒性和稳定性。

图 7 原图与对抗样本的检测结果对比

Fig. 7 Detection comparison between original and adversarial example

((a) original image; (b) initial model on original image; (c) robustness model on original image; (d) perturbations; (e) initial model on adversarial example; (f) robustness model on adversarial example)

4 实验结果对比与分析

实验在SHWD(safety helmet wearing detect dataset)数据集(https://github.com/njvisionpower/Safety-Helmet-Wearing-Datast)上进行。SHWD数据集共包含7 581幅图像。随机选取80%作为训练集，20%作为测试集。并从训练集分别随机抽取10%进行随机裁剪、CutMix、噪声模拟环境、高斯滤波去噪和生成对抗样本等数据增强操作，之后将其加入到训练集中。另外，在环境适应性验证测试中额外收集了1 500幅真实环境图像作为测试数据集。

目标检测一般采用召回率(recall)和平均精度均值(mean average precision，mAP)对网络模型的性能进行评价，计算公式为

$ \begin{array}{c} R = \frac{{TP}}{{TP = FN}}\\ P = \frac{{TP}}{{TP = FP}} \end{array} $

(5)

$ \mathit{mAP} = \frac{{\sum\limits_{i = 0}^n {AP(\mathit{i})} }}{\mathit{N}} $

(6)

式中，$R$为召回率，$P$为精确率，$TP$为正确预测的正例，$FP$为错误预测的正例，$FN$为错误预测的负例。$AP$表示不同召回率下精确率的平均值，而$mAP$表示所有类别检测精确率的均值。

为验证本文方法的有效性，进行4项对比实验。

1) 不同算法性能对比。在未做数据增强的安全帽佩戴检测数据集SHWD上，将本文改进YOLOv4模型与CornerNet-Lite、Faster RCNN、YOLOv3和YOLOv4进行对比实验。不同模型检测结果recall和mAP的对比如图 8和图 9所示。

图 8 不同模型检测的召回率结果对比

Fig. 8 Comparison of recall among different models

图 9 不同模型检测的mAP结果对比

Fig. 9 Comparison of mAP among different models

可以看出，CornerNet-Lite作为无anchor检测框架的典范，摒弃了边界框，以一种全新的视角界定边框，但是在此数据集上的检测准确率并不高，recall为87.9%，mAP为71.3%。Faster RCNN改进自Fast RCNN，引入区域生成网络(region proposal network，RPN)和感兴趣区域池化(region of interest pooling，ROI pooling)，极大提升了两阶段检测模型的性能，recall为89.1%，mAP为80.9%。YOLOv3模型选择对图像的全局区域进行训练，速度加快的同时，能够更好地区分目标和背景，对于安全帽的检测效果较好，recall为91.96%，mAP为84.58%。YOLOv4作为最新的单阶段检测模型，在性能上有较大提升，recall为92.83%，mAP为86.35%。而改进的YOLOv4模型不仅集合了各种优质模块，在数据增强、超参数进化等方面有了很大提升，而且更好地结合了低层的特征信息和高层的语义信息，提高特征图对目标的表述能力，在本数据集上的表现效果最好，recall达到98.62%，mAP达到91.55%。相较于YOLOv4，mAP和recall分别提高了5.2%和5.79%，验证了本文提出的改进YOLOv4算法的高效性。

2) 数据增强性能验证。模拟各种天气环境进行数据增强后，各模型在同一安全帽数据集数据增强前后检测的mAP结果如图 10所示。可以看出，各模型在数据增强后的mAP都有所提升，提升幅度在2%~5%，其中本文改进YOLOv4方法的mAP提高了4.27%，达到95.82%。由此可见，本文的数据增强方法不仅有效扩充了数据集，而且可普遍应用于各模型，提升检测准确率。

图 10 数据增强前后各模型mAP对比

Fig. 10 Comparison of mAP of different models before and after data enhancement

3) 模型的环境适应性验证。为验证数据增强对模型泛化能力和鲁棒性的影响，额外收集真实环境下晴天、夜晚、雾天、雨天以及模糊图像各300幅作为测试集，并使用改进YOLOv4模型对各种真实环境下的测试集进行数据增强前后的对比实验。结果如表 3和图 11所示。可以看出，未做数据增强的改进YOLOv4模型检测夜晚、雾天、雨天以及模糊图像时准确率明显降低，而进行数据增强后，改进YOLOv4模型在各种真实环境下都有较稳定的表现，检测准确率保持在较高水平，在夜晚的检测对比中尤其明显，mAP从67.73%提高到84.10%，可以为夜间施工提供更为有效的安全检测。通过数据增强，使模型对各种真实环境下的数据都能达到较好的检测效果。

表 3 改进YOLOv4在各种真实环境测试集数据增强前后检测能力对比
Table 3 Comparison of detection between before and after data enhancement on different real environment tests by improved YOLOv4

下载CSV

/%
测试集	召回率		mAP
测试集	数据增强前	数据增强后	数据增强前	数据增强后
晴天	96.53	97.71	91.48	93.23
夜晚	71.81	90.79	67.73	84.10
雾天	83.96	93.23	78.59	88.82
雨天	80.99	92.59	76.13	86.92
模糊	86.31	95.08	80.02	90.05
注：加粗字体表示各行每个指标最优结果。

图 11 改进YOLOv4环境适应性验证实验结果

Fig. 11 Verification experimental results of environmental adaptability on improved YOLOv4

((a)original images; (b)ground truths; (c)results before data enhancement; (d)results after data enhancement)

4) 验证对抗训练对模型鲁棒性的影响。表 4为在训练集中添加对抗样本前后的实验结果对比。

表 4 加入对抗样本训练前后模型检测能力对比
Table 4 Comparison of detection before and after adding adversarial sample training

下载CSV

/%
训练集	召回率	mAP
加入对抗样本训练前	98.62	95.82
加入对抗样本训练后	98.91	96.38
注：加粗字体表示各列最优结果。

将对抗样本和原有数据一起进行训练，对抗样本产生的损失作为原损失的一部分，即在不修改原模型结构的情况下增加模型的loss，产生正则化的效果。从实验数据得到，加入对抗样本训练后，本文模型的recall提升了0.29%，mAP提升了0.56%。模型通过学习训练集中的对抗性干扰来获得泛化能力，从而具有更好的鲁棒性。

另外，为了更加直观地感受本文提出模型检测效果的提升, 选取了一些检测图像在不同模型上进行对比分析，检测效果如图 12所示。可以看出，Faster RCNN和YOLOv4检测中存在更多的错检漏检现象，而改进YOLOv4的检测结果更贴近原图标签，效果更好。

图 12 不同模型检测结果对比

Fig. 12 Comparison of detection among different models

((a) original image; (b) image with tag; (c) Faster RCNN; (d) YOLOv4; (e) ours)

5 结论

针对施工现场的安全帽佩戴检测问题，本文提出了一种融合环境特征与改进YOLOv4的安全帽佩戴检测方法。本文的贡献主要有以下两点：1)在保证感受野一致的前提下，将YOLOv4的每一种输出特征图都与原图经过特征提取得到的特征图相加来补充丢失的特征，从而融合高层特征与低层特征，捕捉更多细节信息；对融合后的特征图进行3$×$3的卷积操作，以减小特征图融合后的混叠效应，避免特征混乱，使得模型在大小目标的检测上都能达到更高的准确率；2)采用多种数据增强方式进行环境模拟，并加入对抗训练方法，极大增强了模型的泛化能力和鲁棒性。

实验结果表明，本文方法在安全帽佩戴检测任务中的mAP和recall均有所提高，与其他方法相比有一定的优越性。在施工现场应用本文方法，可以在实现实时检测的同时保证较高的检测准确率。

本文方法提出利用对抗样本进行数据增强提升模型鲁棒性，使用的是基础的FGSM算法，后续将在生成对抗样本算法上进行更多探索，利用更好的算法以提升模型的鲁棒性。

参考文献

Bochkovskiy A, Wang C Y and Liao H Y M. 2020. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2021-09-23]. https://arxiv.org/pdf/2004.10934.pdf

Fang M, Sun T T, Shao Z. 2019. Fast helmet-wearing-condition detection based on improved YOLOv2. Optical Precision Engineering, 27(5): 1196-1205 (方明, 孙腾腾, 邵桢. 2019. 基于改进YOLOv2的快速安全帽佩戴情况检测. 光学精密工程, 27(5): 1196-1205) [DOI:10.3788/OPE.20192705.1196]

Felzenszwalb P, McAllester D and Ramanan D. 2008. A discriminatively trained, multiscale, deformable part model//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE: 1-8[DOI:10.1109/CVPR.2008.4587597]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[DOI:10.1109/ICCV.2015.169]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81]

Goodfellow I J, Shlens J and Szegedy C. 2015. Explaining and harnessing adversarial examples[EB/OL]. [2021-09-23]. https://arxiv.org/pdf/1412.6572.pdf

Ioffe S and Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift[EB/OL]. [2021-09-23]. https://arxiv.org/pdf/1502.03167.pdf

Jiang W T, Zhang C, Zhang S C, Liu W J. 2019. Multiscale feature map fusion algorithm for target detection. Journal of Image and Graphics, 24(11): 1918-1931 (姜文涛, 张驰, 张晟翀, 刘万军. 2019. 多尺度特征图融合的目标检测. 中国图象图形学报, 24(11): 1918-1931) [DOI:10.11834/jig.190021]

Kim S W, Kook H K, Sun J Y, Kang M C and Ko S J. 2018. Parallel feature pyramid network for object detection//Proceedings of European Conference on Computer Vision. Munich, Germany: Springer: 239-256[DOI:10.1007/978-3-030-01228-1_15]

Kumar S, Neware N, Jain A, Swain D and Singh P. 2020. Automatic helmet detection in real-time and surveillance video//Machine Learning and Information Processing. Singapore, Singapore: Springer: 51-60[DOI:10.1007/978-981-15-1884-3_5]

Law H and Deng J. 2018. Cornernet: detecting objects as paired keypoints//Proceedings of European Conference on Computer Vision. Munich, Germany: Springer: 765-781[DOI:10.1007/978-3-030-01264-9_45]

Law H, Teng Y, Russakovsky O and Deng J. 2020. CornerNet-lite: efficient keypoint based object detection[EB/OL]. [2021-09-23]. https://arxiv.org/pdf/1904.08900.pdf

Lin T Y, Goyal P, Girshick R, He K M and Dollár P. 2017. Focal loss for dense object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2999-3007[DOI:10.1109/ICCV.2017.324]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[DOI:10.1007/978-3-319-46448-0_2]

Liu X H, Ye X N. 2014. Skin color detection and Hu moments in helmet recognition research. Journal of East China University of Science and Technology (Natural Science Edition), 40(3): 365-370 (刘晓慧, 叶西宁. 2014. 肤色检测和Hu矩在安全帽识别中的应用. 华东理工大学学报(自然科学版), 40(3): 365-370) [DOI:10.3969/j.issn.1006-3080.2014.03.016]

Nair V and Hinton G E. 2010. Rectified linear units improve restricted boltzmann machines//Proceedings of the 27th International Conference on Machine Learning. Haifa, Israel: Omnipress: 807-814

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788[DOI:10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE: 6517-6525[DOI:10.1109/CVPR.2017.690]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2021-09-23]. https://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R, Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI:10.1109/TPAMI.2016.2577031]

Shi H, Chen X Q, Yang Y. 2019. Safety helmet wearing detection method of improved YOLOv3. Computer Engineering and Applications, 55(11): 213-220 (施辉, 陈先桥, 杨英. 2019. 改进YOLOv3的安全帽佩戴检测方法. 计算机工程与应用, 55(11): 213-220) [DOI:10.3778/j.issn.1002-8331.1811-0389]

Silva R R V E, Aires K R T and de Melo Souza Veras R. 2014. Helmet detection on motorcyclists using image descriptors and classifiers//Proceedings of the 27th SIBGRAPI Conference on Graphics, Patterns and Images. Rio de Janeiro, Brazil: IEEE: 141-148[DOI:10.1109/SIBGRAPI.2014.28]

Viola P and Jones M. 2001. Rapid object detection using a boosted cascade of simple features//Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Kauai, USA: IEEE: 511-518[DOI:10.1109/CVPR.2001.990517]

Viola P, Jones M J. 2004. Robust real-time face detection. International Journal of Computer Vision, 57(2): 137-154 [DOI:10.1023/B:VISI.0000013087.49260.fb]

Xu D Q, Wu Y Q. 2020. Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors, 20(15): #4276 [DOI:10.3390/s20154276]

Yun S, Han D, Chun S, Oh S J, Yoo Y and Choe J. 2019. CutMix: regularization strategy to train strong classifiers with localizable features//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6022-6031[DOI:10.1109/ICCV.2019.00612]

Zeng J X, Fang Q, Fu X, Leng L. 2019. Multi-scale pedestrian detection algorithm with multi-layer features. Journal of Image and Graphics, 24(10): 1683-1691 (曾接贤, 方琦, 符祥, 冷璐. 2019. 融合多层特征的多尺度行人检测. 中国图象图形学报, 24(10): 1683-1691) [DOI:10.11834/jig.190009]

Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018. Single-shot refinement neural network for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4203-4212[DOI:10.1109/CVPR.2018.00442]