Current Issue Cover
融合跨阶段连接与倒残差的NAS-FPNLite目标检测方法

王红霞1, 张永善1, 宋邦2, 陈德山3, 杨益1(1.武汉理工大学计算机与人工智能学院, 武汉 430063;2.北京云畅游戏科技股份有限公司, 北京 100015;3.武汉理工大学智能交通系统研究中心, 武汉 430063)

摘 要
目的 轻量级目标检测方法旨在保证检测精度,并减少神经网络的计算成本和存储成本。针对MobileNetv3网络瓶颈层bneck之间特征连接弱和深度可分离卷积在低维度下易出现参数为0的问题,提出一种融合跨阶段连接与倒残差的NAS-FPNLite(neural architecture search-feature pyramid networks lite)目标检测方法。方法 提出一种跨阶段连接(cross stage connection,CSC)结构,将同一级网络块的初始输入与最终输出做通道融合,获取差异最大的梯度组合,得到一种改进的CSCMobileNetv3网络模型。在NAS-FPNLite的检测器结构中特征金字塔(featurepyramid networks,FPN)部分融合倒残差结构,将不同特征层之间逐元素相加的特征融合方式替换为通道叠加的方式,使得进行深度可分离卷积时保持更高的通道数,并将输入的特征层与最终的输出层做跳跃连接,进行充分特征融合,得到一种融合倒残差的NAS-FPNLite目标检测方法。结果 实验数据表明,在CIFAR(Canadian Institute forAdvanced Research)-100数据集上,当缩放系数为0.5、0.75和1.0时,CSCMobileNetv3网络准确率相比MobileNetv3均有0.71%~1.04%的上升,尤其在CSCMobileNetv3缩放系数为0.75时,相比于MobileNetv3缩放系数为1.0,准确率有0.19%的提升,而参数量却降低了30%,浮点数运算量降低了20%。在ImageNet 1000数据集上,相比于MobileNetv3准确率有0.7%的提升,且相较于其他轻量级网络准确率均有一定的提升。在COCO(common objects in context)数据集上,CSCMobileNetv3+倒残差NAS-FPNLite轻量级目标检测方法与其他轻量级目标检测方法相比,在运算量相当的情况下,检测精度均有0.7%~4%的提高。结论 本文提出的CSCMobileNetv3可以有效获取差异梯度信息,在只少量增加运算量的情况下,获得了更高的准确率;融合倒残差的NAS-FPNLite目标检测方法可以有效避免参数变为0的情况,提升了检测精度,在运算量与检测精度达到了更好的平衡。
关键词
NAS-FPNLite object detection method fused with cross stage connection and inverted residual

Wang Hongxia1, Zhang Yongshan1, Song Bang2, Chen Deshan3, Yang Yi1(1.School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan 430063, China;2.Beijing Yunchang Game Technology Co., Ltd., Beijing 100015, China;3.Intelligent Transportation Systems Center, Wuhan University of Technology, Wuhan 430063, China)

Abstract
Objective To optimize the detection accuracy,lightweight target detection method is focused on the problem of cost efficiency of computation and storage. The Mobilenetv3 can extract feature effectively through the inverted residual structures-related bneck. However,features are connected with bneck-inner only excluded the bneck-between feature connection. The network accuracy is not optimized because more initial features are not involved in. To achieve a better balance between computation and detection accuracy,the neural architecture search-feature pyramid networks lite(NASFPNLite)can be as an effective target detection method based on deep learning technique. The NAS-FPNLite detector is focused on depth separation convolution in the feature pyramid part and the channel number of the intermediate feature layer can be compressed to a fixed 64-dimensional. A better balance can be achieved between the floating point operations and detection accuracy. The depth separation convolution can configure the parameters to 0 easily under this circumstance. To resolve these two problems,we develop a NAS-FPNLite lightweight object detection method in terms of the fusion of cross stage connection and inverted residual. First,an improved network model CSCMobileNetv3 is illustrated,which can obtain more multifaceted information and improves the efficiency of network feature extraction. Next,the inverted residual structure is applied to the feature pyramid part of NAS-FPNLite to obtain a higher number of channels during the depthwise separable convolutions,which can improve the detection accuracy on possibility-alleviated of those parameters become 0. Finally,the experiment for CSCMobilenetv3 model is validated on the Canadian Institute for Advanced Research(CIFAR) -100 and ImageNet 1000 datasets,as well as the inverted residual NAS-FPNLite detector in the COCO(common objects in context)dataset. Method At the beginning,to obtain different gradient information between network layers,a cross stage connection(CSC)structure is proposed in terms of DenseNet dense connection,which can combine the initial input with final output of the same level network block to obtain the gradient combination with the maximum difference,and get an improved CSCMobileNetv3 network model. The CSCMobileNetv3 network model is composed of 6 block structures,the first two blocks remain unchanged,and the last four block structures are combined with the CSC structure. Within the same block,the initial input is combined with the final output and as the input of the next block. At the same time,to obtain more different gradient information,the number of channels between the various blocks is changed from the original 16,24,40,80,112,160 to 16,24,40,80,160,320 in correspondent,It can surpress the intensity of the number of parameters and the amount of floating point operations effectively derived from the excessive expansion of channels. Then, in the detector part of NAS-FPNLite,the feature pyramid part is fused with the inverted residual structure,and the feature fusion method of element-by-element growth between different feature layers is replaced by the channel-concatenated. It is possible to maintain a higher number of channels via processing depthwise separable convolutions. To perform sufficient feature fusion,the situation where the parameters become 0 is avoided effectively,and a skip connection is realized between the input feature layer and the final output layer. A NAS-FPNLite object detection method fused with inverted residuals is demonstrated. Result In the training stage,our configuration is equipped with listed below:1)the graphics card used is NVIDIA GeForce GTX 2070 Super,2)8 GB video memory,3)CUDA version is CUDA10. 0,and 4)the CPU is 8-core AMD Ryzen7 3700x. On CIFAR-100 dataset training,since the image resolution of the CIFAR-100 dataset is 32×32 pixels,and the CSCMobileNetv3 is required of the resolution of the input image to be set to 224×224 pixels, the first convolutional layer,and in the first and third bnecks,the convolution stride is set from 2 to 1,200 epochs are trained,the learning rate is set to multi-stage adjustment,the initial learning rate lr is set to 0. 1,and then at 100,150 and 180 epochs,multiply the learning rate by a factor of 10. On ImageNet 1000 dataset training,the dataset needs to be preprocessed first,the image resolution is adjusted to 224×224 pixels as the input of the network,the number of training iterations is 150 epochs,the cosine annealing learning strategy is used,and the initial learning rate lr is 0. 02. The experimental results show that the accuracy of the CSCMobileNetv3 network can be increased by 0. 71% to 1. 04% compared to MobileNetv3 when the scaling factors are 0. 5,0. 75,and 1. 0 on the CIFAR-100 dataset. Compared to MobileNetv3,CSCMobileNetv3 increases the number of parameters by 6% and the amount of floating point operations by about 11% when the scaling factor is 1. 0,but the accuracy rate increases by 1. 04%. Especially,when the CSCMobileNetv3 zoom factor is 0. 75,the number of parameters is reduced by 30% compared to the MobileNetv3 zoom factor of 1. 0 and the amount of floating point operations is reduced by 20%. The accuracy rate is still improved by 0. 19%. On the ImageNet 1000 dataset, CSCMobileNetv3 has a 0. 7% improvement in accuracy although the amount of parameters and floating point operations is slightly higher than that of MobileNetv3. To sum up,CSCMobileNetv3 has optimized relevant parameters,floating-point operations and accuracy. On the COCO dataset,compared to other lightweight object detection methods,the detection accuracy of CSCMobileNetv3 and NAS-FPNLite lightweight object detection method fused with inverted residuals can be improved by 0. 7% to 4% based on equivalent computation. Conclusion The CSCMobileNetv3 can obtain differential gradient information effectively,and achieve higher accuracy with a small calculation only. To improve detection accuracy,the NAS-FPNLite object detection method fused with inverted residuals can avoid the situation effectively where the parameters become 0. Our method has its balancing potentials between the amount of calculation and detection accuracy.
Keywords

订阅号|日报