NAS-FPNLite object detection method fused with cross stage connection and inverted residual

Wang Hongxia; Zhang Yongshan; Song Bang; Chen Deshan; Yang Yi

doi:10.11834/jig.211099

Image Analysis and Recognition | Views : 0 下载量: 0 CSCD: 1

PDF
Export
Share
Collection
Album

NAS-FPNLite object detection method fused with cross stage connection and inverted residual
Vol. 28, Issue 4, Pages: 1004-1018(2023)
Published： 16 April 2023 ，
DOI： 10.11834/jig.211099
稿件说明：

移动端阅览

王红霞，张永善，宋邦，陈德山，杨益. 2023. 融合跨阶段连接与倒残差的NAS-FPNLite目标检测方法. 中国图象图形学报， 28(04):1004-1018

Wang Hongxia， Zhang Yongshan， Song Bang， Chen Deshan， Yang Yi. 2023. NAS-FPNLite object detection method fused with cross stage connection and inverted residual. Journal of Image and Graphics， 28(04):1004-1018
王红霞，张永善，宋邦，陈德山，杨益. 2023. 融合跨阶段连接与倒残差的NAS-FPNLite目标检测方法. 中国图象图形学报， 28(04):1004-1018 DOI： 10.11834/jig.211099.

Wang Hongxia， Zhang Yongshan， Song Bang， Chen Deshan， Yang Yi. 2023. NAS-FPNLite object detection method fused with cross stage connection and inverted residual. Journal of Image and Graphics， 28(04):1004-1018 DOI： 10.11834/jig.211099.

摘要

目的

轻量级目标检测方法旨在保证检测精度，并减少神经网络的计算成本和存储成本。针对MobileNetv3网络瓶颈层bneck之间特征连接弱和深度可分离卷积在低维度下易出现参数为0的问题，提出一种融合跨阶段连接与倒残差的NAS-FPNLite（neural architecture search-feature pyramid networks lite）目标检测方法。

方法

提出一种跨阶段连接（cross stage connection，CSC）结构，将同一级网络块的初始输入与最终输出做通道融合，获取差异最大的梯度组合，得到一种改进的CSCMobileNetv3网络模型。在NAS-FPNLite的检测器结构中特征金字塔（feature pyramid networks，FPN）部分融合倒残差结构，将不同特征层之间逐元素相加的特征融合方式替换为通道叠加的方式，使得进行深度可分离卷积时保持更高的通道数，并将输入的特征层与最终的输出层做跳跃连接，进行充分特征融合，得到一种融合倒残差的NAS-FPNLite目标检测方法。

结果

实验数据表明，在CIFAR（Canadian Institute for Advanced Research）-100数据集上，当缩放系数为0.5、0.75和1.0时，CSCMobileNetv3网络准确率相比MobileNetv3均有0.71%～1.04%的上升，尤其在CSCMobileNetv3缩放系数为0.75时，相比于MobileNetv3缩放系数为1.0，准确率有0.19%的提升，而参数量却降低了30%，浮点数运算量降低了20%。在ImageNet 1000数据集上，相比于MobileNetv3准确率有0.7%的提升，且相较于其他轻量级网络准确率均有一定的提升。在COCO（common objects in context）数据集上，CSCMobileNetv3+倒残差NAS-FPNLite轻量级目标检测方法与其他轻量级目标检测方法相比，在运算量相当的情况下，检测精度均有0.7%～4%的提高。

结论

本文提出的CSCMobileNetv3可以有效获取差异梯度信息，在只少量增加运算量的情况下，获得了更高的准确率；融合倒残差的NAS-FPNLite目标检测方法可以有效避免参数变为0的情况，提升了检测精度，在运算量与检测精度达到了更好的平衡。

Abstract

Objective

To optimize the detection accuracy， lightweight target detection method is focused on the problem of cost efficiency of computation and storage. The Mobilenetv3 can extract feature effectively through the inverted residual structures-related bneck. However， features are connected with bneck-inner only excluded the bneck-between feature connection. The network accuracy is not optimized because more initial features are not involved in. To achieve a better balance between computation and detection accuracy， the neural architecture search-feature pyramid networks lite（NAS-FPNLite） can be as an effective target detection method based on deep learning technique. The NAS-FPNLite detector is focused on depth separation convolution in the feature pyramid part and the channel number of the intermediate feature layer can be compressed to a fixed 64-dimensional. A better balance can be achieved between the floating point operations and detection accuracy. The depth separation convolution can configure the parameters to 0 easily under this circumstance. To resolve these two problems， we develop a NAS-FPNLite lightweight object detection method in terms of the fusion of cross stage connection and inverted residual. First， an improved network model CSCMobileNetv3 is illustrated， which can obtain more multifaceted information and improves the efficiency of network feature extraction. Next， the inverted residual structure is applied to the feature pyramid part of NAS-FPNLite to obtain a higher number of channels during the depthwise separable convolutions， which can improve the detection accuracy on possibility-alleviated of those parameters become 0. Finally， the experiment for CSCMobilenetv3 model is validated on the Canadian Institute for Advanced Research（CIFAR）-100 and ImageNet 1000 datasets， as well as the inverted residual NAS-FPNLite detector in the COCO （common objects in context） dataset.

Method

At the beginning， to obtain different gradient information between network layers， a cross stage connection （CSC） structure is proposed in terms of DenseNet dense connection， which can combine the initial input with final output of the same level network block to obtain the gradient combination with the maximum difference， and get an improved CSCMobileNetv3 network model. The CSCMobileNetv3 network model is composed of 6 block structures， the first two blocks remain unchanged， and the last four block structures are combined with the CSC structure. Within the same block， the initial input is combined with the final output and as the input of the next block. At the same time， to obtain more different gradient information， the number of channels between the various blocks is changed from the original 16， 24， 40， 80， 112， 160 to 16， 24， 40， 80， 160， 320 in correspondent， It can surpress the intensity of the number of parameters and the amount of floating point operations effectively derived from the excessive expansion of channels. Then， in the detector part of NAS-FPNLite， the feature pyramid part is fused with the inverted residual structure， and the feature fusion method of element-by-element growth between different feature layers is replaced by the channel-concatenated. It is possible to maintain a higher number of channels via processing depthwise separable convolutions. To perform sufficient feature fusion， the situation where the parameters become 0 is avoided effectively， and a skip connection is realized between the input feature layer and the final output layer. A NAS-FPNLite object detection method fused with inverted residuals is demonstrated.

Result

In the training stage， our configuration is equipped with listed below： 1） the graphics card used is NVIDIA GeForce GTX 2070 Super， 2） 8 GB video memory， 3） CUDA version is CUDA10.0， and 4） the CPU is 8-core AMD Ryzen7 3700x. On

CIFAR-100 dataset training， since the image resolution of the CIFAR-100 dataset is 32 × 32 pixels， and the CSCMobileNetv3 is required of the resolution of the input image to be set to 224 × 224 pixels， the first convolutional layer， and in the first and third bnecks， the convolution stride is set from 2 to 1， 200 epochs are trained， the learning rate is set to multi-stage adjustment， the initial learning rate lr is set to 0.1， and then at 100， 150 and 180 epochs， multiply the learning rate by a factor of 10. On ImageNet 1000 dataset training， the dataset needs to be preprocessed first， the image resolution is adjusted to 224 × 224 pixels as the input of the network， the number of training iterations is 150 epochs， the cosine annealing learning strategy is used， and the initial learning rate lr is 0.02. The experimental results show that the accuracy of the CSCMobileNetv3 network can be increased by 0.71% to 1.04% compared to MobileNetv3 when the scaling factors are 0.5， 0.75， and 1.0 on the CIFAR-100 dataset. Compared to MobileNetv3， CSCMobileNetv3 increases the number of parameters by 6% and the amount of floating point operations by about 11% when the scaling factor is 1.0， but the accuracy rate increases by 1.04%. Especially， when the CSCMobileNetv3 zoom factor is 0.75， the number of parameters is reduced by 30% compared to the MobileNetv3 zoom factor of 1.0 and the amount of floating point operations is reduced by 20%. The accuracy rate is still improved by 0.19%. On the ImageNet 1000 dataset， CSCMobileNetv3 has a 0.7% improvement in accuracy although the amount of parameters and floating point operations is slightly higher than that of MobileNetv3. To sum up， CSCMobileNetv3 has optimized relevant parameters， floating-point operations and accuracy. On the COCO dataset， compared to other lightweight object detection methods， the detection accuracy of CSCMobileNetv3 and NAS-FPNLite lightweight object detection method fused with inverted residuals can be improved by 0.7% to 4% based on equivalent computation.

Conclusion

The CSCMobileNetv3 can obtain differential gradient information effectively， and achieve higher accuracy with a small calculation only. To improve detection accuracy， the NAS-FPNLite object detection method fused with inverted residuals can avoid the situation effectively where the parameters become 0. Our method has its balancing potentials between the amount of calculation and detection accuracy.

关键词

轻量级目标检测图像分类深度可分离卷积多尺度特征融合

Keywords

lightweight object detectionimage classificationdepthwise separable convolutionmulti-scale feature fusion

references

Chen B， Ghiasi G， Liu H X， Lin T Y， Kalenichenko D， Adam H and Le Q V. 2020. MnasFPN： learning latency-aware pyramid architecture for object detection on mobile devices//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 13604-13613 ［DOI： 10.1109/CVPR42600.2020.01362http://dx.doi.org/10.1109/CVPR42600.2020.01362］

Fu C Y， Liu W， Ranga A， Tyagi A and Berg A C. 2017. DSSD： deconvolutional single shot detector ［EB/OL］. ［2020-05-18］. https://arxiv.org/pdf/1701.06659.pdfhttps://arxiv.org/pdf/1701.06659.pdf

Ghiasi G， Lin T Y and Le Q V. 2019. NAS-FPN： learning scalable feature pyramid architecture for object detection//Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 7029-7038 ［DOI： 10.1109/CVPR.2019.00720http://dx.doi.org/10.1109/CVPR.2019.00720］

Girshick R， Donahue J， Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus， USA： IEEE： 580-587 ［DOI： 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81］

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 1440-1448 ［DOI： 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169］

He K， Zhang X， Ren S and Sun J. 2015. Deep residual learning for image recognition//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE Computer Society： 770-778［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Howard A， Sandler M， Chen B， Wang W J， Chen L C， Tan M X， Chu G， Vasudevan V， Zhu Y K， Pang R M， Adam H and Le Q V. 2019. Searching for MobileNetv3//Proceedings of the 17th IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 1314-1324 ［DOI： 10.1109/ICCV.2019.00140http://dx.doi.org/10.1109/ICCV.2019.00140］

Howard A G， Zhu M L， Chen B， Kalenichenko D， Wang W J， Weyand T， Andreetto M and Adam H. 2017. Mobilenets： efficient convolutional neural networks for mobile vision applications ［EB/OL］. ［2020-05-17］. https://arxiv.org/pdf/1704.04861.pdfhttps://arxiv.org/pdf/1704.04861.pdf

Huang G， Liu S C， Van Der Maaten L and Weinberger K Q. 2018. Condensenet： an efficient densenet using learned group convolutions//Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 2752-2761 ［DOI： 10.1109/CVPR.2018.00291http://dx.doi.org/10.1109/CVPR.2018.00291］

Huang G， Liu Z and Laurens V D M. 2017. Densely connected convolutional networks［EB/OL］. ［2021-11-18］ https://arxiv.org/pdf/1608.06993.pdfhttps://arxiv.org/pdf/1608.06993.pdf

Iandola F N， Han S， Moskewicz M W， Ashraf K， Dally W J and Keutzer K. 2016. SqueezeNet： AlexNet-level accuracy with 50 ×fewer parameters and < 0.5 MB model size ［EB/OL］. ［2020-05-12］. https://arxiv.org/pdf/1602.07360.pdfhttps://arxiv.org/pdf/1602.07360.pdf

Karen S and Andrew Z. 2014. Very deep convolutional networks for large-scale image recognition［EB/OL］. ［2021-11-18］ https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf

Krizhevsky A， Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe， USA： Curran Associates Inc： 1097-1105

Li Y X， Li J W， Lin W Y and Li J G. 2018. Tiny-DSOD： lightweight object detection for resource-restricted usages//Proceedings of British Machine Vision Conference 2018. Newcastle， Australia： BMVA Press： 1516-1524

Li Z X and Zhou F Q. 2017. FSSD： feature fusion single shot multibox detector ［EB/OL］. ［2020-06-10］. https://arxiv.org/pdf/1712.00960.pdfhttps://arxiv.org/pdf/1712.00960.pdf

Lin T Y， Goyal P， Girshick R B， He K M and Doll􀅡r P. 2020. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence， 42（2）： 318-327 ［DOI： 10.1109/TPAMI.2018.2858826http://dx.doi.org/10.1109/TPAMI.2018.2858826］

Liu W， Anguelov D， Erhan D， Szegedy C， Reed S， Fu C Y and Berg A C. 2016. SSD： single shot MultiBox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 21-37 ［DOI： 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2］

Redmon J， Divvala S， Girshick R and Farhadi A. 2016. You only look once： unified， real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Las Vegas， USA： IEEE： 779-788 ［DOI： 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Sandler M， Howard A， Zhu M L， Zhmoginov A and Chen L C. 2018. MobileNetV2： inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4510-4520 ［DOI： 10.1109/CVPR.2018.00474http://dx.doi.org/10.1109/CVPR.2018.00474］

Wang R J， Li X and Ling C X. 2018. Pelee： a real-time object detection system on mobile devices//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal， Canada： Curran Associates Inc： 2159-2165

Yan Z Y， Li X M， Li M， Zuo W M and Shan S G. 2018. Shift-Net： image inpainting via deep feature rearrangement//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 3-19 ［DOI： 10.1007/978-3-030-01264-9_1http://dx.doi.org/10.1007/978-3-030-01264-9_1］

Zhang X Y， Zhou X Y， Lin M X and Sun J. 2018. ShuffleNet： an extremely efficient convolutional neural network for mobile devices//Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6848-6856 ［DOI： 10.1109/CVPR.2018.00716http://dx.doi.org/10.1109/CVPR.2018.00716］

Alert me when the article has been cited

提交

Comprehensive review of methods for vehicle logo recognition in intelligent transportation systems

Sparse adversarial patch attack based on QR code mask

TransAS-UNet： regional segmentation of breast cancer Swin Transformer and of UNet algorithm

Automatic capture for standard fetal cardiac four-chamber ultrasound view by fusing frame sequential relationships

Meta-cosine loss for few-shot image classification