结合混合域注意力与空洞卷积的3维目标检测
3D object detection based on domain attention and dilated convolution
- 2020年25卷第6期 页码:1221-1234
收稿:2019-08-16,
修回:2019-11-8,
纸质出版:2020-06-16
DOI: 10.11834/jig.190378
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-08-16,
修回:2019-11-8,
纸质出版:2020-06-16
移动端阅览
目的
2
通过深度学习卷积神经网络进行3维目标检测的方法已取得巨大进展,但卷积神经网络提取的特征既缺乏不同区域特征的依赖关系,也缺乏不同通道特征的依赖关系,同时难以保证在无损空间分辨率的情况下扩大感受野。针对以上不足,提出了一种结合混合域注意力与空洞卷积的3维目标检测方法。
方法
2
在输入层融入空间域注意力机制,变换输入信息的空间位置,保留需重点关注的区域特征;在网络中融入通道域注意力机制,提取特征的通道权重,获取关键通道特征;通过融合空间域与通道域注意力机制,对特征进行混合空间与通道的混合注意。在特征提取器的输出层融入结合空洞卷积与通道注意力机制的网络层,在不损失空间分辨率的情况下扩大感受野,根据不同感受野提取特征的通道权重后进行融合,得到全局感受野的关键通道特征;引入特征金字塔结构构建特征提取器,提取高分辨率的特征图,大幅提升网络的检测性能。运用基于二阶段的区域生成网络,回归定位更准确的3维目标框。
结果
2
KITTI(A project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)数据集中的实验结果表明,在物体被遮挡的程度由轻到高时,对测试集中的car类别,3维目标检测框的平均精度
AP
3D
值分别为83.45%、74.29%、67.92%,鸟瞰视角2维目标检测框的平均精度
AP
BEV
值分别为89.61%、87.05%、79.69%;对pedestrian和cyclist类别,
AP
3D
和
AP
BEV
值同样比其他方法的检测结果有一定优势。
结论
2
本文提出的3维目标检测网络,一定程度上解决了3维检测任务中卷积神经网络提取的特征缺乏视觉注意力的问题,从而使3维目标检测更有效地运用于室外自动驾驶。
Objective
2
With the continuous development of convolutional neural network (CNN) used in deep learning in recent years
3D object detection networks based on deep learning have also made outstanding development. 3D object detection aims to identify the class
location
orientation
and size of a target object in 3D space. It is widely used in the visual field
such as autonomous driving
intelligent monitoring
and medical analysis. The feature extracted by a deep learning network is important in detection accuracy. The detection task is similar to human vision; that is
it also needs to distinguish the difference between the background and the objects. In human vision
attention is given to target objects
while the background is disregarded. Therefore
paying more attention to the target area and less attention to the background area is better when performing object detection in an image. However
a CNN does not distinguish which areas and channels in an image should be given more and less attention. Thus
the features extracted by a CNN not only lack the dependence relationship between different regions but also the dependence relationship between different channels. The current 3D object detection method based on a deep learning network uses a combination of pooling layers behind the multilayer convolution layer. These network structures generally use maximum or averaging pooling in feature maps. They aim to adjust the receptive field size of the extracted features. However
transforming the receptive field of the features of the pooling layers must be performed by removing some information
causing a considerable loss of feature information. Information loss may result in detected errors. Therefore
a CNN should expand the receptive field without losing information
obtaining good detection results. To address the shortcomings of the aforementioned 3D target detection methods
this study proposes a two-stage 3D object detection network that combines mixed domain attention and dilated convolution.
Method
2
In this study
a 3D object detection network based on a deep learning network is built. Integrating the spatial domain attention mechanism into the input layer of the network transforms the spatial position of the input information
preserving regional features that require more attention. Incorporating the channel domain attention mechanism into the network computes the channel weights of the extracted features
obtaining the key channel features. The features are mixed by combining the aforementioned spatial and channel domain attention mechanisms. Second
the output layer of the feature extractor integrates the network layer that is combined with the dilated convolution and the channel domain attention mechanism
and thus
our network can expand the receptive field of the extracted features without losing spatial resolution. In accordance with the different obtained receptive fields
the features can determine their channel weights and then fuse these feature weights through different schemes to obtain the channel weights of their global receptive fields and identify key channel features. In addition
the feature pyramid network structure is introduced to construct the feature extractor of our network
through which our network can extract high-resolution feature maps
considerably improving the detection performance of our network. Lastly
our network architecture is based on a two-stage region proposal network
which can regress to accurate 3D bounding boxes.
Result
2
A series of experiments has been conducted on the KITTI(A project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) dataset using the method proposed in this study. Cases wherein the object is slightly to severely occluded are denoted as "easy"
"moderate"
and "hard" in the tables. In the car class in the test set
the values of
AP
3D
that represents average accuracy of 3D object detection box obtained are 83.45%
74.29%
and 67.92%; and the values of
AP
BEV
that represents average accuracy of 2D detection box from bird's eye view obtained are 89.61%
87.05%
and 79.69%. In the pedestrian class
the values of
AP
3D
obtained are 52.23%
44.91%
and 41.64%; and the values of
AP
BEV
obtained are 59.73%
53.97%
and 49.62%. In the cyclist class
the values of
AP
3D
obtained are 65.02%
54.38%
and 47.97%; and the values of
AP
BEV
obtained are 69.13%
59.69%
and 52.11%. We also perform ablation experiments on the test set. The experiment results show that in the car class and relative to the proposed method
the average value of
AP
3D
obtained after removing the pyramid structure is reduced by approximately 6.09%
the average value of
AP
3D
obtained after removing the mixed domain attention structure is reduced by approximately 0.99%
and the average value of
AP
3D
obtained after removing the dilated convolution structure is reduced by approximately 0.71%.
Conclusion
2
For the research on 3D object detection task
we propose a two-stage 3D object detection network that combines dilated convolution and mixed domain attention. The experiment results show that the proposed method outperforms several existing state-of-the-art 3D object detection methods and obtains accurate detection results
and it can be effectively applied to outdoor automatic driving.
Charles R Q, SuH, Kaichun M and Guibas L J. 2017. PointNet: deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE: 77-85[DOI: 10.1109/CVPR.2017.16 http://dx.doi.org/10.1109/CVPR.2017.16 ]
Chen X Z, Ma H M, Wan J, Li B and Xia T. 2017. Multi-view 3D object detection network for autonomous driving//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 6526-6534[DOI: 10.1109/CVPR.2017.691 http://dx.doi.org/10.1109/CVPR.2017.691 ]
Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA: IEEE: 886-893[DOI: 10.1109/CVPR.2005.177 http://dx.doi.org/10.1109/CVPR.2005.177 ]
Geiger A, Lenz P, Stiller C and Urtasun R. 2013. Vision meets robotics:the KITTI dataset. International Journal of Robotics Research, 32(11):1231-1237[DOI:10.1177/0278364913491297]
Hu J, Shen L, Albanie S, Sun G and Wu E H. 2017. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence: 1-13[DOI: 10.1109/TPAMI.2019.2913372 http://dx.doi.org/10.1109/TPAMI.2019.2913372 ]
Ku J, Mozifian M, Lee J, Harakeh A and Waslander S L. 2018. Joint 3D proposal generation and object detection from view aggregation//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid: IEEE: 1-8[DOI: 10.1109/IROS.2018.8594049 http://dx.doi.org/10.1109/IROS.2018.8594049 ]
Li B, Zhang T L and Xia T. 2016. Vehicle detection from 3d lidar using fully convolutional network. Proceedings of Robotics: Science and Systems: #42[DOI: 10.15607/RSS.2016.XII.042 http://dx.doi.org/10.15607/RSS.2016.XII.042 ]
Li B. 2017. 3D fully convolutional network for vehicle detection in point cloud//Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Vancouver, BC: IEEE: 1513-1518[DOI: 10.1109/IROS.2017.8205955 http://dx.doi.org/10.1109/IROS.2017.8205955 ]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]
Lowe D G. 1999. Object recognition from local scale-invariant features//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE: 1150-1157[DOI: 10.1109/ICCV.1999.790410 http://dx.doi.org/10.1109/ICCV.1999.790410 ]
Neubeck A and Van Gool L J. 2006. Efficient Non-Maximum Suppression//Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong, China: IEEE: 850-855[DOI: 10.1109/ICPR.2006.479 http://dx.doi.org/10.1109/ICPR.2006.479 ]
Papageorgiou C P, Oren M and Poggio T. 1998. A general framework for object detection//Proceedings of the 6th International Conference on Computer Vision. Bombay, India: IEEE: 555-562[DOI: 10.1109/ICCV.1998.710772 http://dx.doi.org/10.1109/ICCV.1998.710772 ]
Qi C R, Liu W, Wu C X, Su H and Guibas L J. 2018. Frustum PointNets for 3D object detection from RGB-D data//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE: 918-927[DOI: 10.1109/CVPR.2018.00102 http://dx.doi.org/10.1109/CVPR.2018.00102 ]
Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/TPAMI.2016.2577031]
Simon M, Amende K, Kraus A, Honer J, Sämann T, Kaulbersch H, Milz S and Gross H M. 2019. Complexer-YOLO: real-time 3D object detection and tracking on semantic point clouds[EB/OL].[2019-10-24] . https://arxiv.org/pdf/1904.07537.pdf https://arxiv.org/pdf/1904.07537.pdf
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-10-24] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, KaiserŁand Polosukhin I. 2017. Attention is all you need[EB/OL].[2019-10-24] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf
Yu F and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions[EB/OL].[2019-10-24] . https://arxiv.org/pdf/1511.07122 https://arxiv.org/pdf/1511.07122 . pdf
Zhou J, Tan X, Shao Z W and Ma L Z. 2019. FVNet: 3D front-view proposal generation for real-time object detection from point clouds//Proceedings of the 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). Suzhou, China: IEEE: 1-8[DOI: 10.1109/CISP-BMEI48845.2019.8965844 http://dx.doi.org/10.1109/CISP-BMEI48845.2019.8965844 ]
Zhou Y and Tuzel O. 2018. Voxelnet: end-to-end learning for point cloud based 3d object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE: 4490-4499[DOI: 10.1109/CVPR.2018.00472 http://dx.doi.org/10.1109/CVPR.2018.00472 ]
相关作者
相关机构
京公网安备11010802024621