摘 要 ：目的 近年来，通过深度学习中的卷积神经网络进行三维目标检测的方法取得了巨大的进展，但此类方法中，卷积神经网络提取的特征不仅缺乏不同区域特征的依赖关系与不同通道特征的依赖关系，且难以保证在无损空间分辨率的情况下扩大感受野。针对以上三维目标检测方法存在的不足，本文提出了一种结合混合域注意力与空洞卷积的三维目标检测方法。方法 在本文网络的输入层中融入空间域注意力机制，以变换输入信息的空间位置，从而增强网络的自适应几何变换能力，而后在本文网络中融入通道域注意力机制，以提取特征的通道权重，从而获取关键通道特征，通过融合空间域与通道域注意力机制，从而对特征进行混合空间与通道的混合注意；其次，在特征提取器的输出层融入结合空洞卷积与通道注意力机制的网络层，从而在不损失空间分辨率的情况下扩大感受野，根据不同感受野提取特征的通道权重后进行融合，从而得到全局感受野的关键通道特征；此外，引入特征金字塔结构构建特征提取器以提取高分辨率的特征图，从而在很大程度上提升网络的检测性能；本文运用的网络是基于二阶段的区域生成网络，以回归定位更准确的三维目标框。结果 本文在KITTI数据集中对提出的方法进行了实验，在物体被遮挡由少变多的情况下，其中对test数据集中的“Car”类别， 值分别为83.45%、74.29%、67.92%， 值分别为89.61%、87.05%、79.69%，而对于“Pedestrian”与“Cyclist”类别，其 值与 值相对很多其他方法的检测结果也有一定的优势。结论 本文提出了一种结合混合域注意力与空洞卷积的二阶段三维目标检测方法，相对现有的很多出众的二阶段三维目标检测网络取得了更好的检测结果。
Abstract: Objective In recent years, with the continuous development of convolutional neural network used in deep learning, the 3D object detection network based on deep learning has also made outperforming development. 3D object detection aims to identify the class, location, orientation and size of the target object in the three-dimensional space, it is widely used in the visual field such as autonomous driving, intelligent monitoring, medical analysis and so on. The feature which is extracted by deep learning network is important for detected accuracy. The detection task is similar to human vision, it also needs to distinguish the difference between the background and the objects. In human vision, more attention was paid to the target objects, and the background was ignored. Therefore, it is better to pay more attention to the target area and pay less attention to the background area while performing object detection in the image, however, the convolutional neural network does not distinguish which areas and channels in image need to be paid more attention and which need to be paid less attention, the result is that the extracted features by convolutional neural network not only lack the dependence relationship between different region, but also lack the dependence relationship between different channels. The current 3D object detection method based on deep learning network uses a combination of pooling layers behind the multi-layer convolution layer, these network structures are generally use maximum pooling or averaging pooling in feature maps, it aims at adjusting the receptive field size of the extracted features. However, transform the receptive field of the feature the pooling layers must performed by removing some of the information, which causing a large loss of feature information, and the loss of the information may cause detected errors. Therefore, the convolutional neural network needs to expand the receptive field without losing information, thereby obtaining good detection results. Aiming at the shortcomings of the above three-dimensional target detection methods, this paper proposes a two-stage 3D object detection network combining mixed domain attention and dilated convolution. Method In this paper, a 3D object detection network based on deep learning net is built. Integrating the spatial domain attention mechanism into the input layer of the network to transform the spatial position of the input information, thereby enhancing the adaptive geometric transformation capability of the network, and then incorporating the channel domain attention mechanism into the network in order to compute the channel weights of the extracted features, so that the key channel features are obtained, the features are mixed by combining the above spatial domain attention mechanism and the channel domain attention mechanism. Secondly, the output layer of the feature extractor integrates the network layer which combined with the dilated convolution and the channel domain attention mechanism, so that our network can expand the extracted features’ receptive field without losing the spatial resolution. According to the different receptive fields obtained, the features can obtain their channel weights, and then fuse these feature weights through different schemes to obtain the channel weights of their global receptive fields in order to obtain the key channel features. In addition, the feature pyramid network structure is introduced to construct feature extractor of our network, by which our network can extract feature maps of high resolution, and with which can greatly improve the detection performance of our network. Finally, our network architecture is based on a two-stage region proposal network, which can regress to more accurate 3D bounding boxes. Result The method we propose in this paper has conducted a series of experiments on the KIITI dataset. In the case where the object is occluded from less to more which show ‘Easy’, ‘Moderate’ and ‘Hard’ in the tables. On the Car class in the test set the values of obtained are 83.45%, 74.29% and 67.92%, the values of obtained are 89.61%, 87.05% and 79.69%. On the Pedestrian class, the values of obtained are 52.23%, 44.91% and 41.64%, the values of obtained are 59.73%, 53.97% and 49.62%. On the Cyclist class, the values of obtained are 65.02%, 54.38% and 47.97%, the values of obtained are 69.13%, 59.69% and 52.11%. We also perform ablation experiments in the test set. The experimental results show that on the Car class, relative to the proposed method, the average values of obtained after removing the pyramid structure is reduced by about 6.09%, the average values of obtained after removing the mixed domain attention structure is reduced by about 0.99%, the average values of obtained after removing the dilated convolution structure is reduced by about 0.71%. Conclusion In the research for 3D object detection task, we propose a two-stage 3D object detection network combining dilated convolution and mixed domain attention. The experimental results show that the method we propose is outperforms several existing state-of-the-art 3D object detection methods and obtains amazing detection results.