贾迪1, 蔡鹏1, 吴思2, 王骞1, 宋慧伦1(1.辽宁工程技术大学;2.国网葫芦岛供电公司)
目的 近年来，采用神经网络完成立体匹配任务已成为计算机视觉领域的研究热点，目前现有方法存在弱纹理目标缺乏全局表征的问题，为此本文提出一种基于Transformer架构的密集特征提取网络。方法 首先，采用空间池化窗口策略使得Transformer层可以在维持线性计算复杂度的同时，捕获广泛的上下文表示。弥补局部弱纹理导致的特征匮乏问题。其次，通过卷积与转置卷积实现重叠式块嵌入，使得所有特征点都尽可能多地捕捉邻近特征，便于细粒度匹配。再次，通过将跳跃查询策略应用于编码器和解码器间的特征融合部分，以此实现高效地信息传递。最后，针对立体像对存在的遮挡情况，对固定区域内的匹配概率进行截断求和，输出更为合理的遮挡置信度。结果 在Scene Flow数据集上进行了消融实验，实验结果表明，本文给出的网络获得了0.33的绝对像素距离，0.92%的异常像素占比和98%的遮挡预测交并比。为了验证模型在实际路况场景下的有效性，在KITTI-2015数据集上进行了补充对比实验，本文方法获得了1.78的平均异常值百分比，上述指标均优于STTR等主流行方法。此外在KITTI-2015、MPI-Sintel和Middlebury-2014数据集的测试中，本文模型具备较强的泛化性。结论 本文提出了一个纯粹的基于Transformer架构的密集特征提取器，使用空间池化窗口策略减小注意力计算的空间规模，并利用跳跃查询策略对编码器和解码器的特征进行了有效融合，可以较好地提高Transformer架构下的特征提取性能。
Feature Extraction Network for Stereo Matching of Weak Texture Objects
Jia Di, Cai Peng1, Wu Si2, Wang Qian1, Huilun Song1(1.Liaoning Technical University;2.State Grid Huludao Electric Power Supply Company)
Objective In recent years, the use of neural networks for stereo matching tasks has become a hot topic in the field of computer vision. However, existing methods suffer from the problem of inadequate global representation for low-texture objects. To address this issue, this paper proposes a dense feature extraction network based on the Transformer architecture. Methods First, a spatial pooling window strategy is employed to enable the Transformer layers to capture a wide range of contextual representations while maintaining linear computational complexity. This helps to alleviate the feature scarcity problem caused by local low-texture regions. Secondly, overlapping block embeddings are achieved through convolution and transposed convolution to capture neighboring features for all feature points, facilitating fine-grained matching. Furthermore, a skip-query strategy is applied to efficiently propagate information between the encoder and decoder in the feature fusion part. Lastly, for stereo image pairs with occlusions, the matching probabilities within fixed regions are truncated and summed to provide more reasonable occlusion confidence. Results Ablation experiments were conducted on the Scene Flow dataset, and the results show that the proposed network achieves an absolute pixel distance of 0.33, an outlier pixel ratio of 0.92%, and a 98% overlap prediction intersection over union. To validate the model"s effectiveness in real-world driving scenarios, additional comparative experiments were conducted on the KITTI-2015 dataset, where the proposed method achieved an average outlier percentage of 1.78, outperforming mainstream methods such as STTR. Moreover, in tests on the KITTI-2015, MPI-Sintel, and Middlebury-2014 datasets, the proposed model demonstrates strong generalization capabilities. Conclusion This paper presents a purely Transformer-based dense feature extractor. It employs a spatial pooling window strategy to reduce the spatial scale of attention computation and effectively fuses features in the encoder and decoder using a skip-query strategy. This approach significantly improves feature extraction performance within the Transformer architecture.