吕倪祺,宋广华,杨波威(浙江大学航空航天学院空天信息技术研究所, 杭州 310027)
目的 在微小飞行器系统中，如何实时获取场景信息是实现自主避障及导航的关键问题。本文提出了一种融合中心平均Census特征与绝对误差（AD）特征、基于纹理优化的半全局立体匹配算法（ADCC-TSGM），并利用统一计算设备架构 （CUDA）进行并行加速。方法 使用沿极线方向的一维差分计算纹理信息，使用中心平均Census特征及AD特征进行代价计算，通过纹理优化的SGM算法聚合代价并获得初始视差图；然后，通过左右一致性检验检查剔除粗略视差图中的不稳定点和遮挡点，使用线性插值和中值滤波对视差图中的空洞进行填充；最后，利用GPU特性，对立体匹配中的代价计算、半全局匹配 （SGM）计算、视差计算等步骤使用共享内存、单指令多数据流 （SIMD）及混合流水线进行优化以提高运行速度。结果 在Quarter Video Graphics Array （QVGA）分辨率的middlebury双目图像测试集中，本文提出的ADCC-TSGM算法总坏点率较Semi-Global Block Matching （SGBM）算法降低36.1%，较SGM算法降低28.3%；平均错误率较SGBM算法降低44.5%，较SGM算法降低49.9%。GPU加速实验基于NVIDIA Jetson TK1嵌入式计算平台，在双目匹配性能不变的情况下，通过使用CUDA并行加速，可获得117倍以上加速比，即使相较于已进行SIMD及多核并行优化的SGBM，运行时间也减少了85%。在QVGA分辨率下，GPU加速后的运行帧率可达31.8 帧/s。结论 本文算法及其CUDA加速可为嵌入式平台提供一种实时获取高质量深度信息的有效途径，可作为微小飞行器、小型机器人等设备进行环境感知、视觉定位、地图构建的基础步骤。
Semi-global stereo matching algorithm based on feature fusion and its CUDA implementation
Lyu Niqi,Song Guanghua,Yang Bowei(Institute of Aerospace Information Technology, School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China)
Objective In unmanned aerial vehicle systems, estimation of scene information in real time is a key issue in conducting automatic obstacle avoidance and navigation. A binocular stereo vision system is an effective means to obtain scene information; this system simulates the working principle of the human eyes by using two cameras to capture the same sense at the same time and generates a disparity map by using a stereo matching algorithm. In this work, we propose ADCC-TSGM, a novel texture-optimized semi-global stereo matching algorithm based on the fusion of absolute difference (AD) feature and center average census feature. Efforts are made to speed up the algorithm through CUDA parallel acceleration. Method First, a one-dimensional difference method is used to calculate the texture information along the epipolar line, the center average census feature and AD feature are exploited to conduct the cost computation, and the global stereo matching algorithm is texture-optimized to aggregate the cost and obtain the initial disparity. Second, left-right consistency check is used to detect unstable pixels and occlusion pixels, and linear interpolation and median filter method are used to fill the holes of the disparity map. Lastly, to improve the running speed, we optimize the code of GPU acceleration for each step of the stereo matching. The time consumption of memory access is considered in the feature calculation of types, such that center average census is much higher than that of computation, and a large number of data-intensive computing tasks are conducted between adjacent threads. Consequently, we divide the dataset of the entire thread block into four regions, copy them into a shared memory, and use the shared memory for computation to reduce the overhead of memory accessing. A single thread can simultaneously handle two consecutive disparity calculations by using SIMD instructions. When the GPU is processing, the CPU is basically idle. Therefore, a hybrid pipeline is designed to fully utilize the computing resources of the embedded platform. Result To demonstrate the effectiveness of the proposed algorithm, we use NVIDIA Jetson TK1 developer kit, which has a quad-core ARM Cortex-A15 CPU, a Kepler GPU with 192 CUDA cores, and 2 GB memory, as the embedded computing platform to conduct experiments on Middlebury stereo datasets that have been resized to QVGA resolution. With the actual application scenarios and resolution of images, the maximum disparity for each algorithm is set to 64, and the block matching window size of SGBM and BM is set to 9×9. The texture penalty coefficients ε1 and ε2 in the proposed algorithm are set to 0.25 and 0.125, respectively. Experimental results show that the total bad-pixel rate and the average error rate of the proposed algorithm are significantly lower than those of BM, SGBM, and SGM, respectively. The total bad-pixel rate of the ADCC-TSGM algorithm is 73.9% lower than that of BM algorithm, 36.1% lower than that of SGBM algorithm, and 28.3% lower than that of SGM algorithm. The average error rate of the proposed algorithm is 83.2% lower than that of the BM algorithm, 44.5% lower than that of the SGBM algorithm, and 49.9% lower than that of the SGM algorithm. In particular, the use of center average census in feature matching can reduce the bad-pixel and error rates. The texture-based optimization can adaptively increase the penalty coefficient in low-texture regions and reduce the average error rate from 6.62 to 4.84. The post-processing method, including disparity consistency check and hole filling, can reduce the total bad-pixel rate from 14.46 to 7.12. Through GPU parallel acceleration, the CUDA implementation of the proposed algorithm becomes hundreds of times faster than that of pure CPU implementation without any loss in the quality of disparity map. Compared with SGBM, which has been optimized by using SIMD and multi-core parallel method, our proposed algorithm has a running time that is reduced by 85%. For QVGA resolution, the frame processing rate is as high as 31.8 FPS. Conclusion The proposed algorithm outperforms existing algorithms, such as BM, SGM, and SGBM, which have been used in industries. The CUDA-accelerated implementation of the proposed algorithm provides an effective and feasible method to obtain high-quality disparity information and can be used as a basic means of environmental perception, visual positioning, and map construction for real-time embedded applications, such as micro-aircraft systems.