Visual place recognition with fusion event cameras

Liu Yichen; Yu Lei; Yu Huai; Yang Wen

doi:10.11834/jig.230003

Image Analysis and Recognition | Views : 0 下载量: 1012 CSCD: 0

PDF
Export
Share
Collection
Album

Visual place recognition with fusion event cameras
Vol. 29, Issue 4, Pages: 1018-1029(2024)
Received：25 January 2023，

Revised：2023-04-18，

Published：16 April 2024
DOI： 10.11834/jig.230003
稿件说明：

移动端阅览

刘熠晨，余磊，余淮，杨文. 2024. 融合事件相机的视觉场景识别. 中国图象图形学报， 29(04):1018-1029 DOI： 10.11834/jig.230003.

Liu Yichen， Yu Lei， Yu Huai， Yang Wen. 2024. Visual place recognition with fusion event cameras. Journal of Image and Graphics， 29(04):1018-1029 DOI： 10.11834/jig.230003.

摘要

目的

传统视觉场景识别（visual place recognition，VPR）算法的性能依赖光学图像的成像质量，因此高速和高动态范围场景导致的图像质量下降会进一步影响视觉场景识别算法的性能。针对此问题，提出一种融合事件相机的视觉场景识别算法，利用事件相机的低延时和高动态范围的特性，提升视觉场景识别算法在高速和高动态范围等极端场景下的识别性能。

方法

本文提出的方法首先使用图像特征提取模块提取质量良好的参考图像的特征，然后使用多模态特征融合模块提取查询图像及其曝光区间事件信息的多模态融合特征，最后通过特征匹配查找与查询图像最相似的参考图像。

结果

在MVSEC（multi-vehicle stereo event camera dataset）和RobotCar两个数据集上的实验表明，本文方法对比现有视觉场景识别算法在高速和高动态范围场景下具有明显优势。在高速高动态范围场景下，本文方法在MVSEC数据集上相较对比算法最优值在召回率与精度上分别提升5.39%和8.55%，在RobotCar数据集上相较对比算法最优值在召回率与精度上分别提升3.36%与4.41%。

结论

本文提出了融合事件相机的视觉场景识别算法，利用了事件相机在高速和高动态范围场景的成像优势，有效提升了视觉场景识别算法在高速和高动态范围场景下的场景识别性能。

Abstract

Objective

The performance of traditional visual place recognition （VPR） algorithms depends on the imaging quality of optical images. However， optical cameras suffer from low temporal resolution and dynamic range. For example， in a scene with high-speed motion， continuously capturing the rapid changes in the position of the scene in the imaging plane is difficult for an optical camera， resulting in motion blur in the output image. When the scene brightness exceeds the recording range of the photosensitive chip of the camera， output image degradation of the optical camera such as underexposure and overexposure may occur. The blurring， underexposure， and overexposure of optical images will lead to the loss of image texture structure information， which will result in the performance reduction of visual scene recognition algorithms. Therefore， the recognition performance of image-based VPR algorithms is poor in high-speed and high dynamic range （HDR） scenarios. Event camera is a new type of visual sensor inspired by biological vision. This camera has the characteristics of low latency and HDR. Using event cameras can effectively improve the recognition performance of VPR algorithms in high-speed and HDR scenes. Therefore， this paper proposes a VPR algorithm fused with event cameras， which utilizes the low latency and HDR characteristics of event cameras to improve the recognition performance of VPR algorithms in extreme scenarios such as high speed and HDR.

Method

The proposed method first fuses the information of the query image and the events within its exposure time interval to obtain the multimodal features of the query location. The method then retrieves the reference image closest to the multimodal features of the query location in the reference image database. This method also extracts the features of the reference image with good quality using the image feature extraction module and then inputs query image and its events within the exposure time interval to the multimodal to compare the multimodal query information with the reference image. Multimodal fusion features are obtained by the multimodal feature fusion module， and the reference image most similar to the query image is finally obtained through feature matching retrieval， thereby completing visual scene recognition. The network training is supervised by a triplet loss. The triplet loss drives the network to learn in the direction where the vector distance between the query and positive features is smaller， and the vector distance between the negative feature is larger， until the difference between the negative distance and the positive distance is not less than the similarity distance constant. Therefore， distinguishing reference images with similar and different fields of view from the query image according to the similarity in the feature vector space is possible， further completing the VPR task.

Result

The experiments are conducted on the MVSEC and RobotCar datasets. The proposed method is compared in experiments with image-based method， event camera-based method， and methods that utilize image and event camera information. Under different exposure and high-speed scenarios， the proposed method has advantages over existing visual scene recognition algorithms. Specifically， on the MVSEC dataset， the proposed method can reach a maximum recall rate of 99.36%and a maximum recognition accuracy of 96.34%， which improves the recall rate and precision by 5.39% and 8.55%， respectively， compared with the existing VPR methods. On the RobotCar dataset， the proposed method can reach a maximum recall rate of 97.33%and a maximum recognition accuracy of 93.30%， which improves the recall rate and precision by 3.36% and 4.41%， respectively， compared with the existing VPR methods. Experimental results show that in the high-speed and HDR scene， the proposed method has advantages compared with the existing VPR algorithm and demonstrates a remarkable improvement in the recognition performance.

Conclusion

This paper proposes a VPR algorithm that fuses event cameras， which utilizes the characteristics of low latency and HDR of event cameras and overcomes the problem of image information loss in high-speed and HDR scenes. This method effectively fuses information from image and event modalities， thereby improving the performance of VPR in high-speed and HDR scenarios.

关键词

Keywords

references

Arandjelovic R ， Gronat P ， Torii A ， Pajdla T and Sivic J . 2016 . NetVLAD： CNN architecture for weakly supervised place recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 5297 - 5307 ［ DOI： 10.1109/cvpr.2016.572 http://dx.doi.org/10.1109/cvpr.2016.572 ］

Bottou L . 2010 . Large-scale machine learning with stochastic gradient descent // Proceedings of the 19th International Conference on Computational Statistics . Paris， France ： Springer： 177 - 186 ［ DOI： 10.1007/978-3-7908-2604-3_16 http://dx.doi.org/10.1007/978-3-7908-2604-3_16 ］

Campos C ， Elvira R ， Rodríguez J J G ， Montiel J M M and Tardós J D . 2021 . ORB-SLAM3： an accurate open-source library for visual， visual-inertial， and multimap SLAM . IEEE Transactions on Robotics ， 37 （ 6 ）： 1874 - 1890 ［ DOI： 10.1109/tro.2021.3075644 http://dx.doi.org/10.1109/tro.2021.3075644 ］

Chen X Y ， Liu Y H ， Zhang Z W ， Qiao Y and Dong C . 2021 . HDRUNet： single image HDR reconstruction with denoising and dequantization // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Nashville， USA ： IEEE： 354 - 363 ［ DOI： 10.1109/cvprw53098.2021.00045 http://dx.doi.org/10.1109/cvprw53098.2021.00045 ］

Cho S J ， Ji S W ， Hong J P ， Jung S W and Ko S J . 2021 . Rethinking coarse-to-fine approach in single image deblurring // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 4621 - 4630 ［ DOI： 10.1109/iccv48922.2021.00460 http://dx.doi.org/10.1109/iccv48922.2021.00460 ］

Fischer T and Milford M . 2020 . Event-based visual place recognition with ensembles of temporal windows . IEEE Robotics and Automation Letters ， 5 （ 4 ）： 6924 - 6931 ［ DOI： 10.1109/lra.2020.3025505 http://dx.doi.org/10.1109/lra.2020.3025505 ］

Fu J Y ， Yu L ， Yang W and Lu X . 2023 . Event-based continuous optical flow estimation . Acta Automatica Sinica ， 49 （ 9 ）： 1845 - 1856

付婧祎，余磊，杨文，卢昕 . 2023 . 基于事件相机的连续光流估计 . 自动化学报， 49 （ 9 ）： 1845 - 1856 ［ DOI： 10.16383/j.aas.c210242 http://dx.doi.org/10.16383/j.aas.c210242 ］

Gallego G ， Delbruck T ， Orchard G ， Bartolozzi C ， Taba B ， Censi A ， Leutenegger S ， Davison A J ， Conradt J ， Daniilidis K and Scaramuzza D . 2022 . Event-based vision： a survey . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 44 （ 1 ）： 154 - 180 ［ DOI： 10.1109/TPAMI.2020.3008413 http://dx.doi.org/10.1109/TPAMI.2020.3008413 ］

Galvez-López D and Tardos J D . 2012 . Bags of binary words for fast place recognition in image sequences . IEEE Transactions on Robotics ， 28 （ 5 ）： 1188 - 1197 ［ DOI： 10.1109/tro.2012.2197158 http://dx.doi.org/10.1109/tro.2012.2197158 ］

Gehrig D ， Gehrig M ， Hidalgo-Carrió J and Scaramuzza D . 2020 . Video to events： recycling video datasets for event cameras // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 3583 - 3592 ［ DOI： 10.1109/cvpr42600.2020.00364 http://dx.doi.org/10.1109/cvpr42600.2020.00364 ］

He K M ， Zhang X Y ， Ren S Q and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 770 - 778 ［ DOI： 10.1109/cvpr.2016.90 http://dx.doi.org/10.1109/cvpr.2016.90 ］

Huang Z W ， Zhang T Y ， Heng W ， Shi B X and Zhou S C . 2022 . Real-time intermediate flow estimation for video frame interpolation // Proceedings of the 17th European Conference on Computer Vision . Tel-Aviv， Israel ： Springer： 624 - 642 ［ DOI： 10.1007/978-3-031-19781-9_36 http://dx.doi.org/10.1007/978-3-031-19781-9_36 ］

Jegou H ， Douze M ， Schmid C and Pérez P . 2010 . Aggregating local descriptors into a compact image representation // Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition . San Francisco， USA ： IEEE： 3304 - 3311 ［ DOI： 10.1109/cvpr.2010.5540039 http://dx.doi.org/10.1109/cvpr.2010.5540039 ］

Kong D L ， Fang Z ， Hou K X ， Li H J ， Jiang J J ， Coleman S and Kerr D . 2022 . Event-VPR： end-to-end weakly supervised deep network architecture for visual place recognition using event-based vision sensor . IEEE Transactions on Instrumentation and Measurement ， 71 ： # 5011418 ［ DOI： 10.1109/tim.2022.3168892 http://dx.doi.org/10.1109/tim.2022.3168892 ］

Lee A J and Kim A . 2021 . EventVLAD： visual place recognition with reconstructed edges from event cameras // Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems . Prague， Czech Republic ： IEEE： 2247 - 2252 ［ DOI： 10.1109/iros51168.2021.9635907 http://dx.doi.org/10.1109/iros51168.2021.9635907 ］

Liu Y L ， Lai W S ， Chen Y S ， Kao Y L ， Yang M H ， Chuang Y Y and Huang J B . 2020 . Single-image HDR reconstruction by learning to reverse the camera pipeline // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 1648 - 1657 ［ DOI： 10.1109/cvpr42600.2020.00172 http://dx.doi.org/10.1109/cvpr42600.2020.00172 ］

Lowry S ， Sünderhauf N ， Newman P ， Leonard J J ， Cox D ， Corke P and Milford M J . 2016 . Visual place recognition： a survey . IEEE Transactions on Robotics ， 32 （ 1 ）： 1 - 19 ［ DOI： 10.1109/TRO.2015.2496823 http://dx.doi.org/10.1109/TRO.2015.2496823 ］

Maddern W ， Pascoe G ， Linegar C and Newman P . 2017 . 1 year， 1 000 km： the Oxford RobotCar dataset . The International Journal of Robotics Research ， 36 （ 1 ）： 3 - 15 ［ DOI： 10.1177/0278364916679498 http://dx.doi.org/10.1177/0278364916679498 ］

Maqueda A I ， Loquercio A ， Gallego G ， Garcia N and Scaramuzza D . 2018 . Event-based vision meets deep learning on steering prediction for self-driving cars // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 5419 - 5427 ［ DOI： 10.1109/cvpr.2018.00568 http://dx.doi.org/10.1109/cvpr.2018.00568 ］

Milford M J and Wyeth G F . 2012 . SeqSLAM： visual route-based navigation for sunny summer days and stormy winter nights // Proceedings of 2012 IEEE International Conference on Robotics and Automation . Saint Paul， USA ： IEEE： 1643 - 1649 ［ DOI： 10.1109/icra.2012.6224623 http://dx.doi.org/10.1109/icra.2012.6224623 ］

Ng P C and Henikoff S . 2003 . SIFT： predicting amino acid changes that affect protein function . Nucleic Acids Research ， 31 （ 13 ）： 3812 - 3814 ［ DOI： 10.1093/nar/gkg509 http://dx.doi.org/10.1093/nar/gkg509 ］

Rebecq H ， Ranftl R ， Koltun V and Scaramuzza D . 2021 . High speed and high dynamic range video with an event camera . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 43 （ 6 ）： 1964 - 1980 ［ DOI： 10.1109/tpami.2019.2963386 http://dx.doi.org/10.1109/tpami.2019.2963386 ］

Rublee E ， Rabaud V ， Konolige K and Bradski G . 2011 . ORB： an efficient alternative to SIFT or SURF // Proceedings of 2011 International Conference on Computer Vision . Barcelona， Spain ： IEEE： 2564 - 2571 ［ DOI： 10.1109/iccv.2011.6126544 http://dx.doi.org/10.1109/iccv.2011.6126544 ］

Saputra M R U ， Markham A and Trigoni N . 2018 . Visual SLAM and structure from motion in dynamic environments： a survey . ACM Computing Surveys ， 51 （ 2 ）： # 37 ［ DOI： 10.1145/3177853 http://dx.doi.org/10.1145/3177853 ］

Scheerlinck C ， Rebecq H ， Gehrig D ， Barnes N ， Mahony R E and Scaramuzza D . 2020 . Fast image reconstruction with an event camera // Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision . Snowmass， USA ： IEEE： 156 - 163 ［ DOI： 10.1109/wacv45572.2020.9093366 http://dx.doi.org/10.1109/wacv45572.2020.9093366 ］

Schroff F ， Kalenichenko D and Philbin J . 2015 . FaceNet： a unified embedding for face recognition and clustering // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 815 - 823 ［ DOI： 10.1109/cvpr.2015.7298682 http://dx.doi.org/10.1109/cvpr.2015.7298682 ］

Shang W ， Ren D W ， Zou D Q ， Ren J S ， Luo P and Zuo W M . 2021 . Bringing events into video deblurring with non-consecutively blurry frames // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 4511 - 4520 ［ DOI： 10.1109/iccv48922.2021.00449 http://dx.doi.org/10.1109/iccv48922.2021.00449 ］

Torii A ， Arandjelović R ， Sivic J ， Okutomi M and Pajdla T . 2015 . 24/7 place recognition by view synthesis // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 1808 - 1817 ［ DOI： 10.1109/cvpr.2015.7298790 http://dx.doi.org/10.1109/cvpr.2015.7298790 ］

Wang B S ， He J W ， Yu L ， Xia G S and Yang W . 2020 . Event enhanced high-quality image recovery // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 155 - 171 ［ DOI： 10.1007/978-3-030-58601-0_10 http://dx.doi.org/10.1007/978-3-030-58601-0_10 ］

Woo S ， Park J ， Lee J Y and Kweon I S . 2018 . CBAM： convolutional block attention module // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 3 - 19 ［ DOI： 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ］

Yu L ， Liao W ， Zhou Y L ， Yang W and Xia G S . 2023 . Event camera based synthetic aperture imaging . Acta Automatica Sinica ， 49 （ 7 ）： 1393 - 1406

余磊，廖伟，周游龙，杨文，夏桂松 . 2023 . 基于事件相机的合成孔径成像 . 自动化学报， 49 （ 7 ）： 1393 - 1406 ［ DOI： 10.16383/j.aas.c200388 http://dx.doi.org/10.16383/j.aas.c200388 ］

Zhang X and Yu L . 2022 . Unifying motion deblurring and frame interpolation with events // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 17744 - 17753 ［ DOI： 10.1109/cvpr52688.2022.01724 http://dx.doi.org/10.1109/cvpr52688.2022.01724 ］

Zhu A Z ， Thakur D ， Ozaslan T ， Pfrommer B ， Kumar V and Daniilidis K . 2018 . The multivehicle stereo event camera dataset： an event camera dataset for 3D perception . IEEE Robotics and Automation Letters ， 3 （ 3 ）： 2032 - 2039 ［ DOI： 10.1109/lra.2018.2800793 http://dx.doi.org/10.1109/lra.2018.2800793 ］

Alert me when the article has been cited

提交

Cross-modal feature fusion and detail-enhanced RGB-D salient object detection

Two-stage vision transformer for fusing global and local features in distracted driving behavior recognition

Large-model driven test-time adaptation for multi-modal point cloud semantic segmentation

Image denoising via Swin Transformer V2 and feature fusion U-Net

MoLiNet： a local and global information interactive fusion network for improving multi-classification of pathological image artifacts