Current Issue Cover

周博文, 李阳, 马鑫骥, 苗壮, 张睿(陆军工程大学)

摘 要
A survey of cross-view geo-localization methods based on deep learning

Zhou Bowen, Li Yang, Ma Xinji, Miao Zhuang, Zhang Rui(Army Engineering University of PLA)

Cross-view geo-localization aims to estimate a target geographical location by matching images from different viewpoints. It is usually viewed as an image retrieval task, which has been widely adopted in various artificial intelligence tasks, such as person re-identification, vehicle re-identification, and image registration. The main challenge of this localization task is the drastic changes between different viewpoints, which reduces the retrieval performance of the model. Conventional techniques for cross-view geo-localization rely on manual feature extraction, which restricts precision when determining location. With the development of deep learning techniques, deep learning-based cross-view geo-localization methods have become the current mainstream technology. However, due to the involvement of multiple steps and extensive transfer knowledge in cross-view geo-localization tasks, there is still a lack of relevant literature reviews in this field. In this paper, we propose the first review of cross-view geo-localization methods based on deep learning. We provide a comprehensive overview of the current state-of-the-art cross-view geo-localization methods that rely on deep learning. The focus of this paper is to analyze the various developments in data preprocessing, deep learning networks, feature attention modules, and loss functions within the context of cross-view geo-localization tasks. To address the challenges in this field, the data preprocessing phase involves feature alignment, sampling strategies, and data augmentation. Feature alignment serves as prior knowledge for cross-view geo-localization, which contributes to improving the localization accuracy. The use of GAN networks has emerged as a prominent trend for feature alignment. Additionally, the discrepancy in sample quantities between satellite, ground, and drone images necessitates effective sampling strategies and data augmentation techniques to achieve training balance. Deep learning networks play a critical role in extracting image features, and their performance directly impacts the accuracy of cross-view geo-localization tasks. In general, the methods that use Transformer as the backbone network have higher accuracy than those that use ResNet as the backbone network. The methods that use the ConvNeXt network perform the best among all approaches. To further extract image features and enhance the discriminative power of the model, it is necessary to design feature attention modules. These modules, through learning effective attention mechanisms, adaptively weight the input images or feature maps to better focus on the task-relevant regions or features. Experimental results show that the use of feature attention modules can explore previously unattended feature information, further extract image features, and enhance the discriminative power of the model. Loss functions are used to help the model better fit the data and accelerate the convergence speed of the model. They guide the training direction of the entire network based on the results of the loss function, enabling the model to learn better representations and further improve the accuracy of cross-view geo-localization tasks. The commonly used loss functions include contrastive loss, triplet loss, and three other types of loss. With the improvement of loss functions, the number of samples extracted by the model has evolved from one-to-one to one-to-many, allowing the model to cover all samples during training and further enhance the model"s performance. Through the analysis of nearly a hundred influential literature, this paper summarizes the characteristics and improvement ideas of cross-view geo-localization tasks, which can inspire researchers to design new methods. In addition, this paper tests 10 deep learning-based cross-view geo-localization methods on two representative datasets. The evaluation includes the backbone network type and input data size of cross-view geo-localization methods. In the University-1652 dataset, two accuracy metrics (R@1 and AP), model parameters, and inference speed are evaluated. In the CVUSA dataset, four accuracy metrics, including R@1, R@5, R@10, and R@Top1, are mainly evaluated. The experimental results show that the performance of the backbone network type and larger image data input size have a positive impact on the model"s performance. Finally, building upon an extensive review of the current state-of-the-art cross-view geo-localization methods, we discuss the challenges and provide several further research directions for cross-view geo-localization. We hope to provide some suggestions for future research directions.