Current Issue Cover

石争浩1, 李成建1, 周亮1, 张治军1, 仵晨伟1, 尤珍臻1, 任文琦2(1.西安理工大学计算机科学与工程学院, 西安 710048;2.中山大学网络空间安全学院, 深圳 518107)

摘 要
图像分类是图像理解的基础,对计算机视觉在实际中的应用具有重要作用。然而由于图像目标形态、类型的多样性以及成像环境的复杂性,导致很多图像分类方法在实际应用中的分类结果总是差强人意,例如依然存在分类准确性低、假阳性高等问题,严重影响其在后续图像及计算机视觉相关任务中的应用。因此,如何通过后期算法提高图像分类的精度和准确性具有重要研究意义,受到越来越多的关注。随着深度学习技术的快速发展及其在图像处理中的广泛应用和优异表现,基于深度学习技术的图像分类方法研究取得了巨大进展。为了更加全面地对现有方法进行研究,紧跟最新研究进展,本文对 Transformer 驱动的深度学习图像分类方法和模型进行系统梳理和总结。与已有主题相似综述不同,本文重点对 Transformer 变体驱动的深度学习图像分类方法和模型进行归纳和总结,包括基于可扩展位置编码的 Transformer 图像分类方法、具有低复杂度和低计算代价的 Transformer 图像分类方法、局部信息与全局信息融合的 Transformer 图像分类方法以及基于深层 ViT(visual Transformer)模型的图像分类方法等,从设计思路、结构特点和存在问题等多个维度、多个层面深度分析总结现有方法。为了更好地对不同方法进行比较分析,在 ImageNet、CIFAR-10(Canadian Institute for Advanced Research)和 CIFAR-100 等公开图像分类数据集上,采用准确率、参数量、浮点运算数(floating point operations,FLOPs)、总体分类精度(overall accuracy,OA)、平均分类精度(average accuracy,AA)和 Kappa(κ)系数等评价指标,对不同方法模型的分类性能进行了实验评估。最后,对未来研究方向进行了展望。
Survey on Transformer for image classification

Shi Zhenghao1, Li Chengjian1, Zhou Liang1, Zhang Zhijun1, Wu Chenwei1, You Zhenzhen1, Ren Wenqi2(1.School of Computer Science and Engineering, Xi'an University of Technology, Xi'an 710048, China;2.School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China)

Image classification is an important research direction in the field of image processing and computer vision.It aims to identify the specific category of the object in the image and has important practical application value.However, the classification effect of the existing methods is always unsatisfactory because of the diversity of the shape and type of image objects and the complexity of the imaging environment.Moreover, the existing problems, such as low classification accuracy and high false positives, seriously affect the application of image classification in the subsequent image and computer vision-related tasks.Therefore, improving image classification accuracy through postprocessing algorithms is highly desirable.Given the wide application of deep learning techniques, such as deep convolutional neural networks and generative adversarial neural networks, in the field of natural image object detection, the research on the application of deep learning techniques in image classification has received great attention and become a research hotspot in the field of image processing and computer vision in recent years.Moreover, many excellent works have been born.As a rising star, visual Transformer(ViT)gains an increasing interest in image processing tasks, particularly because of its strong ability of remote modeling and parallel sequence processing.Several technical review articles on the Transformer have been recently published.Moreover, ViT and its variants have been systematically summarized from different angles, and the application of the Transformer in different visual tasks has been introduced.This scenario provides appropriate help for people studying and tracking the research progress of image classification technology.Compared with traditional convolutional neural network (CNN), ViT achieves global modeling and parallel processing of the image by dividing the input image into patches.Thus, the image classification ability of the model is greatly improved.However, many problems, such as poor scalability, high computational overhead, slow convergence, and attention collapse, still exist because of the complexity of image classification problems and the diversity of the development of ViT technology.These problems can be solved using the ViT variants in image processing tasks.Moreover, the reviews that can help scholars comprehensively understand and grasp the latest progress of ViT for image processing tasks from a global perspective are very few.Therefore, the present study systematically compares and summarizes the ViT algorithms for image classification based on the full study of the latest reviews and related research to help scholars understand and grasp the latest progress of image classification research based on ViT.Unlike the existing review papers, our work is particularly focused on the research methods at home and abroad in the past 2 years(between January 2021 and December 31, 2022).We begin by describing the basic concept, principle, and structure of the traditional Transformer model for easy understanding.First, we introduce the attention mechanism and multihead attention mechanism.Then, the feed-forward neural network and position coding are described.Finally, the model structure of the traditional Transformer is presented.Afterward, the evolution of the Transformer model and its applications in image processing in recent years are figured.Then, the concept, principle, and structure of ViT are briefly introduced.Various vision Transformer models and applications in image classification are described in detail according to the problems faced by ViT.Different solutions, including scalable location coding, low complexity, low computing cost, local and global information fusion, and deep ViT model, are described one by one.Experiments on ImageNet, Canadian Institute for Advanced Research(CIFAR-10), and CIFAR-100 are provided, and many evaluations are presented to demonstrate the classification performance of the ViT and its variants for image classification.Two indicators are adopted, namely, accuracy and parameter quantity, to evaluate experimental results.Floating point operation(FLOPs)per second is also used to analyze the performance of the model comprehensively.Given that the Transformer has also been widely used in remote sensing image classification in recent years, the present study compares and analyzes the remote sensing image classification methods based on the Transformer.The experiments are performed on the hyperspectral image datasets of Indian Pines, Trento, and Salinas to evaluate the Transformer for the remote sensing image classification.Three indicators, namely, overall accuracy(OA), average accuracy(AA), and Kappa coefficient, are employed in this work.Finally, the problems and challenges faced by the current application of ViT in image classification are presented.Future research and development trends are also prospected.