Current Issue Cover
正交约束多头自注意力的场景文本识别

徐仕成, 朱子奇(武汉科技大学计算机科学与技术学院, 武汉 430065)

摘 要
目的 场景文本识别(scene text recognition,STR)是计算机视觉中的一个热门研究领域。最近,基于多头自注意力机制的视觉Transformer (vision Transformer,ViT)模型被提出用于STR,以实现精度、速度和计算负载的平衡。然而,没有机制可以保证不同的自注意力头确实捕捉到多样性的特征,这将导致使用多头自注意力机制的ViT模型在多样性极强的场景文本识别任务中表现不佳。针对这个问题,提出了一种新颖的正交约束来显式增强多个自注意力头之间的多样性,提高多头自注意力对不同子空间信息的捕获能力,在保证速度和计算效率的同时进一步提高网络的精度。方法 首先提出了针对不同自注意力头上Q (query)、K (key)和V (value)特征的正交约束,这可以使不同的自注意力头能够关注到不同的查询子空间、键子空间、值子空间的特征,关注不同子空间的特征可以显式地使不同的自注意力头捕捉到更具差异的特征。还提出了针对不同自注意力头上QKV 特征线性变换权重的正交约束,这将为Q、K和V特征的学习提供正交权重空间的解决方案,并在网络训练中带来隐式正则化的效果。结果 实验在7个数据集上与基准方法进行比较,在规则数据集Street View Text (SVT)上精度提高了0.5%;在不规则数据集CUTE80 (CT)上精度提高了1.1%;在7个公共数据集上的整体精度提升了0.5%。结论 提出的即插即用的正交约束能够提高多头自注意力机制在STR任务中的特征捕获能力,使ViT模型在STR任务上的识别精度得到提高。本文代码已公开: https://github.com/lexiaoyuan/XViTSTR。
关键词
Orthogonality-constrained multihead self-attention for scene text recognition

Xu Shicheng, Zhu Ziqi(School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China)

Abstract
Objective Scene text recognition(STR)is a hot research field in computer vision that aims to recognize text information from natural scenes. STR is important in many tasks and applications, such as image search, robot navigation, license plate recognition, and automatic driving. Most of the early STR models usually comprise a rectification network and a recognition network, while recent STR models usually comprise a convolutional neural network(CNN)-based feature encoder and a Transformer-based decoder or a customized CNN module and Transformer encoder-decoder. These STR models usually have a complex model architecture, large computational load, and large memory consumption. A vision Transformer(ViT) -based STR model called ViTSTR maintains balance among accuracy, speed, and computational load. However, without data augmentation targeted for STR, ViTSTR requires improvement in its accuracy. One reason for its low accuracy is that the naive use of multihead self-attention in ViT does not guarantee that different attention heads capture distinct features, especially the diverse features in complex scene text images. To address this problem, this paper studies the application of orthogonality constraints in ViT based on ViTSTR and proposes novel orthogonality constraints for the multihead self-attention mechanism in ViT, which explicitly encourages diversity among multiple self-attention heads, improves the ability of multihead self-attention to capture information in different subspaces, and further improves the accuracy of the network while ensuring speed and computational efficiency. Method The proposed orthogonality constraints comprise two parts, namely, the orthogonality constraints for the features of query (Q), key(K)and value(V)on different selfattention heads and the orthogonality constraints for the linear transformation weights of Q, K, and V on different selfattention heads. Q, K, and V play important roles in the self-attention mechanism as input features of the attention head. The orthogonality of features on different attention heads clearly encourages diversity among multiple attention head features. The orthogonality constraints for the Q, K, and V features allow different self-attention heads to focus on features in different query, key, and value subspaces, hence explicitly enabling different self-attention heads to capture distinct features and guide the ViT model in improving its performance in text recognition tasks for very diverse scenes. Specifically, after normalizing the Q, K, and V features of each head, the orthogonality of Q, K, and V features between different heads is calculated, and the corresponding orthogonality is added as the regularization term after the loss function. The orthogonality of the Q, K, and V features between different heads is penalized by back-propagation, which acts as a constraint on the orthogonality of the corresponding features. Adding orthogonality constraints to the linear transformation weights of the Q, K, and V features on different self-attention heads provides an orthogonal weight space in the learning process of these features, hence triggering implicit regularization in network training and fully utilizing the feature and weight spaces of multihead self-attention. A similar approach is used for the Q, K, and V weights to constrain the orthogonality of the Q, K, and V weight spaces by penalizing the orthogonality of the corresponding weights. The feature and weight orthogonality constraints can produce improvements when used individually or in combination. Result Experiment results show that compared to the benchmark method, the overall accuracy of the proposed method on the test dataset is improved by 0. 447% when adding the feature orthogonality constraint. Meanwhile, when adding the weight orthogonality constraint, the overall accuracy of the proposed method is improved by 0. 364%. When both feature and weight orthogonality constraints are added, the overall accuracy of the proposed method is improved by 0. 513%. We then compare the orthogonality changes in different ablation experiments, including the orthogonality changes of the Q, K, and V features and weights among different self-attention heads. The proposed orthogonality constraint can lead to a significant improvement in the corresponding orthogonality. In addition, the feature orthogonality constraint favors the orthogonality of the weights, and the weight orthogonality constraint favors the orthogonality of the features, but these effects are small. We also produce attention maps of the model with added orthogonal constraints and the baseline model in the CUTE80 (CT)dataset. These attention maps show that the model with added orthogonality constraints focuses on more information in the attention region compared to the baseline model, which is helpful for recognizing the correct results. We also compare the performance of our method and that of previous competitive methods on several popular benchmarks. Our proposed method shows improvements on both regular and irregular datasets. Compared with the baseline, the accuracy of the proposed method is improved by 0. 5% on regular datasets IIIT5K-words (IIIT), street view text (SVT), and ICDAR2003 (IC03) (860), by 0. 5% and 0. 8% on the irregular datasets ICDAR2015 (IC15) (1811)and IC15(2077), respectively, by 0. 8% on SVT perspective (SVTP), and by 1. 1% on CT. In sum, the proposed method shows an overall accuracy improvement of 0. 5%. Conclusion This paper proposes a novel orthogonality constraint for the multihead self-attention mechanism that explicitly encourages this mechanism to capture diverse subspace information. The Q, K, and V feature orthogonality constraints are used to improve the ability of the multihead self-attention mechanism in capturing the feature space information of the input sequence, and the Q, K, and V weight orthogonality constraints are used to provide orthogonal weight space exploration for the learning of features. Experiment results validate the effectiveness of the proposed plug-and-play orthogonality constraints in STR tasks, especially in improving the accuracy of the ViT model in irregular text recognition tasks. The code is publicly available:https://github.com/lexiaoyuan/XViTSTR.
Keywords

订阅号|日报