目的 场景文本识别（Scene Text Recognition，STR）是计算机视觉中的一个热门研究领域。最近，基于多头自注意力机制的Vision Transformer（ViT）模型被提出用于STR，以实现精度、速度和计算负载的平衡。然而，没有机制可以保证不同的自注意力头确实捕捉到多样性的特征，这将导致使用多头自注意力机制的ViT模型在多样性极强的场景文本识别任务中表现不佳。针对这个问题，本文提出了一种新颖的正交约束来显式增强多个自注意力头之间的多样性，提高多头自注意力对不同子空间信息的捕获能力，在保证速度和计算效率的同时进一步提高网络的精度。方法 首先提出了针对不同自注意力头上Query（Q）、Key（K）、Value（V）特征的正交约束，这可以使不同的自注意力头能够关注到不同的查询子空间、键子空间、值子空间的特征，关注不同子空间的特征可以显式的使不同的自注意力头捕捉到更具差异的特征。还提出了针对不同自注意力头上Q、K、V特征线性变换权重的正交约束，这将为Q、K、V特征的学习提供正交权重空间的解决方案，并在网络训练中带来隐式正则化的效果。结果 实验在7个数据集上与基准方法进行了比较，在规则数据集Street View Text上精度提高了0.5%；在不规则数据集CUTE80上精度提高了1.1%；在7个公共数据集上的整体精度提升了0.5%。结论 本文提出的即插即用的正交约束能够提高多头自注意力机制在STR任务中的特征捕获能力，使ViT模型在STR任务上的识别精度得到提高。
Orthogonality constrained multihead self-attention for scene text recognition
Xu Shicheng, Zhu Ziqi(School of Computer Science and Technology, Wuhan University of science and technology)
Objective Scene text recognition (STR) is a hot research field in computer vision, which aims to recognize text information from natural scenes. STR is important in many tasks and applications, such as image search, robot navigation, license plate recognition, automatic driving, etc. Most early STR models usually comprise a rectification network and a recognition network. Recent STR models usually comprise a convolutional neural network (CNN) based feature encoder and a Transformer based decoder, or a customized CNN module and Transformer encoder-decoder. These STR models usually bring complex model architecture, large computational load and memory consumption. A Vision Transformer (ViT) based STR model, ViTSTR, enables a balance between accuracy, speed and computational load. However, without data augmentation targeted for STR, ViTSTR still needs improvement in terms of accuracy. One reason for the low accuracy is that the naive use of multihead self-attention in ViT does not guarantee that different attention heads indeed capture more distinct features, especially the diversity of features in complex scene text images. In response to this problem, this paper studies the application of orthogonality constraints in ViT based on ViTSTR, and proposes novel orthogonality constraints for the multihead self-attention mechanism in ViT, which explicitly encourages the diversity among multiple self-attention heads, improves the ability of multihead self-attention to capture information in different subspaces and further improves the accuracy of the network while ensuring the speed and computational efficiency. Method The orthogonality constraints proposed in this paper are composed of two parts: the orthogonality constraints for the features of Query (Q), Key (K) and Value (V) on different self-attention heads, and the orthogonality constraints for the linear transformation weights of Q, K and V on different self-attention heads. Q, K and V play an important role in the self-attention mechanism as input features of the attention head. The orthogonality of features on different attention heads can clearly encourage the differences between multiple attention head features. The orthogonality constraints for Q, K and V features allow different self-attention heads to focus on features in different query subspaces, key subspaces, and value subspaces, which will explicitly enable different self-attention heads to capture more distinct features and guide the ViT model to achieve higher performance in text recognition tasks for very diverse scenes. Specifically, after normalizing the Q, K and V features of each head, the orthogonality of Q, K and V features between different heads is calculated, and the corresponding orthogonality is added as the regularization terms after the loss function. The orthogonality of Q, K and V features between different heads is penalized by backpropagation, which acts as a constraint on the orthogonality of the corresponding features. Adding orthogonality constraints to the linear transformation weights of Q, K and V features on different self-attention heads provides a solution of orthogonal weight space for the exploration of weight space in the learning process of Q, K and V features, which will bring the effect of implicit regularization in network training and more fully utilize the feature space and weight space of multihead self-attention. A similar approach is used for the Q, K and V weights to constrain the orthogonality of the Q, K and V weight spaces by penalizing the orthogonality of the corresponding weights. The feature orthogonality constraint and the weight orthogonality constraint can produce improvements when used individually, or they can produce further improvements when used in combination. Result The experiment results show that, compared with the benchmark method, the overall accuracy on the test dataset is improved by 0.447% when the feature orthogonality constraint is added. When the weight orthogonality constraint is added, the overall accuracy on the test dataset is improved by 0.364%. When both feature and weight orthogonality constraints are added, the overall accuracy on the test dataset is improved by 0.513%. We provide a comparison of the orthogonality changes in different ablation experiments, including the comparison of the orthogonality changes of Q, K and V features and weights among different self-attention heads. As can be seen from the figure, the orthogonality constraint proposed in this paper can lead to a significant improvement in the corresponding orthogonality. In addition, it can also be seen from the figure that the feature orthogonality constraint favors the orthogonality of the weights, and the weight orthogonality constraint favors the orthogonality of the features, but these effects are small. We also visualize the attention maps of the model with added orthogonal constraints and the baseline model in the CT dataset. From the attention map, it can be found that the model with added orthogonality constraints focuses on more information in the attention region compared to the baseline model, which is helpful for recognizing the correct results. We also compared with previous competitive methods on several popular benchmarks. The proposed method is improved on both regular and irregular datasets. Compared with the baseline, the regular datasets IIIT, SVT and IC03(860) are all improved by 0.5%. On the irregular datasets IC15(1811) and IC15(2077), they were improved by 0.5% and 0.6%, respectively, on SVTP by 0.8%, and on CT by 1.1%. The overall accuracy is improved by 0.5%. Conclusion In this study, this paper proposes a novel orthogonality constraint for the multihead self-attention mechanism, which explicitly encourages the multihead self-attention mechanism to capture more diverse subspace information. The Q, K and V feature orthogonality constraints are used to improve the ability of the multihead self-attention mechanism to capture the feature space information of the input sequence, and the Q, K and V weight orthogonality constraints are used to provide orthogonal weight space exploration for the learning of features. The experiment proves the effectiveness of the proposed plug-and-play orthogonality constraints on STR tasks, especially to improve the accuracy of ViT model on irregular text recognition tasks.