正交约束多头自注意力的场景文本识别

徐仕成; 朱子奇

doi:10.11834/jig.221049

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

正交约束多头自注意力的场景文本识别
Orthogonality-constrained multihead self-attention for scene text recognition
2023年28卷第12期页码：3855-3869
纸质出版日期： 2023-12-16 ，
DOI： 10.11834/jig.221049
稿件说明：

移动端阅览

徐仕成，朱子奇. 2023. 正交约束多头自注意力的场景文本识别. 中国图象图形学报， 28(12):3855-3869

Xu Shicheng， Zhu Ziqi. 2023. Orthogonality-constrained multihead self-attention for scene text recognition. Journal of Image and Graphics， 28(12):3855-3869
徐仕成，朱子奇. 2023. 正交约束多头自注意力的场景文本识别. 中国图象图形学报， 28(12):3855-3869 DOI： 10.11834/jig.221049.

Xu Shicheng， Zhu Ziqi. 2023. Orthogonality-constrained multihead self-attention for scene text recognition. Journal of Image and Graphics， 28(12):3855-3869 DOI： 10.11834/jig.221049.

摘要

目的

场景文本识别（scene text recognition，STR）是计算机视觉中的一个热门研究领域。最近，基于多头自注意力机制的视觉Transformer（vision Transformer，ViT）模型被提出用于STR，以实现精度、速度和计算负载的平衡。然而，没有机制可以保证不同的自注意力头确实捕捉到多样性的特征，这将导致使用多头自注意力机制的ViT模型在多样性极强的场景文本识别任务中表现不佳。针对这个问题，提出了一种新颖的正交约束来显式增强多个自注意力头之间的多样性，提高多头自注意力对不同子空间信息的捕获能力，在保证速度和计算效率的同时进一步提高网络的精度。

方法

首先提出了针对不同自注意力头上

（query）、

（key）和

（value）特征的正交约束，这可以使不同的自注意力头能够关注到不同的查询子空间、键子空间、值子空间的特征，关注不同子空间的特征可以显式地使不同的自注意力头捕捉到更具差异的特征。还提出了针对不同自注意力头上

、

和

特征线性变换权重的正交约束，这将为

、

和

特征的学习提供正交权重空间的解决方案，并在网络训练中带来隐式正则化的效果。

结果

实验在7个数据集上与基准方法进行比较，在规则数据集Street View Text（SVT）上精度提高了0.5%；在不规则数据集CUTE80（CT）上精度提高了1.1%；在7个公共数据集上的整体精度提升了0.5%。

结论

提出的即插即用的正交约束能够提高多头自注意力机制在STR任务中的特征捕获能力，使ViT模型在STR任务上的识别精度得到提高。本文代码已公开：

https://github.com/lexiaoyuan/XViTSTR

。

Abstract

Objective

Scene text recognition （STR） is a hot research field in computer vision that aims to recognize text information from natural scenes. STR is important in many tasks and applications， such as image search， robot navigation， license plate recognition， and automatic driving. Most of the early STR models usually comprise a rectification network and a recognition network， while recent STR models usually comprise a convolutional neural network （CNN）-based feature encoder and a Transformer-based decoder or a customized CNN module and Transformer encoder-decoder. These STR models usually have a complex model architecture， large computational load， and large memory consumption. A vision Transformer （ViT）-based STR model called ViTSTR maintains balance among accuracy， speed， and computational load. However， without data augmentation targeted for STR， ViTSTR requires improvement in its accuracy. One reason for its low accuracy is that the naive use of multihead self-attention in ViT does not guarantee that different attention heads capture distinct features， especially the diverse features in complex scene text images. To address this problem， this paper studies the application of orthogonality constraints in ViT based on ViTSTR and proposes novel orthogonality constraints for the multihead self-attention mechanism in ViT， which explicitly encourages diversity among multiple self-attention heads， improves the ability of multihead self-attention to capture information in different subspaces， and further improves the accuracy of the network while ensuring speed and computational efficiency.

Method

The proposed orthogonality constraints comprise two parts， namely， the orthogonality constraints for the features of query （

）， key （

） and value （

） on different self-attention heads and the orthogonality constraints for the linear transformation weights of

，

， and

on different self-attention heads.

，

， and

play important roles in the self-attention mechanism as input features of the attention head. The orthogonality of features on different attention heads clearly encourages diversity among multiple attention head features. The orthogonality constraints for the

，

K，

and

features allow different self-attention heads to focus on features in different query， key， and value subspaces， hence explicitly enabling different self-attention heads to capture distinct features and guide the ViT model in improving its performance in text recognition tasks for very diverse scenes. Specifically， after normalizing the

，

K，

and

features of each head， the orthogonality of

，

K，

and

features between different heads is calculated， and the corresponding orthogonality is added as the regularization term after the loss function. The orthogonality of the

，

K，

and

features between different heads is penalized by back-propagation， which acts as a constraint on the orthogonality of the corresponding features. Adding orthogonality constraints to the linear transformation weights of the

，

K，

and

features on different self-attention heads provides an orthogonal weight space in the learning process of these features， hence triggering implicit regularization in network training and fully utilizing the feature and weight spaces of multihead self-attention. A similar approach is used for the

，

， and

weights to constrain the orthogonality of the

，

K，

and

weight spaces by penalizing the orthogonality of the corresponding weights. The feature and weight orthogonality constraints can produce improvements when used individually or in combination.

Result

Experiment results show that compared to the benchmark method， the overall accuracy of the proposed method on the test dataset is improved by 0.447% when adding the feature orthogonality constraint. Meanwhile， when adding the weight orthogonality constraint， the overall accuracy of the proposed method is improved by 0.364%. When both feature and weight orthogonality constraints are added， the overall accuracy of the proposed method is improved by 0.513%. We then compare the orthogonality changes in different ablation experiments， including the orthogonality changes of the

，

， and

features and weights among different self-attention heads. The proposed orthogonality constraint can lead to a significant improvement in the corresponding orthogonality. In addition， the feature orthogonality constraint favors the orthogonality of the weights， and the weight orthogonality constraint favors the orthogonality of the features， but these effects are small. We also produce attention maps of the model with added orthogonal constraints and the baseline model in the CUTE80（CT） dataset. These attention maps show that the model with added orthogonality constraints focuses on more information in the attention region compared to the baseline model， which is helpful for recognizing the correct results. We also compare the performance of our method and that of previous competitive methods on several popular benchmarks. Our proposed method shows improvements on both regular and irregular datasets. Compared with the baseline， the accuracy of the proposed method is improved by 0.5% on regular datasets IIIT5K-words（IIIT）， street view text（SVT）， and ICDAR2003（IC03）（860）， by 0.5% and 0.8% on the irregular datasets ICDAR2015（IC15）（1811） and IC15 （2077）， respectively， by 0.8% on SVT perspective（SVTP）， and by 1.1% on CT. In sum， the proposed method shows an overall accuracy improvement of 0.5%.

Conclusion

This paper proposes a novel orthogonality constraint for the multihead self-attention mechanism that explicitly encourages this mechanism to capture diverse subspace information. The

，

， and

feature orthogonality constraints are used to improve the ability of the multihead self-attention mechanism in capturing the feature space information of the input sequence， and the

，

， and

weight orthogonality constraints are used to provide orthogonal weight space exploration for the learning of features. Experiment results validate the effectiveness of the proposed plug-and-play orthogonality constraints in STR tasks， especially in improving the accuracy of the ViT model in irregular text recognition tasks. The code is publicly available：

https://github.com/lexiaoyuan/XViTSTR

关键词

场景文本识别（STR）视觉Transformer（ViT）多头自注意力正交约束计算机视觉

Keywords

scene text recognition （STR）vision Transformer （ViT）multihead self-attentionorthogonality constrainedcomputer vision

references

Arjovsky M， Shah A and Bengio Y. 2016. Unitary evolution recurrent neural networks ［EB/OL］. ［2022-09-22］. https://arxiv.org/pdf/1511.06464.pdfhttps://arxiv.org/pdf/1511.06464.pdf

Atienza R. 2021. Vision Transformer for fast and efficient scene text recognition//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 319-334 ［DOI： 10.1007/978-3-030-86549-8_21http://dx.doi.org/10.1007/978-3-030-86549-8_21］

Baek J， Kim G， Lee J， Park S， Han D， Yun S， Oh S J and Lee H. 2019. What is wrong with scene text recognition model comparisons？ Dataset and model analysis//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 4714-4722 ［DOI： 10.1109/ICCV.2019.00481http://dx.doi.org/10.1109/ICCV.2019.00481］

Bahdanau D， Cho K and Bengio Y. 2016. Neural machine translation by jointly learning to align and translate ［EB/OL］. ［2022-08-29］. https://arxiv.org/pdf/1409.0473.pdfhttps://arxiv.org/pdf/1409.0473.pdf

Bai F， Cheng Z Z， Niu Y， Pu S L and Zhou S G. 2018. Edit probability for scene text recognition ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1805.03384.pdfhttps://arxiv.org/pdf/1805.03384.pdf

Bansal N， Chen X H and Wang Z Y. 2018. Can we gain more from orthogonality regularizations in training deep CNNs?//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal， Canada： Curran Associates Inc.： 4266-4276

Bojarski M， Del Testa D， Dworakowski D， Firner B， Flepp B， Goyal P， Jackel L D， Monfort M， Muller U， Zhang J K， Zhang X， Zhao J K and Zieba K. 2016. End to end learning for self-driving cars ［EB/OL］. ［2022-08-11］. https://arxiv.org/pdf/1604.07316.pdfhttps://arxiv.org/pdf/1604.07316.pdf

Borisyuk F， Gordo A and Sivakumar V. 2018. Rosetta： large scale system for text detection and recognition in images//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London， UK： ACM： 71-79 ［DOI： 10.1145/3219819.3219861http://dx.doi.org/10.1145/3219819.3219861］

Cheng Z Z， Bai F， Xu Y L， Zheng G， Pu S L and Zhou S G. 2017. Focusing attention： towards accurate text recognition in natural images//Proceedings of 2017 IEEE International Conference on Computer Vision （ICCV）. Venice， Italy： IEEE： 5086-5094 ［DOI： 10.1109/ICCV.2017.543http://dx.doi.org/10.1109/ICCV.2017.543］

Dorobantu V， Stromhaug P A and Renteria J. 2016. DizzyRNN： reparameterizing recurrent neural networks for norm-preserving backpropagation ［EB/OL］. ［2022-09-22］. https://arxiv.org/pdf/1612.04035.pdfhttps://arxiv.org/pdf/1612.04035.pdf

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2022-04-07］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Ghiasi G， Lin T Y and Le Q V. 2018. DropBlock： a regularization method for convolutional networks ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1810.12890.pdfhttps://arxiv.org/pdf/1810.12890.pdf

Gupta A， Vedaldi A and Zisserman A. 2016. Synthetic data for text localisation in natural images ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1604.06646.pdfhttps://arxiv.org/pdf/1604.06646.pdf

He K M， Zhang X Y， Ren S Q and Sun J. 2015. Delving deep into rectifiers： surpassing human-level performance on ImageNet classification//Proceedings of 2015 IEEE International Conference on Computer Vision （ICCV）. Santiago， Chile： IEEE： 1026-1034 ［DOI： 10.1109/ICCV.2015.123http://dx.doi.org/10.1109/ICCV.2015.123］

He Y Z， Prabhavalkar R， Rao K， Li W， Bakhtin A and McGraw I. 2017. Streaming small-footprint keyword spotting using sequence-to-sequence models ［EB/OL］. ［2022-08-29］. https://arxiv.org/pdf/1710.09617.pdfhttps://arxiv.org/pdf/1710.09617.pdf

Hinton G E， Srivastava N， Krizhevsky A， Sutskever I and Salakhutdinov R R. 2012. Improving neural networks by preventing co-adaptation of feature detectors ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1207.0580.pdfhttps://arxiv.org/pdf/1207.0580.pdf

Huang G， Sun Y， Liu Z， Sedra D and Weinberger K Q. 2016. Deep networks with stochastic depth ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1603.09382.pdfhttps://arxiv.org/pdf/1603.09382.pdf

Huang L， Liu X L， Lang B， Yu A， Wang Y L and Li B. 2018. Orthogonal weight normalization： solution to optimization over multiple dependent stiefel manifolds in deep neural networks. Proceedings of the AAAI Conference on Artificial Intelligence， 32（1）： 3271-3278 ［DOI： 10.1609/aaai.v32i1.11768http://dx.doi.org/10.1609/aaai.v32i1.11768］

Jaderberg M， Simonyan K， Vedaldi A and Zisserman A. 2014. Synthetic data and artificial neural networks for natural scene text recognition ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1406.2227.pdfhttps://arxiv.org/pdf/1406.2227.pdf

Karatzas D， Shafait F， Uchida S， Iwamura M， Bigorda L G i， Mestre S R， Mas J， Mota D F， Almazan J A and de las Heras L P. 2013. ICDAR 2013 robust reading competition//Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington， USA： IEEE： 1484-1493 ［DOI： 10.1109/ICDAR.2013.221http://dx.doi.org/10.1109/ICDAR.2013.221］

Karatzas D， Gomez-Bigorda L， Nicolaou A， Ghosh S， Bagdanov A， Iwamura M， Matas J， Neumann L， Chandrasekhar V R， Lu S J， Shafait F， Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition （ICDAR）. Tunis， Tunisia： IEEE： 1156-1160 ［DOI： 10.1109/ICDAR.2015.7333942http://dx.doi.org/10.1109/ICDAR.2015.7333942］

Kingma D P and Ba J L. 2017. Adam： a method for stochastic optimization ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdf

Lee J， Park S， Baek J， Oh S J， Kim S and Lee H. 2020. On recognizing texts of arbitrary shapes with 2D self-attention//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Seattle， USA： IEEE： 2326-2335 ［DOI： 10.1109/CVPRW50498.2020.00281http://dx.doi.org/10.1109/CVPRW50498.2020.00281］

Lee M， Lee J， Jang H J， Kim B， Chang W and Hwang K. 2019. Orthogonality constrained multi-head attention for keyword spotting//Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop （ASRU）. Singapore， Singapore： IEEE： 86-92 ［DOI： 10.1109/ASRU46091.2019.9003738http://dx.doi.org/10.1109/ASRU46091.2019.9003738］

Li B C， Tang X， Qi X B， Chen Y H and Xiao R. 2020. Hamming OCR： a locality sensitive hashing neural network for scene text recognition ［EB/OL］. ［2022-09-26］. https://arxiv.org/pdf/2009.10874.pdfhttps://arxiv.org/pdf/2009.10874.pdf

Li J， Tu Z P， Yang B S， Lyu M R and Zhang T. 2018. Multi-head attention with disagreement regularization//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： ACL： 2897-2903 ［DOI： 10.18653/v1/D18-1317http://dx.doi.org/10.18653/v1/D18-1317］

Li Y R， Su H， Shen X Y， Li W J， Cao Z Q and Niu S Z. 2017. DailyDialog： a manually labelled multi-turn dialogue dataset ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1710.03957.pdfhttps://arxiv.org/pdf/1710.03957.pdf

Liu C Y， Chen X X， Luo C J， Jin L W， Xue Y and Liu Y L. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics， 26（6）： 1330-1367

刘崇宇，陈晓雪，罗灿杰，金连文，薛洋，刘禹良. 2021. 自然场景文本检测与识别的深度学习方法. 中国图象图形学报， 26（6）： 1330-1367 ［DOI： 10.11834/jig.210044http://dx.doi.org/10.11834/jig.210044］

Loshchilov I and Hutter F. 2017. SGDR： stochastic gradient descent with warm restarts ［EB/OL］. ［2022-08-12］. https://arxiv.org/pdf/1608.03983.pdfhttps://arxiv.org/pdf/1608.03983.pdf

Lucas S M， Panaretos A， Sosa L， Tang A， Wong S， Young R， Ashida K， Nagai H， Okamoto M， Yamamoto H， Miyao H， Zhu J M， Ou W W， Wolf C， Jolion J M， Todoran L， Worring M and Lin X F. 2005. ICDAR 2003 robust reading competitions： entries， results， and future directions. International Journal of Document Analysis and Recognition （IJDAR）， 7（2）： 105-122 ［DOI： 10.1007/s10032-004-0134-3http://dx.doi.org/10.1007/s10032-004-0134-3］

Luo C J， Jin L W and Sun Z H. 2019. MORAN： a multi-object rectified attention network for scene text recognition ［EB/OL］. ［2022-08-11］. https://arxiv.org/pdf/1901.03003.pdfhttps://arxiv.org/pdf/1901.03003.pdf

Luo C J， Zhu Y Z， Jin L W and Wang Y P. 2020. Learn to augment： joint data augmentation and network optimization for text recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 13743-13752 ［DOI： 10.1109/CVPR42600.2020.01376http://dx.doi.org/10.1109/CVPR42600.2020.01376］

Mhammedi Z， Hellicar A， Rahman A and Bailey J. 2017. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections ［EB/OL］. ［2022-09-22］. https://arxiv.org/pdf/1612.00188.pdfhttps://arxiv.org/pdf/1612.00188.pdf

Mishra A， Alahari K and Jawahar C V. 2012. Scene text recognition using higher order language priors//Procedings of 2012 British Machine Vision Conference. Surrey， UK： BMVA： 1-11 ［DOI： 10.5244/C.26.127http://dx.doi.org/10.5244/C.26.127］

Phan T Q， Shivakumara P， Tian S X and Tan C L. 2013. Recognizing text with perspective distortion in natural scenes//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney， Australia： IEEE： 569-576 ［DOI： 10.1109/ICCV.2013.76http://dx.doi.org/10.1109/ICCV.2013.76］

Risnumawan A， Shivakumara P， Chan C S and Tan C L. 2014. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications， 41（18）： 8027-8048 ［DOI： 10.1016/j.eswa.2014.07.008http://dx.doi.org/10.1016/j.eswa.2014.07.008］

Rodríguez P， Gonzàlez J， Cucurull G， Gonfaus J M and Roca X. 2017. Regularizing cnns with locally constrained decorrelations ［EB/OL］. ［2022-05-05］. https://arxiv.org/pdf/1611.01967.pdfhttps://arxiv.org/pdf/1611.01967.pdf

Schulz R， Talbot B， Lam O， Dayoub F， Corke P， Upcroft B and Wyeth G. 2015. Robot navigation using human cues： a robot navigation system for symbolic goal-directed exploration//Proceedings of 2015 IEEE International Conference on Robotics and Automation （ICRA）. Seattle， USA： IEEE： 1100-1105 ［DOI： 10.1109/ICRA.2015.7139313http://dx.doi.org/10.1109/ICRA.2015.7139313］

Shan C H， Zhang J B， Wang Y J and Xie L. 2018. Attention-based end-to-end models for small-footprint keyword spotting ［EB/OL］. ［2022-08-29］. https://arxiv.org/pdf/1803.10916.pdfhttps://arxiv.org/pdf/1803.10916.pdf

Sheng F F， Chen Z N and Xu B. 2019. NRTR： a no-recurrence sequence-to-sequence model for scene text recognition//Proceedings of 2019 International Conference on Document Analysis and Recognition （ICDAR）. Sydney， Australia： IEEE： 781-786 ［DOI： 10.1109/ICDAR.2019.00130http://dx.doi.org/10.1109/ICDAR.2019.00130］

Shi B G， Bai X and Yao C. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（11）： 2298-2304 ［DOI： 10.1109/TPAMI.2016.2646371http://dx.doi.org/10.1109/TPAMI.2016.2646371］

Shi B G， Yang M K， Wang X G， Lyu P， Yao C and Bai X. 2019. ASTER： an attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（9）： 2035-2048 ［DOI： 10.1109/TPAMI.2018.2848939http://dx.doi.org/10.1109/TPAMI.2018.2848939］

Szegedy C， Vanhoucke V， Ioffe S， Shlens J and Wojna Z. 2015. Rethinking the inception architecture for computer vision ［EB/OL］. ［2022-08-29］. https://arxiv.org/pdf/1512.00567.pdfhttps://arxiv.org/pdf/1512.00567.pdf

Touvron H， Cord M， Douze M， Massa F， Sablayrolles A and Jégou H. 2021. Training data-efficient image Transformers and distillation through attention//Proceedings of the 38th International Conference on Machine Learning. PMLR： 10347-10357

Tran B H， Le-Cong T， Nguyen H M， Anh Le D， Nguyen T H and Le Nguyen P. 2020. SAFL： a self-attention scene text recognizer with focal loss//Proceedings of the 19th IEEE International Conference on Machine Learning and Applications （ICMLA）. Miami， USA： IEEE： 1440-1445 ［DOI： 10.1109/ICMLA51294.2020.00223http://dx.doi.org/10.1109/ICMLA51294.2020.00223］

Tsai S S， Chen H Z， Chen D， Schroth G， Grzeszczuk R and Girod B. 2011. Mobile visual search on printed documents using text and low bit-rate features//Proceedings of the 18th IEEE International Conference on Image Processing. Brussels， Belgium： IEEE： 2601-2604 ［DOI： 10.1109/ICIP.2011.6116198http://dx.doi.org/10.1109/ICIP.2011.6116198］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Vorontsov E， Trabelsi C， Kadoury S and Pal C. 2017. On orthogonality and learning recurrent networks with long term dependencies ［EB/OL］. https://arxiv.org/pdf/1702.00071.pdfhttps://arxiv.org/pdf/1702.00071.pdf

Wang J F and Hu X L. 2017. Gated recurrent convolution neural network for OCR//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 334-343

Wang K， Babenko B and Belongie S. 2011. End-to-end scene text recognition//Proceedings of 2011 International Conference on Computer Vision. Barcelona， Spain： IEEE： 1457-1464 ［DOI： 10.1109/ICCV.2011.6126402http://dx.doi.org/10.1109/ICCV.2011.6126402］

Wang X L， Man Z P， You M Y and Shen C H. 2017. Adversarial generation of training examples： applications to moving vehicle license plate recognition ［EB/OL］. ［2022-08-11］. https://arxiv.org/pdf/1707.03124.pdfhttps://arxiv.org/pdf/1707.03124.pdf

Xie D， Xiong J and Pu S L. 2017. All you need is beyond a good init： exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 5075-5084 ［DOI： 10.1109/CVPR.2017.539http://dx.doi.org/10.1109/CVPR.2017.539］

Yang L， Wang P， Li H， Li Z and Zhang Y N. 2020. A holistic representation guided attention network for scene text recognition. Neurocomputing， 414： 67-75 ［DOI： 10.1016/j.neucom.2020.07.010http://dx.doi.org/10.1016/j.neucom.2020.07.010］

Zhang A， Chan A， Tay Y， Fu J， Wang S H， Zhang S， Shao H J， Yao S C and Lee R K W. 2021. On orthogonality constraints for Transformers//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 2： Short Papers）. Online： ACL： 375-382 ［DOI： 10.18653/v1/2021.acl-short.48http://dx.doi.org/10.18653/v1/2021.acl-short.48］

Zhang S Z， Dinan E， Urbanek J， Szlam A， Kiela D and Weston J. 2018. Personalizing dialogue agents： I have a dog， do you have pets too?//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Melbourne， Australia： ACL： 2204-2213 ［DOI： 10.18653/v1/P18-1205http://dx.doi.org/10.18653/v1/P18-1205］

Zhang Y P， Nie S， Liu W J， Xu X， Zhang D X and Shen H T. 2019. Sequence-to-sequence domain adaptation network for robust text image recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 2735-2744 ［DOI： 10.1109/CVPR.2019.00285http://dx.doi.org/10.1109/CVPR.2019.00285］

文章被引用时，请邮件提醒。

提交

面向文本识别的对抗样本攻击综述

三维步态识别研究进展

分割一切模型SAM的潜力与展望：综述

“三维视觉—语言”推理技术的前沿研究与最新趋势

深度学习实时语义分割综述