融合软注意力掩码嵌入的场景文本识别方法

陈威达; 王林飞; 陶大鹏

发布时间： 2024-05-20
摘要点击次数： 305
全文下载次数： 285
DOI: 10.11834/jig.230081
2024 | Volume 29 | Number 5

融合软注意力掩码嵌入的场景文本识别方法

陈威达, 王林飞, 陶大鹏(云南大学信息学院, 昆明 650500)

摘要

目的基于深度学习的端到端场景文本识别任务已经取得了很大的进展。然而受限于多尺度、任意形状以及背景干扰等问题，大多数端到端文本识别器依然会面临掩码提议不完整的问题，进而影响模型的文本识别结果。为了提高掩码预测的准确率，提出了一种基于软注意力的掩码嵌入模块（soft attention mask embedding，SAME），方法利用Transformer更好的全局感受野，将高层特征进行编码并计算软注意力，然后将编码特征与预测掩码层级嵌入，生成更贴近文本边界的掩码来抑制背景噪声。基于SAME强大的文本掩码优化及细粒度文本特征提取能力，进一步提出了一个健壮的文本识别框架SAME-Net，开展无需字符级注释的端到端精准文本识别。具体来说，由于软注意力是可微的，所提出的SAME-Net可以将识别损失传播回检测分支，以通过学习注意力的权重来指导文本检测，使检测分支可以由检测和识别目标联合优化。结果在多个文本识别公开数据集上的实验表明了所提方法的有效性。其中，SAME-Net在任意形状文本数据集Total-Text上实现了84.02%的H-mean，相比于2022年的GLASS（global to local attention for scene-text spotting），在不增加额外训练数据的情况下，全词典的识别准确率提升1.02%。所提方法在多向数据集ICDAR 2015（International Conference on Document Analysis and Recognition）也获得了与同期工作相当的性能，取得83.4%的强词典识别结果。结论提出了一种基于SAME的端到端文本识别方法。该方法利用Transformer的全局感受野生成靠近文本边界的掩码来抑制背景噪声，提出的SAME模块可以将识别损失反向传输到检测模块，并且不需要额外的文本校正模块。通过检测和识别模块的联合优化，可以在没有字符级标注的情况下实现出色的文本定位性能。

关键词

自然场景文本检测自然场景文本识别软注意力嵌入深度学习端到端自然场景文本检测与识别

SAME-net：scene text recognition method based on soft attention mask embedding

Chen Weida, Wang Linfei, Tao Dapeng(School of Information, Yunnan University, Kunming 650500, China)

Abstract

Objective Text detection and recognition of natural scenes is a long-standing and challenging problem. Hence， this study aims to detect and recognize text information in natural scene images. Owing to its wide applications（e. g. ，traffic sign recognition and content-based image retrieval），text detection and recognition has attracted much attention in the field of computer vision. The traditional scene text detection and recognition method regards detection and recognition as two independent tasks. This method first locates and then clips to predict the text area of the input image and to clip the relevant area and then sends the clipped area into the recognizer for recognition. However，this process has some limitations， such as：1）inaccurate detection results may seriously affect the performance of image text recognition owing to the accumulation of errors between the two tasks，and 2）the separate optimization of the two tasks may not improve the results of text recognition. In recent years，the end-to-end scene text recognition task based on deep learning has made great progress. Many studies have found that detection and recognition are closely related. End-to-end recognition，which integrates detection and recognition tasks，can promote each other and gradually become an important research direction. In the end-toend recognition task，the natural scene image contains disturbing factors，such as light，deformation，and stain. In addition，scene text can be represented by different colors，fonts，sizes，directions，and shapes，making text detection very difficult. Limited by multi-scale，arbitrary shapes，background interference，and other issues，most end-to-end text recognizers still face the problem of incomplete mask proposals，which will affect the text recognition results of the model. Hence， we propose a mask embedding module（SAME）based on soft attention to improve the accuracy of mask prediction. This module effectively improves the robustness and accuracy of the model. Method High-level features are coded，and soft attention is calculated using the global receptive field of Transformer. Then，the coding features and prediction mask are embedded to generate a mask close to the text boundary to suppress background noise. Based on these designs，we propose a simple and robust end-to-end text recognition framework，SAME-Net，because soft attention is differentiable. The proposed SAME module can propagate the recognition loss back to the detection branch to guide the text detection by learning the weight of attention so that the detection and recognition targets can jointly optimize the detection branch. SAME-Net does not need additional recognition modules，nor does it need to annotate the text at the character level. Result This method can effectively detect multi-scale and arbitrarily shaped text. The recall rate，accuracy rate，and H-mean value on the public arbitrarily shaped data set Total-Text are 0. 884 8 and 0. 879 6. Compared with the best results in the comparison method，without adding further training data，the recognition accuracy rate without dictionary guidance is increased by 2. 36%，and the recognition accuracy rate of the full dictionary is increased by 5. 62%. In terms of detection，the recall rate and H-mean value of this method increased from 0. 868 to 0. 884 8 and from 0. 861 to 0. 879 6，respectively，which greatly exceeded the previous method in terms of end-to-end recognition. Both obtained 83. 4% strong dictionary recognition results in the multi-directional dataset ICDAR 2015（International Conference on Document Analysis and Recognition）. In short，our method is superior to others. Conclusion The performance of SAME-Net proposed in this study has significantly improved on the two scene text data machines of ICDAR 2015 and Total-Text. The best results in this task were obtained. This study proposes an end-to-end text recognition method based on SAME. The proposed method has two advantages. First，the method uses the global receptive field of Transformer to embed high-level coding features and prediction mask levels to generate a mask close to the text boundary to suppress background noise. Second，the proposed SAME module can reverse transmit the recognition loss to the detection module，and no additional text correction module is needed. Great text positioning performance can be achieved without character-level comments through the joint optimization of the detection and recognition modules.

Keywords

natural scene text detection natural scene text recognition soft attention embedding deep learning end-to-end natural scene text detection and recognition