层级语义融合的场景文本检测

王紫霄; 谢洪涛; 王裕鑫; 张勇东

doi:10.11834/jig.220902

文档图像智能处理与识别 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

层级语义融合的场景文本检测
Hierarchical semantics-fused scene text detection
2023年28卷第8期页码：2343-2355
纸质出版日期： 2023-08-16 ，
DOI： 10.11834/jig.220902
稿件说明：

移动端阅览

王紫霄，谢洪涛，王裕鑫，张勇东. 2023. 层级语义融合的场景文本检测. 中国图象图形学报， 28(08):2343-2355

Wang Zixiao， Xie Hongtao， Wang Yuxin， Zhang Yongdong. 2023. Hierarchical semantics-fused scene text detection. Journal of Image and Graphics， 28(08):2343-2355
王紫霄，谢洪涛，王裕鑫，张勇东. 2023. 层级语义融合的场景文本检测. 中国图象图形学报， 28(08):2343-2355 DOI： 10.11834/jig.220902.

Wang Zixiao， Xie Hongtao， Wang Yuxin， Zhang Yongdong. 2023. Hierarchical semantics-fused scene text detection. Journal of Image and Graphics， 28(08):2343-2355 DOI： 10.11834/jig.220902.

摘要

目的

场景文本检测是场景理解和文字识别领域的重要任务之一，尽管基于深度学习的算法显著提升了检测精度，但现有的方法由于对文字局部语义和文字实例间的全局语义的提取能力不足，导致缺乏文字多层语义的建模，从而检测精度不理想。针对此问题，提出了一种层级语义融合的场景文本检测算法。

方法

该方法包括基于文本片段的局部语义理解模块和基于文本实例的全局语义理解模块，以分别引导网络关注文字局部和文字实例间的多层级语义信息。首先，基于文本片段的局部语义理解模块根据相对位置将文本划分为多个片段，在细粒度优化目标的监督下增强网络对局部语义的感知能力。然后，基于文本实例的全局语义理解模块利用文本片段粗分割结果过滤背景区域并提取可靠的文字区域特征，进而通过注意力机制自适应地捕获任意形状文本的全局语义信息并得到最终分割结果。此外，为了降低边界区域的预测噪声对层级语义信息聚合的干扰，提出边界感知损失函数以降低边界区域特征的歧义性。

结果

算法在3个常用的场景文字检测数据集上实验并与其他算法进行了比较，所提方法在性能上获得了显著提升，在Totoal-Text数据集上，F值为87.0%，相比其他模型提升了1.0%；在MSRA-TD500（MSRA text detection 500 database）数据集上，F值为88.2%，相比其他模型提升了1.0%；在ICDAR 2015（International Conference on Document Analysis and Recognition）数据集上，F值为87.0%。

结论

提出的模型通过分别构建不同层级下的语义上下文和对歧义特征额外的惩罚解决了层级语义提取不充分的问题，获得了更高的检测精度。

Abstract

Objective

Scene-related text detection is essential for computer vision， which aims to localize text instances for targeted image. It is beneficial for such domain of text recognition applications like scene understanding， translation and text visual question answering. The emerging deep learning based convolution neural network （CNN） has been widely developing in relevance to text detection nowadays. Current researches are focused on texts location in terms of the regression of the quadrangular bounding box. However， since regression based methods unfit texts with arbitrary shapes （e.g.， curved texts）， many approaches focus on segmentation based methods. Fully convolutional networks （FCN） are commonly used to obtain high-resolution feature maps， and the pixel-level mask is predicted to locate the text instances as well. Due to the extreme aspect ratios and the various sizes of text instances， existing models are challenged for one feature map-related integration of local-level and global-level semantics. More feature maps are introduced from multiple levels of the network， and hierarchical semantics can be generated from the corresponding feature map. But， these modules are required to yield the network to optimize the hierarchical features simultaneously， which may distract the network to a certain extent. Hence， existing networks are required to capture accurate hierarchical semantics further.

Method

To resolve this problem， the segmentation based text detection method is developed and a hierarchical semantic fusion network is demonstrated as well. We decouple the local and global feature extraction process and learn corresponding semantics. Specially， two mutual-benefited components are introduced for enhancing effective local and global feature， sub-region based local semantic understanding module （SLM） and instance based global semantic understanding module （IGM）. First， SLM is used to segment the text instance into a kernel and multiple sub-regions in terms of their text-inner position. And， SLM can be used to learn their segmentation， which is an auxiliary task for the model. As a small part of the text， segmenting sub-region requires more local-level information and less long-range context， for which the model can be yielded to learning more accurate local features. Furthermore， ground truth-supervised position information can harness the network to separate the adjacent text instances. Second， IGM is designed for global-contextual feature extraction through capturing text instances-amongst long-range dependency. Thanks to SLM-derived segmentation maps， IGM can be used to filter the noisy background easily， and the instance-level features of each text instance can be obtained as well. Those features are then fed into a Transformer to fuse the semantics from different instances， in which global receptive field-related text features can be generated. And， the similarity is calculated relevant to the original pixel-level feature map. Finally， the global-level feature is aggregated via similarity map-related text features. The integrated SLM and IGM are beneficial for its learning to segment the text from pixel to local region and to text instances further. During this procedure， the hierarchical semantics are collected in the corresponding module， which can shrink the distraction for the other related level manually. In addition， vague semantics-involved ambiguous boundary in segmentation results are be sorted out， which may distort the semantic extraction. To alleviate this problem， we illustrate location aware loss （LAL） to increase the aggression of the misclassification around the border region. The LAL is calculated in terms of a weighted loss， and a higher weight is assigned for the pixels closer to the boundary. This loss function can be used to get a confident and accurate prediction of the boundary-relevant model， which has more accurate and discriminative feature.

Result

Comparative analysis is carried out on the basis of 12 popular methods. Three sort of challenging datasets are used for a comprehensive evaluation as well， called Total-Text， MSRA-TD500， and ICDAR2015 for each. The quantitative evaluation metrics consists of F-measure， recall， and precision. We achieve over 1% improvement on these two datasets with the F-measure of 87.0% and 88.2%. Especially， the recall and precision on MSRA-TD500 can be reached to 92.1% and 84.5%. For the ICDAR2015 dataset， the precision is improved to 92.3%. And， the F-measure on this dataset is optimized and reached to 87.0%. Additionally， a series of comparative experiments on the Total-Text dataset are conducted to evaluate the effectiveness of each module proposed. Such analyses show that the proposed SLM， IGM， and LAL can be used to improve each F-measure of 1.0%， 0.6%， and 0.5%. The qualitative visualization demonstrates that the baseline model can be optimized to a certain extent.

Conclusion

Hierarchical semantic understanding network is developed and a novel loss function is optimized for hierarchical semantics enhancement as well. Decoupling the local and global feature extraction process can be as an essential tool to get more accurate and reliable hierarchical semantics progressively.

关键词

场景文本文字检测全卷积网络（FCN）卷积神经网络（CNN）特征融合注意力机制

Keywords

scene texttext detectionfully convolutional network （FCN）convolutional neural network （CNN）feature fusionattention mechanism

references

Baek Y， Lee B， Han D， Yun S and Lee H. 2019. Character region awareness for text detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 9357-9366 ［DOI： 10.1109/CVPR.2019.00959http://dx.doi.org/10.1109/CVPR.2019.00959］

Cheng B W， Girshick R， Doll􀅡r P， Berg A C and Kirillov A. 2021. Boundary IoU： improving object-centric image segmentation evaluation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 15329-15337 ［DOI： 10.1109/CVPR46437.2021.01508http://dx.doi.org/10.1109/CVPR46437.2021.01508］

Ch'ng C K and Chan C S. 2017. Total-text： a comprehensive dataset for scene text detection and recognition//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto， Japan： IEEE： 935-942 ［DOI： 10.1109/ICDAR.2017.157http://dx.doi.org/10.1109/ICDAR.2017.157］

Dai P W， Zhang S Y， Zhang H and Cao X C. 2021. Progressive contour regression for arbitrary-shape scene text detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 7389-7398 ［DOI： 10.1109/CVPR46437.2021.00731http://dx.doi.org/10.1109/CVPR46437.2021.00731］

Fu J， Liu J， Tian H J， Li Y， Bao Y J， Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 3141-3149 ［DOI： 10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326］

Gupta A， Vedaldi A and Zisserman A. 2016. Synthetic data for text localisation in natural images//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 2315-2324 ［DOI： 10.1109/CVPR.2016.254http://dx.doi.org/10.1109/CVPR.2016.254］

He K M， Gkioxari G， Doll􀅡r P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2980-2988 ［DOI： 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Karatzas D， Gomez-Bigorda L， Nicolaou A， Ghosh S， Bagdanov A， Iwamura M， Matas J， Neumann L， Chandrasekhar V R， Lu S J， Shafait F， Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis， Tunisia： IEEE： 1156-1160 ［DOI： 10.1109/ICDAR.2015.7333942http://dx.doi.org/10.1109/ICDAR.2015.7333942］

Liao M H， Wan Z Y， Yao C， Chen K and Bai X. 2020. Real-time scene text detection with differentiable binarization. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 11474-11481 ［DOI： 10.1609/aaai.v34i07.6812http://dx.doi.org/10.1609/aaai.v34i07.6812］

Liao M H， Zou Z S， Wan Z Y， Yao C and Bai X. 2023. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（1）： 919-931 ［DOI： 10.1109/TPAMI.2022.3155612http://dx.doi.org/10.1109/TPAMI.2022.3155612］

Lin T Y， Doll􀅡r P， Girshick R， He K M， Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 936-944 ［DOI： 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106］

Liu C Y， Chen X X， Luo C J， Jin L W， Xue Y and Liu Y L. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics， 26（6）： 1330-1367

刘崇宇，陈晓雪，罗灿杰，金连文，薛洋，刘禹良. 2021. 自然场景文本检测与识别的深度学习方法. 中国图象图形学报， 26（6）： 1330-1367 ［DOI： 10.11834/jig.210044http://dx.doi.org/10.11834/jig.210044］

Liu Y L， Chen H， Shen C H， He T， Jin L W and Wang L W. 2020. ABCNet： real-time scene text spotting with adaptive bezier-curve network//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 9806-9815 ［DOI： 10.1109/CVPR42600.2020.00983http://dx.doi.org/10.1109/CVPR42600.2020.00983］

Long J， Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3431-3440 ［DOI： 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965］

Lyu P， Yao C， Wu W H， Yan S C and Bai X. 2018b. Multi-oriented scene text detection via corner localization and region segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 7553-7563 ［DOI： 10.1109/CVPR.2018.00788http://dx.doi.org/10.1109/CVPR.2018.00788］

Lyu P Y， Liao M H， Yao C， Wu W H and Bai X. 2018a. Mask TextSpotter： an end-to-end trainable neural network for spotting text with arbitrary shapes//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 71-88 ［DOI： 10.1007/978-3-030-01264-9_5http://dx.doi.org/10.1007/978-3-030-01264-9_5］

Ma M C， Xia C Q and Li J. 2021. Pyramidal feature shrinking for salient object detection. Proceedings of the AAAI Conference on Artificial Intelligence， 35（3）： 2311-2318 ［DOI： 10.1609/aaai.v35i3.16331http://dx.doi.org/10.1609/aaai.v35i3.16331］

Shi B G， Bai X and Belongie S. 2017. Detecting oriented text in natural images by linking segments//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 3482-3490 ［DOI： 10.1109/CVPR.2017.371http://dx.doi.org/10.1109/CVPR.2017.371］

Shi G C and Wu Y R. 2021. Arbitrary shape scene-text detection based on pixel aggregation and feature enhancement. Journal of Image and Graphics， 26（7）： 1614-1624

师广琛，巫义锐. 2021. 像素聚合和特征增强的任意形状场景文本检测. 中国图象图形学报， 26（7）： 1614-1624 ［DOI： 10.11834/jig.200522http://dx.doi.org/10.11834/jig.200522］

Tang J， Yang Z B， Wang Y P， Zheng Q， Xu Y C and Bai X. 2019. Seglink++： detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recognition， 96： #106954 ［DOI： 10.1016/j.patcog.2019.06.020http://dx.doi.org/10.1016/j.patcog.2019.06.020］

Tian Z， Huang W L， He T， He P and Qiao Y. 2016. Detecting text in natural image with connectionist text proposal network//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 56-72 ［DOI： 10.1007/978-3-319-46484-8_4http://dx.doi.org/10.1007/978-3-319-46484-8_4］

Tian Z T， Shu M， Lyu P， Li R Y， Zhou C， Shen X Y and Jia J Y. 2019. Learning shape-aware embedding for scene text detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4229-4238 ［DOI： 10.1109/CVPR.2019.00436http://dx.doi.org/10.1109/CVPR.2019.00436］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang W H， Xie E Z， Li X， Hou W B， Lu T， Yu G and Shao S. 2019a. Shape robust text detection with progressive scale expansion network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 9328-9337 ［DOI： 10.1109/CVPR.2019.00956http://dx.doi.org/10.1109/CVPR.2019.00956］

Wang W H， Xie E Z， Song X G， Zang Y H， Wang W J， Lu T， Yu G and Shen C H. 2019b. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 8439-8448 ［DOI： 10.1109/ICCV.2019.00853http://dx.doi.org/10.1109/ICCV.2019.00853］

Wang X L， Girshick R， Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 7794-7803 ［DOI： 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813］

Wang Y X， Xie H T， Zha Z J， Xing M T， Fu Z L and Zhang Y D. 2020. ContourNet： taking a further step toward accurate arbitrary-shaped scene text detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 11750-11759 ［DOI： 10.1109/CVPR42600.2020.01177http://dx.doi.org/10.1109/CVPR42600.2020.01177］

Xu Y C， Wang Y K， Zhou W， Wang Y P， Yang Z B and Bai X. 2019. Textfield： learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing， 28（11）： 5566-5579 ［DOI： 10.1109/TIP.2019.2900589http://dx.doi.org/10.1109/TIP.2019.2900589］

Xue C H， Lu S J and Zhang W. 2019. MSR： multi-scale shape regression for scene text detection//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao， China： AAAI Press： 989-995

Yang S Q， Yi Y H， Tang Z W and Wang X Y. 2021. Text detection in natural scenes embedded attention mechanism. Computer Engineering and Applications， 57（24）： 185-191

杨锶齐，易尧华，汤梓伟，王新宇. 2021. 嵌入注意力机制的自然场景文本检测方法. 计算机工程与应用， 57（24）： 185-191 ［DOI： 10.3778/j.issn.1002-8331.2007-0098http://dx.doi.org/10.3778/j.issn.1002-8331.2007-0098］

Yao C， Bai X and Liu W Y. 2014. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing， 23（11）： 4737-4749 ［DOI： 10.1109/TIP.2014.2353813http://dx.doi.org/10.1109/TIP.2014.2353813］

Yao C， Bai X， Liu W Y， Ma Y and Tu Z W. 2012. Detecting texts of arbitrary orientations in natural images//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence， USA： IEEE： 1083-1090 ［DOI： 10.1109/CVPR.2012.6247787http://dx.doi.org/10.1109/CVPR.2012.6247787］

Yi Y H， He J J， Lu L Q and Tang Z W. 2020. Association of text and other objects for text detection with natural scene images. Journal of Image and Graphics， 25（1）： 126-135

易尧华，何婧婧，卢利琼，汤梓伟. 2020. 顾及目标关联的自然场景文本检测. 中国图象图形学报， 25（1）： 126-135 ［DOI： 10.11834/jig.190179http://dx.doi.org/10.11834/jig.190179］

Yi Y H， Shen C H， Liu J H and Lu L Q. 2017. Natural scene text detection method by integrating MSCRs into MSERs. Journal of Image and Graphics， 22（2）： 154-160

易尧华，申春辉，刘菊华，卢利琼. 2017. 结合MSCRs与MSERs的自然场景文本检测. 中国图象图形学报， 22（2）： 154-160 ［DOI： 10.11834/jig.20170202http://dx.doi.org/10.11834/jig.20170202］

Yuan Y H， Chen X L and Wang J D. 2020. Object-contextual representations for semantic segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 173-190 ［DOI： 10.1007/978-3-030-58539-6_11http://dx.doi.org/10.1007/978-3-030-58539-6_11］

Zhang S X， Zhu X B， Hou J B， Liu C， Yang C， Wang H F and Yin X C. 2020. Deep relational reasoning graph network for arbitrary shape text detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 9696-9705 ［DOI： 10.1109/CVPR42600.2020.00972http://dx.doi.org/10.1109/CVPR42600.2020.00972］

Zhang Z， Zhang C Q， Shen W， Yao C， Liu W Y and Bai X. 2016. Multi-oriented text detection with fully convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 4159-4167 ［DOI： 10.1109/CVPR.2016.451http://dx.doi.org/10.1109/CVPR.2016.451］

Zhou X Y， Yao C， Wen H， Wang Y Z， Zhou S C， He W R and Liang J J. 2017. EAST： an efficient and accurate scene text detector//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2642-2651 ［DOI： 10.1109/CVPR.2017.283http://dx.doi.org/10.1109/CVPR.2017.283］

Zhu Y Q， Chen J Y， Liang L Y， Kuang Z H， Jin L W and Zhang W. 2021. Fourier contour embedding for arbitrary-shaped text detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 3122-3130 ［DOI： 10.1109/CVPR46437.2021.00314http://dx.doi.org/10.1109/CVPR46437.2021.00314］

Zhu Y X and Du J. 2021. TextMountain： accurate scene text detection via instance segmentation. Pattern Recognition， 110： #107336 ［DOI： 10.1016/j.patcog.2020.107336http://dx.doi.org/10.1016/j.patcog.2020.107336］

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络

嵌入卷积增强型Transformer的头影解剖关键点检测

航空遥感图像深度学习目标检测技术研究进展

结合部首字形和层级结构的手写汉字纠错方法

结合前景分割的多特征融合行人重识别