像素聚合和特征增强的任意形状场景文本检测

师广琛; 巫义锐

doi:10.11834/jig.200522

图像分析和识别 | 浏览量 : 0 下载量: 47 CSCD: 4

PDF
导出
分享
收藏
专辑

像素聚合和特征增强的任意形状场景文本检测
Arbitrary shape scene-text detection based on pixel aggregation and feature enhancement
2021年26卷第7期页码：1614-1624
收稿：2020-08-27，

修回：2021-3-1，

录用：2021-3-8，

纸质出版：2021-07-16
DOI： 10.11834/jig.200522
稿件说明：

移动端阅览

师广琛, 巫义锐. 像素聚合和特征增强的任意形状场景文本检测[J]. 中国图象图形学报, 2021,26(7):1614-1624. DOI： 10.11834/jig.200522.

Guangchen Shi, Yirui Wu. Arbitrary shape scene-text detection based on pixel aggregation and feature enhancement[J]. Journal of Image and Graphics, 2021, 26(7): 1614-1624. DOI： 10.11834/jig.200522.

摘要

目的

获取场景图像中的文本信息对理解场景内容具有重要意义，而文本检测是文本识别、理解的基础。为了解决场景文本识别中文字定位不准确的问题，本文提出了一种高效的任意形状文本检测器：非局部像素聚合网络。

方法

该方法使用特征金字塔增强模块和特征融合模块进行轻量级特征提取，保证了速度优势；同时引入非局部操作以增强骨干网络的特征提取能力，使其检测准确性得以提高。非局部操作是一种注意力机制，能捕捉到文本像素之间的内在关系。此外，本文设计了一种特征向量融合模块，用于融合不同尺度的特征图，使尺度多变的场景文本实例的特征表达得到增强。

结果

本文方法在3个场景文本数据集上与其他方法进行了比较，在速度和准确度上均表现突出。在ICDAR（International Conference on Document Analysis and Recognition）2015数据集上，本文方法比最优方法的F值提高了0.9%，检测速度达到了23.1帧/s；在CTW（Curve Text in the Wild）1500数据集上，本文方法比最优方法的F值提高了1.2%，检测速度达到了71.8帧/s；在Total-Text数据集上，本文方法比最优方法的F值提高了1.3%，检测速度达到了34.3帧/s，远远超出其他方法。

结论

本文方法兼顾了准确性和实时性，在准确度和速度上均达到较高水平。

Abstract

Objective

Text can be seen everywhere

such as on street signs

billboards

newspapers

and other items. The text on these items expresses the information they intend to convey. The ability of text detection determines the level of text recognition and understanding of the scene. With the rapid development of modern technologies such as computer vision and internet of things

many emerging application scenarios need to extract text information from images. In recent years

some new methods for detecting scene text have been proposed. However

many of these methods are slow in detection because of the complexity of the large post-processing methods of the model

which limits their actual deployment. On the other hand

the previous high-efficiency text detectors mainly used quadrilateral bounding boxes for prediction

and accurately predicting arbitrary-shaped scenes is difficult.

Method

In this paper

an efficient arbitrary shape text detector called non-local pixel aggregation network (non-local PAN) is proposed. Non-local PAN follows a segmentation-based method to detect scene text instances. To increase the detection speed

the backbone network must be a lightweight network. However

the presentation capabilities of lightweight backbone networks are usually weak. Therefore

a non-local module is added to the backbone network to enhance its ability to extract features. Resnet-18 is used as the backbone network of non-local PAN

and non-local modules are embedded before the last residual block of the third layer. In addition

a feature-vector fusion module is designed to fuse feature vectors of different levels to enhance the feature expression of scene texts of different scales. The feature-vector fusion module is formed by concatenating multiple feature-vector fusion blocks. Causal convolution is the core component of the feature-vector fusion block. After training

the method can predict the fused feature vector based on the previously input feature vector. This study also uses a lightweight segmentation head that can effectively process features with a small computational cost. The segmentation head contains two key modules

namely

feature pyramid enhancement module (FPEM) and feature fusion module (FFM). FPEM is cascadable and has a low computational cost. It can be attached behind the backbone network to deepen the characteristics of different scales and make the network more expressive. Then

FFM merges the features generated by FPEM at different depths into the final features for segmentation. Non-local PAN uses the predicted text area to describe the complete shape of the text instance and predicts the core of the text to distinguish various text instances. The network also predicts the similarity vector of each text pixel to guide each pixel to the correct core.

Result

This method is compared with other methods on three scene-text datasets

and it has outstanding performance in speed and accuracy. On the International Conference on Document Analysis and Recognition(ICDAR) 2015 dataset

the F value of this method is 0.9% higher than that of the best method

and the detection speed reaches 23.1 frame/s. On the Curve Text in the Wild(CTW) 1500 dataset

the F value of this method is 1.2% higher than that of the best method

and the detection speed reaches 71.8 frame/s. On the total-text dataset

the F value of this method is 1.3% higher than that of the best method

and the detection speed reaches 34.3 frame/s

which is far beyond the result of other methods. In addition

we design parameter setting experiments to explore the best location for non-local module embedding. Experiments have proved that the effect of embedding the non-local module is better than non-embedding

indicating that non-local modules play an active role in the detection process. According to the detection accuracy

the effect of embedding non-local blocks into the second

third

and fourth layers of ResNet-18 is significant

while the effect of embedding the fifth layer is not obvious. Among the methods

embedding non-local blocks in the third layer has the best effect. We designed ablation experiments on the ICDAR 2015 dataset for the non-local and feature-vector fusion modules. The experimental results prove that the superiority of the non-local module does not come from deepening the network but from its own structural characteristics. The feature vector fusion module also plays an active role in the scene text-detection process

which combines feature maps of different scales to enhance the feature expression of scene texts with variable scales.

Conclusion

In this paper

an efficient text detection method for arbitrary shape scene is proposed

which considers accuracy and realtime. The experimental results show that the performance of our model is better than that of previous methods

and our model is superior in accuracy and speed.

关键词

Keywords

references

Baek Y, Lee B, Han D, Yun S and Lee H. 2019. Character region awareness for text detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9357-9366[ DOI: 10.1109/CVPR.2019.00959 http://dx.doi.org/10.1109/CVPR.2019.00959 ]

Chng C K and Chan C S. 2017. Total-text: a comprehensive dataset for scene text detection and recognition//Proceedings of the 14th IAPR International Conference on DocumentAnalysis and Recognition. Kyoto, Japan: IEEE: 935-942[ DOI: 10.1109/ICDAR.2017.157 http://dx.doi.org/10.1109/ICDAR.2017.157 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

He P, Huang W L, He T, Zhu Q L, Qiao Y and Li X L. 2017a. Single shot text detector with regional attention//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3066-3074[ DOI: 10.1109/ICCV.2017.331 http://dx.doi.org/10.1109/ICCV.2017.331 ]

He W H, Zhang X Y, Yin F and Liu C L. 2017b. Deep direct regression for multi-oriented scene text detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 745-753[ DOI: 10.1109/ICCV.2017.87 http://dx.doi.org/10.1109/ICCV.2017.87 ]

Liao M H, Lyu P Y, He M H, Yao C, Wu W H and Bai X. 2021. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2): 532-548[DOI:10.1109/TPAMI.2019.2937086]

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S J. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]

Liu J C, Liu X B, Sheng J, Liang D, Li X and Liu Q J. 2019. Pyramid mask text detector[EB/OL]. [2020-11-07] . http://arxiv.org/pdf/1903.11800.pdf http://arxiv.org/pdf/1903.11800.pdf

Liu Y L, Jin L W, Zhang S T and Zhang S. 2017. Detecting curve text in the wild: new dataset and new solution[EB/OL]. [2020-11-07] . http://arxiv.org/pdf/1712.02170.pdf http://arxiv.org/pdf/1712.02170.pdf

Liu Z C, Lin G S, Yang S, Feng J S, Lin W S and Goh W L. 2018. Learning Markov clustering networks for scene text detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City,USA: IEEE: 6936-6944[ DOI: 10.1109/CVPR.2018.00725 http://dx.doi.org/10.1109/CVPR.2018.00725 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Long S B, Ruan J Q, Zhang W J, He X, Wu W H and Yao C. 2018. TextSnake: a flexible representation for detecting text of arbitrary shapes//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 19-35[ DOI: 10.1007/978-3-030-01216-8_2 http://dx.doi.org/10.1007/978-3-030-01216-8_2 ]

Lyu P Y, Liao M H, Yao C, Wu W H and Bai X. 2018. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 71-88[ DOI: 10.1007/978-3-030-01264-9_5 http://dx.doi.org/10.1007/978-3-030-01264-9_5 ]

Milletari F, Navab N and Ahmadi S A. 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation//Proceedings of the 4th International Conference on 3D Vision. Stanford, USA: IEEE: 565-571[ DOI: 10.1109/3DV.2016.79 http://dx.doi.org/10.1109/3DV.2016.79 ]

Qin S Y, Bissaco A, Raptis M, Fujii Y and Xiao Y. 2019. Towards unconstrained end-to-end text spotting//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 4703-4713[ DOI: 10.1109/ICCV.2019.00480 http://dx.doi.org/10.1109/ICCV.2019.00480 ]

Shi B G, Bai X and Belongie S. 2017. Detecting oriented text in natural images by linking segments//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3482-3490[ DOI: 10.1109/CVPR.2017.371 http://dx.doi.org/10.1109/CVPR.2017.371 ]

van der Oord A, Dieleman S, Zen H G, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A and Kavukcuoglu K. 2016. WaveNet: a generative model for raw audio[EB/OL] . [2020-11-07]. http://arxiv.org/pdf/1609.03499.pdf http://arxiv.org/pdf/1609.03499.pdf

Wang W H, Xie E Z, Song X G, Zang Y H, Wang W J, Lu T, Yu G and Shen C H. 2019a. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 8439-8448[ DOI: 10.1109/ICCV.2019.00853 http://dx.doi.org/10.1109/ICCV.2019.00853 ]

Wang X B, Jiang Y Y, Luo Z B, Liu C L, Choi H and Kim S. 2019b. Arbitrary shape scene text detection with adaptive text region representation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6442-6451[ DOI: 10.1109/CVPR.2019.00661 http://dx.doi.org/10.1109/CVPR.2019.00661 ]

Wang X L, Girshick R B, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803[ DOI: 10.1109/CVPR.2018.00813 http://dx.doi.org/10.1109/CVPR.2018.00813 ]

Xu Y C, Fu M T, Wang Q M, Wang Y K, Chen K, Xia G S and Bai X. 2021. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(4): 1452-1459[DOI:10.1109/TPAMI.2020.2974745]

Zhang C Q, Liang B R, Huang Z M, En M Y, Han J Y, Ding E R and Ding X H. 2019. Look more than once: an accurate detector for text of arbitrary shapes//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10544-10553[ DOI: 10.1109/CVPR.2019.01080 http://dx.doi.org/10.1109/CVPR.2019.01080 ]

Zhou X Y, Yao C, Wen H, Wang Y Z, Zhou S C, He W R and Liang J J. 2017. EAST: an efficient and accurate scene text detector//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2642-2651[ DOI: 10.1109/CVPR.2017.283 http://dx.doi.org/10.1109/CVPR.2017.283 ]