顾及目标关联的自然场景文本检测
Association of text and other objects for text detection with natural scene images
- 2020年25卷第1期 页码:126-135
收稿:2019-05-08,
修回:2019-8-8,
纸质出版:2020-01-16
DOI: 10.11834/jig.190179
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-05-08,
修回:2019-8-8,
纸质出版:2020-01-16
移动端阅览
目的
2
目前基于卷积神经网络(CNN)的文本检测方法对自然场景中小尺度文本的定位非常困难。但自然场景图像中文本目标与其他目标存在很强的关联性,即自然场景中的文本通常伴随特定物体如广告牌、路牌等同时出现,基于此本文提出了一种顾及目标关联的级联CNN自然场景文本检测方法。
方法
2
首先利用CNN检测文本目标及包含文本的关联物体目标,得到文本候选框及包含文本的关联物体候选框;再扩大包含文本的关联物体候选框区域,并从原始图像中裁剪,然后以该裁剪图像作为CNN的输入再精确检测文本候选框;最后采用非极大值抑制方法融合上述两步生成的文本候选框,得到文本检测结果。
结果
2
本文方法能够有效地检测小尺度文本,在ICDAR-2013数据集上召回率、准确率和F值分别为0.817、0.880和0.847。
结论
2
本文方法顾及自然场景中文本目标与包含文本的物体目标的强关联性,提高了自然场景图像中小尺度文本检测的召回率。
Objective
2
Natural scene images contain numerous textual details with semantic information
which is the key to describe and understand the content of natural scene images. The correct detection of textual information is an important pre-step for computer visual tasks
such as image retrieval
image understanding
and intelligent navigation. However
the complexity of environments
flexible image acquisition styles
and variation of text contents pose many challenges for text detection in natural scene images. The natural scene background embodies disturbing factors
such as lighting
distortion
and stains. In addition
scene text can be expressed in different colors
fonts
sizes
orientations
and shapes
which makes text detection difficult. Moreover
the aspect ratios and layouts of scene text might exhibit variations that can block text detection. Prior to deep learning
most text detection methods adopt connected components analysis-or sliding window-based classifications. These methods extract low-or mid-level hand-crafted image features
which require demanding and repetitive pre-and post-processing steps. Owing to the limitation of hand-crafted features and the complexity of pipelines
those methods can hardly handle intricate circumstances that have a lower precision rate. Recently
text detection based on convolutional neural network (CNN) has become the mainstream method for natural scene text detection. However
existing CNN-based methods hardly detect small-scale texts and produce unsatisfactory results. Given the association between text and other objects
this study proposes a method based on cascaded CNN for the text detection of natural scene images
especially small-scale text detection. A strong association between the text and other objects in natural scene images is identified after observing the texts in natural image scenes. Texts are usually attached to man-made objects (e.g.
books
computers
and signboards) but not to natural objects (e.g.
water
sky
tree
and grass).
Method
2
We propose a cascaded CNN-based method for text detection based on RefineDet algorithm to consider the association between texts and other objects. First
the candidate bounding boxes of texts and objects containing texts are detected. Small-scale texts usually exist in these objects; thus
detecting the candidate bounding boxes first can improve the recall rate of text detection. Then
the candidate bounding boxes is enlarged by 10% of the width at each side
cropped as new images
and inputted to the CNN detector to accurately detect the candidate bounding boxes of the texts. Given that candidate bounding boxes cannot completely frame some objects
direct clipping will result in partial text loss and affect the performance of text detection in the next step. Therefore
we expand the boundaries of the candidate bounding boxes on each side by 10% of their width. Finally
the non-maximum suppression algorithm is used to fuse the previous two-step candidate bounding boxes of the texts to obtain the final detection results. The alteration of the intersection over union (IOU) of the candidate bounding boxes in non-maximum suppression algorithm affects text detection; the highest F-score is obtained when the IOU is 20%. We also collected a new available dataset of objects containing texts for training the object detector. This dataset contains 350 and 229 images from the street view text (SVT) and ICDAR-2013 training sets
respectively. Furthermore
all images are manually labeled with ground-truth tight object region bounding boxes.
Result
2
The results showed that the proposed method can effectively detect small-scale text and is computationally efficient at a rate of 0.33 s/image. The recall rate
precision rate
and F-score for the ICDAR-2013 dataset are 0.817
0.880
and 0.847
respectively. Compared with RefineDet
Which is our baseline
the proposed method improves the recall rate by 5.5% and F-score by 2.7%. Compared with state of the art methods
the proposed method increased the recall rate and F-score from 0.780 to 0.817 and from 0.830 to 0.847
respectively. In terms of computational efficiency
the proposed method increased the speed from 2 s/image to 0.33 s/image. Compared with Fast TextBoxes
which has the best computational efficiency
the efficiency of the proposed method is lower but the F-score is higher. In summary
our approach is superior to others.
Conclusion
2
This study proposes a text detection method based on cascaded CNN. The proposed method has two advantages. First
this method can obtain texts from real objects. Second
a cascaded CNN model based on RefineDet is established to complete the task of text detection. According to the strong association between texts and other objects containing texts in natural scene images
the proposed method improves the recall rate of text detection. In addition
the use of RefineDet strengthened the association for higher text detection precision rate. In conclusion
the proposed cascaded CNN-based method can effectively detect small-scale texts in natural scene images.
Bai X, Yang M K, Lyu P Y, Xu Y and Luo J B. 2018. Integrating scene text and visual appearance for fine-grained image classification. IEEE Access, 6:66322-66335[DOI:10.1109/ACCESS.2018.2878899]
Gómez L, Mafla A, Rusiñol M and Karatzas D. 2018. Single shot scene text retrieval//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 728-744[ DOI:10.1007/978-3-030-01264-9_43 http://dx.doi.org/10.1007/978-3-030-01264-9_43 ]
He S N, Guo Y J and Zhang L. 2018. Multi-orientation natural scene text detection. Application Research of Computers, 35(7):2193-2196
何思楠, 郭永金, 张利. 2018.多方向自然场景文本检测.计算机应用研究, 35(7):2193-2196[DOI:10.3969/j.issn.1001-3695.2018.07.066]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 770-778[ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, Mas J, Mota D F, Almazan J A and de las Heras L P. 2013. ICDAR 2013 robust reading competition//Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE: 1484-1493[ DOI:10.1109/ICDAR.2013.221 http://dx.doi.org/10.1109/ICDAR.2013.221 ]
Liao M H, Shi B G, Bai X, Wang X G and Liu W Y. 2017. Textboxes: a fast text detector with a single deep neural network//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 21-37[ DOI:10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Long S B, He X and Yao C. 2018. Scene text detection and recognition: the deep learning era[EB/OL]. 2018-05-22[2019-09-05] . https://arxiv.xilesou.top/pdf/1811.04256.pdf https://arxiv.xilesou.top/pdf/1811.04256.pdf
Mishra A, Alahari K and Jawahar C V. 2013. Image retrieval using textual cues//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3040-3047[ DOI:10.1109/ICCV.2013.378 http://dx.doi.org/10.1109/ICCV.2013.378 ]
Neubeck A and Van Gool L. 2006. Efficient non-maximum suppression//Proceedings of the 18th International Conference on Pattern Recognition. Hong Kong, China: IEEE: 850-855[ DOI:10.1109/ICPR.2006.479 http://dx.doi.org/10.1109/ICPR.2006.479 ]
Neumann L and Matas J. 2015. Efficient scene text localization and recognition with local character refinement//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis, Tunisia: IEEE: 746-750[ DOI:10.1109/ICDAR.2015.7333861 http://dx.doi.org/10.1109/ICDAR.2015.7333861 ]
Neumann L and Matas J. 2011. A method for text localization and recognition in real-world images//Proceedings of the 10th Asian Conference on Computer Vision. Queenstown, New Zealand: Springer: 770-783[ DOI:10.1007/978-3-642-19318-7_60 http://dx.doi.org/10.1007/978-3-642-19318-7_60 ]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/TPAMI.2016.2577031]
Rong X J, Yi C C and Tian Y L. 2016. Recognizing text-based traffic guide panels with cascaded localization network//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 109-121[ DOI:10.1007/978-3-319-46604-0_8 http://dx.doi.org/10.1007/978-3-319-46604-0_8 ]
Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL].2015-04-10[2019-11-03] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Tian Z, Huang W L, He T, He P and Qiao Y. 2016. Detecting text in natural image with connectionist text proposal network//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 56-72[ DOI:10.1007/978-3-319-46484-8_4 http://dx.doi.org/10.1007/978-3-319-46484-8_4 ]
Wang K and Belongie S. 2010. Word spotting in the wild//European Conference on Computer Vision. Springer, Berlin, Heidelberg: 591-604[ DOI:10.1007/978-3-642-15549-9_43 http://dx.doi.org/10.1007/978-3-642-15549-9_43 ]
Wolf C and Jolion J M. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition (IJDAR), 8(4):280-296[DOI:10.1007/s10032-006-0014-0]
Ye Q X and Doermann D. 2014. Scene text detection via integrated discrimination of component appearance and consensus//Proceedings of the 5th International Workshop on Camera-Based Document Analysis and Recognition. Washington, DC, USA: Springer: 47-59[ DOI:10.1007/978-3-319-05167-3_4 http://dx.doi.org/10.1007/978-3-319-05167-3_4 ]
Yi Y H, Shen C H, Liu J H and Lu L Q. 2017. Natural scene text detection method by integrating MSCRs into MSERs. Journal of Image and Graphics, 22(2):154-160
易尧华, 申春辉, 刘菊华, 卢利琼. 2017.结合MSCRs与MSERs的自然场景文本检测.中国图象图形学报, 22(2):154-160[DOI:10.11834/jig.20170202]
Yin X C, Yin X W, Huang K Z and Hao H W. 2014. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):970-983[DOI:10.1109/TPAMI.2013.182]
Zeiler M D, Taylor G W and Fergus R. 2011. Adaptive deconvolutional networks for mid and high level feature learning//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE: 6[ DOI:10.1109/ICCV.2011.6126474 http://dx.doi.org/10.1109/ICCV.2011.6126474 ]
Zhan F N, Lu S J and Xue C H. 2018. Verisimilar image synthesis for accurate detection and recognition of texts in scenes//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 257-273[ DOI:10.1007/978-3-030-01237-3_16 http://dx.doi.org/10.1007/978-3-030-01237-3_16 ]
Zhang H G, Zhao K L, Song Y Z and Guo J. 2013. Text extraction from natural scene image:a survey. Neurocomputing, 122:310-323[DOI:10.1016/j.neucom.2013.05.037]
Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018. Single-shot refinement neural network for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 4203-4212[ DOI:10.1109/CVPR.2018.00442 http://dx.doi.org/10.1109/CVPR.2018.00442 ]
Zhang Z, Shen W, Yao C and Bai X. 2015. Symmetry-based text line detection in natural scenes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, United State: IEEE: 2558-2567[ DOI:10.1109/CVPR.2015.7298871 http://dx.doi.org/10.1109/CVPR.2015.7298871 ]
Zhang Z, Zhang C Q, Shen W, Yao C, Liu W and Bai X. 2016. Multi-oriented text detection with fully convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE: 4159-4167[ DOI:10.1109/CVPR.2016.451 http://dx.doi.org/10.1109/CVPR.2016.451 ]
相关作者
相关机构
京公网安备11010802024621