Current Issue Cover
顾及目标关联的自然场景文本检测

易尧华, 何婧婧, 卢利琼, 汤梓伟(武汉大学印刷与包装系, 武汉 430079)

摘 要
目的 目前基于卷积神经网络(CNN)的文本检测方法对自然场景中小尺度文本的定位非常困难。但自然场景图像中文本目标与其他目标存在很强的关联性,即自然场景中的文本通常伴随特定物体如广告牌、路牌等同时出现,基于此本文提出了一种顾及目标关联的级联CNN自然场景文本检测方法。方法 首先利用CNN检测文本目标及包含文本的关联物体目标,得到文本候选框及包含文本的关联物体候选框;再扩大包含文本的关联物体候选框区域,并从原始图像中裁剪,然后以该裁剪图像作为CNN的输入再精确检测文本候选框;最后采用非极大值抑制方法融合上述两步生成的文本候选框,得到文本检测结果。结果 本文方法能够有效地检测小尺度文本,在ICDAR-2013数据集上召回率、准确率和F值分别为0.817、0.880和0.847。结论 本文方法顾及自然场景中文本目标与包含文本的物体目标的强关联性,提高了自然场景图像中小尺度文本检测的召回率。
关键词
Association of text and other objects for text detection with natural scene images

Yi Yaohua, He Jingjing, Lu Liqiong, Tang Ziwei(School of Printing and Packaging, Wuhan University, Wuhan 430079, China)

Abstract
Objective Natural scene images contain numerous textual details with semantic information, which is the key to describe and understand the content of natural scene images. The correct detection of textual information is an important pre-step for computer visual tasks, such as image retrieval, image understanding, and intelligent navigation. However, the complexity of environments, flexible image acquisition styles, and variation of text contents pose many challenges for text detection in natural scene images. The natural scene background embodies disturbing factors, such as lighting, distortion, and stains. In addition, scene text can be expressed in different colors, fonts, sizes, orientations, and shapes, which makes text detection difficult. Moreover, the aspect ratios and layouts of scene text might exhibit variations that can block text detection. Prior to deep learning, most text detection methods adopt connected components analysis-or sliding window-based classifications. These methods extract low-or mid-level hand-crafted image features, which require demanding and repetitive pre-and post-processing steps. Owing to the limitation of hand-crafted features and the complexity of pipelines, those methods can hardly handle intricate circumstances that have a lower precision rate. Recently, text detection based on convolutional neural network (CNN) has become the mainstream method for natural scene text detection. However, existing CNN-based methods hardly detect small-scale texts and produce unsatisfactory results. Given the association between text and other objects, this study proposes a method based on cascaded CNN for the text detection of natural scene images, especially small-scale text detection. A strong association between the text and other objects in natural scene images is identified after observing the texts in natural image scenes. Texts are usually attached to man-made objects (e.g., books, computers, and signboards) but not to natural objects (e.g., water, sky, tree, and grass). Method We propose a cascaded CNN-based method for text detection based on RefineDet algorithm to consider the association between texts and other objects. First, the candidate bounding boxes of texts and objects containing texts are detected. Small-scale texts usually exist in these objects; thus, detecting the candidate bounding boxes first can improve the recall rate of text detection. Then, the candidate bounding boxes is enlarged by 10% of the width at each side, cropped as new images, and inputted to the CNN detector to accurately detect the candidate bounding boxes of the texts. Given that candidate bounding boxes cannot completely frame some objects, direct clipping will result in partial text loss and affect the performance of text detection in the next step. Therefore, we expand the boundaries of the candidate bounding boxes on each side by 10% of their width. Finally, the non-maximum suppression algorithm is used to fuse the previous two-step candidate bounding boxes of the texts to obtain the final detection results. The alteration of the intersection over union (IOU) of the candidate bounding boxes in non-maximum suppression algorithm affects text detection; the highest F-score is obtained when the IOU is 20%. We also collected a new available dataset of objects containing texts for training the object detector. This dataset contains 350 and 229 images from the street view text (SVT) and ICDAR-2013 training sets, respectively. Furthermore, all images are manually labeled with ground-truth tight object region bounding boxes. Result The results showed that the proposed method can effectively detect small-scale text and is computationally efficient at a rate of 0.33 s/image. The recall rate, precision rate, and F-score for the ICDAR-2013 dataset are 0.817, 0.880, and 0.847, respectively. Compared with RefineDet, Which is our baseline, the proposed method improves the recall rate by 5.5% and F-score by 2.7%. Compared with state of the art methods, the proposed method increased the recall rate and F-score from 0.780 to 0.817 and from 0.830 to 0.847, respectively. In terms of computational efficiency, the proposed method increased the speed from 2 s/image to 0.33 s/image. Compared with Fast TextBoxes, which has the best computational efficiency, the efficiency of the proposed method is lower but the F-score is higher. In summary, our approach is superior to others. Conclusion This study proposes a text detection method based on cascaded CNN. The proposed method has two advantages. First, this method can obtain texts from real objects. Second, a cascaded CNN model based on RefineDet is established to complete the task of text detection. According to the strong association between texts and other objects containing texts in natural scene images, the proposed method improves the recall rate of text detection. In addition, the use of RefineDet strengthened the association for higher text detection precision rate. In conclusion, the proposed cascaded CNN-based method can effectively detect small-scale texts in natural scene images.
Keywords

订阅号|日报