Current Issue Cover
  • 发布时间:
  • 摘要点击次数:
  • 全文下载次数:
  • DOI:
  • 2021 | Volume  | Number 6

()

摘 要
许多自然场景图像中都包含着丰富的文本,他们对于场景理解有着重要的作用。随着移动互联网技术的飞速发展,许多新的应用场景都需要利用这些文本信息,例如招牌识别和自动驾驶等。因此,自然场景文本的分析与处理也越来越成为计算机视觉领域的研究热点之一,该任务主要包括文本检测与识别。传统的文本检测和识别方法依赖于人工设计的特征和规则,且模型设计复杂、效率低、泛化性能差。近年来随着深度学习的发展,自然场景文本检测、自然场景文本识别以及端到端的自然场景文本检测与识别都取得了突破性的进展,其性能和效率都得到了显著提高。本文介绍了该领域相关的研究背景,对近几年基于深度学习的自然场景文本检测、识别以及端到端自然场景文本检测与识别的方法进行整理分类、归纳和总结,阐述了各类方法的基本思想和优缺点。并针对隶属于不同类别下的方法,进一步论述和分析这些主要模型的算法流程、适用场景和他们的技术发展路线。此外还列举说明了一些主流公开数据集,并对比了各个模型方法在代表性数据集上的性能情况。最后本文总结了目前不同场景数据下的自然场景文本检测、识别以及端到端自然场景文本检测与识别算法的局限性以及未来的挑战和发展趋势。
关键词

()

Abstract
With the rapid development of Internet and mobile Internet technologies, many new applications require extensive use of rich text information in natural scenarios, such as sign board recognition and automatic driving. Thus, the analysis and processing of scene text plays an essential role in this field and has increasingly become one of the research hotspots in the field of computer vision. Traditional text detection and recognition methods often rely on manually designed features, with huge amount of computation and low efficiency. These methods also lack satisfactory generalization performance for complex scenes. With the development of deep learning in recent years, convolutional neural network has made great progress on scene text detection and recognition. These deep learning-based methods outperform traditional ones by a large margin and have already become the mainstream in the field of text reading in the wild. For scene text detection, according to the difference of target objects, the methods can be divided into two categories: Top-down methods and Bottom-up methods, respectively. Top-down methods mainly inherit the basic idea from general object detection or instance segmentation, and directly regress the whole bounding box for the text instance. On the contrary, bottom-up methods, following the idea of traditional ones, first detect some components of the text instance and then group them together through some rules. Compared to the Top-down methods, Bottom-up methods is more effective in processing text detection of arbitrary shapes and orientations, and they are not as sensitive to text scaling as Top-down methods. However, grouping the detected components into different text instances requires complex design and processing, which makes the inference stage of Bottom-up approach in efficient. These methods also encounter some difficulties when detecting long text. In addition, when detecting dense text, text conglutination will occur. But the Top-down methods do not have this issue and can have a higher precision for text detection. In recent years, recognizing text in natural scenes (also known as scene text recognition (STR)) has aroused great interest in academia and industry. In particular, the objective of STR is to translate a cropped text instance image into a target string sequence. Although optical character recognition (OCR) in scanned documents has been well developed, STR remains challenging due to many factors (such as very complex backgrounds, various fonts and imperfect imaging conditions). Early work relies on hand-crafted features, such as histogram of oriented gradients descriptors, connected components, and stroke width transformation. However, the performance of these approaches is limited by the low capability of features. In recent years, with the rise and development of deep learning, the community has witnessed substantial advancements. In particular, scene text recognition approaches based on deep learning can be roughly divided into two branches: segmentation-based approaches and segmentation-free approaches. Segmentation-based approaches attempt to locate the position of each character from the input text instance image, apply a character classifier to recognize each character, and then group characters into text lines to obtain the final recognition results. Segmentation-free approaches recognize the text instance image as a whole and focus on mapping the entire text instance image into a target string sequence directly. Both branches own their advantages and limitations. Therefore, practitioners should choose the best trade-offs according to their needs under different application scenarios. In the past few decades, although the practicality and efficiency of recognition approaches have been significantly improved, there is still ample room remaining for future research, such as generalization ability, evaluation protocols and scenarios of STR. Finally, end-to-end scene text spotting aims to combine text detection and text recognition into a unified system, which can be optimized in a single pipeline. How to bridge the gap between the detection branch and recognition branch is the most essential problem for the design of an end-to-end text spotting system. Similar to general object detection and instance segmentation, end-to-end text spotting methods can also be divided into two categories: two-stage methods and one-stage methods. Two-stage methods are mainly based on Faster R-CNN and Mask R-CNN, in which RoI Pooling/Align acts as a bridge between the two branches. But these operations may lose some information as the region proposals from RPN (Region proposal network) are not accurate enough. One-stage methods follow the pipeline of detection then recognition. Various feature-align operations are carefully designed to boost the linking between detection and recognition branches. In this paper, we sort out and summarize the detection and recognition methods of scene text, and further elaborate and analyze the basic ideas of various methods and their Pros and Cons. We hope that the paper can provide reference for researchers and help in future work.
Keywords
QQ在线


订阅号|日报