Current Issue Cover


摘 要
Frontiers of Intelligent Document Analysis and Recognition: Review and Prospects

Cheng-Lin Liu,Lianwen Jin,Xiang Bai,Xiao-Hui Li,Fei Yin(South China University of Technology;Huazhong University of Science and Technology;Institute of Automation,Chinese Academy of Sciences)

Document analysis and recognition (document recognition in brief) is aimed to covert non-structured documents (typically, document images and online handwriting) into structured data for facilitating computer processing and understanding. It is needed in wide applications due to the pervasive communication and usage of documents. The field of document recognition has attracted intensive attention and produced enormous progress in research and applications since 1960s. Particularly, the recent development of deep learning technology has boosted the performance of document recognition remarkably compared to traditional methods, and the technology has been applied successfully to document digitization, form processing, handwriting input, intelligent transportation, document retrieval and information extraction. In this article, we first introduce the background and involved techniques of document recognition, give an overview of the history of research (divided into four periods according to the objects of research, the methods and applications), and then review the main research progress with emphasis on deep learning based methods developed in recent years. After identifying the insufficiency of current technology, we finally suggest some important issues for future research. The review of recent progress is divided into sections corresponding to main processing steps, namely image pre-processing, layout analysis, scene text detection, text recognition, structured symbol and graphics recognition, document retrieval and information extraction. The review of recent progress is divided into sections corresponding to the main processing steps, namely image pre-processing, layout analysis, scene text detection, text recognition, structured symbol and graphics recognition, document retrieval and information extraction. (1) Due to the popularity of camera-captured document images, the current main task in image pre-processing is the rectification of distorted image while the task of binarization is still concerned. Recent methods are mostly end-to-end deep learning based transformation methods. (2) Layout analysis is dichotomized into physical layout analysis (page segmentation) and logical layout analysis (semantic region segmentation and reading order prediction). Recent page segmentation methods based on fully convolutional network (FCN) or graph neural network (GNN) have shown promises. Logical layout analysis has been addressed by deep neural networks fusing multi-modal information. Table structure analysis is a special task of layout analysis and has been studied intensively in recent years. (3) Scene text detection is a hot topic in document analysis and computer vision fields. Deep learning based methods for text methods can be divided into regression-based methods, segmentation-based methods and hybrid methods. FCN is prevalently used for extracting visual features, based on which models are built to predict text regions. (4) Text recognition is the core task in document analysis. We review recent works for handwritten text recognition and scene text recognition, which share some common strategies but also show different preferences. There are two main streams of methods: segmentation-based and sequence-to-sequence learning methods. The convolutional RNN (CRNN) model has received high attention in recent years and is being extended in respect of encoding, decoding or learning strategies, while segmentation-based methods combining deep learning are still performing competitively. A noteworthy tendency is the extension of text line recognition to page-level recognition. Following text recognition, we also review the works of end-to-end scene text recognition (also called as text spotting), for which text detection and recognition models are learned jointly. (5) Among symbol and graphics in documents, mathematical expressions and flowcharts have received increasing attention. Recent methods for mathematical expression recognition are mostly image-to-markup generation methods using encoder-decoder models, while graph-based methods promise in generating both recognition and segmentation results. Flowchart recognition is addressed using structured prediction models such as GNN. (6) Document retrieval concerned mainly keyword spotting in pre-deep learning era, while recent works focus on information extraction (spotting semantic entities) by fusing layout and language information. Pre-trained layout and multi-modal language models are showing promises, while visual information is not considered adequately. Overall, the recent progress shows that the objects of recognition are expanded in breadth and depth, the methods are getting closer to deep neural networks and deep learning, the recognition performance is improved constantly, and the technology is applied to extensive scenes. The review also reveals the insufficiencies of the current technology in accuracy and reliability on various tasks, the interpretability, the learning ability and adaptability. Future works are suggested in respect of performance promotion, application extension, and improved learning. Issues of performance promotion include the reliability of recognition, interpretability, omni-element recognition, long-tailed recognition, multi-lingual documents, complex layout analysis and understanding, recognition of distorted documents. Issues related to applications include new applications (such as robotic process automation (RPA), text scription in natural scenes, archeology), new technical problems involved in applications (such as semantic information extraction, cross-modal fusion, reasoning and decision related to application scenes). Aiming to improve the automatic system design, learning ability and adaptability, the involved learning problems/methods include small sample learning, transfer learning, multi-task learning, domain adaptation, structured prediction, weakly-supervised learning, self-supervised learning, open set learning, and cross-modal learning.