Current Issue Cover
视觉信息抽取的深度学习方法综述

林泽柠, 汪嘉鹏, 金连文(华南理工大学电子与信息学院, 广州 510640)

摘 要
随着信息交互的日益频繁,大量的文档经数字化处理,以图像的格式保存和传播。实际生活工作中,票据识别理解、卡证识别、自动阅卷和文档匹配等诸多应用场景,都需要从文档图像中获取某一特定类别的文本内容,这一过程即为视觉信息抽取,旨在对视觉富文档图像中蕴含的指定类别的信息进行挖掘、分析和提取。随着深度学习技术的快速发展,基于该技术提出了诸多性能优异、流程高效的视觉信息抽取算法,在实际业务中得到了大规模应用,有效解决了以往人工操作速度慢、精度低的问题,极大提高了生产效率。本文调研了近年来提出的基于深度学习的信息抽取方法和公开数据集,并进行了整理、分类和总结。首先,介绍视觉信息抽取的研究背景,阐述了该领域的研究难点。其次,根据算法的主要特征,分别介绍隶属于不同类别的主要模型的算法流程和技术发展路线,同时总结它们各自的优缺点和适用场景。随后,介绍了主流公开数据集的内容、特点和一些常用的评价指标,对比了代表性模型方法在常用数据集上的性能。最后,总结了各类方法的特点和局限性,并对视觉信息抽取领域未来面临的挑战和发展趋势进行了探讨。
关键词
Visual information extraction deep learning method:a critical review

Lin Zening, Wang Jiapeng, Jin Lianwen(School of Electronics and Information Engineering, South China University of Technology, Guangzhou 510640, China)

Abstract
A huge amount of big data-driven documents are required to be digitalized,stored and distributed in relation to images contexts. Such of application scenarios are concerned of document images-oriented key information,such as receipt understanding,card recognition,automatic paper scoring and document matching. Such process is called visual information extraction(VIE),which is focused on information mining,analysis,and extraction from visually rich documents. Documents-related text objects are diverse and varied,multi-language documents can be also commonly-used incorporated with single language scenario. Furthermore,text corpus differs from field to field. For example,a difference in the text content is required to be handled between legal files and medical documents. A complex layout may exist when a variety of visual elements are involved in a document,such as pictures,tables,and statistical curves. Unreadable document images are often derived and distorted from such noises like ink,wrinkles,distortion,and illumination. The completed pipeline of visual information extraction can be segmented into four steps:first,a pre-processing algorithm should be applied to remove the problem of interference and noise in a manner of correction and denoising. Second,document image-derived text strings and their locations contexts may be extracted in terms of text detection and recognition methods. Subsequently, multimodal feature extraction is required to perform high-level calculation and fusion of text,layout and visual features contained in visually rich documents. Finally,entity category parsing is applied to determine the category of each entity. Existed methods are mainly focused on the latter of two steps,while some take text detection and recognition into account. Early works are concerned of querying key information manually via rule-based methods. The effectiveness of these algorithms is quite lower,and they have poor generalization performance as well. The emerging deep learning technique-based feature extractors like convolutional neural networks and Transformers are linked with depth features for the optimization of performance and efficiency. In recent years,deep learning based methods have been widely applied in real scenarios. To sum up,we review deep-learning-based VIE methods and public datasets proposed in recent years,and these algorithms can be classified by their main characteristics. Recent deep-learning-based VIE methods proposed can be roughly categorized into six types of methods relevant to such contexts of grid-based,graph-neural-network-based (GNN-based), Transformer-based,end-to-end,few-shot,and the related others. Grid-based methods are focused on taking the document image as a two-dimensional matrix,pixels-inner text bounding box are filled with text embedding,and the grid representation can be formed for deep processing. Grid-based methods are often simple and have less computational cost. However, its representation ability is not strong enough,and features of text regions in small size may not be fully exploited. GNNbased methods take text segments as graph nodes,relations between segment coordinates are encoded for edge representations. Such graph convolution-related operations are applied for feature extraction further. GNN-based schemes achieve a good balance between cost and performance,but some characteristics of GNN itself like over-smoothing and gradient vanishing are often challenged to train the model. Transformer-based methods achieve outstanding performance through pretraining with a vast amount of data. These methods are preferred to have powerful generalizability,and it can be applied for multiple scenarios extended to other related document understanding tasks. However,these computational models are often costly and computing resources are required to be optimized. A more efficient architecture and pre-training strategy is still as a challenging problem to be resolved. The VIE is a mutual-benefited process,and text detection and recognition optical character recognition(OCR)are needed as prerequisites. The OCR-attainable problems like coordinate mismatches and text recognition errors will affect the following steps as well. Such end-to-end paradigms can be traced to optimize the OCR error accumulation to some extent. Few-shot methods-related structures can be used to enhance the generalization ability of models efficiently,and intrinsic features can be exploited to some extend in term of a small number of samples only. First, the growth of this research domain is reviewed and its challenging contexts can be predicted as well. Then,recent deep learning based visual information extraction methods and their contexts are summarized and analyzed. Furthermore,multiple categories-relevant methods are predictable,while the algorithm flow and technical development route of the representative models are further discussed and analyzed. Additionally,features of some public datasets are illustrated in comparison with the performance of representative models on these benchmarks. Finally,research highlights and limitations of each sort of model are laid out,and future research direction is forecasted as well.
Keywords

订阅号|日报