视觉信息抽取的深度学习方法综述

林泽柠; 汪嘉鹏; 金连文

doi:10.11834/jig.220904

文档图像智能处理与识别 | 浏览量 : 0 下载量: 3 CSCD: 0

PDF
导出
分享
收藏
专辑

视觉信息抽取的深度学习方法综述
Visual information extraction deep learning method： a critical review
2023年28卷第8期页码：2276-2297
纸质出版日期： 2023-08-16 ，
DOI： 10.11834/jig.220904
稿件说明：

移动端阅览

林泽柠，汪嘉鹏，金连文. 2023. 视觉信息抽取的深度学习方法综述. 中国图象图形学报， 28(08):2276-2297

Lin Zening， Wang Jiapeng， Jin Lianwen. 2023. Visual information extraction deep learning method： a critical review. Journal of Image and Graphics， 28(08):2276-2297
林泽柠，汪嘉鹏，金连文. 2023. 视觉信息抽取的深度学习方法综述. 中国图象图形学报， 28(08):2276-2297 DOI： 10.11834/jig.220904.

Lin Zening， Wang Jiapeng， Jin Lianwen. 2023. Visual information extraction deep learning method： a critical review. Journal of Image and Graphics， 28(08):2276-2297 DOI： 10.11834/jig.220904.

摘要

随着信息交互的日益频繁，大量的文档经数字化处理，以图像的格式保存和传播。实际生活工作中，票据识别理解、卡证识别、自动阅卷和文档匹配等诸多应用场景，都需要从文档图像中获取某一特定类别的文本内容，这一过程即为视觉信息抽取，旨在对视觉富文档图像中蕴含的指定类别的信息进行挖掘、分析和提取。随着深度学习技术的快速发展，基于该技术提出了诸多性能优异、流程高效的视觉信息抽取算法，在实际业务中得到了大规模应用，有效解决了以往人工操作速度慢、精度低的问题，极大提高了生产效率。本文调研了近年来提出的基于深度学习的信息抽取方法和公开数据集，并进行了整理、分类和总结。首先，介绍视觉信息抽取的研究背景，阐述了该领域的研究难点。其次，根据算法的主要特征，分别介绍隶属于不同类别的主要模型的算法流程和技术发展路线，同时总结它们各自的优缺点和适用场景。随后，介绍了主流公开数据集的内容、特点和一些常用的评价指标，对比了代表性模型方法在常用数据集上的性能。最后，总结了各类方法的特点和局限性，并对视觉信息抽取领域未来面临的挑战和发展趋势进行了探讨。

Abstract

A huge amount of big data-driven documents are required to be digitalized， stored and distributed in relation to images contexts. Such of application scenarios are concerned of document images-oriented key information， such as receipt understanding， card recognition， automatic paper scoring and document matching. Such process is called visual information extraction （VIE）， which is focused on information mining， analysis， and extraction from visually rich documents. Documents-related text objects are diverse and varied， multi-language documents can be also commonly-used incorporated with single language scenario. Furthermore， text corpus differs from field to field. For example， a difference in the text content is required to be handled between legal files and medical documents. A complex layout may exist when a variety of visual elements are involved in a document， such as pictures， tables， and statistical curves. Unreadable document images are often derived and distorted from such noises like ink， wrinkles， distortion， and illumination. The completed pipeline of visual information extraction can be segmented into four steps： first， a pre-processing algorithm should be applied to remove the problem of interference and noise in a manner of correction and denoising. Second， document image-derived text strings and their locations contexts may be extracted in terms of text detection and recognition methods. Subsequently， multimodal feature extraction is required to perform high-level calculation and fusion of text， layout and visual features contained in visually rich documents. Finally， entity category parsing is applied to determine the category of each entity. Existed methods are mainly focused on the latter of two steps， while some take text detection and recognition into account. Early works are concerned of querying key information manually via rule-based methods. The effectiveness of these algorithms is quite lower， and they have poor generalization performance as well. The emerging deep learning technique-based feature extractors like convolutional neural networks and Transformers are linked with depth features for the optimization of performance and efficiency. In recent years， deep learning based methods have been widely applied in real scenarios. To sum up， we review deep-learning-based VIE methods and public datasets proposed in recent years， and these algorithms can be classified by their main characteristics. Recent deep-learning-based VIE methods proposed can be roughly categorized into six types of methods relevant to such contexts of grid-based， graph-neural-network-based （GNN-based）， Transformer-based， end-to-end， few-shot， and the related others. Grid-based methods are focused on taking the document image as a two-dimensional matrix， pixels-inner text bounding box are filled with text embedding， and the grid representation can be formed for deep processing. Grid-based methods are often simple and have less computational cost. However， its representation ability is not strong enough， and features of text regions in small size may not be fully exploited. GNN-based methods take text segments as graph nodes， relations between segment coordinates are encoded for edge representations. Such graph convolution-related operations are applied for feature extraction further. GNN-based schemes achieve a good balance between cost and performance， but some characteristics of GNN itself like over-smoothing and gradient vanishing are often challenged to train the model. Transformer-based methods achieve outstanding performance through pre-training with a vast amount of data. These methods are preferred to have powerful generalizability， and it can be applied for multiple scenarios extended to other related document understanding tasks. However， these computational models are often costly and computing resources are required to be optimized. A more efficient architecture and pre-training strategy is still as a challenging problem to be resolved. The VIE is a mutual-benefited process， and text detection and recognition optical character recognition （OCR） are needed as prerequisites. The OCR-attainable problems like coordinate mismatches and text recognition errors will affect the following steps as well. Such end-to-end paradigms can be traced to optimize the OCR error accumulation to some extent. Few-shot methods-related structures can be used to enhance the generalization ability of models efficiently， and intrinsic features can be exploited to some extend in term of a small number of samples only.First， the growth of this research domain is reviewed and its challenging contexts can be predicted as well. Then， recent deep learning based visual information extraction methods and their contexts are summarized and analyzed. Furthermore， multiple categories-relevant methods are predictable， while the algorithm flow and technical development route of the representative models are further discussed and analyzed. Additionally， features of some public datasets are illustrated in comparison with the performance of representative models on these benchmarks. Finally， research highlights and limitations of each sort of model are laid out， and future research direction is forecasted as well.

关键词

视觉信息抽取（VIE）文档图像分析与理解计算机视觉自然语言处理光学文字识别（OCR）深度学习综述

Keywords

visual information extraction （VIE）document image analysis and understandingcomputer visionnatural language processingoptical character recognition （OCR）deep learningsurvey

references

Amrhein C and Clematide S. 2018. Supervised OCR error detection and correction using statistical and neural machine translation methods. Journal for Language Technology and Computational Linguistics， 33（1）： 49-76 ［DOI： 10.21248/jlcl.33.2018.218http://dx.doi.org/10.21248/jlcl.33.2018.218］

Appalaraju S， Jasani B， Kota B U， Xie Y S and Manmatha R. 2021. DocFormer： end-to-end Transformer for document understanding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 973-983 ［DOI： 10.1109/ICCV48922.2021.00103http://dx.doi.org/10.1109/ICCV48922.2021.00103］

Bao H B， Dong L， Piao S H and Wei F R. 2021. BEiT： BERT pre-training of image transformers ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2106.08254.pdfhttps://arxiv.org/pdf/2106.08254.pdf

Bojanowski P， Grave E， Joulin A and Mikolov T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics， 5： 135-146 ［DOI： 10.1162/tacl_a_00051http://dx.doi.org/10.1162/tacl_a_00051］

Brown T B， Mann B， Ryder N， Subbiah M， Kaplan J， Dhariwal P， Neelakantan A， Shyam P， Sastry G， Askell A， Agarwal S， Herbert-Voss A， Krueger G， Henighan T， Child R， Ramesh A， Ziegler D M， Wu J， Winter C， Hesse C， Chen M， Sigler E， Litwin M， Gray S， Chess B， Clark J， Berner C， McCandlish S， Radford A， Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 1877-1901

Cao H Y， Li X， Ma J F， Jiang D Q， Guo A T， Hu Y Q， Liu H， Liu Y S and Ren B. 2022. Query-driven generative network for document information extraction in the wild//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： Association for Computing Machinery： 4261-4271 ［DOI： 10.1145/3503161.3547877http://dx.doi.org/10.1145/3503161.3547877］

Carbonell M， Fornés A， Villegas M and Lladós J. 2020. A neural model for text localization， transcription and named entity recognition in full pages. Pattern Recognition Letters， 136： 219-227 ［DOI： 10.1016/j.patrec.2020.05.001http://dx.doi.org/10.1016/j.patrec.2020.05.001］

Cetto M， Niklaus C， Freitas A and Handschuh S. 2018. Graphene： semantically-linked propositions in open information extraction//Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe， USA： ACL： 2300-2311

Cheng M L， Qiu M H， Shi X， Huang J and Lin W. 2020. One-shot text field labeling using attention and belief propagation for structure information extraction//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 340-348 ［DOI： 10.1145/3394171.3413511http://dx.doi.org/10.1145/3394171.3413511］

Chi Z W， Dong L， Wei F R， Yang N， Singhal S Wang W H， Song X， Mao X L， Huang H Y and Zhou M. 2021. InfoXLM： an information-theoretic framework for cross-lingual language model pre-training//Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Online： Association for Computational Linguistics： 3576-3588 ［DOI： 10.18653/v1/2021.naacl-main.280http://dx.doi.org/10.18653/v1/2021.naacl-main.280］

Cui L， Wei F R and Zhou M. 2018. Neural open information extraction//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 2： Short Papers）. Melbourne， Australia： ACL： 407-413 ［DOI： 10.18653/v1/P18-2065http://dx.doi.org/10.18653/v1/P18-2065］

Dai Z H， Yang Z L， Yang Y M， Carbonell J， Le Q and Salakhutdinov R. 2019. Transformer-XL： attentive language models beyond a fixed-length context//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence， Italy： ACL： 2978-2988 ［DOI： 10.18653/v1/P19-1285http://dx.doi.org/10.18653/v1/P19-1285］

Dengel A R and Klein B. 2002. smartFIX： a requirements-driven system for document analysis and understanding//The 5th International Workshop on Document Analysis Systems V. Princeton， USA： Springer： 433-444 ［DOI： 10.1007/3-540-45869-7_47http://dx.doi.org/10.1007/3-540-45869-7_47］

Denk T I and Reisswig C. 2019. BERTgrid： contextualized embedding for 2D document representation and understanding ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/1909.04948.pdfhttps://arxiv.org/pdf/1909.04948.pdf

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， Minnesota， USA： Association for Computational Linguistics： 4171-4186 ［DOI： 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423］

Grill J B， Strub F， Altché F， Tallec C， Richemond P H， Buchatskaya E， Doersch C， Pires B A， Guo Z D， Azar M G， Piot B， Kavukcuoglu K， Munos R and Valko M. 2020. Bootstrap your own latent a new approach to self-supervised learning//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 21271-21284

Gu J X， Kuen J， Morariu V I， Zhao H D， Jain R， Barmpalios N， Nenkova A and Sun T. 2021. UniDoc： unified pretraining framework for document understanding//Advances in Neural Information Processing Systems 34. Curran Associates， Inc.： 39-50

Gu Z X， Meng C H， Wang K， Lan J， Wang W Q， Gu M and Zhang L Q. 2022. XYLayoutLM： towards layout-aware multimodal networks for visually-rich document understanding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 4573-4582 ［DOI： 10.1109/CVPR52688.2022.00454http://dx.doi.org/10.1109/CVPR52688.2022.00454］

Guo H， Qin X M， Liu J M， Han J Y， Liu J T and Ding E R. 2019. EATEN： entity-aware attention for single shot visual text extraction//Proceedings of 2019 International Conference on Document Analysis and Recognition （ICDAR）. Sydney， Australia： IEEE： 254-259 ［DOI： 10.1109/ICDAR.2019.00049http://dx.doi.org/10.1109/ICDAR.2019.00049］

Hämäläinen M and Hengchen S. 2019. From the Paft to the Fiiture： a fully automatic NMT and word embeddings method for OCR post-correction//Proceedings of 2019 International Conference on Recent Advances in Natural Language Processing. Varna， Bulgaria： INCOMA Ltd.： 431-436 ［DOI： 10.26615/978-954-452-056-4_051http://dx.doi.org/10.26615/978-954-452-056-4_051］

Harley A W， Ufkes A and Derpanis K G. 2015. Evaluation of deep convolutional nets for document image classification and retrieval//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis， Tunisia： IEEE： 991-995 ［DOI： 10.1109/ICDAR.2015.7333910http://dx.doi.org/10.1109/ICDAR.2015.7333910］

He H F and Sun X. 2017a. A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media. Proceedings of the AAAI Conference on Artificial Intelligence， 31（1）： 3216-3222 ［DOI： 10.1609/aaai.v31i1.10977http://dx.doi.org/10.1609/aaai.v31i1.10977］

He K M， Gkioxari G， Doll􀅡r P and Girshick R. 2017b. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2980-2988 ［DOI： 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322］

He T， Zhang Z， Zhang H， Zhang Z Y， Xie J Y and Li M. 2019. Bag of tricks for image classification with convolutional neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 558-567 ［DOI： 10.1109/CVPR.2019.00065http://dx.doi.org/10.1109/CVPR.2019.00065］

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation， 9（8）： 1735-1780 ［DOI： 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735］

Hong T， Kim D， Ji M， Hwang W， Nam D and Park S. 2022. BROS： a pre-trained language model focusing on text and layout for better key information extraction from documents. Proceedings of the AAAI Conference on Artificial Intelligence， 36（10）： 10767-10775 ［DOI： 10.1609/aaai.v36i10.21322http://dx.doi.org/10.1609/aaai.v36i10.21322］

Huang Y P， Lv T C， Cui L， Lu Y T and Wei F R. 2022. LayoutLMv3： pre-training for document AI with unified text and image masking//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： Association for Computing Machinery： 4083-4091 ［DOI： 10.1145/3503161.3548112http://dx.doi.org/10.1145/3503161.3548112］

Huang Z， Chen K， He J H， Bai X， Karatzas D， Lu S J and Jawahar C V. 2019. ICDAR2019 competition on scanned receipt OCR and information extraction//Proceedings of 2019 International Conference on Document Analysis and Recognition （ICDAR）. Sydney， Australia： IEEE： 1516-1520 ［DOI： 10.1109/ICDAR.2019.00244http://dx.doi.org/10.1109/ICDAR.2019.00244］

Huang Z H， Xu W and Yu K. 2015. Bidirectional LSTM-CRF models for sequence tagging ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/1508.01991.pdfhttps://arxiv.org/pdf/1508.01991.pdf

Huffman S B. 1996. Learning information extraction patterns from examples//Connectionist， Statistical and Symbolic Approaches to Learning for Natural Language Processing. Berlin， Germany： Springer： 246-260 ［DOI： 10.1007/3-540-60925-3_51http://dx.doi.org/10.1007/3-540-60925-3_51］

Hwang W， Lee H， Yim J， Kim G and Seo M. 2021a. Cost-effective end-to-end information extraction for semi-structured document images//Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana， Dominican Republic： ACL： 3375-3383 ［DOI： 10.18653/v1/2021.emnlp-main.271http://dx.doi.org/10.18653/v1/2021.emnlp-main.271］

Hwang W， Yim J， Park S， Yang S and Seo M. 2021b. Spatial dependency parsing for semi-structured document information extraction//Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Online： ACL： 330-343 ［DOI： 10.18653/v1/2021.findings-acl.28http://dx.doi.org/10.18653/v1/2021.findings-acl.28］

Jaume G， Ekenel H K and Thiran J P. 2019. FUNSD： a dataset for form understanding in noisy scanned documents//Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops （ICDARW）. Sydney， Australia： IEEE： 1-6 ［DOI： 10.1109/ICDARW.2019.10029http://dx.doi.org/10.1109/ICDARW.2019.10029］

Katti A R， Reisswig C， Guder C， Brarda S， Bickel S， Höhne J and Faddoul J B. 2018. Chargrid： towards understanding 2D documents//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： Association for Computational Linguistics： 4459-4469 ［DOI： 10.18653/v1/D18-1476http://dx.doi.org/10.18653/v1/D18-1476］

Kerroumi M， Sayem O and Shabou A. 2021. VisualWordGrid： information extraction from scanned documents using a multimodal approach//Document Analysis and Recognition——ICDAR 2021 Workshops. Lausanne， Switzerland： Springer： 389-402 ［DOI： 10.1007/978-3-030-86159-9_28http://dx.doi.org/10.1007/978-3-030-86159-9_28］

Kim G， Hong T， Yim M， Nam J， Park J， Yim J， Hwang W， Yun S， Han D and Park S. 2022. OCR-free document understanding transformer//Proceedings of Computer Vision-ECCV 2022： the 17th European Conference， Tel Aviv， Israel， October 23-27， 2022， Proceedings， Part XXVIII. Tel Aviv， Israel： Springer： 498-517 ［DOI： 10.1007/978-3-031-19815-1_29http://dx.doi.org/10.1007/978-3-031-19815-1_29］

Kissos I and Dershowitz N. 2016. OCR error correction using character correction and feature-based word classification//Proceedings of the 12th IAPR Workshop on Document Analysis Systems. Santorini， Greece： IEEE： 198-203 ［DOI： 10.1109/DAS.2016.44http://dx.doi.org/10.1109/DAS.2016.44］

Kruiper R， Vincent J， Chen-Burger J， Desmulliez M and Konstas I. 2020. In Layman’s terms： semi-open relation extraction from scientific texts//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online： ACL： 1489-1500 ［DOI： 10.18653/v1/2020.acl-main.137http://dx.doi.org/10.18653/v1/2020.acl-main.137］

Lewis M， Liu Y H， Goyal N， Ghazvininejad M， Mohamed A， Levy O， Stoyanov V and Zettlemoyer L. 2020. BART： denoising sequence-to-sequence pre-training for natural language generation， translation， and comprehension//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online： Association for Computational Linguistics： 7871-7880 ［DOI： 10.18653/v1/2020.acl-main.703http://dx.doi.org/10.18653/v1/2020.acl-main.703］

Li C L， Bi B， Yan M， Wang W， Huang S F， Huang F and Si L. 2021a. StructuralLM： structural pre-training for form understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Online： Association for Computational Linguistics： 6309-6318 ［DOI： 10.18653/v1/2021.acl-long.493http://dx.doi.org/10.18653/v1/2021.acl-long.493］

Li G H， Müller M， Ghanem B and Koltun V. 2021b. Training graph neural networks with 1 000 layers//Proceedings of the 38th International Conference on Machine Learning. Online： PMLR： 6437-6449

Li P Z， Gu J X， Kuen J， Morariu V I， Zhao H D， Jain R， Manjunatha V and Liu H F. 2021c. SelfDoc： self-supervised document representation learning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 5648-5656 ［DOI： 10.1109/CVPR46437.2021.00560http://dx.doi.org/10.1109/CVPR46437.2021.00560］

Li X， Zheng Y， Hu Y Q， Cao H Y， Wu Y F， Jiang D Q， Liu Y S and Ren B. 2022. Relational representation learning in visually-rich documents ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2205.02411.pdfhttps://arxiv.org/pdf/2205.02411.pdf

Li Y L， Qian Y X， Yu Y C， Qin X M， Zhang C Q， Liu Y， Yao K， Han J Y， Liu J T and Ding E R. 2021d. StrucTexT： structured text understanding with multi-modal transformers//Proceedings of the 29th ACM International Conference on Multimedia. Online： ACM： 1912-1920 ［DOI： 10.1145/3474085.3475345http://dx.doi.org/10.1145/3474085.3475345］

Li Y M， Liu L M and Shi S M. 2021e. Empirical analysis of unlabeled entity problem in named entity recognition. ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2012.05426.pdfhttps://arxiv.org/pdf/2012.05426.pdf

Lin B Y and Lu W. 2018. Neural adaptation layers for cross-domain named entity recognition//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： ACL： 2012-2022 ［DOI： 10.18653/v1/D18-1226http://dx.doi.org/10.18653/v1/D18-1226］

Lin W H， Gao Q F， Sun L， Zhong Z Y， Hu K， Ren Q and Huo Q. 2021. ViBERTgrid： a jointly trained multi-modal 2D document representation for key information extraction from documents//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 548-563 ［DOI： 10.1007/978-3-030-86549-8_35http://dx.doi.org/10.1007/978-3-030-86549-8_35］

Liu C Y， Chen X X， Luo C J， Jin L W， Xue Y and Liu Y L. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics， 26（6）： 1330-1367

刘崇宇，陈晓雪，罗灿杰，金连文，薛洋，刘禹良. 2021. 自然场景文本检测与识别的深度学习方法. 中国图象图形学报， 26（6）： 1330-1367 ［DOI： 10.11834/jig.210044http://dx.doi.org/10.11834/jig.210044］

Liu P F， Yuan W Z， Fu J L， Jiang Z B， Hayashi H and Neubig G. 2021. Pre-train， prompt， and predict： a systematic survey of prompting methods in natural language processing ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2107.13586.pdfhttps://arxiv.org/pdf/2107.13586.pdf

Liu S L. 2021. OCR Error Post-Correction Based on Chinese Character-Level Features and Language Model. Hangzhou： Zhejiang University

刘书麟. 2021c. 基于中文字符级特征和语言模型的OCR字符纠错算法研究. 杭州：浙江大学

Liu X J， Gao F Y， Zhang Q and Zhao H S. 2019a. Graph convolution for multimodal information extraction from visually rich documents//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Industry Papers）. Minneapolis， Minnesota， USA： Association for Computational Linguistics： 32-39 ［DOI： 10.18653/v1/N19-2005http://dx.doi.org/10.18653/v1/N19-2005］

Liu Y H， Ott M， Goyal N， Du J F， Joshi M， Chen D Q， Levy O， Lewis M， Zettlemoyer L and Stoyanov V. 2019b. Roberta： a robustly optimized BERT pretraining approach ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/1907.11692.pdfhttps://arxiv.org/pdf/1907.11692.pdf

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021d. Swin transformer： hierarchical vision transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Liu Z H， Xu Y， Yu T Z， Dai W L， Ji Z W， Cahyawijaya S， Madotto A and Fung P. 2021e. CrossNER： evaluating cross-domain named entity recognition. Proceedings of the AAAI Conference on Artificial Intelligence， 35（15）： 13452-13460 ［DOI： 10.1609/aaai.v35i15.17587http://dx.doi.org/10.1609/aaai.v35i15.17587］

Nguyen T T H， Jatowt A， Nguyen N V， Coustaty M and Doucet A. 2020. Neural machine translation with BERT for post-OCR error detection and correction//Proceedings of 2020 ACM/IEEE Joint Conference on Digital Libraries. Wuhan， China： ACM： 333-336 ［DOI： 10.1145/3383583.3398605http://dx.doi.org/10.1145/3383583.3398605］

Park S， Shin S， Lee B， Lee J， Surh J， Seo M and Lee H. 2019. CORD： a consolidated receipt dataset for post-OCR parsing//Workshop on Document Intelligence at NeurIPS 2019. Vancouver， Canada： MIT Press

Powalski R， Borchmann Ł， Jurkiewicz D， Dwojak T， Pietruszka M and Pałka G. 2021. Going full-TILT boogie on document understanding with text-image-layout transformer//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 732-747 ［DOI： 10.1007/978-3-030-86331-9_47http://dx.doi.org/10.1007/978-3-030-86331-9_47］

Qian Y J， Santus E， Jin Z J， Guo J and Barzilay R. 2019. GraphIE： a graph-based framework for information extraction//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， Minnesota， USA： ACL： 751-761 ［DOI： 10.18653/v1/N19-1082http://dx.doi.org/10.18653/v1/N19-1082］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Riloff E. 1993. Automatically constructing a dictionary for information extraction tasks//Proceedings of the 11th National Conference on Artificial Intelligence. Washington， USA： AAAI： 811-816

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Schuster D， Muthmann K， Esser D， Schill A， Berger M， Weidling C， Aliyev K and Hofmeier A. 2013. Intellix-end-user trained information extraction for document archiving//Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington， USA： IEEE： 101-105 ［DOI： 10.1109/ICDAR.2013.28http://dx.doi.org/10.1109/ICDAR.2013.28］

Stanisławek T， Graliński F， Wróblewska A， Lipiński D， Kaliska A， Rosalska P， Topolski B and Biecek P. 2021. Kleister： key information extraction datasets involving long documents with complex layouts//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 564-579 ［DOI： 10.1007/978-3-030-86549-8_36http://dx.doi.org/10.1007/978-3-030-86549-8_36］

Sun H B， Kuang Z H， Yue X Y， Lin C H and Zhang W. 2021. Spatial dual-modality graph reasoning for key information extraction ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2103.14470.pdfhttps://arxiv.org/pdf/2103.14470.pdf

Tang G Z， Xie L L， Jin L W， Wang J P， Chen J D， Xu Z， Wang Q Y， Wu Y Q and Li H. 2021. MatchVIE： exploiting match relevancy between entities for visual information extraction//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal， Canada： IJCAI： 1039-1045 ［DOI： 10.24963/ijcai.2021/144http://dx.doi.org/10.24963/ijcai.2021/144］

Vaswani A， Shazeer N， Parmar N， Uszkoreit N， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang J P， Wang T W， Tang G Z， Jin L W， Ma W H， Ding K and Huang Y C. 2021a. Tag， copy or predict： a unified weakly-supervised learning framework for visual information extraction using sequences//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal， Canada： IJCAI： 1082-1090 ［DOI： 10.24963/ijcai.2021/150http://dx.doi.org/10.24963/ijcai.2021/150］

Wang J P， Jin L W and Ding K. 2022a. LiLT： a simple yet effective language-independent layout transformer for structured document understanding//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Dublin， Ireland： Association for Computational Linguistics： 7747-7757 ［DOI： 10.18653/v1/2022.acl-long.534http://dx.doi.org/10.18653/v1/2022.acl-long.534］

Wang J P， Liu C Y， Jin L W， Tang G Z， Zhang J X， Zhang S T， Wang Q Y， Wu Y Q and Cai M X. 2021b. Towards robust visual information extraction in real world： new dataset and novel solution. Proceedings of the AAAI Conference on Artificial Intelligence， 35（4）： 2738-2745 ［DOI： 10.1609/aaai.v35i4.16378http://dx.doi.org/10.1609/aaai.v35i4.16378］

Wang Z L and Shang J B. 2022b. Towards few-shot entity recognition in document images： a label-aware sequence-to-sequence framework//Findings of the Association for Computational Linguistics： ACL 2022. Dublin， Ireland： ACL： 4174-4186 ［DOI： 10.18653/v1/2022.findings-acl.329http://dx.doi.org/10.18653/v1/2022.findings-acl.329］

Wang Z L， Xu Y H， Cui L， Shang J B and Wei F R. 2021c. LayoutReader： pre-training of text and layout for reading order detection//Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana， Dominican Republic： ACL： 4735-4744 ［DOI： 10.18653/v1/2021.emnlp-main.389http://dx.doi.org/10.18653/v1/2021.emnlp-main.389］

Wei M X， He Y F and Zhang Q. 2020. Robust layout-aware IE for visually rich documents with pre-trained language models//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Online， China： ACM： 2367-2376 ［DOI： 10.1145/3397271.3401442http://dx.doi.org/10.1145/3397271.3401442］

Xu Y， Xu Y H， Lv T C， Cui L， Wei F R， Wang G X， Lu Y J， Florencio D， Zhang C， Che W X， Zhang M and Zhou L D. 2021a. LayoutLMv2： multi-modal pre-training for visually-rich document understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Online： Association for Computational Linguistics： 2579-2591 ［DOI： 10.18653/v1/2021.acl-long.201http://dx.doi.org/10.18653/v1/2021.acl-long.201］

Xu Y H， Li M H， Cui L， Huang S H， Wei F R and Zhou M. 2020. LayoutLM： pre-training of text and layout for document image understanding//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Online， USA： ACM： 1192-1200 ［DOI： 10.1145/3394486.3403172http://dx.doi.org/10.1145/3394486.3403172］

Xu Y H， Lv T C， Cui L， Wang G X， Lu Y J， Florencio D， Zhang C and Wei F R. 2021b. LayoutXLM： multimodal pre-training for multilingual visually-rich document understanding ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2104.08836.pdfhttps://arxiv.org/pdf/2104.08836.pdf

Xu Y H， Lv T C， Cui L， Wang G X， Lu Y J， Florencio D， Zhang C and Wei F R. 2022. XFUND： a benchmark dataset for multilingual visually rich form understanding//Findings of the Association for Computational Linguistics. Dublin， Ireland： ACL： 3214-3224 ［DOI： 10.18653/v1/2022.findings-acl.253http://dx.doi.org/10.18653/v1/2022.findings-acl.253］

Yu W W， Lu N， Qi X B， Gong P and Xiao R. 2021. PICK： processing key information extraction from documents using improved graph learning-convolutional networks//Proceedings of the 25th International Conference on Pattern Recognition （ICPR）. Milan， Italy： IEEE： 4363-4370 ［DOI： 10.1109/ICPR48806.2021.9412927http://dx.doi.org/10.1109/ICPR48806.2021.9412927］

Zhang P， Xu Y L， Cheng Z Z， Pu S L， Lu J， Qiao L， Niu Y and Wu F. 2020a. TRIE： end-to-end text reading and information extraction for document understanding//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 1413-1422 ［DOI： 10.1145/3394171.3413900http://dx.doi.org/10.1145/3394171.3413900］

Zhang S， Duh K and Van Durme B. 2017. MT/IE： cross-lingual open information extraction with neural sequence-to-sequence models//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia， Spain： ACL： 64-70

Zhang S H， Huang H R， Liu J C and Li H. 2020b. Spelling error correction with soft-masked BERT//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online： ACL： 882-890 ［DOI： 10.18653/v1/2020.acl-main.82http://dx.doi.org/10.18653/v1/2020.acl-main.82］

Zhang Z R， Ma J F， Du J， Wang L C and Zhang J S. 2022. Multimodal pre-training based on graph attention network for document understanding ［EB/OL］. ［2022-09-03］. https://arxiv.org/pdf/2203.13530.pdfhttps://arxiv.org/pdf/2203.13530.pdf

文章被引用时，请邮件提醒。

提交

“三维视觉—语言”推理技术的前沿研究与最新趋势