以文字为中心的图像理解技术综述

张言; 李强; 申化文; 曾港艳; 周宇; 马灿; 张远; 王伟平

doi:10.11834/jig.220968

文档图像智能处理与识别 | 浏览量 : 0 下载量: 513 CSCD: 1

PDF
导出
分享
收藏
专辑

以文字为中心的图像理解技术综述
Text-centric image analysis techniques： a crtical review
2023年28卷第8期页码：2253-2275
收稿：2022-09-26，

修回：2022-12-23，

纸质出版：2023-08-16
DOI： 10.11834/jig.220968
稿件说明：

移动端阅览

张言，李强，申化文，曾港艳，周宇，马灿，张远，王伟平. 2023. 以文字为中心的图像理解技术综述. 中国图象图形学报， 28(08):2253-2275 DOI： 10.11834/jig.220968.

Zhang Yan， Li Qiang， Shen Huawen， Zeng Gangyan， Zhou Yu， Ma Can， Zhang Yuan， Wang Weiping. 2023. Text-centric image analysis techniques： a crtical review. Journal of Image and Graphics， 28(08):2253-2275 DOI： 10.11834/jig.220968.

摘要

文字广泛存在于各种文档图像和自然场景图像之中，蕴含着丰富且关键的语义信息。随着深度学习的发展，研究者不再满足于只获得图像中的文字内容，而更加关注图像中文字的理解，故以文字为中心的图像理解技术受到越来越多的关注。该技术旨在利用文字、视觉物体等多模态信息对文字图像进行充分理解，是计算机视觉和自然语言处理领域的一个交叉研究方向，具有十分重要的实际意义。本文主要对具有代表性的以文字为中心的图像理解任务进行综述，并按照理解认知程度，将以文字为中心的图像理解任务划分为两类，第1类仅要求模型具备抽取信息的能力，第2类不仅要求模型具备抽取信息的能力，而且要求模型具备一定的分析和推理能力。本文梳理了以文字为中心的图像理解任务所涉及的数据集、评价指标和经典方法，并进行对比分析，提出了相关工作中存在的问题和未来发展趋势，希望能够为后续相关研究提供参考。

Abstract

Text can be as one of the key carriers for information transmission. Digital media-related text has been widely developing for such image aspects of document and scene contexts. To extract and analyze these text information-involved images automatically， Conventional researches are mainly focused on automatic text extraction techniques like scene text detection and recognition. However， text-centric images-based semantic information recognition or analysis as a downstream task of spotting text， remains a challenge due to the difficulty of fully leveraging multi-modal features from both vision and language. To this end， text-centric image understanding has been an emerging research topic and many related tasks have been proposed. For example， the visual information extraction technique is capable of extracting the specified content from the given image， which can be used to improve productivity in finance， social media， and other fields. In this paper， we introduce five representative text-centric image understanding tasks and conduct a systematic survey on them. According to the understanding level， these tasks can be broadly classified into two categories. The first category requires the basic understanding ability to extract and distinguish information， such as visual information extraction and scene text retrieval. In contrast， besides the fundamental understanding ability， the second category is more concerned with high-level semantic understanding capabilities like information aggregation and logical reasoning. With the research progress in deep learning and multimodal learning， the second category has attracted considerable attention recently. For the second category， this survey mainly introduces document visual question answering， scene text visual question answering， and scene text image captioning tasks. Over the past few decades， the development of text-centric image understanding techniques has gone through several stages. Earlier approaches are based on heuristic rules and may only utilize unimodal features. Currently， deep learning methods have gained wide popularity and dominated this area. Meanwhile， multimodal features are valued and exploited to improve performance. To be more specific， traditional visual information extraction depends on pre-defined templates or specific rules. Traditional text retrieval task tends to represent words with pyramid histograms of character vectors and predict the matched image according to the representation distance. Expanded from the conventional visual question answering framework， earlier document visual question answering， and scene text visual question answering approaches simply add an optical character recognition branch to extract text information. As integrating knowledge from multimodal signals helps to better understand images， graph neural networks and Transformer-based frameworks are used to fuse multi-modal features recently. Furthermore， self-supervised pre-training schemes are applied to learn the alignment between different modalities， thus boosting model capabilities by a large margin. For each text-centric image understanding task， we summarize classical methods and further elaborate the pros and cons of them. In addition， we also discuss the potential problems and further research directions for the community. Firstly， due to the complexity of different modality features， such as mutative layout and diverse fonts， current deep learning architectures still fail to complete the interaction of multi-modal information efficiently. Secondly， existing text-centric image understanding methods are still limited in their reasoning abilities， involving counting， sorting， and arithmetic operations. For instance， in document visual question answering and scene text visual question answering tasks， current models have difficulty predicting accurate answers when they require to jointly reason over image layout， textual content， and visual art， etc. Finally， the current text-centric understanding tasks are often trained independently and the correlation between different tasks has not been effectively leveraged. We hope this survey can help researchers capture the latest progress in text-centric image understanding and inspire the new design of advanced models and algorithms.

关键词

Keywords

references

Agarwal A and Lavie A . 2008 . METEOR， M-BLEU and M-TER： evaluation metrics for high-correlation with human rankings of machine translation output // Proceedings of the 3rd Workshop on Statistical Machine Translation . Columbus， USA ： Association for Computational Linguistics： 115 - 118 ［ DOI： 10.3115/1626394.1626406 http://dx.doi.org/10.3115/1626394.1626406 ］

Anand M ， Karteek A and Jawahar C V . 2013 . Image retrieval using textual cues // Proceedings of 2013 ICCV .［s.l.］：［s.n.］

Almaz􀆦n J ， Gordo A ， Fornés A and Valveny E . 2014 . Word spotting and recognition with embedded attributes . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 36 （ 12 ）： 2552 - 2566 ［ DOI： 10.1109/TPAMI.2014.2339814 http://dx.doi.org/10.1109/TPAMI.2014.2339814 ］

Anderson P ， Fernando B ， Johnson M and Gould S . 2016 . SPICE： semantic propositional image caption evaluation // Proceedings of the 14th European Conference on Computer Vision . Amsterdam， the Netherlands ： Springer： 382 - 398 ［ DOI： 10.1007/978-3-319-46454-1_24 http://dx.doi.org/10.1007/978-3-319-46454-1_24 ］

Appalaraju S ， Jasani B ， Kota B U ， Xie Y S and Manmatha R . 2021 . DocFormer： end-to-end transformer for document understanding // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 973 - 983 ［ DOI： 10.1109/ICCV48922.2021.00103 http://dx.doi.org/10.1109/ICCV48922.2021.00103 ］

Biten A F ， Litman R ， Xie Y S ， Appalaraju S and Manmatha R . 2021 . LaTr： layout-aware transformer for scene-text VQA ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2112.12494.pdf https://arxiv.org/pdf/2112.12494.pdf

Biten A F ， Tito R ， Mafla A ， Gomez L ， Rusiñol M ， Jawahar C V ， Valveny E and Karatzas D . 2019 . Scene text visual question answering // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 4290 - 4300 ［ DOI： 10.1109/ICCV.2019.00439 http://dx.doi.org/10.1109/ICCV.2019.00439 ］

Carbonell M ， Riba P ， Villegas M ， Fornés A and Lladós J . 2020 . Named entity recognition and relation extraction with graph neural networks in semi structured documents // Proceedings of the 25th International Conference on Pattern Recognition . Milan， Italy ： IEEE： 9622 - 9627 ［ DOI： 10.1109/ICPR48806.2021.9412669 http://dx.doi.org/10.1109/ICPR48806.2021.9412669 ］

Chu X X ， Tian Z ， Zhang B ， Wang X L and Shen C H . 2021 . Conditional positional encodings for vision transformers ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2102.10882.pdf https://arxiv.org/pdf/2102.10882.pdf

Cui L ， Xu Y H ， Lv T C and Wei F R . 2021 . Document AI： benchmarks， models and applications ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2111.08609.pdf https://arxiv.org/pdf/2111.08609.pdf

Dai Z H ， Yang Z L ， Yang Y M ， Carbonell J ， Le Q and Salakhutdinov R . 2019 . Transformer-XL： attentive language models beyond a fixed-length context // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Florence， Italy ： Association for Computational Linguistics： 2978 - 2988 ［ DOI： 10.18653/v1/P19-1285 http://dx.doi.org/10.18653/v1/P19-1285 ］

Denk T I and Reisswig C . 2019 . BERTgrid： contextualized embedding for 2D document representation and understanding ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/1909.04948.pdf https://arxiv.org/pdf/1909.04948.pdf

Devlin J ， Chang M W ， Lee K and Toutanova K . 2019 . BERT： pre-training of deep bidirectional transformers for language understanding // Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers） . Minneapolis， Minnesota， USA ： Association for Computational Linguistics： 4171 - 4186 ［ DOI： 10.18653/v1/N19-1423 http://dx.doi.org/10.18653/v1/N19-1423 ］

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X H ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J and Houlsby N . 2021 . An image is worth 16 × 16 words： transformers for image recognition at scale ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

Gao C Y ， Zhu Q ， Wang P ， Li H ， Liu Y L ， van den Hengel A and Wu Q . 2020a . Structured multimodal attentions for TextVQA ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2006.00753.pdf https://arxiv.org/pdf/2006.00753.pdf

Gao D F ， Li K ， Wang R P ， Shan S G and Chen X L . 2020b . Multi-modal graph neural network for joint reasoning on vision and scene text // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 12743 - 12753 ［ DOI： 10.1109/CVPR42600.2020.01276 http://dx.doi.org/10.1109/CVPR42600.2020.01276 ］

Ghosh S K ， Gómez L ， Karatzas D and Valveny E . 2015 . Efficient indexing for query by string text retrieval // Proceedings of the 13th International Conference on Document Analysis and Recognition . Tunis， Tunisia ： IEEE： 1236 - 1240 ［ DOI： 10.1109/ICDAR.2015.7333961 http://dx.doi.org/10.1109/ICDAR.2015.7333961 ］

Gómez L ， Biten A F ， Tito R ， Mafla A ， Rusiñol M ， Valveny E and Karatzas D . 2021 . Multimodal grid features and cell pointers for scene text visual question answering . Pattern Recognition Letters ， 150 ： 242 - 249 ［ DOI： 10.1016/j.patrec.2021.06.026 http://dx.doi.org/10.1016/j.patrec.2021.06.026 ］

Gómez L ， Mafla A ， Rusiñol M and Karatzas D . 2018 . Single shot scene text retrieval // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 728 - 744 ［ DOI： 10.1007/978-3-030-01264-9_43 http://dx.doi.org/10.1007/978-3-030-01264-9_43 ］

Graves A and Schmidhuber J . 2005 . Framewise phoneme classification with bidirectional LSTM and other neural network architectures . Neural Networks ， 18 （ 5/6 ）： 602 - 610 ［ DOI： 10.1016/j.neunet.2005.06.042 http://dx.doi.org/10.1016/j.neunet.2005.06.042 ］

Gu Z X ， Meng C H ， Wang K ， Lan J ， Wang W Q ， Gu M and Zhang L Q . 2022 . XYLayoutLM： towards layout-aware multimodal networks for visually-rich document understanding ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2203.06947.pdf https://arxiv.org/pdf/2203.06947.pdf

Han W ， Huang H T and Han T . 2020 . Finding the evidence： localization-aware answer prediction for text visual question answering // Proceedings of the 28th International Conference on Computational Linguistics . Barcelona， Spain ： International Committee on Computational Linguistics： 3118 - 3131 ［ DOI： 10.18653/v1/2020.coling-main.278 http://dx.doi.org/10.18653/v1/2020.coling-main.278 ］

He K M ， Gkioxari G ， Doll􀆦r P and Girshick R . 2017 . Mask R-CNN // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice， Italy ： IEEE： 2980 - 2988 ［ DOI： 10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ］

He K M ， Zhang X Y ， Ren S Q and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 770 - 778 ［ DOI： 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ］

Hu A W ， Chen S Z and Jin Q . 2021 . Question-controlled text-aware image captioning // Proceedings of the 29th ACM International Conference on Multimedia . Virtual Event， China ： ACM： 3097 - 3105 ［ DOI： 10.1145/3474085.3475452 http://dx.doi.org/10.1145/3474085.3475452 ］

Hu R H ， Singh A ， Darrell T and Rohrbach M . 2020 . Iterative answer prediction with pointer-augmented multimodal Transformers for TextVQA // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 9989 - 9999 ［ DOI： 10.1109/CVPR42600.2020.01001 http://dx.doi.org/10.1109/CVPR42600.2020.01001 ］

Huang Y P ， Lv T C ， Cui L ， Lu Y T and Wei F R . 2022 . LayoutLMv3： pre-training for document AI with unified text and image masking ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2204.08387.pdf https://arxiv.org/pdf/2204.08387.pdf

Huang Z ， Chen K ， He J H ， Bai X ， Karatzas D ， Lu S J and Jawahar C V . 2019 . ICDAR2019 competition on scanned receipt OCR and information extraction // Proceedings of 2019 International Conference on Document Analysis and Recognition . Sydney， Australia ： IEEE： 1516 - 1520 ［ DOI： 10.1109/ICDAR.2019.00244 http://dx.doi.org/10.1109/ICDAR.2019.00244 ］

Jaume G ， Ekenel H K and Thiran J P . 2019 . FUNSD： a dataset for form understanding in noisy scanned documents // Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops . Sydney， Australia ： IEEE： 1 - 6 ［ DOI： 10.1109/ICDARW.2019.10029 http://dx.doi.org/10.1109/ICDARW.2019.10029 ］

Jin Z X ， Wu H R ， Yang C ， Zhou F ， Qin J Y ， Xiao L and Yin X C . 2020 . RUArt： a novel text-centered solution for text-based visual question answering ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2010.12917.pdf https://arxiv.org/pdf/2010.12917.pdf

Joulin A ， Grave E ， Bojanowski P and Mikolov T . 2017 . Bag of tricks for efficient text classification // Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics： Volume 2 ， Short Papers. Valencia， Spain ： Association for Computational Linguistics： 427 - 431 ［ DOI： 10.18653/v1/E17-2068 http://dx.doi.org/10.18653/v1/E17-2068 ］

Kai W ， Boris B and Serge J B . 2011 . End-toend scene text recognition // Proceedings of 2011 ICCV . ［s.l.］：［s.n.］

Kant Y ， Batra D ， Anderson P ， Schwing A ， Parikh D ， Lu J S and Agrawal H . 2020 . Spatially aware multimodal Transformers for TextVQA // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 715 - 732 ［ DOI： 10.1007/978-3-030-58545-7_41 http://dx.doi.org/10.1007/978-3-030-58545-7_41 ］

Katti A R ， Reisswig C ， Guder C ， Brarda S ， Bickel S ， Höhne J and Faddoul J B . 2018 . Chargrid： towards understanding 2d documents // Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing . Brussels， Belgium ： Association for Computational Linguistics： 4459 - 4469 ［ DOI： 10.18653/v1/D18-1476 http://dx.doi.org/10.18653/v1/D18-1476 ］

Kim W ， Son B and Kim I . 2021 . ViLT： vision-and-language transformer without convolution or region supervision // Proceedings of the 38th International Conference on Machine Learning . Virtual ： PMLR： 5583 - 5594

Krasin I ， Duerig T ， Alldrin N ， Veit A ， Abu-El-Haija S ， Belongie S ， Cai D ， Feng Z Y ， Ferrari V and Gomes V . 2016 . Openimages： a public dataset for large-scale multi-label and multi-class image classification .

Lafferty J D ， McCallum A and Pereira F C N . 2001 . Conditional random fields： probabilistic models for segmenting and labeling sequence data // Proceedings of the 18th International Conference on Machine Learning . Williams College， USA ： Morgan Kaufmann： 282 - 289

Lee C Y ， Li C L ， Dozat T ， Perot V ， Su G L ， Hua N ， Ainslie J ， Wang R S ， Fujii Y and Pfister T . 2022 . FormNet： structural encoding beyond sequential modeling in form document information extraction // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers） . Dublin， Ireland ： Association for Computational Linguistics： 3735 - 3754 ［ DOI： 10.18653/v1/2022.acl-long.260 http://dx.doi.org/10.18653/v1/2022.acl-long.260 ］

Li C L ， Bi B ， Yan M ， Wang W ， Huang S F ， Huang F and Si L . 2021a . StructuralLM： Structural pre-training for form understanding // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers） . Virtual ： Association for Computational Linguistics： 6309 - 6318 ［ DOI： 10.18653/v1/2021.acl-long.493 http://dx.doi.org/10.18653/v1/2021.acl-long.493 ］

Li J W ， Galley M ， Brockett C ， Gao J F and Dolan B . 2016 . A diversity-promoting objective function for neural conversation models // Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies . San Diego， USA ： The Association for Computational Linguistics： 110 - 119 ［ DOI： 10.18653/v1/N16-1014 http://dx.doi.org/10.18653/v1/N16-1014 ］

Li P Z ， Gu J X ， Kuen J ， Morariu V I ， Zhao H D ， Jain R ， Manjunatha V and Liu H F . 2021b . SelfDoc： self-supervised document representation learning // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 5648 - 5656 ［ DOI： 10.1109/CVPR46437.2021.00560 http://dx.doi.org/10.1109/CVPR46437.2021.00560 ］

Li X P ， Wu B ， Song J K ， Gao L L ， Zeng P P and Gan C . 2022 . Text-instance graph： exploring the relational semantics for text-based visual question answering . Pattern Recognition ， 124 ： # 108455 ［ DOI： 10.1016/j.patcog.2021.108455 http://dx.doi.org/10.1016/j.patcog.2021.108455 ］

Li Y L ， Qian Y X ， Yu Y C ， Qin X M ， Zhang C Q ， Liu Y ， Yao K ， Han J Y ， Liu J T and Ding E R . 2021c . StrucTexT： structured text understanding with multi-modal transformers // Proceedings of the 29th ACM International Conference on Multimedia . Virtual， China ： ACM： 1912 - 1920 ［ DOI： 10.1145/3474085.3475345 http://dx.doi.org/10.1145/3474085.3475345 ］

Liao M H ， Pang G ， Huang J ， Hassner T and Bai X . 2020 . Mask TextSpotter v3： segmentation proposal network for robust scene text spotting // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 706 - 722 ［ DOI： 10.1007/978-3-030-58621-8_41 http://dx.doi.org/10.1007/978-3-030-58621-8_41 ］

Lin C Y . 2004 . ROUGE： a package for automatic evaluation of summaries // Text Summarization Branches Out . Barcelona， Spain ： Association for Computational Linguistics： 74 - 81

Lin W H ， Gao Q F ， Sun L ， Zhong Z Y ， Hu K ， Ren Q and Huo Q . 2021 . VIBERTgrid： a jointly trained multi-modal 2D document representation for key information extraction from documents // Proceedings of the 16th International Conference on Document Analysis and Recognition . Lausanne， Switzerland ： Springer： 548 - 563 ［ DOI： 10.1007/978-3-030-86549-8_35 http://dx.doi.org/10.1007/978-3-030-86549-8_35 ］

Liu C Y ， Chen X X ， Luo C J ， Jin L W ， Xue Y and Liu Y L . 2021 . Deep learning methods for scene text detection and recognition . Journal of Image and Graphics ， 26 （ 6 ）： 1330 - 1367

刘崇宇，陈晓雪，罗灿杰，金连文，薛洋，刘禹良 . 2021 . 自然场景文本检测与识别的深度学习方法 . 中国图象图形学报， 26 （ 6 ）： 1330 - 1367 ［ DOI： 10.11834/jig.210044 http://dx.doi.org/10.11834/jig.210044 ］

Liu F ， Xu G H ， Wu Q ， Du Q ， Jia W and Tan M K . 2020 . Cascade reasoning network for text-based visual question answering // Proceedings of the 28th ACM International Conference on Multimedia . Seattle， USA ： ACM： 4060 - 4069 ［ DOI： 10.1145/3394171.3413924 http://dx.doi.org/10.1145/3394171.3413924 ］

Liu T Y . 2009 . Learning to rank for information retrieval . Foundations and Trends in Information Retrieval ， 3 （ 3 ）： 225 - 331 ［ DOI： 10.1561/1500000016 http://dx.doi.org/10.1561/1500000016 ］

Liu X J ， Gao F Y ， Zhang Q and Zhao H S . 2019 . Graph convolution for multimodal information extraction from visually rich documents //Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Industry Papers） . Minneapolis， Minnesota ： Association for Computational Linguistics： 32 - 39 ［ DOI： 10.18653/v1/N19-2005 http://dx.doi.org/10.18653/v1/N19-2005 ］

Lu X P ， Fan Z ， Wang Y S ， Oh J and Rosé C P . 2021 . Localize， group， and select： boosting text-VQA by scene text modeling // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops . Montreal， Canada ： IEEE： 2631 - 2639 ［ DOI： 10.1109/ICCVW54120.2021.00297 http://dx.doi.org/10.1109/ICCVW54120.2021.00297 ］

Mafla A ， Tito R ， Dey S ， Gómez L ， Rusiñol M ， Valveny E and Karatzas D . 2021 . Real-time lexicon-free scene text retrieval . Pattern Recognition ， 110 ： # 107656 ［ DOI： 10.1016/j.patcog.2020.107656 http://dx.doi.org/10.1016/j.patcog.2020.107656 ］

Mathew M ， Bagal V ， Tito R ， Karatzas D ， Valveny E and Jawahar C V . 2022 . InfographicVQA // Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa， USA ： IEEE： 2582 - 2591 ［ DOI： 10.1109/WACV51458.2022.00264 http://dx.doi.org/10.1109/WACV51458.2022.00264 ］

Mathew M ， Karatzas D and Jawahar C V . 2021 . DocVQA： a dataset for VQA on document images // Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision . Waikoloa， USA ： IEEE： 2199 - 2208 ［ DOI： 10.1109/WACV48630.2021.00225 http://dx.doi.org/10.1109/WACV48630.2021.00225 ］

Mishra A ， Alahari K and Jawahar C V . 2013 . Image retrieval using textual cues // Proceedings of 2013 IEEE International Conference on Computer Vision . Sydney， Australia ： IEEE： 3040 - 3047 ［ DOI： 10.1109/ICCV.2013.378 http://dx.doi.org/10.1109/ICCV.2013.378 ］

Mishra A ， Shekhar S ， Singh A K and Chakraborty A . 2019 . OCR-VQA： visual question answering by reading text in images // Proceedings of 2019 International Conference on Document Analysis and Recognition . Sydney， Australia ： IEEE： 947 - 952 ［ DOI： 10.1109/ICDAR.2019.00156 http://dx.doi.org/10.1109/ICDAR.2019.00156 ］

Papineni K ， Roukos S ， Ward T and Zhu W J . 2002 . BLEU： a method for automatic evaluation of machine translation // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Philadelphia， USA ： ACL： 311 - 318 ［ DOI： 10.3115/1073083.1073135 http://dx.doi.org/10.3115/1073083.1073135 ］

Pennington J ， Socher R and Manning C . 2014 . GloVe： global vectors for word representation // Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing . Doha， Qatar ： ACL： 1532 - 1543 ［ DOI： 10.3115/v1/D14-1162 http://dx.doi.org/10.3115/v1/D14-1162 ］

Powalski R ， Borchmann Ł ， Jurkiewicz D ， Dwojak T ， Pietruszka M and Pałka G . 2021 . Going full-TILT boogie on document understanding with text-image-layout transformer // Proceedings of the 16th International Conference on Document Analysis and Recognition . Lausanne， Switzerland ： Springer： 732 - 747 ［ DOI： 10.1007/978-3-030-86331-9_47 http://dx.doi.org/10.1007/978-3-030-86331-9_47 ］

Qian Y J ， Santus E ， Jin Z J ， Guo J and Barzilay R . 2019 . GraphIE： a graph-based framework for information extraction // Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers） . Minneapolis， Minnesota ： Association for Computational Linguistics： 751 - 761 ［ DOI： 10.18653/v1/N19-1082 http://dx.doi.org/10.18653/v1/N19-1082 ］

Qiao Z ， Zhou Y ， Wei J ， Wang W ， Zhang Y ， Jiang N ， Wang H B and Wang W P . 2021 . PIMNet： a parallel， iterative and mimicking network for scene text recognition // Proceedings of the 29th ACM International Conference on Multimedia . Virtual， China ： ACM： 2046 - 2055 ［ DOI： 10.1145/3474085.3475238 http://dx.doi.org/10.1145/3474085.3475238 ］

Qiao Z ， Zhou Y ， Yang D B ， Zhou Y C and Wang W P . 2020 . SEED： semantics enhanced encoder-decoder framework for scene text recognition // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle， USA ： IEEE： 13525 - 13534 ［ DOI： 10.1109/CVPR42600.2020.01354 http://dx.doi.org/10.1109/CVPR42600.2020.01354 ］

Qin X G ， Zhou Y ， Guo Y H ， Wu D Y ， Tian Z H ， Jiang N ， Wang H B and Wang W P . 2021 . Mask is all you need： rethinking mask R-CNN for dense and arbitrary-shaped scene text detection // Proceedings of the 29th ACM International Conference on Multimedia . Virtual， China ： ACM： 414 - 423 ［ DOI： 10.1145/3474085.3475178 http://dx.doi.org/10.1145/3474085.3475178 ］

Rajpurkar P ， Zhang J ， Lopyrev K and Liang P . 2016 . SQuAD ： 100 ， 000+ questions for machine comprehension of text //Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin， USA： The Association for Computational Linguistics： 2383 - 2392 ［ DOI： 10.18653/v1/D16-1264 http://dx.doi.org/10.18653/v1/D16-1264 ］

Ren S Q ， He K M ， Girshick R and Sun J . 2017 . Faster R-CNN： towards real-time object detection with region proposal networks . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 39 （ 6 ）： 1137 - 1149 ［ DOI： 10.1109/TPAMI.2016.2577031 http://dx.doi.org/10.1109/TPAMI.2016.2577031 ］

Rong X J ， Yi C C and Tian Y L . 2020 . Unambiguous scene text segmentation with referring expression comprehension . IEEE Transactions on Image Processing ， 29 ： 591 - 601 ［ DOI： 10.1109/TIP.2019.2930176 http://dx.doi.org/10.1109/TIP.2019.2930176 ］

Rong X J ， Yi C C and Tian Y L . 2022 . Unambiguous text localization， retrieval， and recognition for cluttered scenes . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 44 （ 3 ）： 1638 - 1652 ［ DOI： 10.1109/TPAMI.2020.3018491 http://dx.doi.org/10.1109/TPAMI.2020.3018491 ］

Sang E F T K and Veenstra J . 1999 . Representing text chunks // Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics . Bergen， Norway ： The Association for Computer Linguistics： 173 - 179 ［ DOI： 10.3115/977035.977059 http://dx.doi.org/10.3115/977035.977059 ］

Sharma H and Jalal A S . 2022 . Improving visual question answering by combining scene-text information . Multimedia Tools and Applications ， 81 （ 9 ）： 12177 - 12208 ［ DOI： 10.1007/s11042-022-12317-0 http://dx.doi.org/10.1007/s11042-022-12317-0 ］

Sidorov O ， Hu R H ， Rohrbach M and Singh A . 2020 . TextCaps： a dataset for image captioning with reading comprehension // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 742 - 758 ［ DOI： 10.1007/978-3-030-58536-5_44 http://dx.doi.org/10.1007/978-3-030-58536-5_44 ］

Singh A ， Natarajan V ， Jiang Y ， Chen X ， Shah M ， Rohrbach M ， Batra D and Parikh D . 2018 . Pythia —— a platform for vision and language research // SysML Workshop ， NeurIPS. Montréal ， Canada ： MIT Press

Singh A ， Natarajan V ， Shah M ， Jiang Y ， Chen X L ， Batra D ， Parikh D and Rohrbach M . 2019a . Towards VQA models that can read // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 8309 - 8318 ［ DOI： 10.1109/CVPR.2019.00851 http://dx.doi.org/10.1109/CVPR.2019.00851 ］

Singh A ， Pang G ， Toh M ， Huang J ， Galuba W and Hassner T . 2021 . TextOCR： towards large-scale end-to-end reasoning for arbitrary-shaped scene text // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 8798 - 8808 ［ DOI： 10.1109/CVPR46437.2021.00869 http://dx.doi.org/10.1109/CVPR46437.2021.00869 ］

Singh A K ， Mishra A ， Shekhar S and Chakraborty A . 2019b . From strings to things： knowledge-enabled VQA model that can read and reason // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 4601 - 4611 ［ DOI： 10.1109/ICCV.2019.00470 http://dx.doi.org/10.1109/ICCV.2019.00470 ］

Tang G Z ， Xie L L ， Jin L W ， Wang J P ， Chen J D ， Xu Z ， Wang Q Y ， Wu Y Q and Li H . 2021 . MatchVIE： exploiting match relevancy between entities for visual information extraction // Proceedings of the 30th International Joint Conference on Artificial Intelligence . Montreal， Canada ： IJCAI.org： 1039 - 1045 ［ DOI： 10.24963/ijcai.2021/144 http://dx.doi.org/10.24963/ijcai.2021/144 ］

Tito R ， Karatzas D and Valveny E . 2021 . Document collection visual question answering // Proceedings of the 16th International Conference on Document Analysis and Recognition . Lausanne， Switzerland ： Springer： 778 - 792 ［ DOI： 10.1007/978-3-030-86331-9_50 http://dx.doi.org/10.1007/978-3-030-86331-9_50 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6000 - 6010

Vedantam R ， Zitnick C L and Parikh D . 2015 . CIDEr： consensus-based image description evaluation // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 4566 - 4575 ［ DOI： 10.1109/CVPR.2015.7299087 http://dx.doi.org/10.1109/CVPR.2015.7299087 ］

Wang H ， Bai X ， Yang M K ， Zhu S G ， Wang J and Liu W Y . 2021a . Scene text retrieval via joint text detection and similarity learning // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 4556 - 4565 ［ DOI： 10.1109/CVPR46437.2021.00453 http://dx.doi.org/10.1109/CVPR46437.2021.00453 ］

Wang J ， Tang J H and Luo J B . 2020a . Multimodal attention with image text spatial relationship for OCR-based image captioning // Proceedings of the 28th ACM International Conference on Multimedia . Seattle， USA ： ACM： 4337 - 4345 ［ DOI： 10.1145/3394171.3413753 http://dx.doi.org/10.1145/3394171.3413753 ］

Wang J ， Tang J H ， Yang M K ， Bai X and Luo J B . 2021c . Improving OCR-based image captioning by incorporating geometrical relationship // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 1306 - 1315 ［ DOI： 10.1109/CVPR46437.2021.00136 http://dx.doi.org/10.1109/CVPR46437.2021.00136 ］

Wang J P ， Jin L W and Ding K . 2022a . LiLT： a simple yet effective language-independent layout Transformer for structured document understanding // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers） . Dublin， Ireland ： Association for Computational Linguistics： 7747 - 7757 ［ DOI： 10.18653/v1/2022.acl-long.534 http://dx.doi.org/10.18653/v1/2022.acl-long.534 ］

Wang J P ， Liu C Y ， Jin L W ， Tang G Z ， Zhang J X ， Zhang S T ， Wang Q Y ， Wu Y Q and Cai M X . 2021b . Towards robust visual information extraction in real world： new dataset and novel solution . Proceedings of the AAAI Conference on Artificial Intelligence ， 35 （ 4 ）： 2738 - 2745 ［ DOI： 10.1609/aaai.v35i4.16378 http://dx.doi.org/10.1609/aaai.v35i4.16378 ］

Wang Q Z and Chan A B . 2019 . Describing like humans： on diversity in image captioning // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 4190 - 4198 ［ DOI： 10.1109/CVPR.2019.00432 http://dx.doi.org/10.1109/CVPR.2019.00432 ］

Wang W ， Zhou Y ， Lv J H ， Wu D Y ， Zhao G Q ， Jiang N and Wang W P . 2022b . TPSNet： reverse thinking of thin plate splines for arbitrary shape scene text representation // Proceedings of the 30th ACM International Conference on Multimedia . Lisboa， Portugal ： ACM： 5014 - 5025 ［ DOI： 10.1145/3503161.3547882 http://dx.doi.org/10.1145/3503161.3547882 ］

Wang X Y ， Liu Y L ， Shen C H ， Ng C C ， Luo C J ， Jin L W ， Chan C S ， van den Hengel A and Wang L W . 2020b . On the general value of evidence， and bilingual scene-text visual question answering // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 10123 - 10132 ［ DOI： 10.1109/CVPR42600.2020.01014 http://dx.doi.org/10.1109/CVPR42600.2020.01014 ］

Wang Z K ， Bao R D ， Wu Q and Liu S . 2021d . Confidence-aware non-repetitive multimodal Transformers for TextCaps . Proceedings of the AAAI Conference on Artificial Intelligence ， 35 （ 4 ）： 2835 - 2843 ［ DOI： 10.1609/aaai.v35i4.16389 http://dx.doi.org/10.1609/aaai.v35i4.16389 ］

Wei J ， Zhang Y ， Zhou Y ， Zeng G Y ， Qiao Z ， Guo Y H ， Wu H Y ， Wang H B and Wang W P . 2022 . TextBlock： towards scene text spotting without fine-grained detection // Proceedings of the 30th ACM International Conference on Multimedia . Lisboa， Portugal ： ACM： 5892 - 5902 ［ DOI： 10.1145/3503161.3548051 http://dx.doi.org/10.1145/3503161.3548051 ］

Wu J J ， Du J ， Wang F R ， Yang C ， Jiang X Z ， Hu J S ， Yin B ， Zhang J S and Dai L R . 2022 . A multimodal attention fusion network with a dynamic vocabulary for TextVQA . Pattern Recognition ， 122 ： # 108214 ［ DOI： 10.1016/j.patcog.2021.108214 http://dx.doi.org/10.1016/j.patcog.2021.108214 ］

Xu G H ， Niu S C ， Tan M K ， Luo Y C ， Du Q and Wu Q . 2021a . Towards accurate text-based image captioning with content diversity exploration // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 12632 - 12641 ［ DOI： 10.1109/CVPR46437.2021.01245 http://dx.doi.org/10.1109/CVPR46437.2021.01245 ］

Xu Y ， Xu Y H ， Lyu T C ， Cui L ， Wei F R ， Wang G X ， Lu Y J ， Florêncio D ， Zhang C ， Che W X ， Zhang M and Zhou L D . 2021c . LayoutLMv2： multi-modal pre-training for visually-rich document understanding // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers） . Online ： Association for Computational Linguistics： 2579 - 2591 ［ DOI： 10.18653/v1/2021.acl-long.201 http://dx.doi.org/10.18653/v1/2021.acl-long.201 ］

Xu Y H ， Li M H ， Cui L ， Huang S H ， Wei F R and Zhou M . 2020 . LayoutLM： pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Virtual， USA ： ACM： 1192 - 1200 ［ DOI： 10.1145/3394486.3403172 http://dx.doi.org/10.1145/3394486.3403172 ］

Xu Y H ， Lv T C ， Cui L ， Wang G X ， Lu Y J ， Florêncio D ， Zhang C and Wei F R . 2021b . LayoutXLM： multimodal pre-training for multilingual visually-rich document understanding ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2104.08836.pdf https://arxiv.org/pdf/2104.08836.pdf

Xu Y H ， Lyu T C ， Cui L ， Wang G X ， Lu Y J ， Florêncio D ， Zhang C and Wei F R . 2022 . XFUND： a benchmark dataset for multilingual visually rich form understanding // Findings of the Association for Computational Linguistics： ACL 2022 . Dublin， Ireland ： Association for Computational Linguistics： 3214 - 3224 ［ DOI： 10.18653/v1/2022.findings-acl.253 http://dx.doi.org/10.18653/v1/2022.findings-acl.253 ］

Yang Z C ， He X D ， Gao J F ， Deng L and Smola A . 2016 . Stacked attention networks for image question answering // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 21 - 29 ［ DOI： 10.1109/CVPR.2016.10 http://dx.doi.org/10.1109/CVPR.2016.10 ］

Yang Z Y ， Lu Y J ， Wang J F ， Yin X ， Florêncio D ， Wang L J ， Zhang C ， Zhang L and Luo J B . 2021 . TAP： text-aware pre-training for text-VQA and text-caption // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 8747 - 8757 ［ DOI： 10.1109/CVPR46437.2021.00864 http://dx.doi.org/10.1109/CVPR46437.2021.00864 ］

Zeng G Y ， Zhang Y ， Zhou Y and Yang X M . 2021 . Beyond OCR + VQA： involving OCR into the flow for robust and accurate TextVQA // Proceedings of the 29th ACM International Conference on Multimedia . Virtual Event， China ： ACM： 376 - 385 ［ DOI： 10.1145/3474085.3475606 http://dx.doi.org/10.1145/3474085.3475606 ］

Zhang P ， Xu Y L ， Cheng Z Z ， Pu S L ， Lu J ， Qiao L ， Niu Y and Wu F . 2020 . TRIE： end-to-end text reading and information extraction for document understanding // Proceedings of the 28th ACM International Conference on Multimedia . Seattle， USA ： ACM： 1413 - 1422 ［ DOI： 10.1145/3394171.3413900 http://dx.doi.org/10.1145/3394171.3413900 ］

Zhang W Q ， Shi H C ， Guo J N ， Zhang S Y ， Cai Q P ， Li J C ， Luo S H and Zhuang Y T . 2022 . MAGIC： multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2112.06558.pdf https://arxiv.org/pdf/2112.06558.pdf

Zhang X Y and Yang Q . 2021 . Position-augmented Transformers with entity-aligned mesh for TextVQA // Proceedings of the 29th ACM International Conference on Multimedia . Virtual， China ： ACM： 2519 - 2528 ［ DOI： 10.1145/3474085.3475425 http://dx.doi.org/10.1145/3474085.3475425 ］

Zhu C G ， Zeng M and Huang X D . 2019 . SDNet： contextualized attention-based deep network for conversational question answering ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/1812.03593.pdf https://arxiv.org/pdf/1812.03593.pdf

Zhu Q ， Gao C Y ， Wang P and Wu Q . 2021 . Simple is not easy： a simple strong baseline for TextVQA and TextCaps . Proceedings of the AAAI Conference on Artificial Intelligence ， 35 （ 4 ）： 3608 - 3615 ［ DOI： 10.1609/aaai.v35i4.16476 http://dx.doi.org/10.1609/aaai.v35i4.16476 ］

Park S ， Shin S ， Lee B ， Lee J ， Surh J ， Seo M and Lee H . 2019 . CORD： a consolidated receipt dataset for post-ocr parsing // Workshop on Document Intelligence at NeurIPS 2019 . Vancouver， Canada ：［s.n.］

Kerroumi M ， Sayem O and Shabou A . 2020 . VisualWordGrid： information extraction from scanned documents using a multimodal approach ［EB/OL］. ［ 2022-09-10 ］. http：//arxiv.org/pdf/2010.02358.pdf http://arxiv.org/pdf/2010.02358.pdf

Hong T ， Kim D ， Ji M ， Hwang W ， Nam D and Park S . 2021 . BROS： a pre-trained language model focusing on text and layout for better key information extraction from documents // Proceedings of AAAI Conference on Artificial Intelligence . Vancouver， Canada ： AAAI： 10767 - 10775

Zhang P ， Xu Y ， Cheng Z ， Pu S ， Lu J ， Qiao L ， Niu Y and Wu F . 2020 . TRIE： end-to-end text reading and information extraction for document understanding // Proceedings of the 28th ACM International Conference on Multimedia . Seattle， USA ： ACM

Yu W ， Lu N ， Qi X ， Gong P and Xiao R . 2020 . PICK： processing key information extraction from documents using improved graph learning-convolutional networks // Proceedings of the 25th International Conference on Pattern Recognition （ICPR） . Milan， Italy ： IEEE： 4363 - 4370

Gu J ， Kuen J ， Morariu V I ， Zhao H ， Barmpalios N ， Jain R ， Nenkova A and Sun T . 2022 . Unified pretraining framework for document understanding ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/2204.10939.pdf https://arxiv.org/pdf/2204.10939.pdf

Veit A ， Matera T ， Neumann L ， Matas J and Belongie S J . 2016 . COCO-Text： dataset and benchmark for text detection and recognition in natural images ［EB/OL］. ［ 2022-09-10 ］. https://arxiv.org/pdf/1601.07140.pdf https://arxiv.org/pdf/1601.07140.pdf

Zhu Q ， Gao C ， Wang P and Wu Q . 2020 . Simple is not easy： a simple strong baseline for textvqa and textcaps // Proceedings of the AAAI Conference on Artificial Intelligence . New York， USA ： AAAI

文章被引用时，请邮件提醒。

提交

暂无数据