TextLLM:基于动态分辨率的文档多模态大模型
TextLLM: a document multimodal large model based on dynamic resolution
- 2025年 页码:1-15
收稿日期:2024-10-07,
修回日期:2025-01-17,
录用日期:2025-02-18,
网络出版日期:2025-02-20
DOI: 10.11834/jig.240608
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-10-07,
修回日期:2025-01-17,
录用日期:2025-02-18,
网络出版日期:2025-02-20,
移动端阅览
杨彪, 刘禹良, 刘强, 朱盈盈. TextLLM:基于动态分辨率的文档多模态大模型[J/OL]. 中国图象图形学报, 2025,1-15.
Yang Biao, Liu Yuliang, Liu Qiang, Zhu Yingying. TextLLM: a document multimodal large model based on dynamic resolution[J/OL]. Journal of image and graphics, 2025, 1-15.
目的
2
文档智能旨在自动和智能地处理纸质文本信息,包括但不限于表格、表单、发票等,极大地便利了信息的电子化管理。然而,传统深度学习方法往往专注于单一任务的优化,这限制了它们在处理复杂多变的文档场景时的效能。此外,这些方法需要额外的光学字符识别(optical character recognition,OCR)工具来提取文档中的文字信息,这不仅增加了处理步骤,也可能引入额外的错误。多模态大模型的出现为免去OCR工具统一处理文档信息带来了希望,但是它在处理高分辨率的文档图片和应对逐渐增加的视觉标记时,仍然面临着不小挑战。针对上述的挑战,提出了一种基于动态分辨率的文档多模态大模型TextLLM,它能免OCR工具地处理高分辨率的文档图片。
方法
2
基于最新的多模态大模型训练了一个能够处理动态分辨率的文档多模态大模型。在动态分辨率的基础上,提出一种动态特征压缩算法,设置动态的可学习压缩率来获得需要保留的特征长度,通过计算特征相似度来得到重要性特征,以此来聚合关键特征。更进一步,利用大语言模型的注意力机制捕捉与提示词相关的视觉特征部分,根据提示词的注意力分布图筛选出最相关的特征,并保留其周围相关特征。
结果
2
实验在多个数据集上与最新的6种方法进行了比较,TextLLM在多个文档理解基准测试中取得了显著的性能提升。在DocVQA、WTQ、ChartQA和TextVQA等数据集上的表现均优于现有模型,分别获得了82.4、37.6、70.8和65.3的分数。此外,在综合评测数据集OCRBench中,模型得分高达601分,证明了其在多样化文本相关任务中的适应能力和整体效果。同时也在多个数据集中进行了消融实验以验证算法的有效性,消融实验验证了提出的动态算法改善了模型的效果。
结论
2
本文提出了基于动态分辨率的文档大模型TextLLM,并提出了动态压缩特征和动态选择的算法来应对多场景的文档。实验结果表明,本文的模型优于几种最先进的文档大模型,兼具了高效性和准确性。
Objective
2
The advancement of document intelligence aims to realize the intelligent processing and interpretation of various document information. This includes the processing of structured documents containing tables, forms, invoices, etc., as well as text in natural scenes. Traditional deep learning methods excel at certain tasks, but because they are often optimized for a single task, they struggle to adapt to increasingly complex needs and diverse scenarios. When processing document text, Optical Character Recognition (OCR) is usually required to extract text information, which limits the processing speed and accuracy. Moreover, the various steps involved in handling the reading of the pipeline text can lead to the accumulation of errors. In addition, relying on ready-made OCR models/apis introduces additional engineering complexity, limits connections between text and its surrounding context, and can increase computational costs. In order to mitigate the disadvantages of external systems before they are understood, OCR-Free solutions have attracted more and more attention recently. With the rise of multimodal large models, the field of document intelligence is undergoing a revolution. These models integrate textual and visual information to enable a more comprehensive and accurate understanding of document content, potentially eliminating reliance on OCR tools. However, these multimodal large models still face challenges, especially when it comes to processing high-resolution document images and dealing with the ever-growing number of visual markers. Previous similar working resolutions are limited by the input size of the encoder, and it is generally difficult to see the specific small text. However, the use of the idea of cutting the image resolution will also greatly increase the number of tokens of the image, which is challenging for the memory and space occupation. In order to overcome these challenges, a large multimodal model based on dynamic resolution is proposed. This model is designed to handle high-resolution document images, eliminating the need for OCR tools, while being flexible enough to accommodate ever-increasing visual tokens.
Method
2
In this study, we improve the latest multimodal large model and train a large document model based on the dynamic resolution. This process mainly involves dynamic adjustment of the image, block processing, visual coding, feature compression and screening to extract and retain useful information to the maximum extent. First, the area closest to the predefined scaling size is found based on the original image size for subsequent processing, including scaling processing, image slicing, and global information acquisition. Next, the image blocks are fed into the visual encoder for encoding processing, including window attention mechanisms to recognize and fuse information, and extract local regional visual features and global visual features. Then, the image resampling module is used to process the segmented features, and the compressed visual features are obtained based on the attention mechanism. Set the compression rate dynamically, define discrete compression rates and a learnable parameter to learn the compression rate dynamically, while using Gumbel Softmax for end-to-end training to learn the compression rate. According to the compression ratio, the number of features after compression is obtained. By calculating the similarity and importance, the most important features are selected and sorted, and these important features are used for further aggregation. Finally, the attention mechanism of the large language model is used to capture the visual features related to the prompt words, and the most relevant features are selected according to the attention distribution map of the prompt words, and the surrounding relevant features are retained. The integrated application of these steps is helpful to effectively process and utilize image information and improve the efficiency and accuracy of information extraction.
Result
2
The experiments conduct across multiple datasets compared TextLLM with the latest six methods, demonstrating significant performance improvements in various document understanding benchmarks. It outperforms existing models on datasets such as DocVQA, WTQ, ChartQA, and TextVQA, achieving scores of 82.4, 37.6, 70.8, and 65.3, respectively. Furthermore, on the comprehensive OCRBench evaluation dataset, the model achieves a score of 601, proving its adaptability and overall strength in a variety of text-related tasks. Comparative experiments are also carried out across multiple datasets to verify the effectiveness of the algorithm, and the results confirm that the proposed dynamic processing algorithm improved the model's performance. To further demonstrate the model's capabilities in text-related tasks across various scenarios, several mainstream scenario images are selected for visualization in this paper. The model is able to accurately recognize text in scene images and documents and answer questions based on its understanding, showing its strong text processing capabilities and adaptability. By introducing dynamic feature compression, the model can autonomously learn the compression rate and sample different compression rates during the learning process, which to some extent serves as data augmentation. Furthermore, by incorporating visual feature selection, the model can focus more on text-related features, and while further reducing features, it has achieved improvements across all datasets.
Conclusion
2
In this study, we introduce an innovative document large model, TextLLM, which based on dynamic resolution, combines dynamic compression feature and dynamic selection algorithm. Through extensive experimental validation, we find that our model not only outperforms several state-of-the-art document large models in performance, but also achieves significant improvements in efficiency and accuracy.
Liu C. , Jin L. , Bai X. , Li X. , and Yin F . 2023 . Frontiers of intelligent document analysis and recognition: review and prospects . Journal of Image and Graphics , 28 ( 08 ): 2223 - 2252 .[ DOI: 10.11834/ http://dx.doi.org/10.11834/
jig . 221112] .
刘成林 , 金连文 , 白翔 , 李晓辉 , 殷飞 . 2023 . 文档智能分析与识别前沿:回顾与展望 . 中国图象图形学报 , 28 ( 08 ): 2223 - 2252
Ying Z. , Zhao Y. , Xuan , C.and Deng W . 2020 . Layout analysis of document images based on multifeature fusion . [J]. Journal of image and graphics , 25 ( 2 ): 311 - 320 .[ DOI: 10.11834/jig.190190 http://dx.doi.org/10.11834/jig.190190 ]
应自炉 , 赵毅鸿 , 宣晨 , 邓文博 . 多特征融合的文档图像版面分析 [J]. 中国图象图形学报 , 2020 , 25 ( 2 ): 311 - 320
Xu Y. , Li M. , Cui L. , Huang S. , Wei F. , and Zhou M . 2020 . Layoutlm: Pre-training of text and layout for document image understanding . // Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining . 2020: 1192 - 1200 . [ DOI: 10.1145/3394486.3403172 http://dx.doi.org/10.1145/3394486.3403172 ]
Xu Y. , Xu Y. , Lv T. , Cui L. , Wei F. , Wang G. , Liu Y. , Florencio D. , Zhang C. , Che W. , Zhang M. and Zhou L . 2021 . LayoutLMv2 : Multi-modal Pre-training for Visually-rich Document Understanding. // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long Papers) (pp . 2579 - 2591 . [ DOI: 10.18653/v1/2021.acl-long.201 http://dx.doi.org/10.18653/v1/2021.acl-long.201 ]
Huang Y. , Lv T. , Cui L. , Lu Y. , and Wei F . 2022 . Layoutlmv 3 : Pre-training for document ai with unified text and image masking . //Proceedings of the 30th ACM International Conference on Multimedia (pp. 4083 - 4091 ). [ DOI: 10.1145/3503161.3548112 http://dx.doi.org/10.1145/3503161.3548112 ]
Lin Z. , Wang J. , and Jin L . 2023 . Visual information extraction deep learning method: a critical review . Journal of Image and Graphics , 28 ( 08 ): 2276 - 2297 [ DOI: 10.11834/jig.220904 http://dx.doi.org/10.11834/jig.220904 ]
林泽柠 , 汪嘉鹏 , 金连文 . 2023 . 视觉信息抽取的深度学习方法综述 . 中国图象图形学报 , 28 ( 08 ): 2276 - 2297
Tang Z. , Yang Z. , Wang G. , Fang Y. , Liu Y. , Zhu C. , Zeng M. , Zhang C. and Bansal M . 2023 . Unifying vision , text, and layout for universal document processing. // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 2023: 19254 - 19264 . [ DOI: 10.1109/CVPR52729.2023.01845 http://dx.doi.org/10.1109/CVPR52729.2023.01845 ]
Wang D. , Raman N. , Sibue M. , Ma Z. , Babkin P. , Kaur S. , Pei Y. , Nourbakhsh A . and Liu , X. 2023 . DocLLM: A layout-aware generative language model for multimodal document understanding. [EB/OL].[ 2023-12-21 ]. https://arxiv.org/pdf/2401.00908 https://arxiv.org/pdf/2401.00908
Kim G. , Hong T. , Yim M. , Nam J. , Park J. , Yim J. , Hwang W. , Yun S. , Han S. and Park S . 2022 . Ocr-free document understanding transformer . // European Conference on Computer Vision (pp. 498 - 517 ) [ DOI: 10.1007/978-3-031-19815-1_29 http://dx.doi.org/10.1007/978-3-031-19815-1_29 ]
Davis B. , Morse B. , Price B. , Tensmeyer C. , Wigington C. , and Morariu V . 2022 . End-to-end document recognition and understanding with dessurt . // European Conference on Computer Vision (pp. 280 - 296 ). [ DOI: 10.1007/978-3-031-25069-9_19 http://dx.doi.org/10.1007/978-3-031-25069-9_19 ]
Lee K. , Joshi M. , Turc I. R. , Hu H. , Liu F. , Eisenschlos J. M. , Khandelwal U. , Shaw P. , Chang M. , and Toutanova K . 2023 . Pix2struct: Screenshot parsing as pretraining for visual language understanding . // International Conference on Machine Learning (pp. 18893 - 18912 ). [ DOI: 10.48550/arxiv2210.03347 http://dx.doi.org/10.48550/arxiv2210.03347 ]
Zhang Y. , Zhang R. , Gu J. , Zhou Y. , Lipka N. , Yang D. , and Sun T . 2023 . Llavar: Enhanced visual instruction tuning for text-rich image understanding . [EB/OL].[ 2023-06-29 ]. https://arxiv.org/pdf/ 2306.17107 https://arxiv.org/pdf/2306.17107
Ye J. , Hu A. , Xu H. , Ye Q. , Yan M. , Dan Y. , Zhao C. , Xu G. , Li C. , Tian J. , Qi Q. , Zhang J . and Huang , F. 2023 . mplug-docowl: Modularized multimodal large language model for document understanding. [EB/OL].[ 2023-7-4 ]. https://arxiv.org/abs/2307.02499 https://arxiv.org/abs/2307.02499
Feng H. , Wang Z. , Tang J. , Lu J. , Zhou W. , Li H . and Huang , C. 2023 . Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. [EB/OL].[ 2023-08-19 ]. https://arxiv.org/pdf/2308.11592 https://arxiv.org/pdf/2308.11592
Ye J. , Hu A. , Xu H. , Ye Q. , Yan M. , Dan Y. , Zhao C. , Xu G. , Li C. , Tian J. , Qi Q. , Zhang J. and Huang F . 2023 . UReader : Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. // Findings of the Association for Computational Linguistics : EMNLP 2023 (pp. 2841 - 2858 ). [ DOI: 10.18653/v1/2023.findings-emnlp.187 http://dx.doi.org/10.18653/v1/2023.findings-emnlp.187 ]
Feng H. , Liu Q. , Liu H. , Zhou W. , Li H . and Huang , C. 2023 . Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. [EB/OL]. [ 2023-11-20 ]. https://arxiv.org/pdf/2311.11810 https://arxiv.org/pdf/2311.11810
Wei H. , Kong L. , Chen J. , Zhao L. , Ge Z. , Yang J. , Sun J. , Han C. , and Zhang X . 2023 . Vary: Scaling up the vision vocabulary for large vision-language models . [EB/OL]. [ 2023-12-11 ]. https://arxiv.org/abs/2312.06109 https://arxiv.org/abs/2312.06109
Liu Y. , Yang B. , Liu Q. , Li Z. , Ma Z. , Zhang S. , and Bai X . 2024 . Textmonkey: An ocr-free large multimodal model for understanding document . [EB/OL]. [ 2024-3-7 ]. https://arxiv.org/pdf/2403.04473 https://arxiv.org/pdf/2403.04473
Radford A. , Kim J. W. , Hallacy C. , Ramesh A. , Goh G. , Agarwal S. , Sastry G. , Askell A. , Mishkin P. , Clark J. , Krueger G. and Sutskever I . 2021 . Learning transferable visual models from natural language supervision . //International conference on machine learning (pp. 8748 - 8763 ). [ DOI: 10.48550/arxi2103.00020 http://dx.doi.org/10.48550/arxi2103.00020 ]
Li J. , Li D. , Savarese S. , and Hoi S . 2023 . Blip-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. // International conference on machine learning : 19730 - 19742 . [ DOI: 10.48550/arXiv.2301.12597 http://dx.doi.org/10.48550/arXiv.2301.12597 ]
Liu Z. , Lin Y. , Cao Y. , Hu H. , Wei Y. , Zhang Z. , Lin S. and Guo B . 2021 . Swin transformer: Hierarchical vision transformer using shifted windows . // Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012 - 10022 ). [ DOI: 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ]
Jang E. , Gu S. , and Poole B . 2016 . Categorical reparameterization with gumbel-softmax . [EB/OL]. [ 2016-11-3 ]. https://arxiv.org /abs/1611.01144 https://arxiv.org/abs/1611.01144
Chen L. , Zhao H. , Liu T. , Bai S. , Lin J. , Zhou C. , and Chang B . 2024 . An image is worth 1 / 2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. [EB/OL].[ 2024-3-1 ] https://arxiv.org/abs/2403.06764 https://arxiv.org/abs/2403.06764
Bai J. , Bai S. , Yang S. , Wang S. , Tan S. , Wang P. , Lin J. , Zhou C . and Zhou , J. 2023 . Qwen-vl: A frontier large vision-language model with versatile abilities. [EB/OL].[ 2023-8-24 ]. https://arxiv.org /pdf/2308.12966 https://arxiv.org/pdf/2308.12966
Liu H. , Li C. , Wu Q. , and Lee Y . J . 2023 . Visual instruction tuning . // Advances in neural information processing systems , 36 . 34892 - 34916 [ DOI: 10.48550/arXiv.2304.08485 http://dx.doi.org/10.48550/arXiv.2304.08485 ]
Hu A. , Xu H. , Ye J. , Yan M. , Zhang L. , Zhang B. , Li C. , Zhang J. , Jin Q. , Huang F . and Zhou , J. 2024 . mPLUG-DocOwl 1 . 5 : Unified structure learning for ocr-free document understanding . [EB/OL].[ 2024-03-19 ]. https://arxiv.org/pdf/2403.12895 https://arxiv.org/pdf/2403.12895
Lewis D. , Agam G. , Argamon S. , Frieder O. , Grossman D. , and Heard J . 2006 . Building a test collection for complex document information processing . // Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 665 - 666 ).[ DOI: 10.1145/1148170. 1148307 http://dx.doi.org/10.1145/1148170.1148307 ]
Mathew M. , Karatzas D. and Jawahar C . V . 2021 . Docvqa : A dataset for vqa on document images. // Proceedings of the IEEE/CVF winter conference on applications of computer vision : 2200 - 2209 . [ DOI: 10.1109/WACV48630.2021.00225 http://dx.doi.org/10.1109/WACV48630.2021.00225 ]
Mathew M., Bagal V., Tito R., Karatzas D., Valveny E. and Jawahar , C . V . 2022 . Infographicvqa. //Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision: 1697 - 1706 . [ DOI: 10.1109/WACV51458.2022.00264 http://dx.doi.org/10.1109/WACV51458.2022.00264 ]
Schuhmann C. , Beaumont R. , Vencu R. , Gordon C. , Wightman R. , Svetlichnaya S . 2020 . Deepform: Understand structured documents at scale , 2020. [EB/OL] https://wandb.ai/stacey https://wandb.ai/stacey .
Stanisławek T. , Graliński F. , Wróblewska A. , Lipiński D. , Kaliska A. , Rosalska P. , Topolski B. and Biecek P . 2021 . Kleister : key information extraction datasets involving long documents with complex layouts. // International Conference on Document Analysis and Recognition : 564 - 579 . [ DOI: 10.1007/978-3-030-86549-8_36 http://dx.doi.org/10.1007/978-3-030-86549-8_36 ]
Pasupat P . and Liang P . 2015 . Compositional semantic parsing on semi-structured tables. [EB/OL].[ 2015-8-3 ]. https://arxiv.org/pdf/ 1508.00305 https://arxiv.org/pdf/1508.00305
Masry A. , Long D. X. , Tan J. Q. , Joty S . and Hoque , E. 2022 . Chartqa: A benchmark for question answering about charts with visual and logical reasoning. [EB/OL].[ 2022-5-19 ]. https://arxiv.org/ pdf/2203.10244 https://arxiv.org/pdf/2203.10244
Veit A. , Matera T. , Neumann L. , Matas J. , and Belongie S . 2016 . Coco-text: Dataset and benchmark for text detection and recognition in natural images . [EB/OL].[ 2016-06-19 ]. https://arxiv.org/pdf/ 1601.07140 https://arxiv.org/pdf/1601.07140
Biten A. F., Tito R., Mafla A., Gomez L., Rusinol M., Valveny E ., Jawahar C. V . and Karatzas , D . 2019 . Scene text visual question answering. //Proceedings of the IEEE/CVF international conference on computer vision: 4291 - 4301 . [ DOI: 10.1109/iccv.2019.00439 http://dx.doi.org/10.1109/iccv.2019.00439 ]
Long S. , Qin S. , Panteleev D. , Bissacco A. , Fujii Y. , and Raptis , M . 2022 . Towards end-to-end unified scene text detection and layout analysis. // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition : 1049 - 1059 . [ DOI: 10.1109/ CVPR52688.2022.00112 http://dx.doi.org/10.1109/CVPR52688.2022.00112 ]
Singh A. , Pang G. , Toh M. , Huang J. , Galuba W. , and Hassner T . 2021 . Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text . // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8802 - 8812 ). [ DOI: 10.48550/arxiv.2105.05486 http://dx.doi.org/10.48550/arxiv.2105.05486 ]
Nayef N. , Patel Y. , Busta M. , Chowdhury P. N. , Karatzas D. , Khlif W. , Matas. J. , Pal U. , Burie J. , Liu C. and Ogier J . M . 2019 . Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019 . // 2019 International conference on document analysis and recognition : 1582 - 1587 . [ DOI: 10.1109/ICDAR.2019.00254 http://dx.doi.org/10.1109/ICDAR.2019.00254 ]
Singh A. , Natarajan V. , Shah M. , Jiang Y. , Chen X. , Batra D. , Parikh D . and Rohrbach , M . 2019 . Towards vqa models that can read // the IEEE/CVF conference on computer vision and pattern recognition : 8317 - 8326 . [ DOI: 10.1109/cvpr.2019.00851 http://dx.doi.org/10.1109/cvpr.2019.00851 ]
Tanaka R. , Nishida K. , and Yoshida S . 2021 . Visualmrc : Machine reading comprehension on document images. // the AAAI Conference on Artificial Intelligence : 13878 - 13888 . [ DOI: 10.1609/aaai.v35i15 http://dx.doi.org/10.1609/aaai.v35i15 .
17635]
Yu Y. , Liao M. , Wu J. , Liao Y. , Zheng X. , and Zeng W . 2024 . Texthawk: Exploring efficient fine-grained perception of multimodal large language models . [EB/OL].[ 2024-4-14 ]. https://arxiv.org/pdf/ 2404.09204 https://arxiv.org/pdf/2404.09204
Luo C. , Shen Y. , Zhu Z. , Zheng Q. , Yu Z. , and Yao C . 2024 . LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding . // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15630 - 15640 ). [ DOI: 10.1109/CVPR52733.2024.01480 http://dx.doi.org/10.1109/CVPR52733.2024.01480 ]
Hu A. , Xu H. , Zhang L. , Ye J. , Yan M. , Zhang J. , Jin Q. , Huang , F . and Zhou , J. 2024 . mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding. [EB/OL].[ 2024-09-05 ]. https://arxiv.org/pdf/2409.03420 https://arxiv.org/pdf/2409.03420
Liu Y. , Li Z. , Yang B. , Li C. , Yin X. , Liu C. L. , Jin , L . and Bai , X. 2023 . On the hidden mystery of ocr in large multimodal models. [EB/OL].[ 2023-5-13 ]. https://arxiv.org/pdf/2305.07895 https://arxiv.org/pdf/2305.07895
Li Z. , Yang B. , Liu Q. , Ma Z. , Zhang S. , Yang J. , Sun Y. , Liu , Y. and Bai , X . 2024 . Monkey : Image resolution and text label are important things for large multi-modal models. // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition : 26763 - 26773 . [ 10.1109/CVPR52733.2024.02527 http://dx.doi.org/10.1109/CVPR52733.2024.02527 ]
相关作者
相关机构