SCID： a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images

Qiao Liang; Li Zaisheng; Cheng Zhanzhan; Li Xi

doi:10.11834/jig.220911

Intelligent Processing and Recognition of Document Images | Views : 0 下载量: 5 CSCD: 1

PDF
Export
Share
Collection
Album

SCID： a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images
Vol. 28, Issue 8, Pages: 2298-2313(2023)
Published： 16 August 2023 ，
DOI： 10.11834/jig.220911
稿件说明：

移动端阅览

乔梁，李再升，程战战，李玺. 2023. SCID：用于富含视觉信息文档图像中信息提取任务的扫描中文票据数据集. 中国图象图形学报， 28(08):2298-2313

Qiao Liang， Li Zaisheng， Cheng Zhanzhan， Li Xi. 2023. SCID： a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images. Journal of Image and Graphics， 28(08):2298-2313
乔梁，李再升，程战战，李玺. 2023. SCID：用于富含视觉信息文档图像中信息提取任务的扫描中文票据数据集. 中国图象图形学报， 28(08):2298-2313 DOI： 10.11834/jig.220911.

Qiao Liang， Li Zaisheng， Cheng Zhanzhan， Li Xi. 2023. SCID： a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images. Journal of Image and Graphics， 28(08):2298-2313 DOI： 10.11834/jig.220911.

摘要

目的

视觉富文档信息抽取致力于将输入文档图像中的关键文字信息进行结构化提取，以解决实际业务问题，财务票据是其中一种常见的数据类型。解决该类问题通常需要应用光学字符识别（optical character recognition，OCR）和信息抽取等多个领域的技术。然而，目前公开的相关数据集的数量较少，且每个数据集中包含的图像数量也较少，这都成为了制约该领域技术发展的一个重要因素。为此，本文收集、标注并公开发布了一个真实中文扫描票据数据集SCID（scanned Chinese invoice dataset），包含6类常见财务票据，共40 716幅图像。

方法

该数据集提供了用于OCR任务和信息抽取的两种标签。针对该数据集，本文提出一个基于LayoutLM v2（layout language model v2）的基线方案，实现了从输入图像到最终结果的端到端推理。基于该数据集承办的CSIG（China Society of Image and Graphics）2022票据识别与分析挑战赛，吸引了大量科研人员参与，并提出了优秀的解决方案。

结果

在基线方案实验中，分别验证了使用OCR引擎推理、OCR模型精调和OCR真值3种设定的实验结果，F1值分别为0.768 7、0.857 0和0.985 7，一方面证明了LayoutLM v2模型的有效性；另一方面证明了该场景下OCR的挑战性。

结论

本文提出的扫描票据数据集SCID展示了真实OCR技术应用场景的多项挑战，可以为文档富视觉信息抽取相关技术领域研发和技术落地提供重要数据支持。该数据集下载网址：

https：//davar-lab.github.io/dataset/scid.html

https://davar-lab.github.io/dataset/scid.html

。

Abstract

Objective

Visually-rich document information extraction is committed to such key document images-related text information structure. Invoice-contextual data can be as one of the commonly-used data types of documents. For the enterprises-oriented reimbursement process， much more demands are required of key information extraction of invoices. To resolve this problem， such key techniques like optical character recognition（OCR） and information extraction have been developing intensively. However， the number of related publicly available datasets and the number of images involved are relatively challenged to rich in each dataset.

Method

We develop a real financial scanned Chinese invoice dataset， for which it can be used for collection， annotation， and releasing further. This data set consists of 40 716 images of six types of invoices in the context of aircraft itinerary tickets， taxi invoices， general quota invoices， passenger invoices， train tickets， and toll invoices. It can be divided into training/validation/testing sets further in related to 19 999/10 358/10 359 images. The labeling process of this dataset is concerned of such key steps like pseudo-label generation， manual recheck and cleaning， and manual desensitization， which can offer two sort of labels-related for the OCR task and information extraction deliberately. Such of challenges are still to be resolved in the context of print misalignment， blurring， and overlap. We facilitate a baseline scheme to realize end-to-end inference result. The overall solution can be divided into four steps as mentioned below： 1） a OCR module to predict all text instances’ content and location. 2） A text block ordering module to re-arrange all text instances into a more feasible order and serialize the 2D information into 1D. 3） The LayoutLM v2 model is melted into three modalities information （text， visual， and layout） and generate the prediction of sequence labels， which can utilize knowledge generated from the pre-trained language model. 4） The post-processing module transfer the model’s output to the final structural information. The overall solution can simplify the complexity of the overall ticket system via the integration of multiple invoices.

Result

The baseline experimental results are verified using OCR engine reasoning， OCR model prediction， and OCR ground-truth value. The F1 value of 0.768 7/0.857 0/0.985 7 can be reached as well. Furthermore， the effectiveness of the overall solution and LayoutLM V2 model can be optimized， and the challenging issue of OCR can be reflected in this scenario. Tesla-V100 GPU-based inference speed of the model can be reached to 1.88 frame/s. The accuracy of 90% can be reached using the raw image only as input. We demonstrate that the optimal solutions can be roughly segmented into two categories： one category is focused on melting the structured task into the text detection straightforward （i.e.， multi-category detection）， and the requirement of recognition model is to identify the text only with the corresponding category of concern. The other one is to implement the general information strategy， and an independent information extraction model can be used to extract key information. These solutions can integrate the potentials of the OCR and information extraction technologies farther.

Conclusion

The scanned invoice dataset SCID （scanned Chinese invoice dataset） proposed demonstrates the application scenarios of the OCR technology can provide data support for the research and development of visually-rich document information extraction-related technology and technical implementation. The dataset can be linked and downloaded from

https：//davar-lab.github.io/dataset/scid.html

https://davar-lab.github.io/dataset/scid.html

关键词

数据集财务票据视觉富文档信息抽取光学字符识别（OCR）多模态信息

Keywords

datasetfinancial invoicesvisually-rich documentsinformation extractionoptical character recognition （OCR）multi-modal information

references

Appalaraju S， Jasani B， Kota B U， Xie Y S and Manmatha R. 2021. DocFormer： end-to-end transformer for document understanding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 973-983 ［DOI： 10.1109/ICCV48922.2021.00103http://dx.doi.org/10.1109/ICCV48922.2021.00103］

Cai Z W and Vasconcelos N. 2018. Cascade R-CNN： delving into high quality object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6154-6162 ［DOI： 10.1109/CVPR.2018.00644http://dx.doi.org/10.1109/CVPR.2018.00644］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， Minnesota， USA： ACL： 4171-4186 ［DOI： 10.18653/v1/n19-1423http://dx.doi.org/10.18653/v1/n19-1423］

Du Y N， Li C X， Guo R Y， Yin X T， Liu W W， Zhou J， Bai Y F， Yu Z L， Yang Y H， Dang Q Q and Wang H S. 2020. PP-OCR： a practical ultra-lightweight OCR system ［EB/OL］. ［2022-08-15］. https://arxiv.org/pdf/2009.09941.pdfhttps://arxiv.org/pdf/2009.09941.pdf

Graliński F， Stanisławek T， Wróblewska A， Lipiński D， Kaliska A， Rosalska P， Topolski B and Biecek P. 2020. Kleister： a novel task for information extraction involving long documents with complex layout ［EB/OL］. ［2022-08-15］. https://arxiv.org/pdf/2003.02356.pdfhttps://arxiv.org/pdf/2003.02356.pdf

Guo H， Qin X M， Liu J M， Han J Y， Liu J T and Ding E R. 2019. EATEN： entity-aware attention for single shot visual text extraction//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney， Australia： IEEE： 254-259 ［DOI： 10.1109/ICDAR.2019.00049http://dx.doi.org/10.1109/ICDAR.2019.00049］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

He K M， Gkioxari G， Dollár P and Girshick R. 2017， Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2980-2988 ［DOI： 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322］

Howard J and Ruder S. 2018. Universal language model fine-tuning for text classification//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Melbourne， Australia： ACL： 328-339 ［DOI： 10.18653/v1/P18-1031http://dx.doi.org/10.18653/v1/P18-1031］

Huang Y P， Lv T C， Cui L， Lu Y T and Wei F R. 2022. LayoutLMv3： pre-training for document AI with unified text and image masking//Proceedings of the 30th ACM International Conference on Multimedia. Lisbon， Portugal： ACM： 4083-4091 ［DOI： 10.1145/3503161.3548112http://dx.doi.org/10.1145/3503161.3548112］

Huang Z， Chen K， He J H， Bai X， Karatzas D， Lu S J and Jawahar C V. 2019. ICDAR2019 competition on scanned receipt OCR and information extraction//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney， Australia： IEEE： 1516-1520 ［DOI： 10.1109/ICDAR.2019.00244http://dx.doi.org/10.1109/ICDAR.2019.00244］

Jaume G， Ekenel H K and Thiran J P. 2019. FUNSD： a dataset for form understanding in noisy scanned documents//Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops （ICDARW）. Sydney， Australia： IEEE： 1-6 ［DOI： 10.1109/ICDARW.2019.10029http://dx.doi.org/10.1109/ICDARW.2019.10029］

Li C L， Bi B， Yan M， Wang W， Huang S F， Huang F and Si L. 2021. StructuralLM： structural pre-training for form understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Virtual： ACL： 6309-6318 ［DOI： 10.18653/v1/2021.acl-long.493http://dx.doi.org/10.18653/v1/2021.acl-long.493］

Liao M H， Wan Z Y， Yao C， Chen K and Bai X. 2020. Real-time scene text detection with differentiable binarization. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 11474-11481 ［DOI： 10.1609/aaai.v34i07.6812http://dx.doi.org/10.1609/aaai.v34i07.6812］

Lin T Y， Dollár P， Girshick R， He K M， Hariharan B and Belongie S. 2017a. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 936-944 ［DOI： 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106］

Lin T Y， Goyal P， Girshick R， He K M and Dollár P. 2017b. Focal loss for dense object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2999-3007 ［DOI： 10.1109/ICCV.2017.324http://dx.doi.org/10.1109/ICCV.2017.324］

Lin W H， Gao Q F， Sun L， Zhong Z Y， Hu K， Ren Q and Huo Q. 2021. ViBERTgrid： a jointly trained multi-modal 2D document representation for key information extraction from documents//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 548-563 ［DOI： 10.1007/978-3-030-86549-8_35http://dx.doi.org/10.1007/978-3-030-86549-8_35］

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin transformer： hierarchical vision transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Loshchilov I and Hutter F. 2019. Decoupled weight decay regularization ［EB/OL］. ［2022-08-15］. http://arxiv.org/pdf/1711.05101.pdfhttp://arxiv.org/pdf/1711.05101.pdf

Park S， Shin S， Lee B， Lee J， Surh J， Seo M and Lee H. 2019. CORD： a consolidated receipt dataset for post-OCR parsing ［EB/OL］. ［2022-08-15］. https://openreview.net/pdf?id=SJl3z659UHhttps://openreview.net/pdf?id=SJl3z659UH

Qiao L， Jiang H， Chen Y， Li C， Li P F， Li Z S， Zou B R， Guo D S， Xu Y D， Xu Y L， Cheng Z Z and Niu Y. 2022. DavarOCR： a toolbox for OCR and multi-modal document understanding//Proceedings of the 30th ACM International Conference on Multimedia. Lisbon， Portugal： ACM： 7355-7358 ［DOI： 10.1145/3503161.3548547http://dx.doi.org/10.1145/3503161.3548547］

Qiao L， Tang S L， Cheng Z Z， Xu Y L， Niu Y， Pu S L and Wu F. 2020. Text perceptron： towards end-to-end arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 11899-11907 ［DOI： 10.1609/aaai.v34i07.6864http://dx.doi.org/10.1609/aaai.v34i07.6864］

Raffel C， Shazeer N， Roberts A， Lee K， Narang S， Metana M， Zhou Y Q， Li W and Liu P J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research， 21（1）： 5485-5551

Shi B G， Bai X and Yao C. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（11）： 2298-2304 ［DOI： 10.1109/TPAMI.2016.2646371http://dx.doi.org/10.1109/TPAMI.2016.2646371］

Sun H B， Kuang Z H， Yue X Y， Lin C H and Zhang W. 2021. Spatial dual-modality graph reasoning for key information extraction ［EB/OL］. ［2022-08-15］. https://arxiv.org/pdf/2103.14470.pdfhttps://arxiv.org/pdf/2103.14470.pdf

Sun S， Zhang W M， Fang H and Yu N H. 2022. Automatic generation of Chinese document watermarking fonts. Journal of Image and Graphics， 27（1）： 262-276

孙杉，张卫明，方涵，俞能海. 2022. 中文水印字库的自动生成方法. 中国图象图形学报， 27（1）： 262-276 ［DOI： 10.11834/jig.200695http://dx.doi.org/10.11834/jig.200695］

Tang G Z， Xie L L， Jin L W， Wang J P， Chen J D， Xu Z， Wang Q Y， Wu Y Q and Li H. 2021. MatchVIE： exploiting match relevancy between entities for visual information extraction//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal， Canada： IJCAI： 1039-1045 ［DOI： 10.24963/ijcai.2021/144http://dx.doi.org/10.24963/ijcai.2021/144］

Wang J P， Lu L W， Jin L W， Tang G Z， Zhang J X， Zhang S T， Wang Q Y， Wu Y Q and Cai M X. 2021. Towards robust visual information extraction in real world： new dataset and novel solution. Proceedings of the AAAI Conference on Artificial Intelligence， 35（4）： 2738-2745 ［DOI： 10.1609/aaai.v35i4.16378http://dx.doi.org/10.1609/aaai.v35i4.16378］

Wang P F， Zhang C Q， Qi F， Huang Z M， En M Y， Han J Y， Liu J T， Ding E R and Shi G M. 2019. A single-shot arbitrarily-shaped text detector based on context attended multi-task learning//Proceedings of the 27th ACM International Conference on Multimedia. Nice， France： ACM： 1277-1285 ［DOI： 10.1145/3343031.3350988http://dx.doi.org/10.1145/3343031.3350988］

Xu Y， Xu Y H， Lv T C， Cui L， Wei F R， Wang G X， Lu Y J， Florêncio D， Zhang C， Che W X， Zhang M and Zhou L D. 2021a. LayoutLMv2： multi-modal pre-training for visually-rich document understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Online： ACL： 2579-2591 ［DOI： 10.18653/v1/2021.acl-long.201http://dx.doi.org/10.18653/v1/2021.acl-long.201］

Xu Y H， Li M H， Cui L， Huang S H， Wei F R and Zhou M. 2020. LayoutLM： pre-training of text and layout for document image understanding//Proceedings of the 26th Conference on Knowledge Discovery and Data Mining. Virtual Event， USA： ACM： 1192-1200 ［DOI： 10.1145/3394486.3403172http://dx.doi.org/10.1145/3394486.3403172］

Xu Y H， Lv T C， Cui L， Wang G X， Lu Y J， Florêncio D， Zhang C and Wei F R. 2021b. LayoutXLM： multimodal pre-training for multilingual visually-rich document understanding ［EB/OL］. ［2022-08-15］. https://arxiv.org/pdf/2104.08836.pdfhttps://arxiv.org/pdf/2104.08836.pdf

Ying Z L， Zhao Y H， Xuan C and Deng W B. 2020. Layout analysis of document images based on multifeature fusion. Journal of Image and Graphics， 25（2）： 311-320

应自炉，赵毅鸿，宣晨，邓文博. 2020. 多特征融合的文档图像版面分析. 中国图象图形学报， 25（2）： 311-320 ［DOI： 10.11834/jig.190190http://dx.doi.org/10.11834/jig.190190］

Zhang L， Zhu Y and Wu G W. 1998. An algorithm to establish optimal trees for the description of document structures in document segmentation. Journal of Image and Graphs， 3（7）： 553-556

张利，朱颖，吴国威. 1998. 版面分割中文本区域最佳结构表示树的生成算法. 中国图象图形学报， 3（7）： 553-556 ［DOI： 10.11834/jig.199807179http://dx.doi.org/10.11834/jig.199807179］

Zhang P， Xu Y L， Cheng Z Z， Pu S L， Lu J， Qiao L， Niu Y and Wu F. 2020. TRIE： end-to-end text reading and information extraction for document understanding//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 1413-1422 ［DOI： 10.1145/3394171.3413900http://dx.doi.org/10.1145/3394171.3413900］

Alert me when the article has been cited

提交

A review of adversarial examples for optical character recognition

Large-scale datasets for facial tampering detection with inpainting techniques

Research progress on fetal brain magnetic resonance image segmentation

GZMH： a dataset of breast cancer pathological images for mitosis nuclei detection and segmentation

Large-scale image dataset for perceptual hashing