针对文档图像的非对称式几何校正网络

秦海; 李艺杰; 梁桥康; 王耀南

doi:10.11834/jig.220426

文档图像智能处理与识别 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

针对文档图像的非对称式几何校正网络
AsymcNet： a document images-relevant asymmetric geometry correction network
2023年28卷第8期页码：2314-2329
纸质出版日期： 2023-08-16 ，
DOI： 10.11834/jig.220426
稿件说明：

移动端阅览

秦海，李艺杰，梁桥康，王耀南. 2023. 针对文档图像的非对称式几何校正网络. 中国图象图形学报， 28(08):2314-2329

Qin Hai， Li Yijie， Liang Qiaokang， Wang Yaonan. 2023. AsymcNet： a document images-relevant asymmetric geometry correction network. Journal of Image and Graphics， 28(08):2314-2329
秦海，李艺杰，梁桥康，王耀南. 2023. 针对文档图像的非对称式几何校正网络. 中国图象图形学报， 28(08):2314-2329 DOI： 10.11834/jig.220426.

Qin Hai， Li Yijie， Liang Qiaokang， Wang Yaonan. 2023. AsymcNet： a document images-relevant asymmetric geometry correction network. Journal of Image and Graphics， 28(08):2314-2329 DOI： 10.11834/jig.220426.

摘要

目的

文档图形的几何校正是指通过图像处理的方法对图像采集过程中存在的扭曲、畸变和歪斜等几何干扰进行处理，以提升原始图像的视觉效果与光学字符识别（optical character recognition，OCR）精度。在深度学习普及以前，传统的图像处理方法需要使用激光扫描仪等辅助硬件或在多视角下对文档进行拍摄，且算法的鲁棒性欠佳。深度学习方法构建模型能规避传统算法的不足，但在现阶段这些模型还存在一定的局限性。针对现有算法的缺陷，提出了一种集成文档区域定位与校正的轻量化几何校正网络（asymmetric geometry correction network，AsymcNet），端到端地实现文档图像的几何校正。

方法

AsymcNet由用于文档区域定位的分割网络和用于校正网格回归的回归网络构成，两个子网络以级联的形式搭设。由于分割网络的存在，AsymcNet对于各种视野下的文档图像均能取得良好的校正效果。在回归网络部分，通过减小输出回归网格的分辨率来降低AsymcNet在训练及推理时的显存耗用和时长。

结果

在自制的测试数据集中与业内最新的4种方法进行了比较，使用AsymcNet可以将原始图像的多尺度结构相似度（multi-scale structural similarity，MS-SSIM）从0.318提升至0.467，局部畸变（local distortion，LD）从33.608降低至11.615，字符错误率（character error rate，CER）从0.570降低至0.273。相比于业内效果较好的DFE-FC（displacement flow estimation with fully convolutional network），AsymcNet的MS-SSIM提升了0.036，LD降低了2.193，CER降低了0.033，且AsymcNet处理单幅图像的平均耗时仅为DFE-FC的8.85%。

结论

实验验证了本文所提出AsymcNet的有效性与先进性。

Abstract

Objective

Electronic entry of paper documents is normally based on optical character recognition （OCR） technology. A commonly-used OCR system consists of four sequential steps： image acquisition， image preprocessing， character recognition， and typesetting output. The acquired digital image will have a certain degree of geometric distortion because paper document may not be parallel to the plane where the image acquisition device is located. The lens of the image acquisition device may have its own problem of distortion， or its paper document may challenge for deformation. Image acquisition problems of interferences and distortions will be more severe when handheld image capture devices are used （e.g.， mobile phone cameras）. Computer vision-oriented highly robust correction algorithms are focused on removing geometric distortions derived from imaging process of paper documents. Currrent researches are concerned about neural networks-based geometric correction of document images. Compared to traditional geometric correction algorithms， neural network-based document image correction algorithms have its potential ability in terms of both hardware requirements and algorithm implementation. However， it is still challenged for optimizing processing performance， especially for the contexts of offline and light weight.To improve the visual effect and OCR recognition accuracy of the original image， geometric correction of document graphics can be used to handle distortion， aberration， skew， and other related image-capturing geometric perturbations. Conventional image processing methods are required for such auxiliary hardware like laser scanners or multiple views-captured documents， and the algorithms can not be robusted. The emerging deep learning methods can be used to optimize traditional algorithms via modeling， but these models still have certain limitations. So， we develop a lightweight geometric correction network （AsymcNet）， for which an integrated document region localisation and correction method can be oriented to implement geometric correction of document images end-to-end.

Method

AsymcNet is designed and dealt with possible geometric interference in image acquisition. It consists of document regions-located segmentation network and a grid regression-rectifying regression network， as well as two sub-networks in a cascade form. Segmentation network-based AsymcNet can achieve good correction results for document images in various fields of view. In the regression part of the network， the resolution of the output regression grid is down to shrink the memory consumption and duration of training and inference. The methodologies are illustrated as follows： 1）Segmentation of the network： a simplified Unet-basd skip connection is set up between the encoder and decoder， in which lower layers-derived features can flow into higher layers directly and melt them into small resolution inputs and outputs. Considering the simplicity of the segmentation task， the segmentation network uses a small resolution （128 × 128 pixels） document image as input and outputs a small resolution segmentation result for the sake of lightweight and possible subsequent localization and mobile porting. 2） Regression network： compared to the segmentation task， the regression task of correcting the grid output is more complex. To capture more details from the image to be corrected for the final corrected grid regression， the regression network can be used to adapt a large resolution （512 × 512 pixels） document image as input with the segmentation result of the segmentation network output as a dot product， and outputs a small resolution （128 × 128 pixels） corrected grid.

Result

AsymcNet-relevant comparative analysis is carried out in relevance to 4 popular methods. The multi-scale structural similarity （MS-SSIM） of raw images can be improved from 0.318 to 0.467； The local distortion （LD） is improved from 33.608 to 11.615； and the character error rate （CER） is optimized from 0.570 to 0.273. Compared to displacement flow estimation with fully convolutional network （DFE-FC）， AsymcNet’s MS-SSIM is improved by 0.036， LD is lower by 2.193， CER is shrinked by 0.033， and AsymcNet’s average processing time for a single image is required for 8.85% of DFE-FC’s only. The experimental results demonstrate that the proposed AsymcNet has certain advantages in comparison with other related correction algorithms. In particular， when the relative area occupied by document regions in the image to be processed is small， the advantage of AsymcNet is more significant due to the integration of sub-networks for document region segmentation in the structure of AsymcNet.

Conclusion

Our AsymcNet proposed has been validated for its effectiveness and generalization. Compared to existing methods， AsymcNet has its priorities in terms of correction accuracy， computational efficiency， and generalization. Furthermore， the design of AsymcNet is focused on “small resolution grid” as the regression target of the network， which can alleviate the convergence difficulty of the network and the memory consumption during training and inference. The generalizability of the network can be improved further.

关键词

图像预处理几何校正全卷积网络（FCN）网格采样端到端

Keywords

image preprocessinggeometric correctionfull convolutional network（FCN）grid samplingend-to-end

references

Bandyopadhyay H， Dasgupta T， Das N and Nasipuri M. 2021. A gated and bifurcated stacked U-Net module for document image dewarping//Proceedings of the 25th International Conference on Pattern Recognition. Milan， Italy： IEEE： 10548-10554 ［DOI： 10.1109/ICPR48806.2021.9413001http://dx.doi.org/10.1109/ICPR48806.2021.9413001］

Brown M S and Seales W B. 2001. Document restoration using 3D shape： a general deskewing algorithm for arbitrarily warped documents//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver， Canada： IEEE： 367-374 ［DOI： 10.1109/ICCV.2001.937649http://dx.doi.org/10.1109/ICCV.2001.937649］

Das S， Ma K， Shu Z X， Samaras D and Shilkrot R. 2019. Dewarpnet： Single-image document unwarping with stacked 3D and 2D regression networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 131-140 ［DOI： 10.1109/ICCV.2019.00022http://dx.doi.org/10.1109/ICCV.2019.00022］

Das S， Mishra G， Sudharshana A and Shilkrot R. 2017. The common fold： utilizing the four-fold to dewarp printed documents from a single image//2017 ACM Symposium on Document Engineering. Valletta， Malta： ACM： 125-128 ［DOI： 10.1145/3103010.3121030http://dx.doi.org/10.1145/3103010.3121030］

Das S， Singh K Y， Wu J， Bas E， Mahadevan V， Bhotika R and Samaras D. 2021. End-to-end piece-wise unwarping of document images//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 4248-4257 ［DOI： 10.1109/ICCV48922.2021.00423http://dx.doi.org/10.1109/ICCV48922.2021.00423］

Feng H， Wang Y C， Zhou W G， Deng J J and Li H Q. 2021. Doctr： document image transformer for geometric unwarping and illumination correction//Proceedings of the 29th ACM International Conference on Multimedia. ［s.l.］： ACM： 273-281 ［DOI： 10.1145/3474085.3475388http://dx.doi.org/10.1145/3474085.3475388］

Gao L C， Li Y B， Du L， Zhang X P， Zhu Z Y， Lu N， Jin L W， Huang Y S and Tang Z. 2022. A survey on table recognition technology. Journal of Image and Graphics， 27（6）： 1898-1917

高良才，李一博，都林，张新鹏，朱子仪，卢宁，金连文，黄永帅，汤帜. 2022. 表格识别技术研究进展. 中国图象图形学报， 27（6）： 1898-1917 ［DOI： 10.11834/jig.220152http://dx.doi.org/10.11834/jig.220152］

Glorot X and Bengio Y. 2010. Understanding the difficulty of training deep feedforward neural networks//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia， Italy： AISTATS： 249-256

Goodfellow I， Pouget-Abadie J， Mirza M， Xu B， Warde-Farley D， Ozair S， Courville A and Bengio Y. 2020. Generative adversarial networks. Communications of the ACM， 63（11）： 139-144 ［DOI： 10.1145/3422622http://dx.doi.org/10.1145/3422622］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Isola P， Zhu J Y， Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5967-5976 ［DOI： 10.1109/CVPR.2017.632http://dx.doi.org/10.1109/CVPR.2017.632］

Jaderberg M， Simonyan K， Zisserman A and Kavukcuoglu K. 2015. Spatial transformer networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 2017-2025

Jiang X W， Long R J， Xue N， Yang Z B， Yao C and Xia G S. 2022. Revisiting document image dewarping by grid regularization//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 4533-4542 ［DOI： 10.1109/CVPR52688.2022.00450http://dx.doi.org/10.1109/CVPR52688.2022.00450］

Kingma D P and Ba J. 2015. Adam： a method for stochastic optimization//Proceedings of the 3rd International Conference on Learning Representations. San Diego， USA： ICLR

Krizhevsky A， Sutskever I and Hinton G E. 2017. Imagenet classification with deep convolutional neural networks. Communications of the ACM， 60（6）： 84-90 ［DOI： 10.1145/3065386http://dx.doi.org/10.1145/3065386］

Levenshtein V I. 1966. Binary codes capable of correcting deletions， insertions and reversals. Soviet Physics Doklady， 10（8）： 707-710

Li X Y， Zhang B， Liao J and Sander PV. 2019. Document rectification and illumination correction using a patch-based CNN. ACM Transactions on Graphics， 38（6）： #168 ［DOI： 10.1145/3355089.3356563http://dx.doi.org/10.1145/3355089.3356563］

Liang J， DeMenthon D and Doermann D. 2008. Geometric rectification of camera-captured document images. IEEE Transactions on Pattern Analysis and Machine Intelligence， 30（4）： 591-605 ［DOI： 10.1109/TPAMI.2007.70724http://dx.doi.org/10.1109/TPAMI.2007.70724］

Liu C， Yuen J and Torralba A. 2011. Sift flow： dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence， 33（5）： 978-994 ［DOI： 10.1109/TPAMI.2010.147http://dx.doi.org/10.1109/TPAMI.2010.147］

Liu C S， Zhang Y， Wang B K and Ding X Q. 2015. Restoring camera-captured distorted document images. International Journal on Document Analysis and Recognition， 18（2）： 111-124 ［DOI： 10.1007/s10032-014-0233-8http://dx.doi.org/10.1007/s10032-014-0233-8］

Long J， Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3431-3440 ［DOI： 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965］

Ma K， Das S， Shu Z X and Samaras D. 2022. Learning from documents in the wild to improve document unwarping//Proceedings of ACM SIGGRAPH 2022 Conference Proceedings. Vancouver， Canada： ACM： #34 ［DOI： 10.1145/3528233.3530756http://dx.doi.org/10.1145/3528233.3530756］

Ma K， Shu Z X， Bai X， Wang J and Samaras D. 2018. DocUNet： document image unwarping via a stacked U-Net//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4700-4709 ［DOI： 10.1109/CVPR.2018.00494http://dx.doi.org/10.1109/CVPR.2018.00494］

Markovitz A， Lavi I， Perel O， Mazor S and Litman R. 2020. Can you read me now？ Content aware rectification using angle supervision//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 208-223 ［DOI： 10.1007/978-3-030-58610-2_13http://dx.doi.org/10.1007/978-3-030-58610-2_13］

Meng G F， Wang Y， Qu S Q， Xiang S M and Pan C H. 2014. Active flattening of curved document images via two structured beams//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus， USA： IEEE： 3890-3897 ［DOI： 10.1109/CVPR.2014.497http://dx.doi.org/10.1109/CVPR.2014.497］

Odena A， Dumoulin V and Olah C. 2016. Deconvolution and checkerboard artifacts. Distill， 1（10）： #e3 ［DOI： 10.23915/distill.00003http://dx.doi.org/10.23915/distill.00003］

Paszke A， Gross S， Massa F， Lerer A， Bradbury J， Chanan G， Killeen T， LinZ M， Gimelshein N， Antiga L， Desmaison A， Köpf A， Yang A， DeVito Z， Raison M， Tejani A， Chilamkurthy S， Steiner B， Fang L， Bai J J and Chintala S. 2019. Pytorch： an imperative style， high-performance deep learning library//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： #721

Ronneberger O， Fischer P and Brox T. 2015. U-Net： Convolutional networks for biomedical image segmentation//Proceedings of 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Smith R. 2007. An overview of the Tesseract OCR engine//Proceedings of the 9th International Conference on Document Analysis and Recognition. Curitiba， Brazil： IEEE： 629-633 ［DOI： 10.1109/ICDAR.2007.4376991http://dx.doi.org/10.1109/ICDAR.2007.4376991］

Tian Y D and Narasimhan S G. 2011. Rectification and 3D reconstruction of curved document images//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs， USA： IEEE： 377-384 ［DOI： 10.1109/CVPR.2011.5995540http://dx.doi.org/10.1109/CVPR.2011.5995540］

Tsoi Y C and Brown M S. 2007. Multi-view document rectification using boundary//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis， USA： IEEE： 1-8 ［DOI： 10.1109/CVPR.2007.383251http://dx.doi.org/10.1109/CVPR.2007.383251］

Wang Z， Bovik A C， Sheikh H R and Simoncelli E P. 2004. Image quality assessment： from error visibility to structural similarity. IEEE Transactions on Image Processing， 13（4）： 600-612 ［DOI： 10.1109/TIP.2003.819861http://dx.doi.org/10.1109/TIP.2003.819861］

Wang Z， Simoncelli E P and Bovik A C. 2003. Multiscale structural similarity for image quality assessment//Proceedings of the 37th Asilomar Conference on Signals， Systems and Computers， 2003. Pacific Grove， USA： IEEE： 1398-1402 ［DOI： 10.1109/ACSSC.2003.1292216http://dx.doi.org/10.1109/ACSSC.2003.1292216］

Xie G W， Yin F， Zhang X Y and Liu C L. 2020. Dewarping document image by displacement flow estimation with fully convolutional network//Proceedings of the 14th International Workshop on Document Analysis Systems. Wuhan， China： Springer： 131-144 ［DOI： 10.1007/978-3-030-57058-3_10http://dx.doi.org/10.1007/978-3-030-57058-3_10］

Xie G W， Yin F， Zhang X Y and Liu C L. 2021. Document dewarping with control points//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 466-480 ［DOI： 10.1007/978-3-030-86549-8_30http://dx.doi.org/10.1007/978-3-030-86549-8_30］

Xing Y C， Li R， Cheng L L and Wu Z J. 2018. Research on curved Chinese document correction based on deep neural network//The 11th International Symposium on Computational Intelligence and Design. Hangzhou， China： IEEE： 342-345 ［DOI： 10.1109/ISCID.2018.10179http://dx.doi.org/10.1109/ISCID.2018.10179］

Xue C H， Tian Z C， Zhan F N， Lu S J and Bai S. 2022. Fourier document restoration for robust document dewarping and recognition//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 4563-4572 ［DOI： 10.1109/CVPR52688.2022.00453http://dx.doi.org/10.1109/CVPR52688.2022.00453］

Ying Z L， Zhao Y H， Xuan C and Deng W B. 2020. Layout analysis of document images based on multifeature fusion. Journal of Image and Graphics， 25（2）： 311-320

应自炉，赵毅鸿，宣晨，邓文博. 2020. 多特征融合的文档图像版面分析. 中国图象图形学报， 25（2）： 311-320 ［DOI： 10.11834/jig.190190http://dx.doi.org/10.11834/jig.190190］

You S D， Matsushita Y， Sinha S， Bou Y and Ikeuchi K. 2018. Multiview rectification of folded documents. IEEE Transactions on Pattern Analysis and Machine Intelligence， 40（2）： 505-511 ［DOI： 10.1109/TPAMI.2017.2675980http://dx.doi.org/10.1109/TPAMI.2017.2675980］

Yu F， Koltun V and Funkhouser T. 2017. Dilated residual networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 636-644 ［DOI： 10.1109/CVPR.2017.75http://dx.doi.org/10.1109/CVPR.2017.75］

Zhang J X， Luo C J， Jin L W， Guo F J and Ding K. 2022. Marior： margin removal and iterative content rectification for document dewarping in the wild//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： ACM： 2805-2815 ［DOI： 10.1145/3503161.3548214http://dx.doi.org/10.1145/3503161.3548214］

Zhang L， Zhang Y and Tan C. 2008. An improved physically-based method for geometric restoration of distorted document images. IEEE Transactions on Pattern Analysis and Machine Intelligence， 30（4）： 728-734 ［DOI： 10.1109/TPAMI.2007.70831http://dx.doi.org/10.1109/TPAMI.2007.70831］

Zhang Z Y. 1990. Flexible camera calibration by viewing a plane from unknown orientations//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra， Greece： IEEE： 666-673 ［DOI： 10.1109/ICCV.1999.791289http://dx.doi.org/10.1109/ICCV.1999.791289］

Zhu J Y， Park T， Isola P and Efros A A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2242-2251 ［DOI： 10.1109/ICCV.2017.244http://dx.doi.org/10.1109/ICCV.2017.244］

文章被引用时，请邮件提醒。

提交

采用Transformer网络的视频序列表情识别

视觉Transformer与多特征融合的脑卒中检测算法

深度纯追随的拟人化无人驾驶转向控制模型

视觉感知的端到端自动驾驶运动规划综述

自学习规则下的多聚焦图像融合