结构先验指导的文本图像修复模型

刘雨轩; 赵启军; 潘帆; 高定国; 普布旦增

doi:10.11834/jig.220960

图像处理和编码 | 浏览量 : 0 下载量: 1 CSCD: 0

PDF
导出
分享
收藏
专辑

结构先验指导的文本图像修复模型
Structure prior guided text image inpainting model
2023年28卷第12期页码：3699-3712
纸质出版日期： 2023-12-16 ，
DOI： 10.11834/jig.220960
稿件说明：

移动端阅览

刘雨轩，赵启军，潘帆，高定国，普布旦增. 2023. 结构先验指导的文本图像修复模型. 中国图象图形学报， 28(12):3699-3712

Liu Yuxuan， Zhao Qijun， Pan Fan， Gao Dingguo， Danzeng Pubu. 2023. Structure prior guided text image inpainting model. Journal of Image and Graphics， 28(12):3699-3712
刘雨轩，赵启军，潘帆，高定国，普布旦增. 2023. 结构先验指导的文本图像修复模型. 中国图象图形学报， 28(12):3699-3712 DOI： 10.11834/jig.220960.

Liu Yuxuan， Zhao Qijun， Pan Fan， Gao Dingguo， Danzeng Pubu. 2023. Structure prior guided text image inpainting model. Journal of Image and Graphics， 28(12):3699-3712 DOI： 10.11834/jig.220960.

摘要

目的

图像修复是根据图像中已知内容来自动恢复丢失内容的过程。目前基于深度学习的图像修复模型在自然图像和人脸图像修复上取得了一定效果，但是鲜有对文本图像修复的研究，其中保证结构连贯和纹理一致的方法也没有关注文字本身的修复。针对这一问题，提出了一种结构先验指导的文本图像修复模型。

方法

首先以Transformer为基础，构建一个结构先验重建网络，捕捉全局依赖关系重建文本骨架和边缘结构先验图像，然后提出一种新的静态到动态残差模块（static-to-dynamic residual block，StDRB），将静态特征转换到动态文本图像序列特征，并将其融合到编码器—解码器结构的修复网络中，在结构先验指导和梯度先验损失等联合损失的监督下，使修复后的文本笔划连贯，内容真实自然，达到有利于下游文本检测和识别任务的目的。

结果

实验在藏文和英文两种语言的合成数据集上，与4种图像修复模型进行了比较。结果表明，本文模型在主观视觉感受上达到了较好的效果，在藏文和英文数据集上的峰值信噪比和结构相似度分别达到了42.31 dB，98.10%和39.23 dB，98.55%，使用 Tesseract OCR（optical character recognition）识别修复后藏文图像中的文字的准确率达到了62.83%，使用Tesseract OCR、CRNN（convolutional recurrent neural network）以及ASTER（attentional scene text recognizer）识别修复后英文图像中的文字的准确率分别达到了85.13%，86.04%和76.71%，均优于对比模型。

结论

本文提出的文本图像修复模型借鉴了图像修复方法的思想，利用文本图像中文字本身的特性，取得了更加准确的文本图像修复结果。

Abstract

Objective

Image inpainting is a process of reconstructing the missing regions of corrupted images， which can make the images visually complete and semantically plausible. This process is widely used in many applications， such as object removal， old photo restoration， and image editing. Until now， deep-learning-based inpainting methods have achieved good performance on natural and human face images. Nevertheless， the methods used to ensure consistency in the image texture and structure have limitations in text image inpainting because they do not focus on the text itself. Meanwhile， studies on text images have mainly concentrated on text image super-resolution， text detection， and text recognition. However， many ancient documents contain broken text regions， which present an obstacle for downstream detection or recognition tasks and for the digital protection of ancient literature. Therefore， reconstructing broken text on images is worthy of further study. This paper proposes a novel text image inpainting model guided by text structure prior to solve the above problem.

Method

First， the model proposes a structure prior reconstruction network. Given that the text skeleton contains important text stroke information and that the text edge contains texture and structure information， the network chooses both of these priors to guide the inpainting. Due to the limitation of convolutional neural network （CNN） receptive fields， the network applies Transformer to capture the long-term dependency of the text image and reconstructs robust and readable text skeleton and edge image based on the useful feature information extracted from the masked RGB image， the masked text skeleton， and the masked edge image. To reduce the computational cost caused by self-attention in Transformer， the network first downsamples the input image and then sends the compressed features to sequential Transformer layers. The network then upsamples these features to recover the prior images. To construct an accurate text skeleton， the network is trained by the combination of binary cross-entropy loss function and Dice loss function. Second， to explore the sequence feature information of the text itself on the images， this paper designs a static-to-dynamic residual block （StDRB）. The text image inpainting network adopts an encoder-decoder as the main architecture and integrates sequential StDRBs to enhance the inpainting performance. The text skeleton image and edge image contain significant text stroke and structure information about the whole image， and the StDRB module can make use of the prior information to effectively help the inpainting. In the first place， the input image is sent to the CNN encoder to obtain the static fused features. Then StDRB can convert the static fused features into dynamic text sequence features. By assuming that the text follows a pseudo-dynamic process from left to right and top to down， StDRB uses bi-directional gated recurrent units from the horizontal and vertical directions in parallel to extract useful text semantic information. The residual block also deepens the network and facilitates network convergence. Finally， the CNN decoder recovers the missing regions from the features to obtain the inpainting results. To make the restored text images visually realistic and semantically explicit， the network uses presetting parameters to combine several loss functions， such as adversarial， pixel reconstruction， perceptual， and style losses. Given that the aim of text image inpainting is to reconstruct the text stroke， the network also introduces gradient prior loss as one of the joint losses. The gradient prior loss uses the gradient field between the inpainted and ground truth images to restrict the network to generate a sharp text stroke contrast with backgrounds. The training set consists of Tibetan and English text images that are randomly synthesized using corpus and noisy background images. All the input images are resized to 256 × 256 pixels for training. The model is implemented in PyTorch and accelerated using an NVIDIA GeForce GTX 1080Ti GPU. The model trains the structure prior reconstruction and text image inpainting networks in two stages to obtain the inpainting results.

Result

Due to the limited number of studies on text image inpainting， we compare our model with four natural image and face image inpainting models qualitatively and quantitatively. Both of the codes are official versions. From the perspective of human vision， the proposed model obtains better holistic inpainting results than the other methods and achieves more detailed and accurate text reconstruction results in large missing regions. As quantitative evaluation metrics， this paper not only uses image quality evaluations that are widely used in previous inpainting methods but also uses optical character recognition （OCR） results for comparison. These results can effectively show the inpainting effect of broken text on images. Our model has a peak signal-to-noise ratio（PSNR） and structural similarity（SSIM） of 42.31 dB， 98.10% on average in the Tibetan dataset， and 39.23 dB， 98.55% on average in the English dataset. The character accuracy of Tesseract OCR for the Tibetan dataset is 62.83%， and the character accuracies of Tesseract OCR， convolutional recurrent neural network （CRNN）， and attentional scene text recognizer （ASTER） for the English dataset are 85.13%， 86.04%， and 76.71%， respectively. Our model obviously outperforms the other algorithms on both datasets.

Conclusion

This paper proposes a structure prior guided text image inpainting model that aims to reconstruct and use priors to guide text image inpainting. To obtain accurate priors， we use Transformer to improve the quality of our results. In the inpainting process， StDRBs that are integrated into the network extract useful text sequence information and boost the text inpainting performance. The model is also trained by using effective joint loss functions to improve its results. The results on Tibetan and English datasets prove the effectiveness of the proposed model.

关键词

图像修复文本图像修复结构先验静态到动态残差模块（StDRB）联合损失

Keywords

image inpaintingtext image inpaintingstructure priorstatic-to-dynamic residual block （StDRB）joint loss

references

Cho K， van Merrienboer B， Gulcehre C， Bahdanau D， Bougares F， Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha， Qatar： Association for Computational Linguistics： 1724-1734 ［DOI： 10.3115/v1/d14-1179http://dx.doi.org/10.3115/v1/d14-1179］

Dong Q L， Cao C J and Fu Y W. 2022. Incremental Transformer structure enhanced image inpainting with masking positional encoding ［EB/OL］. ［2022-09-06］. https://arxiv.org/pdf/2203.00867v2.pdfhttps://arxiv.org/pdf/2203.00867v2.pdf

Duan Y， Long H， Qu Y Q， Shao Y B and Du Q Z. 2021. An irregular interference repair algorithm of text images based on partial convolution. Computer Engineering and Science， 43（9）： 1634-1644

段荧，龙华，瞿于荃，邵玉斌，杜庆治. 2021. 基于部分卷积的文字图像不规则干扰修复算法研究. 计算机工程与科学， 43（9）： 1634-1644 ［DOI： 10.3969/j.issn.1007-130X.2021.09.014http://dx.doi.org/10.3969/j.issn.1007-130X.2021.09.014］

Guo J T. 2021. Research on Face Image Inpainting and Editing based on Generative Adversarial Networks. Beijing： Beijing Jiaotong University

郭景涛. 2021. 基于生成对抗网络的人脸图像修复和编辑方法研究. 北京：北京交通大学［DOI： 10.26944/d.cnki.gbfju.2021.000352http://dx.doi.org/10.26944/d.cnki.gbfju.2021.000352］

Guo X F， Yang H Y and Huang D. 2021. Image inpainting via conditional texture and structure dual generation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 14114-14123 ［DOI： 10.1109/ICCV48922.2021.01387http://dx.doi.org/10.1109/ICCV48922.2021.01387］

Hu J， Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 7132-7141 ［DOI： 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745］

Li C T， Siu W C， Liu Z S， Wang L W and Lun D P K. 2020. DeepGIN： deep generative inpainting network for extreme image inpainting//Proceedings of 2022 European Conference on Computer Vision. Glasgow， UK： Springer： 5-22 ［DOI： 10.1007/978-3-030-66823-5_1http://dx.doi.org/10.1007/978-3-030-66823-5_1］

Liao L， Xiao J， Wang Z， Lin C W and Satoh S. 2020. Guidance and evaluation： semantic-aware image inpainting for mixed scenes//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 683-700 ［DOI： 10.1007/978-3-030-58583-9_41http://dx.doi.org/10.1007/978-3-030-58583-9_41］

Liu G L， Reda F A， Shih K J， Wang T C， Tao A and Catanzaro B. 2018. Image inpainting for irregular holes using partial convolutions//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 89-105 ［DOI： 10.1007/978-3-030-01252-6_6http://dx.doi.org/10.1007/978-3-030-01252-6_6］

Liu H Y， Jiang B， Xiao Y and Yang C. 2019. Coherent semantic attention for image inpainting//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 4169-4178 ［DOI： 10.1109/ICCV.2019.00427http://dx.doi.org/10.1109/ICCV.2019.00427］

Liu Z W， Luo P， Wang X G and Tang X O. 2015. Deep learning face attributes in the wild//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE：［DOI： 10.1109/ICCV.2015.425http://dx.doi.org/10.1109/ICCV.2015.425］

Nazeri K， Ng E， Joseph T， Qureshi F Z and Ebrahimi M. 2019. EdgeConnect： generative image inpainting with adversarial edge learning ［EB/OL］. ［2022-09-06］. https://arxiv.org/pdf/1901.00212v3.pdfhttps://arxiv.org/pdf/1901.00212v3.pdf

Pathak D， Krahenbuhl P， Donahue J， Darrell T and Efros A A. 2016. Context encoders： feature learning by inpainting//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 2536-2544 ［DOI： 10.1109/CVPR.2016.278http://dx.doi.org/10.1109/CVPR.2016.278］

Qiang Z P， He L B， Chen X and Xu D. 2019. Survey on deep learning image inpainting methods. Journal of Image and Graphics， 24（3）： 447-463

强振平，何丽波，陈旭，徐丹. 2019. 深度学习图像修复方法综述. 中国图象图形学报， 24（3）： 447-463 ［DOI： 10.11834/jig.180408http://dx.doi.org/10.11834/jig.180408］

Russakovsky O， Deng J， Su H， Krause J， Satheesh S， Ma S A， Huang Z H， Karpathy A， Khosla A， Bernstein M， Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision， 115（3）： 211-252 ［DOI： 10.1007/s11263-015-0816-yhttp://dx.doi.org/10.1007/s11263-015-0816-y］

Shi B G， Bai X and Yao C. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（11）： 2298-2304 ［DOI： 10.1109/tpami.2016.2646371http://dx.doi.org/10.1109/tpami.2016.2646371］

Shi B G， Yang M K， Wang X G， Lyu P Y， Yao C and Bai X. 2019. ASTER： an attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（9）： 2035-2048 ［DOI： 10.1109/tpami.2018.2848939http://dx.doi.org/10.1109/tpami.2018.2848939］

Simo-Serra E， Iizuka S and Ishikawa H. 2018. Real-time data-driven interactive rough sketch inking. ACM Transactions on Graphics， 37（4）： #98 ［DOI： 10.1145/3197517.3201370http://dx.doi.org/10.1145/3197517.3201370］

Sun J， Sun J， Xu Z B and Shum H Y. 2011. Gradient profile prior and its applications in image super-resolution and enhancement. IEEE Transactions on Image Processing， 20（6）： 1529-1542 ［DOI： 10.1109/tip.2010.2095871http://dx.doi.org/10.1109/tip.2010.2095871］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser L and Polosukhin I. 2023. Attention is all you need ［EB/OL］. ［2022-09-06］. https://arxiv.org/pdf/1706.03762.pdfhttps://arxiv.org/pdf/1706.03762.pdf

Wan Z Y， Zhang B， Chen D D， Zhang P， Chen D， Liao J and Wen F. 2020. Bringing old photos back to life//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 2744-2754 ［DOI： 10.1109/CVPR42600.2020.00282http://dx.doi.org/10.1109/CVPR42600.2020.00282］

Wan Z Y， Zhang J B， Chen D D and Liao J. 2021. High-fidelity pluralistic image completion with Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 4672-4681 ［DOI： 10.1109/ICCV48922.2021.00465http://dx.doi.org/10.1109/ICCV48922.2021.00465］

Wang N， Li J Y， Zhang L F and Du B. 2019. MUSICAL： multi-scale image contextual attention learning for inpainting//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao， China： Morgan Kaufmann： 3748-3754 ［DOI： 10.24963/ijcai.2019/520http://dx.doi.org/10.24963/ijcai.2019/520］

Wang W H. 2021. Image Restoration Algorithm Based on U-Net Networks and It Application on Yi Character Restoration. Kunming： Yunnan University

王伟华. 2021. 基于U-Net网络的图像修复算法研究及其在彝文修复的应用. 昆明：云南大学［DOI： 10.27456/d.cnki.gyndu.2021.000783http://dx.doi.org/10.27456/d.cnki.gyndu.2021.000783］

Wu H W， Zhou J T and Li Y M. 2022. Deep generative model for image inpainting with local binary pattern learning and spatial attention. IEEE Transactions on Multimedia， 24： 4016-4027 ［DOI： 10.1109/tmm.2021.3111491http://dx.doi.org/10.1109/tmm.2021.3111491］

Yan Z Y， Li X M， Li M， Zuo W M and Shan S G. 2018. Shift-Net： image inpainting via deep feature rearrangement//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 3-19 ［DOI： 10.1007/978-3-030-01264-9_1http://dx.doi.org/10.1007/978-3-030-01264-9_1］

Yu B X， Xu Y， Huang Y， Yang S and Liu J Y. 2021. Mask-guided GAN for robust text editing in the scene. Neurocomputing， 441： 192-201 ［DOI： 10.1016/j.neucom.2021.02.045http://dx.doi.org/10.1016/j.neucom.2021.02.045］

Yu J H， Lin Z， Yang J M， Shen X H， Lu X and Huang T. 2019. Free-form image inpainting with gated convolution//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 4470-4479 ［DOI： 10.1109/ICCV.2019.00457http://dx.doi.org/10.1109/ICCV.2019.00457］

Yu J H， Lin Z， Yang J M， Shen X H， Lu X and Huang T S. 2018. Generative image inpainting with contextual attention//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 5505-5514 ［DOI： 10.1109/CVPR.2018.00577http://dx.doi.org/10.1109/CVPR.2018.00577］

Zhang L S， Chen Q C， Hu B T and Jiang S R. 2020. Text-guided neural image inpainting//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： Association for Computing Machinery： 1302-1310 ［DOI： 10.1145/3394171.3414017http://dx.doi.org/10.1145/3394171.3414017］

Zhou B L， Lapedriza A， Khosla A， Oliva A and Torralba A. 2018. Places： a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 40（6）： 1452-1464 ［DOI： 10.1109/TPAMI.2017.2723009http://dx.doi.org/10.1109/TPAMI.2017.2723009］

文章被引用时，请邮件提醒。

提交

面向人脸修复篡改检测的大规模数据集

图像复原中自注意力和卷积的动态关联学习

混合双注意力机制生成对抗网络的图像修复模型

联合语义分割与边缘重建的深度学习图像修复

面向图像修复的增强语义双解码器生成模型