结构先验指导的文本图像修复模型

刘雨轩; 赵启军; 潘帆; 高定国; 普布旦增

发布时间： 2023-12-21
摘要点击次数： 666
全文下载次数： 899
DOI: 10.11834/jig.220960
2023 | Volume 28 | Number 12

结构先验指导的文本图像修复模型

刘雨轩¹, 赵启军^1,2,3, 潘帆⁴, 高定国^2,3, 普布旦增²(1.四川大学计算机学院, 成都 610065;2.西藏大学信息科学技术学院, 拉萨 850011;3.藏文信息技术创新人才培养示范基地, 拉萨 850011;4.四川大学电子信息学院, 成都 610065)

摘要

目的图像修复是根据图像中已知内容来自动恢复丢失内容的过程。目前基于深度学习的图像修复模型在自然图像和人脸图像修复上取得了一定效果，但是鲜有对文本图像修复的研究，其中保证结构连贯和纹理一致的方法也没有关注文字本身的修复。针对这一问题，提出了一种结构先验指导的文本图像修复模型。方法首先以Transformer为基础，构建一个结构先验重建网络，捕捉全局依赖关系重建文本骨架和边缘结构先验图像，然后提出一种新的静态到动态残差模块（static-to-dynamic residual block，StDRB），将静态特征转换到动态文本图像序列特征，并将其融合到编码器—解码器结构的修复网络中，在结构先验指导和梯度先验损失等联合损失的监督下，使修复后的文本笔划连贯，内容真实自然，达到有利于下游文本检测和识别任务的目的。结果实验在藏文和英文两种语言的合成数据集上，与4种图像修复模型进行了比较。结果表明，本文模型在主观视觉感受上达到了较好的效果，在藏文和英文数据集上的峰值信噪比和结构相似度分别达到了42.31 dB，98.10%和39.23 dB，98.55%，使用Tesseract OCR （optical character recognition）识别修复后藏文图像中的文字的准确率达到了62.83%，使用Tesseract OCR、CRNN （convolutional recurrent neural network）以及ASTER （attentional scene text recognizer）识别修复后英文图像中的文字的准确率分别达到了85.13%，86.04%和76.71%，均优于对比模型。结论本文提出的文本图像修复模型借鉴了图像修复方法的思想，利用文本图像中文字本身的特性，取得了更加准确的文本图像修复结果。

关键词

图像修复文本图像修复结构先验静态到动态残差模块(StDRB) 联合损失

Structure prior guided text image inpainting model

Liu Yuxuan¹, Zhao Qijun^1,2,3, Pan Fan⁴, Gao Dingguo^2,3, Pubu Danzeng²(1.College of Computer Science, Sichuan University, Chengdu 610065, China;2.School of Information Science and Technology, Tibet University, Lhasa 850011, China;3.Tibetan Information Technology Innovative Talent Cultivation Demonstration Base, Lhasa 850011, China;4.College of Electronic Information, Sichuan University, Chengdu 610065, China)

Abstract

Objective Image inpainting is a process of reconstructing the missing regions of corrupted images, which can make the images visually complete and semantically plausible. This process is widely used in many applications, such as object removal, old photo restoration, and image editing. Until now, deep-learning-based inpainting methods have achieved good performance on natural and human face images. Nevertheless, the methods used to ensure consistency in the image texture and structure have limitations in text image inpainting because they do not focus on the text itself. Meanwhile, studies on text images have mainly concentrated on text image super-resolution, text detection, and text recognition. However, many ancient documents contain broken text regions, which present an obstacle for downstream detection or recognition tasks and for the digital protection of ancient literature. Therefore, reconstructing broken text on images is worthy of further study. This paper proposes a novel text image inpainting model guided by text structure prior to solve the above problem. Method First, the model proposes a structure prior reconstruction network. Given that the text skeleton contains important text stroke information and that the text edge contains texture and structure information, the network chooses both of these priors to guide the inpainting. Due to the limitation of convolutional neural network(CNN)receptive fields, the network applies Transformer to capture the long-term dependency of the text image and reconstructs robust and readable text skeleton and edge image based on the useful feature information extracted from the masked RGB image, the masked text skeleton, and the masked edge image. To reduce the computational cost caused by self-attention in Transformer, the network first downsamples the input image and then sends the compressed features to sequential Transformer layers. The network then upsamples these features to recover the prior images. To construct an accurate text skeleton, the network is trained by the combination of binary cross-entropy loss function and Dice loss function. Second, to explore the sequence feature information of the text itself on the images, this paper designs a static-to-dynamic residual block (StDRB). The text image inpainting network adopts an encoder-decoder as the main architecture and integrates sequential StDRBs to enhance the inpainting performance. The text skeleton image and edge image contain significant text stroke and structure information about the whole image, and the StDRB module can make use of the prior information to effectively help the inpainting. In the first place, the input image is sent to the CNN encoder to obtain the static fused features. Then StDRB can convert the static fused features into dynamic text sequence features. By assuming that the text follows a pseudodynamic process from left to right and top to down, StDRB uses bi-directional gated recurrent units from the horizontal and vertical directions in parallel to extract useful text semantic information. The residual block also deepens the network and facilitates network convergence. Finally, the CNN decoder recovers the missing regions from the features to obtain the inpainting results. To make the restored text images visually realistic and semantically explicit, the network uses presetting parameters to combine several loss functions, such as adversarial, pixel reconstruction, perceptual, and style losses. Given that the aim of text image inpainting is to reconstruct the text stroke, the network also introduces gradient prior loss as one of the joint losses. The gradient prior loss uses the gradient field between the inpainted and ground truth images to restrict the network to generate a sharp text stroke contrast with backgrounds. The training set consists of Tibetan and English text images that are randomly synthesized using corpus and noisy background images. All the input images are resized to 256 × 256 pixels for training. The model is implemented in PyTorch and accelerated using an NVIDIA GeForce GTX 1080Ti GPU. The model trains the structure prior reconstruction and text image inpainting networks in two stages to obtain the inpainting results. Result Due to the limited number of studies on text image inpainting, we compare our model with four natural image and face image inpainting models qualitatively and quantitatively. Both of the codes are official versions. From the perspective of human vision, the proposed model obtains better holistic inpainting results than the other methods and achieves more detailed and accurate text reconstruction results in large missing regions. As quantitative evaluation metrics, this paper not only uses image quality evaluations that are widely used in previous inpainting methods but also uses optical character recognition(OCR)results for comparison. These results can effectively show the inpainting effect of broken text on images. Our model has a peak signal-to-noise ratio(PSNR)and structural similarity(SSIM)of 42. 31 dB, 98. 10% on average in the Tibetan dataset, and 39. 23 dB, 98. 55% on average in the English dataset. The character accuracy of Tesseract OCR for the Tibetan dataset is 62. 83%, and the character accuracies of Tesseract OCR, convolutional recurrent neural network(CRNN), and attentional scene text recognizer(ASTER)for the English dataset are 85. 13%, 86. 04%, and 76. 71%, respectively. Our model obviously outperforms the other algorithms on both datasets. Conclusion This paper proposes a structure prior guided text image inpainting model that aims to reconstruct and use priors to guide text image inpainting. To obtain accurate priors, we use Transformer to improve the quality of our results. In the inpainting process, StDRBs that are integrated into the network extract useful text sequence information and boost the text inpainting performance. The model is also trained by using effective joint loss functions to improve its results. The results on Tibetan and English datasets prove the effectiveness of the proposed model.

Keywords

image inpainting text image inpainting structure prior static-to-dynamic residual block(StDRB) joint loss

在线采编平台

论文出版

年度会议

下载中心

年度信息