结合门循环单元和生成对抗网络的图像文字去除

王超群; 全卫泽; 侯诗玉; 张晓鹏; 严冬明

doi:10.11834/jig.200764

Chinagraph 2020 | 浏览量 : 0 下载量: 75 CSCD: 0

PDF
导出
分享
收藏
专辑

结合门循环单元和生成对抗网络的图像文字去除
Gate recurrent unit and generative adversarial networks for scene text removal
2022年27卷第4期页码：1264-1276
收稿日期：2020-12-14，

修回日期：2021-02-08，

录用日期：2021-2-15，

纸质出版日期：2022-04-16
DOI： 10.11834/jig.200764
稿件说明：

移动端阅览

王超群, 全卫泽, 侯诗玉, 张晓鹏, 严冬明. 结合门循环单元和生成对抗网络的图像文字去除[J]. 中国图象图形学报, 2022,27(4):1264-1276. DOI： 10.11834/jig.200764.

Chaoqun Wang, Weize Quan, Shiyu Hou, Xiaopeng Zhang, Dongming Yan. Gate recurrent unit and generative adversarial networks for scene text removal[J]. Journal of image and graphics, 2022, 27(4): 1264-1276. DOI： 10.11834/jig.200764.

摘要

目的

图像文本信息在日常生活中无处不在，其在传递信息的同时，也带来了信息泄露问题，而图像文字去除算法很好地解决了这个问题，但存在文字去除不干净以及文字去除后的区域填充结果视觉感受不佳等问题。为此，本文提出了一种基于门循环单元(gate recurrent unit，GRU)的图像文字去除模型，可以高质量和高效地去除图像中的文字。

方法

通过由门循环单元组成的笔画级二值掩膜检测模块精确地获得输入图像的笔画级二值掩膜；将得到的笔画级二值掩膜作为辅助信息，输入到基于生成对抗网络的文字去除模块中进行文字的去除和背景颜色的回填，并使用本文提出的文字损失函数和亮度损失函数提升文字去除的效果，以实现对文字高质量去除，同时使用逆残差块代替普通卷积，以实现高效率的文字去除。

结果

在1 080组通过人工处理得到的真实数据集和使用文字合成方法合成的1 000组合成数据集上，与其他3种文字去除方法进行了对比实验，实验结果表明，在峰值信噪比和结构相似性等图像质量指标以及视觉效果上，本文方法均取得了更好的性能。

结论

本文提出的基于门循环单元的图像文字去除模型，与对比方法相比，不仅能够有效解决图像文字去除不干净以及文字去除后的区域与背景不一致问题，并能有效地减少模型的参数量和计算量，最终整体计算量降低了72.0 %。

Abstract

Objective

The textual information in digital images is ubiquitous in our daily life. However

while it delivers valuable information

it also runs the risk of leaking private information. For example

when taking photos or collecting data

some private information will inevitably appear in the images

such as phone numbers. Image text removal technology can protect privacy by removing sensitive information in the images. At the same time

this technology can also be widely used in image and video editing

text translation

and other related tasks. Tursun et al. added a binary mask as auxiliary information to make the model focus on the text area

which has made obvious improvements compared with the existing scene text removal methods. However

this binary mask is redundant because it covers a large amount of background information between text strokes

which means the removed area (indicating by binary mask) is larger than what needs to be removed (i.e.

text strokes)

and this limitation can be improved further. Considering the problems of unclean text removal in existing text removal methods and poor visual perception after text removal

we propose a gate recurrent unit (GRU) -based generative adversarial network (GAN) framework to effectively remove the text and obtain high-quality results.

Method

Our framework is fully "end-to-end". We first take the image with text as input and the binary mask of the corresponding text area

the stroke-level binary mask of the input image can be accurately obtained through our designed detection module composed of multiple GRUs. Then

the GAN-based text removal module combines input image

text area mask

and stroke-level mask to remove the text in the image. Meanwhile

we propose the brightness loss function to further improve visual quality based on the observation that human eyes are more sensitive to changes in the brightness of the image. Specifically

we transfer the output image from the RGB space to the YCrCb color space and minimize the difference in the brightness channel of the output image and ground truth. The purpose of using the weighted text loss function is to make the model focus more on the text area. Using the weighted text loss function and brightness loss function proposed in this paper can effectively improve the performance of text removal. Our method applies the inverted residual blocks instead of standard convolutions to achieve a high-efficiency text removal model and balance model size and inference performance. The inverted residual structure first uses a point convolution operation with a convolution kernel of 1 × 1 to expand the dimension of the input feature map

which can prevent too much information from being lost after the activation function because of the low dimension. Then

a depth-wise convolution with the kernel of 3 × 3 is applied to extract features

and a 1 × 1 point convolution is used to compress the number of channels of the feature map.

Result

We conduct extensive experiments and evaluate 1 080 groups of real-world data obtained through manual processing and 1 000 groups of synthetic data synthesized using the SynthText method to validate our proposed method. In this work

we compare our method with several state-of-the-art text removal methods. For the evaluation metrics

we adopt two kinds of evaluation measures to evaluate the results quantitatively. The first type of evaluation indicators is PSNR (peak signal-to-noise ratio) and SSIM (structural similarity index)

which are used to measure the difference between the results after removing the text and the corresponding ground truth. The second type of evaluation index is recall

precision

and F-measure

which are applied to measure the model' s ability to remove text. The experimental results show that our method consistently performs better in terms of PSNR and SSIM. In addition

we also compare the results of our proposed method qualitatively with state-of-the-art(SOTA) methods

and our method achieves better visual quality. The inverted residual blocks reduce the floating point of operations (FLOPs) by 72.0 % with a slight reduction of the performance.

Conclusion

We propose a high-quality and efficient text removal method based on gate recurrent unit

which takes the image with text and the binary mask of the text area as inputs and obtains the image result after text removal in an "end-to-end" manner. Compared with the existing methods

our method can not only improve the problem of unclean image text removal effectively and the inconsistency of the text removal area with the background

but also reduce the model parameters and FLOPs effectively.

关键词

Keywords

references

Badrinarayanan V, Kendall A and Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12): 2481-2495 [DOI: 10.1109/TPAMI.2016.2644615]

Baek Y, Lee B, Han D, Yun S and Lee H. 2019. Character region awareness for text detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 9357-9366 [ DOI: 10.1109/CVPR.2019.00959 http://dx.doi.org/10.1109/CVPR.2019.00959 ]

Chung J, Gulcehre C, Cho K H and Bengio Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling [EB/OL]. [2020-11-24] . https://arxiv.org/pdf/1412.3555.pdf https://arxiv.org/pdf/1412.3555.pdf

Dey R and Salem F M. 2017. Gate-variants of gated recurrent unit (GRU) neural networks//Proceedings of the 60th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS). Boston, USA: IEEE: 1597-1600 [ DOI: 10.1109/MWSCAS.2017.8053243 http://dx.doi.org/10.1109/MWSCAS.2017.8053243 ]

Dong C, Loy C C, He K M and Tang X O. 2016. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2): 295-307 [DOI: 10.1109/TPAMI.2015.2439281]

Goodfellow I J, Pouget-Abadie A, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Benjio Y. 2019. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 2672-2680

Gupta A, Vedaldi A and Zisserman A. 2016. Synthetic data for text localisation in natural images//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: 2315-2324 [ DOI: 10.1109/CVPR.2016.254 http://dx.doi.org/10.1109/CVPR.2016.254 ]

Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M and Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications [EB/OL]. [2020-11-24] . https://arxiv.org/pdf/1704.04861.pdf https://arxiv.org/pdf/1704.04861.pdf

Ji L Q and Wang J J. 2018. Automatic text detection and removal in video images. Journal of Image and Graphics, 13(3): 461-466

季丽琴, 王加俊. 2008. 视频字幕的自动检测与去除. 中国图象图形学报, 13(3): 461-466 [DOI: 10.11834/jig.20080315]

Khodadadi M and Behrad A. 2012. Text localization, extraction and inpainting in color images//Proceedings of the 20th Iranian Conference on Electrical Engineering (ICEE2012). Tehran, India: IEEE: 1035-1040 [ DOI: 10.1109/IranianCEE.2012.6292505 http://dx.doi.org/10.1109/IranianCEE.2012.6292505 ]

Kingma D P and Ba J. 2014. Adam: a method for stochastic optimization [EB/OL]. [2020-11-24] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf

Lawrence S, Giles C L, Tsoi A C and Back A D. 1997. Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks, 8(1): 98-113 [DOI: 10.1109/72.554195]

Lee C W, Jung K and Kim H J. 2003. Automatic text detection and removal in video sequences. Pattern Recognition Letters, 24(15): 2607-2623 [DOI: 10.1016/S0167-8655(03)00105-3]

Liu C Y, Liu Y L, Jin L W, Zhang S T, Luo C J and Wang Y P. 2020. EraseNet: end-to-end text removal in the wild. IEEE Transactions on Image Processing, 29: 8760-8775 [DOI: 10.1109/TIP.2020.3018859]

Liu G L, Reda F A, Shih K J, Wang T C, Tao A and Catanzaro B. 2018. Image inpainting for irregular holes using partial convolutions//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 85-100 [ DOI: 10.1007/978-3-030-01252-6_6 http://dx.doi.org/10.1007/978-3-030-01252-6_6 ]

Liu J and Wang H Z. 2015. Image search scheme in content sharing environments based on privacy protection. Computer Applications and Software, 32(7): 207-211, 227

刘静, 王化喆. 2015. 内容共享环境中基于隐私保护的图像搜索方案. 计算机应用与软件, 32(7): 207-211, 227 [DOI: 10.3969/j.issn.1000-386x.2015.07.050]

Mikolov T, Karafiát M, Burget L, Černocký J and Khudanpur S. 2010. Recurrent neural network based language model//Proceedings of the 11th Annual Conference of the International Speech Communication Association. Makuhari, Japan: ISCA: 1045-1048

Modha U and Dave P. 2012. Image inpainting-automatic detection and removal oftext from images. International Journal of Engineering Research and Applications, 2(2): 930-932

Mosleh A, Bouguila N and Hamza A B. 2013. Automatic inpainting scheme for video text detection and removal. IEEE Transactions on Image Processing, 22(11): 4460-4472 [DOI: 10.1109/TIP.2013.2273672]

Nakamura T, Zhu A N, Yanai K and Uchida S. 2017. Scene text eraser//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Kyoto, Japan: IEEE: 832-837 [ DOI: 10.1109/ICDAR.2017.141 http://dx.doi.org/10.1109/ICDAR.2017.141 ]

Nayef N, Yin F, Bizid I, Choi H, Feng Y, Karatzas D, Luo Z B, Pal U, Rigaud C, Chazalon J, Khlif W, Luqman M M, Burie J C, Liu C L and Ogier J M. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Kyoto, Japan: IEEE: 1454-1459 [ DOI: 10.1109/ICDAR.2017.237 http://dx.doi.org/10.1109/ICDAR.2017.237 ]

Qian R, Tan R T, Yang W H, Su J J and Liu J Y. 2018. Attentive generative adversarial network for raindrop removal from a single image//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2482-2491 [ DOI: 10.1109/CVPR.2018.00263 http://dx.doi.org/10.1109/CVPR.2018.00263 ]

Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetv2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520 [ DOI: 10.1109/CVPR.2018.00474 http://dx.doi.org/10.1109/CVPR.2018.00474 ]

Shi X J, Chen Z R, Wang H, Yeung D Y, Wong W K and Woo W C. 2015. Convolutional LSTM network: a machine learning approach for precipitation nowcasting//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 802-810

Tursun O, Denman S, Zeng R, Sivapalan S, Sridharan S and Fookes C. 2020. MTRNet+ +: one-stage mask-based scene text eraser. Computer Vision and Image Understanding, 201: #103066 [DOI: 10.1016/j.cviu.2020.103066]

Tursun O, Zeng R, Denman S, Sivapalan S, Sridharan S and Fookes C. 2019. MTRNet: a generic scene text eraser//Proceedings of 2019 International Conference on Document Analysis and Recognition (ICDAR). Sydney, Australia: IEEE: 39-44 [ DOI: 10.1109/ICDAR.2019.00016 http://dx.doi.org/10.1109/ICDAR.2019.00016 ]

van den Oord A, Kalchbrenner N and Kavukcuoglu K. 2016. Pixel recurrent neural networks//Proceedings of the 33rd International Conference on Machine Learning. New York City, USA: JMLR: 1747-1756

Wagh P D and Patil D R. 2015. Text detection and removal from image using inpainting with smoothing//Proceedings of 2015 International Conference on Pervasive Computing (ICPC). Pune, India: IEEE: 1-4 [ DOI: 10.1109/PERVASIVE.2015.7087154 http://dx.doi.org/10.1109/PERVASIVE.2015.7087154 ]

Wang W H, Yang N, Wei F R, Chang B B and Zhou M. 2017. Gated self-matching networks for reading comprehension and question answering//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL: 189-198 [ DOI: 10.18653/v1/P17-1018 http://dx.doi.org/10.18653/v1/P17-1018 ]

Wolf C and Jolion J M. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition, 8(4): 280-296 [DOI: 10.5555/2722900.2723092]

Xu L, Yan Q, Xia Y and Jia J Y. 2012. Structure extraction from texture via relative total variation. ACM Transactions on Graphics, 31(6): 139 [DOI: 10.1145/2366145.2366158]

Yu J H, Lin Z, Yang J M, Shen X H, Lu X and Huang T S. 2018. Generative image inpainting with contextual attention//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5505-5514 [ DOI: 10.1109/CVPR.2018.00577 http://dx.doi.org/10.1109/CVPR.2018.00577 ]

Yu J H, Lin Z, Yang J M, Shen X H, Lu X and Huang T. 2019. Free-form image inpainting with gated convolution//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4470-4479 [ DOI: 10.1109/ICCV.2019.00457 http://dx.doi.org/10.1109/ICCV.2019.00457 ]

Zhang S T, Liu Y L, Jin L W, Huang Y X and Lai S X. 2019. EnsNet: ensconce text in the wild//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI: 801-808 [ DOI: 10.1609/aaai.v33i01.3301801 http://dx.doi.org/10.1609/aaai.v33i01.3301801 ]

文章被引用时，请邮件提醒。

提交