Current Issue Cover
结合背景图的高分辨率视频人像实时抠图网络

彭泓1, 张家宝1, 贾迪1,2, 安彤1, 蔡鹏1, 赵金源1(1.辽宁工程技术大学电子与信息工程学院, 葫芦岛 125105;2.辽宁工程技术大学电气与控制工程学院, 葫芦岛 125105)

摘 要
目的 近年来,采用神经网络完成人像实时抠图已成为计算机视觉领域的研究热点,现有相关网络在处理高分辨率视频时还无法满足实时性要求,为此本文提出一种结合背景图的高分辨率视频人像实时抠图网络。方法 给出一种由基准网络和精细化网络构成的双层网络,在基准网络中,视频帧通过编码器模块提取图像的多尺度特征,采用金字塔池化模块融合这些特征作为循环解码器网络的输入;在循环解码器中,通过残差门控循环单元聚合连续视频帧间的时间信息,以此生成蒙版图、前景残差图和隐藏特征图,采用残差结构降低模型参数量并提高网络的实时性。为提高高分辨率图像实时抠图性能,在精细化网络中,设计高分辨率信息指导模块,通过高分辨率图像信息指导低分辨率图像的方式生成高质量人像抠图结果。结果 与近年来的相关网络模型进行实验对比,实验结果表明,本文方法在高分辨率数据集Human2K上优于现有相关方法,在评价指标(绝对误差、均方误差、梯度、连通性)上分别提升了18.8%、39.2%、40.7%、20.9%。在NVIDIA GTX 1080Ti GPU上处理4 K分辨率影像运行速率可达26帧/s,处理HD(high definition)分辨率影像运行速率可达43帧/s。结论 本文模型能够更好地完成高分辨率人像实时抠图任务,可以为影视、短视频社交以及网络会议等高级应用提供更好的支持。
关键词
Real-time high-resolution video portrait matting network combined with background image

Peng Hong1, Zhang Jiabao1, Jia Di1,2, An Tong1, Cai Peng1, Zhao Jinyuan1(1.School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China;2.Faculty of Electrical and Control Engineering, Liaoning Technical University, Huludao 125105, China)

Abstract
Objective Video matting is one of the most commonly used operations in visual image processing.It aims to separate a certain part of an image from the original image into a separate layer and further apply it to specific scenes for later video synthesis.In recent years,real-time portrait matting that uses neural networks has become a research hotspot in the field of computer vision.Existing related networks cannot meet real-time requirements when processing high-resolution video.Moreover,the matting results at the edges of high-resolution image targets still have blurry issues.To solve these problems,several recently proposed methods that use various auxiliary information to guide high-resolution image for mask estimation have demonstrated good performance.However,many methods cannot perfectly learn information about the edges and details of portraits.Therefore,this study proposes a high-resolution video real-time portrait matting network combined with background images.Method A double-layer network composed of a base network and a refinement network is presented.To achieve a lightweight network,high-resolution feature maps are first downsampled at sampling rate D.In the base network,the multi-scale features of video frames are extracted by the encoder module,and these features are fused by the pyramid pooling module,because the input of the cyclic decoder network is beneficial for the cyclic decoder to learn the multi-scale features of video frames.In the cyclic decoder,a residual gated recurrent unit(GRU) is used to aggregate the time information between consecutive video frames.The masked map,foreground residual map,and hidden feature map are generated.A residual structure is used to reduce model parameters and improve the real-time performance of the network.In the residual GRU,the time information of the video is fully utilized to promote the construction of the masked map of the video frame sequence based on time information.To improve the real-time matting performance of high-resolution images,the high-resolution information guidance module designed in the refinement network,and the initial highresolution video frames and low-resolution predicted features(masked map,foreground residual map,and hidden feature map) are used as input to pass the high-resolution information guidance module,generating high-quality portrait matting results by guiding low-resolution images with high-resolution image information.In the high-resolution information guidance module,the combination of covariance means filtering,variance means filtering,and pointwise convolution processing can effectively extract the matting quality of the detailed areas of character contours in a high-resolution video frame.Under the synergistic effects of the benchmark and refinement networks,the designed network cannot only fully extract multi-scale information from low-resolution video frames,but can also more fully learn the edge information of portraits in high-resolution video frames.This condition is conducive to more accurate prediction of masked maps and foreground images in the network structure and can also improve the generalization ability of the matting network at multiple resolutions.In addition,the high-resolution image downsampling scheme,lightweight pyramid pooling module,and residual link structure designed in the network further reduce the number of network parameters,improving the real-time performance of the network.Result We use PyTorch to implement our network on NVIDIA GTX 1080Ti GPU with 11 GB RAM.Batch size is 1,and the optimizer used is Adam.This study trains the benchmark network on three datasets in sequence:the Video240K SD dataset,with an input frame sequence of 15.After 8 epochs of training,the fine network is trained on the Video240K HD dataset for 1 epoch.To improve the robustness of the model in processing high-resolution videos,the refinement network was further trained on the Human2K dataset,with a downsampling rate D of 0.25 and an input frame sequence of 2 for 50 epochs of training.Compared with related network models in recent years,the experimental results show that the proposed method is superior to other methods on the Video240K SD dataset and the Human2K dataset.On the Video240K SD dataset,26.1%,50.6%,56.9%,and 39.5% of the evaluation indicators(sum of absolute difference(SAD),mean squared error(MSE),gradient error(Grad),and connectivity error(Coon)) were optimized,respectively.In particular,on the high-resolution Human2K dataset,the proposed method is significantly superior to other state-of-theart methods,optimizing the evaluation indicators(SAD,MSE,Grad,and Coon) by 18.8%,39.2%,40.7%,and 20.9%,respectively.Simultaneously achieving the lowest network complexity at 4 K resolution(28.78 GMac).The running speed of processing low-resolution video(512 × 288 pixels) can reach 49 frame/s,and the running speed of processing medium-resolution video(1 024 × 576 pixels) can reach 42.4 frame/s.In particular,the running speed of processing 4 K resolution video can reach 26 frame/s,while the running speed of processing HD-resolution video can reach 43 frame/s on NVIDIA GTX 1080Ti GPU.This value is significantly improved compared with other state-of-the-art methods.Conclusion The network model proposed in this study can better complete the real-time matting task of high-resolution portraits.The pyramid pooling module in the benchmark network effectively extracts and integrates multi-scale information of video frames,while the residual GRU module significantly aggregates continuous inter-frame time information.The highresolution information guidance module captures high-resolution information in images and guides low-resolution images to learn high-resolution information.The improved network effectively enhances the matting information of high-resolution human-oriented edges.The experiments on the high-resolution dataset Human2K show that the proposed network is more effective in predicting high-resolution montage maps.It has high real-time processing speed and can provide better support for advanced applications,such as film and television,short video social networking,and online conference.
Keywords

订阅号|日报