贾迪, 张家宝, 安彤, 蔡鹏, 赵金源(辽宁工程技术大学)
目的 近年来，采用神经网络完成人像实时抠图已成为计算机视觉领域的研究热点，现有相关网络在处理高分辨率视频时还无法满足实时性要求，为此本文提出一种结合背景图的高分辨率视频人像实时抠图网络。方法 给出一种由基准网络和精细化网络构成的双层网络，在基准网络中，视频帧通过编码器模块提取图像的多尺度特征，采用金字塔池化模块融合这些特征作为循环解码器网络的输入；在循环解码器中，通过残差门控循环单元聚合连续视频帧间的时间信息，以此生成蒙版图、前景残差图和隐藏特征图，采用残差结构降低模型参数量并提高网络的实时性。为提高高分辨率图像实时抠图性能，在精细化网络中，设计高分辨率信息指导模块，通过高分辨率图像信息指导低分辨率图像的方式生成高质量人像抠图结果。结果 与近年来的相关网络模型进行实验对比，实验结果表明，本文方法在高分辨率数据集Human2K上优于现有相关方法，在评价指标（绝对误差、均方误差、梯度、连通性）上分别提升了18.8%、39.2%、40.7%、20.9%。在NVIDIA GTX 1080Ti GPU上处理4K分辨率影像运行速率可达26FPS（Frames Per Second），处理HD（High Definition）分辨率影像运行速率可达43FPS。结论 本文提出的网络模型能够更好地完成高分辨率人像实时抠图任务，可以为影视、短视频社交、网络会议等高级应用提供更好地支持。
Real-time high resolution video portrait matting network combined with background image
Jiadi, zhang jiabao, an tong, cai peng, zhao jinyuan(Liaoning technical university)
Objective Video matting is one of the most commonly used operations in visual image processing, aiming to separate a certain part of an image or image from the original image or image into a separate layer, and further apply it to specific scenes for later video synthesis. In recent years, real-time portrait matting using neural networks has become a research hot spot in the field of computer vision. The existing related networks cannot meet the real-time requirements when processing high-resolution video. And the matting results at the edges of high-resolution image targets still have blurry issues. In order to solve these problems, several recently proposed methods that use various auxiliary information to guide high-resolution image for mask estimation have shown good performance. However, many methods cannot perfectly learn information about the edges and details of portraits. Therefore, this paper proposes a high-resolution video real-time portrait matting network combined with background images. Method A double-layer network composed of a base network and a refinement network is given. In order to achieve lightweight network, high-resolution feature maps are first down sampled at sampling rate D. In the base network, the multi-scale features of video frames are extracted by the encoder module, and these features are fused by the pyramid pooling module as the input of the cyclic decoder network is beneficial for the cyclic decoder to learn multi-scale features of video frames. In the cyclic decoder, residual gated recurrent unit are used to aggregate the time information between consecutive video frames, the masked map, the foreground residual map, and the hidden feature map are generated. Using residual structure to reduce model parameters and improve the real-time performance of the network. In the residual gated recurrent unit, the time information of the video is fully utilized to promote the construction of the masked map of the video frame sequence based on the time information. In order to improve the real-time matting performance of high-resolution images, in the refinement network, high-resolution Information Guidance Module designed, and the initial high-resolution video frames and low resolution predicted features (masked map, foreground residual map, and hidden feature map) are used as inputs to pass to the high-resolution information guidance module, generated high-quality portrait matting results by guiding low-resolution images with high-resolution image information. In the high-resolution information guidance module, the combination of covariance means filtering, variance mean filtering, and pointwise convolution processing can effectively extract the matting quality of the detailed areas of character contours in high-resolution video frame. Under the synergistic effect of benchmark network and refinement network, the designed network can not only fully extract multi-scale information from low resolution video frames, but also learn more fully the edge information of portraits in high resolution video frames, which is conducive to more accurate prediction of masked maps and foreground images in the network structure and can also improve the generalization ability of the matting network at multiple resolutions. In addition, the high-resolution image down sampling scheme, lightweight pyramid pooling module, and residual link structure designed in the network further reduce the number of network parameters, thereby improving the real-time performance of the network. Result We used Pytorch to implement our network on NVIDIA GTX 1080Ti GPU with 11GB RAM. The batch size is 1, and the optimizer uses Adam. And this article trains the benchmark network on three datasets in sequence: the Video240K SD dataset, with an input frame sequence of 15, after 8 epochs of training, the fine network is trained on the Video240K HD dataset for 1 epoch. In addition, to improve the robustness of the model in processing high-resolution videos, the refinement network was further trained on the Human2K dataset, with a down sampling rate D of 0.25 and an input frame sequence of 2 for 50 epochs of training. Compared with the related network models in recent years, the experimental results show that the proposed method is superior to other methods on Video240K SD dataset and Human2K dataset, on the dataset Video240K SD, 26.1%, 50.6%, 56.9%, and 39.5% of the evaluation indicators (SAD, MSE, Grad, Coon) were optimized, respectively. especially on the high-resolution dataset Human2K, it is significantly superior to other state-of-the-art methods, optimizing the evaluation indicators (SAD, MSE, Grad, Coon) by 18.8%, 39.2%, 40.7%, and 20.9%, respectively. Simultaneously achieving the lowest network complexity at 4K resolution (28.78 GMac). And the running speed of processing low resolution video (512x288) can reach 49FPS, and the running speed of processing medium resolution video (1024x576) can reach 42.4FPS, especially the running speed of processing 4K resolution video can reach 26FPS, and the running speed of processing HD resolution video can reach 43FPS on NVIDIA GTX 1080Ti GPU, which is significantly improved compared to other state-of-the-art methods. Conclusion The network model proposed in this paper can better complete the real-time matting task of high-resolution portraits. The pyramid pooling module in the benchmark network effectively extracts and integrates multi-scale information of video frames, while the residual gated recurrent unit module significantly aggregates continuous inter frame time information. The high-resolution information guidance module captures high-resolution information in images and guides low resolution images to learn high-resolution information. The improved network effectively enhances the matting information of high-resolution human oriented edges. The experiments on the high-resolution dataset Human2K show that the proposed network is more effective in predicting high-resolution montage maps, and has high real-time processing speed, and can provide better support for advanced applications such as film and television, short video social networking and online conference.