轻量级注意力约束对齐网络的视频超分重建

靳雨桐; 宋慧慧; 刘青山

doi:10.11834/jig.210345

图像处理和编码 | 浏览量 : 0 下载量: 202 CSCD: 0

PDF
导出
分享
收藏
专辑

轻量级注意力约束对齐网络的视频超分重建
Super-resolution video frame reconstruction through lightweight attention constraint alignment network
2022年27卷第10期页码：2984-2993
收稿：2021-05-18，

修回：2021-10-11，

录用：2021-10-18，

纸质出版：2022-10-16
DOI： 10.11834/jig.210345
稿件说明：

移动端阅览

靳雨桐, 宋慧慧, 刘青山. 轻量级注意力约束对齐网络的视频超分重建[J]. 中国图象图形学报, 2022,27(10):2984-2993. DOI： 10.11834/jig.210345.

Yutong Jin, Huihui Song, Qingshan Liu. Super-resolution video frame reconstruction through lightweight attention constraint alignment network[J]. Journal of Image and Graphics, 2022, 27(10): 2984-2993. DOI： 10.11834/jig.210345.

摘要

目的

深度学习在视频超分辨率重建领域表现出优异的性能，本文提出了一种轻量级注意力约束的可变形对齐网络，旨在用一个模型参数少的网络重建出逼真的高分辨率视频帧。

方法

本文网络由特征提取模块、注意力约束对齐子网络和动态融合分支3部分组成。1)共享权重的特征提取模块在不增加参数量的前提下充分提取输入帧的多尺度语义信息。2)将提取到的特征送入注意力约束对齐子网络中生成具有精准匹配关系的对齐特征。3)将拼接好的对齐特征作为共享条件输入动态融合分支，融合前向神经网络中参考帧的时域对齐特征和原始低分辨率(low-resolution，LR)帧在不同阶段的空间特征。4)通过上采样重建高分辨率(high-resolution，HR)帧。

结果

实验在两个基准测试数据集(Vid4(Vimeo-90k)和REDS4(realistic and diverse scenes dataset))上进行了定量评估，与较先进的视频超分辨率网络相比，本文方法在图像质量指标峰值信噪比(peak signal to noise ratio，PSNR)和结构相似性(structural similarity，SSIM)方面获得了更好的结果，进一步提高了超分辨率的细节特征。本文网络在获得相同的PSNR指标的情况下，模型参数减少了近50%。

结论

通过极轴约束使得注意力对齐网络模型参数量大大减少，并能够充分捕获远距离信息来进行特征对齐，产生高效的时空特征，还通过设计动态融合机制，实现了高质量的重建结果。

Abstract

Objective

Current deep learning technology is beneficial to video super-resolution(SR) reconstruction. The existing methods are constrained of the accuracy of motion estimation and compensation based on optical flow estimation

and the reconstruction effect of large-scale moving targets is poor. The deformable convolutional alignment network captures the target's motion information via learning adaptive receptive fields

and provides a new solution for video super-resolution reconstruction. To reconstruct realistic high-resolution(HR) video frames

our lightweight-attention-constrained deformable alignment network aims to use a less model parameters network to make full use of the redundant information between the reference frame and adjacent frames.

Method

Our attention constraint alignment network (ACAN) consists of three key components like feature extraction module

attention constraint alignment sub-network and dynamic fusion. First

the 5 layers are designed in terms of shared weights feature extraction module in the context of three ground residuals without batch normalization (BN) layer and two residuals atrous spatial pyramid pooling (res_ASPP). To extract multi-scale information and multi-level information without increasing the amount of parameters

the two residuals atrous spatial pyramid pooling and three ground residuals are connected alternately without batch normalization layer. After that

the polar axis constraint and the attention mechanism are integrated to design a lightweight attention constraint alignment sub-network (ACAS). The network regulates the input features of deformable convolution via capturing the global correspondence between adjacent frames and reference frames in the time domain under polar axis constraints

and generates a reasonable offset to achieve implicit alignment. Specifically

the ACAS is introduced through combining the deformable convolution with attention and polar axis constraint. The three attention constraint blocks (ACB) involved ACAS to constrain the features on the horizontal axis of neighboring frames. To find out the most similar features

it can code the feature correlation between any two positions along the horizontal line. At the same time

an effective mask is designed to solve the unavoidable occlusion phenomenon in the video. In the feature extraction module

we send extracted features to the alignment module to generate alignment features with exact matching relationships. In the ablation experiment

we verified that the network can well capture the matching relationship between the reference frame and the adjacent frame using a layer of ACB. However

the network can capture the matching relationship between adjacent frames and the reference frame and handle the status of large motion in the video based on the cascaded three-layer ACB. Therefore

we select a cascaded three-layer ACB network during network design. We illustrate a dynamic fusion branch

which is composed of 16 dynamic fusion blocks. Each block is made of two spatial feature transformation (SFT) layers and two 1×1 convolutions. This branch fuses the time alignment features of the reference frame in the forward neural network and the spatial features of the original low-resolution(LR) frame at different stages. Finally

the high-resolution frame is reconstructed and to be trained. Vimeo-90K is a widely used training dataset and is evaluated in conjunction with the Vid4 test dataset in common. In the training process

this network is trained on Vimeo-90K dataset and tested on Vid4 and REDS4 datasets. The loss function chooses the Charbonnier penalty function solely. The channel size of each layer is set to 64 for the final comparison

where we designates that the alignment module is composed of a layer of attention constraint alignment module

while that the assigned alignment module is cascading from three layers of attention constraint alignment module.Additionally

the network makes use of seven consecutive frames as input. Our RGB patches of a size of 64 × 64 are used as input to the video SR

with the mini-batch size set to 16. We use the Adam optimizer to update the network parameters. The initial learning rate is set to 4e-4. All experiments are conducted on PyTorch 1.0 and four Nvidia Tesla T4 GPUs.

Result

Our experiment is evaluated on two benchmark datasets quantitatively

including Vid4 and realistic and diverse scenes dataset(REDS4)

and the proposed combined method obtained better results in the image quality indicators peak signal to noise ratio (PSNR) and structural similarity (SSIM). Our results are compared the model to 10 recognized super resolution models

including single image super resolution(SISR) and video super resolution(VSR) methods on two common datasets(Vid4

REDS4).The quantitative evaluations are involved of PSNR and SSIM

and the reconstructed images of each method are provided for comparison. Our reconstruction results show that the proposed model can recover precise details

and the effectiveness of the proposed alignment module with polar axis constraints is verified by comparing the results of no alignment operation and the results of one or three layers of attention constraint alignment. Without the use of alignment

the PSNR score is 22.11 dB

with one layer of ACB PSNR score increased by 1.81 dB

and with three layers of ACB

the PSNR score is increased by 1.21 dB. This result proves the effectiveness of attention constraint to aligning blocks

and the network of cascaded three-layer ACB can capture long-distance spatial information. The dynamic fusion (DF) module is also verified

and the comparative experiment shows that the DF module can improve the reconstruction performance. Our results demonstrate that the PSNR score on the Vid4 data set has increased by more than 0.33 dB compared to EDVR_M

which is an increase of about 1.2%.Compared with EDVR_M

the PSNR score has increased by 0.49 dB on the REDS4 dataset

which is an increase of about 1.6%. Moreover

under the condition of the same PSNR scores

the proposed model parameters are nearly 50% less than that of recurrent back-projection network(RBPN). Our PSNR value is much higher than dynamic upsampling filter(DUF) in terms of the same number of parameters. The PSNR is increased by 0.21 dB although the number of parameters is slightly higher than that of EDVR_M in our model.

Conclusion

the number of model parameters is reduced dramatically in the attentional alignment network through the polar axis constraint. To achieve high quality reconstruction results

the distance information can be captured for feature alignment.It can integrate the spatio-temporal features of video frames.

关键词

Keywords

references

Caballero J, Ledig C, Aitken A, Acosta A, Totz J, Wang Z H and Shi W Z. 2017. Real-time video super-resolution with spatio-temporal networks and motion compensation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 2848-2857 [ DOI: 10.1109/CVPR.2017.304 http://dx.doi.org/10.1109/CVPR.2017.304 ]

Cheng S S and Pan J S. 2021. Video super-resolution method based on deep learning feature warping. Computer Science,48(7): 184-189

程松盛, 潘金山. 2021. 基于深度学习特征匹配的视频超分辨率方法. 计算机科学, 48(7): 184-189

Dai J F, Qi H Z, Xiong Y W, Yi L, Zhang G D, Hu H and Wei Y C. 2017. Deformable convolutional networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 764-773 [ DOI: 10.1109/ICCV.2017.89 http://dx.doi.org/10.1109/ICCV.2017.89 ]

Farsiu S, Robinson M D, Elad M and Milanfar P. 2004. Fast and robust multiframe super resolution. IEEE Transactions on Image Processing, 13(10): 1327-1344 [DOI: 10.1109/TIP.2004.834669]

Haris M, Shakhnarovich G and Ukita N. 2018. Deep back-projection networks for super-resolution//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1664-1673 [ DOI: 10.1109/CVPR.2018.00179 http://dx.doi.org/10.1109/CVPR.2018.00179 ]

Haris M, Shakhnarovich G and Ukita N. 2019. Recurrent back-projection network for video super-resolution//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3892-3901 [ DOI: 10.1109/CVPR.2019.00402 http://dx.doi.org/10.1109/CVPR.2019.00402 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2015. Delving deep into rectifiers: surpassing human-level performance on imagenet classification//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1026-1034 [ DOI: 10.1109/ICCV.2015.123 http://dx.doi.org/10.1109/ICCV.2015.123 ]

Jo Y, Oh S W, Kang J and Kim S J. 2018. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3224-3232 [ DOI: 10.1109/CVPR.2018.00340 http://dx.doi.org/10.1109/CVPR.2018.00340 ]

Kappeler A, Yoo S, Dai Q Q and Katsaggelos A K. 2016. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2): 109-122 [DOI: 10.1109/TCI.2016.2532323]

Kingma D P and Ba J. 2017. Adam: a method for stochastic optimization[EB/OL]. [2021-05-03] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf

Lai W S, Huang J B, Ahuja N and Yang M H. 2019. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11): 2599-2613 [DOI: 10.1109/TPAMI.2018.2865304]

Liao R J, Tao X, Li R Y, Ma Z Y and Jia J Y. 2015. Video super-resolution via deep draft-ensemble learning//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 531-539 [ DOI: 10.1109/ICCV.2015.68 http://dx.doi.org/10.1109/ICCV.2015.68 ]

Liu C and Sun D Q. 2014. On Bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2): 346-360 [DOI: 10.1109/TPAMI.2013.127]

Liu D, Wang Z W, Fan Y C, Liu X M, Wang Z Y, Chang S Y and Huang T. 2017. Robust video super-resolution with learned temporal dynamics//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2526-2534 [ DOI: 10.1109/ICCV.2017.274 http://dx.doi.org/10.1109/ICCV.2017.274 ]

Sajjadi M S M, Vemulapalli R and Brown M. 2018. Frame-recurrent video super-resolution//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6626-6634 [ DOI: 10.1109/CVPR.2018.00693 http://dx.doi.org/10.1109/CVPR.2018.00693 ]

Shen M Y, Yu P F, Wang R G, Yang J and Xue L X. 2019. Image super-resolution reconstruction via deep network based on multi-staged fusion. Journal of Image and Graphics, 24(8): 1258-1269

沈明玉, 俞鹏飞, 汪荣贵, 杨娟, 薛丽霞. 2019. 多阶段融合网络的图像超分辨率重建. 中国图象图形学报, 24(8): 1258-1269 [DOI: 10.11834/jig.180619]

Song H H, Xu W J, Liu D, Liu B, Liu Q S and Metaxas D N. 2021. Multi-stage feature fusion network for video super-resolution. IEEE Transactions on Image Processing, 30: 2923-2934 [DOI: 10.1109/TIP.2021.3056868]

Tao X, Gao H Y, Liao R J, Wang J and Jia J Y. 2017. Detail-revealing deep video super-resolution//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 4482-4490 [ DOI: 10.1109/ICCV.2017.479 http://dx.doi.org/10.1109/ICCV.2017.479 ]

Tian Y P, Zhang Y L, Fu Y and Xu C L. 2020. TDAN: temporally-deformable alignment network for video super-resolution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3357-3366 [ DOI: 10.1109/CVPR42600.2020.00342 http://dx.doi.org/10.1109/CVPR42600.2020.00342 ]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need [EB/OL]. [2021-05-12] . https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/1706.03762.pdf

Wang L G, Guo Y L, Wang Y Q, Liang Z F, Lin Z P, Yang J G and An W. 2022. Parallax attention for unsupervised stereo correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4): 2108-2125 [DOI: 10.1109/TPAMI.2020.3026899]

Wang L G, Wang Y Q, Liang Z F, Lin Z P, Yang J G, An W and Guo Y L. 2019b. Learning parallax attention for stereo image super-resolution//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 12242-12251 [ DOI: 10.1109/CVPR.2019.01253 http://dx.doi.org/10.1109/CVPR.2019.01253 ]

Wang X L, Girshick R, Gupta A and He K M. 2018a. Non-local neural networks [EB/OL]. [2021-05-12] . https://arxiv.org/pdf/1711.07971.pdf https://arxiv.org/pdf/1711.07971.pdf

Wang X T, Chan K C K, Yu K, Dong C and Loy C C. 2019a. EDVR: video restoration with enhanced deformable convolutional networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).Long Beach, USA: IEEE: 1954-1963 [ DOI: 10.1109/CVPRW.2019.00247 http://dx.doi.org/10.1109/CVPRW.2019.00247 ]

Wang X T, Yu K, Dong C and Loy C C. 2018b. Recovering realistic texture in image super-resolution by deep spatial feature transform//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 606-615 [ DOI: 10.1109/CVPR.2018.00070 http://dx.doi.org/10.1109/CVPR.2018.00070 ]

Wang Z, Bovik A C, Sheikh H R and Simoncelli E P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4): 600-612 [DOI: 10.1109/TIP.2003.819861]

Wu H, Lai H C, Qian X Z and Chen H. 2021. Video super-resolution reconstruction algorithm based on optical flow residuals. Computer Engineering and Applications, 58 (15): 220-228

吴昊, 赖惠成, 钱绪泽, 陈豪. 2021. 基于光流残差的视频超分辨率重建算法. 计算机工程与应用, 58(15): 220-228 [DOI: 10.3778/j.issn.1002-8331.2012-0409]

Xue T F, Chen B A, Wu J J, Wei D L and Freeman W T. 2019. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8): 1106-1125 [DOI: 10.1007/s11263-018-01144-2]

Zhang Y L, Li K P, Li K, Wang L C, Zhong B N and Fu Y. 2018. Image super-resolution using very deep residual channel attention networks//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 294-310 [ DOI: 10.1007/978-3-030-01234-2_18 http://dx.doi.org/10.1007/978-3-030-01234-2_18 ]

Zhu X Z, Hu H, Lin S and Dai J F. 2018. Deformable ConvNets v2: more deformable, better results//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9300-9308 [ DOI: 10.1109/CVPR.2019.00953 http://dx.doi.org/10.1109/CVPR.2019.00953 ]