多层次融合注意力网络的双目图像超分辨率重建

徐磊; 宋慧慧; 刘青山

doi:10.11834/jig.211119

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 2

PDF
导出
分享
收藏
专辑

多层次融合注意力网络的双目图像超分辨率重建
Super-resolution reconstruction of binocular image based on multi-level fusion attention network
2023年28卷第4期页码：1079-1090
纸质出版日期： 2023-04-16 ，
DOI： 10.11834/jig.211119
稿件说明：

移动端阅览

徐磊，宋慧慧，刘青山. 2023. 多层次融合注意力网络的双目图像超分辨率重建. 中国图象图形学报， 28(04):1079-1090

Xu Lei， Song Huihui， Liu Qingshan. 2023. Super-resolution reconstruction of binocular image based on multi-level fusion attention network. Journal of Image and Graphics， 28(04):1079-1090
徐磊，宋慧慧，刘青山. 2023. 多层次融合注意力网络的双目图像超分辨率重建. 中国图象图形学报， 28(04):1079-1090 DOI： 10.11834/jig.211119.

Xu Lei， Song Huihui， Liu Qingshan. 2023. Super-resolution reconstruction of binocular image based on multi-level fusion attention network. Journal of Image and Graphics， 28(04):1079-1090 DOI： 10.11834/jig.211119.

摘要

目的

随着深度卷积神经网络广泛应用于双目立体图像超分辨率重建任务，双目图像之间的信息融合成为近年来的研究热点。针对目前的双目图像超分辨重建算法对单幅图像的内部信息学习较少的问题，提出多层次融合注意力网络的双目图像超分辨率重建算法，在立体匹配的基础上学习图像内部的丰富信息。

方法

首先，利用特征提取模块从不同尺度和深度来获取左图和右图的低频特征。然后，将低频特征作为混合注意力模块的输入，此注意力模块先利用二阶通道非局部注意力模块学习每个图像内部的通道和空间特征，再采用视差注意力模块对左右特征图进行立体匹配。接着采用多层融合模块获取不同深度特征之间的相关信息，进一步指导产生高质量图像重建效果。再利用亚像素卷积对特征图进行上采样，并和低分辨率左图的放大特征相加得到重建特征。最后使用1层卷积得到重建后的高分辨率图像。

结果

本文算法采用Flickr1024数据集的800幅图像和60幅经过2倍下采样的Middlebury图像作为训练集，以峰值信噪比（peak signal-to-noise ratio，PSNR）和结构相似性（structural similarity，SSIM）作为指标。实验在3个基准测试集Middlebury、KITTI2012和KITTI2015上进行定量和定性评估。实验结果显示，本文算法获得了最清晰的图像效果。当放大因子为2时，在3个数据集上的指标与PASSRnet（learning parallax attention for stereo image super-resolution）相比，本文算法的峰值信噪比提升了0.56 dB、0.31 dB和0.26 dB，结构相似性均提升了0.005。

结论

本文提出的网络模型充分学习图像内部的丰富信息，有效指导左右特征图的立体匹配。同时，能够不断进行高低频信息融合，取得了较好的重建效果。

Abstract

Objective

Binocular images-interrelated information fusion method has been developing intensively in terms of deep convolutional neural networks （DNNs） based binocular stereo image super-resolution tasks. However， current stereo image super-resolution algorithms are challenged for internal information learning of a single image. To resolve this problem， we develop a multi-level fusion attention network-relevant binocular image super-resolution reconstruction algorithm， which can learn stereo matching-related richer information-inner of the image.

Method

Our network is demonstrated and composed of multiple modules in the context of 1） feature extraction， 2） mixed attention， 3） multi-level fusion， and 4） reconstruction. The feature extraction module consists of 1） convolutional layer， 2） residual unit， and 3） residual-intensive atrous space pyramid pooling module. Specifically， a convolutional layer is used to extract the shallow features of the low-resolution image， and the residual unit and the residual-intensive spatial pyramid pooling module is used to process the shallow features alternately. To form a spatial pyramid pooling group， the residual-intensive atrous space pyramid pooling module is interconnected of three hollow convolutions with expansion rates of 1， 4， and 8 in parallel. First， three spatial pyramid pooling groups of the same structure are cascaded， and the output features and input features of each group are transmitted back sequentially to the next group in terms of a densely connected manner. Then， to perform feature fusion and channel reduction， a convolution layer is utilized at the end of each spatial pyramid pooling group. At the end of the module， a dense feature fusion and global residual connection are performed and the output features of each spatial pyramid pooling group are fused together， and linear-superimposed is followed with the input features of the module. The mixed attention module is mainly organized of 1） the second-order channel non-local attention module， and 2） the parallax attention module. The second-order channel and non-local attention modules are divided into 1） second-order channels and spatial attention modules， and 2） high-efficiency non-local modules. To optimize effective information， the second-order channel and spatial attention module can be used to extract useful information of features in channel and spatial dimensions. The input features are transmitted in the channel and spatial dimensions at the same time， where the channel dimension first performs global covariance pooling on it， and the convolution is then used to increase and decrease the dimensionality of the channels to obtain the correlation between the channels， which is regarded as the channel attention map. Finally， the input features are adjusted using the channel attention map. In the spatial dimension， the module first performs global average pooling and global maximum pooling on the input feature map at the same time， and cascade the generated feature maps， the convolution and sigmoid function is then used to obtain the spatial attention map， and the spatial attention map is used to compare the input features make adjustments at the end. The high-efficiency non-local module uses non-local operation to learn the global correlation of features to expand the receptive field and capture contextual information. The parallax attention module first uses the convolutional layer and the residual unit to process the left and right feature maps， and the parallax attention mechanism is then used to capture the stereo correlation between the left and right images for stereo matching. The multi-level fusion module takes the dense residual block as the basic block， and the attention mechanism is then used to explore the interlinks between different depth features， assign different attention weights to different depth features， and improve the characterization ability of features. To obtain the reconstructed feature， sub-pixel convolution is used to up-sample the feature map and it is added to the enlarged feature of the low-resolution left image as well. Finally， a layer of convolution is used to obtain the reconstructed high-resolution image.

Result

Our algorithm is developed and organized by 800 images of the Flickr1024 dataset and twice-downsampled 60 Middlebury images as the training set. Our research is focused on the bicubic interpolation down-samples the high-resolution images to generate low-resolution images， and a sum of 20 steps is used to crop these low-resolution images into image blocks， and the high-resolution images are also cropped after that. As the benchmark dataset， the test set is based on 5 images from the Middlebury dataset， 20 images from the KITTI2012 dataset， and 20 images from the KITTI2015 dataset. To evaluate the reconstruction effect of the model quantitatively and compare it to other methods， peak signal-to-noise ratio （PSNR） and structural similarity （SSIM） are used as evaluation indicators. We compare the results of the algorithm model to some single-image super-resolution methods， and the latest stereo image super-resolution methods like StereoSR， PASSRnet， SRResNet + SAM， SRResNet + DFAM， CVCnet on three benchmark test sets in terms of the scale and conditions. The KITTI2012 test set is as an example when the scale × 2， the PSNR and SSIM are 0.17 dB and 0.002 higher than the CVCnet network of each.

Conclusion

Our model is demonstrated and focused on fully learning for richer effective information， and it can guide the stereo matching of the left and right feature maps effectively. Furthermore， the fusion of high and low frequency information is in consistent， and a good reconstruction effect is achieved. Our algorithm model is still challenged to optimize richer information in a single image and the complementary information between the left and right images. The future research direction can be predicted that a single image feature extraction module is required to be designed and get a left and right image feature fusion module further.

关键词

卷积神经网络（CNN）双目图像超分辨率注意力机制立体匹配信息融合

Keywords

convolutional neural network（CNN）stereo image super-resolutionattention mechanismstereo matchinginformation fusion

references

Caballero J， Ledig C， Aitken A， Acosta A， Totz J， Wang Z H and Shi W Z. 2017. Real-time video super-resolution with spatio-temporal networks and motion compensation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2848-2857 ［DOI： 10.1109/CVPR.2017.304http://dx.doi.org/10.1109/CVPR.2017.304］

Chan K C K， Wang X T， Xu X Y， Gu J W and Loy C C. 2021. GLEAN： generative latent bank for large-factor image super-resolution//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 14240-14249 ［DOI： 10.1109/CVPR46437.2021.01402http://dx.doi.org/10.1109/CVPR46437.2021.01402］

Chang H， Yeung D Y and Xiong Y M. 2004. Super-resolution through neighbor embedding//Proceedings of 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington， USA： IEEE： 1063-6919 ［DOI： 10.1109/CVPR.2004.1315043http://dx.doi.org/10.1109/CVPR.2004.1315043］

Dai Q Y， Li J C， Yi Q S， Fang F M and Zhang G X. 2021. Feedback network for mutually boosted stereo image super-resolution and disparity estimation//Proceedings of the 29th ACM International Conference on Multimedia. ［s.l.］： ACM： 1985-1993 ［DOI： 10.1145/3474085.3475356http://dx.doi.org/10.1145/3474085.3475356］

Dai T， Cai J R， Zhang Y B， Xia S T and Zhang L. 2019. Second-order attention network for single image super-resolution//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 11065-11074 ［DOI： 10.1109/CVPR.2019.01132http://dx.doi.org/10.1109/CVPR.2019.01132］

Dan J W， Qu Z W， Wang X R and Gu J H. 2021. A disparity feature alignment module for stereo image super-resolution. IEEE Signal Processing Letters， 28： 1285-1289 ［DOI： 10.1109/LSP.2021.3088050http://dx.doi.org/10.1109/LSP.2021.3088050］

Dong C， Loy C C and Tang X O. 2016. Accelerating the super-resolution convolutional neural network//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 391-407 ［DOI： 10.1007/978-3-319-46475-6_25http://dx.doi.org/10.1007/978-3-319-46475-6_25］

Duan C Y and Xiao N F. 2019. Parallax-based spatial and channel attention for stereo image super-resolution. IEEE Access， 7： 183672-183679 ［DOI： 10.1109/ACCESS.2019.2960561http://dx.doi.org/10.1109/ACCESS.2019.2960561］

Geiger A， Lenz P and Urtasun R. 2012. Are we ready for autonomous driving？ The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence， USA： IEEE： 3354-3361 ［DOI： 10.1109/CVPR.2012.6248074http://dx.doi.org/10.1109/CVPR.2012.6248074］

Guo Y， Chen J， Wang J D， Chen Q， Cao J Z， Deng Z S， Xu Y W and Tan M K. 2020. Closed-loop matters： dual regression networks for single image super-resolution//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 5407-5416 ［DOI： 10.1109/CVPR42600.2020.00545http://dx.doi.org/10.1109/CVPR42600.2020.00545］

Hou Q B， Zhou D Q and Feng J S. 2021. Coordinate attention for efficient mobile network design//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 13713-13722 ［DOI： 10.1109/CVPR46437.2021.01350http://dx.doi.org/10.1109/CVPR46437.2021.01350］

Huang J B， Singh A and Ahuja N. 2015. Single image super-resolution from transformed self-exemplars//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 5197-5206 ［DOI： 10.1109/CVPR.2015.7299156http://dx.doi.org/10.1109/CVPR.2015.7299156］

Jeon D S， Baek S H， Choi I and Kim M H. 2018. Enhancing the spatial resolution of stereo images using a parallax prior//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 1721-1730 ［DOI： 10.1109/CVPR.2018.00185http://dx.doi.org/10.1109/CVPR.2018.00185］

Kim J， Lee J K and Lee K M. 2016. Accurate image super-resolution using very deep convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 1646-1654 ［DOI： 10.1109/CVPR.2016.182http://dx.doi.org/10.1109/CVPR.2016.182］

Lim B， Son S， Kim H， Nah S and Lee K M. 2017. Enhanced deep residual networks for single image super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu， USA： IEEE： 1132-1140 ［DOI： 10.1109/CVPRW.2017.151http://dx.doi.org/10.1109/CVPRW.2017.151］

Menze M and Geiger A. 2015. Object scene flow for autonomous vehicles//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3061-3070 ［DOI： 10.1109/CVPR.2015.7298925http://dx.doi.org/10.1109/CVPR.2015.7298925］

Niu B， Wen W L， Ren W Q， Zhang X D， Yang L P， Wang S Z， Zhang K H， Cao X C and Shen H F. 2020. Single image super-resolution via a holistic attention network//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 191-207 ［DOI： 10.1007/978-3-030-58610-2_12http://dx.doi.org/10.1007/978-3-030-58610-2_12］

Paszke A， Gross S， Chintala S， Chanan G， Yang E， DeVito Z， Lin Z， Desmaison A， Antiga L and Lerer A. 2017. Automatic differentiation in pytorch//Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach， USA：［s.n.］

Protter M， Elad M， Takeda H and Milanfar P. 2009. Generalizing the nonlocal-means to super-resolution reconstruction. IEEE Transactions on Image Processing， 18（1）： 36-51 ［DOI： 10.1109/TIP.2008.2008067http://dx.doi.org/10.1109/TIP.2008.2008067］

Scharstein D， Hirschmüller H， Kitajima Y， Krathwohl G， Nešić N， Wang X and Westling P. 2014. High-resolution stereo datasets with subpixel-accurate ground truth//Proceedings of the 16th European Conference on Pattern Recognition. Glasgow， UK： Springer： 31-42 ［DOI： 10.1007/978-3-319-11752-2_3http://dx.doi.org/10.1007/978-3-319-11752-2_3］

Schultz R R and Stevenson R L. 1996. Extraction of high-resolution frames from video sequences. IEEE Transactions on Image Processing， 5（6）： 996-1011 ［DOI： 10.1109/83.503915http://dx.doi.org/10.1109/83.503915］

Song W， Choi S， Jeong S and Sohn K. 2020. Stereoscopic image super-resolution with stereo consistent feature. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 12031-12038 ［DOI： 10.1609/aaai.v34i07.6880http://dx.doi.org/10.1609/aaai.v34i07.6880］

Sun J， Xu Z B and Shum H Y. 2008. Image super-resolution using gradient profile prior//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage， AK， USA： IEEE： 1-8 ［DOI： 10.1109/CVPR.2008.4587659http://dx.doi.org/10.1109/CVPR.2008.4587659］

Tai Y， Yang J and Liu X M. 2017. Image super-resolution via deep recursive residual network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2790-2798 ［DOI： 10.1109/CVPR.2017.298http://dx.doi.org/10.1109/CVPR.2017.298］

Takeda H， Milanfar P， Protter M and Elad M. 2009. Super-resolution without explicit subpixel motion estimation. IEEE Transactions on Image Processing， 18（9）： 1958-1975 ［DOI： 10.1109/TIP.2009.2023703http://dx.doi.org/10.1109/TIP.2009.2023703］

Tao X， Gao H Y， Liao R J， Wang J and Jia J Y. 2017. Detail-revealing deep video super-resolution//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 4482-4490 ［DOI： 10.1109/ICCV.2017.479http://dx.doi.org/10.1109/ICCV.2017.479］

Timofte R， De Smet V and Van Gool L. 2014. A+： adjusted anchored neighborhood regression for fast super-resolution//Proceedings of the 12th Asian Conference on Computer Vision. Singapore： Springer： 111-126 ［DOI： 10.1007/978-3-319-16817-3_8http://dx.doi.org/10.1007/978-3-319-16817-3_8］

Wang J， Song H H， Zhang K H and Liu Q S. 2021. Learning global attention-gated multi-scale memory residual networks for single-image super-resolution. Journal of Image and Graphics， 26（4）： 766-775

王静，宋慧慧，张开华，刘青山. 2021. 全局注意力门控残差记忆网络的图像超分重建. 中国图象图形学报， 26（4）： 766-775 ［DOI： 10.11834/jig.200174http://dx.doi.org/10.11834/jig.200174］

Wang L G， Wang Y Q， Liang Z F， Lin Z P， Yang J G， An W and Guo Y L. 2019a. Learning parallax attention for stereo image super-resolution//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 12250-12259 ［DOI： 10.1109/CVPR.2019.01253http://dx.doi.org/10.1109/CVPR.2019.01253］

Wang Q L， Wu B G， Zhu P F， Li P H， Zuo W M and Hu Q H. 2020. ECA-Net： efficient channel attention for deep convolutional neural networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 11531-11539 ［DOI： 10.1109/CVPR42600.2020.01155http://dx.doi.org/10.1109/CVPR42600.2020.01155］

Wang X L， Girshick R， Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 7794-7803 ［DOI： 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813］

Wang Y Q， Wang L G， Yang J G， An W and Guo Y L. 2019b. Flickr1024： a large-scale dataset for stereo image super-resolution//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. Long Seoul， Korea （South）： IEEE： 3852-3857 ［DOI： 10.1109/ICCVW.2019.00478http://dx.doi.org/10.1109/ICCVW.2019.00478］

Xu K， Ba J， Kiros R， Cho K， Courville A C， Salakhutdinov R， Zemel R S and Bengio Y. 2015. Show， attend and tell： neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille， France： PMLR.org： 2048-2057

Yang J C， Wright J， Huang T and Ma Y. 2008. Image super-resolution as sparse representation of raw image patches//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage， USA： IEEE： 1-8 ［DOI： 10.1109/CVPR.2008.4587647http://dx.doi.org/10.1109/CVPR.2008.4587647］

Ying X Y， Wang Y Q， Wang L G， Sheng W D， An W and Guo Y L. 2020. A stereo attention module for stereo image super-resolution. IEEE Signal Processing Letters， 27： 496-500 ［DOI： 10.1109/LSP.2020.2973813http://dx.doi.org/10.1109/LSP.2020.2973813］

Ying Z L and Long X. 2019. Single-image super-resolution construction based on multi-scale dense residual network. Journal of Image and Graphics， 24（3）： 410-419

应自炉，龙祥. 2019. 多尺度密集残差网络的单幅图像超分辨率重建. 中国图象图形学报， 24（3）： 410-419 ［DOI： 10.11834/jig.180431http://dx.doi.org/10.11834/jig.180431］

Zamir S W， Arora A， Khan S， Hayat M， Khan F S， Yang M and Shao L. 2020. Learning enriched features for real image restoration and enhancement//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 492-511 ［DOI： 10.1007/978-3-030-58595-2_30http://dx.doi.org/10.1007/978-3-030-58595-2_30］

Zhang Y L， Tian Y P， Kong Y， Zhong B N and Fu Y. 2018. Residual dense network for image super-resolution//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 2472-2481 ［DOI： 10.1109/cvpr.2018.00262http://dx.doi.org/10.1109/cvpr.2018.00262］

Zhu X Y， Guo K H， Fang H， Chen L， Ren S and Hu B. 2022. Cross view capture for stereo image super-resolution. IEEE Transactions on Multimedia， 24： 3074-3086 ［DOI： 10.1109/TMM.2021.3092571http://dx.doi.org/10.1109/TMM.2021.3092571］

文章被引用时，请邮件提醒。

提交

边缘引导的双注意力图像拼接检测网络

双Gabor滤波器手掌静脉识别网络

面向弱纹理目标立体匹配的Transformer网络

红外与可见光图像特征动态选择的目标检测网络

注意力引导局部特征联合学习的人脸表情识别