Deep attention guided image cropping with fine-grained feature aggregation
- Vol. 27, Issue 2, Pages: 586-601(2022)
Published: 16 February 2022 ,
Accepted: 30 September 2021
DOI: 10.11834/jig.210544
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 February 2022 ,
Accepted: 30 September 2021
移动端阅览
Yuming Fang, Yu Zhong, Jiebin Yan, Lixia Liu. Deep attention guided image cropping with fine-grained feature aggregation. [J]. Journal of Image and Graphics 27(2):586-601(2022)
目的
2
从图像中裁剪出构图更佳的区域是提升图像美感的有效手段之一,也是计算机视觉领域极具挑战性的问题。为提升自动裁图的视觉效果,本文提出了聚合细粒度特征的深度注意力自动裁图方法(deep attention guided image cropping network with fine-grained feature aggregation,DAIC-Net)。
方法
2
整体模型结构由通道校准的语义特征提取(semantic feature extraction with channel calibration,ECC)、细粒度特征聚合(fine-grained feature aggregation,FFA)和上下文注意力融合(contextual attention fusion,CAF)3个模块构成,采用端到端的训练方式,核心思想是多尺度逐级增强不同细粒度区域特征,融合全局和局部注意力特征,强化上下文语义信息表征。ECC模块在通用语义特征的通道维度上进行自适应校准,融合了通道注意力;FFA模块将多尺度区域特征级联互补,产生富含图像构成和空间位置信息的特征表示;CAF模块模拟人眼观看图像的规律,从不同方向、不同尺度显式编码图像空间不同像素块之间的记忆上下文关系;此外,定义了多项损失函数以指导模型训练,进行多任务监督学习。
结果
2
在3个数据集上与最新的6种方法进行对比实验,本文方法优于现有的自动裁图方法,在最新裁图数据集GAICD(grid anchor based image cropping database)上,斯皮尔曼相关性和皮尔森相关性指标分别提升了2.0%和1.9%,其他最佳回报率指标最高提升了4.1%。在ICDB(image cropping database)和FCDB(flickr cropping database)上的跨数据集测试结果进一步表明了本文提出的DAIC-Net的泛化性能。此外,消融实验验证了各模块的有效性,用户主观实验及定性分析也表明DAIC-Net能裁剪出视觉效果更佳的裁图结果。
结论
2
本文提出的DAIC-Net在GAICD数据集上多种评价指标均取得最优的预测结果,在ICDB和FCDB测试集上展现出较强的泛化能力,能有效提升裁图效果。
Objective
2
Image cropping is a remarkable factor in composing photography's aesthetics
aiming at cropping the region of interest (RoI) with a better aesthetic composition. Image cropping has been widely used in photography
printing
thumbnail generating
and other related fields
especially in image processing/computer vision tasks that need to process a large number of images simultaneously. However
modeling the aesthetic properties of image composition in image cropping is highly challenging due to the subjectivity of image aesthetic assessment (IAA). In the past few years
many researchers tried to maximize the visual important information to crop a target region by feat of salient object detection or eye fixation. The results are often not in line with human preferences due to the lack of consideration of the integrity of image composition. Recently
owing to the powerful representative ability of deep learning (mainly refers to convolutional neural network (CNN))
many data-driven image cropping methods have been proposed and achieved great success. The cropped RoI images have a substantial similarity
making distinguishing the aesthetics between them
which is different from natural IAA
more difficult. Most of existing CNN-based methods only focus on feature corresponding to each cropped RoI and use rough location information
which is not robust enough for complex scenes
spatial deformation
and translation. Few methods consider the fine-grained features and local and global context dependence
which is remarkably beneficial to image composition understanding. Motivated by this
a novel deep attention guided image cropping network with fine-grained feature aggregation
namely
DAIC-Net
is proposed.
Method
2
In an end-to-end learning manner
the overall model structure of DAIC-Net consists of three modules: semantic feature extraction with channel calibration(ECC)
fine-grained feature aggregation (FFA)
and global-to-local contextual attention fusion (CAF). Our main idea is to combine the multiscale features and incorporate global and local contexts
which contribute to enhancing informative contextual representation from coarse to fine. First
a backbone is used to extract high-level semantic feature maps of the input in ECC. Three popular architectures
namely
Visual Geometry Group 16-layer network (VGG16)
MobileNetV2
and ShuffleNetV2
are tested
and all of the variants achieve competitive performance. The output of the backbone is followed by a squeeze and excitation module
which exploits the attention between channels to calibrate channel features adaptively. Then
an FFA module connects multiscale regional information to generate various fine-grained features. The operation is designed for capturing higher semantic representations and complex composition rules in image composition. Almost no additional running time is observed due to the low-dimensional semantic feature sharing of the FFA module. Moreover
to mimic the human visual attention mechanism
the CAF module is proposed to recalibrate high fine-grained features
generating contextual knowledge for each pixel by selectively scanning from different directions and scales. The input features of the CAF module are re-encoded explicitly by fusing global and local attention features
and it generates top-to-down and left-to-right contextual regional attention for each pixel
obtaining richer context features and facilitating the final decision. Finally
considering the particularity of image cropping scoring regression
a multi-task loss function is defined by incorporating score regression
pairwise comparison
and correlation ranking to train the proposed DAIC-Net. The proposed multi-task loss functions can explicitly rank the aesthetics to model the relations between every two different regions. An NVIDIA GeForce GTX 1060 device is used to train and test the proposed DAIC-Net.
Result
2
The performance of our method is compared with six state-of-the-art methods on three public datasets
namely
grid anchor based image cropping database (GAICD)
image cropping database (ICDB)
and flickr cropping database (FCDB). The quantitative evaluation metrics of GAICD contain average Pearson correlation coefficient (
$$\overline {PCC}$$
)
average Spearman's rank-order correlation coefficient (
$$\overline {SRCC}$$
)
best return metrics (
$$Acc^K/N$$
)
and rank-weighted best return metrics (
$$wAcc^K/N$$
) (i.e.
higher is better over these metrics). Intersection over union and boundary displacement error are adopted as evaluation metrics in the two other datasets. The GAICD dataset is split into 2 636 training images
200 validating images
and 500 test images. ICDB and FCDB contain 950 and 348 test images respectively
which are not used for training by all compared methods. Experimental results demonstrate the effectiveness of DAIC-Net compared with other state-of-the-art methods. Specifically
$$\overline {SRCC}$$
and
$$\overline {PCC}$$
increase by 2.0% and 1.9%
and other best return metrics increase by 4.1% at most on the GAICD. The proposed DAIC-Net outperforms most of the other methods despite very minimal room for improvement on ICDB and FCDB. Qualitative analysis and user study of each method are also provided for comparison. The results demonstrate that the proposed DAIC-Net generates better composition views than the other compared methods.
Conclusion
2
In this paper
a new automatic image cropping method with fine-grained feature aggregation and contextual attention is presented. The ablation study demonstrates the effectiveness of each module in DAIC-Net
and further experiments show that DAIC-Net can obtain better results than other methods on the GAICD dataset. Comparison experiments on the ICDB and FCDB datasets verify the generalization of DAIC-Net.
自动裁图图像美学评价(IAA)感兴趣区域(RoI)空间金字塔池化(SPP)注意力机制多任务学习
automatic image croppingimage aesthetics assessment (IAA)region of interest (RoI)spatial pyramid pooling (SPP)attention mechanismmulti-task learning
Byeon W, Breuel T M, Raue F and Liwicki M. 2015. Scene labeling with LSTM recurrent neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3547-3555[DOI: 10.1109/CVPR.2015.7298977http://dx.doi.org/10.1109/CVPR.2015.7298977]
Chen J S, Bai G C, Liang S H and Li Z Q. 2016a. Automatic image cropping: a computational complexity study//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 507-515[DOI: 10.1109/CVPR.2016.61http://dx.doi.org/10.1109/CVPR.2016.61]
Chen L C, Yang Y, Wang J, Xu W and Yuille A L. 2016b. Attention to scale: scale-aware semantic image segmentation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3640-3649[DOI: 10.1109/CVPR.2016.396http://dx.doi.org/10.1109/CVPR.2016.396]
Chen Q Y, Zhang W, Zhou N, Lei P, Xu Y, Zheng Y and Fan J P. 2020. Adaptive fractional dilated convolution network for image aesthetics assessment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 14102-14111[DOI: 10.1109/CVPR42600.2020.01412http://dx.doi.org/10.1109/CVPR42600.2020.01412]
Chen Y L, Huang T W, Chang K H, Tsai Y C, Chen H T and Chen B Y. 2017a. Quantitative analysis of automatic image cropping algorithms: a dataset and comparative study//Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision. Santa Rosa, USA: IEEE: 226-234[DOI: 10.1109/WACV.2017.32http://dx.doi.org/10.1109/WACV.2017.32]
Chen Y L, Klopp J, Sun M, Chien S Y and Ma K L. 2017b. Learning to compose with professional photographs on the web//Proceedings of the 25th ACM international conference on Multimedia. Mountain View, USA: ACM: 37-45[DOI: 10.1145/3123266.3123274http://dx.doi.org/10.1145/3123266.3123274]
Dai J F, He K M and Sun J. 2016. Instance-aware semantic segmentation via multi-task network cascades//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3150-3158[DOI: 10.1109/cvpr.2016.343http://dx.doi.org/10.1109/cvpr.2016.343]
Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]
Fang C, Lin Z, Mech R and Shen X H. 2014. Automatic image cropping using visual composition, boundary simplicity and content preservation models//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, USA: ACM: 1105-1108[DOI: 10.1145/2647868.2654979http://dx.doi.org/10.1145/2647868.2654979]
Fang Y M, Sui X J, Yan J B, Liu X L and Huang L P. 2021. Progress in no-reference image quality assessment. Journal of Image and Graphics, 26(2): 265-286
方玉明, 眭相杰, 鄢杰斌, 刘学林, 黄丽萍. 2021. 无参考图像质量评价研究进展. 中国图象图形学报, 26(2): 265-286)[DOI:10.11834/jig.200274]
Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[DOI: 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169]
Graves A, Jaitly N and Mohamed A R. 2013. Hybrid speech recognition with deep bidirectional LSTM//Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE: 273-278[DOI: 10.1109/ASRU.2013.6707742http://dx.doi.org/10.1109/ASRU.2013.6707742]
He K M, Gkioxari G, Dollár P and Girshick R. 2020. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 386-397[DOI:10.1109/TPAMI.2018.2844175]
He K M, Zhang X Y, Ren S Q and Sun J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1904-1916[DOI:10.1109/TPAMI.2015.2389824]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141[DOI: 10.1109/cvpr.2018.00745http://dx.doi.org/10.1109/cvpr.2018.00745]
Jian M W, Lam K M, Dong J Y and Shen L L. 2015. Visual-patch-attention-aware saliency detection. IEEE Transactions on Cybernetics, 45(8): 1575-1586[DOI:10.1109/TCYB.2014.2356200]
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization[EB/OL]. [2021-06-05].https://https://arxiv.org/pdf/1412.6980v8.pdfhttps://https://arxiv.org/pdf/1412.6980v8.pdf
Kong S, Shen X H, Lin Z, Mech R and Fowlkes C. 2016. Photo aesthetics ranking network with attributes and content adaptation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 662-679[DOI: 10.1007/978-3-319-46448-0_40http://dx.doi.org/10.1007/978-3-319-46448-0_40]
Li D B, Wu H K, Zhang J G and Huang K Q. 2018. A2-RL: aesthetics aware reinforcement learning for image cropping//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8193-8201[DOI: 10.1109/CVPR.2018.00855http://dx.doi.org/10.1109/CVPR.2018.00855]
Li L D, Zhu H C, Zhao S C, Ding G G and Lin W S. 2020. Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Transactions on Image Processing, 29: 3898-3910[DOI:10.1109/TIP.2020.2968285]
Liang X D, Shen X H, Xiang D L, Feng J S, Lin L and Yan S C. 2016. Semantic object parsing with local-global long short-term memory//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3185-3193[DOI: 10.1109/CVPR.2016.347http://dx.doi.org/10.1109/CVPR.2016.347]
Liu N, Han J W and Yang M H. 2018. PiCANet: learning pixel-wise contextual attention for saliency detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: 3089-3098[DOI: 10.1109/CVPR.2018.00326http://dx.doi.org/10.1109/CVPR.2018.00326]
Lu W R, Xing X F, Cai B L and Xu X M. 2019. Listwise view ranking for image cropping. IEEE Access, 7: 91904-91911[DOI:10.1109/ACCESS.2019.2925430]
Lu X, Lin Z, Jin H L, Yang J C and Wang J Z. 2014. RAPID: rating pictorial aesthetics using deep learning//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, USA: ACM: 457-466[DOI: 10.1145/2647868.2654927http://dx.doi.org/10.1145/2647868.2654927]
Ma N N, Zhang X Y, Zheng H T and Sun J. 2018. ShuffleNet V2: practical guidelines for efficient CNN architecture design//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 116-131[DOI: 10.1007/978-3-030-01264-9_8http://dx.doi.org/10.1007/978-3-030-01264-9_8]
Mai L, Jin H L and Liu F. 2016. Composition-preserving deep photo aesthetics assessment//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. LasVegas, USA: IEEE: 497-506[DOI: 10.1109/CVPR.2016.60http://dx.doi.org/10.1109/CVPR.2016.60]
Murray N, Marchesotti L and Perronnin F. 2012. AVA: a large-scale database for aesthetic visual analysis//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2408-2415[DOI: 10.1109/CVPR.2012.6247954http://dx.doi.org/10.1109/CVPR.2012.6247954]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI:10.1109/TPAMI.2016.2577031]
Sandler M, Howard A, Zhu M L, Zhmoginov A and Chen L C. 2018. MobileNetV2: inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4510-4520[DOI: 10.1109/CVPR.2018.00474http://dx.doi.org/10.1109/CVPR.2018.00474]
Sheng K K, Dong W M, Chai M L, Wang G H, Zhou P, Huang F Y, Hu B G, Ji R R and Ma C Y. 2020. Revisiting image aesthetic assessment via self-supervised feature learning//Proceedings of the AAAI Conference on Artificial Intelligence, 34(4): 5709-5716[DOI: 10.1609/aaai.v34i04.6026http://dx.doi.org/10.1609/aaai.v34i04.6026]
Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2021-06-05].https://arxiv.org/pdf/1409.1556v1.pdfhttps://arxiv.org/pdf/1409.1556v1.pdf
Stentiford F. 2007. Attention based auto image cropping//Proceedings of the 5th International Conference on Computer Vision Systems. Bielefeld, Germany: [s. n.][DOI: 10.2390/biecoll-icvs2007-148http://dx.doi.org/10.2390/biecoll-icvs2007-148]
Suh B, Ling H B, Bederson B B and Jacobs D W. 2003. Automatic thumbnail cropping and its effectiveness//The 16th Annual ACM Symposium on User Interface Software and Technology. Vancouver, Canada: ACM: 95-104[DOI: 10.1145/964696.964707http://dx.doi.org/10.1145/964696.964707]
Tang X O, Luo W and Wang X G. 2013. Content-based photo quality assessment. IEEE Transactions on Multimedia, 15(8): 1930-1943[DOI:10.1109/TMM.2013.2269899]
Varior R R, Shuai B, Lu J W, Xu D and Wang G. 2016. A siamese long short-term memory architecture for human re-identification//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 135-153[DOI: 10.1007/978-3-319-46478-7_9http://dx.doi.org/10.1007/978-3-319-46478-7_9]
Visin F, Kastner K, Cho K, Matteucci M, Courville A and Bengio Y. 2015. ReNet: a recurrent neural network based alternative to convolutional networks[EB/OL]. [2021-06-05].https://arxiv.org/pdf/1505.00393.pdfhttps://arxiv.org/pdf/1505.00393.pdf
Wei Z J, Zhang J M, Shen X H, Lin Z, Mech R, Hoai M and Samaras D. 2018. Good view hunting: learning photo composition from dense view pairs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5437-5446[DOI: 10.1109/CVPR.2018.00570http://dx.doi.org/10.1109/CVPR.2018.00570]
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R S and Bengio Y. 2015. Show, attend and tell:neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR. org: 2048-2057
Yan J B, Zhong Y, Fang Y M, Wang Z Y and Ma K D. 2021. Exposing semantic segmentation failures via maximum discrepancy competition. International Journal of Computer Vision, 129(2): 1768-1786[DOI:10.1007/s11263-021-01450-2]
Yan J Z, Lin S, Kang S B and Tang X O. 2013. Learning the change for automatic image cropping//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 971-978[DOI: 10.1109/CVPR.2013.130http://dx.doi.org/10.1109/CVPR.2013.130]
Yu F and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions//Proceedings of the 4th International Conference on Learning Representations. San Juan, Puerto Rico: [s. n.]
Zeiler M D and Fergus R. 2014. Visualizing and understanding convolutional networks//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 818-833[DOI: 10.1007/978-3-319-10590-1_53http://dx.doi.org/10.1007/978-3-319-10590-1_53]
Zeng H, Li L D, Cao Z S and Zhang L. 2020. Grid anchor based image cropping: a new benchmark and an efficient model. IEEE Transactions on Pattern Analysis and Machine Intelligence: #3024207[DOI: 10.1109/TPAMI.2020.3024207http://dx.doi.org/10.1109/TPAMI.2020.3024207]
Zhang L M, Song M L, Zhao Q, Liu X, Bu J J and Chen C. 2013. Probabilistic graphlet transfer for photo cropping. IEEE Transactions on Image Processing, 22(2): 802-815[DOI:10.1109/TIP.2012.2223226]
Zhang W X, Ma K D, Yan J, Deng D X and Wang Z. 2020. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology, 30(1): 36-47[DOI:10.1109/TCSVT.2018.2886771]
Zhu H C, Li L D, Wu J J, Zhao S C, Ding G G and Shi G M. 2020. Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Transactions on Cybernetics: 1-14[DOI:10.1109/TCYB.2020.2984670]
Zhu H C, Zhou Y, Li L D, Zhao J Q and Du W L. 2021. Recent progress and tend of personalized image aesthetics assessment[J/OL]. Journal of Image and Graphics[DOI: 10.11834/jig.210211http://dx.doi.org/10.11834/jig.210211]
祝汉城, 周勇, 李雷达, 赵佳琦, 杜文亮. 2021. 个性化图像美学评价的研究进展与趋势[J/OL]. 中国图象图形学报[DOI: 10.11834/jig.210211http://dx.doi.org/10.11834/jig.210211]
相关文章
相关作者
相关机构