图注意力网络的场景图到图像生成模型
Image generation from scene graph with graph attention network
- 2020年25卷第8期 页码:1591-1603
收稿:2019-10-10,
修回:2020-1-8,
录用:2020-1-15,
纸质出版:2020-08-16
DOI: 10.11834/jig.190515
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-10-10,
修回:2020-1-8,
录用:2020-1-15,
纸质出版:2020-08-16
移动端阅览
目的
2
目前文本到图像的生成模型仅在具有单个对象的图像数据集上表现良好,当一幅图像涉及多个对象和关系时,生成的图像就会变得混乱。已有的解决方案是将文本描述转换为更能表示图像中场景关系的场景图结构,然后利用场景图生成图像,但是现有的场景图到图像的生成模型最终生成的图像不够清晰,对象细节不足。为此,提出一种基于图注意力网络的场景图到图像的生成模型,生成更高质量的图像。
方法
2
模型由提取场景图特征的图注意力网络、合成场景布局的对象布局网络、将场景布局转换为生成图像的级联细化网络以及提高生成图像质量的鉴别器网络组成。图注意力网络将得到的具有更强表达能力的输出对象特征向量传递给改进的对象布局网络,合成更接近真实标签的场景布局。同时,提出使用特征匹配的方式计算图像损失,使得最终生成图像与真实图像在语义上更加相似。
结果
2
通过在包含多个对象的COCO-Stuff图像数据集中训练模型生成64×64像素的图像,本文模型可以生成包含多个对象和关系的复杂场景图像,且生成图像的Inception Score为7.8左右,与原有的场景图到图像生成模型相比提高了0.5。
结论
2
本文提出的基于图注意力网络的场景图到图像生成模型不仅可以生成包含多个对象和关系的复杂场景图像,而且生成图像质量更高,细节更清晰。
Objective
2
With the development of deep learning
the problem of image generation has achieved great progress. Text-to-image generation is an important research field based on deep learning image generation. A large number of related papers conducted by researchers have proposed to implement text-to-image. However
a significant limitation exists
that is
the model will behave poorly in terms of relationships when generating images involving multiple objects. The existing solution is to replace the description text with a scene graph structure that closely represents the scene relationship in the image and then use the scene graphs to generate an image. Scene graphs are the preferred structured representation between natural language and images
which is conducive to the transfer of information between objects in the graphs. Although the scene graphs to image generation model solve the problem of image generation
including multiple objects and relationships
the existing scene graphs to image generation model ultimately produce images with lower quality
and the object details are unremarkable compared with real samples. A model with improved performance should be developed to generate high-quality images and to solve large errors.
Method
2
We propose a model called image generation from scene graphs with a graph attention network (GA-SG2IM)
which is an improved model implementing image generation from scene graphs
to generate high-quality images containing multiple objects and relationships. The proposed model mainly realizes image generation in three parts:First
a feature extraction network is used to realize the feature extraction of the scene graphs. The attention network of the graphs introduces the attention mechanism in the convolution network of original graphs
enabling the output object vector to have strong expression ability. The object vector is then passed to the improved object layout network for obtaining a respectful and factual scene layout. Finally
the scene layout is passed to the cascaded refinement network for obtaining the final output image. A network of discriminators consisting of an object discriminator and an image discriminator is connected to the end to ensure that the generated image is sufficiently realistic. At the same time
we use feature matching as our image loss function to ensure that the final generated and real images are similar in semantics and to obtain high-quality images.
Result
2
We use the COCO-Stuff image dataset to train and validate the proposed model. The dataset includes more than 40 000 images of different scenes
where each of them provides annotation information of the borders and segmentation masks of the objects in the image
and the annotation information can be used to synthesize and input the scene graph of the proposed model. We train the proposed model to generate 64×64 images and compare them with other image generation models to prove its feasibility. At the same time
the quantitative results of the Inception Score and the bounding box intersection over union(IoU) of the generated image are compared to determine the improvement effects of the proposed model and SG2IM(image generation from scene graph) and StackGAN models. The final experimental results show that the proposed model achieves an Inception Score of 7.8
which increases by 0.5 compared with the SG2IM model.
Conclusion
2
Qualitative experimental results show that the proposed model can realize the generation of complex scene images containing multiple objects and relationships and improves the quality of the generated images to a certain extent
making the final generated images clear and the object details evident. A machine can autonomously model its input data and takes a step toward "wisdom" when it can generate high-quality images containing multiple objects and relationships. Our next goal is to enable the proposed model for generating real-time high-resolution images
such as photographic images
which requires many theoretical supports and practical operations.
Caesar H, Uijlings J and Ferrari V. 2018. COCO-Stuff: thing and stuff classes in context//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 1209-1218[ DOI: 10.1109/CVPR.2018.00132 http://dx.doi.org/10.1109/CVPR.2018.00132 ]
Chen Q F and Koltun V. 2017. Photographic image synthesis with cascaded refinement networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE: 1511-1520[ DOI: 10.1109/ICCV.2017.168 http://dx.doi.org/10.1109/ICCV.2017.168 ]
Chen X X, Xu L, Liu Z Y, Sun M S and Luan H B. 2015. Joint learning of character and word embeddings//Proceedings of the 24th International Conference on Artificial Intelligence.[s.l.]: AAAI Press: 1236-1242
Girshick R, Donahue J, Darrell T and Malik J. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142-158[DOI:10.1109/TPAMI.2015.2437384]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press: 2672-2680
Ioffe S and Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift//Proceedings of the 32nd International Conference on International Conference on Machine Learning.[s.l.]: ACM: 448-456
Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 1125-1134[ DOI: 10.1109/CVPR.2017.632 http://dx.doi.org/10.1109/CVPR.2017.632 ]
Jaderberg M, Simonyan K, Zisserman A and Kavukcuoglu K. 2015. Spatial transformer networks//Advances in Neural Information Processing Systems.[s.l.]: [s.n.]: 2017-2025
Johnson J, Gupta A and Li F F. 2018. Image generation from scene graphs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 1219-1228[ DOI: 10.110/CVPR.2018.00133 http://dx.doi.org/10.110/CVPR.2018.00133 ]
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization[EB/OL].[2019-10-01] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf
Kingma D P and Welling M. 2013. Auto-encoding variational bayes[EB/OL].[2019-10-01] . https://arxiv.org/pdf/1312.6114.pdf https://arxiv.org/pdf/1312.6114.pdf
Kipf T N and Welling M. 2016. Semi-supervised classification with graph convolutional networks[EB/OL] .[2019-10-01]. https://arxiv.org/pdf/1609.02907.pdf https://arxiv.org/pdf/1609.02907.pdf
Lan H and Fang Z Y. 2019. Image recognition of steel plate defects based on a 3D gray matrix. Journal of Image and Graphics, 24(6):859-869
兰红, 方治屿. 2019.3维灰度矩阵的钢板缺陷图像识别.中国图象图形学报, 24(6):859-869)[DOI:10.11834/jig.180555]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich: Springer: 740-755[ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Liu Z L, Zhu W and Yuan Z Y. 2019. Image instance style transfer combined with fully convolutional network and cycleGAN. Journal of Image and Graphics, 24(8):1283-1291
刘哲良, 朱玮, 袁梓洋. 2019.结合全卷积网络与CycleGAN的图像实例风格迁移.中国图象图形学报, 24(8):1283-1291)[DOI:10.11834/jig.180624]
Maas A L, Hannun A Y and Ng A Y. 2013. Rectifier nonlinearities improve neural network acoustic models//ICML Workshop on Deep Learning for Audio, Speech and Language Processing.[s.l.]: [s.n.]: #3
Odena A, Olah C and Shlens J. 2016. Conditional image synthesis with auxiliary classifier GANs//Proceedings of the 34th International Conference on Machine Learning.[s.l.]: JMLR.org: 2642-2651
Qiang Z P, He L B, Chen X and Xu D. 2019. Survey on deep learning image inpainting methods. Journal of Image and Graphics, 24(3):447-463.
强振平, 何丽波, 陈旭, 徐丹. 2019.深度学习图像修复方法综述.中国图象图形学报, 24(3):447-463)[DOI:10.11834/jig.180408]
Radford A, Metz L and Chintala S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks.[EB/OL].[2019-10-01] . https://arxiv.org/pdf/1511.06434.pdf https://arxiv.org/pdf/1511.06434.pdf
Reed S, Akata Z, Yan X C, Logeswaran L, Schiele B and Lee H. 2016a. Generative adversarial text to image synthesis//Proceedings of the 33rd International Conference on International Conference on Machine Learningar.[s.l.]: ACM: 1060-1069
Reed S E, Akata Z, Mohan S, Tenka S, Schiele B and Lee H. 2016b. Learning what and where to draw//Advances in Neural Information Processing Systems.Barcelona, Spain: [s.n.]: 217-225
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A and Chen X. 2016. Improved techniques for training GANs//Advances in Neural Information Processing Systems. Barcelona, Spain: [s.n.]: 2234-2242
Salimans T, Karpathy A, Chen X and Kingma D P. 2017. PixelCNN++: improving the pixelCNN with discretized logistic mixture likelihood and other modifications[EB/OL].[2019-10-10] . https://arxiv.org/pdf/1701.05517.pdf https://arxiv.org/pdf/1701.05517.pdf
Shelhamer E, Long J and Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651[DOI:10.1109/TPAMI.2016.2572683]
van den Oord A, Kalchbrenner N and Kavukcuoglu K. 2016. Pixel recurrent neural networks[EB/OL].[2019-10-10] . https://arxiv.org/pdf/1601.06759.pdf https://arxiv.org/pdf/1601.06759.pdf
Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on Machine Learning.[s.l.]: [s.n.]: 2048-2057
Xu T, Zhang P C, Huang Q Y, Zhang H, Gan Z, Huang X L and He X D. 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 1316-1324[ DOI: 10.1109/CVPR.2018.00143 http://dx.doi.org/10.1109/CVPR.2018.00143 ]
Zhang H, Xu T, Li H S, Zhang S T, Wang X G, Huang X L and Metaxas D. 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE: 5907-5915[ DOI: 10.1109/ICCV.2017.629 http://dx.doi.org/10.1109/ICCV.2017.629 ]
Zhang H, Xu T, Li H S, Zhang S T, Wang X G, Huang X L and Metaxas D N. 2019. StackGAN++:realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1947-1962[DOI:10.1109/TPAMI.2018.2856256]
相关作者
相关机构
京公网安备11010802024621