提取全局语义信息的场景图生成算法

段静雯; 闵卫东; 杨子元; 张煜; 陈鑫浩; 杨升宝

doi:10.11834/jig.210032

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

提取全局语义信息的场景图生成算法
Global semantic information extraction based scene graph generation algorithm
2022年27卷第7期页码：2214-2225
纸质出版日期： 2022-07-16 ，

录用日期： 2021-06-23
DOI： 10.11834/jig.210032
稿件说明：

移动端阅览

段静雯, 闵卫东, 杨子元, 张煜, 陈鑫浩, 杨升宝. 提取全局语义信息的场景图生成算法[J]. 中国图象图形学报, 2022,27(7):2214-2225.

Jingwen Duan, Weidong Min, Ziyuan Yang, Yu Zhang, Xinhao Chen, Shengbao Yang. Global semantic information extraction based scene graph generation algorithm[J]. Journal of Image and Graphics, 2022,27(7):2214-2225.
段静雯, 闵卫东, 杨子元, 张煜, 陈鑫浩, 杨升宝. 提取全局语义信息的场景图生成算法[J]. 中国图象图形学报, 2022,27(7):2214-2225. DOI： 10.11834/jig.210032.

Jingwen Duan, Weidong Min, Ziyuan Yang, Yu Zhang, Xinhao Chen, Shengbao Yang. Global semantic information extraction based scene graph generation algorithm[J]. Journal of Image and Graphics, 2022,27(7):2214-2225. DOI： 10.11834/jig.210032.

摘要

目的

场景图能够简洁且结构化地描述图像。现有场景图生成方法重点关注图像的视觉特征，忽视了数据集中丰富的语义信息。同时，受到数据集长尾分布的影响，大多数方法不能很好地对出现概率较小的三元组进行推理，而是趋于得到高频三元组。另外，现有大多数方法都采用相同的网络结构来推理目标和关系类别，不具有针对性。为了解决上述问题，本文提出一种提取全局语义信息的场景图生成算法。

方法

网络由语义编码、特征编码、目标推断以及关系推理等4个模块组成。语义编码模块从图像区域描述中提取语义信息并计算全局统计知识，融合得到鲁棒的全局语义信息来辅助不常见三元组的推理。目标编码模块提取图像的视觉特征。目标推断和关系推理模块采用不同的特征融合方法，分别利用门控图神经网络和门控循环单元进行特征学习。在此基础上，在全局统计知识的辅助下进行目标类别和关系类别推理。最后利用解析器构造场景图，进而结构化地描述图像。

结果

在公开的视觉基因组数据集上与其他10种方法进行比较，分别实现关系分类、场景图元素分类和场景图生成这3个任务，在限制和不限制每对目标只有一种关系的条件下，平均召回率分别达到了44.2%和55.3%。在可视化实验中，相比性能第2的方法，本文方法增强了不常见关系类别的推理能力，同时改善了目标类别与常见关系的推理能力。

结论

本文算法能够提高不常见三元组的推理能力，同时对于常见的三元组也具有较好的推理能力，能够有效地生成场景图。

Abstract

Objective

The scene graph can construct a graph structure for image interpretation. The image objects and inter-relations are represented via nodes and edges. However

the existing methods have focused on the visual features and lack of semantic information. While the semantic information can provide robust feature and improve the capability of inference. In addition

it is challenged of long-tailed distribution issue in the dataset. The 30 regular relationships account for 69% of the sample size

while the triplet of 20 irregular relationships just has 31% of the sample size. Most of methods cannot maintain qualified results on the rare triplets and tend to infer the regular one. To improve the reasoning ability of irregular triples

we demonstrated a scene graph generation algorithm to generate robust features.

Method

The components of this network are semantic encoding

feature encoding

target inference

and relationship reasoning. The semantic coding module first represents the word in region description into low dimension via word embedding. Thanks to the Word2Vec model is trained on a large corpus database

it can better represent the semantics of words based on complete word embedding. We use the Word2Vec network to traverse the region description of the dataset and extract the intermediate word embedding vectors of 150 types of targets and 50 types of relationships as the semantic information. Additionally

in this module

we explicitly calculate global statistical knowledge

which can represent the global characters of the dataset. We use graph convolution networks to integrate them with semantic information. This method can get global semantic information

which strengthens the reasoning capability of rare triplets. The feature encoding module extracts the visual image features based on faster region convolutional neural network (Faster R-CNN). We remove its classification network and use its feature extraction network

region proposal network

and region of interest pooling layer to get visual features of image processing. In the target reasoning and the relationship reasoning modules

visual features and global semantic information are fused to obtain global semantic features via different feature fusion methods. These features applications can enhance the performance of rare triplets through clarifying the differences of target and relationship. In respect of the target reasoning module

we use graphs to represent the images and use gated graph neural networks to aggregate the context information. After three times step iteration

the target feature has been completely improved

we train a classifier to determine the target classes using these final global semantic features. Objects' classes can benefit to the reasoning capability of relationships. In respect of the relationship in reasoning module

we use both object class and the global semantic feature of relationship to conduct reasoning work. We use gated recurrent units to refine features and reasoning the relationship. Each relationship feature will aggregate information derived from the corresponding object pair. Meanwhile

a parser is used to construct the scene graph to describe structured images.

Result

We carried out experiments on the public visual genome dataset and compared it with 10 methods proposed. We actually conduct predicate classification

scene graph classification

and scene graph generation tasks

respectively. Ablation experiments were also performed. The average recall reached 44.2% and 55.3% under each setting

respectively. Compared with the neural motifs method

the R@50 of the scene graph classification task has a 1.3% improvement. With respect of the visualization part

we visualize the results of the scene graph generation task. The target location and their class in the original image are marked. The target and relationship classes are represented based on node and edge. Compared with the second score obtained in the quantitative analysis part

our network enhances the reasoning capability of rare relationships significantly in terms of the reasoning capability of target and common relationships improvement.

Conclusion

Our demonstrated algorithms facilitate the reasoning capability of rare triplets. It has good performance on regular-based triplets reasoning as well as scene graph generation.

关键词

场景图全局语义信息目标推断关系推理图像理解

Keywords

scene graphglobal semantic informationtarget inferencerelationship reasoningimage interpretation

references

Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. [2021-01-23].DOI: https://arxiv.org/pdf/1406.1078.pdfhttps://arxiv.org/pdf/1406.1078.pdf

Gu J X, Zhao H D, Lin Z, Li S, Cai J F and M Y. 2019 Scene graph generation with external knowledge and image reconstruction//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1969-1978[DOI: 10.1109/CVPR.2019.00207http://dx.doi.org/10.1109/CVPR.2019.00207]

Herzig R, Bar A, Xu H J, Chechik G, Darrell T and Globerson A. 2020. Learning canonical representations for scene graph to image generation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 210-227[DOI: 10.1007/978-3-030-58574-7_13http://dx.doi.org/10.1007/978-3-030-58574-7_13]

Hung Z S, Mallya A and Lazebnik S. 2021. Contextual translation embedding for visual relationship detection and scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11): 3820-3832[DOI: 10.1109/TPAMI.2020.2992222]

Johnson J, Krishna R, Stark M, Li L J, Shamma D A, Bernstein M S and Li F F. 2015. Image retrieval using scene graphs//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3668-3678[DOI: 10.1109/CVPR.2015.7298990http://dx.doi.org/10.1109/CVPR.2015.7298990]

Kipf T N and Welling M. 2017. Semi-supervised classification with graph convolutional networks[EB/OL]. [2021-01-23].DOI: https://arxiv.org/pdf/1609.02907.pdfhttps://arxiv.org/pdf/1609.02907.pdf

Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73[DOI: 10.1007/s11263-016-0981-7]

Leng L, Yang Z Y and Min W D. 2020. Democratic voting downsampling for coding-based palmprint recognition. IET Biometrics, 9(6): 290-296[DOI: 10.1049/iet-bmt.2020.0106]

Li Y J, Zemel R, Brockschmidt M and Tarlow D. 2017a. Gated graph sequence neural networks[EB/OL]. [2022-04-19].DOI: https://arxiv.org/pdf/1511.05493.pdfhttps://arxiv.org/pdf/1511.05493.pdf

Li Y K, Ouyang W L, Zhou B L, Shi J P, Zhang C and Wang X G. 2018. Factorizable net: an efficient subgraph-based framework for scene graph generation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 346-363[DOI: 10.1007/978-3-030-01246-5_21http://dx.doi.org/10.1007/978-3-030-01246-5_21]

Li Y K, Ouyang W L, Zhou B L, Wang K and Wang X G. 2017b. Scene graph generation from objects, phrases and region captions//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1270-1279[DOI: 10.1109/ICCV.2017.142http://dx.doi.org/10.1109/ICCV.2017.142]

Lu C W, Krishna R, Bernstein M and Li F F. 2016. Visual relationship detection with language priors//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 852-869[DOI: 10.1007/978-3-319-46448-0_51http://dx.doi.org/10.1007/978-3-319-46448-0_51]

Mikolov T, Chen K, Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space[EB/OL]. [2021-01-23].DOI: https://arxiv.org/pdf/1301.3781.pdfhttps://arxiv.org/pdf/1301.3781.pdf

Newell A and Deng J. 2018. Pixels to graphs by associative embedding[EB/OL]. [2021-01-23].DOI: https://arxiv.org/pdf/1706.07365.pdfhttps://arxiv.org/pdf/1706.07365.pdf

Prabhu N and Babu R V. 2015. Attribute-graph: a graph based approach to image ranking//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1071-1079[DOI: 10.1109/ICCV.2015.128http://dx.doi.org/10.1109/ICCV.2015.128]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]

Scarselli F, Gori M, Tsoi A C, Hagenbuchner M and Monfardini G. 2009. The graph neural network model. IEEE Transactions on Neural Networks, 20(1): 61-80[DOI: 10.1109/TNN.2008.2005605]

Wan H, Luo Y H, Peng B and Zheng W S. 2018. Representation learning for scene graph completion via jointly structural and visual embedding//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IJCAI: 949-956[DOI: 10.24963/ijcai.2018/132http://dx.doi.org/10.24963/ijcai.2018/132]

Xi Y L, Zhang Y N, Ding S T and Wan S H. 2020. Visual question answering model based on visual relationship detection. Signal Processing: Image Communication, 80: #115648[DOI: 10.1016/j.image.2019.115648]

Xu D F, Zhu Y K, Choy C B and Li F F. 2017. Scene graph generation by iterative message passing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3097-3106[DOI: 10.1109/CVPR.2017.330http://dx.doi.org/10.1109/CVPR.2017.330]

Xu N, Liu A A, Liu J, Nie W Z and Su Y T. 2018. Scene graph captioner: image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, 58: 477-485[DOI: 10.1016/j.jvcir.2018.12.027]

Yang J W, Lu J S, Lee S, Batra D and Parikh D. 2018. Graph R-CNN for scene graph generation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 690-706[DOI: 10.1007/978-3-030-01246-5_41http://dx.doi.org/10.1007/978-3-030-01246-5_41]

Yang Z Y, Li J, Min W D and Wang Q. 2019. Real-time pre-identification and cascaded detection for tiny faces. Applied Sciences, 9(20): #4344[DOI: 10.3390/app9204344]

Zaremba W, Sutskever I and Vinyals O. 2015. Recurrent neural network regularization[EB/OL]. [2022-04-19].DOI: https://arxiv.org/pdf/1409.2329.pdfhttps://arxiv.org/pdf/1409.2329.pdf

Zellers R, Yatskar M, Thomson S and Choi Y. 2018. Neural motifs: scene graph parsing with global context//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5831-5840[DOI: 10.1109/CVPR.2018.00611http://dx.doi.org/10.1109/CVPR.2018.00611]

Zhao Y Q, Rao Y, Dong S P and Zhang J Y. 2020. Survey on deep learning object detection. Journal of Image and Graphics, 25(4): 629-654

赵永强, 饶元, 董世鹏, 张君毅. 2020. 深度学习目标检测方法综述. 中国图象图形学报, 25(4): 629-654 [DOI:10.11834/jig.190307]

文章被引用时，请邮件提醒。

提交

暂无数据