融合图像场景及物体先验知识的图像描述生成模型

汤鹏杰; 谭云兰; 李金忠

doi:10.11834/jig.170052

图像理解和计算机视觉 | 浏览量 : 0 下载量: 357 CSCD: 10

PDF
导出
分享
收藏
专辑

融合图像场景及物体先验知识的图像描述生成模型
Image description based on the fusion of scene and object category prior knowledge
2017年22卷第9期页码：1251-1260
网络出版：2017-08-25，

纸质出版：2017
DOI： 10.11834/jig.170052
稿件说明：

移动端阅览

汤鹏杰, 谭云兰, 李金忠. 融合图像场景及物体先验知识的图像描述生成模型[J]. 中国图象图形学报, 2017,22(9):1251-1260. DOI： 10.11834/jig.170052.

Tang Pengjie, Tan Yunlan, Li Jinzhong. Image description based on the fusion of scene and object category prior knowledge[J]. Journal of Image and Graphics, 2017, 22(9): 1251-1260. DOI： 10.11834/jig.170052.

摘要

目前基于深度卷积神经网络（CNN）和长短时记忆（LSTM）网络模型进行图像描述的方法一般是用物体类别信息作为先验知识来提取图像CNN特征，忽略了图像中的场景先验知识，造成生成的句子缺乏对场景的准确描述，容易对图像中物体的位置关系等造成误判。针对此问题，设计了融合场景及物体类别先验信息的图像描述生成模型（F-SOCPK），将图像中的场景先验信息和物体类别先验信息融入模型中，协同生成图像的描述句子，提高句子生成质量。首先在大规模场景类别数据集Place205上训练CNN-S模型中的参数，使得CNN-S模型能够包含更多的场景先验信息，然后将其中的参数通过迁移学习的方法迁移到CNN-S中，用于捕捉待描述图像中的场景信息；同时，在大规模物体类别数据集Imagenet上训练CNN-O模型中的参数，然后将其迁移到CNN-O模型中，用于捕捉图像中的物体信息。提取图像的场景信息和物体信息之后，分别将其送入语言模型LM-S和LM-O中；然后将LM-S和LM-O的输出信息通过Softmax函数的变换，得到单词表中每个单词的概率分值；最后使用加权融合方式，计算每个单词的最终分值，取概率最大者所对应的单词作为当前时间步上的输出，最终生成图像的描述句子。在MSCOCO、Flickr30k和Flickr8k 3个公开数据集上进行实验。本文设计的模型在反映句子连贯性和准确率的BLEU指标、反映句子中单词的准确率和召回率的METEOR指标及反映语义丰富程度的CIDEr指标等多个性能指标上均超过了单独使用物体类别信息的模型，尤其在Flickr8k数据集上，在CIDEr指标上，比单独基于物体类别的Object-based模型提升了9%，比单独基于场景类别的Scene-based模型提升了近11%。本文所提方法效果显著，在基准模型的基础上，性能有了很大提升；与其他主流方法相比，其性能也极为优越。尤其是在较大的数据集上（如MSCOCO），其优势较为明显；但在较小的数据集上（如Flickr8k），其性能还有待于进一步改进。在下一步工作中，将在模型中融入更多的视觉先验信息，如动作类别、物体与物体之间的关系等，进一步提升描述句子的质量。同时，也将结合更多视觉技术，如更深的CNN模型、目标检测、场景理解等，进一步提升句子的准确率。

Abstract

Object category prior knowledge is typically used to extract image features in popular approaches for the image description task based on the deep convolutional neural network (CNN) and long-short-term memory (LSTM) models.The task of image description involves detecting numerous objects in an image and generating a highly accurate description sentence.In general

people are always sensitive to objects in an image and direct considerable attention toward objects when they describe an image.However

the scene is also important in image description because objects are typically described within a specific scene (e.g.

place

location

or space)

or a sentence may lack semantic information

thereby leading to poor-quality description for a candidate image.In current popular approaches

the scene category and information are not regarded seriously

and scene category prior knowledge is always disregarded in an image.Consequently

the generated sentences do not provide a correct description for a scene and position relationships are easy to misjudge

which lead to the poor quality of these sentences.The effects of object category and scene category on generating a description sentence for an image is surveyed and studied to address the aforementioned problem.Both factors are determined to be useful for an accurate and semantic sentence.A novel framework

called fusion of scene and object category prior knowledge

in which scene category prior knowledge and object category prior knowledge are fused

is proposed and designed in this work based on the aforementioned findings to generate an accurate description of an image

improve the quality and semantics of a sentence

and achieve efficient performances. The objects and scene in an image are detected using CNN models

which are optimized on large-scale datasets

and the transfer learning method is mainly utilized for object category and scene category prior knowledge.A deep CNN model for scene recognition (CNN-Scene

abbreviated as CNN-S) is trained on the large-scale scene dataset Place205 to enable the CNN feature to include considerable scene category prior information.The parameters in CNN-S are transferred to CNN-S to extract the CNN feature for capturing scene category and information in a candidate image

and the parameters are continuously optimized using a fine-tuning method.Another deep CNN model (CNN-Object

abbreviated as CNN-O) is optimized on Imagenet

which is a large-scale dataset for object recognition.The parameters are applied to the CNN-O model through transfer learning and continuously trained via fine-tuning to determine object category and information in the image.The CNN features from CNN-S and CNN-O are inputted into the language models based on scene and based on object

respectively.These language models include two stacked LSTM layers and are constructed via factoring

in which the first layer receives the embedding features of words

and the outputs of the first layer and the CNN feature of the candidate image are combined to form multimodal features that will be inputted into the second LSTM layer.A fully connected layer is used as a classification layer to map the feature vector as category information.A Softmax function is used to compute the probability for each word in the vocabulary

which contains all the words that appear in the reference sentences of the image in the training dataset.A late fusion strategy by weighted average is applied to calculate the final score for each word

and the output on the current time step that corresponds to the maximum score is the expected word.All the words generated on all the time steps are formed into the description sentence for the candidate image. Three public popular datasets

namely

MSCOCO

Flickr30k

and Flickr8k

are used to evaluate the effectiveness of the proposed method.The using protocols of the three datasets and other popular approaches are followed.Three evaluation metrics

namely

BLEU for evaluating consistency and precision

METEOR for reflecting the precision and recall of words

and CIDEr for evaluating the semantic of candidate sentences

are adopted to evaluate the proposed model.Results on the three datasets demonstrate that the proposed model significantly improves performances on the three metrics.In particular

the performance on the CIDEr metric is 1.9% and 6.2% higher than that of object-and scene-based models

respectively

on the MSCOCO dataset.Performance is also improved on the Flickr30k dataset by 1.7% and 2.6% compared with the two benchmark models.For the Flickr8k dataset

CIDEr reaches 52.8%

thereby outperforming the two benchmark models by 9% and approximately 11%.On the BLEU and METEOR metrics

the proposed model also achieves better performances than the models with only object and scene as bases.The performances of the proposed model on the three datasets surpass those of most current state-of-the-art methods on multiple metrics. The experimental results show that the proposed model is effective.This model considerably improves performances compared with the benchmark models (object-based and scene-based models) and most popular state-of-the-art approaches.The results also indicate that the proposed model can generate numerous semantic description sentences for an image because the model captures the relationships between objects and scene when the scene category information is applied.However

the performance of the proposed model on the BLEU metric should be improved further.In a future work

additional prior knowledge

such as the action category and relationships among objects

will be fused into the proposed framework to further improve the quality of generation sentences for images

particularly the accuracy of sentences.Other novel techniques

such as residual net for effective CNN feature

region-based CNN for accurate object recognition

and gated recurrent unit for a concise language model

will be used to further improve performance.