Current Issue Cover
融合图像场景及物体先验知识的图像描述生成模型

汤鹏杰1,2,3, 谭云兰2,4, 李金忠2,4,3(1.井冈山大学数理学院, 吉安 343009;2.井冈山大学流域生态与地理环境监测国家测绘地理信息局重点实验室, 吉安 343009;3.同济大学计算机科学与技术系, 上海 201804;4.井冈山大学电子与信息工程学院, 吉安 343009)

摘 要
目的 目前基于深度卷积神经网络(CNN)和长短时记忆(LSTM)网络模型进行图像描述的方法一般是用物体类别信息作为先验知识来提取图像CNN特征,忽略了图像中的场景先验知识,造成生成的句子缺乏对场景的准确描述,容易对图像中物体的位置关系等造成误判。针对此问题,设计了融合场景及物体类别先验信息的图像描述生成模型(F-SOCPK),将图像中的场景先验信息和物体类别先验信息融入模型中,协同生成图像的描述句子,提高句子生成质量。方法 首先在大规模场景类别数据集Place205上训练CNN-S模型中的参数,使得CNN-S模型能够包含更多的场景先验信息,然后将其中的参数通过迁移学习的方法迁移到CNNd-S中,用于捕捉待描述图像中的场景信息;同时,在大规模物体类别数据集Imagenet上训练CNN-O模型中的参数,然后将其迁移到CNNd-O模型中,用于捕捉图像中的物体信息。提取图像的场景信息和物体信息之后,分别将其送入语言模型LM-S和LM-O中;然后将LM-S和LM-O的输出信息通过Softmax函数的变换,得到单词表中每个单词的概率分值;最后使用加权融合方式,计算每个单词的最终分值,取概率最大者所对应的单词作为当前时间步上的输出,最终生成图像的描述句子。结果 在MSCOCO、Flickr30k和Flickr8k 3个公开数据集上进行实验。本文设计的模型在反映句子连贯性和准确率的BLEU指标、反映句子中单词的准确率和召回率的METEOR指标及反映语义丰富程度的CIDEr指标等多个性能指标上均超过了单独使用物体类别信息的模型,尤其在Flickr8k数据集上,在CIDEr指标上,比单独基于物体类别的Object-based模型提升了9%,比单独基于场景类别的Scene-based模型提升了近11%。结论 本文所提方法效果显著,在基准模型的基础上,性能有了很大提升;与其他主流方法相比,其性能也极为优越。尤其是在较大的数据集上(如MSCOCO),其优势较为明显;但在较小的数据集上(如Flickr8k),其性能还有待于进一步改进。在下一步工作中,将在模型中融入更多的视觉先验信息,如动作类别、物体与物体之间的关系等,进一步提升描述句子的质量。同时,也将结合更多视觉技术,如更深的CNN模型、目标检测、场景理解等,进一步提升句子的准确率。
关键词
Image description based on the fusion of scene and object category prior knowledge

Tang Pengjie1,2,3, Tan Yunlan2,4, Li Jinzhong2,4,3(1.School of Mathematical and Physical Science, Jinggangshan University, Ji'an 343009, China;2.Key laboratory of watershed ecology and geographical environment monitoring, National Administration of Surveying, Mapping and Geoinformation, Ji'an 343009, China;3.Department of Computer Science and Technology, Tongji University, Shanghai 201804, China;4.School of Electronics and Information Engineering, Jinggangshan University, Ji'an 343009, China)

Abstract
Objective Object category prior knowledge is typically used to extract image features in popular approaches for the image description task based on the deep convolutional neural network (CNN) and long-short-term memory (LSTM) models.The task of image description involves detecting numerous objects in an image and generating a highly accurate description sentence.In general,people are always sensitive to objects in an image and direct considerable attention toward objects when they describe an image.However,the scene is also important in image description because objects are typically described within a specific scene (e.g.,place,location,or space),or a sentence may lack semantic information,thereby leading to poor-quality description for a candidate image.In current popular approaches,the scene category and information are not regarded seriously,and scene category prior knowledge is always disregarded in an image.Consequently,the generated sentences do not provide a correct description for a scene and position relationships are easy to misjudge,which lead to the poor quality of these sentences.The effects of object category and scene category on generating a description sentence for an image is surveyed and studied to address the aforementioned problem.Both factors are determined to be useful for an accurate and semantic sentence.A novel framework,called fusion of scene and object category prior knowledge,in which scene category prior knowledge and object category prior knowledge are fused,is proposed and designed in this work based on the aforementioned findings to generate an accurate description of an image,improve the quality and semantics of a sentence,and achieve efficient performances.Method The objects and scene in an image are detected using CNN models,which are optimized on large-scale datasets,and the transfer learning method is mainly utilized for object category and scene category prior knowledge.A deep CNN model for scene recognition (CNN-Scene,abbreviated as CNN-S) is trained on the large-scale scene dataset Place205 to enable the CNN feature to include considerable scene category prior information.The parameters in CNN-S are transferred to CNNd-S to extract the CNN feature for capturing scene category and information in a candidate image,and the parameters are continuously optimized using a fine-tuning method.Another deep CNN model (CNN-Object,abbreviated as CNN-O) is optimized on Imagenet,which is a large-scale dataset for object recognition.The parameters are applied to the CNNd-O model through transfer learning and continuously trained via fine-tuning to determine object category and information in the image.The CNN features from CNNd-S and CNNd-O are inputted into the language models based on scene and based on object,respectively.These language models include two stacked LSTM layers and are constructed via factoring,in which the first layer receives the embedding features of words,and the outputs of the first layer and the CNN feature of the candidate image are combined to form multimodal features that will be inputted into the second LSTM layer.A fully connected layer is used as a classification layer to map the feature vector as category information.A Softmax function is used to compute the probability for each word in the vocabulary,which contains all the words that appear in the reference sentences of the image in the training dataset.A late fusion strategy by weighted average is applied to calculate the final score for each word,and the output on the current time step that corresponds to the maximum score is the expected word.All the words generated on all the time steps are formed into the description sentence for the candidate image.Result Three public popular datasets,namely,MSCOCO,Flickr30k,and Flickr8k,are used to evaluate the effectiveness of the proposed method.The using protocols of the three datasets and other popular approaches are followed.Three evaluation metrics,namely,BLEU for evaluating consistency and precision,METEOR for reflecting the precision and recall of words,and CIDEr for evaluating the semantic of candidate sentences,are adopted to evaluate the proposed model.Results on the three datasets demonstrate that the proposed model significantly improves performances on the three metrics.In particular,the performance on the CIDEr metric is 1.9% and 6.2% higher than that of object-and scene-based models,respectively,on the MSCOCO dataset.Performance is also improved on the Flickr30k dataset by 1.7% and 2.6% compared with the two benchmark models.For the Flickr8k dataset,CIDEr reaches 52.8%,thereby outperforming the two benchmark models by 9% and approximately 11%.On the BLEU and METEOR metrics,the proposed model also achieves better performances than the models with only object and scene as bases.The performances of the proposed model on the three datasets surpass those of most current state-of-the-art methods on multiple metrics.Conclusion The experimental results show that the proposed model is effective.This model considerably improves performances compared with the benchmark models (object-based and scene-based models) and most popular state-of-the-art approaches.The results also indicate that the proposed model can generate numerous semantic description sentences for an image because the model captures the relationships between objects and scene when the scene category information is applied.However,the performance of the proposed model on the BLEU metric should be improved further.In a future work,additional prior knowledge,such as the action category and relationships among objects,will be fused into the proposed framework to further improve the quality of generation sentences for images,particularly the accuracy of sentences.Other novel techniques,such as residual net for effective CNN feature,region-based CNN for accurate object recognition,and gated recurrent unit for a concise language model,will be used to further improve performance.
Keywords

订阅号|日报