李军,吕绍和,陈飞,阳国贵,窦勇(国防科学技术大学计算机学院, 长沙 410073)
目的 图像检索是计算机视觉的一项重要任务。图像检索的关键是图像的内容描述，复杂图像的内容描述很具有挑战性。传统的方法用固定长度的向量描述图像内容，为此提出一种变长序列描述模型，目的是丰富特征编码的信息表达能力，提高检索精度。方法 本文提出序列描述模型，用可变长度特征序列描述图像。序列描述模型首先用CNN(convolutional neural network)提取底层特征，然后用中间层LSTM(long short-term memory)产生局部特征的相关性表示，最后用视觉注意LSTM(attention LSTM)产生一组向量描述一幅图像。通过匈牙利算法计算图像之间的相似性完成图像检索任务。模型采用标签级别的triplet loss函数进行端对端的训练。结果 在MIRFLICKR-25K和NUS-WIDE数据集上进行图像检索实验，并和相关算法进行比较。相对于其他方法，本文模型检索精度提高了512个百分点。相对于定长的图像描述方式，本文模型在多标签数据集上能够显著改善检索效果。结论 本文提出了新的图像序列描述模型，可以显著改善检索效果，适用于多标签图像的检索任务。
Image retrieval by combining recurrent neural network and visual attention mechanism
Li Jun,Lyu Shaohe,Chen Fei,Yang Guogui,Dou Yong(College of Computer, National University of Defense Technology, Changsha 410073, China)
Objective Image retrieval is an important task in computer vision. Image content description is the key to image retrieval. Accurate and full descriptions of the image content can significantly improve retrieval precision. Traditional methods describe image content by a unified fixed-length vector. A simple image only contains one object, whereas a complex image can contain several objects. Describing a complex image similar to a simple image by a fixed-length vector is generally insufficient. This study proposes a varying-length sequence description model. Method We propose the sequence description model based on the Recurrent Neural Network and Visual Attention Mechanism. The sequence description model describes images with varying-length sequences. The sequence description model first extracts low-level features by CNN (convolutional neural network), then generates a contextual representation of local features by intermediate LSTM (long short-term memory), and finally produces a vector group to describe an image by attention LSTM. The attention mechanism enables the vector number to describe images that are as many as the label number of the described image. The model is end-to-end trainable, and we train the sequence description model with label-level triplet loss function. We apply the Hungarian algorithm to compute the similarities between the two images. We also study the image retrieval precision with different deep multilayer LSTMs by changing the number of multilayer LSTMs. Result We performed the experiment based on two common datasets:MIRFLICKR-25K and NUS-WIDE. Our sequence description model method increased by 10 percent to 12 percent in terms of accuracy rate, unlike the DNN-lai method in the single-label image retrieval experiment on the MIRFLICKR-25K dataset. Our sequence description model method increased by approximately 10 percent over the CCA-ITQ and DSRH methods in the experiment on multi-label image retrieval on the NUS-WIDE dataset. We also provided comparative results of the performance of our method against the DNN-lai method. We applied the Hungarian algorithm to compute the similarities between two images, which consumed much time, given that our feature extraction results are varying-length. Thus, our method required a long time when querying an image in the dataset. Conclusion This study presented a model utilizing a recurrent neural network to generate descriptive sequences of an image with attention LSTM. The proposed model was applicable to the task of multi-label image retrieval.