Image retrieval is an important task in computer vision. Image content description is the key to image retrieval. Accurate and full descriptions of the image content can significantly improve retrieval precision. Traditional methods describe image content by a unified fixed-length vector. A simple image only contains one object
whereas a complex image can contain several objects. Describing a complex image similar to a simple image by a fixed-length vector is generally insufficient. This study proposes a varying-length sequence description model. We propose the sequence description model based on the Recurrent Neural Network and Visual Attention Mechanism. The sequence description model describes images with varying-length sequences. The sequence description model first extracts low-level features by CNN (convolutional neural network)
then generates a contextual representation of local features by intermediate LSTM (long short-term memory)
and finally produces a vector group to describe an image by attention LSTM. The attention mechanism enables the vector number to describe images that are as many as the label number of the described image. The model is end-to-end trainable
and we train the sequence description model with label-level triplet loss function. We apply the Hungarian algorithm to compute the similarities between the two images. We also study the image retrieval precision with different deep multilayer LSTMs by changing the number of multilayer LSTMs. We performed the experiment based on two common datasets:MIRFLICKR-25K and NUS-WIDE. Our sequence description model method increased by 10 percent to 12 percent in terms of accuracy rate
unlike the DNN-lai method in the single-label image retrieval experiment on the MIRFLICKR-25K dataset. Our sequence description model method increased by approximately 10 percent over the CCA-ITQ and DSRH methods in the experiment on multi-label image retrieval on the NUS-WIDE dataset. We also provided comparative results of the performance of our method against the DNN-lai method. We applied the Hungarian algorithm to compute the similarities between two images
which consumed much time
given that our feature extraction results are varying-length. Thus
our method required a long time when querying an image in the dataset. This study presented a model utilizing a recurrent neural network to generate descriptive sequences of an image with attention LSTM. The proposed model was applicable to the task of multi-label image retrieval.