多媒体工程：2016——图像检索研究进展与发展趋势

于俊清; 吴泽斌; 吴飞; 孙立峰

doi:10.11834/jig.170503

综述 | 浏览量 : 0 下载量: 519 CSCD: 5

PDF
导出
分享
收藏
专辑

多媒体工程：2016——图像检索研究进展与发展趋势
Multimedia technology 2016:advances and trends in image retrieval
2017年22卷第11期页码：1467-1485
网络出版：2017-10-30，

纸质出版：2017
DOI： 10.11834/jig.170503
稿件说明：

移动端阅览

于俊清, 吴泽斌, 吴飞, 孙立峰. 多媒体工程：2016——图像检索研究进展与发展趋势[J]. 中国图象图形学报, 2017,22(11):1467-1485. DOI： 10.11834/jig.170503.

Yu Junqing, Wu Zebin, Wu Fei, Sun Lifeng. Multimedia technology 2016:advances and trends in image retrieval[J]. Journal of Image and Graphics, 2017, 22(11): 1467-1485. DOI： 10.11834/jig.170503.

摘要

基于内容的图像检索方法利用从图像提取的特征进行检索，以较小的时空开销尽可能准确的找到与查询图片相似的图片。本文从浅层特征、深层特征和特征融合3个方面对图像检索国内外研究进展和面临的挑战进行介绍，并对未来的发展趋势进行展望。尺度下不变特征转换（SIFT）存在缺乏空间几何信息和颜色信息，高层语义的表达不够等问题；而CNN （convolutional neural network）特征则往往缺乏足够的底层信息。为了丰富描述符的信息，通常将SIFT与CNN等特征进行融合。融合方式主要包括：串连、核融合、图融合、索引层次融合和得分层（score-level）融合。"融合"可以有效地利用不同特征的互补性，提高检索的准确率。与SIFT相比，CNN特征的通用性及几何不变性都不够强，依然是图像检索领域面临的挑战。

Abstract

Content-based image retrieval uses features extracted from an image to retrieve similar images accurately and with low memory and time consumption from a large-scale dataset.Scale-invariant feature transform (SIFT) is robust to translation

scaling

rotation

viewpoint changing

and occlusion

as well as performs fast extraction.Thus

SIFT is widely used theoretically and practically.However

SIFT has some shortcomings

such as a lack of spatial geometric information and color information.Convolutional neural network (CNN) has good domain transferability

and deep features from pre-trained CNN can be applied to various domains.CNN deep features have recently attracted considerable attention and exhibit superior performance over SIFT.However

contrary to the shortcoming of SIFT

CNN features lack shallow information.Thus

SIFT is usually fused with CNN features and other shallow features. This report reviews the recent advances and challenges in image retrieval in the world and in China

including shallow feature

deep feature

and feature fusion.Future development trends are also explored.For shallow features

we mainly review SIFT and its variants

the encoding methods

and the development of these methods.For deep features

we divide the descriptors of the features into different categories according to the type of CNN layer that was used:fully connected layer

convolutional layer

and softmax layer.Many features can be extracted from a convolutional layer

and many pooling methods are proposed. The encoding methods of SIFT mainly include bag of features (BOF)

vector of locally aggregated vectors (VLAD)

Fisher vector (FV)

and triangulation embedding (TE)

and they mostly consist of two steps:embedding and aggregation (or pooling).For CNN features

features from the fully connected layer of CNN are typically used because of their good transferability and accuracy.However

deep features from the convolutional layer have become an increasingly attractive option recently because the convolutional features can be effectively combined with a variety of pooling methods such as sum-pooling

max-pooling

VLAD-pooling

and FV-pooling

and they perform well in the domains of image classification and retrieval.The fusion methods can mainly be divided into five types:concatenation

kernel fusion

graph fusion

index-level fusion

and score-level fusion.Concatenation

kernel fusion

and index-level fusion work directly on different features

and graph fusion and score-level fusion work on the retrieval results of different features.Fusion uses complementary different features and can improve image retrieval accuracy effectively. SIFT and CNN feature are complementary to each other:SIFT contains rich low-level information

and CNN features contain rich high semantic information; SIFT has a good property of invariance

which is the shortcoming of CNN features.Fusion is an effective way to maximize image information.However

time and space consumption will inevitably increase

and a good algorithm that can be used to distinguish good features from bad ones is yet to be studied.At present

the generalizability and geometric invariance of CNN features are inferior to those of SIFT; this issue continues to be a challenge for image retrieval researchers.The generalizability of CNN features is limited by the domain and statistic difference between the source task (usually ImageNet) and the target task.Fine tuning is a good strategy to solve this problem; however

this approach needs an additional labeled dataset similar to the target task.To enhance the geometric invariance of CNN

the CNN descriptor space consumption and extraction time will inevitably increase

and only scale invariance is usually considered for simplicity

ignoring other aspects of invariance.Moreover

the number of CNN features from one image is usually much smaller than that of SIFT; thus

insufficient information for encoding will be captured.The most commonly used CNNs are designed for image classification tasks and not for image retrieval.However

image retrieval is a more fine-grained domain; a relevant algorithm needs to find similar images

not just the images from one class.Thus

a CNN trained for image retrieval may be a good future research direction.More work is still needed to strike a better balance among generalizability

invariance

memory consumption

and extraction time for an effective and efficient image retrieval descriptor.