多媒体工程:2016——图像检索研究进展与发展趋势
Multimedia technology 2016:advances and trends in image retrieval
- 2017年22卷第11期 页码:1467-1485
网络出版:2017-10-30,
纸质出版:2017
DOI: 10.11834/jig.170503
移动端阅览

浏览全部资源
扫码关注微信
网络出版:2017-10-30,
纸质出版:2017
移动端阅览
基于内容的图像检索方法利用从图像提取的特征进行检索,以较小的时空开销尽可能准确的找到与查询图片相似的图片。 本文从浅层特征、深层特征和特征融合3个方面对图像检索国内外研究进展和面临的挑战进行介绍,并对未来的发展趋势进行展望。 尺度下不变特征转换(SIFT)存在缺乏空间几何信息和颜色信息,高层语义的表达不够等问题;而CNN (convolutional neural network)特征则往往缺乏足够的底层信息。为了丰富描述符的信息,通常将SIFT与CNN等特征进行融合。融合方式主要包括:串连、核融合、图融合、索引层次融合和得分层(score-level)融合。"融合"可以有效地利用不同特征的互补性,提高检索的准确率。 与SIFT相比,CNN特征的通用性及几何不变性都不够强,依然是图像检索领域面临的挑战。
Content-based image retrieval uses features extracted from an image to retrieve similar images accurately and with low memory and time consumption from a large-scale dataset.Scale-invariant feature transform (SIFT) is robust to translation
scaling
rotation
viewpoint changing
and occlusion
as well as performs fast extraction.Thus
SIFT is widely used theoretically and practically.However
SIFT has some shortcomings
such as a lack of spatial geometric information and color information.Convolutional neural network (CNN) has good domain transferability
and deep features from pre-trained CNN can be applied to various domains.CNN deep features have recently attracted considerable attention and exhibit superior performance over SIFT.However
contrary to the shortcoming of SIFT
CNN features lack shallow information.Thus
SIFT is usually fused with CNN features and other shallow features. This report reviews the recent advances and challenges in image retrieval in the world and in China
including shallow feature
deep feature
and feature fusion.Future development trends are also explored.For shallow features
we mainly review SIFT and its variants
the encoding methods
and the development of these methods.For deep features
we divide the descriptors of the features into different categories according to the type of CNN layer that was used:fully connected layer
convolutional layer
and softmax layer.Many features can be extracted from a convolutional layer
and many pooling methods are proposed. The encoding methods of SIFT mainly include bag of features (BOF)
vector of locally aggregated vectors (VLAD)
Fisher vector (FV)
and triangulation embedding (TE)
and they mostly consist of two steps:embedding and aggregation (or pooling).For CNN features
features from the fully connected layer of CNN are typically used because of their good transferability and accuracy.However
deep features from the convolutional layer have become an increasingly attractive option recently because the convolutional features can be effectively combined with a variety of pooling methods such as sum-pooling
max-pooling
VLAD-pooling
and FV-pooling
and they perform well in the domains of image classification and retrieval.The fusion methods can mainly be divided into five types:concatenation
kernel fusion
graph fusion
index-level fusion
and score-level fusion.Concatenation
kernel fusion
and index-level fusion work directly on different features
and graph fusion and score-level fusion work on the retrieval results of different features.Fusion uses complementary different features and can improve image retrieval accuracy effectively. SIFT and CNN feature are complementary to each other:SIFT contains rich low-level information
and CNN features contain rich high semantic information; SIFT has a good property of invariance
which is the shortcoming of CNN features.Fusion is an effective way to maximize image information.However
time and space consumption will inevitably increase
and a good algorithm that can be used to distinguish good features from bad ones is yet to be studied.At present
the generalizability and geometric invariance of CNN features are inferior to those of SIFT; this issue continues to be a challenge for image retrieval researchers.The generalizability of CNN features is limited by the domain and statistic difference between the source task (usually ImageNet) and the target task.Fine tuning is a good strategy to solve this problem; however
this approach needs an additional labeled dataset similar to the target task.To enhance the geometric invariance of CNN
the CNN descriptor space consumption and extraction time will inevitably increase
and only scale invariance is usually considered for simplicity
ignoring other aspects of invariance.Moreover
the number of CNN features from one image is usually much smaller than that of SIFT; thus
insufficient information for encoding will be captured.The most commonly used CNNs are designed for image classification tasks and not for image retrieval.However
image retrieval is a more fine-grained domain; a relevant algorithm needs to find similar images
not just the images from one class.Thus
a CNN trained for image retrieval may be a good future research direction.More work is still needed to strike a better balance among generalizability
invariance
memory consumption
and extraction time for an effective and efficient image retrieval descriptor.
相关作者
相关机构
京公网安备11010802024621