发布时间: 2021-09-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200371
2021 | Volume 26 | Number 10

NCIG 2020

融合词袋连通图的图像检索特征选择

李国祥^1,2, 王继军^2,3, 马文斌^1,2

1. 广西财经学院教务处, 南宁 530003;

2. 广西师范大学广西多源信息挖掘与安全重点实验室, 桂林 541004;

3. 广西财经学院信息与统计学院, 南宁 530003

收稿日期: 2020-07-16; 修回日期: 2020-10-28; 预印本日期: 2020-11-04

基金项目: 国家自然科学基金项目（71862003）；广西重点研发计划项目（2018AB15003）；广西多源信息挖掘与安全重点实验室开放基金项目（MIMS17-02）；广西高校中青年教师基础能力提升资助项目（2021KY0650，2019KY0661）；广西跨境电商智能信息处理重点实验室培育基地（广西财经学院）专项资助项目

作者简介: 李国祥, 1984年生, 男, 副教授, 主要研究方向为模式识别、人工智能。E-mail: masterlgx@163.com
王继军, 通信作者, 男, 副教授, 主要研究方向为图像图形学、信息安全。E-mail: wangjijun1209@126.com
马文斌, 男, 助理研究员, 主要研究方向为机器学习、模式识别。E-mail: mwbwy2014@163.com
*通信作者: 王继军 wangjijun1209@126.com

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2021)10-2533-12

摘要

目的随着图像检索所依赖的特征愈发精细化，在提高检索精度的同时，也不可避免地产生众多非相关和冗余的特征。针对在大规模图像检索和分类中高维度特征所带来的时间和空间挑战，从减少特征数量这一简单思路出发，提出了一种有效的连通图特征点选择方法，探寻图像检索精度和特征选择间的平衡。方法基于词袋模型（bag of words，BOW）的图像检索机制，结合最近邻单词交叉核、特征距离和特征尺度等属性，构建包含若干个连通分支和平凡图的像素级特征分离图，利用子图特征点的逆文本频率修正边权值，从各连通分量的节点数量和孤立点最近邻单词相关性两个方面开展特征选择，将问题转化为在保证图像匹配精度情况下，最小化特征分离图的阶。结果实验采用Oxford和Paris公开数据集，在特征存储容量、时间复杂度集和检索精度等方面进行评估，并对不同特征抽取和选择方法进行了对比。实验结果表明选择后的特征数量和存储容量有效约简50%以上；100 k词典的KD-Tree查询时间减少近58%；相对于其他编码方法和全连接层特征，Oxford数据集检索精度平均提升近7.5%；Paris数据集中检索精度平均高于其他编码方法4%，但检索效果不如全连接层特征。大量实验表明了大连通域的冗余性和孤立点的可选择性。结论通过构建特征分离图，摒弃大连通域的冗余特征点，保留具有最近邻单词相关性的孤立特征点，最终形成图像的精简特征点集。整体检索效果稳定，其检索精度基本与原始特征点集持平，且部分类别效果优于原始特征和其他方法。同时，选择后特征的重用性好，方便进一步聚合集成。

关键词

词袋模型(BOW); 特征选择; 图像检索; 连通分量; 聚合特征

Feature selection method for image retrieval based on connected graphs and bag of words

Li Guoxiang^1,2, Wang Jijun^2,3, Ma Wenbin^1,2

1. Department of Academic Affairs, Guangxi University of Finance and Economics, Nanning 530003, China;

2. Guangxi Key Laboratory of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China;

3. Department of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China

Supported by: National Natural Science Foundation of China(71862003); Guangxi Key Research and Development Program(2018AB15003)

Abstract

Objective Features have to be more refined to improve the accuracy of image retrieval. As a result, a large amount of irrelevant and redundant features is also produced inevitably, which leads to a high requirement of memory and computation, especially for large-scale image retrieval.Thus, feature selection plays a critical role in image retrieval.Based on the principle of feature number reduction, we propose a novel, effectively connected component feature selection method and explore a tradeoff between image retrieval accuracy and feature selection in this paper. Method First, we construct a pixel-level feature separate graph that contains several connected branches and trivial graphs based on the bag of words(BOW) principle by combining different characteristics such as nearest word cross kernel, feature distance, and feature scale. Then, we calculate the cross kernel among the first D nearest neighbor words of each feature point. If the crossing set is empty and the distance and scale between feature points satisfy the established conditions, we assume that these two feature points belong to the same group. Then, we select features according to the node number of each connected component and the correlation of nearest words of isolated points. In this process, we use inverse document frequency as the weight of the first D nearest neighbor words to measure the contribution. Finally, we transform the problem to minimize the network order of the feature separated graph in guaranteeing the accuracy of image matching and select feature points from isolated point and connected branches. If the maximum cross kernel of the isolated point with other points is greater than the threshold $n$, then we retain it as a valid feature point. If the connected component of the graph is less than the preset threshold $γ$, then we retain these points in the connected branch as valid feature points Result We adopt the public Oxford datasets and Paris datasets, and evaluate the proposed method on the aspects of feature restore requirement, time complex set, and retrieval accuracy using the Kronecker product as the matching kernel. We also compare the proposed method with different feature extraction and selection methods, such as Vlad, spetral regression-locality preserving projection(SR-LPP), and deep learning features. Experimental results demonstrate that the number of features and storage are reduced to more than 50% in guaranteeing the original retrieve accuracy. Compared with other methods, the retrieval time of KD-Tree for the 100 k dictionary is reduced by nearly 58%. The retrieval method is stable, and the selected features have excellent reusability for further clustering and assembling. When using Oxford as a test set, theretrieval accuracies of selected features are similar to the original features in each type of building, and the retrieval accuracy is better for several categories such as Allsouls, Ashmolean, and Keble. Compared with other coding methods and features from fully connected layers, the retrieval accuracy is improved by nearly 7.5% on average.When tested on the Paris dataset, our algorithm improves retrieval accuracy by approximately 2%5% overall. Conclusion Extensive experiments demonstrate the redundancy of largely connected areas and the selectivity of isolated points. We can retain the isolated feature points with the nearest word correlation and finally form a refined feature point set of image by constructing the feature separated graph and discarding the redundant feature points of the large connected area. The retrieval accuracy is comparable with the original feature point set and outperforms the original feature and other encoding methods in several categories. Moreover, the obtained feature points maintain original independence, reduce dimension reduction, and can further achieve clustering. It is convenient to be transplanted and coded in different dictionaries.We attempt to integrate it with principal component analysis(PCA), which can reduce the dimension again only with the time of feature projection, but has an outstanding retrieval effect.

Key words

bag of words(BOW); features selection; image retrieval; connected component; aggregated descriptors

0 引言

图像特征提取是机器视觉研究的重要组成部分，广泛应用于图像的检索分类、识别和目标跟踪等领域。随着机器学习研究的不断开展，特征的提取越来越精细化，众多特征提取算法用于图像的分类与检索，诸如SIFT(scale-invariant feature transform)，VLAD(vector of locally aggregated descriptors)，FV(Fisher vector)以及众多在原有特征描述子的基础上开展的二次优化(Zheng等，2018)。在图像检索领域，特征所具有的表征能力是决定检索精度的关键，通过一些算法在图像标准库上检索精度的明显提升可以看出，特征越细腻，其检索分类能力越强，但细腻的特征由此又引出了在大规模的图像检索中的计算时间和空间成本问题。对于较高分辨率的图像。如果每幅图像包含2 k个特征点，那么对于百万当量的数据集，需要2×10⁹个特征索引，其所需要的内存空间约1 TB，所带来的聚类、编码和量化的时间成本巨大，因此从众多特征点中选择能够与其他图像进行有效区分的特征便显得尤为重要。

目前特征选择或约简主要从原始特征编码聚合和特征二次选择集成两个方面开展。二者在特征处理的时间顺序上完全不同，一个是在原始特征生成的过程进行聚合；另一个是在原始特征生成之后开展二次特征选择。代表性的如弱几何一致性(Jégou等，2008)方法，根据特征点主方向和尺度在图像发生变换后直方图上的一致变化，通过弱几何一致性来实现特征匹配点的校验，以主方向旋转角度为例，多数的匹配特征点应该集中在图像的实际旋转角度范围内，而偏离匹配点角度波峰较远的特征点实际为误匹配，可以将其剔除，从而实现特征点选择。Jégou和Chum(2012)利用主成分分析(principal component analysis, PCA)和白化技术分别在同一尺寸的复合词典、不同尺寸的复合词典和主流低维特征向量等方面开展了对比研究，利用余弦相似性测量函数与倒排文档集成，规避了稀疏矩阵与权重均值向量直接相减所带来的检索负面影响，并在实验中表明了简单的SSR(signed square rooting)、PCA和白化的集成，可以有效降低特征向量维度，且保证检索精度。Turcot和Lowe(2009)提出了无效的特征点仅存在于单幅图像中，有效特征存在于包含该目标特征的不同图像中的思想，利用词袋模型对数据集中的每一幅图像进行检索，对前$M$个近邻图像进行几何一致性检验，如果检验的特征点数量大于阈值，则认为两幅图像是联通的，特征点作为有效特征保存，并在图像的连通域内修正TFIDF(term frequency-inverse document frequency)权重。该方法将原特征点数量约简至12 % 左右。Tolias等人(2013)提出的匹配核函数选择方法，对所有属于同一个单词的特征向量求均值，以此生成汉明二进制编码，以及在汉明编码方面进行的一些查询扩展(Tolias等，2013)。

在原始特征编码聚合方面，VLAD(Jégou等，2010)将特征向量表示为特征点与聚类中心差值之和，增强特征表达能力。Fisher vector(Sánchez等，2013)利用Fisher kernel形成图像紧凑特征的表示。以及各类以核函数为基础建立的描述子和混合核函数描述子(Mukundan等，2017；Bursuc等，2015)，该类特征利用笛卡尔坐标和极坐标、梯度模和梯度方向、梯度方向与极坐标角度差等物理位置信息，通过傅里叶级数逼近上述特征，将其映射为特征向量，最后使用克罗内克积形成新的特征向量。为了有效解决核函数逼近问题，Vedaldi和Zisserman(2010)提出了一种有效的核函数映射方法，Chum(2015)建立低维度特征映射，解决了拟合精度和维度选择的平衡问题，通过特征维度及拟合误差之和的线性规划，选择平衡常数$γ$和kernel signature的系数$ω$，实现了低维度特征映射，使上述特征得到了广泛应用。另外，还有在图像分类中表现优异的卷积神经网络，其迁移学习而来的全连接层和卷积层特征广泛地应用于图像检索。此类方法是利用在大规模数据集中预先训练好的卷积神经网络，提取某层的输出作为图像的特征表示(Babenko等，2014；Razavian等，2014)。Yan等人(2016)分别从场景、目标和特征点3个层面进行特征提取，有效地将SIFT和CNN(convolutional neural network)特征融合在一起。

结合上述特征选择研究，以剔除冗余原始特征点这一简单思路为出发点，构建基于词袋模型的特征点连通图，选择有效特征，提高检索效率。

1 词袋模型

词袋模型(bag of words，BOW)的核心思想是通过K均值聚类或者高斯混合模型聚类在训练数据集中训练生成词典(vocabulary)，对一幅图像中每一个特征点测量其到所有词典单词(words)的距离，并归到最近邻单词所属类，通过计量每一个单词所归属特征点的数量形成特征直方图，即表征了整个图像的特征信息。其模型流程如图 1所示。

图 1 BOW模型框架

Fig. 1 Framework of BOW model

在整个BOW的建模过程中，一系列的量化编码方法用于特征直方图的优化表达，如经典的空间金字塔结构模型(Lazebnik等，2006)，通过对在特征空间中不同物理位置的特征点分块加权，形成不同权重的特征表达，以及众多金字塔的改进模型(Avrithis和Tolias，2014；Karakasis等，2015)。还有诸如FV(Perronnin等，2010)、VLAD使用多阶绝对差构建编码空间，描述局部特征分布与视觉单词之间的多阶差异，实现对原始特征的压缩聚合优化。Rootsift(Arandjelović和Zisserman，2012)利用平方根函数完成SIFT空间到RootSift的映射。最后通过距离函数、核函数映射、支持向量机(support vector machines，SVM)等一系列相似性测量来计算直方图的差异，完成图像的匹配、分类和检索。

2 BOW的图像检索机制

设图像提取的$D$维局部特征${\mathit{\boldsymbol{X}}}=[{\mathit{\boldsymbol{x}}}_{1}, {\mathit{\boldsymbol{x}}}_{2}, …{\mathit{\boldsymbol{x}}}_M, ]∈ {\bf{R}} ^{D×M}$，BOW使用特征向量量化的方式来形成词典。其量化方式一般可以表示为

$ \begin{gathered} q: {\bf{R}}^{D \times M} \rightarrow[1, k]\\ \boldsymbol{x} \rightarrow q(\boldsymbol{x}) \end{gathered} $

(1)

量化函数$q$完成特征向量${\mathit{\boldsymbol{x}}}$到一个索引的映射，在BOW模型一般采用K-means聚类，其聚类中心作为映射的量化索引，即称之为words，从而形成词典${\mathit{\boldsymbol{B}}}=[b_{1}, b_{2}, …, b_N]∈ {\bf{R}}^{D×N}$。那么, 相似的两个特征向量，必然在words上面具有相似的概率分布，令特征向量子集$\boldsymbol{X}_{b}=\{\boldsymbol{x} \in \boldsymbol{X}: q(\boldsymbol{x})=\boldsymbol{b}\}$表示归属于单词${\mathit{\boldsymbol{b}}}$的子集，对于两幅图像的BOW模型，二者间的相似性测量函数可以表示为

$ \begin{gathered} \boldsymbol{K}(\boldsymbol{X}, \boldsymbol{Y})= \\ \gamma(\boldsymbol{x}) \gamma(\boldsymbol{y}) \sum\limits_{\boldsymbol{b} \in \boldsymbol{B}}(T F \sim I D F(\boldsymbol{b})) M\left(\boldsymbol{X}_{b}, \boldsymbol{Y}_{b}\right) \end{gathered} $

(2)

式中，$TF \sim IDF$为单词${\mathit{\boldsymbol{c}}}$的词频(term frequency)和逆文本频率(inverse document frequency)。归一化因子$\gamma(\boldsymbol{x})=\left(\sum\limits_{\boldsymbol{b} \in B}(T F \sim \operatorname{IDF}(\boldsymbol{b})) M\left(\boldsymbol{X}_{b}, \boldsymbol{X}_{b}\right)^{-\frac{1}{2}}\right.$，$K$(${\mathit{\boldsymbol{X}}}$, ${\mathit{\boldsymbol{Y}}}$)则为两幅图像间的相似性差异，分值的大小决定了其在检索图像的排序(rank)。

3 特征连通图的构建

设当前有图像集$\boldsymbol{I}=\left\{\left[\boldsymbol{I}_{1}, \boldsymbol{I}_{2}, \cdots, \boldsymbol{I}_{i}\right]\right\}$，图像${\mathit{\boldsymbol{I}}}_i$中包含$N$个特征点$\boldsymbol{V}=\left\{\left(p_{i}, s_{i}, \boldsymbol{x}_{i}\right)\right\}_{i}^{N}$, 式中$p_{i}$是特征的物理位置，$s_{i}$是特征的尺度函数，${\mathit{\boldsymbol{x}}}_{i}$是特征向量。根据上述词袋模型理论，每一个特征点分别归属于词典中的单词，根据最近邻和次近邻特征的理论(Lowe，2004)，本文仿照repetitive structures(Torii等，2015)从特征归属$D$个最近邻单词的角度开展特征点连通分组。

设有两个特征点${\mathit{\boldsymbol{V}}}_{a}和{\mathit{\boldsymbol{V}}}_{b}$，选择每个图像特征点的前$D$个最近邻单词，令${\mathit{\boldsymbol{H}}}_{a}={[{\mathit{\boldsymbol{b}}}_{a1}, {\mathit{\boldsymbol{b}}}_{a2}, …, {\mathit{\boldsymbol{b}}}_{aD}]}$，${\mathit{\boldsymbol{H}}}_b={[{\mathit{\boldsymbol{b}}}_{b1}, {\mathit{\boldsymbol{b}}}_{b2}, …, {\mathit{\boldsymbol{b}}}_{bD}]}, {\mathit{\boldsymbol{b}}}_{ai}, {\mathit{\boldsymbol{b}}}_{bi}≤NumWords, NumWords$为既定词典单词数量。二者交叉核为

$ \boldsymbol{I}\left(\boldsymbol{H}_{a}, \boldsymbol{H}_{b}\right)=\sum\limits_{i}^{D} \min \left(\boldsymbol{H}_{a i}, \boldsymbol{H}_{b i}\right) $

(3)

如果${\mathit{\boldsymbol{I}}}({\mathit{\boldsymbol{H}}}_{a}, {\mathit{\boldsymbol{H}}}_b)≠{\mathit{\boldsymbol{∅}}}$，且$ {\mathit{\boldsymbol{x}}}_a-{\mathit{\boldsymbol{x}}}_b<c(s_a+s_b)$，其中$c$是常数项，且$0.5<s_a/s_b<1.5$，则认为${\mathit{\boldsymbol{V}}}_a和{\mathit{\boldsymbol{V}}}_b$在同一组内。以此类推，对图像内所有特征点遍历之后, 得到一个包含$n$个连通分支或平凡图的分离图，即

$ \begin{gathered} \boldsymbol{G}(\boldsymbol{V}, \boldsymbol{E})=\boldsymbol{G}_{1}\left(\boldsymbol{V}_{1}, \boldsymbol{E}_{1}\right) \cup \boldsymbol{G}_{2}\left(\boldsymbol{V}_{2}, \boldsymbol{E}_{2}\right) \cup \cdots \cup \\ \boldsymbol{G}_{n}\left(\boldsymbol{V}_{n}, \boldsymbol{E}_{n}\right) \end{gathered} $

(4)

式中，${\mathit{\boldsymbol{V}}}$表示该图内的特征点索引，$|{\mathit{\boldsymbol{V}}}| $=$N$；${\mathit{\boldsymbol{E}}}$表示特征点边集，每一个连通分支${\mathit{\boldsymbol{G}}}_{i}({\mathit{\boldsymbol{V}}}, {\mathit{\boldsymbol{E}}})$边权值相同，令其$ω({\mathit{\boldsymbol{e}}})=ω_n，{\mathit{\boldsymbol{e}}}∈{\mathit{\boldsymbol{E}}}$。

根据IDF(inverse document frequency)思想，少量图像所包含的单词更具有区分性。通过IDF对每个单词设定权重，从而减少具有普遍性的单词对分类的影响，增加重要的、区分性更强的单词的影响，同时图像特征点的前$D$个最近邻单词中，不同单词的归属贡献度不同，贡献度越大其具有代表图像更重要特征的能力，在图像的匹配上更具有优先权。根据IDF完成特征点连通图权值的$ω({\mathit{\boldsymbol{e}}})$设定，即

$ \omega_{n}=\frac{1}{D} \sum\limits_{i}^{D} \log \frac{N}{c\left(\boldsymbol{b}_{i}\right)+1} $

(5)

式中，$c({\mathit{\boldsymbol{b}}}_{i})$为单词${\mathit{\boldsymbol{b}}}_{i}$在图像中出现的次数，最后，不同图像间的相似性模型修订为

$ \boldsymbol{K}(\boldsymbol{X}, \boldsymbol{Y})=\gamma(\boldsymbol{x}) \gamma(\boldsymbol{y}) \sum\limits_{\boldsymbol{b} \in \boldsymbol{B}} \omega_{n} M\left(\boldsymbol{X}_{b}, \boldsymbol{Y}_{b}\right) $

(6)

采用特征向量的克罗内克积作为匹配核，即$M({\mathit{\boldsymbol{X}}}_b, {\mathit{\boldsymbol{Y}}}_b)=\sum\limits_{{\mathit{\boldsymbol{x}}} \in {\mathit{\boldsymbol{X}}}} \sum\limits_{{\mathit{\boldsymbol{y}}} \in {\mathit{\boldsymbol{Y}}}} ϕ({\mathit{\boldsymbol{x}}})^{\rm{T}}ϕ({\mathit{\boldsymbol{y}}})$，其中$ϕ({\mathit{\boldsymbol{x}}})，ϕ({\mathit{\boldsymbol{y}}})$为特征编码方法。

4 连通域内的特征选择

根据大规模图像内容，无论图像是否受到仿射变换、噪声等方面的影响，大连通域通常包含在图像的重复结构和冗余特征上，其所包含的特征点不具有典型的特征代表意义。正是如此，Torii等人(2015)在复杂的地域场景信息中建立了重复结构特征，完成对重复场景的特征有效分类。而小连通域包含了大量的孤立点，记为${\mathit{\boldsymbol{V}}}_{\rm{singleton}}$，这些孤立点的度为0，即$\boldsymbol{V}_{\text {singleton }}=\{v \mid {deg}(v)=0, v \in \boldsymbol{V}(\boldsymbol{G})\}$。

图 2第1行图像用不同颜色展现了大连通图的特征点，图像中的天空、地面和墙面等物体通常为极大连通子图所在，所包含的点集对于后期的图像分类和检索并不具有典型的可区分性。第2行图像标注了图像中的孤立点${\mathit{\boldsymbol{V}}}_{\rm{singleton}}$，这些点包含了两个层面的信息：1)是一些具有典型可区分性的角点等，这符合IDF中少量图像中所包含的单词更具有代表性的思想；2)属于既没有与其他点集相关，单词中也无共性，无法在其他图像中匹配的特征点。

图 2 Oxford Building部分图片连通域和singleton特征点

Fig. 2 Connected regions and singleton feature points in Oxford Building

实验中，Keble和All_Souls的部分匹配实验可以直观看出连通分量对于特征点匹配的影响，如图 3所示。Keble中连通分量在23以下的子图包含了大多数的正确匹配点，而数量仅为原始特征的50 %，All_Souls中连通分量20以下的子图则包含了81个正确匹配特征点，接近其最大正确匹配值(85)，而随着更多连通分支的加入，正确匹配的个数反而呈下降趋势。孤立点同样存在占比大而匹配度低的问题，由此可见特征点的数量并不是越多越好，冗余特征的加入反而会增加误匹配的数量。

图 3 Keble和All_Souls的部分匹配实验

Fig. 3 Matching experiments of Keble and All_Souls

((a) Keble feature matching sample; (b) the relationship between matching numbers and connected components of Keble; (c) All_Souls feature matching sample; (d) the relationship between matching numbers and connected components of All-Souls)

因此在大连通图和孤立点中进行有效的特征选择十分必要，在精度不变的情况下，数量的减少势必降低计算复杂度和存储空间。从孤立点单词间的联系和连通分支的连通分量两个角度出发，进行特征选择。

1) 孤立点的选择为

$ \begin{gathered} \boldsymbol{V}_{\text {sels }}(\boldsymbol{G})=v_{\text {sels }} \mid \max \left(I\left(\boldsymbol{H}_{v_{\text {sels }}}, \boldsymbol{H}_{v_{\text {com }}}\right)\right)>n, \\ v_{\text {sels }}, v_{\text {com }} \in \boldsymbol{V}_{\text {singleton }} \end{gathered} $

(7)

式中，$v_{\rm{sels}}$是孤立点集合中的候选点，$v_{\rm{com}}$是其他点，${\mathit{\boldsymbol{V}}}_{\rm{sels}}({\mathit{\boldsymbol{G}}})$为选择后的特征集合。

2) 连通分支的选择为

$ \boldsymbol{V}_{\mathrm{cn}}(\boldsymbol{G})=\left\{\boldsymbol{V}_{i}(\boldsymbol{G})|| \boldsymbol{V}_{i}(\boldsymbol{G}) \mid \leqslant \gamma, i \leqslant n\right\} $

(8)

式中，${\mathit{\boldsymbol{V}}}_{\rm{cn}}({\mathit{\boldsymbol{G}}})$为选择后的连通分支集合，最终的特征点集合${\mathit{\boldsymbol{V}}}_{\rm{sf}}({\mathit{\boldsymbol{G}}})={\mathit{\boldsymbol{V}}}_{\rm{sels}}({\mathit{\boldsymbol{G}}})∪{\mathit{\boldsymbol{V}}}_{\rm{cn}}({\mathit{\boldsymbol{G}}})$。其中连通分量$γ$，$n$的选择变成为保证匹配精度$ε$情况下，最小化分离图${\mathit{\boldsymbol{G}}}({\mathit{\boldsymbol{V}}}, {\mathit{\boldsymbol{E}}})$的阶${\mathit{\boldsymbol{V}}}_{\rm{sf}}({\mathit{\boldsymbol{G}}})$，即

$ \begin{aligned} &\min \left(\left|\boldsymbol{V}_{\mathrm{sf}}(\boldsymbol{G})\right|\right) \\ &\text { s. t. } \quad C_{\mathrm{s}}\left(\boldsymbol{I}_{i}, \boldsymbol{I}_{j}\right) \geqslant \varepsilon C_{\mathrm{o}}\left(\boldsymbol{I}_{i}, \boldsymbol{I}_{j}\right) \end{aligned} $

(9)

式中，$C({\mathit{\boldsymbol{I}}}_{i}, {\mathit{\boldsymbol{I}}}_{j})$为两幅图像间特征匹配数量。

$ C\left(\boldsymbol{I}_{i}, \boldsymbol{I}_{j}\right)=\left|\left\{\left(v_{i}, v_{j}\right) \in \boldsymbol{I}_{i} \times \boldsymbol{I}_{j} ; h\left(v_{i}, v_{j}\right) \leqslant \boldsymbol{h}_{t}\right\}\right| $

(10)

$C_{\mathrm{o}}\left(\boldsymbol{I}_{i}, \boldsymbol{I}_{j}\right)$和$C_{\mathrm{s}}\left(\boldsymbol{I}_{i}, \boldsymbol{I}_{j}\right)$分别为原始特征匹配数量和特征选择后的匹配数量。

本文特征选择方法与UsefulFeatures(Turcot和Lowe, 2009)不同之处在于，UsefulFeatures属于后验型的特征选择，也就是首先将每一幅图像完成图像检索之后，把满足几何一致性的特征保存下来，通过校验两两图像构成近邻图像，形成图像级的无向图。其优点在完成一次彻底的全图像检索后，形成有效特征数据集，便于以后的再次查询。而本文算法是像素集的分离图，其特征选择是在完整的图像检索前发生的，对于一些大规模的图像集能够显著地减少其算法复杂度。

5 实验

为了验证特征选择的有效性，实验采用Oxford数据集和Paris数据集对其进行评估。图像检索性能指标采用平均查询准确率(mean average precision, MAP)。使用SIFT作为特征向量，其Peakthrold取0.001；BOW模型的词典选择Oxford数据集中提供的100 k词典，特征选择参数$n=3，D=50$。硬件环境为：Xeon(R) CPU E5-2640，64 GB内存的Station；软件环境为Linux下的MATLAB 2018a。

5.1 Oxford数据集

该图像集共包含11个建筑类，其中每个建筑类包含5个查询图像，共55幅标准查询图像，查询结果包含Good、OK、Junk和Bad这4个分类。鉴于研究方法重点在于特征选择的有效性，为进一步扩大查询范围，实验采用Good分类中所有的图像作为查询图像。参数$γ$=18。

1) 特征约简。特征约简率并不会因为数据集的大小而发生明显变化，因此这里的约简实验仅采用Oxford数据集标注为Good、OK和Junk的图像，共660幅图像。实验结果如表 1所示，特征点筛选率接近原始特征点的一半，存储空间节省54 % 左右。

表 1 特征约简表
Table 1 Feature reduction

下载CSV

阈值$γ$	原始特征点数量	存储容量/MB	特征选择数量	存储容量/MB	约简率/%	时间复杂度/s
10	1 419 844	693.3	544 103	265.7	38	99.71
15	1 419 844	693.3	615 611	300.6	43	99.84
20	1 419 844	693.3	664 231	324.3	47	100.05
25	1 419 844	693.3	700 207	341.9	49	100.137
30	1 419 844	693.3	728 270	355.8	51	99.83

2) 特征点选择前后在不同方法上的对比。使用原始特征点(original features，OF)和选择后的特征点(selected features，SF)分别与IDF，RootSift, Vlad，以及深度学习的全连接层特征, 卷积层的Bilinear特征(Lin等，2018)进行对比。如图 4和图 5所示，通过对比可以发现，BOW+IDF和RootSift的OF和SF之间每一类建筑的MAP实际上相差无几，对于类别All_Souls、Ashmolean、Keble等，选择后的特征向量检索精度反而优于原始特征向量，其误差棒范围中的最高AP(average precision)也高于原始特征。而对于聚合编码的Vlad特征，SF相对于OF有着明显的下降，其原因主要在于Vlad是全部特征上进行编码降维的，特征点的选择会减少其编码信息量，从而带来特征表达能力的损失，深度学习特征在Magdalen和Radcliffe两类建筑中表现优异，其他类检索效果一般，其中Bilinear特征相比其他深度学习特征检索精度高出3 % 左右，但整体检索精度仍有待提高。各类图像MAP的均值如表 2所示，SF+IDF显著高于其他特征表示方法，略低于原始特征。结合上述特征约简实验中50 % 的约简率来看，微弱检索精度的损失尚在可接受范围内。部分图像检索结果如图 6所示。

图 4 编码聚合特征对比

Fig. 4 Performance comparison with aggregation features

图 5 深度学习特征对比

Fig. 5 Performance comparison with deep learning features

表 2 各类图像MAP的均值
Table 2 Means of MAP for different classes

下载CSV

方法	平均MAP
Selected features+IDF	0.554
Original features+IDF	0.571
RootSelected	0.510
RootOriginal	0.527
VladSelected	0.287
VladOriginal	0.413
VGG19-FC6	0.454
Bilinear	0.482
Alex-FC6	0.454
Resnet18-FC1000	0.449

图 6 基于特征选择的部分图像检索结果

Fig. 6 Retrieval results based on feature selection

3) 不同特征抽取和选择方法的对比。将算法SF与PCA、LPP(locality preserving projection)、SR-LPP(spectral regression-LPP)、LS(Laplacian score)共4种不同特征抽取和选择方法进行对比，同时增加SF+PCA的集成实验，进一步验证SF的拓展集成性。Jégou和Chum(2012)将特征约简维度设置为128，实验结果如图 7和表 3所示。

图 7 不同特征抽取和选择方法的对比

Fig. 7 Retrieval results based on feature selection

表 3 时间复杂度对比
Table 3 Time complexity comparison

下载CSV

计算时间和频率	方法
计算时间和频率	PCA	LPP	SR-LPP	LS	SF	SF+PCA
100 k词典的KD-Tree查询/s	528.3	528.3	528.3	528.3	221.9	221.9
特征抽取和选择/s	17.1	40.2	3 872.5	31.0	864.3	22.9
复合$N$个词典的特征选择频率	$N$	$N$	$N$	$N$	1	1

通过对比可以看出，LPP和LS的特征选择在Ashmolean、Balliol等6个建筑物类别的检索中表现较差，不适合在多类别的数据集中进行降维。PCA、SR-LPP和算法SF在11类的检索实验中表现较为稳定，SR-LPP虽然检索效果较好，但是对于100 k的词典，其矩阵转置相乘耗时较长，无法满足大规模图像检索的需要。PCA的检索精度则与本文算法相似，与其他3种算法相比，其优点是单次降维时间较少，但同样每次特征直方图的形成需要重新降维。而词典层面的降维则需要重新聚类，大词典的聚类则具有相当高的时间复杂度和空间复杂度。

算法SF的主要特点在于其基于原始特征点的选择，所得到的特征点保持了原始独立性，在此基础上又可以从特征向量的角度进一步降维、聚类，方便在不同词典中移植和编码，具有很好的拓展性，其中SF+PCA的集成，仅需特征向量进行投影约简，即可实现特征向量的再次降维，且具有突出的检索效果。

5.2 Paris数据集

Paris数据库包含12类，共6 412幅图像。除去General类包含涉及到其他各类的通用图像外，其他每类包含5个查询图像，共55幅标准查询图像，与Oxford Buliding相似，查询结果包含Good、OK、Junk和Bad等4个分类。针对每类图像，对不同连通域所选择的特征分别进行检索, 如图 8所示。随着连通域参数$γ$的不断增加，选择的特征逐渐包含所有原始特征。

图 8 连通域对于检索结果的影响

Fig. 8 The influence of connected domains on retrieval

((a) Defense and Eiffel; (b) Invalides and Louvre; (c) Moulinrouge and Museedorsay; (d)Notredame and Pantheon; (e)Pompidou and Sacrecoeur; (f)Triomphe and Mean map)

检索结果表明，除了Defense类随着特征数量增加，图像检索精度得到提升之外，其他10类图像检索精度相对于特征数量普遍存在一个明显的波峰，也即说明原始特征中存在一定数量的特征是典型表征图像意义的，而其他特征则是冗余的，这些特征的加入不但没有明显效果的提升，反而由于冗余信息的混淆降低了检索精度。

接下来使用选择后的特征点(Selected features，SF)分别与原始特征点、RootSift、Fisher vector、Vlad及深度学习方法进行对比。各类图像的检索精度和均值如图 9和表 4所示。

图 9 不同特征提取方法对比

Fig. 9 Comparison of different feature extraction methods

表 4 各类图像MAP的均值
Table 4 Means of MAP for different classes

下载CSV

方法	平均MAP
Selected features	0.397
Original features	0.299
Root	0.346
Vlad	0.377
Fisher vector	0.396
Alex-FC6	0.592
VGG19-FC6	0.604
Resnet18-FC1000	0.597
Bilinear	0.558
注：加粗字体表示最优结果。

实验结果表明，11类图像的检索中，与原始特征相比，特征选择后8类图像的MAP明显高于原始特征，1类持平，另外两类略低；与Root特征相比，同样有8类图像明显较高，1类持平，2类略低；与Vlad相比，7类较高，4类偏低；与Fisher vector相比，6类较高，1类持平，4类偏低。所以，在大部分类别中特征选择是行之有效的。图 10则显示了其部分图像检索结果，另一方面，深度学习特征在该数据集中表现出了良好的检索效果，除了Notre Dame和Pantheon外，各类检索都优于其他算法，MAP的均值在55 % 以上，普遍高于编码聚合类特征提取算法。因此，如何将SF算法与深度学习特征进一步融合，提高特征选择的普遍适用性，将是下一步研究的重点。

图 10 基于特征选择的部分图像检索结果

Fig. 10 Retrieval results based on feature selection

6 结论

针对大规模图像检索，提出了一种简单有效的特征点选择方法。大规模图像数据集中特征向量体量巨大，为后期BOW词典聚类和相似性测量计算带来了时间和空间的挑战，本文从减少特征点数量这一简单思路为出发点，结合词典前$D$个最近邻单词归属、尺度特征和特征距离等特征点属性，构建特征分离图，实验表明了大连通图点集的冗余性和孤立点的部分可区分性，最后通过设置连通分量阈值摒弃大连通区域的冗余特征点，根据孤立点间的最近邻单词相关性保留具有典型特征意义的点。实验结果验证了该特征选择方法的有效性，在保证原有检索精度的基础上，有效的约简特征在50 % 以上，且部分类别的检索效果甚至优于原始特征，在与其他特征抽取和选择方法的对比中，同样保证了稳定和优异的检索性能。

但是，不同图像最优的连通域阈值是不同的，实验中连通分量的选择是根据先验知识统一设定的，忽略了图像个体间的差异性。在未来的研究中，将针对不同图像，自适应地选择连通分量，形成最优特征点集，并将其与深度学习特征有效结合。

参考文献

Arandjelović R and Zisserman A. 2012. Three things everyone should know to improve object retrieval//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2911-2918[DOI: 10.1109/CVPR.2012.6248018]

Avrithis Y, Tolias G. 2014. Hough pyramid matching: speeded-up geometry re-ranking for large scale image retrieval. International Journal of Computer Vision, 107(1): 1-19 [DOI:10.1007/s11263-013-0659-3]

Babenko A, Slesarev A, Chigorin A and Lempitsky V. 2014. Neural codes for image retrieval//Proceedings of European Conference on Computer Vision. Zurich, Switzerland: Springer-Verlag: 584-599[DOI: 10.1007/978-3-319-10590-1_38]

Bursuc A, Tolias G and Jégou H. 2015. Kernel local descriptors with implicit rotation matching//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. Shanghai, China: ACM: 595-598[DOI: 10.1145/2671188.2749379]

Chum O. 2015. Low dimensional explicit feature maps//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4077-4085[DOI: 10.1109/ICCV.2015.464]

Jégou H and Chum O. 2012. Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening//Proceedings of European Conference on Computer Vision. Florence, Italy: Springer: 774-787[DOI: 10.1007/978-3-642-33709-3_55]

Jégou H, Douze M and Schmid C. 2008. Hamming embedding and weak geometric consistency for large scale image search//Proceedings of European Conference on Computer Vision. Marseille, France: Springer-Verlag: 304-317[DOI: 10.1007/978-3-540-88682-2_24]

Jégou H, Douze M, Schmid C and Pérez P. 2010. Aggregating local descriptors into a compact image representation//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 3304-3311[DOI: 10.1109/CVPR.2010.5540039]

Karakasis E G, Amanatiadis A, Gasteratos A, Chatzichristofis S A. 2015. Image moment invariants as local features for content based image retrieval using the bag-of-visual-words model. Pattern Recognition Letters, 55: 22-27 [DOI:10.1016/j.patrec.2015.01.005]

Lazebnik S, Schmid C and Ponce J. 2006. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories//Proceedings of 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE: 2169-2178[DOI: 10.1109/CVPR.2006.68]

Lin T, Roychowdhury A, Maji S. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6): 1309-1322 [DOI:10.1109/TPAMI.2017.2723400]

Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60: 91-110 [DOI:10.1023/B:VISI.0000029664.99615.94]

Mukundan A, Tolias G and Chum O. 2017. Multiple-kernel local-patch descriptor[EB/OL]. [2020-08-08]. https://arxiv.org/pdf/1707.07825.pdf

Perronnin F, Sánchez J and Mensink T. 2010. Improving the fisher kernel for large-scale image classification//Proceedings of European Conference on Computer Vision. Heraklion, Greece: Springer: 143-156[DOI: 10.1007/978-3-642-15561-1_11]

Razavian A S, Azizpour H, Sullivan J and Carlsson S. 2014. CNN features off-the-shelf: an astounding baseline for recognition//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Columbus, USA: IEEE: 512-519[DOI: 10.1109/CVPRW.2014.131]

Sánchez J, Perronnin F, Mensink T, Verbeek J. 2013. Image classification with the fisher vector: theory and practice. International Journal of Computer Vision, 105(3): 222-245 [DOI:10.1007/s11263-013-0636-x]

Tolias G, Avrithis Y and Jégou H. 2013. To aggregate or not to aggregate: selective match kernels for image search//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 1401-1408[DOI: 10.1109/ICCV.2013.177]

Torii A, Sivic J, Okutomi M, Pajdla T. 2015. Visual place recognition with repetitive structures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11): 2346-2359 [DOI:10.1109/TPAMI.2015.2409868]

Turcot P and Lowe D G. 2009. Better matching with fewer features: the selection of useful features in large database recognition problems//Proceedings of the 12th IEEE International Conference on Computer Vision Workshops. Kyoto, Japan: IEEE: 2109-2116[DOI: 10.1109/ICCVW.2009.5457541]

Vedaldi A and Zisserman A. 2010. Efficient additive kernels via explicit feature maps//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 3539-3546[DOI: 10.1109/CVPR.2010.5539949]

Yan K, Wang Y W, Liang D W, Huang T J and Tian Y H. 2016. CNN vs. SIFT for image retrieval: alternative or complementary?//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM: 407-411[DOI: 10.1145/2964284.2967252]

Zheng L, Yang Y, Tian Q. 2018. SIFT meets CNN: a decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5): 1224-1244 [DOI:10.1109/TPAMI.2017.2709749]