发布时间: 2016-10-25
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.20161002
2016 | Volumn 21 | Number 10

综述

视觉地形分类的词袋框架综述

吴航¹, 刘保真¹, 苏卫华¹, 张文昌², 孙景工¹

1. 军事医学科学院卫生装备研究所, 天津 300161;

2. 清华大学国家智能技术与系统重点实验室, 北京 100084

收稿日期: 2016-03-04; 修回日期: 2016-06-15

基金项目: 国家科学技术重大专项基金项目（2012ZX10004801）

第一作者简介: 吴航(1991-),男,现为军事医学科学院卫生装备研究所2012级博士研究生,主要研究方向为移动机器人地形视觉识别。E-mail:2008.wuhang@163.com

中图法分类号: TP301.6

文献标识码: A

文章编号: 1006-8961(2016)10-1276-13

摘要

目的视觉地形分类是室外移动机器人领域的一个研究热点。基于词袋框架的视觉地形分类方法，聚集和整合地形图像的视觉底层特征，建立底层特征统计分布与高层语义之间的联系，已成为目前视觉地形分类的常用方法和标准范式。本文全面综述视觉地形分类中的词袋框架，系统性总结现有研究工作，同时指出未来的研究方向。方法词袋框架主要包括4个步骤：特征提取、码本聚类、特征编码、池化与正则化。对各步骤中的不同方法加以总结和比较，建立地形分类数据集，评估不同方法对地形识别效果的影响。结果对词袋框架各步骤的多种方法进行系统性的分类和总结，利用地形数据集进行评估，发现每个步骤对最后生成的中层特征性能都至关重要。特异性特征设计、词袋框架改进和特征融合研究是未来重要的研究方向。结论词袋框架缩小低层视觉特征与高层语义之间的语义鸿沟，生成中层语义表达，提高视觉地形分类效果。视觉地形分类的词袋框架方法研究具有重要意义。

关键词

视觉地形分类; 非几何地形特征危险; 词袋框架; 编码方法; 池化方法; 移动机器人

Bag of words for visual terrain classification:a comprehensive study

Wu Hang¹, Liu Baozhen¹, Su Weihua¹, Zhang Wenchang², Sun Jinggong¹

1. Institute of Medical Equipment, Academy of Military Medical Science, Tianjin 300161, China;

2. The State Key Laboratory of Intelligent Technology and System, Tsinghua University, Beijing 100084, China

Supported by: National Science and Technology Major Project of China (2012ZX10004801)

Abstract

Objective Unlike a mobile robot in an indoor-structured environment, an outdoor robot should recognize non-geometric terrain characteristics within a reasonable time and adjust the appropriate path, gait, and motion planning strategies to cope with the terrain. Visual terrain classification has become a hot topic in outdoor mobile robot research. The bag-of-visual-words (BOVW) framework, which can aggregate low-level visual descriptors and establish contact with semantic features, has become the most common approach and an effective paradigm for visual terrain classification. In this paper, we provide a comprehensive study of each step in the BOVW framework for visual terrain classification. Diverse methods in each step are introduced and summarized, and their characteristics and relations are explored. Method The BOVW framework includes four main steps: 1) feature extraction, 2) codebook generation, 3) feature coding, and 4) pooling and normalization. Feature extraction acquires low-level feature information from the terrain images to develop local descriptors. In the codebook generation step, a codebook is formed through clustering. The coding step uses the codebook to map the descriptors in the terrain image to the coding space. Then, coding results are aggregated into a single vector, that is, the mid-level feature, of the fixed length by pooling and normalization. Finally, the mid-level feature is fed into a linear or nonlinear classifier, such as SVM, for terrain classification. The diverse methods in each step are summarized and compared systematically. The performances of the method are preliminarily tested on a terrain dataset. Results The BOVW framework for visual terrain classification is reviewed in the paper. We also present a preliminary comparison of different BOVW frameworks for visual terrain classification on the terrain dataset. On the basis of the result, we find that every step is crucial in contributing to the final classification performance, and an improper choice in one step will markedly weaken the effectiveness and efficiency of the visual classification system as a whole. New handcrafted descriptors that are specific to the visual terrain, modified BOVW framework, and feature fusion are three potential research directions. Conclusion Visual terrain classification is an important technology for recognizing non-geometric terrain characteristics for outdoor mobile robots. Compared with other sensors, visual information most closely resembles the manner by which humans perceive the environment and provides richer terrain information, and visual terrain classification has become a hotspot issue in outdoor mobile robot technology. However, visual appearances of the same terrain type may exhibit vast differences, and various types of terrain may appear highly similar. Therefore, these issues engender numerous challenges to visual terrain classification. Both effectiveness and efficiency are necessary factors that should be taken into account in the design of the visual terrain classification system. Therefore, studies on the BOVW for visual terrain classification are of considerable significance.

Key words

visual terrain classification; non-geometric hazard; bag of words; encoding methods; pooling methods; mobile robot

0 引言

目前，室外移动机器人在星际探索、野外搜救和灾害救援等方面具有重要应用。不同于室内结构化环境，野外场景下，机器人必须面对不同的路面环境，松软、泥泞、崎岖不平的路面都有可能给机器人带来危险。这些危险路面统称为非几何地形特征危险（non-geometric hazard）^[1]。因此，机器人必须对所处环境的地形具备准确的感知和分类能力，才能据此做出合理的路径规划、不同的步态选择和恰当的运动控制策略。如果机器人所处地形特征无法被准确识别，就有可能导致机器人做出错误的运动控制决策^[2]。例如：2005年，美国国家航空和航天局（NASA）研发的机遇号火星探测器就是因为缺乏对周围地形的分析能力，陷入松软的沙地中而导致其数周不能移动，2006年的勇气号也遇到了同样的问题。

非几何地形特征危险的分类识别方法主要包括两类：基于本体感受的分类方法（proprioceptive methods）和基于外形特征的分类方法（appearance-based methods）。基于本体感受的方法^[3-6]，主要利用机器人经过相应地形时的振动信息进行识别分类，最大的缺点是无法预先识别目标区域地形，机器人容易发生危险，对于多数机器人不适用。而基于外形特征的分类方法利用地形的视觉信息，与其他传感器信息相比，最接近人类的环境感知方式，能够提供较为丰富的地形信息^{[3, 7-8]}。基于机器人视觉进行地形识别与分类是目前移动机器人领域的一个研究热点。

然而，同类地形具有大量差异性的视觉表现，异类地形有时却表现出高度视觉相似，地形图像中局部和全局特征混杂，运动过程中需同步完成地形识别，实时性强。这些都给视觉地形分类带来挑战，如何准确、快速完成视觉地形分类任务仍然是目前急需解决的关键问题。

各种颜色、纹理特征先后被用于视觉地形识别。文献^[9]对比了多种不同颜色特征^[10]的地形识别效果，颜色特征提取方便，但易受外部环境影响，对光照和噪声比较敏感。文献^[11]采用纹理特征^[12]进行地形识别，刻画地形图像的灰度空间分布规律，纹理特征不依赖于图像颜色或亮度，实用性更强。文献^[13]同时利用地形的颜色与纹理信息，生成综合描述符进行地形分类，使用多种分类器，将地形分为预先定义的5类。然而，这些基于底层特征描述的视觉地形分类方法由于缺乏中层语义表达，导致其总体鲁棒性较差，泛化能力不好，使用受到限制。

为了跨越图像低层视觉特征与高层语义之间的语义鸿沟（semantic gap）^[14]，视觉地形中层特征成为众多研究者的努力方向。2012年，加利福尼亚州圣塔芭芭拉分校（UCSB）机器人研究所的Filichkin与Byl利用词袋框架（BOVW），生成紧凑的视觉图像中层特征，地形分类效果获得较大提升，其研究成果也实地应用于美军军用机器人——小狗（LittleDog），成为重要功能载荷模块^[15-16]。在其后，许多学者也利用词袋框架，在不同场景下，实现了很好的地形分类效果^[17-18]。利用词袋框架，聚集和整合地形视觉底层特征，建立底层特征统计分布与高层语义之间的联系，逐步成为了地形分类识别的常用方法和标准范式。

深度学习方法近来获得广泛关注，其端对端的图像分类方法，构建多层的识别网络，获取深层的图像语义信息，在多个领域获得了惊人的突破。但是在深度学习强大认知能力的背后是其巨大的计算成本，需要GPU（graphics processing unit）加速运算和巨大的训练样本防止过拟合。而基于词袋框架的识别方法所需训练集要小很多，依靠CPU（central processing unit）即可运行，相比起来更加适应移动机器人平台上的地形识别应用。

视觉地形分类的词袋框架方法主要包括4个步骤：1）特征提取，2）码本聚类，3）特征编码，4）池化与正则化。每个步骤都对中层特征的性能至关重要。本文全面介绍词袋框架构成，对各步骤依次展开，说明各步骤的作用，总结各步骤中能使用的不同方法，并分析其内在联系和相互关联，系统性展示基于词袋框架的视觉地形分类方法体系。

1 基于BOVW的视觉地形分类方法

BOVW框架通过对局部特征的再表达，弥补语义鸿沟，形成对地形图像更好的描述。如图 1所示，框架主要包括特征提取、码本聚类、特征编码、池化与正则化4个步骤。

图 1 视觉地形分类的词袋框架

Fig. 1 Bag of visual words framework for terrain classification

X为地形图片上提取的D维局部特征$\left[ {{x}_{1}},{{x}_{2}},\cdots ,{{x}_{k}} \right]\in {{\mathbb{R}}^{D\times K}}$，通过聚类算法，在训练集上形成大小为M的码本词典$B=\left[ {{b}_{1}},{{b}_{2}},\cdots ,{{b}_{M}} \right]\in {{\mathbb{R}}^{D\times K}}$，其中b_i(${{b}_{i}}\in {{\mathbb{R}}^{D}}$)表示一个码本单词。使用这些码本单词对局部特征进行再表示，生成编码结果D，之后采用池化与正则化手段生成紧凑的图像中层特征表达F。最后将F输入各种线性或非线性分类器（例如：支持向量机）中得到地形分类结果。

1.1 特征提取与码本生成

1.1.1 特征提取

特征提取主要包括检测子和描述符两个方面，用于捕获地形的视觉图像底层信息。检测子指地形图像特征点的采样方法，主要包括稀疏采样、密集采样和随机采样^[19]。描述符则用来描述采样点及其附近区域。许多手工设计的底层特征都用来解决地形识别问题，主要包括颜色与纹理特征。

颜色特征是一种常用特征，因大多数地形如土壤、植被和岩石等都具有明显的颜色区别。颜色特征主要包括颜色熵、颜色矩和颜色直方图^[20]。颜色特征提取方便，应用广泛，但易随光照、天气、阴影而改变，受外部环境因素影响较大。纹理特征主要包括灰度共生矩阵（GLCM）^[21]、Gabor滤波响应（GIST）^[22-23]、局部二元模式（LBP）^[24]，局部三元模式(LTP)^[25]等，特征刻画图像灰度空间分布规律，不依赖图像的颜色或亮度。同时，结合颜色和纹理两种不同信息，CCDs^[26]，FCTH^[27]和JCD^[28]等特征也被用于视觉地形的识别与分类^[13]。2004年，David提出了尺度不变特征转化（SIFT）特征^[29]，是底层视觉特征设计领域的里程牌，其变形版本包括SUFT^[30]与Root-SIFT^[31]，其特征具有视角、尺度与方向不变性，对光照变化和噪声影响敏感度低，鲁棒性强，获得广泛应用，在地形分类和识别领域获得了很好的效果^[15]。

原始提取的底层视觉特征$X\in {{\mathbb{R}}^{D\times K}}$通常高度耦合，并且包含许多冗余信息，这对随后码本的有效生成带来挑战。特征预处理手段能较好地改善这一情况。常用的特征预处理手段主要包含主成分分析（PCA）^[32]和白化（Whitening）^[19]。PCA在保留原始特征绝大部分信息的前提下，通过正交转换将原始特征投射到低维空间。白化技术主要用于降低特征间的相关性，使得特征之间具有同样的方差。PCA与Whitening通常一起使用，其整体转化公式为

${{\operatorname{X}}_{i}}^{'}=\Lambda {{\operatorname{U}}^{\text{T}}}{{\operatorname{X}}_{i}}$

(1)

式中，${{\operatorname{X}}_{i}}\in {{\mathbb{R}}^{D}}$是原始的视觉底层特征，${{\operatorname{X}}_{i}}^{'}\in {{\mathbb{R}}^{N}}$是预处理之后的结果，$\operatorname{U}\in {{\mathbb{R}}^{D\times N}}$是PCA降维矩阵，$diag(\Lambda )=\text{ }\!\![\!\!\text{ }1/\sqrt{{{\lambda }_{1}}},1/\sqrt{{{\lambda }_{2}}},...,1/\sqrt{{{\lambda }_{N}}}\text{ }\!\!]\!\!\text{ }$，$\Lambda $是对角白化矩阵，${{\lambda }_{i}}$是特征协方差矩阵的第i个特征值（从大到小排序）。

1.1.2 地形视觉码本构建

词袋框架的核心思想是使用超完备基向量对局部特征进行再表达，这些基向量b_m又称为视觉单词或码词，整个向量集合构成码本词典B= [b₁,b₂,...,b_M]$\in {{\mathbb{R}}^{D\times M}}$，视觉单词需要对特征空间具备足够的代表性和鉴别力，视觉词典的构建对于整体的识别分类效果具有很大影响^[33]。而人工标注方法工作量大，主观性强，目前视觉单词生成的主流方法为无监督聚类算法，典型方法主要包括：K均值聚类（K-means）和高斯混合模型（GMM）

1）K均值聚类。K-means方法将特征空间划分为K个区域，以每个区域的质心作为视觉单词^[34]。从训练集中选择T个特征向量X_train= [x₁,x₂,... ,x_T]$\in {{\mathbb{R}}^{D\times T}}$，随机选择M个中心[b₁,b₂,... ,b_M]$\in {{\mathbb{R}}^{D\times T}}$ 作为初始的质心，之后对剩余的特征向量测量其到质心的欧氏距离，并把它归到最近质心所属类，即

${{q}_{mi}}=\underset{m}{\mathop{\text{arg min}}}\,{{\left\| {{\operatorname{x}}_{i}}-{{\operatorname{b}}_{k}} \right\|}^{2}}$

(2)

之后再重新计算已经得到的各类新质心${{\operatorname{b}}_{m}}=avg\{{{\operatorname{x}}_{i}}:{{q}_{i}}=m\}$，重复迭代，直到误差$\sum\limits_{i=1}^{T}{{{\left\| {{\operatorname{x}}_{i}}-\,\,{{\operatorname{b}}_{{{q}_{i}}}} \right\|}^{2}}}$小于阈值或达到最大迭代数，最后各质心组成视觉地形码本词典。

2) 高斯混合模型聚类（GMM）。方法采用高斯混合模型捕捉局部特征可能性分布$\text{P}(x|\theta )$，即

$P(\text{x}|\theta )=\sum\limits_{m=1}^{M}{P(\text{x}|{{u}_{m}},{{\text{ }\!\!\Sigma\!\!\text{ }}_{\text{m}}}){{\pi }_{m}}}$

(3)

不同于K均值聚类仅描述了特征归属，高斯混合模型同时描述特征归属情况与聚集程度，其中M表示码本大小，$\theta =({{\pi }_{1}},{{\mu }_{1}},{{\text{ }\!\!\Sigma\!\!\text{ }}_{\text{1}}},...,{{\pi }_{M}},{{\mu }_{M}},{{\text{ }\!\!\Sigma\!\!\text{ }}_{\text{M}}})$表示模型参数，包括先验概率${{\pi }_{m}}\in {{\mathbb{R}}_{+}}$，均值${{\mu }_{m}}\in {{\mathbb{R}}^{D}}$，对角协方差矩阵${{\text{ }\!\!\Sigma\!\!\text{ }}_{\text{m}}}\in {{\mathbb{R}}^{D\times D}}$。通常使用期望最大值（EM^[35]）算法从训练集中学习这些模型参数。EM算法对初值敏感，可以预先使用K-means算法进行初始化。

1.2 特征编码

特征编码是整个词袋框架的核心部分，其使用视觉词典B$\in {{\mathbb{R}}^{D\times M}}$对局部特征空间X$\in {{\mathbb{R}}^{D\times M}}$进行重新表达，缩小语义鸿沟，生成更有鉴别力的表达，形成编码结果D。

特征编码将特征空间投射到编码空间，获取更多的地形图像信息。不同的编码空间与信息，则构建了不同的编码方法。不同于以前传统观点，我们认为不同编码方法间的本质区别是在于其捕获信息的种类。不同的信息会构建出不同的编码空间，形成不同鉴别力的表达。我们将编码方法分为如图 2所示的两类：基于激活的编码方法与基于差异的编码方法。

图 2 基于激活的编码方法与基于差异的编码方法对比

Fig. 2 Comparison between activation-based encoding methods and difference based-encoding methods

基于激活的编码方法使用“激活”的概念从特征空间中捕获信息。

1）其编码空间由不同的码本单词组成。方法关心的是：哪些视觉单词被激活，以多大的程度被激活。之后这些视觉单词的激活权重联合组成了最后的编码结果D，不同的编码方法制定不同的激活规则。

2）方法使用的特征空间的0阶统计分布信息。图片上的各个特征独立进行编码，局部特征各自得到一个编码结果，表征着这个局部特征激活了哪些视觉单词，以多大程度激活了它们。

3）方法需要后续的池化操作（Pooling）。基于激活的编码方法需要较多的视觉单词，较大的码本词典，其编码直接串联，则会导致维度灾难（百万级的维数）。为了得到更为紧凑的表达，各种池化方法被研究与应用。

该类型的典型编码方法包括：HA^[15]，SA^[33]，LSA^[36]，SC^[37]，LCC^[38]和LLC^[39]。

基于差异的编码方法利用“差异”的概念从特征空间中捕获信息。

1）特征空间与视觉词典之间的差异构成编码空间。不同的编码方法记录不同类型的差异，其核心在于建立表征差异的规则。

2）方法利用特征空间多阶信息（0阶，1阶，2阶）。输入图片上所有特征组成的特征分布，和视觉码本对比，记录与各视觉单词的多维度差异，构成编码结果。

3）直接串联则可形成紧凑的图像级表达，无需附加的池化操作。方法利用了特征空间更多的先验信息，其所需的视觉词典更小，直接串联则可形成符合要求的中层表达

1.2.1 基于激活的编码方法

方法使用“激活”的概念去获取信息，不同的编码方法制定了不同的激活规则，编码结果D。根据不同的激活策略，方法可被划分为表决编码方法与稀疏编码方法。

1）表决编码方法（voting-based encoding method）。表决编码方法采用“相似性”原理去激活相关视觉单词。与局部特征越相似的视觉单词，被认为距离越近，越有可能被激活，激活的程度也越强，HA^[15]，SA^[33]，LSA^[36]编码方法都属于这一类。

硬性表决编码（HA）是最基本的词袋编码方法，每个局部特征x_i只能选择完全激活与其最相似的一个视觉单词b_j，表达为

$\text{HA}:{{d}_{j}}=\left\{ \begin{matrix} 1 & j=\underset{j}{\mathop{\text{arg min}}}\,{{\left\| {{\operatorname{x}}_{i}}-{{\operatorname{b}}_{j}} \right\|}_{2}} \\ 0 & 其他 \\ \end{matrix} \right.$

(4)

但一个局部特征可能同时与多个视觉单词相关，只使用一个视觉单词完全代表它，导致不小的信息损耗。为了解决这一问题，柔性表决编码（SA）方法选择激活整个码本，每个视觉单词的激活程度与其和局部特征的距离相关，即

$\text{SA:}{{d}_{j}}=\frac{\text{exp(}-\beta \widehat{e}\text{(}{{\operatorname{x}}_{i}},{{\operatorname{b}}_{j}}\text{))}}{\sum\limits_{l=1}^{M}{\text{exp}(-\beta \widehat{e}\text{(}{{\operatorname{x}}_{i}},{{\operatorname{b}}_{l}}\text{))}}}$

(5)

$\widehat{e}({{\operatorname{x}}_{i}},{{\operatorname{b}}_{j}})={{\left\| {{\operatorname{x}}_{i}}-{{\operatorname{b}}_{j}} \right\|}^{2}}$

(6)

式中，β为柔性因子，调节表决的柔性程度，欧氏距离ê用于表征相似度。在高维度特征向量空间中，当两者距离过远时，采用欧氏距离表征其相似度已不可靠，局部柔性编码（LSA）不同于SA，其仅激活与其邻近的k个视觉单词，用来表达局部特征，即

$\text{LSA:}\widehat{e}({{\text{x}}_{\text{i}}}\text{,}{{\text{b}}_{\text{j}}})=\left\{ \begin{matrix} {{\left\| {{\text{x}}_{\text{i}}}-{{\text{b}}_{\text{j}}} \right\|}^{2}} & {{\text{b}}_{\text{j}}}\in {{N}_{k}}({{\text{x}}_{\text{i}}}) \\ \infty & 其他 \\ \end{matrix} \right.$

(7)

每个局部特征仅激活邻近区域视觉单词。相似的局部特征，其激活区域会出现重叠，导致相似的编码结果，能更好的反映特征空间的信息分布。

2）稀疏编码方法（Sparse coding method）。

稀疏编码方法选择仅激活码本词典B$\in {{\mathbb{R}}^{D\times M}}$中的极少数视觉单词，使用它们的线性组合来重构特征空间X$\in {{\mathbb{R}}^{D\times M}}$，相应的系数矩阵D$\in {{\mathbb{R}}^{D\times M}}$作为编码结果。稀疏编码原理与生物识别图像原理相似，能够捕获图像更深层的特征信息。典型的稀疏编码方法包括SC^[37]，LCC^[38]和LLC^[39]。稀疏编码的统一公式形式为

$\underset{D}{\mathop{\text{arg min}}}\,\sum\limits_{i=1}^{K}{\text{(}{{\left\| {{\operatorname{x}}_{i}}-\operatorname{B}{{\operatorname{d}}_{i}} \right\|}^{2}}+\lambda \psi \text{(}{{\operatorname{d}}_{i}}\text{)}}\text{)}$

(8)

式中包含最小二乘项${{\left\| {{\operatorname{x}}_{i}}-{{\operatorname{Bd}}_{i}} \right\|}^{2}}$与正则项$\psi ({{d}_{i}})$，λ为平衡因子。最小二乘项用于确保尽量小的重构误差，正则项则划定激活单词的选择区域（保证激活的视觉单词具有代表性和鉴别力），提高编码表达的类间差异和类内相似。不同的稀疏编码方法定义不同的正则项，制定不同的视觉单词激活区域划定规则。SC编码方法的正则项使用L₁正则化，即

$\text{SC}:\psi \left( {{\text{d}}_{\text{i}}} \right)={{\left\| {{\text{d}}_{\text{i}}} \right\|}_{1}}$

(9)

L₁正则化限制每个局部特征所激活的视觉单词个数，确保其稀疏性。考虑特征重构的空间流型，LCC引入局部限制（Locality），定义新的正则项

$\eqalign{ & LCC:\psi \left( {{d_i}} \right) = {\left\| {{{\hat e}_i} \odot \left| {{d_i}} \right|} \right\|_1}, \cr & {\rm{s}}{\rm{.t}}{.1^T}{{\rm{d}}_{\rm{i}}} = 1,\forall i \cr} $

(10)

$\widehat{{{\text{e}}_{\text{i}}}}={{[\widehat{e}({{\text{x}}_{\text{i}}},{{\text{b}}_{\text{1}}}),...\widehat{,e}({{\text{x}}_{\text{i}}},{{\text{b}}_{\text{m}}}),...,\widehat{e}({{\text{x}}_{\text{i}}},{{\text{b}}_{\text{M}}})]}^{T}}$

(11)

式中，$\odot $代表元素相乘，为局部限制权重，dist(x_i,b_m)为特征x_i与视觉单词b_m之间的欧氏距离。L₁正则优化需要迭代计算，时间成本高，难度大。LLC编码方法定义新正则项$\psi ({{\text{d}}_{\text{i}}})$，加速编码过程。

$\eqalign{ & {\rm{LLC}}:\psi \left( {{d_i}} \right) = {\left\| {{E_i} \odot {d_i}} \right\|^2} \cr & {\rm{s}}{\rm{.t}}{.1^T}{d_i} = 1,\forall i \cr} $

(12)

${{\text{E}}_{\text{i}}}=\exp {{(}_{{}}}{{\frac{\widehat{{{\text{e}}_{\text{i}}}}}{\sigma }}_{{}}})$

(13)

式中，σ用于调节局部限制权重的衰减速率^[³⁷^]，采用正则项式（12），LLC可以获得编码结果d_i的解析解：

$\widetilde{{{\text{d}}_{\text{i}}}}=({{\text{C}}_{\text{i}}}+\lambda diag(\text{E}))\backslash 1$

(14)

${{\text{d}}_{\text{i}}}=\widetilde{{{\text{d}}_{\text{i}}}}/{{1}^{T}}\widetilde{{{\text{d}}_{i}}}$

(15)

式中，${{\text{C}}_{\text{i}}}=({{\text{B}}^{T}}-1{{\text{x}}_{\text{i}}}^{T}){{({{\text{B}}^{T}}-1{{\text{x}}_{\text{i}}}^{T})}^{T}}$表示特征数据的协方差矩阵。解析解能大大降低编码计算成本。在实际应用中，LLC也有近似版本，方法省去正则项，直接划定特征x_i的K个领域单词作为激活空间，仅使用这些视觉单词去重构局部特征，使用最小二乘项，最小化重构误差，完成特征编码，近似LLC编码方法进一步加速了编码速率。

1.2.2 基于差异的编码方法

基于差异的编码方法使用“差异”的概念去构建编码空间，描述局部特征分布与视觉单词之间的多阶差异，不同的编码方法定义了不同的差异描述。该编码类型的典型方法包括：FV^[40-42]，VLAD^[43]，LTC^[44]，SVC^[45]。

FV编码方法基于Fisher核函数，同时综合了生成式与鉴别式方法的优点。对比局部特征与GMM码本中的每一个视觉单词，捕捉高斯均值（1^st）与方差（2^nd）差异，作为编码结果。

$\text{d}_{\text{m}}^{\text{(1)}}=\frac{1}{K\sqrt{{{w}_{m}}}}\sum\limits_{p=1}^{K}{{{\alpha }_{p}}(m)(\frac{{{\text{x}}_{\text{p}}}-{{\mu }_{m}}}{{{\sigma }_{m}}})}$

(16)

$\text{d}_{\text{m}}^{\text{(2)}}=\frac{1}{K\sqrt{2{{w}_{m}}}}\sum\limits_{p=1}^{K}{{{\alpha }_{p}}(m)(\frac{{{({{\text{x}}_{\text{p}}}-{{\mu }_{m}})}^{2}}}{\sigma _{m}^{2}}-1)}$

(17)

式中，${{\{{{w}_{m}},{{\mu }_{m}},{{\sigma }_{m}}\}}_{k}}$分别表示GMM视觉字典B中各视觉单词的混合权重，均值与对角协方差，${{\alpha }_{p}}(m)$表示特征x_p相对与m-th视觉单词的柔性分配权重。FV编码结果D$\in {{\mathbb{R}}^{2D\times M}}$由1阶与2阶差异堆叠形成。

$\text{FV}:D=[\text{d}_{\text{1}}^{\text{(1)}},\text{d}_{\text{1}}^{\text{(2)}},\text{d}_{\text{2}}^{\text{(1)}},\text{d}_{\text{2}}^{\text{(2)}},...,\text{d}_{\text{M}}^{\text{(1)}},\text{d}_{\text{M}}^{\text{(2)}}]$

(18)

VLAD编码可以看成FV编码的非概率简化版本^[43]，仅利用特征空间与码本之间的一阶统计学差异。方法使用K-means码本，将每一个局部特征x_t划归于最近的视觉单词b_i。对于每个视觉单词b_i，计算其与所有归属特征的累计残差作为编码结果D$\in {{\mathbb{R}}^{D\times M}}$。

$\text{VLAD}:D=[{{\text{d}}_{\text{1}}},{{\text{d}}_{\text{2}}},...,{{\text{d}}_{\text{M}}}]$

(19)

${{\text{d}}_{\text{i}}}=\sum\limits_{{{\text{x}}_{\text{t}}}:NN({{\text{x}}_{\text{t}}})=i}{({{\text{x}}_{\text{t}}}-{{\text{b}}_{\text{i}}})}$

(20)

但是对于不同地形图像，每个视觉单词显著性不同，LTC与SVC编码方法添加权重因子θ_i，修正VLAD中的累计残差d_i，修正后的差异项θ_id_i构成了LTC与SVC编码结果D$\in {{\mathbb{R}}^{(D+1)\times M}}$。

$D=[\alpha {{\theta }_{1}},{{\theta }_{1}}{{\text{d}}_{1}},\alpha {{\theta }_{2}},{{\theta }_{2}}{{\text{d}}_{\text{2}}},...,\alpha {{\theta }_{M}},{{\theta }_{M}}{{\text{d}}_{\text{M}}}]$

(21)

式中，α为一个正比例平衡因子，LTC与SVC之间的不同之处在于权重因子$\text{ }\!\!\theta\!\!\text{ }=[{{\theta }_{1}},{{\theta }_{2}},...,{{\theta }_{M}}]\in {{\mathbb{R}}^{M}}$的定义，在LTC编码方法中权重因子θ采用LCC编码获得，在SVC编码方法中θ由LSA编码定义^[19]。LTC/SVC编码方法获取了局部特征空间的0^th阶与1^st阶统计学信息。

1.3 池化与正则化

特征编码结果D为矩阵形式，无法输入分类器中获得地形分类结果，池化方法将编码结果D转换为固定长度的向量F，使表达更加紧凑，对噪声和视角变化更加鲁棒^[46]。正则化方法用于抵消不同图像中局部特征数量差异的影响，使地形图像的最终表达保持同一量级。

基于差异的编码方法采用串联方式完成池化目的，典型的池化方法主要应用于基于激活的编码方法上，典型的池化方法包括Avg^[47]，Max^[36]，L_P^[48]，MaxExp^[49]，ExaPro^[49]，AxMin^[46]。这些池化方法可以分为两类：经典池化方法与概率池化方法

1）经典池化方法（Classical Pooling Methods）。

Avg，Max，L_P 均属于经典池化方法，方法以视觉单词的合理综合激活值作为最终表达。Avg池化方法将每个视觉单词上所有局部特征的平均激活响应作为池化结果：

$\text{Avg}:{{F}_{m}}=\frac{1}{\left| K \right|}\sum\limits_{k=1}^{K}{{{d}_{km}}}$

(22)

式中，d_km表征着k-th局部特征x_k对m-th视觉单词b_m的激活响应，K表示图像中局部特征的数量。Avg池化方法数学运算简单直接，获得广泛应用。但其表达会受到频繁出现的低值信息特征影响，高鉴别力特征很难表达^[50]。Max池化方法走向了另一个极端，方法只考虑每个视觉单词中的最强激活响应，即

$\text{Max:}{{F}_{m}}=\max ({{\{{{\text{d}}_{km}}\}}_{k\in K}})$

(23)

L_P池化方法^[48]介于Avg与Max之间，其表达为

${{L}_{p}}:{{F}_{m}}={{\left( \frac{1}{\left| K \right|}\sum\limits_{k=1}^{K}{{{\left| {{\text{d}}_{km}} \right|}^{p}}} \right)}^{1/p}}$

(24)

式中，p为调节因子，当p=1时，L_p与Avg池化相同，当时，L_p相当于Max池化。

2）概率池化方法（Likelihood-based pooling methods）。概率池化方法描述视觉单词出现在输入图片上的概率。方法假定局部特征的编码结果服从伯努利独立同分布，计算不同的概率表达作为池化结果。概率池化方法主要包含MaxExp，MaxPro和AxMin。在这类型池化方法中@n策略被广泛应用，@n策略针对视觉单词b_m，挑选其前n个最强的激活响应Φ_mn进行池化操作^[46]。

MaxExp和MaxPro池化方法描述视觉单词b_m至少在图像上出现一次的概率，两种方法目的相同，但使用了不同的数学表达方式，即

$\text{MaxExp:}{{F}_{m}}=1-{{(1-\frac{1}{n}\sum\limits_{k=1}^{n}{{{\Phi }_{km}}})}^{n}}$

(25)

$\text{MaxPro:}{{F}_{m}}=1-\prod\limits_{k=1}^{n}{(1-{{\Phi }_{km}})}$

(26)

底层图像特征提取时存在重叠，独立同分布的概率假设被过高估计^[46]。AxMin引入了参数β来修正特征的独立性影响。

$\text{AxMin:}{{F}_{m}}=\min (1,\beta \frac{1}{n}\sum\limits_{k=1}^{n}{{{\Phi }_{km}}}),1\le \beta \le n$

(27)

基于词袋框架的编码与池化方法总结在表 1中。依靠池化方法，图像特征更加紧凑和鲁棒，但同时也忽略了图像各区块之间的空间关系，解决这一问题目前主要依靠空间金字塔匹配模型^[51]（SPM），通过在金字塔网格中分别完成池化，再将池化结果串联起来，部分保留了区块间的空间信息，但同时也增加计算成本。需要指出，地形图像不同于物品图像，翻转、区块移位对整体地形图像影响较小，其空间信息不显著^[52]。考虑效率，也可忽略各区块间的空间关系。

表 1 词袋框架中编码与池化方法
Table 1 Coding and pooling methods in the BOVW framework

下载CSV

正则化方法主要包括L₁正则化^[37]，L₂正则化^[47]，Power正则化^[40]和Intra正则化^[19]。其中L₁与L₂是基本的正则化手段：

${{L}_{1}}:\text{F}=\,\text{F}/{{\left\| \text{F} \right\|}_{1}}$

(28)

${{L}_{2}}:\text{F}=\,\text{F}/{{\left\| \text{F} \right\|}_{2}}$

(29)

Power和Intra正则化可以视为L₁或L₂的前处理环节，需要与L₁或L₂共同使用。Power正则化，可以使最终表达F更加平滑。

$\text{Power:}{{F}_{i}}=\text{sgn}({{F}_{i}}){{\left| {{F}_{i}} \right|}^{\alpha }}$

(30)

式中，F_i为F中的第i个元素，控制平滑强度。Intra正则化仅与差异化编码方法共同使用，编码在串联之前，对每个视觉单词b_m对应的多阶差异信息F^m进行L₁或L₂正则化处理

$Intra:\text{F}=\left[ \begin{matrix} \frac{{{\text{F}}^{\text{1}}}}{\left\| {{\text{F}}^{\text{1}}} \right\|}, & \frac{{{\text{F}}^{\text{2}}}}{\left\| {{\text{F}}^{\text{2}}} \right\|}, & ..., & \frac{{{\text{F}}^{\text{m}}}}{\left\| {{\text{F}}^{\text{m}}} \right\|}, & ..., & \frac{{{\text{F}}^{\text{M}}}}{\left\| {{\text{F}}^{\text{M}}} \right\|} \\ \end{matrix} \right]$

(31)

在Intra正则化后，还需对串联后的向量F整体进行L₁或L₂正则化处理。

经过池化与正则化后，形成了图像中层特征向量F，完成了对整张图片的描述，相比底层视觉特征，F包含更多的语义信息，有助于之后的识别分类，之后将F输入分类器中得到地形分类结果。常见的分类器包括K近邻分类（KNN）^[53]，朴素贝叶斯（Naive Bayes）^[54]，随机森林（random forest）^[55]，人工神经网络（ANN）^[56]，支持向量机（SVM）^[57]，极限学习器（ELM）^[58]等。其中获得最广泛应用的为支持向量机（SVM），其构型简单，应用方便，将低维线性不可分的特征空间映射到高维度，实现线性可分，其常用的核函数包括：多项式核、高斯核和线性核。关于各种分类器在地形识别中的应用，不是本文重点内容，具体可以参看文献^[59]。

2 评估

由于视觉地形分类领域没有通用数据集，本文制作数据集DS1，作为评估测试对象。数据集DS1一共包含8种不同典型地形路面：沥青路面、泥地、草地、瓷砖、碎石、大碎石、沙地和落叶覆盖。图片在不同光照和天气条件下获取，统一分辨率256×256像素，每类包含300幅图片，数据集共计2400幅图片。各类典型路面如图 3所示。

图 3 数据集DS1中的典型地形路面

Fig. 3 Terrain samples in dataset DS1

每类地形随机选择90幅图片组成训练集，100幅图片组成测试集，训练集与测试集之间没有交集。采用词袋框架生成中层特征后，采用SVM作为分类器，选择线性核，采用十折交叉验证获取惩罚因子C，具体实现依靠LIBSVM^[57]。评价指标选择分类平均准确率AP（10次平均）。底层视觉地形特征统一采用SIFT特征，采用VLFeat工具箱生成^[60]，特征生成后采用PCA-Whitening生成80维的底层视觉特征。

评估不同编码方法对地形识别准确率的影响。从两种不同类型的编码方法中分别挑选出SA、LSA、LLC和VLAD、SVC、FV共计6种编码方法作为评估对象，改变码本大小（K-means：100~3200，GMM：2~64），统一采用L₂正则化方法，基于激活的编码方法采用Max池化方法，其他编码参数与文献^[61]相同。评估结果如图 4所示。

图 4 不同编码方法对地形分类准确率的影响

Fig. 4 Classification performance of different coding methods

从图 4中可以看出．不同编码方法之间的效果差距很大。LLC与LSA引入局部限制（Locality），相似的局部特征能获得相似的编码结果，其最后的分类结果相比SA更好。而基于差异的编码方法（VLAD、SVC和FV）引入了多阶特征信息，其分类效果又有了跃升。同时码本大小对最后的准确率影响很大，可以视为编码方法的关键参数，所有编码方法的分类准确率均随着码本增大而增加，但随着码本的继续增大，增加的速率逐渐放缓。

针对基于激活的编码方法（LSA、LLC和SA），评估池化方法对地形分类效果的影响，选择了Avg、Max、L_p、MaxExp、AxMin和ExaPro6种池化方法，码本参数设定为800，统一使用L₂正则化，池化方法参数设定与文献^[46]一致。评估结果如图 5所示，可以看出，不同的池化方法对效果的影响惊人，采用恰当的池化方法，分类效果可以成倍增长。

图 5 不同池化方法对地形识别准确率的影响

Fig. 5 Terrain classification performance of different pooling methods

评估不同正则化方法对地形识别效果的影响。基于激活的编码方法（SA、LSA和LLC），码本大小设定为800，统一使用Max池化，可使用4种不同的正则化方法（Power与L₁/L₂交叉组合）。基于差异的编码方法（VLAD、SVC和FV），码本大小设定为8，可使用8种不同的正则化方法（Power、Intra与L₁/L₂交叉组合），评估结果如图 6所示。

图 6 不同正则化方法对地形识别准确率的影响

Fig. 6 Terrain classification performance of different normalization methods

正则化方法对最终的分类效果影响很大，从评估结果上可以看出最关键的因素为L₁/L₂，对于地形识别，L₂正则化的效果明显更好，Power与Intra对于不同的编码方法效果不同，小幅度增强或削弱了分类效果。

每个步骤都对理想的地形分类效果至关重要，一个步骤中的不恰当方法，都对最后的分类效果影响很大。其词袋框架最优流程评估是视觉地形分类的系统设计中的一个重要工作。

3 结语

视觉地形分类是室外移动机器人感知周边地形的重要手段。词袋框架聚集和整合地形图像的视觉底层特征，生成中层语义特征，缩小低层视觉特征与高层语义之间的语义鸿沟，提高了视觉地形分类能力，是目前视觉地形识别重要的方法和范式。其主要包括四个步骤：特征提取、码本聚类、特征编码、池化与正则化，本文对每个步骤的作用、步骤间的相互联系进行了详细说明，归纳总结了各步骤中的不同方法，系统性展示了基于词袋框架的视觉地形分类方法体系。建立地形分类数据集，对各方法进行评估，发现对于视觉地形分类词袋框架，每个步骤对地形分类效果都至关重要。确定词袋框架最佳流程和相应的最优参数，才能生成具有鉴别力的中层特征，获得理想的视觉地形分类效果。对词袋框架的深入研究具有重要意义。

总结视觉地形分类的词袋框架研究，本文认为未来的研究方向主要有：

1）特异性特征设计。底层特征获取图像的基础视觉表现，其鉴别力直接影响词袋中层特征性能。根据地形场景特点，设计新的地形特异性特征，更加全面的描述地形的颜色、纹理和边缘等视觉信息，进一步提高词袋框架生成的中层特征对地形的描述能力。

2）词袋框架改进。词袋框架使用超完备的基向量对图像底层特征进行再描述，获取更强的鉴别力。因此，设计新的编码方法^[62]，采用新的池化手段^[63]，都能增强中层特征的表达能力。同时采用多层编码结构^[64-65]，增加‘深度’，生成更高层的特征表达，获得更丰富的语义信息，也是重要的发展方向。

3）特征融合研究。词袋框架更加侧重对地形图像的局部特征描述，缺乏全局特征表达。同时没有一种特征对不同种类的地形保持同样鉴别力。因此，采用特征融合手段，融合多种互补的特征表达，将成为地形识别系统设计的一种常用技术手段。

参考文献

[1] Wilcox B H. Non-geometric hazard detection for a Mars microrover[C]//Proceedings of the 1994 AIAA Conference on Intelligent Robotics in Field, Factory, Service and Space. Houston, USA: IEEE, 1994: 675-684.

[2] Li B. Research and application on visual terrain classification and gait planning approaches of quadruped robot[D]. Ji'nan: Shandong University, 2012. [李彬. 视觉地形分类和四足机器人步态规划方法研究与应用[D]. 济南: 山东大学, 2012.] http://cdmd.cnki.com.cn/article/cdmd-10422-1013140674.htm

[3] Papadakis P. Terrain traversability analysis methods for unmanned ground vehicles: A survey[J]. Engineering Applications of Artificial Intelligence , 2013, 26 (4) : 1373–1385. DOI:10.1016/j.engappai.2013.01.006]

[4] Bajracharya M, Howard A, Matthies L H, et al. Autonomous off-road navigation with end-to-end learning for the LAGR program[J]. Journal of Field Robotics , 2009, 26 (1) : 3–25. DOI:10.1002/rob.20269]

[5] Garcia Bermudez F L, Julian R C, Haldane D W, et al. Performance analysis and terrain classification for a legged robot over rough terrain[C]//Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vilamoura, Portugal: IEEE, 2012: 513-519.[DOI: 10.1109/IROS.2012.6386243]

[6] Li Q, Xue K, Xu H, et al. Vibration-based terrain classification for mobile robots using support vector machine[J]. Robot , 2012, 34 (6) : 660–667. [ 李强, 薛开, 徐贺, 等. 基于振动采用支持向量机方法的移动机器人地形分类[J]. 机器人 , 2012, 34 (6) : 660–667. DOI:10.3724/SP.J.1218.2012.00660 ]

[7] Kim D, Oh S M, Rehg J M. Traversability classification for ugv navigation: A comparison of patch and superpixel representations[C]//2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2007: 3166-3173.

[8] Khan Y N, Komma P, Zell A. High resolution visual terrain classification for outdoor robots[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops. Barcelona, Spain: IEEE, 2011: 1014-1021.[DOI: 10.1109/ICCVW.2011.6130362]

[9] Chen M. Terrain classification in field environment based on illumination recognition and dynamic feature selection[D]. Tianjin: Nankai University, 2014. [陈铭. 基于光照识别和动态特征选择的野外环境地形分类[D]. 天津: 南开大学, 2014.]

[10] Gong Y H, Chuan C H, Guo X Y. Image indexing and retrieval based on color histograms[J]. Multimedia Tools and Applications , 1996, 2 (2) : 133–156. DOI:10.1007/BF00672252]

[11] Angelova A, Matthies L, Helmick D, et al. Fast terrain classification using variable-length representation for autonomous navigation[C]//Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE, 2007: 1-8.[DOI: 10.1109/CVPR.2007.383024]

[12] Liu L, Kuang G Y. Overview of image textural feature extraction methods[J]. Journal of Image and Graphics , 2009, 14 (4) : 622–635. [ 刘丽, 匡纲要. 图像纹理特征提取方法综述[J]. 中国图象图形学报 , 2009, 14 (4) : 622–635. DOI:10.11834/jig.20090409 ]

[13] Zou Y H, Chen W H, Xie L H, et al. Comparison of different approaches to visual terrain classification for outdoor mobile robots[J]. Pattern Recognition Letters , 2014, 38 : 54–62. DOI:10.1016/j.patrec.2013.11.004]

[14] Zhao L J, Tang P, Huo L Z, et al. Review of the bag-of-visual-words models in image scene classification[J]. Journal of Image and Graphics , 2014, 19 (3) : 333–343. [ 赵理君, 唐娉, 霍连志, 等. 图像场景分类中视觉词包模型方法综述[J]. 中国图象图形学报 , 2014, 19 (3) : 333–343. DOI:10.11834/jig.20140301 ]

[15] Filitchkin P. Visual terrain classification for legged robots[D]. Santa Barbara, USA: University of California, 2011.

[16] Filitchkin P, Byl K. Feature-based terrain classification for LittleDog[C]//Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vilamoura, Portugal: IEEE, 2012: 1387-1392.[DOI: 10.1109/IROS.2012.6386042]

[17] Zenker S, Aksoy E E, Goldschmidt D, et al. Visual terrain classification for selecting energy efficient gaits of a hexapod robot[C]//Proceedings of the 2013 IEEE/ASME International Conference on Advanced Intelligent Mechatronics. Wollongong, NSW, Australia: IEEE, 2013: 577-584.[DOI: 10.1109/AIM.2013.6584154]

[18] Khan Y N, Masselli A, Zell A. Visual terrain classification by flying robots[C]//Proceedings of the 2012 IEEE International Conference on Robotics and Automation. Saint Paul, MN, USA: IEEE, 2012: 498-503.[DOI: 10.1109/ICRA.2012.6224988]

[19] Peng X J, Wang L M, Wang X X, et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice[J]. Computer Vision and Image Understanding , 2016, 150 : 109–125. DOI:10.1016/j.cviu.2016.03.013]

[20] Tian Y M, Lin G Q. Retrieval technique of color image based on color features[J]. Journal of Xidian University , 2002, 29 (1) : 43–46. [ 田玉敏, 林高全. 基于颜色特征的彩色图像检索方法[J]. 西安电子科技大学学报: 自然科学版 , 2002, 29 (1) : 43–46. DOI:10.3969/j.issn.1001-2400.2002.01.010 ]

[21] Gotlieb C C, Kreyszig H E. Texture descriptors based on co-occurrence matrices[J]. Computer Vision, Graphics, and Image Processing , 1990, 51 (1) : 70–86. DOI:10.1016/S0734-189X(05)80063-5]

[22] Song Y, McLoughlin I V, Dai L R. Local coding based matching kernel method for image classification[J]. PLoS One , 2014, 9 (8) : e103575. DOI:10.1371/journal.pone.0103575]

[23] Wang X Z, He Y L, Wang D D. Non-naive Bayesian classifiers for classification problems with continuous attributes[J]. IEEE Transactions on Cybernetics , 2014, 44 (1) : 21–39. DOI:10.1109/TCYB.2013.2245891]

[24] Guo Y M, Zhao G Y, Pietikäinen M. Discriminative features for texture description[J]. Pattern Recognition , 2012, 45 (10) : 3834–3843. DOI:10.1016/j.patcog.2012.04.003]

[25] Guo Z H, Zhang L, Zhang D. Rotation invariant texture classification using LBP variance (LBPV) with global matching[J]. Pattern Recognition , 2010, 43 (3) : 706–719. DOI:10.1016/j.patcog.2009.08.017]

[26] Chatzichristofis S A, Boutalis Y S. CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval[C]//Proceedings of the 6th International Conference on Computer Vision Systems. Berlin Heidelberg, Germany: Springer, 2008: 312-322.[DOI: 10.1007/978-3-540-79547-6_30]

[27] Chatzichristofis S A, Boutalis Y S. FCTH: fuzzy color and texture histogram-a low level feature for accurate image retrieval[C]//Proceedings of the 9th International Workshop on Image Analysis for Multimedia Interactive Services. Klagenfurt, Austria: IEEE, 2008: 191-196.[DOI: 10.1109/WIAMIS.2008.24]

[28] Chatzichristofis S A, Zagoris K, Boutalis Y S, et al. Accurate image retrieval based on compact composite descriptors and relevance feedback information[J]. International Journal of Pattern Recognition and Artificial Intelligence , 2010, 24 (2) : 207–244. DOI:10.1142/S0218001410007890]

[29] Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision , 2004, 60 (2) : 91–110. DOI:10.1023/B:VISI.0000029664.99615.94]

[30] Bay H, Ess A, Tuytelaars T, et al. Speeded-up robust features (SURF)[J]. Computer Vision and Image Understanding , 2008, 110 (3) : 346–359. DOI:10.1016/j.cviu.2007.09.014]

[31] Arandjelović R, Zisserman A. Three things everyone should know to improve object retrieval[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI: IEEE, 2012: 2911-2918.[DOI: 10.1109/CVPR.2012.6248018]

[32] Abdi H, Williams L J. Principal component analysis[J]. Wiley Interdisciplinary Reviews: Computational Statistics , 2010, 2 (4) : 433–459. DOI:10.1002/wics.101]

[33] vanGemert J C, Veenman C J, Smeulders A W M, et al. Visual word ambiguity[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2010, 32 (7) : 1271–1283. DOI:10.1109/TPAMI.2009.132]

[34] Arthur D, Vassilvitskii S. K-means++: The advantages of careful seeding[C]//18th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2007: 1027-1035.

[35] McLachlan G, Peel D. Finite mixture models[M]. Hoboken, New Jersey, USA: John Wiley & Sons, 2004 .

[36] Liu L Q, Wang L, Liu X W. In defense of soft-assignment coding[C]//Proceedings of the 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE, 2011: 2486-2493.[DOI: 10.1109/ICCV.2011.6126534]

[37] Yang J C, Yu K, Gong Y H, et al. Linear spatial pyramid matching using sparse coding for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 1794-1801.[DOI: 10.1109/CVPR.2009.5206757]

[38] Yu K, Zhang T, Gong Y. Nonlinear learning using local coordinate coding[C]//Advances in neural information processing systems. 2009: 2223-2231. http://www.oalib.com/references/16875404

[39] Wang J J, Yang J C, Yu K, et al. Locality-constrained linear coding for image classification[C]//Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 3360-3367.[DOI: 10.1109/CVPR.2010.5540018]

[40] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[C]//Proceedings of the 11th European Conference on Computer Vision. Berlin Heidelberg, Germany: Springer, 2010: 143-156.[DOI: 10.1007/978-3-642-15561-1_11]

[41] Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector: theory and practice[J]. International Journal of Computer Vision , 2013, 105 (3) : 222–245. DOI:10.1007/s11263-013-0636-x]

[42] Simonyan K, Vedaldi A, Zisserman A. Deep fisher networks for large-scale image classification[C]//Advances in neural information processing systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Lake Tahoe, Nevada, USA: NIPS, 2013: 163-171.

[43] Jégou H, Perronnin F, Douze M, et al. Aggregating local image descriptors into compact codes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2012, 34 (9) : 1704–1716. DOI:10.1109/TPAMI.2011.235]

[44] Yu K, Zhang T. Improved local coordinate coding using local tangents[C]//Proceedings of the 27th International Conference on Machine Learning (ICML-10). Haifa, Israel: IMLS, 2010: 1215-1222.

[45] Zhou X, Yu K, Zhang T, et al. Image classification using super-vector coding of local image descriptors[C]//Proceedings of the 11th European Conference on Computer Vision. Berlin Heidelberg, Germany: Springer, 2010: 141-154.[DOI: 10.1007/978-3-642-15555-0_11]

[46] Koniusz P, Yan F, Mikolajczyk K. Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection[J]. Computer Vision and Image Understanding , 2013, 117 (5) : 479–492. DOI:10.1016/j.cviu.2012.10.010]

[47] Lin Y Q, Lv F J, Zhu S H, et al. Large-scale image classification: fast feature extraction and SVM training[C]//Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Providence RI, USA: IEEE, 2011: 1689-1696.[DOI: 10.1109/CVPR.2011.5995477]

[48] Boureau Y L, Ponce J, Lecun Y. A theoretical analysis of feature pooling in visual recognition[C]//Proceedings of the 27th International Conference on Machine Learning (ICML-10). Haifa, Israel: IMLS, 2010: 111-118.

[49] Boureau Y L. Learning hierarchical feature extractors for image recognition[D]. New York, USA: New York University, 2012. http://cn.bing.com/academic/profile?id=330167645&encoded=0&v=paper_preview&mkt=zh-cn

[50] Murray N, Perronnin F. Generalized max pooling[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH, USA: IEEE, 2014: 2473-2480.[DOI: 10.1109/CVPR.2014.317]

[51] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories[C]//Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, NY, USA: IEEE, 2006, 2: 2169-2178.[DOI: 10.1109/CVPR.2006.68]

[52] Zhao B, Zhong Y F, Zhang L P. A spectral-structural bag-of-features scene classifier for very high spatial resolution remote sensing imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing , 2016, 116 : 73–85. DOI:10.1016/j.isprsjprs.2016.03.004]

[53] Zhang H, Berg A C, Maire M, et al. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition[C]//Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, NY, USA: IEEE, 2006, 2: 2126-2136.[DOI: 10.1109/CVPR.2006.301]

[54] Zhang H, Su J. Naive Bayesian classifiers for ranking[C]//Proceedings of the 15th European Conference on Machine Learning: ECML 2004. Berlin Heidelberg, Germany: Springer, 2004: 501-512.[DOI: 10.1007/978-3-540-30115-8_46]

[55] Svetnik V, Liaw A, Tong C, et al. Random forest: a classification and regression tool for compound classification and QSAR modeling[J]. Journal of Chemical Information and Computer Sciences , 2003, 43 (6) : 1947–1958. DOI:10.1021/ci034160g]

[56] Yegnanarayana B. Artificial neural networks[M]. New Delhi: PHI Learning Pvt. Ltd, 2009 .

[57] Chang CC, Lin C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology , 2011, 2 (3) : #27. DOI:10.1145/1961189.1961199]

[58] Huang G B, Zhu Q Y, Siew C K. Extreme learning machine: a new learning scheme of feedforward neural networks[C]//Proceedings of the 2004 IEEE International Joint Conference on Neural Networks. Budapest, Hungary: IEEE, 2004, 2: 985-990.[DOI: 10.1109/IJCNN.2004.1380068]

[59] Khan Y N. Visual terrain classification for outdoor mobile robots[D]. Tübingen, German: Universität Tübingen, 2013.

[60] Vedaldi A, Fulkerson B. Vlfeat: An open and portable library of computer vision algorithms[C]//Proceedings of the 18th ACM International Conference on Multimedia. New York, NY, USA: ACM, 2010: 1469-1472.[DOI: 10.1145/1873951.1874249]

[61] Chatfield K, Lempitsky V, Vedaldi A, et al. The devil is in the details: an evaluation of recent feature encoding methods[C]//Proceedings of British Machine Vision Conference. Dundee, UK: British Machine Vision Association, 2011.

[62] Chen S H, Shi W R, Lv X. Feature coding for image classification combining global saliency and local difference[J]. Pattern Recognition Letters , 2015, 51 : 44–49. DOI:10.1016/j.patrec.2014.08.008]

[63] Avila S, Thome N, Cord M, et al. Pooling in image representation: The visual codeword point of view[J]. Computer Vision and Image Understanding , 2013, 117 (5) : 453–465. DOI:10.1016/j.cviu.2012.09.007]

[64] Wu H, Liu B Z, Su W H, et al. Hierarchical coding vectors for scene level land-use classification[J]. Remote Sensing , 2016, 8 (5) : #436. DOI:10.3390/rs8050436]

[65] Peng X J, Zou C Q, Qiao Y, et al. Action recognition with stacked fisher vectors[C]//Proceedings of the 13th European Conference on Computer Vision. Switzerland: Springer International Publishing, 2014: 581-595.[DOI: 10.1007/978-3-319-10602-1_38]