发布时间: 2021-07-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.200503
2021 | Volume 26 | Number 7

图像理解和计算机视觉

多模态零样本人体动作识别

吕露露¹, 黄毅², 高君宇², 杨小汕², 徐常胜²

1. 郑州大学, 郑州 450000;

2. 中国科学院自动化研究所模式识别国家重点实验室, 北京 100190

收稿日期: 2020-08-22; 修回日期: 2021-01-17; 预印本日期: 2021-01-24

基金项目: 国家重点研发计划项目(2018AAA0100604);国家自然科学基金项目(61720106006, 62072455, 61702511, 61751211, U1836220, U1705262)

作者简介: 吕露露, 1995年生, 女, 硕士研究生, 主要研究方向为零样本图像分类。E-mail: 1176515019@qq.com
黄毅, 男, 博士研究生, 主要研究方向为视频文本分析和多模态协同分析。E-mail: yi.huang@nlpr.ia.ac.cn
高君宇, 男, 助理研究员, 主要研究方向为模式识别与智能系统。E-mail: gaojunyu2015@ia.ac.cn
杨小汕, 男, 副研究员, 主要研究方向为模式识别和多媒体内容分析。E-mail: xiaoshan.yang@nlpr.ia.ac.cn
徐常胜, 通信作者, 男, 研究员, 主要研究方向为多媒体内容分析/索引/检索、模式识别与计算机视觉。E-mail: csxu@nlpr.ia.ac.cn
*通信作者: 徐常胜 csxu@nlpr.ia.ac.cn

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2021)07-1658-10

摘要

目的在人体行为识别算法的研究领域，通过视频特征实现零样本识别的研究越来越多。但是，目前大部分研究是基于单模态数据展开的，关于多模态融合的研究还较少。为了研究多种模态数据对零样本人体动作识别的影响，本文提出了一种基于多模态融合的零样本人体动作识别（zero-shot human action recognition framework based on multimodel fusion，ZSAR-MF）框架。方法本文框架主要由传感器特征提取模块、分类模块和视频特征提取模块组成。具体来说，传感器特征提取模块使用卷积神经网络（convolutional neural network，CNN）提取心率和加速度特征；分类模块利用所有概念（传感器特征、动作和对象名称）的词向量生成动作类别分类器；视频特征提取模块将每个动作的属性、对象分数和传感器特征映射到属性—特征空间中，最后使用分类模块生成的分类器对每个动作的属性和传感器特征进行评估。结果本文实验在Stanford-ECM数据集上展开，对比结果表明本文ZSAR-MF模型比基于单模态数据的零样本识别模型在识别准确率上提高了4 %左右。结论本文所提出的基于多模态融合的零样本人体动作识别框架，有效地融合了传感器特征和视频特征，并显著提高了零样本人体动作识别的准确率。

关键词

零样本; 多模态融合; 动作识别; 传感器数据; 视频特征

Multimodal-based zero-shot human action recognition

Lyu Lulu¹, Huang Yi², Gao Junyu², Yang Xiaoshan², Xu Changsheng²

1. Zhengzhou University, Zhengzhou 450000, China;

2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

Supported by: National Key Research and Development Program of China (2018AAA0100604);National Natural Science Foundation of China (61720106006, 62072455, 61702511, 61751211, U1836220, U1705262)

Abstract

Objective Human action recognition is one of the research hotspots in computer vision because of its wide application in human-computer interaction, virtual reality, and video surveillance. With the development of related technology in recent years, the human action recognition algorithm based on deep learning has achieved good recognition performance when the sample size is sufficient. However, studying human action recognition is difficult when the sample size is small or missing. The emergence of zero-shot recognition technology has solved these problems and attracted considerable attention because it can directly classify the "unseen" categories that are not in the training set. In the past decade, numerous methods have been conducted to perform zero-shot human action recognition by using video features and achieved promising improvement. However, most of the current methods are based on single modality data and few studies have been conducted on multimodal fusion. To study the influence of multiple modality fusion on zero-shot human action recognition, this study proposes a zero-shot human action recognition framework based on multimodal fusion(ZSAR-MF). Method Unlike most of the previous methods based on the fusion of external information and video features or only research on single-modality video features, tour study focuses on the influence of sensor features that are most related to the activity state to improve the recognition performance. The zero-shot human-action recognition framework based on multimodal fusion is mainly composed of a sensor feature-extraction module, classification module, and video feature extraction module. Specifically, the sensor feature-extraction module uses convolutional neural network (CNN) to extract the acceleration and heart rate features of human actions and predict the most relevant feature words for each action. The classification module uses the word vectors of all concepts (sensor features, actions names, and object names) to generate action category classifiers. The "seen" category classifiers are obtained by learning the training data of these categories, and the "unseen" category classifiers are generalized from the "seen" category classifiers by using graph convolutional network (GCN).The video feature-extraction module extracts the video features of each action and maps the attributes of human actions, object scores, and sensor features into the attribute-feature space. Finally, the classifiers generated by the classification module are used to evaluate the feature of each video to calculate the action class scores. Result The experiment is conducted on the Stanford-ECM dataset with sensor and video data. The dataset includes 23 types of human action video and heart rate and acceleration data synchronized with the collected video. Our experiment can be divided into three steps. First, we remove the 7 actions that do not meet the experimental conditions and select the remaining 16 actions as the experimental dataset. Then, we select three methods to perform experiments on zero-shot human action recognition. A comparison of the experimental results show that the results of zero-shot action recognition via two-stream GCNs and knowledge graphs (TS-GCN) method are approximately 8% higher than that of zero-shot image classification based on generated countermeasure network (ZSIC-GAN) method, which proves the auxiliary role of knowledge graphs in action description by using external semantic information and the advantage of GCN. Compared with the ZSIC-GAN and TS-GCN methods, our proposed method have recognition results that are 12% and 4% higher than that of the ZSIC-GAN and TS-GCN method, respectively, which proves that for zero-shot human-action recognition, the fusion method of the sensor and video features is better than the method that only uses video features. Furthermore, we verify the influence of the number of layers of GCN on the recognition accuracy and analyze the reasons for this result. The experimental results show that adding more layers to the three-layer model cannot significantly improve the recognition accuracy of the model. One of the potential reasons for this situation is that the amount of training data is too small, and an overfitting problem occurs in the deeper network. Conclusion Sensor and video data can comprehensively describe human activity patterns from different views, which provide convenience for zero-shot human-action recognition based on multimodal fusion. Unlike most of the multimodal fusion methods based on the text description of the action or the audio data and image features, our study uses the sensor and video features that are most related to the active state to realize the multimodal fusion, and pays close attention to the original features of the action. In general, our zero-shot human-action recognition framework based on multimodal fusion includes three parts: sensor feature-extraction module, classification module, and video feature-extraction module. This framework integrates video features and features extracted from sensor data. The two features are modeled by using the knowledge graphs, and the entire network is optimized by using classification loss function. The experimental results on the Stanford-ECM dataset demonstrate the effectiveness of our proposed zero-shot human-action recognition framework based on multimodal fusion. By fully fusing sensor and video features, we significantly improve the accuracy of zero-shot human-action recognition.

Key words

zero-shot; multimodal fusion; action recognition; sensor data; video features

0 引言

人体动作识别是计算机视觉领域的研究热点之一，广泛应用在生物识别、视频监控和人机交互等领域(何俊佑，2019)。随着相关技术的进步，基于深度学习的人体动作识别算法在样本量充足的情况下取得了良好的识别效果。然而，在样本量少甚至缺失的情况下，如何识别出新的动作这一类问题研究得还比较少。

针对人体动作多样性和复杂性等特点，研究人员迫切需要一种在目标类别的视觉标注数据完全缺失的情况下，仍然能够识别出这些类别的技术(何俊佑，2019)。零样本识别技术的出现解决了这些问题并引起了相当大的关注，因为它可以对训练集中没有的类别(称之为不可见类别)直接进行分类(Zhu等，2018)。

现有的零样本识别方法可以概括为两种类型，一种是利用人工定义的属性进行识别(Liu等，2011)，这些方法的特点是利用动作和属性的关系来识别不可见类。但是，由于对动作等相关属性的定义十分困难，在实际情况下，这些基于属性的方法不适用于零样本人体动作识别；另一种是采用动作名称的语义表示对语义空间中的动作关系建模(Xu等，2016)，常用的语义信息包括人工定义的属性(Farhadi等，2009)和自动从辅助文本语料库中提取的单词的词向量(Akata等，2015)。零样本人体动作识别示例如图 1所示，图中训练集与测试集无相同动作。

图 1 零样本人体动作识别示例

Fig. 1 Example of zero-shot human action recognition

尽管上述方法简单有效，但是词嵌入空间只能表示动作与动作的关系，很难利用动作视频里的语义信息来提高识别准确率。为了解决这些问题，Jain等人(2015)与Mettes和Snoek(2017)将对象作为零样本识别的属性，使用预训练的对象分类器来查找动作视频里的对象；Gao等人(2019)通过将知识图谱(Speer等，2017)合并到上述方法中，利用外部知识信息来提高零样本识别方法的泛化能力，并在通用的数据集上获得了良好的测试结果。

本文在Gao等人(2019)方法的基础上，提出了一个基于多模态融合的零样本人体动作识别框架，结构如图 2所示。图中训练集与测试集无重复动作，带箭头的实线表示训练过程，带箭头的虚线表示测试过程。

图 2 基于多模态融合的零样本人体动作识别结构图

Fig. 2 Structure diagram of zero-shot human action recognition based on multimodal fusion

具体地，本文框架主要由传感器特征提取模块、分类模块和视频特征提取模块组成。传感器特征提取模块使用卷积神经网络(convolutional neural network, CNN)提取心率和加速度特征；同时，为了更好地从传感器特征和视频中提取信息，本文基于图卷积神经网络设计了分类模块和视频特征提取模块，在分类模块中使用所有概念(传感器特征、动作和对象名称)的词向量生成动作类别的分类器；在视频特征提取模块中将每个动作的属性、对象分数以及传感器特征映射到属性—特征空间，并采用分类损失函数对整个网络进行优化。在Stanford-ECM数据集(Nakamura等，2017)上的实验结果证明了本文模型的有效性。

总体而言，本文工作有如下贡献：

1) 提出一个新的基于多模态融合的零样本人体动作识别方法(zero-shot human action recognition framework based on multimodel fusion, ZSAR-MF)，方法中的分类模块利用传感器特征、动作和对象名称的词向量生成动作类别分类器；视频特征提取模块将传感器特征对应的特征词加入到构建的知识图谱中，作为节点的一部分，以实现多模态融合。

2) 提出了一种新的数据融合方式，以往的多模态融合方法都使用特征连接的前端融合方法，或是结果加权的后端融合方法。与之前方法不同，本文使用图卷积神经网络(Kipf和Welling，2017)对动作的属性、对象分数和传感器特征进行融合，同时利用图卷积网络和知识图谱学习动作分类器。

3) 本文在Stanford-ECM数据集上进行实验，结果表明，相比于仅使用视频特征实现零样本人体动作识别的方法，本文模型在识别率上提升了4%左右。

1 相关工作

1.1 多模态融合

模态是指事物发生或存在的方式，多模态是指两个或两个以上模态任意形式的组合。多模态研究分为4个时期，即人类行为多模态研究、多模态计算机处理研究、多模态互动研究和多模态深度学习研究(刘建伟等，2020)。本文主要讨论的是人类行为多模态研究的一个分支—多模态融合的零样本人体动作识别。

传统视频包含多个数据源，如视频、音频和文本等数据，每个数据模式都有不同的属性，用于描述事件的不同方面。早前的多模态学习方法使用受限玻耳兹曼机(restricted Boltzmann machines，RBM)(Srivastava和Salakhutdinov，2014)或Log-bilinear模型(Kiros等，2014)学习语句和图像的分布，以实现基于文本描述和图像特征的多模态融合；Piergiovanni和Ryoo(2020)提出了一种联合多模态表示空间的方法，对不匹配的文本和视频数据使用对抗式公式以改进联合嵌入空间；Frome等人(2013)提出了深层视觉语义嵌入模型，该模型利用已标注的图像数据和未标注文本中的语义信息来识别视觉对象；Ngiam等人(2011)设计了一个能够学习融合音频和视频表示的自动编码器，生成的分类器使用其中一种数据进行训练，使用另一种数据进行测试，但此方法依赖逐层训练，而不是端到端的训练模型。

1.2 零样本动作识别

零样本动作识别是一种在目标动作的视觉标注数据完全缺失的情况下，仍然能够识别出目标动作的技术，通过利用已知的语义信息训练出识别未知动作的分类器。因此可以认为零样本动作识别是建立在人类对类别描述之上的研究。

由于人工定义属性具有很强的主观性且词嵌入空间只能简单地表示动作与动作的关系，不能利用动作视频里的其他信息来提高识别准确率，近年来，人们证明了将对象作为类别的属性非常适用于动作识别的研究。Jain等人(2015)使用上千个对象类别来构建语义嵌入模型；Mettes和Snoek(2017)进一步将空间对象感知嵌入用于零样本动作分类；Gan等人(2016)依据外部知识信息构建类比池化来实现零样本动作识别；但是这些方法都不能进行端到端的训练。Gao等人(2019)对上述方法进行改进，一方面将对象作为零样本动作识别的属性，将知识图谱(Speer等，2017)结合到上述方法中，另一方面使用图卷积网络提取外部知识，该方法以端到端的方式显式地建立对象和动作类别之间的关系。

本文在Gao等人(2019)方法的基础上，提出了一个基于多模态融合的零样本人体动作识别框架，与以往大多数基于外部信息与图像的特征融合或者仅对单一图像特征进行研究的方法不同，本文方法更注重与动作状态最相关的传感器特征对识别效果的影响，并进一步将每个动作的传感器特征与图像特征进行融合，以实现对单一图像特征进行补充。

1.3 图卷积神经网络

图卷积神经网络(Kipf和Welling，2017)是一种将卷积神经网络推广到图结构数据上以进行特征提取的方法，因具有可直接应用于图数据、使用图的结构信息和能够有效挖掘图数据等优势而广泛应用于动作识别中。Lee等人(2018)利用门控图神经网络对知识图建模以描述多个标签之间的关系，但此方法仅为所有类别生成一个分类器，并且在训练阶段没有合并不可见类的标签；Gao等人(2018)设计了一个graph convLSTM模型来同时识别视频片段中的局部知识结构，并且对连续视频片段之间的动态变化建模以提升视频分类效果；Wang等人(2018)通过将图卷积网络中每个节点的语义嵌入作为输入语义来设计零样本识别模型，但该方法仅考虑标签—标签的关系而没有利用属性。

为了更好地从传感器特征和视频中提取信息，本文基于图卷积神经网络设计了分类模块和视频特征提取模块，利用知识图谱(Speer等，2017)对传感器特征、动作以及从视频提取出的对象建模，并直接采用分类损失优化整个框架，在训练过程中，模型为每个动作学习了不同的分类器，具有良好的泛化能力。

2 方法

2.1 问题定义

在基于多模态融合的零样本人体动作识别中，本文从具有$S$个可见类别的源数据集中挑选${{N}_{s}}$个视频组成训练集${{\boldsymbol{D}}^{s}}$

$ \boldsymbol{D}^{s}=\left\{\boldsymbol{F}^{s}, \boldsymbol{Y}^{s}\right\} $

式中，$s\in \left[ 1, \cdots, S \right]$，每个视频${{\boldsymbol{f}}^{s}}\in {{\boldsymbol{F}}^{s}}$与动作标签${{y}^{s}}\in {{\boldsymbol{Y}}^{s}}$相关联，${{\boldsymbol{f}}^{s}}$由视频数据${{\boldsymbol{v}}^{s}}$、加速度特征${{\boldsymbol{a}}^{s}}$和心率特征${{\boldsymbol{h}}^{s}}$组成。

同样，本文从具有$U$个不可见类别的源数据集中挑选${{N}_{u}}$个视频组成测试集${{\boldsymbol{D}}^{u}}$

$ \boldsymbol{D}^{u}=\left\{\boldsymbol{F}^{u}, \boldsymbol{Y}^{u}\right\} $

式中，${{\boldsymbol{Y}}^{s}}\cup {{\boldsymbol{Y}}^{u}}=\boldsymbol{Y}$，${{\boldsymbol{Y}}^{s}}\cap {{\boldsymbol{Y}}^{u}}=\varnothing $，每个目标视频${\mathit{\boldsymbol{f}}^u}$由视频数据${\mathit{\boldsymbol{v}}^u}$、加速度特征${\mathit{\boldsymbol{a}}^u}$和心率特征${\mathit{\boldsymbol{h}}^u}$组成。此外，有一个包含$o$个对象的对象集$\mathit{\boldsymbol{O}}$，用做描述动作的属性。

本文模型以视频特征和传感器特征作为输入，输出为零样本人体动作识别的准确率，准确率的大小代表模型识别性能的优劣。

2.2 整体框架

本文整体框架主要由传感器特征提取模块、分类模块和视频特征提取模块构成。首先从传感器数据中提取传感器特征($\mathit{\boldsymbol{G}}$)，然后从视频截取的图像里提取视频特征，得到与所有动作相关联的概念，即可见类($S$)、不可见类($U$)和对象($O$)的概念，最终使用与所有概念(传感器特征和所有动作相关联的概念)对应的相同数量的节点构图。本文整体框架如图 3所示。

图 3 基于多模态融合的零样本人体动作识别框架图

Fig. 3 Framework diagram of zero-shot human action recognition based on multimodal fusion

2.3 传感器特征提取模块

本文使用Zhang等人(2019)提出的用于传感器信号特征提取的卷积神经网络模型，利用三轴加速度数据和心率数据识别人体动作的传感器特征。模型结构如图 4所示(图中，Conv表示卷积层，MaxPool表示最大池化层，FC表示全连接层)。

图 4 传感器特征提取网络

Fig. 4 Sensor feature extraction network

如问题定义部分所述，${\mathit{\boldsymbol{f}}^s}$除视频数据外，还包含一系列的加速度数据与心率数据

$ \left\{\left(\boldsymbol{a}_{1}, \boldsymbol{h}_{1}\right),\left(\boldsymbol{a}_{2}, \boldsymbol{h}_{2}\right), \cdots,\left(\boldsymbol{a}_{T}, \boldsymbol{h}_{T}\right)\right\} $

式中，${\mathit{\boldsymbol{a}}_T}$是加速度数据，${\mathit{\boldsymbol{h}}_T}$是心率数据，$T$是数据样本数。

对于具有x、y与z轴3个方向的加速度数据${\mathit{\boldsymbol{a}}_t} = \left\{ {{\mathit{\boldsymbol{a}}_{x, t}}, {\mathit{\boldsymbol{a}}_{y, t}}, {\mathit{\boldsymbol{a}}_{z, t}}} \right\}$，在提取特征时，本文采用宽度为2的卷积核得到传感器轴与单轴时间序列特征间的相关信息

$ \boldsymbol{a c c}_{t}^{s}={CONV}_{2}\left(\boldsymbol{a}_{t}\right)=\boldsymbol{a}_{t} * \boldsymbol{\varGamma}_{2} $

(1)

式中，CONV(·)和表示卷积操作，${\mathit{\boldsymbol{ \boldsymbol{\varGamma} }}_2}$表示宽度为2的卷积核。

对于只有单一方向的心率数据${\mathit{\boldsymbol{h}}_t}$，在提取特征时，本文采用宽度为1的卷积核来提取心率变化的时间序列特征

$ \boldsymbol { hea }_{t}^{s}=C O N V_{1}\left(\boldsymbol{h}_{t}\right)=\boldsymbol{h}_{t} * \boldsymbol{\varGamma}_{1} $

(2)

式中，${\mathit{\boldsymbol{ \boldsymbol{\varGamma} }}_1}$表示宽度为1的卷积核。

通过上述卷积操作，本文获得加速度特征与心率特征，并将二者串联起来得到传感器信号特征$\mathit{\boldsymbol{e}}_t^s$

$ \boldsymbol{e}_{t}^{s}=\boldsymbol{a c c}_{t}^{s} \oplus \boldsymbol{h e a}_{t}^{s} $

(3)

为了增加传感器特征的语义信息，本文还保留了输出层前的全连接层的特征$\mathit{\boldsymbol{c}}_t^s$，并将其与传感器信号特征$\mathit{\boldsymbol{e}}_t^s$串联起来得到最终的传感器特征$\mathit{\boldsymbol{g}}_t^s$

$ \boldsymbol{g}_{t}^{s}=\boldsymbol{e}_{t}^{s} \oplus \boldsymbol{c}_{t}^{s} $

(4)

至此，每个动作的传感器特征${\mathit{\boldsymbol{G}}^s}$可以表述为

$ \boldsymbol{G}^{s}=\left[\boldsymbol{g}_{1}^{s}, \boldsymbol{g}_{2}^{s}, \cdots, \boldsymbol{g}_{T}^{s}\right] $

接下来，本文利用带有softmax激活函数的全连接层生成传感器特征对应动作类别的概率分布

$ y_{\text {seen }}^{s}=\sigma\left(W \boldsymbol{G}^{s}+b\right) $

(5)

式中，$\sigma $为softmax操作，$W$和$b$分别为全连接层的权重和偏置。最后，本文使用交叉熵损失进行优化，得到传感器特征对应的类别，进而为每个动作预测出最相关的特征词

$ \mathcal{L}_{\text {seen }}=-\sum\limits_{s}^{S} y^{s} \log y_{\text {seen }}^{s} $

(6)

2.4 分类模块

图卷积网络通过有效挖掘图数据和使用图的结构信息来提取有用的特征向量。本文使用Kipf和Welling(2017)提出的图卷积网络构建模型。给定具有$m$个节点的无向图、节点之间的一组边、邻接矩阵$\mathit{\boldsymbol{A}} \in {\textbf{R}^{m \times m}}$，在这里，图卷积的线性变换形式可以由图信号$\mathit{\boldsymbol{X}} \in {\textbf{R}^{k \times m}}$与卷积核$\mathit{\boldsymbol{W}} \in {\textbf{R}^{k \times c}}$的乘积得到

$ \boldsymbol{Z}=\hat{\boldsymbol{D}}^{-\frac{1}{2}} \hat{\boldsymbol{A}} \hat{\boldsymbol{D}}^{-\frac{1}{2}} \boldsymbol{X}^{\mathrm{T}} \boldsymbol{W} $

(7)

式中，$\mathit{\boldsymbol{Z}}$是生成的特征向量，$\mathit{\boldsymbol{\hat A = A + I}}$，$\mathit{\boldsymbol{I}}$是单位矩阵，${\mathit{\boldsymbol{\hat D}}_{ii}} = \sum\limits_j {{{\mathit{\boldsymbol{\hat A}}}_{ij}}} $是度矩阵。

从图卷积网络中受到启发，本文的分类模块将动作的传感器特征词、动作以及对象名称的词向量作为图的节点，并在训练阶段利用图卷积网络从可见类($S$)中归纳出一个不可见类($U$)分类器。其中，可见类别的分类器通过学习有标注的训练数据得到，在训练阶段，不可见类别的分类器利用图卷积网络从可见类别的分类器泛化得到。

分类模块采用$L$层的图卷积网络，其中$l$层的输入都为上一层生成的特征矩阵($\mathit{\boldsymbol{Z}}_{l - 1}^{{\rm{cls}}}$)，然后生成该层的特征矩阵($\mathit{\boldsymbol{Z}}_{l}^{{\rm{cls}}}$)。该模块的输入是$k$×($S$+$U$+$O$+$G$) 的矩阵${\mathit{\boldsymbol{X}}^{{\rm{cls}}}}$，即传感器特征词、动作以及对象名称的词嵌入向量，$k$表示词向量的维数。输出是$d$×($S$+$U$+$O$+$G$) 的矩阵${\mathit{\boldsymbol{w}}^{{\rm{cls}}}}$，$d$表示分类器的维数。分类模块使用在Wikipedia数据集上训练的GloVe文本模型(Pennington等，2014)作为词嵌入模型，为每个动作和对象得出300维向量表示。

2.5 视频特征提取模块

与分类模块类似，本文的视频特征提取模块利用图卷积网络将每个动作的属性、对象分数和传感器特征映射到属性—特征空间中，最后使用分类模块生成的分类器对每个动作的属性和传感器特征进行评估。

为了获得对象分数，本文将视频按帧截取成图像，并使用GoogLeNet模型(Szegedy等，2015)得到每幅图像对应12 988个类别的概率(Mettes等，2016)，随后为每幅图像选取前$K$个最相关的对象(Jain等，2015)，并将其概率作为对应的对象分数，视频不同帧的对象得分共同组成视频的原始表示$\mathit{\boldsymbol{V}}$。由于建模视频的时序动态信息在提取视频特征中十分重要，在这里，本文利用Jing等人(2019)的方法，对$\mathit{\boldsymbol{v}}$执行自注意操作，并得到带有时间动态信息的视频表示$\mathit{\boldsymbol{\hat V}}$

$ \hat{\boldsymbol{V}}_{s}=\gamma \sum\limits_{i=1}^{I} \alpha_{s, i} \boldsymbol{r}\left(\boldsymbol{v}_{i}\right)+\boldsymbol{v}_{s} $

(8)

式中，$I$为动作视频的帧数，$\gamma $为可学习的比例参数(初始值为0)，$\mathit{\boldsymbol{r}}$(·) 是卷积核大小为1×1的卷积层，${\mathit{\boldsymbol{v}}_i}$和${\mathit{\boldsymbol{v}}_s}$为$\mathit{\boldsymbol{V}}$的列向量，${\alpha _{s, i}}$是通过计算${\mathit{\boldsymbol{v}}_i}$和${\mathit{\boldsymbol{v}}_s}$的相似性并执行归一化操作后得到的注意力权重

$ \alpha_{s, i}=\frac{1}{C_{i}} \exp \left(\boldsymbol{b}\left(\boldsymbol{v}_{i}\right)^{\mathrm{T}} \boldsymbol{c}\left(\boldsymbol{v}_{s}\right)\right) $

(9)

式中，$\mathit{\boldsymbol{b}}\left(\cdot \right)$和$\mathit{\boldsymbol{c}}\left(\cdot \right)$分别是卷积核大小为1×1的卷积层，${C_i} = \sum\limits_{i = 1}^I {\exp \left({\mathit{\boldsymbol{b}}{{\left({{\mathit{\boldsymbol{v}}_i}} \right)}^{\rm{T}}}\mathit{\boldsymbol{c}}\left({{\mathit{\boldsymbol{v}}_s}} \right)} \right)} $，是归一化系数。

为了充分利用动作间的显式关系，本文在分类模块和视频特征提取模块使用ConceptNet(Speer等，2017)的英文子图，并采用字符串匹配的方法将动作、对象与传感器特征词映射到ConceptNet中的节点。为了不丢失主要的语义信息，本文将某些没有对应节点的特征词进行字母大小写和连接符替换，转化为在ConceptNet中可以找到的常用词，例如，将“Used_car”替换成“used_car”。构建知识图谱最重要的是确定节点之间的关系，不同的知识图谱具有不同类型的关系，在利用知识图谱构图时，本文使用Marino等人(2017)的方法，将节点与节点之间的关系简化为邻接矩阵来有效地表示语义一致性并传送节点之间的信息，在训练过程中，本文通过固定邻接矩阵来保持原有的知识结构。

接下来，本文利用上述从视频中得到的相关特征和传感器特征$\mathit{\boldsymbol{G}}$共同构建知识图谱，主要包含以下两部分：

1) 本文为每个视频帧取前20个最相关的对象，执行去重操作后最终得到1 050个对象，并把它们作为图中节点的一部分。

2) 同样地，本文将传感器特征$\mathit{\boldsymbol{G}}$对应的特征词加入到构建的图中，作为节点的另一部分，以实现与视频中提取对象的多模态特征融合。

在这里，本文使用传感器的特征词向量与对象类别的词向量作为$L$层图卷积网络的输入，并在最后一层输出特征矩阵$\mathit{\boldsymbol{Z}}_L^{{\rm{vid}}} \in {\textbf{R}^{\left({G + O} \right) \times d}}$。

2.6 训练与预测

分类模块得到的输出矩阵${\mathit{\boldsymbol{w}}^{{\rm{cls}}}}$的前$S$列向量是可见类别的动作分类器，$S + 1$至$S + U$列的向量是不可见类别的动作分类器。在训练阶段，本文使用可见类别的分类器对所有被标记的样本特征进行分类

$ \hat{y}_{n}^{s}=\sigma\left(q_{n}^{s}\right) $

(10)

$ q_{n}^{s}=\left(\boldsymbol{w}_{s}^{\mathrm{cls}}\right){ }^{\mathrm{T}} \boldsymbol{Z}_{s}^{\text {vid }} $

(11)

式中，$\hat y_n^s$是第$s$个动作的预测分数$q_n^s$经过softmax归一化后得到的预测类别的概率，$\mathit{\boldsymbol{w}}_s^{{\rm{cls}}}$是分类模块产生的第$s$个动作的分类器，$\mathit{\boldsymbol{Z}}_s^{{\rm{vid}}}$是从视频特征提取模块生成的可见类动作所有特征的特征向量和: $\sum\limits_{o \in \mathit{\boldsymbol{N}}\left(s \right), G \in {\mathit{\boldsymbol{G}}^s}} {\mathit{\boldsymbol{Z}}_{L, n, o, G}^{{\rm{vid}}}} $，$\mathit{\boldsymbol{N}}\left(s \right)$表示知识图谱(Speer等，2017)中第$s$个动作的单跳对象的集合，${\mathit{\boldsymbol{G}}^s}$表示可见类动作的传感器特征，这表示本文将重点放在与动作最相关的对象和传感器特征上，以对特定动作进行分类。最后，本文使用交叉熵损失函数对整个网络进行优化

$ \mathcal{L}_{\mathrm{obj}}=-\frac{1}{N_{S}} \sum\limits_{n=1}^{N_{S}} \sum\limits_{s=1}^{S} y_{n}^{s} \log \left(\hat{y}_{n}^{s}\right) $

(12)

式中，$S$是可见类别的数目，$s$是第$s$个可见类动作，${N_S}$是样本总数。

在训练过程中，基于多模态融合的零样本人体动作识别模型不仅可以优化可见类的分类器，还可以通过对分类模块与视频特征提取模块之间的关系建模，将其推广到不可见类别中进行泛化学习。在测试过程中，模型使用分类模块生成的不可见类的分类器对视频特征提取模块中不可见类动作的对象特征和传感器特征进行分类，以实现零样本人体动作识别

$ q_{n}^{u}=\left(\boldsymbol{w}_{u}^{\mathrm{cls}}\right)^{\mathrm{T}} \boldsymbol{Z}_{u}^{\text {vid }} $

(13)

式中，$\mathit{\boldsymbol{w}}_u^{{\rm{cls}}}$是不可见类动作的分类器，$\mathit{\boldsymbol{Z}}_u^{{\rm{vid}}}$是从视频特征提取模块生成的不可见类动作的所有特征的特征向量和：$\sum\limits_{o \in {\rm{ }}\mathit{\boldsymbol{N}}\left(u \right), G \in {\mathit{\boldsymbol{G}}^u}} {\mathit{\boldsymbol{Z}}_{L, n, o, G}^{{\rm{vid}}}} $，$\mathit{\boldsymbol{N}}\left(u \right)$是不可见类动作的单跳对象集合，${\mathit{\boldsymbol{G}}^u}$是不可见类动作的传感器特征。

3 实验

3.1 数据集

第一人称视频和人体加速度数据可以从不同角度全面地描述人类的活动模式和状态。本文采用Stanford-ECM数据集(Nakamura等，2017)进行实验，此数据集包括23种日常活动动作和与采集视频同步的心率和加速度数据，视频的持续时间从3 min到51 min不等。

为了保证实验结果的客观性，本文以Stanford-ECM数据集中的用户信息作为训练数据集和测试数据集的划分原则，即：同一用户的数据只能出现在训练集或者测试集中。由于该数据集样本分布不匀衡，比如上楼梯和下楼梯、骑车和骑车上坡等动作，不满足本文数据集划分原则，因此，本文选择满足数据集划分的16类动作进行实验，所选类别名称如表 1所示。

表 1 实验所选动作名称
Table 1 Action classes selected in the experiment

下载CSV

动作名称		动作名称
bicycling		talking standing
calisthenics		talking sitting
walking		sitting tasks
descending stairs		meeting
presenting		eating
driving		standing in line
shopping		riding
food preparation		reading

3.2 数据集划分及实验结果比对

3.2.1 数据集划分

本文采用Xu等人(2017)的方法对数据集进行划分，首先选取50%的动作用于模型训练，另外50%的动作为不可见类用于模型测试，然后对16个动作生成50个独立的划分，并输出识别的平均准确率以进行实验性能评估。

3.2.2 不同方法结果对比

为了证明本文方法的有效性与优良性，本文选择了3种不同的方法进行实验，所选方法如下：

1) ZSIC-GAN (zero-shot image classification based on generated countermeasure network)(魏宏喜和张越，2019)。该方法通过生成不可见类的图像特征使得零样本分类任务转化为传统图像分类任务。

2) TS-GCN(zero-shot action recognition via two-stream GCNs and knowledge graphs)(Gao等，2019)。该方法通过一个融合知识图谱的端到端框架来实现零样本人体动作识别。

3) ZSAR-MF。该方法为本文提出的基于多模态融合的零样本人体动作识别模型，通过融合传感器和视频两种模态特征实现零样本动作识别。

本文为每个实验取前5个最大的识别准确率，结果对比如表 2所示。

表 2 不同方法结果比较
Table 2 Comparison of results of different methods

下载CSV

方法	准确率/%
ZSIC-GAN	35.3	33.7	30.2	28.1	26.8
TS-GCN	42.4	40.9	39.4	37.9	36.4
ZSAR-MF	47.0	45.5	42.4	40.9	39.4
注：加粗字体为每列最优值。

从表 2在Stanford-ECM数据集(Nakamura等，2017)实现的零样本人体动作识别的结果可知：

1) TS-GCN方法的结果高出ZSIC-GAN方法8%左右，这证明了知识图谱利用外部语义信息对动作描述的辅助作用和图卷积网络的优势。

2) 本文方法比ZSIC-GAN方法和TS-GCN方法识别结果分别高出12%和4%左右，这证明对于零样本人体动作识别来说，传感器特征和视频特征融合的方法，在一定程度上优于现有方法。

3.2.3 图卷积不同层数结果对比

为了研究图卷积网络的层数对实验结果的影响，本文分别选取3个不同的层数来进行实验。从理论上来说，更深层次的图卷积网络将提高节点间信息传播的能力，但是，从图 5可以看出，在3层模型之上添加更多的层不能显著提高模型的识别准确率，由分析可知，逐渐增加层数而准确率降低的一个潜在原因是训练数据量太少，在深层网络中出现了过拟合问题。图卷积网络不同层数结果对比如图 5所示。

图 5 GCN不同层数结果对比

Fig. 5 Comparison of results with different GCN layers

4 结论

本文提出基于多模态融合的零样本人体动作识别框架，该框架包含传感器特征提取模块、分类模块和视频特征提取模块3个部分，整体融合了传感器特征和视频特征，利用知识图谱对两种特征共同建模，并采用分类损失函数对整个网络进行优化。

1) 本文进行了基于多模态融合的零样本人体动作识别工作，并在Stanford-ECM数据集上进行了不同的实验，对比结果表明本文提出的ZSAR-MF模型比基于单模态数据的零样本识别模型在识别准确率上提高了4 %左右。

2) 本文采用字符串匹配的方法将动作、对象与传感器特征词映射到ConceptNet中的节点，该操作会丢失一小部分语义信息，产生这种不足的原因是经GoogLeNet模型训练得出的对象词与知识图谱节点中的词不完全匹配。

3) 本文进行了融合两种模态的工作，暂未对其他模态展开研究，在未来工作中，将完善基于多模态融合的零样本人体动作识别方法，如在现有实验基础上融合音频特征等。

参考文献

Akata Z, Reed S, Walter D, Lee H and Schiele B. 2015. Evaluation of output embeddings for fine-grained image classification//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2927-2936 [DOI: 10.1109/CVPR.2015.7298911]

Farhadi A, Endres I, Hoiem D and Forsyth D. 2009. Describing objects by their attributes//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 1778-1785 [DOI: 10.1109/CVPR.2009.5206772]

Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M A and Mikolov T. 2013. DeViSE: a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: ACM: 2121-2129

Gan C, Yang Y, Zhu L C, Zhao D L, Zhuang Y T. 2016. Recognizing an action using its name: a knowledge-based approach. International Journal of Computer Vision, 120(1): 61-77 [DOI:10.1007/s11263-016-0893-6]

Gao J Y, Zhang T Z and Xu C S. 2018. Watch, think and attend: end-to-end video classification via dynamic knowledge evolution modeling//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 690-699 [DOI: 10.1145/3240508.3240566]

Gao J Y, Zhang T Z and Xu C S. 2019. I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI: 8303-8311 [DOI: 10.1609/aaai.v33i01.33018303]

He J Y. 2019. Research on Multimodal Human Motion Recognition. Beijing: Beijing University of Posts and Telecommunications (何俊佑. 2019. 多模态人体动作识别研究. 北京: 北京邮电大学)

Jain M, Van Gemert J C, Mensink T and Snoek C G M. 2015. Objects2action: classifying and localizing actions without any video example//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4588-4594 [DOI: 10.1109/ICCV.2015.521]

Jing Y, Hao J S, Li P. 2019. Learning spatiotemporal features of CSI for indoor localization with dual-stream 3D convolutional neural networks. IEEE Access, 7: 147571-147585 [DOI:10.1109/ACCESS.2019.2946870]

Kipf T N and Welling M. 2017. Semi-supervised classification with graph convolutional networks//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview: 1-6

Kiros R, Salakhutdinov R and Zemel R. 2014. Multimodal neural language models//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: Journal of Machine Learning Research: 595-603

Lee C W, Fang W, Yeh C K and Wang Y C F. 2018. Multi-label zero-shot learning with structured knowledge graphs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1576-1585 [DOI: 10.1109/CVPR.2018.00170]

Liu J G, Kuipers B and Savarese S. 2011. Recognizing human actions by attributes//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, USA: IEEE: 3337-3344 [DOI: 10.1109/CVPR.2011.5995353]

Liu J W, Ding X H, Luo X L. 2020. Survey of multimodal deep learning. Computer Application Research, 37(6): 1601-1614 (刘建伟, 丁熙浩, 罗雄麟. 2020. 多模态深度学习综述. 计算机应用研究, 37(6): 1601-1614) [DOI:10.19734/j.issn.1001-3695.2018.12.0857]

Marino K, Salakhutdinov R and Gupta A. 2017. The more you know: using knowledge graphs for image classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 20-28 [DOI: 10.1109/CVPR.2017.10]

Mettes P, Koelma D C and Snoek C G M. 2016. The ImageNet shuffle: reorganized pre-training for video event detection//Proceedings of 2016 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM: 175-182 [DOI: 10.1145/2911996.2912036]

Mettes P and Snoek C G M. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4453-4460 [DOI: 10.1109/ICCV.2017.476]

Nakamura K, Yeung S, Alahi A and Li F F. 2017. Jointly learning energy expenditures and activities using egocentric multimodal signals//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6817-6826 [DOI: 10.1109/CVPR.2017.721]

Ngiam J, Khosla A, Kim M, Nam J, Lee H and Ng A Y. 2011. Multimodal deep learning//Proceedings of the 28th International Conference on International Conference on Machine Learning. Bellevue, USA: ICML: 689-696

Pennington J, Socher R and Manning C. 2014. GloVe: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 1532-1543 [DOI: 10.3115/v1/d14-1162]

Piergiovanni A J and Ryoo M S. 2020. Learning multimodal representations for unseen activities//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass Village, USA: IEEE: 506-515 [DOI: 10.1109/WACV45572.2020.9093612]

Speer R, Chin J and Havasi C. 2017. ConceptNet 5.5: an open multilingual graph of general knowledge//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 4444-4451

Srivastava N, Salakhutdinov R. 2014. Multimodal learning with deep boltzmann machines. The Journal of Machine Learning Research, 15(1): 2949-2980

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S E, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9 [DOI: 10.1109/CVPR.2015.7298594]

Wang X L, Ye Y F and Gupta A. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6857-6866 [DOI: 10.1109/CVPR.2018.00717]

Wei H X, Zhang Y. 2019. Zero-shot image classification based on generative adversarial network. Journal of Beijing University of Aeronautics and Astronautics, 45(12): 2345-2350 (魏宏喜, 张越. 2019. 基于生成对抗网络的零样本图像分类. 北京航空航天大学学报, 45(12): 2345-2350) [DOI:10.13700/j.bh.1001-5965.2019.0363]

Xu X, Hospedales T, Gong S G. 2017. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3): 309-333 [DOI:10.1007/s11263-016-0983-5]

Xu X, Hospedales T M and Gong S G. 2016. Multi-task zero-shot action recognition with prioritised data augmentation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 343-359 [DOI: 10.1007/978-3-319-46475-6_22]

Zhang W M, Huang Y, Yu W T, Yang X S, Wang W and Sang J. T. 2019. Multimodal attribute and feature embedding for activity recognition//Proceedings of the ACM Multimedia Asia. Beijing, China: ACM: 1-7 [DOI: 10.1145/3338533.3366592]

Zhu Y, Long Y, Guan Y, Newsam S and Shao L. 2018. Towards universal representation for unseen action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 9436-9445 [DOI: 10.1109/CVPR.2018.00983]