多模态零样本人体动作识别
Multimodal-based zero-shot human action recognition
- 2021年26卷第7期 页码:1658-1667
纸质出版日期: 2021-07-16 ,
录用日期: 2021-01-24
DOI: 10.11834/jig.200503
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2021-07-16 ,
录用日期: 2021-01-24
移动端阅览
吕露露, 黄毅, 高君宇, 杨小汕, 徐常胜. 多模态零样本人体动作识别[J]. 中国图象图形学报, 2021,26(7):1658-1667.
Lulu Lyu, Yi Huang, Junyu Gao, Xiaoshan Yang, Changsheng Xu. Multimodal-based zero-shot human action recognition[J]. Journal of Image and Graphics, 2021,26(7):1658-1667.
目的
2
在人体行为识别算法的研究领域,通过视频特征实现零样本识别的研究越来越多。但是,目前大部分研究是基于单模态数据展开的,关于多模态融合的研究还较少。为了研究多种模态数据对零样本人体动作识别的影响,本文提出了一种基于多模态融合的零样本人体动作识别(zero-shot human action recognition framework based on multimodel fusion,ZSAR-MF)框架。
方法
2
本文框架主要由传感器特征提取模块、分类模块和视频特征提取模块组成。具体来说,传感器特征提取模块使用卷积神经网络(convolutional neural network,CNN)提取心率和加速度特征;分类模块利用所有概念(传感器特征、动作和对象名称)的词向量生成动作类别分类器;视频特征提取模块将每个动作的属性、对象分数和传感器特征映射到属性—特征空间中,最后使用分类模块生成的分类器对每个动作的属性和传感器特征进行评估。
结果
2
本文实验在Stanford-ECM数据集上展开,对比结果表明本文ZSAR-MF模型比基于单模态数据的零样本识别模型在识别准确率上提高了4 %左右。
结论
2
本文所提出的基于多模态融合的零样本人体动作识别框架,有效地融合了传感器特征和视频特征,并显著提高了零样本人体动作识别的准确率。
Objective
2
Human action recognition is one of the research hotspots in computer vision because of its wide application in human-computer interaction
virtual reality
and video surveillance. With the development of related technology in recent years
the human action recognition algorithm based on deep learning has achieved good recognition performance when the sample size is sufficient. However
studying human action recognition is difficult when the sample size is small or missing. The emergence of zero-shot recognition technology has solved these problems and attracted considerable attention because it can directly classify the "unseen" categories that are not in the training set. In the past decade
numerous methods have been conducted to perform zero-shot human action recognition by using video features and achieved promising improvement. However
most of the current methods are based on single modality data and few studies have been conducted on multimodal fusion. To study the influence of multiple modality fusion on zero-shot human action recognition
this study proposes a zero-shot human action recognition framework based on multimodal fusion(ZSAR-MF).
Method
2
Unlike most of the previous methods based on the fusion of external information and video features or only research on single-modality video features
tour study focuses on the influence of sensor features that are most related to the activity state to improve the recognition performance. The zero-shot human-action recognition framework based on multimodal fusion is mainly composed of a sensor feature-extraction module
classification module
and video feature extraction module. Specifically
the sensor feature-extraction module uses convolutional neural network (CNN) to extract the acceleration and heart rate features of human actions and predict the most relevant feature words for each action. The classification module uses the word vectors of all concepts (sensor features
actions names
and object names) to generate action category classifiers. The "seen" category classifiers are obtained by learning the training data of these categories
and the "unseen" category classifiers are generalized from the "seen" category classifiers by using graph convolutional network (GCN).The video feature-extraction module extracts the video features of each action and maps the attributes of human actions
object scores
and sensor features into the attribute-feature space. Finally
the classifiers generated by the classification module are used to evaluate the feature of each video to calculate the action class scores.
Result
2
The experiment is conducted on the Stanford-ECM dataset with sensor and video data. The dataset includes 23 types of human action video and heart rate and acceleration data synchronized with the collected video. Our experiment can be divided into three steps. First
we remove the 7 actions that do not meet the experimental conditions and select the remaining 16 actions as the experimental dataset. Then
we select three methods to perform experiments on zero-shot human action recognition. A comparison of the experimental results show that the results of zero-shot action recognition via two-stream GCNs and knowledge graphs (TS-GCN) method are approximately 8% higher than that of zero-shot image classification based on generated countermeasure network (ZSIC-GAN) method
which proves the auxiliary role of knowledge graphs in action description by using external semantic information and the advantage of GCN. Compared with the ZSIC-GAN and TS-GCN methods
our proposed method have recognition results that are 12% and 4% higher than that of the ZSIC-GAN and TS-GCN method
respectively
which proves that for zero-shot human-action recognition
the fusion method of the sensor and video features is better than the method that only uses video features. Furthermore
we verify the influence of the number of layers of GCN on the recognition accuracy and analyze the reasons for this result. The experimental results show that adding more layers to the three-layer model cannot significantly improve the recognition accuracy of the model. One of the potential reasons for this situation is that the amount of training data is too small
and an overfitting problem occurs in the deeper network.
Conclusion
2
Sensor and video data can comprehensively describe human activity patterns from different views
which provide convenience for zero-shot human-action recognition based on multimodal fusion. Unlike most of the multimodal fusion methods based on the text description of the action or the audio data and image features
our study uses the sensor and video features that are most related to the active state to realize the multimodal fusion
and pays close attention to the original features of the action. In general
our zero-shot human-action recognition framework based on multimodal fusion includes three parts: sensor feature-extraction module
classification module
and video feature-extraction module. This framework integrates video features and features extracted from sensor data. The two features are modeled by using the knowledge graphs
and the entire network is optimized by using classification loss function. The experimental results on the Stanford-ECM dataset demonstrate the effectiveness of our proposed zero-shot human-action recognition framework based on multimodal fusion. By fully fusing sensor and video features
we significantly improve the accuracy of zero-shot human-action recognition.
零样本多模态融合动作识别传感器数据视频特征
zero-shotmultimodal fusionaction recognitionsensor datavideo features
Akata Z, Reed S, Walter D, Lee H and Schiele B. 2015. Evaluation of output embeddings for fine-grained image classification//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2927-2936 [DOI: 10.1109/CVPR.2015.7298911http://dx.doi.org/10.1109/CVPR.2015.7298911]
Farhadi A, Endres I, Hoiem D and Forsyth D. 2009. Describing objects by their attributes//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 1778-1785 [DOI: 10.1109/CVPR.2009.5206772http://dx.doi.org/10.1109/CVPR.2009.5206772]
Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M A and Mikolov T. 2013. DeViSE: a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: ACM: 2121-2129
Gan C, Yang Y, Zhu L C, Zhao D L and Zhuang Y T. 2016. Recognizing an action using its name: a knowledge-based approach. International Journal of Computer Vision, 120(1): 61-77 [DOI: 10.1007/s11263-016-0893-6]
Gao J Y, Zhang T Z and Xu C S. 2018. Watch, think and attend: end-to-end video classification via dynamic knowledge evolution modeling//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 690-699 [DOI: 10.1145/3240508.3240566http://dx.doi.org/10.1145/3240508.3240566]
Gao J Y, Zhang T Z and Xu C S. 2019. I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI: 8303-8311 [DOI: 10.1609/aaai.v33i01.33018303http://dx.doi.org/10.1609/aaai.v33i01.33018303]
He J Y. 2019. Research on Multimodal Human Motion Recognition. Beijing: Beijing University of Posts and Telecommunications
何俊佑. 2019. 多模态人体动作识别研究. 北京: 北京邮电大学
Jain M, Van Gemert J C, Mensink T and Snoek C G M. 2015. Objects2action: classifying and localizing actions without any video example//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4588-4594 [DOI: 10.1109/ICCV.2015.521http://dx.doi.org/10.1109/ICCV.2015.521]
Jing Y, Hao J S and Li P. 2019. Learning spatiotemporal features of CSI for indoor localization with dual-stream 3D convolutional neural networks. IEEE Access, 7: 147571-147585 [DOI: 10.1109/ACCESS.2019.2946870]
Kipf T N and Welling M. 2017. Semi-supervised classification with graph convolutional networks//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview: 1-6
Kiros R, Salakhutdinov R and Zemel R. 2014. Multimodal neural language models//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: Journal of Machine Learning Research: 595-603
Lee C W, Fang W, Yeh C K and Wang Y C F. 2018. Multi-label zero-shot learning with structured knowledge graphs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1576-1585 [DOI: 10.1109/CVPR.2018.00170http://dx.doi.org/10.1109/CVPR.2018.00170]
Liu J G, Kuipers B and Savarese S. 2011. Recognizing human actions by attributes//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, USA: IEEE: 3337-3344 [DOI: 10.1109/CVPR.2011.5995353http://dx.doi.org/10.1109/CVPR.2011.5995353]
Liu J W, Ding X H and Luo X L. 2020. Survey of multimodal deep learning. Computer Application Research, 37(6): 1601-1614
刘建伟, 丁熙浩, 罗雄麟. 2020. 多模态深度学习综述. 计算机应用研究, 37(6): 1601-1614 [DOI: 10.19734/j.issn.1001-3695.2018.12.0857]
Marino K, Salakhutdinov R and Gupta A. 2017. The more you know: using knowledge graphs for image classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 20-28 [DOI: 10.1109/CVPR.2017.10http://dx.doi.org/10.1109/CVPR.2017.10]
Mettes P, Koelma D C and Snoek C G M. 2016. The ImageNet shuffle: reorganized pre-training for video event detection//Proceedings of 2016 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM: 175-182 [DOI: 10.1145/2911996.2912036http://dx.doi.org/10.1145/2911996.2912036]
Mettes P and Snoek C G M. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4453-4460 [DOI: 10.1109/ICCV.2017.476http://dx.doi.org/10.1109/ICCV.2017.476]
Nakamura K, Yeung S, Alahi A and Li F F. 2017. Jointly learning energy expenditures and activities using egocentric multimodal signals//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6817-6826 [DOI: 10.1109/CVPR.2017.721http://dx.doi.org/10.1109/CVPR.2017.721]
Ngiam J, Khosla A, Kim M, Nam J, Lee H and Ng A Y. 2011. Multimodal deep learning//Proceedings of the 28th International Conference on International Conference on Machine Learning. Bellevue, USA: ICML: 689-696
Pennington J, Socher R and Manning C. 2014. GloVe: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 1532-1543 [DOI: 10.3115/v1/d14-1162http://dx.doi.org/10.3115/v1/d14-1162]
Piergiovanni A J and Ryoo M S. 2020. Learning multimodal representations for unseen activities//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass Village, USA: IEEE: 506-515 [DOI: 10.1109/WACV45572.2020.9093612http://dx.doi.org/10.1109/WACV45572.2020.9093612]
Speer R, Chin J and Havasi C. 2017. ConceptNet 5.5: an open multilingual graph of general knowledge//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 4444-4451
Srivastava N and Salakhutdinov R. 2014. Multimodal learning with deep boltzmann machines. The Journal of Machine Learning Research, 15(1): 2949-2980
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S E, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9 [DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]
Wang X L, Ye Y F and Gupta A. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs//Proceedings of 2018IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6857-6866 [DOI: 10.1109/CVPR.2018.00717http://dx.doi.org/10.1109/CVPR.2018.00717]
Wei H X and Zhang Y. 2019. Zero-shot image classification based on generative adversarial network. Journal of Beijing University of Aeronautics and Astronautics, 45(12): 2345-2350
魏宏喜, 张越. 2019. 基于生成对抗网络的零样本图像分类. 北京航空航天大学学报, 45(12): 2345-2350 [DOI: 10.13700/j.bh.1001-5965.2019.0363]
Xu X, Hospedales T and Gong S G. 2017. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision, 123(3): 309-333 [DOI: 10.1007/s11263-016-0983-5]
Xu X, Hospedales T M and Gong S G. 2016. Multi-task zero-shot action recognition with prioritised data augmentation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 343-359 [DOI: 10.1007/978-3-319-46475-6_22http://dx.doi.org/10.1007/978-3-319-46475-6_22]
Zhang W M, Huang Y, Yu W T, Yang X S, Wang W and Sang J. T. 2019. Multimodal attribute and feature embedding for activity recognition//Proceedings of the ACM Multimedia Asia. Beijing, China: ACM: 1-7 [DOI: 10.1145/3338533.3366592http://dx.doi.org/10.1145/3338533.3366592]
Zhu Y, Long Y, Guan Y, Newsam S and Shao L. 2018. Towards universal representation for unseen action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 9436-9445 [DOI: 10.1109/CVPR.2018.00983http://dx.doi.org/10.1109/CVPR.2018.00983]
相关作者
相关机构