A review of deep learning based human-object interaction detection
- Vol. 27, Issue 9, Pages: 2611-2628(2022)
Received:31 December 2021,
Revised:2022-6-16,
Accepted:23 June 2022,
Published:16 September 2022
DOI: 10.11834/jig.211268
移动端阅览

浏览全部资源
扫码关注微信
Received:31 December 2021,
Revised:2022-6-16,
Accepted:23 June 2022,
Published:16 September 2022
移动端阅览
人—物交互关系检测旨在通过精细化定位图像或视频中产生特定动作行为的人,以及与其产生交互关系的物体,并识别人和物体之间的动作关系来理解和分析人体的行为。人—物交互关系检测是一个非常具有实际应用意义和前瞻性的研究方向,是高层视觉理解的关键基石。随着深度学习的发展,基于深度学习的研究方法引领了近期人—物交互关系检测研究的进步。本文一方面分析空域人—物交互关系检测任务,从数据内容场景、标注粒度两个方面总结和分析当下数据库和基准。然后从两阶段分段式方法和单阶段端到端式方法两个流派出发系统性地阐述当前检测方法的发展现状,分析两个流派方法的特性和优劣,厘清该领域方法的发展路线。其中,两阶段方法包括多流模型和图模型两种主要范式,而单阶段模型包括基于框的范式、基于关系点的范式和基于查询的范式。另一方面,对时空域人—物交互关系检测任务进行总结,分析现有时空域交互关系数据集构造与特性和现有基线算法的优劣。最后对未来的研究方向进行展望。
Human-object interaction (HOI) detection is essential for intelligent human behaviors analysis. Our review is focused on a fine-grain scaled image or video based human behaviors analysis through the localization of interactive human-object pairs and their recognition of interaction types. HOI detection has developed high-level visual applications like dangerous behaviors detection and human-robot interaction. Recent deep learning based methods have facilitated current HOI detection. Our critical review is carried out in terms of recent deep learning based HOI detection methods. We introduce an accelerated progress of image-level HOI detection because the growth of datasets is a key factor for the review of deep learning. First
the datasets and benchmarks of image-level HOI detection is introduced based on an annotation granularity. Therefore
the conventional image-level HOI detection datasets are assigned to three levels of instance
partial and pixel. We introduce the image collection
annotation
and statics information of every level for each dataset. Next
we analyze the conventional HOI detection methods via deep-learning-structured assignment. We summarize traditional HOI detection methods into two main folds further based on a serial architecture of two-stage fold and an end-to-end framework of one-stage fold. Two-stage methods are composed of two split serial stages
where an instance detector is initial to be used for human-object detection
and a following designed interaction classifier is applied for the interaction types reasoning between the targeted human-object detection. To clarify an accurate interaction classifier
our two-stage fold methods are mostly concerned of the two stages. However
one-stage methods are melted into an end-to-end framework
where HOI triplets can be directly detected in an end-to-end manner. Additionally
one-stage methods can also be regarded as a top-down paradigm. An anchor is designed to denote interaction and first be detected in association with human and object. Specifically
we retrace the representative methods and analyze the growth paths of such two folds. Moreover
we demonstrate the pros and cons analysis of the two folds and their potentials. At the beginning
we introduce the two-stage methods sequentially. The two-stage fold into the multi-stream pipeline and graph-based pipeline is divided based on the design of the second stage. Then
the introduced one-stage methods are split into point-based
bounding box-based
and query-based contexts in terms of multiple settings of the interaction anchor. At the end
we review the progress of zero-shot HOI detection. Meanwhile
the growth analysis of video-level HOI detection is reviewed based on datasets and methods. Finally
the future directions of HOI detection are predicted as mentioned below: 1) large-scale pre-trained model-guided HOI detection: the complex HOI types are hard to be annotated for all due to multiple human-object interaction derived of various behaviors. Therefore
zero-shot HOI discovery is a challenging issue in the future. 2) Self-supervised pre-training for HOI detection: it is originated from the mechanism view
where a large-scale image-text pre-trained model hypothesis can much properly benefit for HOI understanding
and 3) efficient video HOI detection: it is hard to detect video-based HOIs efficiently for conventional multi-phases detection mechanisms. Our critical analysis reviewed deep learning based human-object interaction detection tasks systematically.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 213-229 [ DOI:10.1007/978-3-030-58452-8_13 http://dx.doi.org/10.1007/978-3-030-58452-8_13 ]
Chao Y W, Liu Y F, Liu X Y, Zeng H Y and Deng J. 2018. Learning to detect human-object Interactions//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, USA: IEEE: 381-389 [ DOI:10.1109/WACV.2018.00048 http://dx.doi.org/10.1109/WACV.2018.00048 ]
Chen M F, Liao Y, Liu S, Chen Z Y, Wang F and Qian C. 2021. Reformulating HOI detection as adaptive set prediction//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 9000-9009 [ DOI:10.1109/CVPR46437.2021.00889 http://dx.doi.org/10.1109/CVPR46437.2021.00889 ]
Chiou M J, Liao C Y, Wang L W, Zimmermann R and Feng J S. 2021. ST-HOI: a spatial-temporal baseline for human-object interaction detection in videos//Proceedings of 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval. Taipei, China: ACM: 9-17 [ DOI:10.1145/3463944.3469097 http://dx.doi.org/10.1145/3463944.3469097 ]
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, USA: ACL: 4171-4186 [ DOI:10.18653/v1/N19-1423 http://dx.doi.org/10.18653/v1/N19-1423 ]
Dong Q, Tu Z W, Liao H F, Zhang Y T, Mahadevan V and Soatto S. 2021. Visual relationship detection using part-and-sum transformers with composite queries//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 3530-353 9 [ DOI:10.1109/ICCV48922.2021.00353 http://dx.doi.org/10.1109/ICCV48922.2021.00353 ]
Fang H S, Xie Y C, Shao D and Lu C W. 2021. DIRV: dense interaction region voting for end-to-end human-object interaction detection//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto, USA: 1291-1299
Gao C, Zou Y L and Huang J B. 2018. ICAN: instance-centric attention network for human-object interaction detection [EB/OL ] . [2021-12-16 ] . https://arxiv.org/pdf/1808.10437.pdf https://arxiv.org/pdf/1808.10437.pdf
Gao C, Xu J R, Zou Y L and Huang J B. 2020. DRG: dual relation graph for human-object interaction detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 696-712 [ DOI:10.1007/978-3-030-58610-2_41 http://dx.doi.org/10.1007/978-3-030-58610-2_41 ]
Gkioxari G, Girshick R, Dollár P and He K M. 2018. Detecting and recognizing human-object interactions//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8359-8367 [ DOI:10.1109/CVPR.2018.00872 http://dx.doi.org/10.1109/CVPR.2018.00872 ]
Gupta S and Malik J. 2015. Visual semantic role labeling [EB/OL ] . [2021-12-16 ] . https://arxiv.org/pdf/1505.04474.pdf https://arxiv.org/pdf/1505.04474.pdf
He T, Gao L L, Song J and Li Y F. 2021. Exploiting scene graphs for human-object interaction detection//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 15964-15973 [ DOI:10.1109/ICCV48922.2021.01568 http://dx.doi.org/10.1109/ICCV48922.2021.01568 ]
Hou Z, Peng X J, Qiao Y and Tao D C. 2020. Visual compositional learning for human-object interaction detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 584-600 [ DOI:10.1007/978-3-030-58555-6_35 http://dx.doi.org/10.1007/978-3-030-58555-6_35 ]
Hou Z, Yu B S, Qiao Y, Pneg X J and Tao D C. 2021. Detecting human-object interaction via fabricated compositional learning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 14641-14650 [ DOI:10.1109/CVPR46437.2021.01441 http://dx.doi.org/10.1109/CVPR46437.2021.01441 ]
Ji J W, Desai R and Niebles J C. 2021. Detecting human-object relationships in videos//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 8086-8096 [ DOI:10.1109/ICCV48922.2021.00800 http://dx.doi.org/10.1109/ICCV48922.2021.00800 ]
Ji J W, Krishna R, Li F F and Niebles J C. 2020. Action genome: actions as compositions of spatio-temporal scene graphs//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10233-10244 [ DOI:10.1109/CVPR42600.2020.01025 http://dx.doi.org/10.1109/CVPR42600.2020.01025 ]
Kim B, Choi T, Kang J and Kim H J. 2020. UnionDet: union-level detector towards real-time human-object interaction detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 498-514 [ DOI:10.1007/978-3-030-58555-6_30 http://dx.doi.org/10.1007/978-3-030-58555-6_30 ]
Kim B, Lee J, Kang J, Kim E S and Kim H J. 2021. HOTR: end-to-end human-object interaction detection with transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 74-83 [ DOI:10.1109/CVPR46437.2021.00014 http://dx.doi.org/10.1109/CVPR46437.2021.00014 ]
Koppula H S and Saxena A. 2016. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1): 14-29 [DOI: 10.1109/TPAMI.2015.2430335]
Li L J, Chen Y C, Cheng Y, Gan Z, Yu L C and Liu J J. 2020a. HERO: hierarchical encoder for video+language omni-representation pre-training//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. [s. l. ] : ACM: 2046-2065 [ DOI:10.18653/v1/2020.emnlp-main.161 http://dx.doi.org/10.18653/v1/2020.emnlp-main.161 ]
Li Y L, Xu L, Liu X P, Huang X J, Xu Y, Wang S Y, Fang H S, Ma Z, Chen M Y and Lu C W. 2020b. PaStaNet: toward human activity knowledge engine//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 379-388 [ DOI:10.1109/CVPR42600.2020.00046 http://dx.doi.org/10.1109/CVPR42600.2020.00046 ]
Li Y L, Zhou S Y, Huang X J, Xu L, Ma Z, Fang H S, Wang Y F and Lu C W. 2019. Transferable interactiveness knowledge for human-objectinteraction detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3580-3589 [ DOI:10.1109/CVPR.2019.00370 http://dx.doi.org/10.1109/CVPR.2019.00370 ]
Li Z M, Zou C, Zhao Y, Li B X and Zhong S. 2022. Improving human-object interaction dete ction via phrase learning and label composition [EB/OL ] . [2021-12-16 ] . https://arxiv.org/pdf/2112.07383.pdf https://arxiv.org/pdf/2112.07383.pdf
Liao Y, Liu S, Wang F, Chen Y J, Chen Q and Feng J S. 2020. PPDM: parallel point detection and matching for real-time human-object interaction detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 479-487 [ DOI:10.1109/CVPR42600.2020.00056 http://dx.doi.org/10.1109/CVPR42600.2020.00056 ]
Lin T Y, Maire M, Belongie S J, Hays J, Perona P, Ramanan D, Dollár P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [ DOI:10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Lin X, Zou Q and Xu X X. 2020. Action-guided attention mining and relation reasoning network for human-object interaction detection//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Yokohama, Japan: [s. n. ] : 1104-1110 [ DOI:10.24963/ijcai.2020/154 http://dx.doi.org/10.24963/ijcai.2020/154 ]
Liu S, Wang Z T, Gao Y L, Ren L J, Liao Y, Ren G H, Li B and Yan S C. 2021. Human-centric relation segmentation: dataset and solution. IEEE Transactions on Pattern Analysis and Machine Intelligence: #3075846[DOI: 10.1109/TPAMI.2021.3075846]
Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM: 13-23
Nagarajan T, Feichtenhofer C and Grauman K. 2019. Grounded human-object interaction hotspots from video//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 8687-8696 [ DOI:10.1109/ICCV.2019.00878 http://dx.doi.org/10.1109/ICCV.2019.00878 ]
Peyre J, Sivic J, Laptev I and Schmd C. 2019. Detecting unseen visual relations using analogies//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1981-1990 [ DOI:10.1109/ICCV.2019.00207 http://dx.doi.org/10.1109/ICCV.2019.00207 ]
Qi S Y, Wang W G, Jia B X, Shen J B and Zhu S C. 2018. Learning human-object interactions by graph parsing neural networks//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 407-423 [ DOI:10.1007/978-3-030-01240-3_25 http://dx.doi.org/10.1007/978-3-030-01240-3_25 ]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/tpami.2016.2577031]
Shang X D, Di D L, Xiao J B, Cao Y, Yang X and Chua T S. 2019. Annotating objects and relations in user-generated videos//Proceedings of 2019 on International Conference on Multimedia Retrieval. Ottawa, Canada: ACM: 279-287 [ DOI:10.1145/3323873.3325056 http://dx.doi.org/10.1145/3323873.3325056 ]
Shen L Y, Yeung S, Hoffman J, Mori G and Li F F. 2018. Scaling human-object interaction recognition through zero-shot learning//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, USA: IEEE: 1568-1576 [ DOI:10.1109/wacv.2018.00181 http://dx.doi.org/10.1109/wacv.2018.00181 ]
Sigurdsson G A, Varol G, Wang X L, Farhadi A, Laptev I and Gupta A. 2016. Hollywood in homes: crowdsourcing data collection for activity understanding//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 510-526 [ DOI:10.1007/978-3-319-46448-0_31 http://dx.doi.org/10.1007/978-3-319-46448-0_31 ]
Sun X, Hu X W, Ren T W and Wu G S. 2020. Human object interaction detection via multi-level conditioned network//Proceedings of 2020 on International Conference on Multimedia Retrieval. Dublin, Ireland: ACM: 26-34 [ DOI:10.1145/3372278.3390671 http://dx.doi.org/10.1145/3372278.3390671 ]
Sunkesula S P R, Dabral R and Ramakrishnan G. 2020. LIGHTEN: learning interactions with graph and hierarchical temporal networks for HOI in videos//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 691-699 [ DOI:10.1145/3394171.3413778 http://dx.doi.org/10.1145/3394171.3413778 ]
Tamura M, Ohashi H and Yoshinaga T. 2021. QPIC: query-based pairwise human-object interaction detection with image-wide contextual information//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 10405-10414 [ DOI:10.1109/CVPR46437.2021.01027 http://dx.doi.org/10.1109/CVPR46437.2021.01027 ]
Ulutan O, Iftekhar A S M and Manjunath B S. 2020. VSGNet: spatial attention network for detecting human object interactions using graph convolutions//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13614-13623 [ DOI:10.1109/CVPR42600.2020.01363 http://dx.doi.org/10.1109/CVPR42600.2020.01363 ]
Wan B, Zhou D, Liu Y F, Li R J and He X M. 2019. Pose-aware multi-level feature network for human object interaction detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9468-9477 [ DOI:10.1109/ICCV.2019.00956 http://dx.doi.org/10.1109/ICCV.2019.00956 ]
Wang H, Zheng W S and Ling Y B. 2020. Contextual heterogeneous graph network for human-object interaction detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 248-264 [ DOI:10.1007/978-3-030-58520-4_15 http://dx.doi.org/10.1007/978-3-030-58520-4_15 ]
Wang N, Zhu G M, Zhang L, Shen P Y, Li H S and HuaC. 2019a. Spatio-temporal interaction graph parsing networks for human-object interaction recognition//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 4985-4993 [ DOI:10.1145/3474085.3475636 http://dx.doi.org/10.1145/3474085.3475636 ]
Wang T C, Anwer R M, Khan M H, Khan F S, Pang Y W, Shao L and Laaksonen J. 2019b. Deep contextual attention for human-object interaction detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (S outh): IEEE: 5693-5701 [ DOI:10.1109/ICCV.2019.00579 http://dx.doi.org/10.1109/ICCV.2019.00579 ]
Wang T C, Yang T, Danelljan M, Khan F S, Zhang X Y and Sun J. 2020b. Learning human-object interaction detection using interaction points//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 4115-4124 [ DOI:10.1109/CVPR42600.2020.00417 http://dx.doi.org/10.1109/CVPR42600.2020.00417 ]
Xu K L, Li Z M, Zhang Z J, Dong L Z, Xu W H, Yan L X, Zhong S and Zou X. 2022. Effective actor-centric human-object interaction detection. Image and Vision Computing, 121: #104422 [DOI: 10.1016/j.imavis.2022.104422]
Zhang F Z, Campbell D and Gould S. 2021. Spatially conditioned graphs for detecting human-object interactions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 13299-13307 [ DOI:10.1109/ICCV48922.2021.01307 http://dx.doi.org/10.1109/ICCV48922.2021.01307 ]
Zhong X B, Qu X, Ding C X and Tao D C. 2021b. Glance and gaze: inferring action-aware points for one-stage human-object interaction detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 13229-13238 [ DOI:10.1109/CVPR46437.2021.01303 http://dx.doi.org/10.1109/CVPR46437.2021.01303 ]
Zhuang B H, Wu Q, Shen C H, Reid I and van den Hengel A. 2018. HCVRD: a benchmark for large-scale human-centered visual relationship detection//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI: 7631-7638
Zou C, Wang B H, Hu Y, Liu J Q, Wu Q, Zhao Y, Li B X, Zhang C G, Zhang C, Wei Y C and Sun J. 2021. End-to-end human object interaction detection with HOI transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 11820-11829 [ DOI:10.1109/CVPR46437.2021.01165 http://dx.doi.org/10.1109/CVPR46437.2021.01165 ]
相关文章
相关作者
相关机构
京公网安备11010802024621