面向复杂场景的人物视觉理解技术
Visual recognition technologies for complex scenarios analysis
- 2022年27卷第6期 页码:1723-1742
收稿:2022-02-28,
修回:2022-3-7,
录用:2022-3-15,
纸质出版:2022-06-16
DOI: 10.11834/jig.220157
移动端阅览

浏览全部资源
扫码关注微信
收稿:2022-02-28,
修回:2022-3-7,
录用:2022-3-15,
纸质出版:2022-06-16
移动端阅览
面向复杂场景的人物视觉理解技术能够提升社会智能化协作效率,加速社会治理智能化进程,并在服务人类社会的经济活动、建设智慧城市等方面展现出巨大活力,具有重大的社会效益和经济价值。人物视觉理解技术主要包括实时人物识别、个体行为分析与群体交互理解、人机协同学习、表情与语音情感识别和知识引导下视觉理解等,当环境处于复杂场景中,特别是考虑“人物—行为—场景”整体关联的视觉表达与理解,相关问题的研究更具有挑战性。其中,大规模复杂场景实时人物识别主要集中在人脸检测、人物特征理解以及场景分析等,是复杂场景下人物视觉理解技术的重要研究基础;个体行为分析与群体交互理解主要集中在视频行人重识别、视频动作识别、视频问答和视频对话等,是视觉理解的关键行为组成部分;同时,在个体行为分析和群体交互理解中,形成综合利用知识与先验的机器学习模式,包含视觉问答对话、视觉语言导航两个重点研究方向;情感的识别与合成主要集中在人脸表情识别、语音情感识别与合成以及知识引导下视觉分析等方面,是情感交互的核心技术。本文围绕上述核心关键技术,阐述复杂场景下人物视觉理解领域的研究热点与应用场景,总结国内外相关成果与进展,展望该领域的前沿技术与发展趋势。
Public security and social governance is essential to national development nowadays. It is challenged to prevent large-scale riots in communities and various city crimes for spatial and timescaled social governance in corona virus disease 2019(Covid-19) likehighly accurate human identity verification
highly efficient human behavior analysis and crowd flow track and trace. The core of the challenge is to use computer vision technologies to extract visual information in complex scenarios and to fully express
identify and understand the relationship between human behavior and scenes to improve the degree of social administration and governance. Complex scenarios oriented visual technologies recognition can improve the efficiency of social intelligence and accelerate the process of intelligent social governance. The main challenge of human recognition is composed of three aspects as mentioned below: 1) the diversity attack derived from mask occlusion attack
affecting the security of human identity recognition; 2) the large span of time and space information has affected the accuracy of multiple ages oriented face recognition (especially tens of millions of scales retrieval); 3) the complex and changeable scenarios are required for the high robustness of the system and adapt to diverse environments. Therefore
it is necessary to facilitate technologies of remote human identity verification related to the high degree of security
face recognition accuracy
human behavior analysis and scene semantic recognition. The motion analysis of individual behavior and group interaction trend are the key components of complex scenarios based human visual contexts. In detail
individual behavior analysis mainly includes video-based pedestrian re-recognition and video-based action recognition. The group interaction recognition is mainly based on video question-and-answer and dialogue. Video-based network can record the multi-source cameras derived individuals/groups image information. Multi-camera based human behavior research of group segmentation
group tracking
group behavior analysis and abnormal behavior detection. However
it is extremely complex that the individual behavior/group interaction is recorded by multiple cameras in real scenarios
and it is still a great challenge to improve the performance of multi-camera and multi-objective behavior recognition through integrated modeling of real scene structure
individual behavior and group interaction. The video-based network recognition of individual and group behavior mainly depends on visual information in related to scene
individual and group captured. Nonetheless
complex scenarios based individual behavior analysis and group interaction recognition require human knowledge and prior knowledge without visual information in common.Specifically
a crowdsourced data application has improved visual computing performance and visual question-and-answer and dialogue and visual language navigation. The inherited knowledge in crowdsourced data can develop a data-driven machine learning model for comprehensive knowledge and prior applications in individual behavior analysis and group interaction recognition
and establish a new method of data-driven and knowledge-guided visual computing. In addition
the facial expression behavior can be recognized as the human facial micro-motions like speech the voice of language. Speech emotion recognition can capture and understand human emotions and beneficial to support the learning mode of human-machine collaboration better. It is important for research to get deeper into the technology of human visual recognition. Current researches have been focused on human facial expression recognition
speech emotion recognition
expression synthesis
and speech emotion synthesis. We carried out about the contexts of complex scenarios based real-time human identification
individual behavior and group interaction understanding analysis
visual speech emotion recognition and synthesis
comprehensive utilization of knowledge and a priori mode of machine learning. The research and application scenarios for the visual ability is facilitated for complex scenarios. We summarize the current situations
and predict the frontier technologies and development trends. The human visual recognition technology will harness the visual ability to recognize relationship between humans
behavior and scenes. It is potential to improve the capability of standard data construction
model computing resources
and model robustness and interpretability further.
Agrawal A, Batra D, Parikh D and Kembhavi A. 2018. Don't just assume; look and answer: overcoming priors for visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4971-4980 [ DOI: 10.1109/CVPR.2018.00522 http://dx.doi.org/10.1109/CVPR.2018.00522 ]
Ahonen T, Hadid A and Pietikainen M. 2006. Face description with local binary patterns: application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12): 2037-2041 [DOI: 10.1109/TPAMI.2006.244]
Ai H Z, Liang L H and Xu G Y. 2000. A general framework for face detection//Proceedigns of the 3rd International Conference on Multimodal Interfaces. Beijing, China: Springer: 119-126 [ DOI: 10.1007/3-540-40063-X_16 http://dx.doi.org/10.1007/3-540-40063-X_16 ]
Anjum A, Abdullah T, Tariq M F, Baltaci Y and Antonopoulos N. 2019. Video stream analysis in clouds: an object detection and classification framework for high performance video analytics. IEEE Transactions on Cloud Computing, 7(4): 1152-1167 [DOI: 10.1109/TCC.2016.2517653]
Basri R and Jacobs D W. 2003. Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and MachineIntelligence, 25(2): 218-233 [DOI: 10.1109/TPAMI.2003.1177153]
Cao X D, Wei Y C, Wen F and Sun J. 2012. Face alignment by explicit shape regression//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 2887-2894 [ DOI: 10.1109/CVPR.2012.6248015 http://dx.doi.org/10.1109/CVPR.2012.6248015 ]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 4724-4733 [ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]
Chen B Y, Guan W L, Li P X, Ikeda N, Hirasawa K and Lu H C. 2021. Residual multi-task learning for facial landmark localization and expression recognition. Pattern Recognition, 115: #107893 [DOI: 10.1016/j.patcog.2021.107893]
Chen S K, Wang J F, Chen Y D, Shi Z C, Geng X and Rui Y. 2020a. Label distribution learning on auxiliary label space graphs for facial expression recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 13981-13990 [ DOI: 10.1109/CVPR42600.2020.01400 http://dx.doi.org/10.1109/CVPR42600.2020.01400 ]
Chen X F, Lao S Y and Duan T. 2020b. Multimodal fusi on of visual dialog: a survey//Proceedings of the 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence. Shanghai, China: ACM: 302-308 [ DOI: 10.1145/3438872.3439098 http://dx.doi.org/10.1145/3438872.3439098 ]
Chen Y Z, Huang T D, Niu Y Z, Ke X and Lin Y Y. 2019. Pose-guided spatial alignment and key frame selection for one-shot video-based person re-identification. IEEE Access, 7: 78991-79004 [DOI: 10.1109/ACCESS.2019.2922679]
Cootes T F, Edwards G J and Taylor C J. 2001. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6): 681-685 [DOI: 10.1109/34.927467]
Cootes T F, Taylor C J, Cooper D H and Graham J. 1995. Active shape models-their training and application. Computer Vision and Image Understanding, 61(1): 38-59 [DOI: 10.1006/cviu.1995.1004]
Cristinacce D and Cootes T F. 2006. Feature detection and tracking with constrained local models//Proceedings of the British Machine Vision Conference. Edinburgh, UK: BMVA Press: #95 [ DOI: 10.5244/C.20.95 http://dx.doi.org/10.5244/C.20.95 ]
Farenzena M, Bazzani L, Perina A, Murino V and Cristani M. 2010. Person re-identification by symmetry-driven accumulation of local features//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). San Francisco, USA: IEEE: 2360-2367 [ DOI: 10.1109/CVPR.2010.5539926 http://dx.doi.org/10.1109/CVPR.2010.5539926 ]
Feng L T, Po L M, Li Y M, Xu X Y, Yuan F, Cheung T C H and Cheung K W. 2016. Integration of image quality and motion cues for face anti-spoofing: a neural network approach. Journal of Visual Communication and Image Representation, 38: 451-460 [DOI: 10.1016/j.jvcir.2016.03.019]
Gao C, Chen J Y, Liu S, Wang L T, Zhang Q and Wu Q. 2021. Room-and-object aware knowledge reasoning for remote embodied referring expression//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3063-3072 [ DOI: 10.1109/CVPR46437.2021.00308 http://dx.doi.org/10.1109/CVPR46437.2021.00308 ]
Gardères F, Ziaeefard M, Abeloos B and Lecue F. 2020. ConceptBert: concept-aware representation for visual question answering[s. l. ] : Association for Computational Linguistics: 489-498 [ DOI: 10.18653/v1/2020.findings-emnlp.44 http://dx.doi.org/10.18653/v1/2020.findings-emnlp.44 ]
Gheissari N, Sebastian T B and Hartley R. 2006. Person reidentification using spatiotemporal appearance//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE: 1528-1535 [ DOI: 10.1109/CVPR.2006.223 http://dx.doi.org/10.1109/CVPR.2006.223 ]
Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448 [ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]
Hara K, Kataoka H and Satoh Y. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6546-6555 [ DOI: 10.1109/CVPR.2018.00685 http://dx.doi.org/10.1109/CVPR.2018.00685 ]
He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988 [ DOI: 10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ]
Hong Y C, Rodriguez-Opazo C, Qi Y K, Wu Q and Gould S. 2020. Language and visual entity relationship graph for agent navigation//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : 7685-7696
Hong Y C, Wu Q, Qi Y K, Rodríguez-Opazo C and Gould S. 2021. VLN [ DOI: 10.1109/CVPR46437.2021.00169 http://dx.doi.org/10.1109/CVPR46437.2021.00169 ]
Hou R B, Ma B P, Chang H, Gu X Q, Shan S G and Chen X L. 2019. VRSTC: occlusion-free video person re-identification//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7176-7185 [ DOI: 10.1109/CVPR.2019.00735 http://dx.doi.org/10.1109/CVPR.2019.00735 ]
Jiang X Z, Du S Y, Qin Z C, Sun Y J and Yu J. 2020. KBGN: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 1265-1273 [ DOI: 10.1145/3394171.3413826 http://dx.doi.org/10.1145/3394171.3413826 ]
Jing C C, Wu Y W, Zhang X X, Jia Y D and Wu Q. 2020. Overcoming language priors in VQA via decomposed linguistic representations. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7): 11181-11188 [DOI: 10.1609/aaai.v34i07.6776]
Kil J, Zhang C, Xuan D and Cha o W L. 2021. Discovering the unknown knowns: turning implicit knowledge in the dataset into explicit training examples for visual question answering [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2109.06122.pdf https://arxiv.org/pdf/2109.06122.pdf
Lao M R, Guo Y M, Liu Y and Lew M S. 2021. A language prior based focal loss for visual question answering//Proceedings of 2021 IEEE International Conference on Multimedia and Expo (ICME). Shenzhen, China: IEEE: 1-6 [ DOI: 10.1109/ICME51207.2021.9428165 http://dx.doi.org/10.1109/ICME51207.2021.9428165 ]
Laptev I. 2005. On space-time interest points. International Journal of Computer Vision, 64(2): 107-123 [DOI: 10.1007/s11263-005-1838-7]
Laptev I, Marszalek M, Schmid C and Rozenfeld B. 2008. Learning realistic human actions from movies//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE: 1-8 [ DOI: 10.1109/CVPR.2008.4587756 http://dx.doi.org/10.1109/CVPR.2008.4587756 ]
Li G H, Wang X and Zhu W W. 2020. Boosting visual question answering with context-aware knowledge aggregation//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 1227-1235 [ DOI: 10.1145/3394171.3413943 http://dx.doi.org/10.1145/3394171.3413943 ]
Li J L, Tan H and Bansal M. 2021. Improving cross-modal alignment in vision language navigation via syntactic information [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2104.09580.pdf https://arxiv.org/pdf/2104.09580.pdf
Li L, Feng X Y, Boulkenafet Z, Xia Z Q, Li M M and Hadid A. 2016. An original face anti-spoofing approach using partial convolutional neural network//Proceedings of the 6th International Conference on Image Processing Theory, Tools and Applications (IPTA). Oulu, Finland: IEEE: 1-6 [ DOI: 10.1109/IPTA.2016.7821013 http://dx.doi.org/10.1109/IPTA.2016.7821013 ]
Li S, Bak S, Carr P and Wang X G. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 369-378 [ DOI: 10.1109/CVPR.2018.00046 http://dx.doi.org/10.1109/CVPR.2018.00046 ]
Li S and Deng W H. 2020. Deep facial expression recognition: a survey. Journal of Image and Graphics, 25(11): 2306-2320
李珊, 邓伟洪. 2020. 深度人脸表情识别研究进展. 中国图象图形学报, 25(11): 2306-2320 [DOI: 10.11834/jig.200233]
Liang L H, Ai H Z and He K Z. 1999. Multi-template-matching based single face detection. Journal of Image and Graphics, 4(10): 825-830
梁路宏, 艾海舟, 何克忠. 1999. 基于多模板匹配的单人脸检测. 中国图象图形学报, 4(10): 825-830 [DOI: 10.11834/jig.1999010197]
Lin J, Gan C and Han S. 2019. TSM: temporal shift module for efficient video understanding//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 7082-7092 [ DOI: 10.1109/ICCV.2019.00718 http://dx.doi.org/10.1109/ICCV.2019.00718 ]
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]
Lindeberg T. 2012. Scale invariant feature transform. Scholarpedia, 7(5): #10491 [DOI: 10.4249/scholarpedia.10491]
Liu M B, Yao H X and Gao W. 1998. Real-time human face tracking in color images. Chinese Journal of Computers, 21(6): 527-532
刘明宝, 姚鸿勋, 高文. 1998. 彩色图像的实时人脸跟踪方法. 计算机学报, 21(6): 527-532 [DOI: 10.3321/j.issn:0254-4164.1998.06.007]
Liu P, Wei Y C, Meng Z B, Deng W H, Zhou J T and Yang Y. 2021b. Omni-supervised facial expression recognition: a simple baseline [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2005.08551.pdf https://arxiv.org/pdf/2005.08551.pdf
Liu Y J, Jourabloo A and Liu X M. 2018. Learning deep models for face anti-spoofing: binary or auxiliary supervision//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 389-398 [ DOI: 10.1109/CVPR.2018.00048 http://dx.doi.org/10.1109/CVPR.2018.00048 ]
Liu Y J, Stehouwer J, Jourabloo A and Liu X M. 2019. Deep tree learning for zero-shot face anti-spoofing//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 4675-4684 [ DOI: 10.1109/CVPR.2019.00481 http://dx.doi.org/10.1109/CVPR.2019.00481 ]
Liu Y J, Stehouwer J and Liu X M. 2020. On disentangling spoof trace for generic face anti-spoofing [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2007.09273.pdf https://arxiv.org/pdf/2007.09273.pdf
Lu C H, Wen X, Liu R L and Chen X. 2021. Multi-speaker emotional speech synthesis with fine-grained prosody modeling//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5729-5733 [ DOI: 10.1109/ICASSP39728.2021.9413398 http://dx.doi.org/10.1109/ICASSP39728.2021.9413398 ]
Lu C Y, Zhang C S, Wen F and Yan P F. 1999. Regional feature based fast human face detection. Journal of Tsinghua University (Science and Technology), 39(1): 101-105
卢春雨, 张长水, 闻芳, 阎平凡. 1999. 基于区域特征的快速人脸检测法. 清华大学学报(自然科学版), 39(1): 101-105 [DOI: 10.3321/j.issn:1000-0054.1999.01.027]
Lu J W, Peng Y X, Qi G J and Yu J. 2020. Guest editorial introduction to the special section on representation learning for visual content understanding. IEEE Transactions on Circuits and Systems for Video Technology, 30(9): 2797-2800 [DOI: 10.1109/TCSVT.2020.3009095]
Lyu X G, Zhou J and Zhang C S. 2000. A novel algorithm for rotated human face detection//Proceedings of 2000 IEEE Conference on Computer Vision and Pattern Recognition. Hilton Head, USA: IEEE: 760-765 [ DOI: 10.1109/CVPR.2000.855897 http://dx.doi.org/10.1109/CVPR.2000.855897 ]
Ma B P, Su Y and Jurie F. 2014. Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image and Vision Computing, 32(6/7): 379-390 [DOI: 10.1016/j.imavis.2014.04.002]
Mahdi M K, Wu Q, Abbasnejad E and Shi J. 2020. Utilising Prior Knowledge for Visual Navigation: Distil and Adapt[EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2004.03222v2.pdf https://arxiv.org/pdf/2004.03222v2.pdf
Maninis K K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D and Van Gool L. 2019. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6): 1515-1530 [DOI: 10.1109/TPAMI.2018.2838670]
Marino K, Chen X L, Parikh D, Gupta A and Rohrbach M. 2021. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 14106-14116 [ DOI: 10.1109/CVPR46437.2021.01389 http://dx.doi.org/10.1109/CVPR46437.2021.01389 ]
Miao J, Yin B C, Wang K Q, Shen L S and Chen X C. 1999. A hierarchical multiscale and multiangle system for human face detection in a complex background using gravity-center template. Pattern Recognition, 32(7): 1237-1248
Murahari V, Batra D, Parikh D and Das A. 2020. Large-scale pretraining for visual dialog: a simple state-of-the-art baseline//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 336-352 [ DOI: 10.1007/978-3-030-58523-5_20 http://dx.doi.org/10.1007/978-3-030-58523-5_20 ]
Navaneet K L, Todi V, Babu R V and Chakraborty A. 2019. All for one: frame-wise rank loss for improving video-based person re-identification//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE: 2472-2476 [ DOI: 10.1109/ICASSP.2019.8682292 http://dx.doi.org/10.1109/ICASSP.2019.8682292 ]
Otberdout N, Daoudi M, Kacem A, Ballihi L and Berretti S. 2022. Dynamic facial expression generation on hilbert hypersphere with conditional wasserstein generative adversarial nets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2): 848-863 [DOI: 10.1109/TPAMI.2020.3002500]
Qi J X, Niu Y L, Huang J Q and Zhang H W. 2020. Two causal principles for improving visual dialog//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10857-10866 [ DOI: 10.1109/CVPR42600.2020.01087 http://dx.doi.org/10.1109/CVPR42600.2020.01087 ]
Qi Y K, Pan Z Z, Hong Y C, Yang M H, van den Hengel A and Wu Q. 2021. The road to know-where: an object-and-room informed sequential BERT for indoor vision-language navigation [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2104.04167.pdf https://arxiv.org/pdf/2104.04167.pdf
Qing L Y, Shan S G, Chen X L and Gao W. 2006. Face recognition under varying lighting based on the harmonic images. Chinese Journal of Computers, 29(5): 760-768
卿来云, 山世光, 陈熙霖, 高文. 2006. 基于球面谐波基图像的任意光照下的人脸识别. 计算机学报, 29(5): 760-768 [doi: 10.3321/j.issn:0254-4164.2006.05.011]
Qiu Z F, Yao T and Mei T. 2017. Learning spatio-temporal representation with pseudo-3D residual networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5534-5542 [ DOI: 10.1109/ICCV.2017.590 http://dx.doi.org/10.1109/ICCV.2017.590 ]
Ramnath K and Hasegawa-Johnson M. 2021. Seeing is knowing! Fact-based visual question answering using knowledge graph embeddings [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2012.15484.pdf https://arxiv.org/pdf/2012.15484.pdf
Savvides M, Kumar B V K V and Khosla P K. 2004. Eigenphases vs//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK: IEEE: 810-813 [ DOI: 10.1109/ICPR.2004.1334652 http://dx.doi.org/10.1109/ICPR.2004.1334652 ]
Shan S G, Gao W, Cao B and Zhao D B. 2003. Illumination normalization for robust face recognition against varying lighting conditions//Proceedings of 2003 IEEE International SOI Conference. Nice, France: IEEE: 157-164 [ DOI: 10.1109/AMFG.2003.1240838 http://dx.doi.org/10.1109/AMFG.2003.1240838 ]
Shao R, Lan X Y, Li J W and Yuen P C. 2019. Multi-adversarial discriminative deep domain generalization for face presentation attack detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 10015-10023 [ DOI: 10.1109/CVPR.2019.01026 http://dx.doi.org/10.1109/CVPR.2019.01026 ]
Shao Z W, Liu Z L, Cai J F and Ma L Z. 2021a JAA-Net: joint facial action unit detection and face alignment via adaptive attention. International Journal of Computer Vision, 129(2): 321-340
Shao Z W, Zhu H L, Tang J S, Lu X Q and Ma L Z. 2021b Explicit Facial Expression Transfervia Fine-Grained Representations. IEEE Transactions on Image Processing. 30: 4610-4621
Shashua A and Riklin-Raviv T. 2001. The quotient image: class-based re-rendering and recognition with varying illuminations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2): 129-139 [DOI: 10.1109/34.908964]
Simonyan K and Zisserman A. 2014. Two-stream convolutional networks for action recognition in videos [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/1406.2199.pdf https://arxiv.org/pdf/1406.2199.pdf
Su Z, Zhu C, Dong Y P, Cai D Q, Chen Y R and Li J G. 2018. Learning visual knowledge memory networks for visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7736-7745 [ DOI: 10.1109/CVPR.2018.00807 http://dx.doi.org/10.1109/CVPR.2018.00807 ]
Taigman Y,Yang M, Ranzato M A and Wolf L. 2014. DeepFace: closing the gap to human-level performance in face verification//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 1701-1708 [ DOI: 10.1109/CVPR.2014.220 http://dx.doi.org/10.1109/CVPR.2014.220 ]
Tang J S, Shao Z and Ma L Z. 2020. Fine-Grained Expression Manipulation Via Structured Latent Space// Proceedings of 2020 IEEE International Conference on Multimedia and Expo (ICME). London, UK: IEEE: 1-6[10.1109/ICME46284.2020.9102852]
Tang J S, Shao Z and Ma L Z. 2021. EGGAN: Learning Latent Space for Fine-Grained Expression Manipulation. IEEE MultiMedia, 28(3): 42-51
Tan X, Xu K, Cao Y, Zhang Y, Ma L Z and Lau Rynson W H. 2021. Night-time scene parsing with a large real dataset. IEEE Transactions on Image Processing, 30: 9085-9098 [DOI: 10.1109/TIP.2021.3122004]
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 4489-4497 [ DOI: 10.1109/ICCV.2015.510 http://dx.doi.org/10.1109/ICCV.2015.510 ]
Tu T, Ping Q, Thattai G, Tur G and Natarajan P. 2021. Learning better visual dialog agents with pretrained visual-linguistic representation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 5618-5627 [ DOI: 10.1109/CVPR46437.2021.00557 http://dx.doi.org/10.1109/CVPR46437.2021.00557 ]
Vries H D, Strub F, Chandar S, Pietquin O and Courville A. 2016. GuessWhat?! Visual object discovery through multi-modal dialogue[EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/1611.08481.pdf https://arxiv.org/pdf/1611.08481.pdf
Wang H and Schmid C. 2013. Action recognition with improved trajectories//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3551-3558 [ DOI: 10.1109/ICCV.2013.441 http://dx.doi.org/10.1109/ICCV.2013.441 ]
Wang H T, Li S Z and Wang Y S. 2004. Face recognition under varying lighting conditions using self quotient image//Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition. Seoul, Korea (South): IEEE: 819-824 [ DOI: 10.1109/AFGR.2004.1301635 http://dx.doi.org/10.1109/AFGR.2004.1301635 ]
Wang J G and Tan T N. 2000. A new face detection method based on shape information. Pattern Recognition Letters, 21(6/7): 463-471 [DOI: 10.1016/S0167-8655(00)00008-8]
Wang K, Peng X J, Yang J F, Lu S J and Qiao Y. 2020a. Suppressing uncertainties for large-scale facial expression recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogniti on (CVPR). Seattle, USA: IEEE: 6896-6905 [ DOI: 10.1109/CVPR42600.2020.00693 http://dx.doi.org/10.1109/CVPR42600.2020.00693 ]
Wang K, Peng X J, Yang J F, Meng D B and Qiao Y. 2020b. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing, 29: 4057-4069 [DOI: 10.1109/TIP.2019.2956143]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and van Gool L. 2016. Temporal segment networks: towards good practices for deep action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 20-36 [ DOI: 10.1007/978-3-319-46484-8_2 http://dx.doi.org/10.1007/978-3-319-46484-8_2 ]
Wang P, Wu Q, Shen C H, Dick A and van den Hengel A. 2017. Explicit knowledge-based reasoning for visual question answering//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI Press: 1290-1296
Wang P, Wu Q, Shen C H, Dick A and van den Hengel A. 2018a. FVQA: fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10): 2413-2427 [DOI: 10.1109/TPAMI.2017.2754246]
Wang W X, Sun Q, Fu Y W, Chen T, Cao C J, Zheng Z Q, Xu G Q, Qiu H, Jiang Y G and Xue X Y. 2019. Comp-GAN: compositional generative adversarial network in synthesizing and recognizing facial expression//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 211-219 [ DOI: 10.1145/3343031.3351032 http://dx.doi.org/10.1145/3343031.3351032 ]
Wang Y, Joty S, Lyu M R, King I, Xiong C M and Hoi S C H. 2020c. VD-BERT: a unified vision and dialog transformer with BERT [EB/OL ] . [2022-02-13 ] . https:arxiv.org/pdf/2004.13278.pdf https:arxiv.org/pdf/2004.13278.pdf
Wang Y Q. 2014. An analysis of the Viola-Jones face detection algorithm. Image Processing on Line, 4: 128-148 [DOI: 10.5201/ipol.2014.104]
Wang ZN, Zeng F W, Liu S C and Zeng B. 2021c. OAENet: oriented attention ensemble for accurate facial expression recognition. Pattern Recognition, 112: #107694 [DOI: 10.1016/j.patcog.2020.107694]
Wojciech Z, Zoran Z and Ben J A K. 2005. Keeping track of humans: have I seen this person before?//Proceedings of 2005 IEEE International Conference on Robotics and Automation. Barcelona, Spain: IEEE: 2081-2086 [ DOI: 10.1109/ROBOT.2005.1570420 http://dx.doi.org/10.1109/ROBOT.2005.1570420 ]
Wong W K, Lai Z H, Wen J J, Fang X Z and Lu Y W. 2017. Low-rank embedding for robust image feature extraction. IEEE Transactions on Image Processing, 26(6): 2905-2917 [DOI: 10.1109/TIP.2017.2691543]
Wu J J, Jiang J G, Qi M B and Liu H. 2019. Independent metric learning with aligned multi-part features for video-based person re-identification. Multimedia Tools and Applications, 78(20): 29323-29341 [DOI: 10.1007/s11042-018-7119-6]
Wu K, Zhu H L, Hao Y Y and Ma L Z. 2017. Cascade regression based multi-poseface alignment. Journal of Image and Graphics, 22(2): 257-264
伍凯, 朱恒亮, 郝阳阳, 马利庄. 2017. 级联回归的多姿态人脸配准. 中国图象图形学报, 22(2): 257-264 [DOI: 10.11834/jig.20170214]
Wu Q, Wang P, Shen C H, Dick A and Van Den Hengel A. 2016. Ask me anything: free-form visual question answering based on knowledge from external sources//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4622-4630 [ DOI: 10.1109/CVPR.2016.500 http://dx.doi.org/10.1109/CVPR.2016.500 ]
Wu Q, Shen C H, Wang P, Dick A and van den Hengel A. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6): 1367-1381 [DOI: 10.1109/TPAMI.2017.2708709]
Wu W S, Chang T and Li X M. 2021. Visual-and-language navigation: a survey and taxonomy [EB/OL ] . [2022-02-03 ] . https://arxiv.org/pdf/2108.11544.pdf https://arxiv.org/pdf/2108.11544.pdf
Xie S Y, Hu H F and Chen Y Z. 2021. Facial expression recognition with two-branch disentangled generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology, 31(6): 2359-2371 [DOI: 10.1109/TCSVT.2020.3024201]
Xing X, Wang K Q and Shen L S. 2000. A real-time algorithm for tracking human faces based on organ tracking. Acta Electronica Sinica, 28(6): 29-31
邢昕, 汪孔桥, 沈兰荪. 2000. 基于器官跟踪的人脸实时跟踪方法. 电子学报, 28(6): 29-31 [DOI: 10.3321/j.issn:0372-2112.2000.06.008]
Xiong X H and de la Torre F. 2013. Supervised descent method and its applications to face alignment//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 532-539 [ DOI: 10.1109/CVPR.2013.75 http://dx.doi.org/10.1109/CVPR.2013.75 ]
Xu B H, Ye H, Zheng Y B, Wang H, Luwang T and Jiang Y G. 2019. Dense dilated network for video action recognition. IEEE Transactions on Image Processing, 28(10): 4941-4953 [DOI: 10.1109/TIP.2019.2917283]
Yan J J, Lei Z, Wen L Y and Li S Z. 2014. The fastest deformable part model for object detection//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 2497-2504 [ DOI: 10.1109/CVPR.2014.320 http://dx.doi.org/10.1109/CVPR.2014.320 ]
Yan Y, Huang Y, Chen S, Shen C H and Wang H Z. 2020. Joint deep learning of facial expression synthesis and recognition. IEEE Transactions on Multimedia, 22(11): 2792-2807 [DOI: 10.1109/TMM.2019.2962317]
Yang J W, Lei Z and Li S Z. 2014. Learn convolutional neural network for face anti-spoofing [EB/OL ] . [2022-02-03 ] . https://arxiv.org/pdf/1408.5601.pdf https://arxiv.org/pdf/1408.5601.pdf
Yang X, Luo W H, Bao L C, Gao Y, Gong D H, Zheng S B, Li Z F and Liu W.2019. Face anti-spoofing: model matters, so does data//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3502-3511 [ DOI: 10.1109/CVPR.2019.00362 http://dx.doi.org/10.1109/CVPR.2019.00362 ]
Yu J, Zhu Z H, Wang Y J, Zhang W F, Hu Y and Tan J L. 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition, 108: #107563 [DOI: 10.1016/j.patcog.2020.107563]
Yu J H, Zhang C Y, Song Y and Cai W D. 2021. ICE-GAN: identity-aware and capsule-enhanced GAN with graph-based reasoning for micro-expression recognition and synthesis//Proceedings of 2021 International Joint Conference on Neural Networks (IJCNN). Shenzhen, China: IEEE: 1-8 [ DOI: 10.1109/IJCNN52387.2021.9533988 http://dx.doi.org/10.1109/IJCNN52387.2021.9533988 ]
Zafeiriou S, Zhang C and Zhang Z Y. 2015. A survey on face detection in the wild: past, present and future. Computer Vision and Image Understanding, 138: 1-24 [DOI: 10.1016/j.cviu.2015.03.015]
Zhang F F, Zhang T Z, Mao Q R and XuC S. 2020a. Geometry guided pose-invariant facial expression recognition. IEEE Transactions on Image Processing, 29: 4445-4460 [DOI: 10.1109/TIP.2020.2972114]
Zhang F F, Zhang T Z, Mao Q R and Xu C S. 2020b. A unified deep model for joint facial expression recognition, face synthesis, and face alignment. IEEE Transactions on Image Processing, 29: 6574-6589
Zhang K Y, Yao T P, Zhang J, Tai Y, Ding S H, Li J L, Huang F Y, Song H C and Ma L Z. 2020c. Face anti-spoofing via disentangled representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 641-657 [ DOI: 10.1007/978-3-030-58529-7_38 http://dx.doi.org/10.1007/978-3-030-58529-7_38 ]
Zhang L Y, Liu S C, Liu D H, Zeng P P, Li X P, Song J K and Gao L L. 2021a. Rich visual knowledge-based augmentation network for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 32(10): 4362-4373 [DOI: 10.1109/TNNLS.2020.3017530]
Zhang W, Li Y M, Lu W Z, Xu X S, Liu Z W and Ji X Y. 2019. Learning intra-video difference for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 29(10): 3028-3036 [DOI: 10.1109/TCSVT.2018.2872957]
Zhang Y F, Ming J and Qi Z. 2021b. Explicit Knowledge Incorporation for Visual Reasoning//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual: IEEE/CVF
Zhao L, Gao L L, Guo Y Y, Song J K and Shen H T. 2021a. SKANet: structured knowledge-aware network for visual dialog//Proceedings of 2021 IEEE International Confe rence on Multimedia and Expo (ICME). Shenzhen, China: IEEE: 1-6 [ DOI: 10.1109/ICME51207.2021.9428279 http://dx.doi.org/10.1109/ICME51207.2021.9428279 ]
Zhao Y, Yang L, Pei E C, Oveneke M C, Alioscha-Perez M, Li L F, Jiang D M and Sahli H. 2021b. Action unit driven facial expression synthesis from a single image with patch attentive GAN. Computer Graphics Forum, 40(6): 47-61 [DOI: 10.1111/cgf.14202]
Zhao Z Q, Liu Q S and Wang S M. 2021c. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30: 6544-6556 [DOI: 10.1109/TIP.2021.3093397]
Zhi R C, Xu H R, Wan M and Li T T. 2019. Combining 3D convolutional neural networks with transfer learning by supervised pre-training for facial micro-expression recognition. IEICE Transactions on Information and Systems, E102. D(5): 1054-1064 [DOI: 10.1587/transinf.2018EDP7153]
Zhou J, Lu C Y, Zhang C S and Li Y D. 2000. A survey of automatic human face recognition. Acta Electronica Sinica, 28(4): 102-106
周杰, 卢春雨, 张长水, 李衍达. 2000. 人脸自动识别方法综述. 电子学报, 28(4): 102-106 [DOI: 10.3321/j.issn:0372-2112.2000.04.027]
Zhou Z, Huang Y, Wang W, Wang L and Tan T N. 2017. See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6776-6785 [ DOI: 10.1109/CVPR.2017.717 http://dx.doi.org/10.1109/CVPR.2017.717 ]
Zhu F D, Zhu Y, Chang X J and Liang X D. 2020a. Vision-language navigation with self-supervised auxiliary reasoning tasks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10009-10019 [ DOI: 10.1109/CVPR42600.2020.01003 http://dx.doi.org/10.1109/CVPR42600.2020.01003 ]
Zhu X K, Jing X Y, You X G, Zhang X Y and Zhang T P. 2018. Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics. IEEE Transactions on Image Processing, 27(11): 5683-5695 [DOI: 10.1109/TIP.2018.2861366]
Zhu Y, Weng Y, Zhu F D, Liang X D, Ye Q X, Lu Y T and Jiao J B. 2021. Self-motivated communication agent for real-world vision-dialog navigation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1574-1583 [ DOI: 10.1109/ICCV48922.2021.00162 http://dx.doi.org/10.1109/ICCV48922.2021.00162 ]
Zhu Y K, Zhang C, Ré C and Li F F. 2015. Building a large-scale multimodal knowledge base system for answering visual queries [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/1507.05670.pdf https://arxiv.org/pdf/1507.05670.pdf
Zhu Z H, Yu J, Wang Y J, Sun Y J, Hu Y and Wu Q. 2020b. Mucko: multi-Layer cross-modal knowledge reasoning for fact-based visual question answering [EB/OL ] . [2022-02-13 ] . https://arxiv.org/pdf/2006.09073.pdf https://arxiv.org/pdf/2006.09073.pdf
相关作者
相关机构
京公网安备11010802024621