视觉知识:跨媒体智能进化的新支点
The review of visual knowledge: a new pivot for cross-media intelligence evolution
- 2022年27卷第9期 页码:2574-2588
纸质出版日期: 2022-09-16 ,
录用日期: 2022-05-25
DOI: 10.11834/jig.211264
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-09-16 ,
录用日期: 2022-05-25
移动端阅览
杨易, 庄越挺, 潘云鹤. 视觉知识:跨媒体智能进化的新支点[J]. 中国图象图形学报, 2022,27(9):2574-2588.
Yi Yang, Yueting Zhuang, Yunhe Pan. The review of visual knowledge: a new pivot for cross-media intelligence evolution[J]. Journal of Image and Graphics, 2022,27(9):2574-2588.
回顾跨媒体智能的发展历程,分析跨媒体智能的新趋势与现实瓶颈,展望跨媒体智能的未来前景。跨媒体智能旨在融合多来源、多模态数据,并试图利用不同媒体数据间的关系进行高层次语义理解与逻辑推理。现有跨媒体算法主要遵循了单媒体表达到多媒体融合的范式,其中特征学习与逻辑推理两个过程相对割裂,无法综合多源多层次的语义信息以获得统一特征,阻碍了推理和学习过程的相互促进和修正。这类范式缺乏显式知识积累与多级结构理解的过程,同时限制了模型可信度与鲁棒性。在这样的背景下,本文转向一种新的智能表达方式——视觉知识。以视觉知识驱动的跨媒体智能具有多层次建模和知识推理的特点,并易于进行视觉操作与重建。本文介绍了视觉知识的3个基本要素,即视觉概念、视觉关系和视觉推理,并对每个要素展开详细讨论与分析。视觉知识有助于实现数据与知识驱动的统一框架,学习可归因可溯源的结构化表达,推动跨媒体知识关联与智能推理。视觉知识具有强大的知识抽象表达能力和多重知识互补能力,为跨媒体智能进化提供了新的有力支点。
We review the recent development of cross-media intelligence
analyze its new trends and challenges
and discuss future prospects of cross-media intelligence. Cross-media intelligence is focused on the integration of multi-source and multi-modal data. It attempts to use the relationship between different media data for high-level semantic understanding and logical reasoning. Existing cross-media algorithms mainly follow the paradigm of "single media representation" to "multimedia integration"
in which the two processes of feature learning and logical reasoning are relatively disconnected. It is unlikely to synthesize multi-source and multi-level semantic information to obtain unified features
which hinders the mutual benefits of the reasoning and learning process. This paradigm is lack of the process of explicit knowledge accumulation and multi-level structure understanding. At the same time
it restricts the interpretability and robustness of the model. We interpret new representation method
i.e.
visual knowledge. Visual knowledge driven cross-media intelligence has the features of multi-level modeling and knowledge reasoning. Its built-in mechanisms can implement operations and reconstruction visually
which learns knowledge alignment and association. To establish a unified way of knowledge representation learning
the theory of visual knowledge has been illustrated as mentioned below: 1) we introduce three key factors of visual contexts
i.e.
concept
visual relationship
and visual reasoning. Visual knowledge has capable of knowledge representations abstraction and multiple knowledge complementing. Visual relations represent the relationship between visual concepts and provide an effective basis for more complex cross-media visual reasoning. We demonstrate visual-based spatio-temporal and causal relationships
but the visual relationship is not limited to these categories. We recommend that the pairwise visual relationships should be extended to multi-objects cascade relationships and the integrated spatio-temporal and causal representations effectively. Visual knowledge is derived of visual concepts and visual relationships
enabling more interpretive and generalized high-level cross-media visual reasoning. Visual knowledge develops a structured knowledge representation
a multi-level basis for visual reasoning
and realizes an effective demonstration for neural network decisions. Broadly
the referred visual reasoning includes a variety of visual operations
such as prediction
reconstruction
association and decomposition. 2) We discuss the applications of visual knowledge
and introduce detailed analysis on their future challenges. We select three applications of those are structured representation of visual knowledge
operation and reasoning of visual knowledge
and cross-media reconstruction and generation. Visual knowledge is predicted to resolve the ambiguity problems in relational descriptions and suppress data bias effectively. It is worth noting that these three specific applications are involved some cross-media intelligence examples of visual knowledge only. Although hand-crafted features are less capable of abstracting multimedia data than deep learning features
these descriptors tend to be more interpretable. The effective integration of hand-crafted features and deep learning features for cross-media representation modeling is a typical application of visual knowledge representation in the context of cross-media intelligence. The structured representation of visual knowledge contributes to the improvement of model interpretability. 3) We analyze the advantages of visual knowledge. It aids to achieve a unified framework driven by both data and knowledge
learn explainable structured representations
and promote cross-media knowledge association and intelligent reasoning. Thanks to the development of visual knowledge based cross-media intelligence
more emerging cross-media intelligence applications will be developed. The decision-making assistance process is more credible through the structural and multi-granularity representation of visual knowledge and the integrated optimization of multi-source and cross-domain data. The reasoning process can be reviewed and clarified
and the model generalization ability can be improved systematically. These factors provide a new powerful pivot for the evolution of cross-media intelligence. Visual knowledge can improve the generative models greatly and enhance the application of simulation technology. Future visual knowledge can be used as a prior to improve the rendering of scenes
realize interactive visual editing tools and controllable semantic understanding of scene objects. A data-driven and visual knowledge derived graphics system will be focused on the integration of the strengths of data and rules
semantic features extraction of visual data
model complexity optimization
simulation improvement
and realistic and sustainable content in new perspectives and new scenarios.
跨媒体智能视觉知识视觉概念视觉关系视觉推理
cross-media intelligencevisual knowledgevisual conceptsvisual relationshipsvisual reasoning
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R and Ives Z. 2007. DBpedia: a nucleus for a web of open data//Proceedings of the 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference. Busan, Korea (South): Springer: 722-735 [DOI:10.1007/978-3-540-76298-0_52http://dx.doi.org/10.1007/978-3-540-76298-0_52]
Blanz V and Vetter T. 1999. A morphable model for the synthesis of 3D faces//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. [s. l.]: ACM Press/Addison-Wesley Publishing Co. : 187-194 [DOI:10.1145/311535.311556http://dx.doi.org/10.1145/311535.311556]
Chang X J, Huang P Y, Shen Y D, Liang X D, Yang Y and Hauptmann A G. 2018. RCAA: relational context-aware agents for person search//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 86-102 [DOI:10.1007/978-3-030-01240-3_6http://dx.doi.org/10.1007/978-3-030-01240-3_6]
Cheng K Y, Wu J X, Wang W S, Rong L and Zhan Y Z. 2021. Multi-person interaction action recognition based on spatio-temporal graph convolution. Journal of Image and Graphics, 26(7): 1681-1691
成科扬, 吴金霞, 王文杉, 荣兰, 詹永照. 2021. 融合时空图卷积的多人交互行为识别. 中国图象图形学报, 26(7): 1681-1691 [DOI: 10.11834/jig.200510]
Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). San Diego, USA: IEEE: 886-893 [DOI:10.1109/CVPR.2005.177http://dx.doi.org/10.1109/CVPR.2005.177]
Fan H H, Yu X, Ding Y H, Yang Y and Kankanhalli M. 2020. PSTNet: point spatio-temporal convolution on point cloud sequences [EB/OL]. [2022-06-22].https://arxiv.org/pdf/2205.13713.pdfhttps://arxiv.org/pdf/2205.13713.pdf
Fan H H, Zhuo T, Yu X, Yang Y and Kankanhalli M. 2022. Understanding atomic hand-object interaction with human intention. IEEE Transactions on Circuits and Systems for Video Technology, 32(1): 275-285 [DOI: 10.1109/TCSVT.2021.3058688]
Ferrada S, Bustos B and Hogan A. 2017. IMGpedia: a linked dataset with content-based analysis of Wikimedia images//Proceedings of the 16th International Semantic Web Conference. Vienna, Austria: Springer: 84-93 [DOI:10.1007/978-3-319-68204-4_8http://dx.doi.org/10.1007/978-3-319-68204-4_8]
Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M and Mikolov T. 2013. DeViSE: a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 2121-2129
Gibson J J. 1977. The Theory of Affordances. Hillsdale: Erlbaum Associates: 67-82
Gogoglou A, Bruss C B and Hines K E. 2019. On the interpretability and evaluation of graph representation learning [EB/OL]. [2022-06-22].https://arxiv.org/pdf/1910.03081.pdfhttps://arxiv.org/pdf/1910.03081.pdf
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680
Hainich R R. 2006. The End of Hardware: A Novel Approach to Augmented Reality. [s. l.]: BookSurge Publishing
Hardoon D R, Szedmak S and Shawe-Taylor J. 2004. Canonical correlation analysis:an overview with application to learning methods. Neural Computation, 16(12): 2639-2664 [DOI: 10.1162/0899766042321814]
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780 [DOI: 10.1162/neco.1997.9.8.1735]
Jain A, Mildenhall B, Barron J T, Abbeel P and Poole B. 2022. Zero-shot text-guided object generation with dream fields [EB/OL]. [2022-06-22].https://arxiv.org/pdf/2112.01455.pdfhttps://arxiv.org/pdf/2112.01455.pdf
Ji J W, Krishna R, Li F F and Niebles J C. 2020. Action genome: actions as compositions of spatio-temporal scene graphs//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10233-10244 [DOI:10.1109/CVPR42600.2020.01025http://dx.doi.org/10.1109/CVPR42600.2020.01025]
Johnson J, Gupta A and Li F F. 2018. Image generation from scene graphs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. SaltLake City, USA: IEEE: 1219-1228 [DOI:10.1109/CVPR.2018.00133http://dx.doi.org/10.1109/CVPR.2018.00133]
Johnson J, Krishna R, Stark M, Li L J, Shamma D A, Bernstein M S and Li F F. 2015. Image retrieval using scene graphs//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3668-3678 [DOI:10.1109/CVPR.2015.7298990http://dx.doi.org/10.1109/CVPR.2015.7298990]
Kim G J. 2005. Designing Virtual Reality Systems. London: Springer [DOI: 10.1007/978-1-84628-230-0]
Klawonn F, Chekhtman V and Janz E. 2003. Visual inspection of fuzzy clustering results//Proceedings of 2003 Advances in Soft Computing. London, UK: Springer: 65-76 [DOI:10.1007/978-1-4471-3744-3_7http://dx.doi.org/10.1007/978-1-4471-3744-3_7]
Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S and Li F F. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1): 32-73 [DOI: 10.1007/s11263-016-0981-7]
Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 1097-1105
Lazebnik S, Schmid C and Ponce J. 2006. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). New York, USA: IEEE: 2169-2178 [DOI:10.1109/CVPR.2006.68http://dx.doi.org/10.1109/CVPR.2006.68]
Li G R, Kang G L, Liu W, Wei Y C and Yang Y. 2020. Content-consistent matching for domain adaptive semantic segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 440-456 [DOI:10.1007/978-3-030-58568-6_26http://dx.doi.org/10.1007/978-3-030-58568-6_26]
Li K P, Zhang Y L, Li K, Li Y Y and Fu Y. 2019. Visual semantic reasoning for image-text matching//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4653-4661 [DOI:10.1109/ICCV.2019.00475http://dx.doi.org/10.1109/ICCV.2019.00475]
Li Y, Liu Z, Yao L N and Chang X J. 2021. Attribute-modulated generative meta learning for zero-shot learning. IEEE Transactions on Multimedia [DOI: 10.1109/TMM.2021.3139211]
Li Y K, Ouyang W L, Zhou B L, Wang K and Wang X G. 2017. Scene graph generation from objects, phrases and region captions//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1270-1279 [DOI:10.1109/ICCV.2017.142http://dx.doi.org/10.1109/ICCV.2017.142]
Li Y T, Min M R, Shen D H, Carlson D and Carin L. 2018. Video generation from text//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 7065-7072
Liu Y, Li H, Garcia-Duran A, Niepert M, Onoro-Rubio D and Rosenblum D S. 2019. MMKG: multi-modal knowledge graphs//Proceedings of the 16th International Conference on Semantic Web. Portorož, Slovenia: Springer: 459-474 [DOI:10.1007/978-3-030-21348-0_30http://dx.doi.org/10.1007/978-3-030-21348-0_30]
Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110 [DOI: 10.1023/B:VISI.0000029664.99615.94]
Lyu L L, Huang Y, Gao J Y, Yang X S and Xu C S. 2021. Multimodal-based zero-shot human action recognition. Journal of Image and Graphics, 26(7): 1658-1667
吕露露, 黄毅, 高君宇, 杨小汕, 徐常胜. 2021. 多模态零样本人体动作识别. 中国图象图形学报, 26(7): 1658-1667 [DOI: 10.11834/jig.200503]
Miao J X, Wu Y and Yang Y. 2021. Identifying visible parts via pose estimation for occluded person re-identification. IEEE Transactions on Neural Networks and Learning Systems: #3059515 [DOI: 10.1109/TNNLS.2021.3059515]
Mikolov T, Chen K, Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space. [2022-06-22].https://arxiv.org/pdf/1301.3781.pdfhttps://arxiv.org/pdf/1301.3781.pdf
Mildenhall B, Srinivasan P P, Tancik M, Barron J T, Ramamoorthi R and Ng R. 2020. NeRF: representing scenes as neural radiance fields for view synthesis//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 405-421 [DOI:10.1007/978-3-030-58452-8_24http://dx.doi.org/10.1007/978-3-030-58452-8_24]
Nagarajan T, Li Y H, Feichtenhofer C and Grauman K. 2020. EGO-TOPO: environment affordances from egocentric video//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 160-169 [DOI:10.1109/CVPR42600.2020.00024http://dx.doi.org/10.1109/CVPR42600.2020.00024]
Pan P B, Xu Z W, Yang Y, Wu F and Zhuang Y T. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1029-1038 [DOI:10.1109/CVPR.2016.117http://dx.doi.org/10.1109/CVPR.2016.117]
Pan Y H. 1996. The synthesis reasoning. Pattern Recognition and Artificial Intelligence, 9(3): 201-208
潘云鹤. 1996. 综合推理的研究. 模式识别与人工智能, 9(3): 201-208
Pan Y H. 2019. On visual knowledge. Frontiers of Information Technology and Electronic Engineering, 20(8): 1021-1025 [DOI: 10.1631/FITEE.1910001]
Pan Y H. 2020. Multiple knowledge representation of artificial intelligence. Engineering, 6(3): 216-217 [DOI: 10.1016/j.eng.2019.12.011]
Parent R. 2012. Computer Animation: Algorithms and Techniques. 3rd ed. San Francisco, USA: Morgan Kaufmann
Quan R J, Zhu L C, Wu Y and Yang Y. 2021. Holistic LSTM for pedestrian trajectory prediction. IEEE Transactions on Image Processing. 30: 3229-3239 [DOI: 10.1109/TIP.2021.3058599]
Rehm F, Klawonn F and Kruse R. 2006. POLARMAP-Efficient visualisation of high dimensional data//Proceedings of the 10th International Conference on Information Visualisation (IV'06). London, UK: IEEE: 731-740 [DOI:10.1109/IV.2006.85http://dx.doi.org/10.1109/IV.2006.85]
Roy V, Xu Y, Wang Y X, Kitani K, Salakhutdinov R and Hebert M. 2020. Few-shot learning with intra-class knowledge transfer [EB/OL]. [2022-06-22].https://arxiv.org/pdf/2008.09892.pdfhttps://arxiv.org/pdf/2008.09892.pdf
Schwarz K, Liao Y Y, Niemeyer M and Geiger A. 2020. GRAF: generative radiance fields for 3D-aware image synthesis//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates, Inc. : 20154-20166
Snell J, Swersky K and Zemel R. 2017. Prototypical networks for few-shot learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 4080-4090
Tang K H, Niu Y L, Huang J Q, Shi J X and Zhang H W. 2020. Unbiased scene graph generation from biased training//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3713-3722 [DOI:10.1109/CVPR42600.2020.00377http://dx.doi.org/10.1109/CVPR42600.2020.00377]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010
Vrandečić D and Krötzsch M. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10): 78-85 [DOI: 10.1145/2629489]
Wang H and Schmid C. 2013. Action recognition with improved trajectories//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3551-3558 [DOI:10.1109/ICCV.2013.441http://dx.doi.org/10.1109/ICCV.2013.441]
Wang W Y, Tran D and Feiszli M. 2020a. What makes training multi-modal classification networks hard?//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12692-12702 [DOI:10.1109/CVPR42600.2020.01271http://dx.doi.org/10.1109/CVPR42600.2020.01271]
Wang X H, Zhu L C, Wang H and Yang Y. 2021. Interactive prototype learning for egocentric action recognition//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 8148-8157 [DOI:10.1109/ICCV48922.2021.00806http://dx.doi.org/10.1109/ICCV48922.2021.00806]
Wang X H, Zhu L C, Wu Y and Yang Y. 2020b. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence: #3015894 [DOI: 10.1109/TPAMI.2020.3015894]
Wang Z H, Liu X H, Li H S, Sheng L, Yan J J, Wang X G and Shao J. 2019. CAMP: cross-modal adaptive message passing for text-image retrieval//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 5763-5772 [DOI:10.1109/ICCV.2019.00586http://dx.doi.org/10.1109/ICCV.2019.00586]
Wu Y and Yang Y. 2021. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 1326-1335 [DOI:10.1109/CVPR46437.2021.00138http://dx.doi.org/10.1109/CVPR46437.2021.00138]
Wu Y, Zhu L C, Jiang L and Yang Y. 2018. Decoupled novel object captioner//Proceedings of the 26th ACM international conference on Multimedia. Seoul, Korea (South): ACM: 1029-1037 [DOI:10.1145/3240508.3240640http://dx.doi.org/10.1145/3240508.3240640]
Wu Y, Zhu L C, Yan Y and Yang Y. 2019. Dual attention matching for audio-visual event localization//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6291-6299 [DOI:10.1109/ICCV.2019.00639http://dx.doi.org/10.1109/ICCV.2019.00639]
Xu D F, Zhu Y K, Choy C B and Li F F. 2017. Scene graph generation by iterative message passing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3097-3106 [DOI:10.1109/CVPR.2017.330http://dx.doi.org/10.1109/CVPR.2017.330]
Yang J W, Lu J S, Lee S, Batra D and Parikh D. 2018. Graph R-CNN for scene graph generation//Proceedings of the 15th European Conference Computer Vision. Munich, Germany: Springer: 690-706 [DOI:10.1007/978-3-030-01246-5_41http://dx.doi.org/10.1007/978-3-030-01246-5_41]
Yang Y, Ma Z G, Xu Z W, Yan S C and Hauptmann A G. 2013a. How related exemplars help complex event detection in web videos?//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 2104-2111 [DOI:10.1109/ICCV.2013.456http://dx.doi.org/10.1109/ICCV.2013.456]
Yang Y, Nie F P, Xu D, Luo J B, Zhuang Y T and Pan Y H. 2012a. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4): 723-742 [DOI: 10.1109/TPAMI.2011.170]
Yang Y, Song J K, Huang Z, Ma Z G, Sebe N and Hauptmann A G. 2013b. Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Transactions on Multimedia, 15(3): 572-581 [DOI: 10.1109/TMM.2012.2234731]
Yang Y, Wu F, Nie F P, Shen H T, Zhuang Y T and Hauptmann A G. 2012b. Web and personal image annotation by mining label correlation with relaxed visual graph embedding. IEEE Transactions on Image Processing, 21(3): 1339-1351 [DOI: 10.1109/TIP.2011.2169269]
Yang Y, Zhuang Y T and Pan Y H. 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology and Electronic Engineering, 22(12): 1551-1558 [DOI: 10.1631/FITEE.2100463]
Yang Y, Zhuang Y T, Wu F and Pan Y H. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia, 10(3): 437-446 [DOI: 10.1109/TMM.2008.917359]
Zareian A, Karaman S and Chang S F. 2020. Bridging knowledge graphs to generate scene graphs//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 606-623 [DOI:10.1007/978-3-030-58592-1_36http://dx.doi.org/10.1007/978-3-030-58592-1_36]
Zheng Z D, Ruan T, Wei Y C, Yang Y and Mei T. 2021. VehicleNet: learning robust visual representation for vehicle re-identification. IEEE Transactions on Multimedia, 23: 2683-2693 [DOI: 10.1109/TMM.2020.3014488]
Zheng Z D, Wei Y C and Yang Y.2020a. University-1652: a multi-view multi-source benchmark for drone-based geo-localization//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 1395-1403 [DOI:10.1145/3394171.3413896http://dx.doi.org/10.1145/3394171.3413896]
Zheng Z D, Yang X D, Yu Z D, Zheng L, Yang Y and Kautz J. 2019. Joint discriminative and generative learning for person re-identification//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2133-2142 [DOI:10.1109/CVPR.2019.00224http://dx.doi.org/10.1109/CVPR.2019.00224]
Zheng Z D, Zheng L, Garrett M, Yang Y, Xu M L and Shen Y D. 2020b. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(2): #51 [DOI: 10.1145/3383184]
Zhu L C, Fan H H, Luo Y W, Xu M L and Yang Y. 2022. Temporal cross-layer correlation mining for action recognition. IEEE Transactions on Multimedia, 24: 668-676 [DOI: 10.1109/TMM.2021.3057503]
Zhu L C, Xu Z W, Yang Y and Hauptmann A G. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, 124(3): 409-421 [DOI: 10.1007/s11263-017-1033-7]
Zhu L C and Yang Y. 2020. ActBERT: learning global-local video-text representations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8743-8752 [DOI:10.1109/CVPR42600.2020.00877http://dx.doi.org/10.1109/CVPR42600.2020.00877]
Zhu L C and Yang Y. 2022. Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1): 273-285 [DOI: 10.1109/TPAMI.2020.3007511]
Zhu M F, Pan P B, Chen W and Yang Y. 2019. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5795-5803 [DOI:10.1109/CVPR.2019.00595http://dx.doi.org/10.1109/CVPR.2019.00595]
相关文章
相关作者
相关机构