多媒体智能：当多媒体遇到人工智能

朱文武; 王鑫; 田永鸿; 高文

doi:10.11834/jig.220086

学者观点 | 浏览量 : 0 下载量: 291 CSCD: 1

PDF
导出
分享
收藏
专辑

多媒体智能：当多媒体遇到人工智能
Multimedia intelligence: the convergence of multimedia and artificial intelligence
2022年27卷第9期页码：2551-2573
收稿：2022-01-27，

修回：2022-6-29，

录用：2022-7-6，

纸质出版：2022-09-16
DOI： 10.11834/jig.220086
稿件说明：

移动端阅览

朱文武, 王鑫, 田永鸿, 高文. 多媒体智能：当多媒体遇到人工智能[J]. 中国图象图形学报, 2022,27(9):2551-2573. DOI： 10.11834/jig.220086.

Wenwu Zhu, Xin Wang, Yonghong Tian, Wen Gao. Multimedia intelligence: the convergence of multimedia and artificial intelligence[J]. Journal of Image and Graphics, 2022, 27(9): 2551-2573. DOI： 10.11834/jig.220086.

摘要

过去10年中涌现出大量新兴的多媒体应用和服务，带来了很多可以用于多媒体前沿研究的多媒体数据。多媒体研究在图像/视频内容分析、多媒体搜索和推荐、流媒体服务和多媒体内容分发等方向均取得了重要进展。与此同时，由于在深度学习领域所取得的重大突破，人工智能(artificial intelligence，AI)在20世纪50年代被正式视为一门学科之后，迎来了一次“新”的发展浪潮。因此，一个问题就自然而然地出现了：当多媒体遇到人工智能时会带来什么？为了回答这个问题，本文通过研究多媒体和人工智能之间的相互影响引入了多媒体智能的概念。从两个方面探讨多媒体与人工智能之间的相互影响：一是多媒体促使人工智能向着更具可解释性的方向发展；二是人工智能反过来为多媒体研究注入了新的思维方式。这两个方面形成了一个良性循环，多媒体和人工智能在其中不断促进彼此发展。本文对相关研究及进展进行了讨论，并围绕值得进一步探索的研究方向分享见解。希望可以对多媒体智能的未来发展带来新的研究思路。

Abstract

Multimedia can be regarded as an integration of various medium such as videos

static images

audios

and texts. Thanks to the rapid development of emerging multimedia applications and services

a huge amount of multimedia data has been generated to advance multimedia research. Furthermore

multimedia research has made great progress in image/video processing and analysis

including search

recommendation

streaming

and content delivery. Since artificial intelligence (AI) became an official academic discipline in the 1 950 s

it has experienced a "new" wave of boost based on deep learning techniques. Its development has been witnessed in the past decades

including expert systems

intelligent search and optimization

symbolic and logical reasoning

probabilistic methods

statistical learning methods

artificial neural networks

etc. As such

a natural question arises: "What will happen when multimedia meets AI?" To answer this question

we introduce the concept of multimedia intelligence by investigating the mutual influences between multimedia and AI. Multimedia drives AI towards a more explainable paradigm

because semantic information is able to enhance the explainability of AI models. At the same time

AI is beneficial for multimedia technology to pocess the advanced ability of reasoning. AI promotes the human-like perception and reasoning processes

which can lead to more inferable multimedia processing and analizing techniques. These mutual influences form a loop in which multimedia and AI interactively enhance each other. To sum up

we discuss the recent advances in literature and share our insights on future research directions deserving further study. We hope this paper can bring new inspirations for future development of multimedia intelligence.

关键词

Keywords

references

Afouras T, Chung J S, Senior A, Vinyals O and Zisserman A. 2018. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence: #2889052 [DOI: 10.1109/TPAMI.2018.2889052]

Agrawal R, Faloutsos C and Swami A. 1993. Efficient similarity search in sequence databases//Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms. Chicago, USA: Springer: 69-84 [ DOI:10.1007/3-540-57301-1_5 http://dx.doi.org/10.1007/3-540-57301-1_5 ]

Akiba T, Sano S, Yanase T, Ohta T and Koyama M. 2019. Optuna: a next-generation hyperparameter optimization framework//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Anchorage, USA: Association for Computing Machinery: 2623-2631 [ DOI:10.1145/3292500.3330701 http://dx.doi.org/10.1145/3292500.3330701 ]

Anderson P, Wu Q, Teney D, Bruce J, Johnson M, Sünderhauf N, Reid I, Gould S and van den Hengel A. 2018. Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 3674-3683 [ DOI:10.1109/CVPR.2018.00387 http://dx.doi.org/10.1109/CVPR.2018.00387 ]

Andreas J, Rohrbach M, Darrell T and Klein D. 2016a. Neural module networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 39-48 [ DOI:10.1109/CVPR.2016.12 http://dx.doi.org/10.1109/CVPR.2016.12 ]

Andreas J, Rohrbach M, Darrell T and Klein D. 2016b. Learning to compose neural networks for question answering//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA: Association for Computational Lingu istics: 1545-1554 [ DOI:10.18653/v1/n16-1181 http://dx.doi.org/10.18653/v1/n16-1181 ]

Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. VQA: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2425-2433 [ DOI:10.1109/ICCV.2015.279 http://dx.doi.org/10.1109/ICCV.2015.279 ]

Badamdorj T, Rochan M, Wang Y and Cheng L. 2021. Joint visual and audio learning for video highlight detection//Proceedings of 2021 IEEE/CVF International Conferenceon Computer Vision (ICCV). Montreal, Canada: IEEE: 8107-8117 [ DOI:10.1109/ICCV48922.2021.00802 http://dx.doi.org/10.1109/ICCV48922.2021.00802 ]

Baevski A, Hsu W N, Conneau A and Auli M. 2021. Unsupervised speech recognition//Proceedings of the 35th Conference on Neural Information Processing Systems. [s. l.]: [s. n.]: 27826-27839

Baevski A, Zhou Y, Mohamed A and Auli M. 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: [s. n.]

Bahdanau D, Cho K and Bengio Y. 2015. Neural machine translation by jointly learning to align and translate//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s. n.]

BaltruŠaitis T, Ahuja C and Morency L P. 2019. Multimodal machine learning: a survey and taxonomy.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423-443 [DOI: 10.1109/TPAMI.2018.2798607]

Baroni M. 2015. Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1): 3-13 [DOI: 10.1111/lnc3.12170]

Bergstra J, Bardenet R, Bengio Y and Kégl B. 2011. Algorithms for hyper-parameter optimization//Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc. : 2546-2554

Bergstra J and Bengio Y. 2012. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13: 281-305

Bigham J P, Jayant C, Ji H J, Little G, Miller A, Miller R C, Miller R, Tatarowicz A, White B, White S and Yeh T. 2010. VizWiz: nearly real-time answers to visual questions//The 23rd Annual ACM Symposium on User Interface Soft ware and Technology. New York, USA: Association for Computing Machinery: 333-342 [ DOI:10.1145/1866029.1866080 http://dx.doi.org/10.1145/1866029.1866080 ]

Bojanowski P, Lajugie R, Grave E, Bach F, Laptev I, Ponce J and Schmid C. 2015. Weakly-supervised alignment of video with text//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 4462-4470 [ DOI:10.1109/ICCV.2015.507 http://dx.doi.org/10.1109/ICCV.2015.507 ]

Cadene R, Ben-younes H, Cord M and Thome N. 2019. MUREL: multimodal relational reasoning for visual question answering//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1989-1998 [ DOI:10.1109/CVPR.2019.00209 http://dx.doi.org/10.1109/CVPR.2019.00209 ]

Cai D S, Qian S S, Fang Q, Hu J, Ding W K and Xu C S. 2022a. Heterogeneous graph contrastive learning network for personalized micro-video recommendation. IEEE Transactions on Multimedia: #3151026 [DOI: 10.1109/TMM.2022.3151026]

Cai D S, Qian S S, Fang Q and Xu C S. 2022b. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia, 24: 805-818 [DOI: 10.1109/TMM.2021.3059508]

Cao Q X, Liang X D, Li B L, Li G B and Lin L. 2018. Visual questionreasoning on general dependency tree//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 7249-7257 [ DOI:10.1109/CVPR.2018.00757 http://dx.doi.org/10.1109/CVPR.2018.00757 ]

Cao X J, Zhang Z T, Sun Y Z, Wang P, Xu S G, Liu F Q, Wang C, Peng F, Mu S Y, Liu W Y and Yang Y. 2022. The review of image processing and edge computing for intelligent transportation system. Journal of Image and Graphics, 27(6): 1743-1767

曹行健, 张志涛, 孙彦赞, 王平, 徐树公, 刘富强, 王超, 彭飞, 穆世义, 刘文予, 杨铀. 2022. 面向智慧交通的图像处理与边缘计算. 中国图象图形学报, 27(6): 1743-1767 [DOI: 10.11834/jig.211266]

Chan W, Jaitly N, Le Q and Vinyals O. 2016. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition//Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China: IEEE: 4960-4964 [ DOI:10.1109/ICASSP.2016.7472621 http://dx.doi.org/10.1109/ICASSP.2016.7472621 ]

Chen C, Jafari R and Kehtarnavaz N. 2015. UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor//Proceedings of 2015 IEEE International Conference on Image P rocessing (ICIP). Quebec City, Canada: IEEE: 168-172 [ DOI:10.1109/ICIP.2015.7350781 http://dx.doi.org/10.1109/ICIP.2015.7350781 ]

Chen J Y, Chen X P, Ma L, Jie Z Q and Chua T S. 2018a. Temporally grounding natural sentence in video//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics: 162-171 [ DOI:10.18653/v1/d18-1015 http://dx.doi.org/10.18653/v1/d18-1015 ]

Chen L J, Lin S Y, Xie Y S, Lin Y Y and Xie X H. 2021. MVHM: a large-scale multi-view hand mesh benchmark for accurate 3D hand pose estimation//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 836-845 [ DOI:10.1109/WACV48630.2021.00088 http://dx.doi.org/10.1109/WACV48630.2021.00088 ]

Chen X, Li L J, Li F F and Gupta A. 2018b. Iterative visual reasoning beyond convolutions//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 7239-7248 [ DOI:10.1109/CVPR.2018.00756 http://dx.doi.org/10.1109/CVPR.2018.00756 ]

Chen Y P, Rohrbach M, Yan Z C, Yan S C, Feng J S and Kalantidis Y. 2019. Graph-based global reasoning networks//Proceedings of 2019 I EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 433-442 [ DOI:10.1109/CVPR.2019.00052 http://dx.doi.org/10.1109/CVPR.2019.00052 ]

Ciaccia P, Patella M and Zezula P. 1997. M-tree: an efficient access method for similarity search in metric spaces//Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB). Athens, Greece: Morgan Kaufmann: 426-435

Cord M and Cunningham P. 2008. Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval. Berlin, Heidelberg, Germany: Springer [ DOI:10.1007/978-3-540-75171-7 http://dx.doi.org/10.1007/978-3-540-75171-7 ]

Das R, Dhuliawala S, Zaheer M and McCallum A. 2019. Multi-step retriever-reader interaction for scalable open-domain question answering//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview. net

Deng J, Dong W, Socher R, Li L J, Kai L and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255 [ DOI:10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ]

Ding C X and Tao D C. 2015. Robust face recognition via multimodal deep face representation. IEEE Transactions on Multimedia, 17(11): 2049-2058 [DOI: 10.1109/TMM.2015.2477042]

Duan X G, Huang W B, Gan C, Wang J D, Zhu W W and Huang J Z. 2018. Weakly supervised dense event captioning in videos//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc. : 3063-3073

Duan X G, Wu Q, Gan C, Zhang Y W, Huang W B, van den Hengel A and Zhu W W. 2019a. Watch, reason and code: learning to represent videos using program//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: Association for Computing Machinery: 1543-1551 [ DOI:10.1145/3343031.3351094 http://dx.doi.org/10.1145/3343031.3351094 ]

Duan Y Q, Zheng Y, Lu J W, Zhou J and Tian Q. 2019b. Structural relational reasoning of point clouds//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 949-958 [ DOI:10.1109/CVPR.2019.00104 http://dx.doi.org/10.1109/CVPR.2019.00104 ]

Ephrat A, Mosseri I, Lang O, Dekel T, Wilson K, Hassidim A, Freeman W T and Rubinstein M. 2018. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics, 37(4): #112 [DOI: 10.1145/3197517.3201357]

Escalera S, Baró X, González J, Bautista M A, Madadi M, Reyes M, Ponce-López V, Escalante H J, Shotton J and Guyon I. 2015. Chalearn looking at people challenge 2014: dataset and results//Computer Vision - ECCV 2014 Workshops. Zurich, Switzerland: Springer: 459-473 [ DOI:10.1007/978-3-319-16178-5_32 http://dx.doi.org/10.1007/978-3-319-16178-5_32 ]

Fan H Q and Zhou J T. 2018. Stacked latent attention for multimodal reasoning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 1072-1080 [ DOI:10.1109/CVPR.2018.00118 http://dx.doi.org/10.1109/CVPR.2018.00118 ]

Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR: 1126-1135

Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M A and Mikolov T. 2013. DeViSE: a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 2121-2129

Fukui A, Park D H, Yang D, Rohrbach A, Darrell T and Rohrbach M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA: Association for Computational Linguistics: 457-468 [ DOI:10.18653/v1/D16-1044 http://dx.doi.org/10.18653/v1/D16-1044 ]

Gao J Y, Sun C, Yang Z H and Nevatia R. 2017. TALL: temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5277-5285 [ DOI:10.1109/ICCV.2017.563 http://dx.doi.org/10.1109/ICCV.2017.563 ]

Gao W. 2020a. City brain: challenges and solution. CAAI Transactions on Intelligent Systems, 15(4): 818-824

高文. 2020a. 城市大脑的痛点与对策. 智能系统学报, 15(4): 818-824 [DOI: 10.11992/tis.202011038]

Gao W. 2020b. Digital retina, let smart city evolve from "see" to "understand". Scientific Chinese, (12): 30-31

高文. 2020b. 数字视网膜, 让智慧城市从"看清"向"看懂"进化. 科学中国人, (12): 30-31

Gao W, Ma S W, Duan L Y, Tian Y H, Xing P Y, Wang Y W, Wang S S, Jia H Z and Huang T J. 2021. Digital retina: a way to make the city brain more efficient by visual coding. IEEE Transactions on Circuits and Systems for Video Technology, 31(11): 4147-4161 [DOI: 10.1109/TCSVT.2021.3104305]

Gao W, Tian Y H and Wang J. 2018. Digital retina: revolutionizing camera systems for the smart city. Scientia Sinica Informationis, 48(8): 1076-1082

高文, 田永鸿, 王坚. 2018. 数字视网膜: 智慧城市系统演进的关键环节. 中国科学: 信息科学, 48(8): 1076-1082) [DOI: 10.1360/N112018-00025]

Garcez A S D, Broda K B and Gabbay D M. 2002. Neural-Symbolic Learning Systems: Foundations and Applications. London, UK: Springer [DOI: 10.1007/978-1-4471-0211-3]

Garg A, Pavlovic V and Rehg J M. 2003. Boosted learning in dynamic Bayesian networks for multimodal speaker detection. Proceedings of the IEEE, 91(9): 1355-1369 [DOI: 10.1109/JPROC.2003.817119]

Geman D, Geman S, Hallonquist N and Younes L. 2015. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences of the United States of America, 112(12): 3618-3623 [DOI: 10.1073/pnas.1422953112]

Ghahramani Z and Jordan M I. 1997. Factorial hidden markov models. Machine Learning, 29(2): 245-273 [DOI: 10.1023/A:1007425814087]

Gionis A, Indyk P and Motwani R. 1999. Similarity search in high dimensions via hashing//Proceedings of the 25th International Conference on Very Large Data Bases. Edinburgh, UK: Morgan Kaufmann Publishers Inc. : 518-529

Gönen M and Alpaydán E. 2011. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12: 2211-2268

Goyal Y, Khot T, Summers-Stay D, Batra D and Parikh D. 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6325-6334 [ DOI:10.1109/CVPR.2017.670 http://dx.doi.org/10.1109/CVPR.2017.670 ]

Guo J Z, Zhu X Y, Zhao C X, Cao D, Lei Z and Li S Z. 2020a. Learning meta face recognition in unseen domains//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 6162-6171 [ DOI:10.1109/CVPR42600.2020.00620 http://dx.doi.org/10.1109/CVPR42600.2020.00620 ]

Guo Z C, Zhang X Y, Mu H Y, Heng W, L iu Z C, Wei Y C and Sun J. 2020b. Single path one-shot neural architecture search with uniform sampling//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 544-560 [ DOI:10.1007/978-3-030-58517-4_32 http://dx.doi.org/10.1007/978-3-030-58517-4_32 ]

Gurban M, Thiran J P, Drugman T and Dutoit T. 2008. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition//Proceedings of the 10th International Conference on Multimodal Interfaces. Chania, Greece: Association for Computing Machinery: 237-240 [ DOI:10.1145/1452392.1452442 http://dx.doi.org/10.1145/1452392.1452442 ]

Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T and Russell B. 2017. Localizing moments in video with natural language//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5804-5813 [ DOI:10.1109/ICCV.2017.618 http://dx.doi.org/10.1109/ICCV.2017.618 ]

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780 [DOI: 10.1162/neco.1997.9.8.1735]

Hossain Z, Sohel F, Shiratuddin M F and Laga H. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys, 51(6): #118 [DOI: 10.1145/3295748]

Hu R H, Andreas J, Darrell T and Saenko K. 2018. Explainable neural computation via stack neural module networks//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 55-71 [ DOI:10.1007/978-3-030-01234-2_4 http://dx.doi.org/10.1007/978-3-030-01234-2_4 ]

Hu R H, Andreas J, Rohrbach M, Darrell T and Saenko K. 2017. Learning to reason: end-to-end module networks for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 804-813 [ DOI:10.1109/ICCV.2017.93 http://dx.doi.org/10.1109/ICCV.2017.93 ]

Hudson D A and Manning C D. 2018. Compositional attention networks for machine reasoning//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview. net

Hudson D A and Manning C D. 2019. GQA: a new dataset for compositional question answering over real-world images [EB/OL ] . [2022-01-12 ] . https://arxiv.org/pdf/1902.09506.pdf https://arxiv.org/pdf/1902.09506.pdf

Huo Y Q, Zhang M L, Liu G Z, Lu H Y, Gao Y Z, Yang G X, Wen J Y, Zhang H, Xu B G, Zheng W H, Xi Z Z, Yang Y Q, Hu A W, Zhao J M, Li R C, Zhao Y D, Zhang L, Song Y Q, Hong X, Cui W Q, Hou D Y, Li Y Y, Li J Y, Liu P Y, Gong Z, Jin C H, Sun Y C, Chen S Z, Lu Z W, Dou Z C, Jin Q, Lan Y Y, Zhao W X, Song R H and Wen J R. 2021. WenLan: bridging vision and language by large-scale multi-modal pre-training [EB/OL ] . [2022-01-12 ] . https://arxiv.org/pdf/2103.06561.pdf https://arxiv.org/pdf/2103.06561.pdf

Johnson J, Hariharan B, van der Maaten L, Li F F, Zitnick C L and Girshick R. 2017a. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 1988-1997 [ DOI:10.1109/CVPR.2017.215 http://dx.doi.org/10.1109/CVPR.2017.215 ]

Johnson J, Hariharan B, van der Maaten L, Hoffman J, Li F F, Zitnick C L and Girshick R. 2017b. Inferring and executing programs for visual reasoning//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 3008-3017 [ DOI:10.1109/ICCV.2017.325 http://dx.doi.org/10.1109/ICCV.2017.325 ]

Kahou S E, Bouthillier X, Lamblin P, Gulcehre C, Michalski V, Konda K, Jean S, Froumenty P, Dauphin Y, Boulanger-Lewandowski N, Chandias Ferrari R, Mirza M, Warde-Farley D, Courville A, Vincent P, Memisevic R, Pal C and Bengio Y. 2016. EmoNets: multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2): 99-111 [DOI: 10.1007/s12193-015-0195-2]

Kalchbrenner N and Blunsom P. 2013. Recurrent continuous translation models//Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. Washington, USA: Association for Computational Linguistics: 1700-1709

Khapra M M, Kumaran A and Bhattacharyya P. 2010. Everybody loves a rich cousin: an empirical study of transliteration through bridge languages//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, USA: Association for Computational Linguistics: 420-428

Kuznetsova A, Talati A, Luo Y W, Simmons K and Ferrari V. 2021. Efficient video annotation with visual interpolation and frame selection guidance//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3069-3078 [ DOI:10.1109/WACV48630.2021.00311 http://dx.doi.org/10.1109/WACV48630.2021.00311 ]

Kuznetsova P, Ordonez V, Berg A C, Berg T L and Choi Y. 2012. Collective generation of natural image descriptions//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1. Jeju Island, Korea(South): Association for Computational Linguistics: 359-368

Lafferty J D, McCallum A and Pereira F C N. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data//Proceedings of the 18th International Conference on Machine Learning. Williamstown, USA: Morgan Kaufmann Publishers Inc. : 282-289

Li B C, Wang Z, Liu J C and Zhu W W. 2013. Two decades of internet video streaming: a retrospective view. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(1 s): #33 [DOI: 10.1145/2505805]

Li G H, Wang X and Zhu W W. 2019. Perceptual visual reasoning with knowledge propagation//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: Association for Computing Machinery: 530-538 [ DOI:10.1145/3343031.3350922 http://dx.doi.org/10.1145/3343031.3350922 ]

Li G X. 2021. The application of "urban brain" in urban architectural planning. Urbanism and Architecture, 18(23): 79-81, 154

李赣湘. 2021. 城市建筑规划中"城市大脑"的应用. 城市建筑, 18(23): 79-81, 154 [DOI: 10.19892/j.cnki.csjz.2021.23.21]

Li X J, Yin X, Li C Y, Zhang P C, Hu X W, Zhang L, Wang L J, Hu H D, Dong L, Wei F R, Choi Y and Gao J F. 2020. OSCAR : object-semantics aligned pre-training for vision-language tasks//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 121-137 [ DOI:10.1007/978-3-030-58577-8_8 http://dx.doi.org/10.1007/978-3-030-58577-8_8 ]

Lian D Z, Hu L N, Luo W X, Xu Y Y, Duan L X, Yu J Y and Gao S H. 2019. Multiview multitask gaze estimation with deep convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems, 30(10): 3010-3023 [DOI: 10.1109/TNNLS.2018.2865525]

Liu A A, Tian H S, Xu N, Nie W Z, Zhang Y D and Kankanhalli M. 2021. Toward region-aware attention learning for scene graph generation. IEEE Transactions on Neural Networks and Learning Systems: #3086066 [ DOI:10.1109/TNNLS.2021.3086066 http://dx.doi.org/10.1109/TNNLS.2021.3086066 ]

Liu D Q, Zhang H W, Zha Z J and Wang F L. 2019a. Referring expression grounding by marginalizing scene graph likelihood [EB/OL ] . [2022-01-12 ] . https://arxiv.org/pdf/1906.03561.pdf https://arxiv.org/pdf/1906.03561.pdf

Liu H X, Simonyan K and Yang Y M. 2019b. DARTS: differentiable architecture search//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview. net

Liu X C, Liu W, Zhang M, Chen J W, Gao L L, Yan C G and Mei T. 2019c. Social relation recognition from videos via multi-scale spatial-temporal reasoning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3561-3569 [ DOI:10.1109/CVPR.2019.00368 http://dx.doi.org/10.1109/CVPR.2019.00368 ]

Liu Y, Wang X, Yuan Y T and Zhu W W. 2019d. Cross-modal dual learning for sentence-to-video generation//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: Association for Computing Machinery: 1239-1247 [ DOI:10.1145/3343031.3350986 http://dx.doi.org/10.1145/3343031.3350986 ]

Lou Y H, Duan L Y, Luo Y, Chen Z Q, Liu T L, Wang S Q and Gao W. 2019. Towards digital retina in smart cities: a model generation, utilization and communication paradigm//Proceedings of 2019IEEE International Conference on Multimedia and Expo (ICME). Shanghai, China: IEEE: 19-24 [ DOI:10.1109/ICME.2019.00012 http://dx.doi.org/10.1109/ICME.2019.00012 ]

Lu X, Zhu L, Liu L, Nie L Q and Zhang H X. 2021. Graph convolutional multi-modal hashing for flexible multimedia retrieval//Proceedings of the 29th ACM International Conference on Multimedia. [s. n. ] : Association for Computing Machinery: 1414-1422 [ DOI:10.1145/3474085.3475598 http://dx.doi.org/10.1145/3474085.3475598 ]

Ma L, Lu Z D, Shang L F and Li H. 2015. Multimodal convolutional neural networks for matching image and sentence//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2623-2631 [ DOI:10.1109/ICCV.2015.301 http://dx.doi.org/10.1109/ICCV.2015.301 ]

Malmaud J, Huang J, Rathod V, Johnston N, Rabinovich A and Murphy K. 2015. What's Cookin′? Interpreting cooking videos using text, speech and vision//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado, USA: Association for Computational Linguistics: 143-152 [ DOI:10.3115/v1/N15-1015 http://dx.doi.org/10.3115/v1/N15-1015 ]

Manhaeve R, Dumancˇic' S, Kimmig A, Demeester T and De Raedt L. 2018. DeepProbLog: neural probabilistic logic programming//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc. : 3753-3763

Mansimov E, Parisotto E, Ba L J and Salakhutdinov R. 2016. Generating images from captions with attention//Proceedings of the 4th International Conference on Learning Representations. San Juan, Puerto Rico, USA: [s. n.]

Mascharka D, Tran P, Soklaski R and Majumdar A. 2018. Transparency by design: closing the gap between performance and interpretability in visual reasoning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 4942-4950 [ DOI:10.1109/CVPR.2018.00519 http://dx.doi.org/10.1109/CVPR.2018.00519 ]

McGurk H and MacDonald J. 1976. Hearing lips and seeing voices. Nature, 264(5588): 746-748 [ DOI:10.1038/264746a0 http://dx.doi.org/10.1038/264746a0 ]

Meng Q, Zhao S C, Huang Z and Zhou F. 2021. MagFace: a universal representation for face recognition and quality assessment//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 14220-14229 [ DOI:10.1109/CVPR46437.2021.01400 http://dx.doi.org/10.1109/CVPR46437.2021.01400 ]

Mikolov T, Sutskever I, Chen K, Corrado G and Dean J. 2013. Distributed representations of words and phrases and their compositionality//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 3111-3119

Mukherjee S S and Robertson N M. 2015. Deep head pose: gaze-direction estimation in multimodal video. IEEE Transactions on Multimedia, 17(11): 2094-2107 [ DOI:10.1109/TMM.2015.2482819 http://dx.doi.org/10.1109/TMM.2015.2482819 ]

Nagrani A, Yang S, Arnab A, Jansen A, Schmid C and Sun C. 2021. Attention bottlenecks for multimodal fusion//Proceedings of the 34th International Conference on Neural Information Processing Systems. [s. l.]: [s. n.]: 14200-14213

Nakov P and Ng H T. 2009. Improved statistical machine translation for resource-poor languages using related resource-rich languages//Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing. Singapore, Singapore: Association for Computational Linguistics: 1358-1367

Narasimhan M, Lazebnik S and Schwing A G. 2018. Out of the box: reasoning with graph convolution nets for factual visual question answering//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc. : 2659-2670

Narasimhan M, Rohrbach A and Darrell T. 2021. CLIP-it! language-guided video summarization//Proceedings of the 34th International Conference on Neural Information Processing Systems. [s. l.]: [s. n.]: 13988-14000

Natarajan P, Wu S, Vitaladevuni S, Zhuang X D, Tsakalidis S, Park U, Prasad R and Natarajan P. 2012. Multimodal feature fusion for robust event detection in web videos//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, USA: IEEE: 1298-1305 [ DOI:10.1109/CVPR.2012.6247814 http://dx.doi.org/10.1109/CVPR.2012.6247814 ]

Nefian A V, Liang L H, Pi X B, Liu X X, Mao C and Murphy K. 2002. A coupled HMM for audio-visual speech recognition//Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, USA: IEEE: 2013-2016 [ DOI:10.1109/ICASSP.2002.5745027 http://dx.doi.org/10.1109/ICASSP.2002.5745027 ]

Neverova N, Wolf C, Taylor G and Nebout F. 2016. ModDrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8): 1692-1706 [ DOI:10.1109/TPAMI.2015.2461544 http://dx.doi.org/10.1109/TPAMI.2015.2461544 ]

Ngiam J, Khosla A, Kim M, Nam J, Lee H and Ng A Y. 2011. Multimodal deep learning//Proceedings of the 28th International Conference on International Conference on Machine Learning. Washington, USA: Omnipress: 689-696

Ofli F, Chaudhry R, Kurillo G, Vidal R and Bajcsy R. 2013. Berkeley MHAD: a comprehensive multimodal human action database//Proceedings of 2013 IEEE Workshop on Applications of Computer Vision (WACV). Clearwater Beach, USA: IEEE: 53-60 [ DOI:10.1109/WACV.2013.6474999 http://dx.doi.org/10.1109/WACV.2013.6474999 ]

Olague G, Olague M, Jacobo-Lopez A R and Ibarra-Vázquez G. 2021. Less is more: pursuing the visual turing test with the kuleshov effect//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Nashville, USA: IEEE: 1553-1561 [ DOI:10.1109/CVPRW53098.2021.00171 http://dx.doi.org/10.1109/CVPRW53098.2021.00171 ]

Ordonez V, Kulkarni G and Berg T L. 2011. Im2Text: describing images using 1 million captioned photographs//Proceedings of the24th International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc. : 1143-1151

Owens A, Isola P, McDermott J, Torralba A, Adelson E H and Freeman W T. 2016. Visually indicated sounds//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 2405-2413 [ DOI:10.1109/CVPR.2016.264 http://dx.doi.org/10.1109/CVPR.2016.264 ]

Palm R B, Paquet U and Winther O. 2018. Recurrent relational networks//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc. : 3372-3382

Pan Y W, Qiu Z F, Yao T, Li H Q and Mei T. 2017a. To create what you tell: generating videos from captions//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: Association for Computing Machinery: 1789-1798 [ DOI:10.1145/3123266.3127905 http://dx.doi.org/10.1145/3123266.3127905 ]

Pan Y W, Yao T, Li H Q and Mei T. 2017b. Video captioning with transferred semantic attributes//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 984-992 [ DOI:10.1109/CVPR.2017.111 http://dx.doi.org/10.1109/CVPR.2017.111 ]

Pashevich A, Schmid C and Sun C. 2021. Episodic transformer for vision-and-language navigation//Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 15922-15932 [ DOI:10.1109/ICCV48922.2021.01564 http://dx.doi.org/10.1109/ICCV48922.2021.01564 ]

Peng Y X, Zhu W W, Zhao Y, Xu C S, Huang Q M, Lu H Q, Zheng Q H, Huang T J and Gao W. 2017. Cross-media analysis and reasoning: advances and directions . Frontiers of Information Technology and Electronic Engineering, 18(1): 44-57 [ DOI:10.1631/FITEE.1601787 http://dx.doi.org/10.1631/FITEE.1601787 ]

Petridis S, Stafylakis T, Ma P, Cai F P, Tzimiropoulos G and Pantic M. 2018. End-to-end audiovisual speech recognition//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE: 6548-6552 [ DOI:10.1109/ICASSP.2018.8461326 http://dx.doi.org/10.1109/ICASSP.2018.8461326 ]

Pham H, Guan M, Zoph B, Le Q and Dean J. 2018. Efficient neural architecture search via parameters sharing//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 4095-4104

Poria S, Chaturvedi I, Cambria E and Hussain A. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis//Proceedings of the 16th IEEE International Conference on Data Mining (ICDM). Barcelona, Spain: IEEE: 439-448 [ DOI:10.1109/ICDM.2016.0055 http://dx.doi.org/10.1109/ICDM.2016.0055 ]

Qi H, Wu T, Lee M W and Zhu S C. 2015. A restricted visual turing test for deep scene and event understanding [EB/OL ] . [2022-01-12 ] . https://arxiv.org/pdf/1512.01715.pdf https://arxiv.org/pdf/1512.01715.pdf

Qiao T T, Zhang J, Xu D Q and Tao D C. 2019. MirrorGAN: learning text-to-image generation by redescription//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1505-1514 [ DOI:10.1109/CVPR.2019.00160 http://dx.doi.org/10.1109/CVPR.2019.00160 ]

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 8748-8763

Rajendran J, Khapra M M, Chandar S and Ravindran B. 2016. Bridge correlational neural networks for multilingual multimodal representation learning//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, USA: Association for Computational Linguistics: 171-181 [ DOI:10.18653/v1/N16-1021 http://dx.doi.org/10.18653/v1/N16-1021 ]

Ramachandram D and Taylor G W. 2017. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6): 96-108 [ DOI:10.1109/MSP.2017.2738401 http://dx.doi.org/10.1109/MSP.2017.2738401 ]

Ramesh A, Dhariwal P, Nichol A, Chu C and Chen M. 2022. Hierarchical text-conditional image generation with CLIP latents[EB/OL ] . [2022-01-12 ] . https://arxiv.org/pdf/2204.06125.pdf https://arxiv.org/pdf/2204.06125.pdf

Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M and Sutskever I. 2021. Zero-shot text-to-image generation//Proceedings of the 38th International Conference on Machine Learning. [s. l.]: PMLR: 8821-8831

Rasiwasia N, Moreno P J and Vasconcelos N. 2007. Bridging the gap: query by semantic example. IEEE Transactions on Multimedia, 9(5): 923-938 [ DOI:10.1109/TMM.2007.900138 http://dx.doi.org/10.1109/TMM.2007.900138 ]

Rasiwasia N, Pereira J C, Coviello E, Doyle G, Lanckriet G R G, Levy R and Vasconcelos N. 2010. A new approach to cross-modal multimedia retrieval//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy: Association for Computing Machinery: 251-260 [ DOI:10.1145/1873951.1873987 http://dx.doi.org/10.1145/1873951.1873987 ]

Reed S, Akata Z, Yan X C, Logeswaran L, Schiele B and Lee H. 2016. Generative adversarial text to image synthesis//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York, USA: JMLR. org: 1060-1069

Rusu A A, Rao D, Sygnowski J, Vinyals O, Pascanu R, Osindero S and Hadsell R. 2019. Meta-learning with latent embedding optimization//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview. net

Santoro A, Bartunov S, Botvinick M M, Wierstra D and Lillicrap T P. 2016. Meta-learning with memory-augmented neural networks//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: JMLR. org: 1842-1850

Santoro A, Raposo D, Barrett D G T, Malinowski M, Pascanu R, BattagliaP and Lillicrap T. 2017. A simple neural network module for relational reasoning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 4974-4983

Scarselli F, Gori M, Tsoi A C, Hagenbuchner M and Monfardini G. 2009. The graph neural network model. IEEE Transactions on Neural Networks, 20(1): 61-80 [DOI: 10.1109/TNN.2008.2005605]

Schultz P T and Sartini R A. 2016. Method and system for multi-factor biometric authentication. U.S., No. 9 323 912

Singh B, Marks T K, Jones M, Tuzel O and Shao M. 2016. A multi-stream Bi-directional recurrent neural network for fine-grained action detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1961-1970 [ DOI:10.1109/CVPR.2016.216 http://dx.doi.org/10.1109/CVPR.2016.216 ]

SitováZ,áeděnka J, Yang Q, Peng G, Zhou G, Gasti P and Balagani K S. 2016. HMOG: new behavioral biometric features for continuous authentication of smartphone users. IEEE Transactions on Information Forensics and Security, 11(5): 877-892 [ DOI:10.1109/TIFS.2015.2506542 http://dx.doi.org/10.1109/TIFS.2015.2506542 ]

Snoek J, Larochelle H and Adams R P. 2012. Practical bayesian optimization of machine learning algorithms//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 2951-2959

Socher R, Ganjoo M, Manning C D and Ng A Y. 2013. Zero-shot learning through cross-modal transfer//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 935-943

Srivastava N and Salakhutdinov R. 2012. Learning representations for multimodal data with deep belief nets//Proceedings of 2012 International Conference on Machine Learning Workshop. Edinburgh, UK: [s. n.]: 978-971

Sutskever I, Vinyals O and Le Q V. 2014. Sequence to sequence learning with neural networks//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 3104-3112

Tan H and Bansal M. 2019. LXMERT: learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics: 5100-5111 [ DOI:10.18653/v1/D19-1514 http://dx.doi.org/10.18653/v1/D19-1514 ]

Trivedi D, Zhang J, Sun S H and Lim J J. 2021. Learning to synthesize programs as interpretable and generalizable policies//Proceedings of the 34th International Conference on Neural Information Processing Systems. [s. l.]: [s. n.]: 25146-25163

Tsai Y H H, Divvala S, Morency L P, Salakhutdinov R and Farhadi A. 2019. Video relationship reasoning using gated spatio-temporal energy graph//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 10416-10425 [ DOI:10.1109/CVPR.2019.01067 http://dx.doi.org/10.1109/CVPR.2019.01067 ]

van den Oord A, Dieleman S and Schrauwen B. 2013. Deep content-based music recommendation//Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc. : 2643-2651

van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A W and Kavukcuoglu K. 2016. WaveNet: a generative model for raw audio//Proceedings of the 9th ISCA Speech Synthesis Workshop. Sunnyvale, USA: ISCA: #125

Venugopalan S, Xu H J, Donahue J, Rohrbach M, Mooney R and Saenko K. 2015. Translating videos to natural language using deep recurrent neural networks//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, USA: Association for Computational Linguistics: 1494-1504 [ DOI:10.3115/v1/N15-1173 http://dx.doi.org/10.3115/v1/N15-1173 ]

Verma A, Murali V, Singh R, Kohli P and Chaudhuri S. 2018. Programmatically interpretable reinforcement learning//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 5045-5054

Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 3156-3164 [ DOI:10.1109/CVPR.2015.7298935 http://dx.doi.org/10.1109/CVPR.2015.7298935 ]

Wang D X, Cui P, Ou M D and Zhu W W. 2015. Deep multimodal hashing with orthogonal regularization//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI Press: 2291-2297

Wang J D, Zhang T, Song J K, Sebe N and Shen H T. 2018. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 769-790 [ DOI:10.1109/TPAMI.2017.2699960 http://dx.doi.org/10.1109/TPAMI.2017.2699960 ]

Wang K Y, Yin Q Y, Wang W, Wu S and Wang L. 2016a. A comprehensive survey on cross-modal retrieval [EB/OL ] . [2022-01-12 ] . https://arxiv.org/pdf/1607.06215.pdf https://arxiv.org/pdf/1607.06215.pdf

Wang X, Donaldson R, Nell C, Gorniak P, Ester M and Bu J J. 2016b. Recommending groups to users using user-group engagement and time-dependent matrix factorization//Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI: 1331-1337

Wang X, Hoi S C H, Ester M, Bu J J and Chen C. 2017. Learning personalized preference of strong and weak ties for social recommendation//Proceedings of the 26th International Conference on World Wide Web. Perth, Australia: International World Wide Web Conferences Steering Committee: 1601-1610 [ DOI:10.1145/3038912.3052556 http://dx.doi.org/10.1145/3038912.3052556 ]

Wang X, Lu W, Ester M, Wang C and Chen C. 2016c. Social recommendation with strong and weak ties//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. Indianapolis, USA: Association for Computing Machinery: 5-14 [ DOI:10.1145/2983323.2983701 http://dx.doi.org/10.1145/2983323.2983701 ]

Wang X, Zhu W W and Liu C H. 2019a. Social recommendation with optimal limited attention//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Anchorage, USA: Association for Computing Machinery: 1518-1527 [ DOI:10.1145/3292500.3330939 http://dx.doi.org/10.1145/3292500.3330939 ]

Wang X, Zhu W W and Liu C H. 2019b. Semi-supervised deep quantization for cross-modal search//Proceedi ngs of the 27th ACM International Conference on Multimedia. Nice, France: Association for Computing Machinery: 1730-1739 [ DOI:10.1145/3343031.3350934 http://dx.doi.org/10.1145/3343031.3350934 ]

Wang Y K, Huang W B, Sun F C, Xu T Y, Rong Y and Huang J Z. 2020. Deepmultimodal fusion by channel exchanging//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. : #406

Wang Z T. 2022. Design and application of intelligent road traffic system based on multi-access edge computing. Traffic and Transportation, 38(3): 50-54

汪志涛. 2022. 基于边缘计算的智能道路交通系统设计及应用. 交通与运输, 38(3): 50-54

Wen Y, Yang Y D, Luo R, Wang J and Pan W. 2019. Probabilistic recursive reasoning for multi-agent reinforcement learning//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview. net

Wöllmer M, Kaiser M, Eyben F, Schuller B and Rigoll G. 2013. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2): 153-163 [ DOI:10.1016/j.imavis.2012.03.001 http://dx.doi.org/10.1016/j.imavis.2012.03.001 ]

Wu C F, Liu J L, Wang X J and Dong X. 2018. Chain of reasoning for visual question answering//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc. : 273-283

Wu D, Pigou L, Kindermans P J, Le N D H, Shao L, Dambre J and Odobez J M. 2016. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8): 1583-1597 [ DOI:10.1109/TPAMI.2016.2537340 http://dx.doi.org/10.1109/TPAMI.2016.2537340 ]

Xiong P X, Zhan H Y, Wang X, Sinha B and Wu Y. 2019. Visual query answering by entity-attribute graph matching and reasoning//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 8349-8358 [ DOI:10.1109/CVPR.2019.00855 http://dx.doi.org/10.1109/CVPR.2019.00855 ]

Xu H, Jiang C H, Liang X D, Lin L and Li Z G. 2019. Reasoning-RCNN: unifying adaptive global reasoning into large-scale object detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 6412-6421 [ DOI:10.1109/CVPR.2019.00658 http://dx.doi.org/10.1109/CVPR.2019.00658 ]

Xu M M, Zhao C, Rojas D S, Thabet A and Ghanem B. 2020a. G-TAD: sub-graph localization for temporal action detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 10153-10162 [ DOI:10.1109/CVPR42600.2020.01017 http://dx.doi.org/10.1109/CVPR42600.2020.01017 ]

Xu M Z, Xiong Y J, Chen H, Li X Y, Xia W, Tu Z W and Soatto S. 2021. Long short-term transformer for online action detection//Proceedings of the 34th International Conference on Neural Information Processing Systems. [s. l.]: [s. n.]: 1086-1099

Xu Q T, Likhomanenko T, Kahn J, Hannun A Y, Synnaeve G and Collobert R. 2020b. Iterative pseudo-labeling for speech recognition//Proceedings of Interspeech 2020, 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 1006-1010 [ DOI:10.21437/Interspeech.2020-1800 http://dx.doi.org/10.21437/Interspeech.2020-1800 ]

Xu Y L, Qin L, Liu X B, Xie J W and Zhu S C. 2018. A causal and-or graph model for visibility fluent reasoning in tracking interacting objects//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE: 2178-2187 [ DOI:10.1109/CVPR.2018.00232 http://dx.doi.org/10.1109/CVPR.2018.00232 ]

Yale S, Vallmitjana J, Stent A and Jaimes A. 2015. TVSum: summarizing web videos using titles//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 5179-5187 [ DOI:10.1109/CVPR.2015.7299154 http://dx.doi.org/10.1109/CVPR.2015.7299154 ]

Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H and Courville A. 2015. Describing videos by exploiting temporal structure//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 4507-4515 [ DOI:10.1109/ICCV.2015.512 http://dx.doi.org/10.1109/ICCV.2015.512 ]

Yao T, Mei T and Rui Y. 2016. Highlight detection with pairwise deep ranking for first-person video summarization//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 982-990 [ DOI:10.1109/CVPR.2016.112 http://dx.doi.org/10.1109/CVPR.2016.112 ]

Yi K X, Wu J J, Gan C, Torralba A, Kohli P and Tenenbaum J B. 2018. Neural-symbolic VQA: disentangling reasoning from vision and language understanding//Proceedings of the 32nd International Conference Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc. : 1039-1050

You Q Z, Jin H L, Wang Z W, Fang C and Luo J B. 2016. Image captioning with semantic attention//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4651-4659 [ DOI:10.1109/CVPR.2016.503 http://dx.doi.org/10.1109/CVPR.2016.503 ]

Yu H N, Wang J, Huang Z H, Yang Y and Xu W. 2016. Video paragraph captioning using hierarchical recurrent neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4584-4593 [ DOI:10.1109/CVPR.2016.496 http://dx.doi.org/10.1109/CVPR.2016.496 ]

Yu S Z, Wang X, Zhu W W, Cui P and Wang J D. 2019a. Disparity-preserved deep cross-platform association for cross-platform video recommendation//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: IJCAI. org: 4635-4641 [ DOI:10.24963/ijcai.2019/644 http://dx.doi.org/10.24963/ijcai.2019/644 ]

Yu W J, Liang X D, Gong K, Jiang C H, Xiao N and Lin L. 2019b. Layout-graph reasoning for fashion landmark detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 2932-2940 [ DOI:10.1109/CVPR.2019.00305 http://dx.doi.org/10.1109/CVPR.2019.00305 ]

Yuan Y T, Mei T, Cui P and Zhu W W. 2019a. Video summarization by learning deep side semantic embedding. IEEE Transactions on Circuits and Systems for Video Technology, 29(1): 226-237 [ DOI:10.1109/TCSVT.2017.2771247 http://dx.doi.org/10.1109/TCSVT.2017.2771247 ]

Yuan Y T, Mei T and Zhu W W. 2019b. To find where you talk: temporal sentence localization in video with attention based location regression//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI Press: 9159-9166 [ DOI:10.1609/aaai.v33i01.33019159 http://dx.doi.org/10.1609/aaai.v33i01.33019159 ]

Yuhas B P, Goldstein M H and Sejnowski T J. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11): 65-71 [ DOI:10.1109/35.41402 http://dx.doi.org/10.1109/35.41402 ]

Zeng R H, Xu H M, Huang W B, Chen P H, Tan M K and Gan C. 2020. Dense regression network for video grounding//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 10284-10293 [ DOI:10.1109/CVPR42600.2020.01030 http://dx.doi.org/10.1109/CVPR42600.2020.01030 ]

Zhang D, Dai X Y, Wang X, Wang Y F and Davis L S. 2019. MAN: moment alignment netw ork for natural language moment retrieval via iterative graph adjustment//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 1247-1257 [ DOI:10.1109/CVPR.2019.00134 http://dx.doi.org/10.1109/CVPR.2019.00134 ]

Zhang H, Xu T, Li H S, Zhang S T, Wang X G, Huang X L and Metaxas D. 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 5908-5916 [ DOI:10.1109/ICCV.2017.629 http://dx.doi.org/10.1109/ICCV.2017.629 ]

Zhang H W, Niu Y L and Chang S F. 2018. Grounding referring expressions in images by variational context//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4158-4166 [ DOI:10.1109/CVPR.2018.00437 http://dx.doi.org/10.1109/CVPR.2018.00437 ]

Zhang J H, Zhu Y Q, Liu Q, Wu S, Wang S H and Wang L. 2021. Mining latent structures for multimedia recommendation//Proceedings of the 29th ACM International Conference on Multimedia. [s. l. ] : Association for Computing Machinery: 3872-3880 [ DOI:10.1145/3474085.3475259 http://dx.doi.org/10.1145/3474085.3475259 ]

Zhang L and Rui Y. 2013. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(1 s): #36 [DOI: 10.1145/2490823]

Zhang W, Zhang Y M, Ma L, Guan J W and Gong S J. 2015. Multimodal learning for facial expression recognition. Pattern Recognition, 48(10): 3191-3202 [ DOI:10.1016/j.patcog.2015.04.012 http://dx.doi.org/10.1016/j.patcog.2015.04.012 ]

Zhang X C, Park S, Beeler T, Bradley D, Tang S Y and Hilliges O. 2020. ETH-XGaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 365-381 [ DOI:10.1007/978-3-030-58558-7_22 http://dx.doi.org/10.1007/978-3-030-58558-7_22 ]

Zhao J W, Han R Z, Gan Y Y,Wan L, Feng W and Wang S. 2020. Human identification and interaction detection in cross-view multi-person videos with wearable cameras//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: Association for Computing Machinery: 2608-2616 [ DOI:10.1145/3394171.3413903 http://dx.doi.org/10.1145/3394171.3413903 ]

Zhao L M, Li X, Zhuang Y T and Wang J D. 2017. Deeply-learned part-aligned representations for person re- identification//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 3239-3248 [ DOI:10.1109/ICCV.2017.349 http://dx.doi.org/10.1109/ICCV.2017.349 ]

Zhu W W, Cui P, Wang Z and Hua G. 2015. Multimedia big data computing. IEEE Multimedia, 22(3): #96 [DOI: 10.1109/MMUL.2015.66]

Zhu W W, Wang X and Gao W. 2020a. Multimedia intelligence: when multimedia meets artificial intelligence. IEEE Transactions on Multimedia, 22(7): 1823-1835 [DOI: 10.1109/TMM.2020.2969791]

Zhu W W, Wang X and Li H Z. 2020b. Multi-modal deep analysis for multimedia. IEEE Transactions on Circuits and Systems for Video Technology, 30(10): 3740-3764 [ DOI:10.1109/TCSVT.2019.2940647 http://dx.doi.org/10.1109/TCSVT.2019.2940647 ]

Zoph B and Le Q V. 2017. Neural architecture search with reinforcement learning//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview. net

文章被引用时，请邮件提醒。

提交