数据受限条件下的多模态处理技术综述
Review of multimodal data processing techniques with limited data
- 2022年27卷第10期 页码:2803-2834
收稿:2022-01-19,
修回:2022-4-22,
录用:2022-4-29,
纸质出版:2022-10-16
DOI: 10.11834/jig.220049
移动端阅览

浏览全部资源
扫码关注微信
收稿:2022-01-19,
修回:2022-4-22,
录用:2022-4-29,
纸质出版:2022-10-16
移动端阅览
随着多媒体技术的发展,可获取的媒体数据在种类和量级上大幅提升。受人类感知方式的启发,多种媒体数据互相融合处理,促进了人工智能在计算机视觉领域的研究发展,在遥感图像解译、生物医学和深度估计等方面有广泛的应用。尽管多模态数据在描述事物特征时具有明显优势,但仍面临着较大的挑战。1)受到不同成像设备和传感器的限制,难以收集到大规模、高质量的多模态数据集;2)多模态数据需要匹配成对用于研究,任一模态的缺失都会造成可用数据的减少;3)图像、视频数据在处理和标注上需要耗费较多的时间和人力成本,这些问题使得目前本领域的技术尚待攻关。本文立足于数据受限条件下的多模态学习方法,根据样本数量、标注信息和样本质量等不同的维度,将计算机视觉领域中的多模态数据受限方法分为小样本学习、缺乏强监督标注信息、主动学习、数据去噪和数据增强5个方向,详细阐述了各类方法的样本特点和模型方法的最新进展。并介绍了数据受限前提下的多模态学习方法使用的数据集及其应用方向(包括人体姿态估计、行人重识别等),对比分析了现有算法的优缺点以及未来的发展方向,对该领域的发展具有积极的意义。
The growth of multimedia technology has leveraged more available multifaceted media data. Human-perceptive multiple media data fusion has promoted the research and development (R&D) of artificial intelligence (AI) for computer vision. It has a wide range of applications like remote sensing image interpretation
biomedicine
and depth estimation. Multimodality can be as a form of representation of things (RoT). It refers to the description of things from multiple perspectives. Early AI-oriented technology is focused on a single modality of data. Current human-perceptive researches have clarified that each modality has a relatively independent description of things (IDoT)
and the use of complementary representations of multimodal data tend to three-dimensional further. Recent processing and applications of multimodal data has been intensively developed like sentiment analysis
machine translation
natural language processing
and biomedicine. Our critical review is focused on the development of multimodality. Computer-vision-oriented multimodal learning is mainly used to analyze the related multimodal data on the aspects of images and videos
modalities-ranged learning and complemented information
and image detection and recognition
semantic segmentation
and video action prediction
etc. Multimodal data has its priority for objects description. First
it is challenged to collect large-scale
high-quality multimodal datasets due to the equipment-limited like multiple imaging devices and sensors. Next
Image and video data processing and labeling are time-consuming and labor-intensive. Based on the limited-data-derived multimodal learning methods
the multimodal data limited methods in the context of computer vision can be segmented into five aspects
including few-shot learning
lack of strong supervised information
active learning
data denoising and data augmentation. The multi-features of samples and the models evolution are critically reviewed as mentioned below: 1) in the case of insufficient multi-modal data
the few-shot learning method has the cognitive ability to make correct judgments via learning a small number of samples only
and it can effectively learn the target features in the case of lack of data. 2) Due to the high cost of the data labeling process
it is challenged to obtain all the ground truth labels of all modalities for strongly supervised learning of the model. The incomplete supervised methods are composed of weakly supervised
unsupervised
semi-supervised
and self-supervised learning methods in common. These methods can optimize modal labeling information and cost-effective manual labeling. 3) The active learning method is based on the integration of prior knowledge and learning regulatory via designing a model using autonomous learning ability
and it is committed to the maximum optimization of few samples. Labeling costs can be effectively reduced in consistency based on the optimized options of samples. 4) Multimodal data denoising refers to reducing data noise
restoring the original data
and then extracting the information of interest. 5) In order to make full use of limited multi-modal data
few-samples-conditioned data enhancement method extends realistic data by performing a series of transformation operations on the original data set. In addition
the data sets are used for the multimodal learning method limited data. Its potential applications are introduced like human pose estimation and person re-identification
and the performance of the existing algorithms is compared and analyzed. The pros and cons
as well as the future development direction
are projected as following: 1) a lightweight multimodal data processing method: we argue that limited-data-conditioned multimodal learning still has the challenge of mobile-devices-oriented models applications. When the existing methods fuse the information of multiple modalities
it is generally necessary to use two or above networks for feature extraction
and then fuse the features. Therefore
the large number of parameters and the complex structure of the model limit its application to mobile devices. Future lightweight model has its potentials. 2) A commonly-used multimodal intelligent processing model: most of existing multimodal data processing methods are derived from the developed multi-algorithms for multitasks
which need to be trained on specific tasks. This tailored training method greatly increases the cost of developing models
making it difficult to meet the needs of more application scenarios. Therefore
for the data of different modalities
it is necessary to promote a consensus perception model to learn the general representation of multimodal data and the parameters and features of the general model can be shared for multiple scenarios. 3) A multi-sources knowledge and data driven model: it is possible to introduce featured data and knowledge of multi-modal data beyond
establish an integrated knowledge-data-driven model
and enhance the model's performance and interpretability.
Abdulnabi A H, Shuai B, Zuo Z, Chau L P and Wang G. 2018. Multimodal recurrent neural networks with information transfer layers for indoor scene labeling. IEEE Transactions on Multimedia, 20(7): 1656-1671 [DOI: 10.1109/TMM.2017.2774007]
Achim A M, Canagarajah C N and Bull D R. 2005. Complex wavelet domain image fusion based on fractional lower order moments//Proceedings of the 7th International Conference on Information Fusion. Philadelphia, USA: IEEE: 7 [ DOI:1109/icif.2005.1591898 http://dx.doi.org/1109/icif.2005.1591898 ]
Andrychowicz M, Denil M, Colmenarejo S G, Hoffman M W, Pfau D, Schaul T, Shillingford B and De Freitas N. 2016. Learning to learn by gradient descent by gradient descent//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc: 3988-3996
Ao X, Zhang X Y, Yang H M, Yin F and Liu C L. 2019. Cross-modal prototype learning for zero-shot handwriting recognition//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney, Australia: IEEE: 589-594 [ DOI:10.1109/ICDAR.2019.00100 http://dx.doi.org/10.1109/ICDAR.2019.00100 ]
Audebert N, Le Saux B and Lefèvre S. 2016. Semantic segmentation of earth observation data using multimodal and multi-scale deep networks//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 180-196 [ DOI:10.1007/978-3-319-54181-5_12 http://dx.doi.org/10.1007/978-3-319-54181-5_12 ]
Baltrušaitis T, Ahuja C and Morency L P. 2019. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423-443 [DOI: 10.1109/tpami.2018.2798607]
Bao Y Q, Song K C, Wang J, Huang L M, Dong H W and Yan Y H. 2021. Visible and thermal images fusion architecture for few-shot semantic segmentation. Journal of Visual Communication and Image Representation, 80: #103306 [DOI: 10.1016/j.jvcir.2021.103306]
Belgacem S, Chatelain C and Paquet T. 2015. A hybrid CRF/HMM for one-shot gesture learning//Rattani A, Roli F and Granger E, eds. Adaptive Biometric Systems. Cham: Springer: 51-72 [ DOI:10.1007/978-3-319-24865-3_4 http://dx.doi.org/10.1007/978-3-319-24865-3_4 ]
Bessadok A, Nebli A, Ali Mahjoub M, Li G, Lin W L, Shen D G and Rekik I. 2021. A few-shot learning graph multi-trajectory evolution network for forecasting multimodal baby connectivity development from a baseline timepoint//Proceedings of the 4th International Workshop on PRedictive Intelligence in Medicine. Strasbourg, France: Springer: 11-24 [ DOI:10.1007/978-3-030-87602-9_2 http://dx.doi.org/10.1007/978-3-030-87602-9_2 ]
Biessmann F, Plis S, Meinecke F C, Eichele T and Muller KR. 2011. Analysis of multimodal neuroimaging data. IEEE Reviews in Biomedical Engineering, 4: 26-58 [DOI: 10.1109/RBME.2011.2170675]
Bonnin A, Borràs R and VitriàJ. 2011. A cluster-based strategy for active learning of RGB-D object detectors//Proceedings of 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). Barcelona, Spain: IEEE: 1215-1220 [ DOI:10.1109/iccvw.2011.6130389 http://dx.doi.org/10.1109/iccvw.2011.6130389 ]
Bramon R, Boada I, Bardera A, Rodriguez J, Feixas M, Puig J and Sbert M. 2012. Multimodal data fusion based on mutual information. IEEE Transactions on Visualization and Computer Graphics, 18(9): 1574-1587 [DOI: 10.1109/TVCG.2011.280]
Bronstein M M, Bronstein A M, Michel F and Paragios N. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 3594-3601 [ DOI:10.1109/CVPR.2010.5539928 http://dx.doi.org/10.1109/CVPR.2010.5539928 ]
Budd S, Robinson E C and Kainz B. 2021. A survey on active learning and human-in-the-loop deep learning for medical image analysis. Medical Image Analysis, 71: #102062 [DOI: 10.1016/j.media.2021.102062]
Bullard K, Schroecker Y and Chernova S. 2019. Active learning within constrained environments through imitation of an expert questioner//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: IJCAI. org: 2045-2052 [ DOI:10.24963/ijcai.2019/283 http://dx.doi.org/10.24963/ijcai.2019/283 ]
Caesar H, Bankiti V, Lang A H, Vora S, Liong V E, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O. 2020. nuScenes: a multimodal dataset for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11618-11628 [ DOI:10.1109/cvpr42600.2020.01164 http://dx.doi.org/10.1109/cvpr42600.2020.01164 ]
Chan T F, Esedoglu S and Park F. 2010. A fourth order dual method for staircase reduction in texture extraction and image restoration problems//Proceedings of 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE: 4137-4140 [ DOI:10.1109/icip.2010.5653199 http://dx.doi.org/10.1109/icip.2010.5653199 ]
Chaudhuri U, Banerjee B, Bhattacharya A and Datcu M. 2020. A simplified framework for zero-shot cross-modal sketch data retrieval//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle, USA: IEEE: 699-706 [ DOI:10.1109/cvprw50498.2020.00099 http://dx.doi.org/10.1109/cvprw50498.2020.00099 ]
Chen C, Ouyang C, Tarroni G, Schlemper J, Qiu H Q, Bai W J and Rueckert D. 2019. Unsupervised multi-modal style transfer for cardiac MR segmentation//Proceedings of the 10th International Workshop on Statistical Atlases and Computational Models of the Heart. Shenzhen, China: Springer: 209-219 [ DOI:10.1007/978-3-030-39074-7_22 http://dx.doi.org/10.1007/978-3-030-39074-7_22 ]
Chen K, Guo Y F, Yang C Q, Xu Y, Zhang R, Li C X andWu R. 2021. Enhanced breast lesion classification via knowledge guided cross-modal and semantic data augmentation//Proceedings of 2021 International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) Strasbourg, France: Springer: 53-63 [ DOI:10.1007/978-3-030-87240-3_6 http://dx.doi.org/10.1007/978-3-030-87240-3_6 ]
Chen Y T, Chang W Y, Lu H L, Wu T F and Sun M. 2018. Leveraging motion priors in videos for improving human segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: IEEE: 220-236 [ DOI:10.1007/978-3-030-01234-2_14 http://dx.doi.org/10.1007/978-3-030-01234-2_14 ]
Cheng Y H, Zhao X, Cai R, Li Z W, Huang K Q and Rui Y. 2016. Semi-supervised multimodal deep learning for RGB-D object recognition//Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York, USA: AAAI Press: 3345-3351
Chiu M T, Xu X Q, Wei Y C, Huang Z L, Schwing A G, Brunner R, Khachatrian H, Karapetyan H, Dozier I, Rose G, Wilson D, Tudor A, Hovakimyan N, Huang T S and Shi H H. 2020. Agriculture-vision: a large aerial image database for agricultural pattern analysis//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2825-2835 [ DOI:10.1109/cvpr42600.2020.00290 http://dx.doi.org/10.1109/cvpr42600.2020.00290 ]
Dai A, Chang A X, Savva M, Halber M, Funkhouser T and Nießner M. 2017. Scannet: richly-annotated 3d reconstructions of indoor scenes//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2432-2443 [ DOI:10.1109/cvpr.2017.261 http://dx.doi.org/10.1109/cvpr.2017.261 ]
Das S, Dai R, Koperski M, Minciullo L, Garattoni L, Bremond F and Francesca G. 2019. Toyota smarthome: real-world activities of daily living//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 833-842[ DOI:10.1109/ICCV.2019.00092 http://dx.doi.org/10.1109/ICCV.2019.00092 ]
Dong Y, Pan H W, Cui Q N, Bian X F, Teng T and Wang B J. 2021. Few-shot segmentation method for multi-modal magnetic resonance images of brain tumor. Journal of Computer Applications, 41(4): 1049-1054
董阳, 潘海为, 崔倩娜, 边晓菲, 滕腾, 王邦菊. 2021. 面向多模态磁共振脑瘤图像的小样本分割方法. 计算机应用, 41(4): 1049-1054 [DOI: 10.11772/j.issn.1001-9081.2020081388]
Du D P, Wang L M, Li Z Y and Wu G S. 2021. Cross-modal pyramid translation for RGB-D scene recognition. International Journal of Computer Vision, 129(8): 2309-2327 [DOI: 10.1007/s11263-021-01475-7]
Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2650-2658 [ DOI:10.1109/iccv.2015.304 http://dx.doi.org/10.1109/iccv.2015.304 ]
Eitel A, Springenberg J T, Spinello L, Riedmiller M and Burgard W. 2015. Multimodal deep learning for robust RGB-D object recognition//Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany: IEEE: 681-687 [ DOI:10.1109/iros.2015.7353446 http://dx.doi.org/10.1109/iros.2015.7353446
El Banani M and Johnson J. 2021. Bootstrap your own correspondences//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 6413-6422 [ DOI:10.1109/ICCV48922.2021.00637 http://dx.doi.org/10.1109/ICCV48922.2021.00637 ]
Faisal M A, Aung Z, Woon W L and Svetinovic D. 2014. Augmented query strategies for active learning in stream data mining//Proceedings of the 21 st International Conference on Neural Information Processing. Kuching, Malaysia: Springer: 431-438 [ DOI:10.1007/978-3-319-12643-2_53 http://dx.doi.org/10.1007/978-3-319-12643-2_53 ]
Ferreri A, Bucci S and Tommasi T. 2021. Multi-modal RGB-D scene recognition across domains//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 2199-2208 [ DOI:10.1109/iccvw54120.2021.00249 http://dx.doi.org/10.1109/iccvw54120.2021.00249 ]
Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR. org: 1126-1135
Fürst M, Gupta S T P, Schuster R, Wasenmüller O and Stricker D. 2021. HPERL: 3D human pose estimation from RGB and LiDAR//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 7321-7327 [ DOI:10.1109/icpr48806.2021.9412785 http://dx.doi.org/10.1109/icpr48806.2021.9412785 ]
Gao J, Li P, Chen Z K and Zhang J N. 2020. A survey on deep learning for multimodal data fusion. Neural Computation, 32(5): 829-864 [DOI: 10.1162/neco_a_01273]
Gao M L, Jiang J, Zou G F, John V and Liu Z. 2019. RGB-D-based object recognition using multimodal convolutional neural networks: a survey. IEEE Access, 7: 43110-43136 [DOI: 10.1109/access.2019.2907071]
Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361 [ DOI:10.1109/cvpr.2012.6248074 http://dx.doi.org/10.1109/cvpr.2012.6248074 ]
Gönen M and Alpaydın E. 2011. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12: 2211-2268
Groves A R, Beckmann C F, Smith S M and Woolrich M W. 2011. Linked independent component analysis for multimodal data fusion. NeuroImage, 54(3): 2198-2217 [DOI: 10.1016/j.neuroimage.2010.09.073]
Han T D, Xie W D and Zisserman A. 2020. Self-supervised co-training for video representation learning[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2010.09709.pdf https://arxiv.org/pdf/2010.09709.pdf
Hong D F, Yokoya N, Xia G S, Chanussot J and Zhu X X. 2020. X-ModalNet: a semi-supervised deep cross-modal network for classification of remote sensing data. ISPRS Journal of Photogrammetry and Remote Sensing, 167: 12-23 [DOI: 10.1016/j.isprsjprs.2020.06.014]
Hou J C, Wang S S, Lai Y H, Tsao Y, Chang H W and Wang H M. 2018. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2): 117-128 [DOI: 10.1109/TETCI.2017.2784878]
Hu Y P, Modat M, Gibson E, Li W Q, Ghavami N, Bonmati E, Wang G T, Bandula S, Moore C M, Emberton M, Ourselin S, Noble J A, Barratt D C and Vercauteren T. 2018. Weakly-supervised convolutional neural networks for multimodal image registration. Medical Image Analysis, 49: 1-13
Huang X, Liu M Y, Belongie S and Kautz J. 2018. Multimodal unsupervised image-to-image translation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 172-189 [ DOI:10.1007/978-3-030-01219-9_11 http://dx.doi.org/10.1007/978-3-030-01219-9_11 ]
Imran J and Raman B. 2020. Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition. Journal of Ambient Intelligence and Humanized Computing, 11(1): 189-208 [DOI: 10.1007/s12652-019-01239-9]
Iqbal M and Chen J. 2012. Unification of image fusion and super-resolution using jointly trained dictionaries and local information contents. IET Image Processing, 6(9): 1299-1310 [DOI: 10.1049/iet-ipr.2012.0122]
Jahan S, Shatabda S and Farid D M. 2018. Active learning for mining big data//Proceedings of the 21st International Conference of Computer and Information Technology (ICCIT). Dhaka, Bangladesh: IEEE: 1-6 [ DOI:10.1109/iccitechn.2018.8631973 http://dx.doi.org/10.1109/iccitechn.2018.8631973 ]
Jalal A, Kim Y H, Kim Y J, Kamal S and Kim D. 2017. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognition, 61: 295-308 [DOI: 10.1016/j.patcog.2016.08.003]
Jaritz M, Vu T H, Charette R D, Wirbel E and Pérez P. 2020. xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE: 12605-12614. [ DOI:10.1109/cvpr42600.2020.01262 http://dx.doi.org/10.1109/cvpr42600.2020.01262 ]
Jiang J, Hu Y C, Tyagi N, Zhang P P, Rimner A, Deasy J O and Veeraraghavan H. 2019. Cross-modality (CT-MRI) prior augmented deep learning for robust lung tumor segmentation from small MR datasets. Medical Physics, 46(10): 4392-4404 [DOI: 10.1002/mp.13695]
Jiang Y and Wang M H. 2014. Image fusion with morphological component analysis. Information Fusion, 18: 107-118 [DOI: 10.1016/j.inffus.2013.06.001]
Jiao J B, Cai Y F, Alsharid M, Drukker L, Papageorghiou A T and Noble J A. 2020. Self-supervised contrastive video-speech representation learning for ultrasound//Proceedings of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention. Lima, Peru: Springer: 534-543 [ DOI:10.1007/978-3-030-59716-0_51 http://dx.doi.org/10.1007/978-3-030-59716-0_51 ]
Jin Z W, Cao J, Guo H, Zhang Y D and Luo J B. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM: 795-816 [ DOI:10.1145/3123266.3123454 http://dx.doi.org/10.1145/3123266.3123454 ]
Jing L L, Chen Y C, Zhang L, He M Y and Tian Y L. 2020. Self-supervised modal and view invariant feature learning [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2005.14169.pdf https://arxiv.org/pdf/2005.14169.pdf
Joze H R V, Shaban A, Iuzzolino M L and Koishida K. 2020. MMTM: multimodal transfer module for CNN fusion//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13286-13296 [ DOI:10.1109/CVPR42600.2020.01330 http://dx.doi.org/10.1109/CVPR42600.2020.01330 ]
Kaur P, Khehra B S and Mavi E B S. 2021. Data augmentation for object detection: a review//Proceedings of 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS). Lansing, USA: IEEE: 537-543 [ DOI:10.1109/MWSCAS47672.2021.9531849 http://dx.doi.org/10.1109/MWSCAS47672.2021.9531849 ]
Kettenring J R. 1971. Canonical analysis of several sets of variables. Biometrika, 58(3): 433-451 [DOI: 10.1093/biomet/58.3.433]
Kim W, Ramanagopal M S, Barto C, Yu M Y, Rosaen K, Goumas N, Vasudevan R and Johnson-Roberson M. 2019. PedX: benchmark dataset for metric 3-D pose estimation of pedestrians in complex urban intersections. IEEE Robotics and Automation Letters, 4(2): 1940-1947 [DOI: 10.1109/LRA.2019.2896705]
Kori A and Krishnamurthi G. 2019. Zero shot learning for multi-modal real time image registration [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1908.06213.pdf https://arxiv.org/pdf/1908.06213.pdf
Kumar M and Dass S. 2009. A total variation-based algorithm for pixel-level image fusion. IEEE Transactions on Image Processing, 18(9): 2137-2143 [DOI: 10.1109/tip.2009.2025006]
Lewis J J, O'Callaghan R J, Nikolov S G, Bull D R and Canagarajah N. 2007. Pixel-and region-based image fusion with complex wavelets. Information fusion, 8(2): 119-130[DOI: 10.1016/j.inffus.2005.09.006]
Li H F, Wang Y T, Yang Z, Wang R X, Li X and Tao D P. 2020. Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion. IEEE Transactions on Instrumentation and Measurement, 69(4): 1082-1102 [DOI: 10.1109/tim.2019.2912239]
Li J J, Ji W, Bi Q, Yan C, Zhang M, Piao Y R, Lu H C and Cheng L. 2021. Joint semantic mining for weakly supervised RGB-D salient object detection//Proceedings of the 34th International Conference on Neural Information Processing Systems. Online: 11945-11959
Li Q Y, Yu Z B, Wang Y B and Zheng H Y. 2020. TumorGAN: a multi-modal data augmentation framework for brain tumor segmentation. Sensors, 20(15): #4203 [DOI: 10.3390/s20154203]
Li S T, Yin H T and Fang L Y. 2012. Group-sparse representation with dictionary learning for medical image denoising and fusion. IEEE Transactions on Biomedical Engineering, 59(12): 3450-3459 [DOI: 10.1109/tbme.2012.2217493]
Li W Y, Sun J, Luo Y K and Wang P. 2019a. 6D object pose estimation using few-shot instance segmentation and 3D matching//2019 IEEE Symposium Series on Computational Intelligence. Xiamen, China: IEEE: 1071-1077 [ DOI:10.1109/ssci44817.2019.9003070 http://dx.doi.org/10.1109/ssci44817.2019.9003070 ]
Li Y. 2019. Multi-modal Remote Sensing Image Fusion Classification Method with Small Sample Size. Shanghai: Shanghai Ocean University
李瑶. 2019. 小样本的多模态遥感影像融合分类方法. 上海: 上海海洋大学
Li Y M, Yang M and Zhang Z F. 2019b. A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering, 31(10): 1863-1883 [DOI: 10.1109/TKDE.2018.2872063]
Liang P P, Lim Y C, Tsai Y H H, Salakhutdinov R and Morency L P. 2019. Strong and simple baselines for multimodal utterance embeddings [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1906.02125.pdf https://arxiv.org/pdf/1906.02125.pdf
Lin C T, Wu Y Y, Hsu P H and Lai S H. 2020. Multimodal structure-consistent image-to-image translation//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI Press: 11490-11498 [ DOI:10.1609/aaai.v34i07.6814 http://dx.doi.org/10.1609/aaai.v34i07.6814 ]
Lin J, Ruan X G, Yu N G and Yang Y H. 2016. Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition. Sensors, 16(12): #2171 [DOI: 10.3390/s16122171]
Lin X, Casas J R and Pardàs M. 2018. Temporally coherent 3D point cloud video segmentation in generic scenes. IEEE Transactions on Image Processing, 27(6): 3087-3099 [DOI: 10.1109/TIP.2018.2811541]
Lin X, Casas J R and Pardàs M. 2019. One shot learning for generic instance segmentation in RGBD videos//Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Setúbal, Portugal: Scitepress: 233-239 [ DOI:10.5220/0007259902330239 http://dx.doi.org/10.5220/0007259902330239 ]
Linder T, Pfeiffer K Y, Vaskevicius N, Schirmer R and Arras K O. 2020. Accurate detection and 3D localization of humans using a novel YOLO-based RGB-D fusion approach and synthetic training data//Proceedings of 2020 IEEE International Conference on Robotics and Automation (ICRA). Paris, France: IEEE: 1000-1006 [ DOI:10.1109/icra40945.2020.9196899 http://dx.doi.org/10.1109/icra40945.2020.9196899 ]
Liu F Y, Zhou L P, Shen C H and Yin J P. 2014. Multiple kernel learning in the primal for multimodal Alzheimer's disease classification. IEEE Journal of Biomedical and Health Informatics, 18(3): 984-990 [DOI: 10.1109/JBHI.2013.2285378]
Liu Y and Wang Z F. 2015. Simultaneousimage fusion and denoising with adaptive sparse representation. IET Image Processing, 9(5): 347-357 [DOI: 10.1049/iet-ipr.2014.0311]
Liu Y Z, Yi L, Zhang S H, Fan Q N, Funkhouser T and Dong H. 2020. P4Contrast: contrastive learning with pairs of point-pixel pairs for RGB-D scene understanding [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2012.13089.pdf https://arxiv.org/pdf/2012.13089.pdf
Liu Z, Shen Y, Lakshminarasimhan V B, Liang PP, Zadeh A and Morency LP. 2018. Efficient low-rank multimodal fusion with modality-specific factors//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: ACL: #1209 [ DOI:10.18653/v1/P18-1209 http://dx.doi.org/10.18653/v1/P18-1209 ]
Loghmani M R, Robbiano L, Planamente M, Park K, Caputo B and Vincze M. 2020. Unsupervised domain adaptation through inter-modal rotation for RGB-D object recognition. IEEE Robotics and Automation Letters, 5(4): 6631-6638 [DOI: 10.1109/LRA.2020.3007092]
Loza A, Bull D, Canagarajah N and Achim A. 2010. Non-Gaussian model-based fusion of noisy images in the wavelet domain. Computer Vision and Image Understanding, 114(1): 54-65 [DOI: 10.1016/j.cviu.2009.09.002]
Ma D A, Tang P, Zhao L J and Zhang Z. 2021. Review of data augmentation for image in deep learning. Journal of Image and Graphics, 26(3): 487-502
马岽奡, 唐娉, 赵理君, 张正. 2021. 深度学习图像数据增广方法研究综述. 中国图象图形学报, 26(3): 487-502 [DOI: 10.11834/jig.200089]
Madhuranga D, Madushan R, Siriwardane C and Gunasekera K. 2021. Real-time multimodal ADL recognition using convolution neural networks. The Visual Computer, 37(6): 1263-1276 [DOI: 10.1007/s00371-020-01864-y]
Mahendran A, Thewlis J and Vedaldi A. 2018. Cross pixel optical-flow similarity for self-supervised learning//Proceedings of the 14th Asian Conference on Computer Vision. Perth, Australia: Springer: 99-116 [ DOI:10.1007/978-3-030-20873-8_7 http://dx.doi.org/10.1007/978-3-030-20873-8_7 ]
Martínez-Montes E, Valdés-Sosa P A, Miwakeichi F, Goldman R I and Cohen M S. 2004. Concurrent EEG/fMRI analysis by multiway partial least squares. NeuroImage, 22(3): 1023-1034 [DOI: 10.1016/j.neuroimage.2004.03.038]
Mehrotra A and Dukkipati A. 2017. Generative adversarial residual pairwise networks for one shot learning [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1703.08033.pdf https://arxiv.org/pdf/1703.08033.pdf
Memmesheimer R, Theisen N and Paulus D. 2021. SL-DML: signal level deep metric learning for multimodal one-shot action recognition//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 4573-4580 [ DOI:10.1109/ICPR48806.2021.9413336 http://dx.doi.org/10.1109/ICPR48806.2021.9413336 ]
Menze B H, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants B B, Ayache N, Buendia P, Collins D L, Cordier N, Corso J J, Criminisi A, Das T, Delingette H, Demiralp Ç, Durst C R, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X T, Hamamci A, Iftekharuddin K M, Jena R, Precup D, Price S J, Raviv T R, Reza S M S, Ryan M, Sarikaya D, Schwartz L, Shin H C, Shotton J, Silva C A, Sousa N, Subbanna N K, Szekely G, Taylor T J, Thomas O M, Tustison N J, Unal G, Vasseur F, Wintermark M, Ye D H, Zhao L, Zhao B S, Zikic D, Prastawa M, Reyes M and Van Leemput K. 2015. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging, 34(10): 1993-2024 [DOI: 10.1109/TMI.2014.2377694]
Meyer J, Eitel A, Brox T and Burgard W, 2020. Improving unimodal object recognition with multimodal contrastive learning//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Las Vegas, USA: IEEE: 5656-5663 [ DOI:10.1109/iros45743.2020.9341029 http://dx.doi.org/10.1109/iros45743.2020.9341029 ]
Mondal A K, Dolz J and Desrosiers C. 2018. Few-shot 3D multi-modal medical image segmentation using generative adversarial learning[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1810.12241.pdf https://arxiv.org/pdf/1810.12241.pdf
Morvant E, Habrard A and Ayache S. 2014. Majority vote of diverse classifiers for late fusion//Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition. Joensuu, Finland: Springer: 153-162 [ DOI:10.1007/978-3-662-44415-3_16 http://dx.doi.org/10.1007/978-3-662-44415-3_16 ]
Narayanan A, Siravuru A and Dariush B. 2019. Temporal multimodal fusion for driver behavior prediction tasks using gated recurrent fusion units [EB/OL]. https://arxiv.org/pdf/1910.00628v1.pdf https://arxiv.org/pdf/1910.00628v1.pdf
Nguyen D T, Hong H G, Kim K W and Park K R. 2017. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3): #605 [DOI: 10.3390/s17030605]
Nichol A, Achiam J and Schulman J. 2018. On first-order meta-learning algorithms[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1803.02999.pdf https://arxiv.org/pdf/1803.02999.pdf
Nie J, Yan J, Yin H L, Ren L and Meng Q. 2021. A multimodality fusion deep neural network and safety test strategy for intelligent vehicles. IEEE Transactions on Intelligent Vehicles, 6(2): 310-322 [DOI: 10.1109/tiv.2020.3027319]
Nobis F, Geisslinger M, Weber M, Betz J and Lienkamp M. 2019. A deep learning-based radar and camera sensor fusion architecture for object detection//2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF). Bonn, Germany: IEEE: 1-7 [ DOI:10.1109/sdf.2019.8916629 http://dx.doi.org/10.1109/sdf.2019.8916629 ]
Ortega J D S, Senoussaoui M, Granger E, Pedersoli M, Cardinal P and Koerich A L. 2019. Multimodal fusion with deep neural networks for audio-video emotion recognition[EB/OL]. https://arxiv.org/pdf/1907.03196.pdf https://arxiv.org/pdf/1907.03196.pdf
Pang A Q, Chen X, Luo H M, Wu M Y, Yu J Y and Xu L. 2021. Few-shot neural human performance rendering from sparse RGBD videos[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2107.06505.pdf https://arxiv.org/pdf/2107.06505.pdf
Pesteie M, Abolmaesumi P and Rohling R N. 2019. Adaptive augmentation of medical data using independently conditional variational auto-encoders. IEEE Transactions on Medical Imaging, 38(12): 2807-2820 [DOI: 10.1109/tmi.2019.2914656]
Plis S M, Amin M F, Chekroud A, Hjelm D, Damaraju E, Lee H J, Bustillo J R, Cho K, Pearlson G D and Calhoun V D. 2018. Reading the (functional) writing on the (structural) wall: multimodal fusion of brain structure and function via a deep neural network based translation approach reveals novel impairments in schizophrenia. NeuroImage, 181: 734-747 [DOI: 10.1016/j.neuroimage.2018.07.047]
Poria S, Cambria E, Bajpai R and Hussain A. 2017. A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion, 37: 98-125 [DOI: 10.1016/j.inffus.2017.02.003]
Quintana M, Karaoglu S, Alvarez F, Menendez J M and Gevers T. 2019. Three-D wide faces (3DWF): facial landmark detection and 3D reconstruction over a new RGB-D multi-camera dataset. Sensors, 19(5): #1103 [DOI: 10.3390/s19051103]
Ramachandram D and Taylor G W. 2017. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6): 96-108 [DOI: 10.1109/msp.2017.2738401]
Ravi S and Larochelle H. 2017. Optimization as a model for few-shot learning//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview. net
Russo S, Lürig M, Hao W J, Matthews B and Villez K. 2020. Active learning for anomaly detection in environmental data. Environmental Modelling and Software, 134: #104869 [DOI: 10.1016/j.envsoft.2020.104869]
Salehinejad H, Valaee S, Dowdell T and Barfett J. 2018. Image augmentation using radial transform for training deep neural networks//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: 3016-3020 [ DOI:10.1109/icassp.2018.8462241 http://dx.doi.org/10.1109/icassp.2018.8462241 ]
Sano A, Chen W X, Lopez-Martinez D, Taylor S and Picard R W. 2019. Multimodal ambulatory sleep detection using LSTM recurrent neural networks. IEEE Journal of Biomedical and Health Informatics, 23(4): 1607-1617 [DOI: 10.1109/JBHI.2018.2867619]
Scheunders P and De Backer S. 2007. Wavelet denoising of multicomponent images using Gaussian scale mixture models and a noise-free image as priors. IEEE Transactions on Image Processing, 16(7): 1865-1872 [DOI: 10.1109/tip.2007.899598]
Settles B. 2011. From theories to queries: active learning in practice//Proceedings of JMLR. Sardinia, Italy: JMLR: 1-18
Shahdoosti H R and Mehrabi A. 2018. Multimodal image fusion using sparse representation classification in tetrolet domain. Digital Signal Processing, 79: 9-22 [DOI: 10.1016/j.dsp.2018.04.002]
Shahroudy A, Liu J, Ng T T and Wang G. 2016. NTU RGB+ D: a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1010-1019 [ DOI:10.1109/cvpr.2016.115 http://dx.doi.org/10.1109/cvpr.2016.115 ]
Shao Q Q, Qi J, Ma J, Fang Y, Wang W M and Hu J. 2020. Object detection-based one-shot imitation learning with an RGB-D camera. Applied Sciences, 10(3): #803 [DOI: 10.3390/app10030803]
Shi L, Shuang K, Geng S J, Su P, Jiang Z K, Gao P, Fu Z H, De Melo G and Su S. 2020. Contrastive visual-linguistic pretraining [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2007.13135.pdf https://arxiv.org/pdf/2007.13135.pdf
Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 746-760 [ DOI:10.1007/978-3-642-33715-4_54 http://dx.doi.org/10.1007/978-3-642-33715-4_54 ]
Simpson A L, Antonelli M, Bakas S, Bilello M, Farahani K, Van Ginneken B, Kopp-Schneider A, Landman B A, Litjens G, Menze B, Ronneberger O, Summers R M, Bilic P, Christ P F, DoR K G, Gollub M, Golia-Pernicka J, Heckers S H, Jarnagin W R, McHugo M K, Napel S, Vorontsov E, Maier-Hein L and Cardoso M J. 2019. A large annotated medical image dataset for the development and evaluation of segmentation algorithms[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1902.09063.pdf https://arxiv.org/pdf/1902.09063.pdf
Song S R, Lichtenberg S P and Xiao J X. 2015. SUN RGB-D: a RGB-D scene understanding benchmark suite//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 567-576 [ DOI:10.1109/cvpr.2015.7298655 http://dx.doi.org/10.1109/cvpr.2015.7298655 ]
Song Y L, Morency L P and Davis R. 2012. Multimodal human behavior analysis: learning correlation and interaction across modalities//Proceedings of the 14th ACM International Conference on Multimodal Interaction. New York, USA: IEEE: 27-30 [ DOI:10.1145/2388676.2388684 http://dx.doi.org/10.1145/2388676.2388684 ]
Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929-1958
Sun L, Zhao C and Stolkin R. 2017. Weakly-supervised DCNN for RGB-D object recognition in real-world applications which lack large-scale annotated training data [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1703.06370.pdf https://arxiv.org/pdf/1703.06370.pdf
Tan W K, Qin N N, Ma L F, Li Y, Du J, Cai G R, Yang K and Li J. 2020. Toronto-3D: a large-scale mobile LiDAR dataset for semantic segmentation of urban roadways//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, USA: IEEE: 797-806 [ DOI:10.1109/cvprw50498.2020.00109 http://dx.doi.org/10.1109/cvprw50498.2020.00109 ]
Tang Y S and Lee G H. 2019. Transferable semi-supervised 3D object detection from RGB-D data//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1931-1940 [ DOI:10.1109/ICCV.2019.00202 http://dx.doi.org/10.1109/ICCV.2019.00202 ]
Tian Y L, Krishnan D and Isola P. 2020. Contrastive multiview coding//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 776-794 [ DOI:10.1007/978-3-030-58621-8_45 http://dx.doi.org/10.1007/978-3-030-58621-8_45 ]
Tompson J, Goroshin R, Jain A, LeCun Y and Bregler C. 2015. Efficient object localization using convolutional networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 648-656 [ DOI:10.1109/cvpr.2015.7298664 http://dx.doi.org/10.1109/cvpr.2015.7298664 ]
Valada A, Oliveira G L, Brox T and Burgard W. 2016. Deep multispectral semantic scene understanding of forested environments using multimodal fusion//2016 International Symposium on Experimental Robotics. Tokyo, Japan: Springer: 465-477 [ DOI:10.1007/978-3-319-50115-4_41 http://dx.doi.org/10.1007/978-3-319-50115-4_41 ]
Vasiljevic I, Kolkin N, Zhang S Y, Luo R T, Wang H C, Dai F Z, Daniele A F, Mostajabi M, Basart S, Walter M R and Shakhnarovich G. 2019. DIODE: a dense indoor and outdoor depth dataset[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/1908.00463.pdf https://arxiv.org/pdf/1908.00463.pdf
Vilalta R and Drissi Y. 2002. A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2): 77-95 [DOI: 10.1023/A:1019956318069]
Wagner J, Andre E, Lingenfelser F and Kim J. 2011. Exploring fusion methods for multimodal emotion recognition with missing data. IEEE Transactions on Affective Computing, 2(4): 206-218 [DOI: 10.1109/T-AFFC.2011.12]
Wald J, Avetisyan A, Navab N, Tombari F and Niessner M. 2019. RIO: 3D object instance re-localization in changing indoor environments//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7657-7666 [ DOI:10.1109/iccv.2019.00775 http://dx.doi.org/10.1109/iccv.2019.00775 ]
Wan J, Guo G D and Li S Z. 2016. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8): 1626-1639 [DOI: 10.1109/TPAMI.2015.2513479
Wan J, Ruan Q Q, Li W and Deng S. 2013. One-shot learning gesture recognition from RGB-D data using bag of features. The Journal of Machine Learning Research, 14(1): 2549-2582
Wan L, Jing Q Y, Sun Z Y, Zhang C, Li Z H and Chen Y. 2021. Self-supervised modality-aware multiple granularity pre-training for RGB-infrared person re-identification[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2112.06147.pdf https://arxiv.org/pdf/2112.06147.pdf
Wang A R, Lu J W, Wang G, Cai J F and Cham T J. 2014. Multi-modal unsupervised feature learning for RGB-D scene labeling//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 453-467 [ DOI:10.1007/978-3-319-10602-1_30 http://dx.doi.org/10.1007/978-3-319-10602-1_30 ]
Wang H L, Su Y X and Liu M. 2019. Self-supervised drivable area and road anomaly segmentation using RGB-D data for robotic wheelchairs. IEEE Robotics and Automation Letters, 4(4): 4386-4393 [DOI: 10.1109/lra.2019.2932874]
Wang H L, Sun Y X, Fan R and Liu M. 2021a. S2P2: self-supervised goal-directed path planning using RGB-D data for robotic wheelchairs//Proceedings of 2021 IEEE International Conference on Robotics and Automation (ICRA). Xi′an, China: IEEE: 11422-11428 [ DOI:10.1109/ICRA48506.2021.9561314 http://dx.doi.org/10.1109/ICRA48506.2021.9561314 ]
Wang J W, Yan Y G, Zhang Y B, Cao G P, Yang M and Ng M K. 2020c. Deep reinforcement active learning for medical image classification//Proceedings of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention. Lima, Peru: Springer: 33-42 [ DOI:10.1007/978-3-030-59710-8_4 http://dx.doi.org/10.1007/978-3-030-59710-8_4 ]
Wang P H, Qiu C Y, Wang JL, Wang Y L, Tang J X, Huang B, Su J and Zhang Y P. 2021d. Multimodal data fusion using non-sparse multi-kernel learning with regularized label softening. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14: 6244-6252 [DOI: 10.1109/JSTARS.2021.3087738]
Wang P Y, Manhardt F, Minciullo L, Garattoni L, Meier S, Navab N and Busam B. 2021b. DemoGrasp: few-shot learning for robotic grasping with human demonstration//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. Prague, Czech Republic: IEEE: 5733-5740 [ DOI:10.1109/IROS51168.2021.9636856 http://dx.doi.org/10.1109/IROS51168.2021.9636856 ]
Wang W W, Shui P L and Feng X C. 2008. Variational models for fusion and denoising of multifocus images. IEEE Signal Processing Letters, 15: 65-68 [DOI: 10.1109/lsp.2007.911148]
Wang W Y, Tran D and Feiszli M. 2020b. What makes training multi-modal classification networks hard?//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12692-12702 [ DOI:10.1109/cvpr42600.2020.01271 http://dx.doi.org/10.1109/cvpr42600.2020.01271 ]
Wang Y Q, Yao Q M, Kwok J T and Ni L M. 2021c. Generalizing from a few examples: a survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3): 1-34 [DOI: 10.1145/3386252]
Wang Y X, Girshick R, Hebert M and Hariharan B. 2018. Low-shot learning from imaginary data//Proceedings of 2018 IEEE/CVF Conference on computer vision and pattern recognition. Salt Lake City, USA: IEEE: 7278-7286 [ DOI:10.1109/cvpr.2018.00760 http://dx.doi.org/10.1109/cvpr.2018.00760 ]
Wen H W, Liu Y, Rekik I, Wang S P, Chen Z Q, Zhang J S, Zhang Y, Peng Y and He H G. 2017. Multi-modal multiple kernel learning for accurate identification of Tourette syndrome children. Pattern Recognition, 63: 601-611 [DOI: 10.1016/j.patcog.2016.09.039]
Wu A C, Zheng W S, Yu H X, Gong S G and Lai J H. 2017. RGB-infrared cross-modality person Re-identification//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5390-5399 [ DOI:10.1109/iccv.2017.575 http://dx.doi.org/10.1109/iccv.2017.575 ]
Xian K, Shen C H, Cao Z G, Lu H, Xiao Y, Li R B and Luo Z B. 2018. Monocular relative depth perception with web stereo data supervision//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: 311-320 [ DOI:10.1109/CVPR.2018.00040 http://dx.doi.org/10.1109/CVPR.2018.00040 ]
Xiong B, Fan H Q, Grauman K and Feichtenhofer C. 2021. Multiview pseudo-labeling for semi-supervised learning from video//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 7189-7199 [ DOI:10.1109/ICCV48922.2021.00712 http://dx.doi.org/10.1109/ICCV48922.2021.00712 ]
Xu X Y, Li Y C, Wu G S and Luo J B. 2017. Multi-modal deep feature learning for RGB-D object detection. Pattern Recognition, 72: 300-313 [DOI: 10.1016/j.patcog.2017.07.026]
Yang B and Li S T. 2009. Multifocus image fusion and restoration with sparse representation. IEEE Transactions on Instrumentation and Measurement, 59(4): 884-892 [DOI: 10.1109/tim.2009.2026612]
Yang B and Li S T. 2012. Pixel-level image fusion with simultaneous orthogonal matching pursuit. Information Fusion, 13(1): 10-19 [DOI: 10.1016/j.inffus.2010.04.001]
Yang F, Wang Z, Xiao J and Satoh S I. 2020. Mining on heterogeneous manifolds for zero-shot cross-modal image retrieval//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI Press: 12589-12596 [ DOI:10.1609/aaai.v34i07.6949 http://dx.doi.org/10.1609/aaai.v34i07.6949 ]
Yang T H, Wu C H, Huang K Y and Su M H. 2017. Coupled HMM-based multimodal fusion for mood disorder detection through elicited audio-visual signals. Journal of Ambient Intelligence and Humanized Computing, 8(6): 895-906 [DOI: 10.1007/s12652-016-0395-y]
Ye M and Yuen P C. 2020. PurifyNet: a robust person re-identification model with noisy labels. IEEE Transactions on Information Forensics and Security, 15: 2655-2666 [DOI: 10.1109/TIFS.2020.2970590]
Yeh J F, Chung C M, Su H T, Chen Y T and Hsu W H. 2021. Stage conscious attention network (SCAN): a demonstration-conditioned policy for few-shot imitation [EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2112.02278.pdf https://arxiv.org/pdf/2112.02278.pdf
Yin H T, Li S T and Fang L Y. 2013. Simultaneous image fusion and super-resolution using sparse representation. Information Fusion, 14(3): 229-240 [DOI: 10.1016/j.inffus.2012.01.008]
Yu N N, Qiu T S, Bi F and Wang A Q. 2011. Image features extraction and fusion based on joint sparse representation. IEEE Journal of Selected Topics in Signal Processing, 5(5): 1074-1082 [DOI: 10.1109/jstsp.2011.2112332]
Zhang J, Fan D P, Dai Y, Anwar S, Saleh F, Aliakbarian S and Barnes N. 2021. Uncertainty inspired RGB-D saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence [ DOI:10.1109/TPAMI.2021.3073564 http://dx.doi.org/10.1109/TPAMI.2021.3073564 ]
Zhang Q C, Yang L T and Chen Z K. 2015. Deep computation model forunsupervised feature learning on big data. IEEE Transactions on Services Computing, 9(1): 161-171 [DOI: 10.1109/TSC.2015.2497705]
Zhang S, Song C and Radkowski R. 2019a. Setforge-synthetic RGB-D training data generation to support CNN-based pose estimation for augmented reality//2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). Beijing, China: IEEE: 237-242 [ DOI:10.1109/ISMAR-Adjunct.2019.00-39 http://dx.doi.org/10.1109/ISMAR-Adjunct.2019.00-39 ]
Zhang T Y, Fu H Z, Zhao Y T, Cheng J, Guo M J, Gu Z W, Yang B, Xiao Y T, Gao S H and Liu J. 2019c. SkrGAN: sketching-rendering unconditional generative adversarial networks for medical image synthesis//Proceedings of the 22nd International Conference on Medical Image Computing and Computer-Assisted Intervention. Shenzhen, China: Springer: 777-785 [ DOI:10.1007/978-3-030-32251-9_85 http://dx.doi.org/10.1007/978-3-030-32251-9_85 ]
Zhang Y F, Morel O, Blanchon M, Seulin R, Rastgoo M and Sidibé D. 2019b. Exploration of deep learning-based multimodal fusion for semantic road scene segmentation//Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Prague, Czech Republic: SciTePress: 336-343 [ DOI:10.5220/0007360403360343 http://dx.doi.org/10.5220/0007360403360343 ]
Zhang Y F, Sidibé D, Morel O and Meriaudeau F. 2021. Incorporating depth information into few-shot semantic segmentation//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 3582-3588 [ DOI:10.1109/ICPR48806.2021.9412921 http://dx.doi.org/10.1109/ICPR48806.2021.9412921 ]
Zhao W D and Lu H C. 2017. Medical image fusion and denoising with alternating sequential filter and adaptive fractional order total variation. IEEE Transactions on Instrumentation and Measurement, 66(9): 2283-2294 [DOI: 10.1109/tim.2017.2700198]
Zhao X Q, Pang Y W, Zhang L H, Lu H C and Ruan X. 2021. Self-supervised representation learning for RGB-D salient object detection[EB/OL]. [2021-12-31] . https://arxiv.org/pdf/2101.12482.pdf https://arxiv.org/pdf/2101.12482.pdf
Zhou F, Hu Y and Shen X K. 2019. MSANet: multimodal self-augmentation and adversarial network for RGB-D object recognition. The Visual Computer, 35(11): 1583-1594 [DOI: 10.1007/s00371-018-1559-x]
Zhou Z H. 2018. A brief introduction to weakly supervised learning. National Science Review, 5(1): 44-53 [DOI: 10.1093/nsr/nwx106]
Zhu X J and Goldberg A B. 2009. Introduction to Semi-Supervised Learning. Cham: Springer: 1-130 [DOI: 10.1007/978-3-031-01548-9]
Zhu Y Q and Yang K. 2019. Tripartite active learning for interactive anomaly discovery. IEEE Access, 7: 63195-63203 [DOI: 10.1109/ACCESS.2019.2915388]
相关作者
相关机构
京公网安备11010802024621