Review of multimodal data processing techniques with limited data

Peijin Wang; Zhiyuan Yan; Xuee Rong; Junxi Li; Xiaonan Lu; Huiyang Hu; Qiwei Yan; Xian Sun

doi:10.11834/jig.220049

Limited Image Data | Views : 0 下载量: 2 CSCD: 2

PDF
Export
Share
Collection
Album

Review of multimodal data processing techniques with limited data
Vol. 27, Issue 10, Pages: 2803-2834(2022)
Published： 16 October 2022 ，

Accepted： 29 April 2022
DOI： 10.11834/jig.220049
稿件说明：

移动端阅览

Peijin Wang, Zhiyuan Yan, Xuee Rong, Junxi Li, Xiaonan Lu, Huiyang Hu, Qiwei Yan, Xian Sun. Review of multimodal data processing techniques with limited data. [J]. Journal of Image and Graphics 27(10):2803-2834(2022)
DOI：

Peijin Wang, Zhiyuan Yan, Xuee Rong, Junxi Li, Xiaonan Lu, Huiyang Hu, Qiwei Yan, Xian Sun. Review of multimodal data processing techniques with limited data. [J]. Journal of Image and Graphics 27(10):2803-2834(2022) DOI： 10.11834/jig.220049.

摘要

随着多媒体技术的发展，可获取的媒体数据在种类和量级上大幅提升。受人类感知方式的启发，多种媒体数据互相融合处理，促进了人工智能在计算机视觉领域的研究发展，在遥感图像解译、生物医学和深度估计等方面有广泛的应用。尽管多模态数据在描述事物特征时具有明显优势，但仍面临着较大的挑战。1)受到不同成像设备和传感器的限制，难以收集到大规模、高质量的多模态数据集；2)多模态数据需要匹配成对用于研究，任一模态的缺失都会造成可用数据的减少；3)图像、视频数据在处理和标注上需要耗费较多的时间和人力成本，这些问题使得目前本领域的技术尚待攻关。本文立足于数据受限条件下的多模态学习方法，根据样本数量、标注信息和样本质量等不同的维度，将计算机视觉领域中的多模态数据受限方法分为小样本学习、缺乏强监督标注信息、主动学习、数据去噪和数据增强5个方向，详细阐述了各类方法的样本特点和模型方法的最新进展。并介绍了数据受限前提下的多模态学习方法使用的数据集及其应用方向(包括人体姿态估计、行人重识别等)，对比分析了现有算法的优缺点以及未来的发展方向，对该领域的发展具有积极的意义。

Abstract

The growth of multimedia technology has leveraged more available multifaceted media data. Human-perceptive multiple media data fusion has promoted the research and development (R&D) of artificial intelligence (AI) for computer vision. It has a wide range of applications like remote sensing image interpretation

biomedicine

and depth estimation. Multimodality can be as a form of representation of things (RoT). It refers to the description of things from multiple perspectives. Early AI-oriented technology is focused on a single modality of data. Current human-perceptive researches have clarified that each modality has a relatively independent description of things (IDoT)

and the use of complementary representations of multimodal data tend to three-dimensional further. Recent processing and applications of multimodal data has been intensively developed like sentiment analysis

machine translation

natural language processing

and biomedicine. Our critical review is focused on the development of multimodality. Computer-vision-oriented multimodal learning is mainly used to analyze the related multimodal data on the aspects of images and videos

modalities-ranged learning and complemented information

and image detection and recognition

semantic segmentation

and video action prediction

etc. Multimodal data has its priority for objects description. First

it is challenged to collect large-scale

high-quality multimodal datasets due to the equipment-limited like multiple imaging devices and sensors. Next

Image and video data processing and labeling are time-consuming and labor-intensive. Based on the limited-data-derived multimodal learning methods

the multimodal data limited methods in the context of computer vision can be segmented into five aspects

including few-shot learning

lack of strong supervised information

active learning

data denoising and data augmentation. The multi-features of samples and the models evolution are critically reviewed as mentioned below: 1) in the case of insufficient multi-modal data

the few-shot learning method has the cognitive ability to make correct judgments via learning a small number of samples only

and it can effectively learn the target features in the case of lack of data. 2) Due to the high cost of the data labeling process

it is challenged to obtain all the ground truth labels of all modalities for strongly supervised learning of the model. The incomplete supervised methods are composed of weakly supervised

unsupervised

semi-supervised

and self-supervised learning methods in common. These methods can optimize modal labeling information and cost-effective manual labeling. 3) The active learning method is based on the integration of prior knowledge and learning regulatory via designing a model using autonomous learning ability

and it is committed to the maximum optimization of few samples. Labeling costs can be effectively reduced in consistency based on the optimized options of samples. 4) Multimodal data denoising refers to reducing data noise

restoring the original data

and then extracting the information of interest. 5) In order to make full use of limited multi-modal data

few-samples-conditioned data enhancement method extends realistic data by performing a series of transformation operations on the original data set. In addition

the data sets are used for the multimodal learning method limited data. Its potential applications are introduced like human pose estimation and person re-identification

and the performance of the existing algorithms is compared and analyzed. The pros and cons

as well as the future development direction

are projected as following: 1) a lightweight multimodal data processing method: we argue that limited-data-conditioned multimodal learning still has the challenge of mobile-devices-oriented models applications. When the existing methods fuse the information of multiple modalities

it is generally necessary to use two or above networks for feature extraction

and then fuse the features. Therefore

the large number of parameters and the complex structure of the model limit its application to mobile devices. Future lightweight model has its potentials. 2) A commonly-used multimodal intelligent processing model: most of existing multimodal data processing methods are derived from the developed multi-algorithms for multitasks

which need to be trained on specific tasks. This tailored training method greatly increases the cost of developing models

making it difficult to meet the needs of more application scenarios. Therefore

for the data of different modalities

it is necessary to promote a consensus perception model to learn the general representation of multimodal data and the parameters and features of the general model can be shared for multiple scenarios. 3) A multi-sources knowledge and data driven model: it is possible to introduce featured data and knowledge of multi-modal data beyond

establish an integrated knowledge-data-driven model

and enhance the model's performance and interpretability.

关键词

多模态数据数据受限深度学习融合算法计算机视觉

Keywords

multimodal datalimited datadeep learningfusion algorithmscomputer vision

references

Abdulnabi A H, Shuai B, Zuo Z, Chau L P and Wang G. 2018. Multimodal recurrent neural networks with information transfer layers for indoor scene labeling. IEEE Transactions on Multimedia, 20(7): 1656-1671 [DOI: 10.1109/TMM.2017.2774007]

Achim A M, Canagarajah C N and Bull D R. 2005. Complex wavelet domain image fusion based on fractional lower order moments//Proceedings of the 7th International Conference on Information Fusion. Philadelphia, USA: IEEE: 7 [DOI:1109/icif.2005.1591898http://dx.doi.org/1109/icif.2005.1591898]

Andrychowicz M, Denil M, Colmenarejo S G, Hoffman M W, Pfau D, Schaul T, Shillingford B and De Freitas N. 2016. Learning to learn by gradient descent by gradient descent//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc: 3988-3996

Ao X, Zhang X Y, Yang H M, Yin F and Liu C L. 2019. Cross-modal prototype learning for zero-shot handwriting recognition//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney, Australia: IEEE: 589-594 [DOI:10.1109/ICDAR.2019.00100http://dx.doi.org/10.1109/ICDAR.2019.00100]

Audebert N, Le Saux B and Lefèvre S. 2016. Semantic segmentation of earth observation data using multimodal and multi-scale deep networks//Proceedings of the 13th Asian Conference on Computer Vision. Taipei, China: Springer: 180-196 [DOI:10.1007/978-3-319-54181-5_12http://dx.doi.org/10.1007/978-3-319-54181-5_12]

Baltrušaitis T, Ahuja C and Morency L P. 2019. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423-443 [DOI: 10.1109/tpami.2018.2798607]

Bao Y Q, Song K C, Wang J, Huang L M, Dong H W and Yan Y H. 2021. Visible and thermal images fusion architecture for few-shot semantic segmentation. Journal of Visual Communication and Image Representation, 80: #103306 [DOI: 10.1016/j.jvcir.2021.103306]

Belgacem S, Chatelain C and Paquet T. 2015. A hybrid CRF/HMM for one-shot gesture learning//Rattani A, Roli F and Granger E, eds. Adaptive Biometric Systems. Cham: Springer: 51-72 [DOI:10.1007/978-3-319-24865-3_4http://dx.doi.org/10.1007/978-3-319-24865-3_4]

Bessadok A, Nebli A, Ali Mahjoub M, Li G, Lin W L, Shen D G and Rekik I. 2021. A few-shot learning graph multi-trajectory evolution network for forecasting multimodal baby connectivity development from a baseline timepoint//Proceedings of the 4th International Workshop on PRedictive Intelligence in Medicine. Strasbourg, France: Springer: 11-24 [DOI:10.1007/978-3-030-87602-9_2http://dx.doi.org/10.1007/978-3-030-87602-9_2]

Biessmann F, Plis S, Meinecke F C, Eichele T and Muller KR. 2011. Analysis of multimodal neuroimaging data. IEEE Reviews in Biomedical Engineering, 4: 26-58 [DOI: 10.1109/RBME.2011.2170675]

Bonnin A, Borràs R and VitriàJ. 2011. A cluster-based strategy for active learning of RGB-D object detectors//Proceedings of 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). Barcelona, Spain: IEEE: 1215-1220 [DOI:10.1109/iccvw.2011.6130389http://dx.doi.org/10.1109/iccvw.2011.6130389]

Bramon R, Boada I, Bardera A, Rodriguez J, Feixas M, Puig J and Sbert M. 2012. Multimodal data fusion based on mutual information. IEEE Transactions on Visualization and Computer Graphics, 18(9): 1574-1587 [DOI: 10.1109/TVCG.2011.280]

Bronstein M M, Bronstein A M, Michel F and Paragios N. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 3594-3601 [DOI:10.1109/CVPR.2010.5539928http://dx.doi.org/10.1109/CVPR.2010.5539928]

Budd S, Robinson E C and Kainz B. 2021. A survey on active learning and human-in-the-loop deep learning for medical image analysis. Medical Image Analysis, 71: #102062 [DOI: 10.1016/j.media.2021.102062]

Bullard K, Schroecker Y and Chernova S. 2019. Active learning within constrained environments through imitation of an expert questioner//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: IJCAI. org: 2045-2052 [DOI:10.24963/ijcai.2019/283http://dx.doi.org/10.24963/ijcai.2019/283]

Caesar H, Bankiti V, Lang A H, Vora S, Liong V E, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O. 2020. nuScenes: a multimodal dataset for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11618-11628 [DOI:10.1109/cvpr42600.2020.01164http://dx.doi.org/10.1109/cvpr42600.2020.01164]

Chan T F, Esedoglu S and Park F. 2010. A fourth order dual method for staircase reduction in texture extraction and image restoration problems//Proceedings of 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE: 4137-4140 [DOI:10.1109/icip.2010.5653199http://dx.doi.org/10.1109/icip.2010.5653199]

Chaudhuri U, Banerjee B, Bhattacharya A and Datcu M. 2020. A simplified framework for zero-shot cross-modal sketch data retrieval//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle, USA: IEEE: 699-706 [DOI:10.1109/cvprw50498.2020.00099http://dx.doi.org/10.1109/cvprw50498.2020.00099]

Chen C, Ouyang C, Tarroni G, Schlemper J, Qiu H Q, Bai W J and Rueckert D. 2019. Unsupervised multi-modal style transfer for cardiac MR segmentation//Proceedings of the 10th International Workshop on Statistical Atlases and Computational Models of the Heart. Shenzhen, China: Springer: 209-219 [DOI:10.1007/978-3-030-39074-7_22http://dx.doi.org/10.1007/978-3-030-39074-7_22]

Chen K, Guo Y F, Yang C Q, Xu Y, Zhang R, Li C X andWu R. 2021. Enhanced breast lesion classification via knowledge guided cross-modal and semantic data augmentation//Proceedings of 2021 International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) Strasbourg, France: Springer: 53-63 [DOI:10.1007/978-3-030-87240-3_6http://dx.doi.org/10.1007/978-3-030-87240-3_6]

Chen Y T, Chang W Y, Lu H L, Wu T F and Sun M. 2018. Leveraging motion priors in videos for improving human segmentation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: IEEE: 220-236 [DOI:10.1007/978-3-030-01234-2_14http://dx.doi.org/10.1007/978-3-030-01234-2_14]

Cheng Y H, Zhao X, Cai R, Li Z W, Huang K Q and Rui Y. 2016. Semi-supervised multimodal deep learning for RGB-D object recognition//Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York, USA: AAAI Press: 3345-3351

Chiu M T, Xu X Q, Wei Y C, Huang Z L, Schwing A G, Brunner R, Khachatrian H, Karapetyan H, Dozier I, Rose G, Wilson D, Tudor A, Hovakimyan N, Huang T S and Shi H H. 2020. Agriculture-vision: a large aerial image database for agricultural pattern analysis//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2825-2835 [DOI:10.1109/cvpr42600.2020.00290http://dx.doi.org/10.1109/cvpr42600.2020.00290]

Dai A, Chang A X, Savva M, Halber M, Funkhouser T and Nießner M. 2017. Scannet: richly-annotated 3d reconstructions of indoor scenes//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2432-2443 [DOI:10.1109/cvpr.2017.261http://dx.doi.org/10.1109/cvpr.2017.261]

Das S, Dai R, Koperski M, Minciullo L, Garattoni L, Bremond F and Francesca G. 2019. Toyota smarthome: real-world activities of daily living//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 833-842[DOI:10.1109/ICCV.2019.00092http://dx.doi.org/10.1109/ICCV.2019.00092]

Dong Y, Pan H W, Cui Q N, Bian X F, Teng T and Wang B J. 2021. Few-shot segmentation method for multi-modal magnetic resonance images of brain tumor. Journal of Computer Applications, 41(4): 1049-1054

董阳, 潘海为, 崔倩娜, 边晓菲, 滕腾, 王邦菊. 2021. 面向多模态磁共振脑瘤图像的小样本分割方法. 计算机应用, 41(4): 1049-1054 [DOI: 10.11772/j.issn.1001-9081.2020081388]

Du D P, Wang L M, Li Z Y and Wu G S. 2021. Cross-modal pyramid translation for RGB-D scene recognition. International Journal of Computer Vision, 129(8): 2309-2327 [DOI: 10.1007/s11263-021-01475-7]

Eigen D and Fergus R. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2650-2658 [DOI:10.1109/iccv.2015.304http://dx.doi.org/10.1109/iccv.2015.304]

Eitel A, Springenberg J T, Spinello L, Riedmiller M and Burgard W. 2015. Multimodal deep learning for robust RGB-D object recognition//Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany: IEEE: 681-687 [DOI:10.1109/iros.2015.7353446http://dx.doi.org/10.1109/iros.2015.7353446

El Banani M and Johnson J. 2021. Bootstrap your own correspondences//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 6413-6422 [DOI:10.1109/ICCV48922.2021.00637http://dx.doi.org/10.1109/ICCV48922.2021.00637]

Faisal M A, Aung Z, Woon W L and Svetinovic D. 2014. Augmented query strategies for active learning in stream data mining//Proceedings of the 21 st International Conference on Neural Information Processing. Kuching, Malaysia: Springer: 431-438 [DOI:10.1007/978-3-319-12643-2_53http://dx.doi.org/10.1007/978-3-319-12643-2_53]

Ferreri A, Bucci S and Tommasi T. 2021. Multi-modal RGB-D scene recognition across domains//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 2199-2208 [DOI:10.1109/iccvw54120.2021.00249http://dx.doi.org/10.1109/iccvw54120.2021.00249]

Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR. org: 1126-1135

Fürst M, Gupta S T P, Schuster R, Wasenmüller O and Stricker D. 2021. HPERL: 3D human pose estimation from RGB and LiDAR//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 7321-7327 [DOI:10.1109/icpr48806.2021.9412785http://dx.doi.org/10.1109/icpr48806.2021.9412785]

Gao J, Li P, Chen Z K and Zhang J N. 2020. A survey on deep learning for multimodal data fusion. Neural Computation, 32(5): 829-864 [DOI: 10.1162/neco_a_01273]

Gao M L, Jiang J, Zou G F, John V and Liu Z. 2019. RGB-D-based object recognition using multimodal convolutional neural networks: a survey. IEEE Access, 7: 43110-43136 [DOI: 10.1109/access.2019.2907071]

Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361 [DOI:10.1109/cvpr.2012.6248074http://dx.doi.org/10.1109/cvpr.2012.6248074]

Gönen M and Alpaydın E. 2011. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12: 2211-2268

Groves A R, Beckmann C F, Smith S M and Woolrich M W. 2011. Linked independent component analysis for multimodal data fusion. NeuroImage, 54(3): 2198-2217 [DOI: 10.1016/j.neuroimage.2010.09.073]

Han T D, Xie W D and Zisserman A. 2020. Self-supervised co-training for video representation learning[EB/OL]. [2021-12-31].https://arxiv.org/pdf/2010.09709.pdfhttps://arxiv.org/pdf/2010.09709.pdf

Hong D F, Yokoya N, Xia G S, Chanussot J and Zhu X X. 2020. X-ModalNet: a semi-supervised deep cross-modal network for classification of remote sensing data. ISPRS Journal of Photogrammetry and Remote Sensing, 167: 12-23 [DOI: 10.1016/j.isprsjprs.2020.06.014]

Hou J C, Wang S S, Lai Y H, Tsao Y, Chang H W and Wang H M. 2018. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2): 117-128 [DOI: 10.1109/TETCI.2017.2784878]

Hu Y P, Modat M, Gibson E, Li W Q, Ghavami N, Bonmati E, Wang G T, Bandula S, Moore C M, Emberton M, Ourselin S, Noble J A, Barratt D C and Vercauteren T. 2018. Weakly-supervised convolutional neural networks for multimodal image registration. Medical Image Analysis, 49: 1-13

Huang X, Liu M Y, Belongie S and Kautz J. 2018. Multimodal unsupervised image-to-image translation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 172-189 [DOI:10.1007/978-3-030-01219-9_11http://dx.doi.org/10.1007/978-3-030-01219-9_11]

Imran J and Raman B. 2020. Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition. Journal of Ambient Intelligence and Humanized Computing, 11(1): 189-208 [DOI: 10.1007/s12652-019-01239-9]

Iqbal M and Chen J. 2012. Unification of image fusion and super-resolution using jointly trained dictionaries and local information contents. IET Image Processing, 6(9): 1299-1310 [DOI: 10.1049/iet-ipr.2012.0122]

Jahan S, Shatabda S and Farid D M. 2018. Active learning for mining big data//Proceedings of the 21st International Conference of Computer and Information Technology (ICCIT). Dhaka, Bangladesh: IEEE: 1-6 [DOI:10.1109/iccitechn.2018.8631973http://dx.doi.org/10.1109/iccitechn.2018.8631973]

Jalal A, Kim Y H, Kim Y J, Kamal S and Kim D. 2017. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognition, 61: 295-308 [DOI: 10.1016/j.patcog.2016.08.003]

Jaritz M, Vu T H, Charette R D, Wirbel E and Pérez P. 2020. xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE: 12605-12614. [DOI:10.1109/cvpr42600.2020.01262http://dx.doi.org/10.1109/cvpr42600.2020.01262]

Jiang J, Hu Y C, Tyagi N, Zhang P P, Rimner A, Deasy J O and Veeraraghavan H. 2019. Cross-modality (CT-MRI) prior augmented deep learning for robust lung tumor segmentation from small MR datasets. Medical Physics, 46(10): 4392-4404 [DOI: 10.1002/mp.13695]

Jiang Y and Wang M H. 2014. Image fusion with morphological component analysis. Information Fusion, 18: 107-118 [DOI: 10.1016/j.inffus.2013.06.001]

Jiao J B, Cai Y F, Alsharid M, Drukker L, Papageorghiou A T and Noble J A. 2020. Self-supervised contrastive video-speech representation learning for ultrasound//Proceedings of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention. Lima, Peru: Springer: 534-543 [DOI:10.1007/978-3-030-59716-0_51http://dx.doi.org/10.1007/978-3-030-59716-0_51]

Jin Z W, Cao J, Guo H, Zhang Y D and Luo J B. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM: 795-816 [DOI:10.1145/3123266.3123454http://dx.doi.org/10.1145/3123266.3123454]

Jing L L, Chen Y C, Zhang L, He M Y and Tian Y L. 2020. Self-supervised modal and view invariant feature learning [EB/OL]. [2021-12-31].https://arxiv.org/pdf/2005.14169.pdfhttps://arxiv.org/pdf/2005.14169.pdf

Joze H R V, Shaban A, Iuzzolino M L and Koishida K. 2020. MMTM: multimodal transfer module for CNN fusion//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13286-13296 [DOI:10.1109/CVPR42600.2020.01330http://dx.doi.org/10.1109/CVPR42600.2020.01330]

Kaur P, Khehra B S and Mavi E B S. 2021. Data augmentation for object detection: a review//Proceedings of 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS). Lansing, USA: IEEE: 537-543 [DOI:10.1109/MWSCAS47672.2021.9531849http://dx.doi.org/10.1109/MWSCAS47672.2021.9531849]

Kettenring J R. 1971. Canonical analysis of several sets of variables. Biometrika, 58(3): 433-451 [DOI: 10.1093/biomet/58.3.433]

Kim W, Ramanagopal M S, Barto C, Yu M Y, Rosaen K, Goumas N, Vasudevan R and Johnson-Roberson M. 2019. PedX: benchmark dataset for metric 3-D pose estimation of pedestrians in complex urban intersections. IEEE Robotics and Automation Letters, 4(2): 1940-1947 [DOI: 10.1109/LRA.2019.2896705]

Kori A and Krishnamurthi G. 2019. Zero shot learning for multi-modal real time image registration [EB/OL]. [2021-12-31].https://arxiv.org/pdf/1908.06213.pdfhttps://arxiv.org/pdf/1908.06213.pdf

Kumar M and Dass S. 2009. A total variation-based algorithm for pixel-level image fusion. IEEE Transactions on Image Processing, 18(9): 2137-2143 [DOI: 10.1109/tip.2009.2025006]

Lewis J J, O'Callaghan R J, Nikolov S G, Bull D R and Canagarajah N. 2007. Pixel-and region-based image fusion with complex wavelets. Information fusion, 8(2): 119-130[DOI: 10.1016/j.inffus.2005.09.006]

Li H F, Wang Y T, Yang Z, Wang R X, Li X and Tao D P. 2020. Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion. IEEE Transactions on Instrumentation and Measurement, 69(4): 1082-1102 [DOI: 10.1109/tim.2019.2912239]

Li J J, Ji W, Bi Q, Yan C, Zhang M, Piao Y R, Lu H C and Cheng L. 2021. Joint semantic mining for weakly supervised RGB-D salient object detection//Proceedings of the 34th International Conference on Neural Information Processing Systems. Online: 11945-11959

Li Q Y, Yu Z B, Wang Y B and Zheng H Y. 2020. TumorGAN: a multi-modal data augmentation framework for brain tumor segmentation. Sensors, 20(15): #4203 [DOI: 10.3390/s20154203]

Li S T, Yin H T and Fang L Y. 2012. Group-sparse representation with dictionary learning for medical image denoising and fusion. IEEE Transactions on Biomedical Engineering, 59(12): 3450-3459 [DOI: 10.1109/tbme.2012.2217493]

Li W Y, Sun J, Luo Y K and Wang P. 2019a. 6D object pose estimation using few-shot instance segmentation and 3D matching//2019 IEEE Symposium Series on Computational Intelligence. Xiamen, China: IEEE: 1071-1077 [DOI:10.1109/ssci44817.2019.9003070http://dx.doi.org/10.1109/ssci44817.2019.9003070]

Li Y. 2019. Multi-modal Remote Sensing Image Fusion Classification Method with Small Sample Size. Shanghai: Shanghai Ocean University

李瑶. 2019. 小样本的多模态遥感影像融合分类方法. 上海: 上海海洋大学

Li Y M, Yang M and Zhang Z F. 2019b. A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering, 31(10): 1863-1883 [DOI: 10.1109/TKDE.2018.2872063]

Liang P P, Lim Y C, Tsai Y H H, Salakhutdinov R and Morency L P. 2019. Strong and simple baselines for multimodal utterance embeddings [EB/OL]. [2021-12-31].https://arxiv.org/pdf/1906.02125.pdfhttps://arxiv.org/pdf/1906.02125.pdf

Lin C T, Wu Y Y, Hsu P H and Lai S H. 2020. Multimodal structure-consistent image-to-image translation//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI Press: 11490-11498 [DOI:10.1609/aaai.v34i07.6814http://dx.doi.org/10.1609/aaai.v34i07.6814]

Lin J, Ruan X G, Yu N G and Yang Y H. 2016. Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition. Sensors, 16(12): #2171 [DOI: 10.3390/s16122171]

Lin X, Casas J R and Pardàs M. 2018. Temporally coherent 3D point cloud video segmentation in generic scenes. IEEE Transactions on Image Processing, 27(6): 3087-3099 [DOI: 10.1109/TIP.2018.2811541]

Lin X, Casas J R and Pardàs M. 2019. One shot learning for generic instance segmentation in RGBD videos//Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Setúbal, Portugal: Scitepress: 233-239 [DOI:10.5220/0007259902330239http://dx.doi.org/10.5220/0007259902330239]

Linder T, Pfeiffer K Y, Vaskevicius N, Schirmer R and Arras K O. 2020. Accurate detection and 3D localization of humans using a novel YOLO-based RGB-D fusion approach and synthetic training data//Proceedings of 2020 IEEE International Conference on Robotics and Automation (ICRA). Paris, France: IEEE: 1000-1006 [DOI:10.1109/icra40945.2020.9196899http://dx.doi.org/10.1109/icra40945.2020.9196899]

Liu F Y, Zhou L P, Shen C H and Yin J P. 2014. Multiple kernel learning in the primal for multimodal Alzheimer's disease classification. IEEE Journal of Biomedical and Health Informatics, 18(3): 984-990 [DOI: 10.1109/JBHI.2013.2285378]

Liu Y and Wang Z F. 2015. Simultaneousimage fusion and denoising with adaptive sparse representation. IET Image Processing, 9(5): 347-357 [DOI: 10.1049/iet-ipr.2014.0311]

Liu Y Z, Yi L, Zhang S H, Fan Q N, Funkhouser T and Dong H. 2020. P4Contrast: contrastive learning with pairs of point-pixel pairs for RGB-D scene understanding [EB/OL]. [2021-12-31].https://arxiv.org/pdf/2012.13089.pdfhttps://arxiv.org/pdf/2012.13089.pdf

Liu Z, Shen Y, Lakshminarasimhan V B, Liang PP, Zadeh A and Morency LP. 2018. Efficient low-rank multimodal fusion with modality-specific factors//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: ACL: #1209 [DOI:10.18653/v1/P18-1209http://dx.doi.org/10.18653/v1/P18-1209]

Loghmani M R, Robbiano L, Planamente M, Park K, Caputo B and Vincze M. 2020. Unsupervised domain adaptation through inter-modal rotation for RGB-D object recognition. IEEE Robotics and Automation Letters, 5(4): 6631-6638 [DOI: 10.1109/LRA.2020.3007092]

Loza A, Bull D, Canagarajah N and Achim A. 2010. Non-Gaussian model-based fusion of noisy images in the wavelet domain. Computer Vision and Image Understanding, 114(1): 54-65 [DOI: 10.1016/j.cviu.2009.09.002]

Ma D A, Tang P, Zhao L J and Zhang Z. 2021. Review of data augmentation for image in deep learning. Journal of Image and Graphics, 26(3): 487-502

马岽奡, 唐娉, 赵理君, 张正. 2021. 深度学习图像数据增广方法研究综述. 中国图象图形学报, 26(3): 487-502 [DOI: 10.11834/jig.200089]

Madhuranga D, Madushan R, Siriwardane C and Gunasekera K. 2021. Real-time multimodal ADL recognition using convolution neural networks. The Visual Computer, 37(6): 1263-1276 [DOI: 10.1007/s00371-020-01864-y]

Mahendran A, Thewlis J and Vedaldi A. 2018. Cross pixel optical-flow similarity for self-supervised learning//Proceedings of the 14th Asian Conference on Computer Vision. Perth, Australia: Springer: 99-116 [DOI:10.1007/978-3-030-20873-8_7http://dx.doi.org/10.1007/978-3-030-20873-8_7]

Martínez-Montes E, Valdés-Sosa P A, Miwakeichi F, Goldman R I and Cohen M S. 2004. Concurrent EEG/fMRI analysis by multiway partial least squares. NeuroImage, 22(3): 1023-1034 [DOI: 10.1016/j.neuroimage.2004.03.038]

Mehrotra A and Dukkipati A. 2017. Generative adversarial residual pairwise networks for one shot learning [EB/OL]. [2021-12-31].https://arxiv.org/pdf/1703.08033.pdfhttps://arxiv.org/pdf/1703.08033.pdf

Memmesheimer R, Theisen N and Paulus D. 2021. SL-DML: signal level deep metric learning for multimodal one-shot action recognition//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 4573-4580 [DOI:10.1109/ICPR48806.2021.9413336http://dx.doi.org/10.1109/ICPR48806.2021.9413336]

Menze B H, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants B B, Ayache N, Buendia P, Collins D L, Cordier N, Corso J J, Criminisi A, Das T, Delingette H, Demiralp Ç, Durst C R, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X T, Hamamci A, Iftekharuddin K M, Jena R, Precup D, Price S J, Raviv T R, Reza S M S, Ryan M, Sarikaya D, Schwartz L, Shin H C, Shotton J, Silva C A, Sousa N, Subbanna N K, Szekely G, Taylor T J, Thomas O M, Tustison N J, Unal G, Vasseur F, Wintermark M, Ye D H, Zhao L, Zhao B S, Zikic D, Prastawa M, Reyes M and Van Leemput K. 2015. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging, 34(10): 1993-2024 [DOI: 10.1109/TMI.2014.2377694]

Meyer J, Eitel A, Brox T and Burgard W, 2020. Improving unimodal object recognition with multimodal contrastive learning//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Las Vegas, USA: IEEE: 5656-5663 [DOI:10.1109/iros45743.2020.9341029http://dx.doi.org/10.1109/iros45743.2020.9341029]

Mondal A K, Dolz J and Desrosiers C. 2018. Few-shot 3D multi-modal medical image segmentation using generative adversarial learning[EB/OL]. [2021-12-31].https://arxiv.org/pdf/1810.12241.pdfhttps://arxiv.org/pdf/1810.12241.pdf

Morvant E, Habrard A and Ayache S. 2014. Majority vote of diverse classifiers for late fusion//Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition. Joensuu, Finland: Springer: 153-162 [DOI:10.1007/978-3-662-44415-3_16http://dx.doi.org/10.1007/978-3-662-44415-3_16]

Narayanan A, Siravuru A and Dariush B. 2019. Temporal multimodal fusion for driver behavior prediction tasks using gated recurrent fusion units [EB/OL].https://arxiv.org/pdf/1910.00628v1.pdfhttps://arxiv.org/pdf/1910.00628v1.pdf

Nguyen D T, Hong H G, Kim K W and Park K R. 2017. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3): #605 [DOI: 10.3390/s17030605]

Nichol A, Achiam J and Schulman J. 2018. On first-order meta-learning algorithms[EB/OL]. [2021-12-31].https://arxiv.org/pdf/1803.02999.pdfhttps://arxiv.org/pdf/1803.02999.pdf

Nie J, Yan J, Yin H L, Ren L and Meng Q. 2021. A multimodality fusion deep neural network and safety test strategy for intelligent vehicles. IEEE Transactions on Intelligent Vehicles, 6(2): 310-322 [DOI: 10.1109/tiv.2020.3027319]

Nobis F, Geisslinger M, Weber M, Betz J and Lienkamp M. 2019. A deep learning-based radar and camera sensor fusion architecture for object detection//2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF). Bonn, Germany: IEEE: 1-7 [DOI:10.1109/sdf.2019.8916629http://dx.doi.org/10.1109/sdf.2019.8916629]

Ortega J D S, Senoussaoui M, Granger E, Pedersoli M, Cardinal P and Koerich A L. 2019. Multimodal fusion with deep neural networks for audio-video emotion recognition[EB/OL].https://arxiv.org/pdf/1907.03196.pdfhttps://arxiv.org/pdf/1907.03196.pdf

Pang A Q, Chen X, Luo H M, Wu M Y, Yu J Y and Xu L. 2021. Few-shot neural human performance rendering from sparse RGBD videos[EB/OL]. [2021-12-31].https://arxiv.org/pdf/2107.06505.pdfhttps://arxiv.org/pdf/2107.06505.pdf

Pesteie M, Abolmaesumi P and Rohling R N. 2019. Adaptive augmentation of medical data using independently conditional variational auto-encoders. IEEE Transactions on Medical Imaging, 38(12): 2807-2820 [DOI: 10.1109/tmi.2019.2914656]

Plis S M, Amin M F, Chekroud A, Hjelm D, Damaraju E, Lee H J, Bustillo J R, Cho K, Pearlson G D and Calhoun V D. 2018. Reading the (functional) writing on the (structural) wall: multimodal fusion of brain structure and function via a deep neural network based translation approach reveals novel impairments in schizophrenia. NeuroImage, 181: 734-747 [DOI: 10.1016/j.neuroimage.2018.07.047]

Poria S, Cambria E, Bajpai R and Hussain A. 2017. A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion, 37: 98-125 [DOI: 10.1016/j.inffus.2017.02.003]

Quintana M, Karaoglu S, Alvarez F, Menendez J M and Gevers T. 2019. Three-D wide faces (3DWF): facial landmark detection and 3D reconstruction over a new RGB-D multi-camera dataset. Sensors, 19(5): #1103 [DOI: 10.3390/s19051103]

Ramachandram D and Taylor G W. 2017. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6): 96-108 [DOI: 10.1109/msp.2017.2738401]

Ravi S and Larochelle H. 2017. Optimization as a model for few-shot learning//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview. net

Russo S, Lürig M, Hao W J, Matthews B and Villez K. 2020. Active learning for anomaly detection in environmental data. Environmental Modelling and Software, 134: #104869 [DOI: 10.1016/j.envsoft.2020.104869]

Salehinejad H, Valaee S, Dowdell T and Barfett J. 2018. Image augmentation using radial transform for training deep neural networks//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: 3016-3020 [DOI:10.1109/icassp.2018.8462241http://dx.doi.org/10.1109/icassp.2018.8462241]

Sano A, Chen W X, Lopez-Martinez D, Taylor S and Picard R W. 2019. Multimodal ambulatory sleep detection using LSTM recurrent neural networks. IEEE Journal of Biomedical and Health Informatics, 23(4): 1607-1617 [DOI: 10.1109/JBHI.2018.2867619]

Scheunders P and De Backer S. 2007. Wavelet denoising of multicomponent images using Gaussian scale mixture models and a noise-free image as priors. IEEE Transactions on Image Processing, 16(7): 1865-1872 [DOI: 10.1109/tip.2007.899598]

Settles B. 2011. From theories to queries: active learning in practice//Proceedings of JMLR. Sardinia, Italy: JMLR: 1-18

Shahdoosti H R and Mehrabi A. 2018. Multimodal image fusion using sparse representation classification in tetrolet domain. Digital Signal Processing, 79: 9-22 [DOI: 10.1016/j.dsp.2018.04.002]

Shahroudy A, Liu J, Ng T T and Wang G. 2016. NTU RGB+ D: a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1010-1019 [DOI:10.1109/cvpr.2016.115http://dx.doi.org/10.1109/cvpr.2016.115]

Shao Q Q, Qi J, Ma J, Fang Y, Wang W M and Hu J. 2020. Object detection-based one-shot imitation learning with an RGB-D camera. Applied Sciences, 10(3): #803 [DOI: 10.3390/app10030803]

Shi L, Shuang K, Geng S J, Su P, Jiang Z K, Gao P, Fu Z H, De Melo G and Su S. 2020. Contrastive visual-linguistic pretraining [EB/OL]. [2021-12-31].https://arxiv.org/pdf/2007.13135.pdfhttps://arxiv.org/pdf/2007.13135.pdf

Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 746-760 [DOI:10.1007/978-3-642-33715-4_54http://dx.doi.org/10.1007/978-3-642-33715-4_54]

Simpson A L, Antonelli M, Bakas S, Bilello M, Farahani K, Van Ginneken B, Kopp-Schneider A, Landman B A, Litjens G, Menze B, Ronneberger O, Summers R M, Bilic P, Christ P F, DoR K G, Gollub M, Golia-Pernicka J, Heckers S H, Jarnagin W R, McHugo M K, Napel S, Vorontsov E, Maier-Hein L and Cardoso M J. 2019. A large annotated medical image dataset for the development and evaluation of segmentation algorithms[EB/OL]. [2021-12-31].https://arxiv.org/pdf/1902.09063.pdfhttps://arxiv.org/pdf/1902.09063.pdf

Song S R, Lichtenberg S P and Xiao J X. 2015. SUN RGB-D: a RGB-D scene understanding benchmark suite//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 567-576 [DOI:10.1109/cvpr.2015.7298655http://dx.doi.org/10.1109/cvpr.2015.7298655]

Song Y L, Morency L P and Davis R. 2012. Multimodal human behavior analysis: learning correlation and interaction across modalities//Proceedings of the 14th ACM International Conference on Multimodal Interaction. New York, USA: IEEE: 27-30 [DOI:10.1145/2388676.2388684http://dx.doi.org/10.1145/2388676.2388684]

Srivastava N, Hinton G, Krizhevsky A, Sutskever I and Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929-1958

Sun L, Zhao C and Stolkin R. 2017. Weakly-supervised DCNN for RGB-D object recognition in real-world applications which lack large-scale annotated training data [EB/OL]. [2021-12-31].https://arxiv.org/pdf/1703.06370.pdfhttps://arxiv.org/pdf/1703.06370.pdf

Tan W K, Qin N N, Ma L F, Li Y, Du J, Cai G R, Yang K and Li J. 2020. Toronto-3D: a large-scale mobile LiDAR dataset for semantic segmentation of urban roadways//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, USA: IEEE: 797-806 [DOI:10.1109/cvprw50498.2020.00109http://dx.doi.org/10.1109/cvprw50498.2020.00109]

Tang Y S and Lee G H. 2019. Transferable semi-supervised 3D object detection from RGB-D data//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1931-1940 [DOI:10.1109/ICCV.2019.00202http://dx.doi.org/10.1109/ICCV.2019.00202]

Tian Y L, Krishnan D and Isola P. 2020. Contrastive multiview coding//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 776-794 [DOI:10.1007/978-3-030-58621-8_45http://dx.doi.org/10.1007/978-3-030-58621-8_45]

Tompson J, Goroshin R, Jain A, LeCun Y and Bregler C. 2015. Efficient object localization using convolutional networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 648-656 [DOI:10.1109/cvpr.2015.7298664http://dx.doi.org/10.1109/cvpr.2015.7298664]

Valada A, Oliveira G L, Brox T and Burgard W. 2016. Deep multispectral semantic scene understanding of forested environments using multimodal fusion//2016 International Symposium on Experimental Robotics. Tokyo, Japan: Springer: 465-477 [DOI:10.1007/978-3-319-50115-4_41http://dx.doi.org/10.1007/978-3-319-50115-4_41]

Vasiljevic I, Kolkin N, Zhang S Y, Luo R T, Wang H C, Dai F Z, Daniele A F, Mostajabi M, Basart S, Walter M R and Shakhnarovich G. 2019. DIODE: a dense indoor and outdoor depth dataset[EB/OL]. [2021-12-31].https://arxiv.org/pdf/1908.00463.pdfhttps://arxiv.org/pdf/1908.00463.pdf

Vilalta R and Drissi Y. 2002. A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2): 77-95 [DOI: 10.1023/A:1019956318069]

Wagner J, Andre E, Lingenfelser F and Kim J. 2011. Exploring fusion methods for multimodal emotion recognition with missing data. IEEE Transactions on Affective Computing, 2(4): 206-218 [DOI: 10.1109/T-AFFC.2011.12]

Wald J, Avetisyan A, Navab N, Tombari F and Niessner M. 2019. RIO: 3D object instance re-localization in changing indoor environments//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7657-7666 [DOI:10.1109/iccv.2019.00775http://dx.doi.org/10.1109/iccv.2019.00775]

Wan J, Guo G D and Li S Z. 2016. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8): 1626-1639 [DOI: 10.1109/TPAMI.2015.2513479

Wan J, Ruan Q Q, Li W and Deng S. 2013. One-shot learning gesture recognition from RGB-D data using bag of features. The Journal of Machine Learning Research, 14(1): 2549-2582

Wan L, Jing Q Y, Sun Z Y, Zhang C, Li Z H and Chen Y. 2021. Self-supervised modality-aware multiple granularity pre-training for RGB-infrared person re-identification[EB/OL]. [2021-12-31].https://arxiv.org/pdf/2112.06147.pdfhttps://arxiv.org/pdf/2112.06147.pdf

Wang A R, Lu J W, Wang G, Cai J F and Cham T J. 2014. Multi-modal unsupervised feature learning for RGB-D scene labeling//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 453-467 [DOI:10.1007/978-3-319-10602-1_30http://dx.doi.org/10.1007/978-3-319-10602-1_30]

Wang H L, Su Y X and Liu M. 2019. Self-supervised drivable area and road anomaly segmentation using RGB-D data for robotic wheelchairs. IEEE Robotics and Automation Letters, 4(4): 4386-4393 [DOI: 10.1109/lra.2019.2932874]

Wang H L, Sun Y X, Fan R and Liu M. 2021a. S2P2: self-supervised goal-directed path planning using RGB-D data for robotic wheelchairs//Proceedings of 2021 IEEE International Conference on Robotics and Automation (ICRA). Xi′an, China: IEEE: 11422-11428 [DOI:10.1109/ICRA48506.2021.9561314http://dx.doi.org/10.1109/ICRA48506.2021.9561314]

Wang J W, Yan Y G, Zhang Y B, Cao G P, Yang M and Ng M K. 2020c. Deep reinforcement active learning for medical image classification//Proceedings of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention. Lima, Peru: Springer: 33-42 [DOI:10.1007/978-3-030-59710-8_4http://dx.doi.org/10.1007/978-3-030-59710-8_4]

Wang P H, Qiu C Y, Wang JL, Wang Y L, Tang J X, Huang B, Su J and Zhang Y P. 2021d. Multimodal data fusion using non-sparse multi-kernel learning with regularized label softening. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14: 6244-6252 [DOI: 10.1109/JSTARS.2021.3087738]

Wang P Y, Manhardt F, Minciullo L, Garattoni L, Meier S, Navab N and Busam B. 2021b. DemoGrasp: few-shot learning for robotic grasping with human demonstration//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. Prague, Czech Republic: IEEE: 5733-5740 [DOI:10.1109/IROS51168.2021.9636856http://dx.doi.org/10.1109/IROS51168.2021.9636856]

Wang W W, Shui P L and Feng X C. 2008. Variational models for fusion and denoising of multifocus images. IEEE Signal Processing Letters, 15: 65-68 [DOI: 10.1109/lsp.2007.911148]

Wang W Y, Tran D and Feiszli M. 2020b. What makes training multi-modal classification networks hard?//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12692-12702 [DOI:10.1109/cvpr42600.2020.01271http://dx.doi.org/10.1109/cvpr42600.2020.01271]

Wang Y Q, Yao Q M, Kwok J T and Ni L M. 2021c. Generalizing from a few examples: a survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3): 1-34 [DOI: 10.1145/3386252]

Wang Y X, Girshick R, Hebert M and Hariharan B. 2018. Low-shot learning from imaginary data//Proceedings of 2018 IEEE/CVF Conference on computer vision and pattern recognition. Salt Lake City, USA: IEEE: 7278-7286 [DOI:10.1109/cvpr.2018.00760http://dx.doi.org/10.1109/cvpr.2018.00760]

Wen H W, Liu Y, Rekik I, Wang S P, Chen Z Q, Zhang J S, Zhang Y, Peng Y and He H G. 2017. Multi-modal multiple kernel learning for accurate identification of Tourette syndrome children. Pattern Recognition, 63: 601-611 [DOI: 10.1016/j.patcog.2016.09.039]

Wu A C, Zheng W S, Yu H X, Gong S G and Lai J H. 2017. RGB-infrared cross-modality person Re-identification//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5390-5399 [DOI:10.1109/iccv.2017.575http://dx.doi.org/10.1109/iccv.2017.575]

Xian K, Shen C H, Cao Z G, Lu H, Xiao Y, Li R B and Luo Z B. 2018. Monocular relative depth perception with web stereo data supervision//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: 311-320 [DOI:10.1109/CVPR.2018.00040http://dx.doi.org/10.1109/CVPR.2018.00040]

Xiong B, Fan H Q, Grauman K and Feichtenhofer C. 2021. Multiview pseudo-labeling for semi-supervised learning from video//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 7189-7199 [DOI:10.1109/ICCV48922.2021.00712http://dx.doi.org/10.1109/ICCV48922.2021.00712]

Xu X Y, Li Y C, Wu G S and Luo J B. 2017. Multi-modal deep feature learning for RGB-D object detection. Pattern Recognition, 72: 300-313 [DOI: 10.1016/j.patcog.2017.07.026]

Yang B and Li S T. 2009. Multifocus image fusion and restoration with sparse representation. IEEE Transactions on Instrumentation and Measurement, 59(4): 884-892 [DOI: 10.1109/tim.2009.2026612]

Yang B and Li S T. 2012. Pixel-level image fusion with simultaneous orthogonal matching pursuit. Information Fusion, 13(1): 10-19 [DOI: 10.1016/j.inffus.2010.04.001]

Yang F, Wang Z, Xiao J and Satoh S I. 2020. Mining on heterogeneous manifolds for zero-shot cross-modal image retrieval//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI Press: 12589-12596 [DOI:10.1609/aaai.v34i07.6949http://dx.doi.org/10.1609/aaai.v34i07.6949]

Yang T H, Wu C H, Huang K Y and Su M H. 2017. Coupled HMM-based multimodal fusion for mood disorder detection through elicited audio-visual signals. Journal of Ambient Intelligence and Humanized Computing, 8(6): 895-906 [DOI: 10.1007/s12652-016-0395-y]

Ye M and Yuen P C. 2020. PurifyNet: a robust person re-identification model with noisy labels. IEEE Transactions on Information Forensics and Security, 15: 2655-2666 [DOI: 10.1109/TIFS.2020.2970590]

Yeh J F, Chung C M, Su H T, Chen Y T and Hsu W H. 2021. Stage conscious attention network (SCAN): a demonstration-conditioned policy for few-shot imitation [EB/OL]. [2021-12-31].https://arxiv.org/pdf/2112.02278.pdfhttps://arxiv.org/pdf/2112.02278.pdf

Yin H T, Li S T and Fang L Y. 2013. Simultaneous image fusion and super-resolution using sparse representation. Information Fusion, 14(3): 229-240 [DOI: 10.1016/j.inffus.2012.01.008]

Yu N N, Qiu T S, Bi F and Wang A Q. 2011. Image features extraction and fusion based on joint sparse representation. IEEE Journal of Selected Topics in Signal Processing, 5(5): 1074-1082 [DOI: 10.1109/jstsp.2011.2112332]

Zhang J, Fan D P, Dai Y, Anwar S, Saleh F, Aliakbarian S and Barnes N. 2021. Uncertainty inspired RGB-D saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence [DOI:10.1109/TPAMI.2021.3073564http://dx.doi.org/10.1109/TPAMI.2021.3073564]

Zhang Q C, Yang L T and Chen Z K. 2015. Deep computation model forunsupervised feature learning on big data. IEEE Transactions on Services Computing, 9(1): 161-171 [DOI: 10.1109/TSC.2015.2497705]

Zhang S, Song C and Radkowski R. 2019a. Setforge-synthetic RGB-D training data generation to support CNN-based pose estimation for augmented reality//2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). Beijing, China: IEEE: 237-242 [DOI:10.1109/ISMAR-Adjunct.2019.00-39http://dx.doi.org/10.1109/ISMAR-Adjunct.2019.00-39]

Zhang T Y, Fu H Z, Zhao Y T, Cheng J, Guo M J, Gu Z W, Yang B, Xiao Y T, Gao S H and Liu J. 2019c. SkrGAN: sketching-rendering unconditional generative adversarial networks for medical image synthesis//Proceedings of the 22nd International Conference on Medical Image Computing and Computer-Assisted Intervention. Shenzhen, China: Springer: 777-785 [DOI:10.1007/978-3-030-32251-9_85http://dx.doi.org/10.1007/978-3-030-32251-9_85]

Zhang Y F, Morel O, Blanchon M, Seulin R, Rastgoo M and Sidibé D. 2019b. Exploration of deep learning-based multimodal fusion for semantic road scene segmentation//Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Prague, Czech Republic: SciTePress: 336-343 [DOI:10.5220/0007360403360343http://dx.doi.org/10.5220/0007360403360343]

Zhang Y F, Sidibé D, Morel O and Meriaudeau F. 2021. Incorporating depth information into few-shot semantic segmentation//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 3582-3588 [DOI:10.1109/ICPR48806.2021.9412921http://dx.doi.org/10.1109/ICPR48806.2021.9412921]

Zhao W D and Lu H C. 2017. Medical image fusion and denoising with alternating sequential filter and adaptive fractional order total variation. IEEE Transactions on Instrumentation and Measurement, 66(9): 2283-2294 [DOI: 10.1109/tim.2017.2700198]

Zhao X Q, Pang Y W, Zhang L H, Lu H C and Ruan X. 2021. Self-supervised representation learning for RGB-D salient object detection[EB/OL]. [2021-12-31].https://arxiv.org/pdf/2101.12482.pdfhttps://arxiv.org/pdf/2101.12482.pdf

Zhou F, Hu Y and Shen X K. 2019. MSANet: multimodal self-augmentation and adversarial network for RGB-D object recognition. The Visual Computer, 35(11): 1583-1594 [DOI: 10.1007/s00371-018-1559-x]

Zhou Z H. 2018. A brief introduction to weakly supervised learning. National Science Review, 5(1): 44-53 [DOI: 10.1093/nsr/nwx106]

Zhu X J and Goldberg A B. 2009. Introduction to Semi-Supervised Learning. Cham: Springer: 1-130 [DOI: 10.1007/978-3-031-01548-9]

Zhu Y Q and Yang K. 2019. Tripartite active learning for interactive anomaly discovery. IEEE Access, 7: 63195-63203 [DOI: 10.1109/ACCESS.2019.2915388]

Alert me when the article has been cited

提交

Comprehensive survey on 3D visual-language understanding techniques

Deep learning-based real-time semantic segmentation： a survey

Survey on knowledge distillation and its application

Recent advances in drone-view object detection

Visual information extraction deep learning method： a critical review