于玉海,林鸿飞,孟佳娜,郭海,赵哲焕(大连理工大学计算机科学与技术学院, 大连 116024;大连民族大学计算机科学与工程学院, 大连 116600)
目的 生物医学文献中的图像经常是包含多种模式的复合图像，自动标注其类别，将有助于提高图像检索的性能，辅助医学研究或教学。方法 融合图像内容和说明文本两种模态的信息，分别搭建基于深度卷积神经网络的多标签分类模型。视觉分类模型借用自然图像和单标签的生物医学简单图像，实现异质迁移学习和同质迁移学习，捕获通用领域的一般特征和生物医学领域的专有特征，而文本分类模型利用生物医学简单图像的说明文本，实现同质迁移学习。然后，采用分段式融合策略，结合两种模态模型输出的结果，识别多标签医学图像的相关模式。结果 本文提出的跨模态多标签分类算法，在ImageCLEF2016生物医学图像多标签分类任务数据集上展开实验。基于图像内容的混合迁移学习方法，比仅采用异质迁移学习的方法，具有更低的汉明损失和更高的宏平均F1值。文本分类模型引入同质迁移学习后，能够明显提高标签的分类性能。最后，融合两种模态的多标签分类模型，获得与评测任务最佳成绩相近的汉明损失，而宏平均F1值从0.320上升到0.488，提高了约52.5%。结论 实验结果表明，跨模态生物医学图像多标签分类算法，融合图像内容和说明文本，引入同质和异质数据进行迁移学习，缓解生物医学图像领域标注数据规模小且标签分布不均衡的问题，能够更有效地识别复合医学图像中的模式信息，进而提高图像检索性能。
Classification modeling and recognition for cross modal and multi-label biomedical image
Yu Yuhai,Lin Hongfei,Meng Jiana,Guo Hai,Zhao Zhehuan(School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China;School of Computer Science & Engineering, Dalian 116600, China)
Objective The amount of biomedical literature in electronic format has increased considerably with the development of the Internet. PubMed comprises more than 27 million citations for biomedical literature linking to full-text content from PubMed Central and publisher web sites. The figures in these biomedical studies can be retrieved through tools along with the full text. However, the lack of associated metadata, apart from the captions, hinders the fulfillment of richer information requirements of biomedical researchers and educators. The modality of a figure is an extremely useful type of metadata. Therefore, biomedical modality classification is an important primary step that can aid users to access required biomedical images and further improve the performance of the literature retrieval system. Many images in the biomedical literature (more than 40%) are compound figures including several subfigures with various biomedical modalities, such as computerized tomography, X-ray, or generic biomedical illustrations. The subfigures in one compound figure may describe one medical problem in several views and have strong semantic correlation with each other. Thus, these figures are valuable to biomedical research and education. The standard approach to modality recognition from biomedical compound figure first detects whether the figure is compound or not. If it is compound, then a figure separation algorithm is first invoked to split it into its constituent subfigures. Then, another multi-class classifier is used to predict the modality of each subfigure. Nevertheless, the figure separation algorithms are not perfect, and the errors in figure separation propagate to the multi-class model for modality classification. Recently, some multi-label learning models use pre-trained convolutional neural networks to extract high-level features to recognize the image modalities from the compound figures. These deep learning methods learn more expressive representations of image data. However, convolutional neural networks may be hindered to disentangle the factors of variation by the limited samples with high variability and the imbalanced label distribution of training data. A new cross-modal multi-label classification model using convolutional neural networks based on hybrid transfer learning is presented to learn biomedical modality information from the compound figure without separating it into subfigures. Method An end-to-end training and multi-label classification method, which does not require additional classifiers, is proposed. Building two convolutional neural networks enables to learn the components of an image without learning from single separated subfigure that represents the image modalities, but from labeled compound figures and their captions. The proposed cross-modal model learns general domain features from large-scale nature images and more special biomedical domain features from the simple figures and their captions in biomedical literature, leveraging techniques of heterogeneous and homogeneous transfer learning. Specifically, the proposed visual convolutional neural network (CNN) is pre-trained on a large auxiliary dataset, which contains approximately 1.2 million labeled training images of 1000 classes. Then, the top layer of the deep CNN is trained from scratch on single-label simple biomedical figures to achieve homogeneous transfer learning. The key point of such transfer learning is fine-tuning the pre-trained deep visual models on the current multi-label compound figure dataset. The architecture of the deep visual models should be changed slightly and then they could be fine-tuned on the current dataset. On the other hand, the weights of the embedding layer are initialized by the word vectors, which are pre-trained on captions extracted from 300 000 biomedical articles in PubMed, and are updated while training the networks. Similar to the homogeneous transfer learning strategy of visual model, the proposed textual convolutional neural networks are first pre-trained on the captions of the simple biomedical figures. Then, the pre-trained textual model is fine-tuned on current multi-label compound figures to capture more biomedical features. Finally, cross-modal multi-label learning model combines outputs of the visual and textual models to predict labels using multi-stage fusion strategy. Result The proposed cross-modal multi-label classification model based on hybrid transfer learning is evaluated on the dataset of the multi-label classification task in ImageCLEF2016. Our approach is evaluated based on multi-label classification Hamming Loss and Macro F1 Score, according to the evaluation criterion of the benchmark. The two comparative models learn multi-label information only from visual content. They pre-train AlexNet on large-scale nature images. Then, the DeCAF features are extracted from the pre-trained AlexNet and fed into the SVM classifier with a linear kernel. One comparative model predicts modalities by the highest score of SVM and the other model predicts by the highest posterior probability. The visual model achieves 33.9% lower Hamming Loss and 100.3% higher Macro F1 Score by introducing homogeneous transfer learning technique, and the textual model efficiently improves the performance in the two metrics. Thus, the proposed cross-modal model can achieve similar Hamming Loss of 0.0157 with the state-of-the-art model and obtain 52.5% higher Macro F1 Score, which is increased from 0.320 to 0.488. Conclusion A new method to extract biomedical modalities from the compound figures is proposed. The proposed models obtain more competitive results than the other reported methods in the literature. The proposed cross-modal model exhibits acceptable generalization capability and could achieve higher performance. The results imply that the homogeneous transfer learning method can aid deep convolutional neural networks (DCNNs) to capture a larger number of biomedical domain features and improve the performance of multi-label classification. The proposed cross-modal model addresses the problems of overfitting and imbalanced dataset and effectively recognizes modalities from biomedical compound figures based on visual content and textual information. In the future, building DCNNs and training networks with new techniques could further improve the proposed method.