杜长德1,2, 周琼怡1,3, 刘澈1,3, 何晖光1,2,3(1.中国科学院自动化研究所脑图谱与类脑智能研究中心, 北京 100190;2.中国科学院自动化研究所多模态人工智能系统全国重点实验室, 北京 100190;3.中国科学院大学人工智能学院, 北京 100049)
视觉神经信息编解码旨在利用功能磁共振成像（functional magnetic resonance imaging，fMRI）等神经影像数据研究视觉刺激与大脑神经活动之间的关系。编码研究可以对神经活动模式进行建模和预测，有助于脑科学与类脑智能的发展；解码研究可以对人的视知觉状态进行解译，能够促进脑机接口领域的发展。因此，基于fMRI的视觉神经信息编解码方法研究具有重要的科学意义和工程价值。本文在总结基于fMRI的视觉神经信息编解码关键技术与研究进展的基础上，分析现有视觉神经信息编解码方法的局限。在视觉神经信息编码方面，详细介绍了基于群体感受野估计方法的发展过程；在视觉神经信息解码方面，首先，按照任务类型将其划分为语义分类、图像辨识和图像重建3个部分，并深入阐述了每个部分的代表性研究工作和所用的方法。特别地，在图像重建部分着重介绍了基于深度生成模型（主要包括变分自编码器和生成对抗网络）的简单图像、人脸图像和复杂自然图像的重建技术。其次，统计整理了该领域常用的10个开源数据集，并对数据集的样本规模、被试个数、刺激类型、研究用途及下载地址进行了详细归纳。最后，详细介绍了视觉神经信息编解码模型常用的度量指标，分析了当前视觉神经信息编码和解码方法的不足，提出可行的改进意见，并对未来发展方向进行展望。
Review of visual neural encoding and decoding methods in fMRI
Du Changde1,2, Zhou Qiongyi1,3, Liu Che1,3, He Huiguang1,2,3(1.Research Center for Brain Mapping and Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;2.State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;3.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China)
The relationship between human visual experience and evoked neural activity is central to the field of computational neuroscience. The purpose of visual neural encoding and decoding is to study the relationship between visual stimuli and the evoked neural activity by using neuroimaging data such as functional magnetic resonance imaging (fMRI). Neural encoding researches attempt to predict the brain activity according to the presented external stimuli, which contributes to the development of brain science and brain-like artificial intelligence. Neural decoding researches attempt to predict the information about external stimuli by analyzing the observed brain activities, which can interpret the state of human visual perception and promote the development of brain computer interface (BCI). Therefore, fMRI based visual neural encoding and decoding researches have important scientific significance and engineering value. Typically, the encoding models are based on the specific computations that are thought to underlie the observed brain responses for specific visual stimuli. Early studies of visual neural encoding relied heavily on Gabor wavelet features because these features are very good at modeling brain responses in the primary visual cortex. Recently, given the success of deep neural networks (DNNs) in classifying objects in natural images, the representations within these networks have been used to build encoding models of cortical responses to complex visual stimuli. Most of the existing decoding studies are based on multi-voxel pattern analysis (MVPA) method, but brain connectivity pattern is also a key feature of the brain state and can be used for brain decoding. Although recent studies have demonstrated the feasibility of decoding the identity of binary contrast patterns, handwritten characters, human facial images, natural picture/video stimuli and dreams from the corresponding brain activation patterns, the accurate reconstruction of the visual stimuli from fMRI still lacks adequate examination and requires plenty of efforts to improve. On the basis of summarizing the key technologies and research progress of fMRI based visual neural encoding and decoding, this paper further analyzes the limitations of existing visual neural encoding and decoding methods. In terms of visual neural encoding, the development process of population receptive field (pRF) estimation method is introduced in detail. In terms of visual neural decoding, it is divided into semantic classification, image identification and image reconstruction according to task types, and the representative research work of each part and the methods used are described in detail. From the perspective of machine learning, semantic classification is a single label or multi-label classification problem. Simple visual stimuli only contain a single object, while natural visual stimuli often contain multiple semantic labels. For example, an image may contain flowers, water, trees, cars, etc. Predicting one or more semantic labels of the visual stimulus from the brain signal is called semantic decoding. Image retrieval based on brain signal is also a common visual decoding task where the model is created to "decode" neural activity by retrieving a picture of what a person has just seen or imagined. In particular, the reconstruction techniques of simple image, face image and complex natural image based on deep generative models (including variational auto-encoders (VAEs) and generative adversarial networks (GANs)) are introduced in the part of image reconstruction. Secondly, 10 open source datasets commonly used in this field were statistically sorted out, and the sample size, number of subjects, types of stimuli, research purposes and download links of the datasets were summarized in detail. These datasets have made important contributions to the development of this field. Finally, we introduce the commonly used measurement metrics of visual neural encoding and decoding model in detail, analyze the shortcomings of current visual neural encoding and decoding methods, propose feasible suggestions for improvement, and show the future development directions. Specifically, for neural encoding, the existing methods still have the following shortcomings:1) the computational models are mostly based on the existing neural network architecture, which cannot reflect the real biological visual information flow; 2) due to the selective attention of each person in the visual perception and the inevitable noise in the fMRI data collection, individual differences are significant; 3) the sample size of the existing fMRI data set is insufficient; 4) most researchers construct feature spaces of neural encoding models based on fixed types of pre-trained neural networks (such as AlexNet), causing problems such as insufficient diversity of visual features. On the other hand, although the existing visual neural decoding methods perform well in the semantic classification and image identification tasks, it is still very difficult to establish an accurate mapping between visual stimuli and visual neural signals, and the results of image reconstruction are often blurry and lack of clear semantics. Moreover, most of the existing visual neural decoding methods are based on linear transformation or deep network transformation of visual images, lacking exploration of new visual features. Factors that hinder researchers from effectively decoding visual information and reconstructing images or videos mainly include high dimension of fMRI data, small sample size and serious noise. In the future, more advanced artificial intelligence technology should be used to develop more effective methods of neural encoding and decoding, and try to translate brain signals into images, video, voice, text and other multimedia content, so as to achieve more BCI applications. The significant research directions include 1) multi-modal neural encoding and decoding based on the union of image and text; 2) brain-guided computer vision model training and enhancement; 3) visual neural encoding and decoding based on the high efficient features of large-scale pre-trained models. In addition, since brain signals are characterized by complexity, high dimension, large individual diversity, high dynamic nature and small sample size, future research needs to combine computational neuroscience and artificial intelligence theories to develop visual neural encoding and decoding methods with higher robustness, adaptability and interpretability.