Current Issue Cover
抑制图像非语义信息的通用后门防御策略

郭钰生1, 钱振兴1,2, 张新鹏1,2, 柴洪峰1,3(1.复旦大学计算机科学技术学院, 上海 200438;2.文化和旅游部数字文化保护与旅游数据智能计算重点实验室, 上海 200438;3.复旦大学金融科技研究院, 上海 200438)

摘 要
目的 后门攻击已成为目前卷积神经网络所面临的重要威胁。然而,当下的后门防御方法往往需要后门攻击和神经网络模型的一些先验知识,这限制了这些防御方法的应用场景。本文依托图像分类任务提出一种基于非语义信息抑制的后门防御方法,该方法不再需要相关的先验知识,只需要对网络的输入进行编解码处理就可以达到后门防御的目的。方法 核心思想是在保持图像语义不改变的同时,尽量削弱原始样本中与图像语义不相关的信息,以此抑制触发器。通过在待保护模型前添加一个即插即用的U型网络(即信息提纯网络)来实现对图像非语义信息的抑制。其输入是干净的初始样本,输出命名为强化样本。具体的训练过程中,首先用不同的训练超参数训练多个结构不一的干净分类器,然后在保持强化样本被上述分类器正确分类的前提下,优化信息提纯网络使强化样本和原始样本之间的差异尽可能地大。结果 实验在MNIST、CIFAR10和ImageNet10数据集上进行。实验结果显示,经过信息提纯网络编解码后,干净样本的分类准确率略有下降,后门攻击成功率大幅降低,带有触发器的样本以接近干净样本的准确率被正确预测。结论 提出的非语义信息抑制防御方法能够在不需要相关先验知识的情况下将含触发器的样本纠正为正常样本,并且保持对干净样本的分类准确率。
关键词
Non-semantic information suppression relevant backdoor defense implementation

Guo Yusheng1, Qian Zhenxing1,2, Zhang Xinpeng1,2, Chai Hongfeng1,3(1.School of Computer Science, Fudan University, Shanghai 200438, China;2.Key Laboratory of Digital Culture Protection and Tourism Data Intelligent Computing, Ministry of Culture and Tourism, Shanghai 200438, China;3.Fintech Research Institute, Fudan University, Shanghai 200438, China)

Abstract
Objective The emerging convolutional neural networks (CNNs) have shown its potentials in the context of computer science, electronic information, mathematics, and finance. However, the security issue is challenged for multiple domains. It is capable to use the neural network model to predict the samples with triggers as target labels in the inference stage through adding the samples with triggers to the data set and changing the labels of samples to target labels in the training process of supervised learning. Backdoor attacks have threaten the interests of model owners severely, especially in high value-added areas like financial security. To preserve backdoor attacks-derived neural network model, a series of defense strategies are implemented. However, conventional defense methods are often required for the prior knowledge of backdoor attack methods or neural network models in relevant to the type and size of the trigger, which is inconsistent and limits the application scenarios of defense methods. To resolve this problem, we develop a backdoor defense method based on input-modified image classification task, called information purification network (IPN). The process of the IPNcan eliminates the impact of the trigger-added samples. Method To alleviate a large amount of redundant information in image samples, we segment the image information into two categories: 1) classification task-oriented semantic information, and 2) classification task-inrelevant non-semantic information. To get the sample being predicted as the target label for interpretation, backdoor attack can enforce the model to pay attention to the non-semantic information of the sample during the model training process. To suppress the noise of trigger, our IPN is demonstrated as a CNN used for encoding and decoding the input samples, which aims to keep the image semantics unchanged via minimizing the non-semantic information in the original samples. The inputs to the IPN are as the clean samples, as well as the outputs are as the modified samples. For specific training, first, several clean classifiers are trained on the basis of multiple structures and training hyperparameters. Then, the IPN is optimized to make the difference between the modified sample and the original sample as large as possible on the premise of keeping the modified sample correctly predicted by the above classifier. The loss function consists of two aspects as mentioned below: 1) semantic information retention, and 2) non-semantic information suppression. To alleviate the difference between the sample and the original sample, the weight of the two parts of the loss function can be balanced. The process of IPN-related sample decoding can disrupt the structure of the trigger. Therefore, the sample will not be predicted as the target label even if the model is injected backdoor. In addition, due to the semantic information in the samples image is not weakened, trigger-involved samples can be used to predict the correct labels whether the model is injected into the backdoor or not. Result All experiments are performed on NVIDIA GeForce RTX 3090 graphics card. The execution environment is Python 3.8.5 with Pytorch version 1.9.1. The datasets are tested in relevant to CIFAR10, MNIST, and Image-Net10. The ImageNet10 dataset is constructed in terms of selecting 10 categories from the ImageNet dataset in random, which are composed of 12 831 images in total. We randomly selected 10 264 images as the training dataset, and the remaining 2 567 images as the test dataset. The architecture of the IPN is U-Net. To evaluate the defense performance of the proposed strategy in detail, a variety of different triggers are used to implement backdoor attacks. For MNIST datasets, the classification accuracy of the clean model for the initial clean sample is 99%. We use two different triggers to implement backdoor attacks as well. Each average classification accuracy of clean samples is 99%, and the success rates of backdoor attacks are 100%. After all samples are encoded and decoded by the IPN, the classification accuracy of clean samples is remained in consistent, while the success rate of backdoor attacks dropped to 10%, and the backdoor samples are predicted to be correctly labeled 98% as well. The experimental results are similar to MNIST for the other two datasets. While the classification accuracy of clean samples decreases slightly, the success rate of backdoor attacks is optimized about 10%, and the backdoor samples are correctly predicted with high accuracy. It should be mentioned that the intensity and size of the triggers can impact the defensive performance of the proposed strategy to a certain extent. The weight between the two parts of the loss function will affect the accuracy of clean samples. The weight of non-semantic information suppression loss is positive correlated to the difference of images and negative correlated to the classification accuracy of clean samples. Conclusion Our proposed strategy is not required any prior knowledge for triggers and the models to be protected. The classification accuracy of clean samples can keep unchanged, and the success rate of backdoor attack is equivalent to random guess, and the backdoor samples will be predicted as correct labels by classifiers, regardless of the problem of classifiers are injected into the backdoor. The training of the IPN is required on clean training data and the task of the protected model only. In the implementation of defense, the IPN can just be configured to predominate the protected model for input sample preprocessing. Multiple backdoor attacks are simulated on the three mentioned data sets. Experimental results show that our defense strategy is an optimized implementation for heterogeneity.
Keywords

订阅号|日报