Author
Jiazhu Dai
Bio: Jiazhu Dai is an academic researcher from Shanghai University. The author has contributed to research in topics: Backdoor & Autoencoder. The author has an hindex of 2, co-authored 3 publications receiving 23 citations.
Papers
More filters
•
TL;DR: A defense method called Backdoor Keyword Identification (BKI) is proposed to mitigate backdoor attacks which the adversary performs against LSTM-based text classification by data poisoning, which can identify and exclude poisoning samples crafted to insert backdoor into the model from training data without a verified and trusted dataset.
Abstract: It has been proved that deep neural networks are facing a new threat called backdoor attacks, where the adversary can inject backdoors into the neural network model through poisoning the training dataset. When the input containing some special pattern called the backdoor trigger, the model with backdoor will carry out malicious task such as misclassification specified by adversaries. In text classification systems, backdoors inserted in the models can cause spam or malicious speech to escape detection. Previous work mainly focused on the defense of backdoor attacks in computer vision, little attention has been paid to defense method for RNN backdoor attacks regarding text classification. In this paper, through analyzing the changes in inner LSTM neurons, we proposed a defense method called Backdoor Keyword Identification (BKI) to mitigate backdoor attacks which the adversary performs against LSTM-based text classification by data poisoning. This method can identify and exclude poisoning samples crafted to insert backdoor into the model from training data without a verified and trusted dataset. We evaluate our method on four different text classification datset: IMDB, DBpedia ontology, 20 newsgroups and Reuters-21578 dataset. It all achieves good performance regardless of the trigger sentences.
41 citations
••
TL;DR: Wang et al. as mentioned in this paper proposed a defense method called Backdoor Keyword Identification (BKI) to mitigate backdoor attacks which the adversary performs against LSTM-based text classification by data poisoning.
29 citations
••
TL;DR: An optimized algorithm to enhance the performance of generating universal perturbations based on the orientations of perturbation vectors is proposed, which shows that compared with UAP, the ones generated using the proposed algorithm achieved an average fooling-rate increment of 9 % in white-box and black-box attacks.
4 citations
••
TL;DR: Wang et al. as mentioned in this paper proposed an evasion attack against stacked capsule autoencoder (SCAE), where a perturbation is generated based on the output of the object capsules in the model, it is added to an image to reduce the contribution of the objects related to the original category of the image so that the perturbed image will be misclassified.
Abstract: Capsule networks are a type of neural network that use the spatial relationship between features to classify images. By capturing the poses and relative positions between features, this network is better able to recognize affine transformation and surpass traditional convolutional neural networks (CNNs) when handling translation, rotation, and scaling. The stacked capsule autoencoder (SCAE) is a state-of-the-art capsule network that encodes an image in capsules which each contain poses of features and their correlations. The encoded contents are then input into the downstream classifier to predict the image categories. Existing research has mainly focused on the security of capsule networks with dynamic routing or expectation maximization (EM) routing, while little attention has been given to the security and robustness of SCAEs. In this paper, we propose an evasion attack against SCAEs. After a perturbation is generated based on the output of the object capsules in the model, it is added to an image to reduce the contribution of the object capsules related to the original category of the image so that the perturbed image will be misclassified. We evaluate the attack using an image classification experiment on the Mixed National Institute of Standards and Technology Database (MNIST), Fashion-MNIST, and German Traffic Sign Recognition Benchmark (GTSRB) datasets, and the average attack success rate can reach 98.6%. The experimental results indicate that the attack can achieve high success rates and stealthiness. This finding confirms that the SCAE has a security vulnerability that allows for the generation of adversarial samples. Our work seeks to highlight the threat of this attack and focus attention on SCAE’s security.
Cited by
More filters
•
TL;DR: A simple and effective textual backdoor defense named ONION, which is based on outlier word detection and, to the best of the knowledge, is the first method that can handle all the textual backdoor attack situations.
Abstract: Backdoor attacks, which are a kind of emergent training-time threat to deep neural networks (DNNS). They can manipulate the output of DNNs and posses high insidiousness. In the field of natural language processing, some attack methods have been proposed and achieve very high attack success rates on multiple popular models. Nevertheless, the studies on defending textual backdoor defense are little conducted. In this paper, we propose a simple and effective textual backdoor defense named ONION, which is based on outlier word detection and might be the first method that can handle all the attack situations. Experiments demonstrate the effectiveness of our model when blocking two latest backdoor attack methods.
85 citations
•
TL;DR: This work provides the community with a timely comprehensive review of backdoor attacks and countermeasures on deep learning, and presents key areas for future research on the backdoor, such as empirical security evaluations from physical trigger attacks, and more efficient and practical countermeasures are solicited.
Abstract: This work provides the community with a timely comprehensive review of backdoor attacks and countermeasures on deep learning. According to the attacker's capability and affected stage of the machine learning pipeline, the attack surfaces are recognized to be wide and then formalized into six categorizations: code poisoning, outsourcing, pretrained, data collection, collaborative learning and post-deployment. Accordingly, attacks under each categorization are combed. The countermeasures are categorized into four general classes: blind backdoor removal, offline backdoor inspection, online backdoor inspection, and post backdoor removal. Accordingly, we review countermeasures, and compare and analyze their advantages and disadvantages. We have also reviewed the flip side of backdoor attacks, which are explored for i) protecting intellectual property of deep learning models, ii) acting as a honeypot to catch adversarial example attacks, and iii) verifying data deletion requested by the data contributor.Overall, the research on defense is far behind the attack, and there is no single defense that can prevent all types of backdoor attacks. In some cases, an attacker can intelligently bypass existing defenses with an adaptive attack. Drawing the insights from the systematic review, we also present key areas for future research on the backdoor, such as empirical security evaluations from physical trigger attacks, and in particular, more efficient and practical countermeasures are solicited.
80 citations
•
TL;DR: T-Miner is presented -- a defense framework for Trojan attacks on DNN-based text classifiers that employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger.
Abstract: Deep Neural Network (DNN) classifiers are known to be vulnerable to Trojan or backdoor attacks, where the classifier is manipulated such that it misclassifies any input containing an attacker-determined Trojan trigger. Backdoors compromise a model's integrity, thereby posing a severe threat to the landscape of DNN-based classification. While multiple defenses against such attacks exist for classifiers in the image domain, there have been limited efforts to protect classifiers in the text domain.
We present Trojan-Miner (T-Miner) -- a defense framework for Trojan attacks on DNN-based text classifiers. T-Miner employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger. T-Miner then analyzes the text produced by the generative model to determine if they contain trigger phrases, and correspondingly, whether the tested classifier has a backdoor. T-Miner requires no access to the training dataset or clean inputs of the suspicious classifier, and instead uses synthetically crafted "nonsensical" text inputs to train the generative model. We extensively evaluate T-Miner on 1100 model instances spanning 3 ubiquitous DNN model architectures, 5 different classification tasks, and a variety of trigger phrases. We show that T-Miner detects Trojan and clean models with a 98.75% overall accuracy, while achieving low false positives on clean models. We also show that T-Miner is robust against a variety of targeted, advanced attacks from an adaptive attacker.
35 citations
••
01 Aug 2021TL;DR: In this paper, the authors show that NLP models can be injected with backdoors that lead to a nearly 100% attack success rate, whereas being highly invisible to existing defense strategies and even human inspections.
Abstract: Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated, presenting serious security threats to real-world applications. Since existing textual backdoor attacks pay little attention to the invisibility of backdoors, they can be easily detected and blocked. In this work, we present invisible backdoors that are activated by a learnable combination of word substitution. We show that NLP models can be injected with backdoors that lead to a nearly 100% attack success rate, whereas being highly invisible to existing defense strategies and even human inspections. The results raise a serious alarm to the security of NLP models, which requires further research to be resolved. All the data and code of this paper are released at https://github.com/thunlp/BkdAtk-LWS.
23 citations
••
01 May 2022
TL;DR:
Abstract: Backdoors can be injected to NLP models such that they misbehave when the trigger words or sentences appear in an input sample. Detecting such backdoors given only a subject model and a small number of benign samples is very challenging because of the unique nature of NLP applications, such as the discontinuity of pipeline and the large search space. Existing techniques work well for backdoors with simple triggers such as single character/word triggers but become less effective when triggers and models become complex (e.g., transformer models). We propose a new backdoor scanning technique. It transforms a subject model to an equivalent but differentiable form. It then uses optimization to invert a distribution of words denoting their likelihood in the trigger. It leverages a novel word discriminativity analysis to determine if the subject model is particularly discriminative for the presence of likely trigger words. Our evaluation on 3839 NLP models from the TrojAI competition and existing works with 7 state-of-art complex structures such as BERT and GPT, and 17 different attack types including two latest dynamic attacks, shows that our technique is highly effective, achieving over 0.9 detection accuracy in most scenarios and substantially outperforming two state-of-the-art scanners. Our submissions to TrojAI leaderboard achieve top performance in 2 out of the 3 rounds for NLP backdoor scanning.
21 citations