scispace - formally typeset
Proceedings ArticleDOI

Piccolo: Exposing Complex Backdoors in NLP Transformer Models

TLDR
Abstract
Backdoors can be injected to NLP models such that they misbehave when the trigger words or sentences appear in an input sample. Detecting such backdoors given only a subject model and a small number of benign samples is very challenging because of the unique nature of NLP applications, such as the discontinuity of pipeline and the large search space. Existing techniques work well for backdoors with simple triggers such as single character/word triggers but become less effective when triggers and models become complex (e.g., transformer models). We propose a new backdoor scanning technique. It transforms a subject model to an equivalent but differentiable form. It then uses optimization to invert a distribution of words denoting their likelihood in the trigger. It leverages a novel word discriminativity analysis to determine if the subject model is particularly discriminative for the presence of likely trigger words. Our evaluation on 3839 NLP models from the TrojAI competition and existing works with 7 state-of-art complex structures such as BERT and GPT, and 17 different attack types including two latest dynamic attacks, shows that our technique is highly effective, achieving over 0.9 detection accuracy in most scenarios and substantially outperforming two state-of-the-art scanners. Our submissions to TrojAI leaderboard achieve top performance in 2 out of the 3 rounds for NLP backdoor scanning.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article

Training with More Confidence: Mitigating Injected and Natural Backdoors During Training

TL;DR: A novel training method is designed that forces the training to avoid generating such hyperplanes and thus remove the injected backdoors, and can outperform existing state-of-the-art defenses.
Journal ArticleDOI

Stealthy Backdoor Attack for Code Models

TL;DR: AFRAIDOOR as mentioned in this paper leverages adversarial perturbations to inject adaptive triggers into different inputs to expose security weaknesses in code models under stealthy backdoor attacks and shows that the state-of-the-art defense method cannot provide sufficient protection.
Proceedings ArticleDOI

UNICORN: A Unified Backdoor Trigger Inversion Framework

TL;DR: UNICORN as mentioned in this paper proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from their analysis, which is general and effective in inverting backdoor triggers in DNNs.
Proceedings ArticleDOI

Jigsaw Puzzle: Selective Backdoor Attack to Subvert Malware Classifiers

TL;DR: This paper proposes a new attack, Jigsaw Puzzle (JP), based on the key observation that malware authors have little to no incentive to protect any other authors’ malware but their own, which is effective as a backdoor, remains stealthy against state-of-the-art defenses, and is a threat in realistic settings that depart from reasoning about feature-space only attacks.
Proceedings ArticleDOI

NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models

TL;DR: This article proposed a transferable backdoor attack against prompt-based models, called NOTABLE, which is independent of downstream tasks and prompting strategies, and achieved superior attack performance (i.e., attack success rate over 90% on all the datasets), and outperforms two state-of-the-art baselines.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

Automatic differentiation in PyTorch

TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Proceedings Article

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

TL;DR: A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.