Title: Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples
Authors: Anish Athalye, Nicholas Carlini, David Wagner
Published: 2018
Link: https://arxiv.org/abs/1802.00420
Github: https://github.com/anishathalye/obfuscated-gradients
Summary: Many defenses against adversarial examples which rely on a technique called obfuscated gradients can be circumvented by three different attack techniques. The authors argue that future work should not rely on this technique anymore.
Misc: ICML 2018 Best Paper Award

daniel etzold sketchnote: Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples

Extended summary: To create adversarial examples in the white-box setting (i.e. the adversary has full access to the network) via optimization-based methods useful gradients are required. Therefore, many defenses rely on gradient masking, a technique which results in useless gradients. Due to the fact that gradient masking can also happen by accident (some classifiers like kNN don’t have gradients at all) the authors use the term obfuscated gradients instead of gradient masking whenever a defense is explicitly designed to cause gradient masking to protect from adversarial examples.

The authors identified three ways in which defenses usually cause obfuscated gradients:

  1. shattered gradients
    are caused when a defense uses non-differentiable operations (e.g. thermometer encoding or input transformations like cropping, rescaling, JPEG compression)

  2. stochastic gradients
    are either caused by randomizing the input before it is given to the classifier or by randomized classifiers (e.g. stochastic activation pruning, rescale at random)

  3. vanishing/exploding gradients
    are often caused by doing multiple iterations of neural network evaluations, feeding the output of one computation as the input of the next iterations (which can be seen as an extremely deep neural network which can cause vanishing or exploding gradients)

From the defenses presented at the ICLR 2018 the authors identified seven of nine defenses relying on obfuscated gradients each of them falling in at least one of the three categories shown above. To circumvent the defenses the authors suggest to use the following techniques:

  1. Backward pass differentiable approximation to bypass shattered gradients

    Let $f^i(\cdot)$ be a non-differentiable layer of the network. The idea is to replace such a layer with a differentiable function $g$ such that $g(x) \approx f^i(x)$. The forward pass is done through $f^i(x)$ and only the backward pass is done through $g(x)$.

  2. Expectation over transformation for randomized classifiers

    Defenses which apply random transformations to the input before feeding it to the classifier can be bypassed using expectation over transformation.

    Let $f(\cdot)$ be the classifier and $t(\cdot)$ some transformation sampled from a distribution of transformations $T$. (e.g. if $T$ is the set of all angles by which an image can be rotated then $t$ is a function which rotates an image using one particular angle chosen at random from this set). The output of the classifier for some input $x$ is $f(t(x))$ where $t$ is chosen at random for each computation of $f$. Now, when an adversarial example $x’$ is generated for some $x$ and for some particular $t$ the classifier might return the true label for $x’$ because the classifier could apply a random transformation $t’ \neq t$ to $x’$ which removes the adversarial modifications.
    Due to the fact that the transformation that is applied by the classifier is not known in advance by the attacker he cannot create an adversarial example.

    To bypass this defense the idea is to optimize the expectation over transformations $\mathbb{E}_{t\sim T}f(t(x))$ when computing adversarial examples. The adversary chooses the adversarial example which results in the smallest errors for all considered transformations. To compute the adversarial example the adversary usually needs to compute the derivative of $f(t(x))$, i.e. $\nabla f(t(x))$. When using the expectation over transformation the adversary now needs to compute $\nabla \mathbb{E} _{t\sim T}f(t(x)) = \mathbb{E} _{t\sim T} \nabla f(t(x))$.

  3. Reparameterization for vanishing/exploding gradients

    Let $f(\cdot)$ be the classifier and let $g(\cdot)$ be a function which transforms the input to a new input before feeding it to the classifier, i.e. to classify an input $x$ we compute $f(g(x))$. Further assume, that $g$ is differentiable but yields vanishing/exploding gradients.

    For example, if it’s true that adversarial examples only lie in the low-probability region of the data distribution, the function $g$ could purify adversarial examples by projecting them back onto the data manifold by finding the highest probability example within an $\epsilon$-ball of the input image.

    Defenses which rely on this assumption and for which $g$ yields vanishing/exploding gradients can by bypassed. The authors solve the problem of vanishing/exploding gradients by using change-of-variable.

    The input $x$ is replaced by some differentiable function $h(\cdot)$ (change-of-variable, i.e. $x=h(z)$) such that $g(h(z)) = h(z)$ for all $z$. This can be achieved if $h(z)$ never returns examples from the low-probability region of the data distribution as $g$ can simply return $h(z)$. Now, we can compute $\nabla f(h(z))$ as $h(z)$ is differentiable in order to optimize for $z$. Finally, when we have found a $z$ we get the adversarial example $x$ via $x=h(z)$.

The authors looked at seven of nine ICLR 2018 defenses (only those which rely on obfuscated gradients) and found that six of them can be completely bypassed and that one can by bypassed partially.