Investigating Pixel Robustness using Input Gradients


1. Adversarial Attack and Defense

This post aims to cover main concepts from the paper "Where to be Adversarial Perturbations Added? Investigating and Manipulating Pixel Robustness using Input Gradients" by Hwang et al 1. The paper connects the gradients of input features to the robustness of a classification model, and shows that the robustness can be manipulated indirectly through changing the gradient flows within the model.


Adversarial attacks can change a panda into a gibbon.
Source : Explaining and Harnessing Adversarial Examples , Goodfellow et al, ICLR 2015.

Adversarial attack can be defined as a process of generating adversarial examples to a given classifier, which are samples that are misclassified by the model but are only slightly different from correctly classified samples drawn from the data distribution 2. Projected Gradient Descent 3 (PGD) is a popular attack method that iteratively generates adversarial examples as the following:

where is the original image, the loss function of the classifier, and the parameters of the classifier. denotes element-wise clipping operation with size , and the vector norm. If is small enough, one can generate perturbations that are imperceptible to the human eye but can successfully fool the target classifier. These adversarial attacks pose potential security threats for practical machine learning applications 4, such as face recognition 5, and autonomous driving systems 6.

Adversarial attacks can be performed under various scenarios, i.e., within different levels of access to the target classifier. In the white-box setting, the adversary has full access to the classifier except test-time randomness 7. Whereas in the black-box setting, the adversary has no information about the classifier nor the data used for training the model.

Adversarial defense is a method used during training or inference in order to alleviate the damage from the potential adversarial examples. Training the model explicitly against some of the well-known attack methods is one of the easiest and robust defense methods 2.


Adversarial defense by random resizing and padding of image.
Source: Mitigating Adversarial Effects Through Randomization, Xie et al, ICLR 2018.

Many defense methods exploit domain-specific invariance within the dataset. In the field of computer vision, random resizing and padding of the input image can be a method of defense 8 since the classified label must remain the same regardless of its size. Swapping a small number of pixels or using wavelet-based approximations can act as another method 9.

2. Measuring Pixel Robustness

Robustness of a classifier can be measured differently according to the distance metric and the attack method used for evaluation 10. However, most of the adversarial attacks solve an optimization problem related to the original loss function of the classifier to generate an example 2,3,10,11,12. Thus the solution of the problem depends highly on the direction and magnitude of the gradient of the input features.

Experiments on the ImageNet dataset show that the gradients from the input data not only depend on each of the images, but also on the structure of the classifier itself. In order to remove the effect from the individual images, one can compute the mean absolute value of the input gradient for each image pixel, with respect to all images in the dataset:

where is a value of an input pixel where , and are the xy-coordinates and the channel of the pixel, respectively. denotes the total number of channels, e.g. for RGB image. It turns out that the pixels with large are more vulnerable to adversarial attacks.


Top-1 accuracy on ImageNet validation dataset after masked PGD attacks.
Source: Where to be Adversarial Perturbations Added? Investigating and Manipulating Pixel Robustness using Input Gradients, Hwang et al, ICLR DebugML Workshop 2019.

To illustrate the relationship between and the robustness of pixels, each image in the ImageNet validation dataset was first attacked using the PGD algorithm (, 1,000 iterations). Then the attacked image was linearly combined with the original image according to a binary mask , size of which is equal to the size of the image, .

The above figure shows the average top-1 classification accuracy on ImageNet validation dataset for the combined input . The red line denotes the case when a pixel in the mask was set to 1 if was in the bottom percentage, whereas the blue line denotes the case when it was in the top percentage. The accuracy on the red line is larger than the blue line for all possible values of , meaning that can be a good approximation for measuring the robustness of an input feature to adversarial attacks.

3. Pixel Robustness Manipulator


Overview of a PRM module. The first convolutional layer can be replaced with any convolutional filter with sparse connections.
Source: Same as above.

Now if one can control the distribution of , one may move the weak points of a classifier to pre-defined locations. To support this claim, the paper proposes a simple auxiliary module named Pixel Robustness Manipulator (PRM). PRM is a convolutional autoencoder with a skip connection that can be turned on or off by controlling the weighting parameter . The encoder has a sparse connection from the input image, i.e., some of the pixels are dropped out from input and must be reconstructed from the decoder. This sparse connection ensures that the gradient flows only to the connected pixels when , and gradually flows into the rest as . The PRM module in the above figure generates a checkerboard-like pattern on the input gradient when , due to its first convolutional layer with kernel size of 1 and stride of 2.


Top-1 accuracy on ImageNet validation dataset (Acc) and Intersection over Union (IoU) with the designated pixels for various values of λ using the PRM module.
Source: Same as above.


Heatmaps of the average input gradient for various values of λ using the PRM module.
Source: Same as above.

The above results were obtained from experiments on the ImageNet dataset using the PRM module depicted above. Since the PRM module is connected to approximately 25% of the input image pixels, the Intersection over Union (IoU) was calculated between the top 25% percentage of and the connected pixels. Regardless of the base model, most of the gradient begins to align with the designated pixels at relatively small .

Then if one can stochastically change the distribution of so that the adversary does not know about its exact form at test-time, the classifier shall be able to evade the generated attacks. The authors of the paper have conducted experiments on a gray-box scenario to show that this is possible: when the adversary knows everything about the classifier, but does not know about the PRM module.


Top-1 accuracy (%) of 4 defense methods against 5 attack methods on CIFAR-10, CIFAR-100, and ImageNet validation datasets. The number below the name of the classifier stands for the top-1 accuracy without adversarial attacks.
Source: Same as above.

5 different adversarial attack methods were performed for evaluation: One Pixel attack (OP)13, JSMA 11, DeepFool (DF) 12, Carlini & Wagner (CW) 10, and PGD. Region-based defense 14, Randomization 8, and Pixel Deflection 9 were used for benchmarks in adversarial defense. "Ours" stands for the defense strategy using the same pretrained PRM module shown above . The PRM module aligns the weak points to the checkerboard when the adversary generates the attacked samples. At test-time, the image is randomly shifted by 1 pixel and zero-padded to match the original size of image. The top-1 accuracy from the ImageNet experiment shows that there is almost no effect of adversarial attack on the classifier, but only the accuracy drop from using the PRM module with high .

4. Summary

Recent progress on the field of adversarial attacks show that many contemporary deep learning models are vulnerable to imperceptible perturbations. However, most of the defense methods are either resource-heavy or domain-specific. Although this paper also has empirical support limited to the image domain, analyzing and manipulating the robustness of model by using input gradients is a method that can be easily utilized in a lot of different domains as well. Further research on methods for modifying the gradients during inference may provide defense strategies which also work in full white-box scenarios.



Where to be Adversarial Perturbations Added? Investigating and Manipulating Pixel Robustness using Input Gradients, Hwang et al, ICLR DebugML Workshop 2019.


Explaining and harnessing adversarial examples, Goodfellow et al, ICLR 2015.


Towards deep learning models resistant to adversarial attacks, Madry et al, ICLR 2018.


Adversarial examples in the physical world, Kurakin et al, ICLR 2017.


Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition, Sharif et al, CCS 2016.


Robust Physical-World Attacks on Machine Learning Models, Eykholt et al, CVPR 2018.


Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, Athalye et al, ICML 2018.


Mitigating Adversarial Effects Through Randomization, Xie et al, ICLR 2018.


Deflecting Adversarial Attacks with Pixel Deflection, Prakash et al, CVPR 2018.


Towards Evaluating the Robustness of Neural Networks, Carlini et al, IEEE 2017.


The limitations of deep learning in adversarial settings, Papernot et al, IEEE 2016.


Deepfool: a simple and accurate method to fool deep neural networks, Moosavi-Dezfooli et al, CVPR 2016.


One pixel attack for fooling deep neural networks, Su et al, arXiv preprint 2017.


Mitigating evasion attacks to deep neural networks via region-based classification, Cao et al, ACM 2017.



Young Hoon KIM

Research Resident

github   github