Extended summary:
Building a state-of-the-art deep learning model from scratch requires large datasets and a lot of computational power. Hence, only a few companies like Google, Microsoft or Facebook are in the position to do this. Transfer learning is a popular technique to build models from these existing models when large datasets and the computational power are not available.
The idea is to adapt the existing models to a similar task. For example, a face recognition systems that was trained on faces of a particular dataset can easily be retrained to identify faces of another dataset. This is typically achieved by retraining only the last few layers (which are responsible for high-level features) of a model and freezing the parameters of the preceding layers (low-level features).
Obviously, the centralized model training results in lack of diversity and the authors argue that this allows an attacker to launch highly effective misclassification attacks. An adversary who wants to attack a model (target model) which was build from some public model via transfer learning can leverage knowledge which he has about the public model even if he cannot access the parameters of the target model (black box attack).
The authors make the following assumptions:
The authors key insight is that if an adversary can create a sample for which the internal representation at some layer K (i.e. the output of the neurons at that layer) perfectly matches the internal representation of the target image at that layer, then it must be misclassified into the same label as the target image.
Hence, the optimization problem is reformulated. The goal is not to minimize the error of the output of the network, i.e. the last layer (the way it’s usually done) but to minimize the error at layer K. The idea is that if the victim freezes this layer and the previous layers but does not freeze the subsequent layers for the retraining process, the sample will always be classified into the target label no matter how the subsequent layers are retrained.
Experiments on facial, iris, traffic sign and flower recognition were done with different transfer methods. It has been found out, that the attack is effective for facial and iris recognition for which it was possible to freeze a large number of layers and that it’s less effective for the other datasets for which many layers had to be retrained to achieve a good accuracy on normal data.