Extended summary: This paper studies common failures of state-of-the-art object detectors. This is done via a technique the authors call object transplanting, i.e. an object of one image is embedded into another image at different locations. By doing that the authors make the following observations:
- detection is not stable: an object which is correctly identified at one location is not detected anymore when the object is slightly moved.
- identity is not consistent: an object which is correctly identified at one location is classified incorrectly at another location.
- non-local impact: other objects in the image are affected even if the two regions of the object that is transplanted and the other objects do not overlap.
- non-local effects on OCR: OCR gives different results depending on the location of a transplanted object even if the transplanted object does not overlap and is far from the text.
As it could be possible that the effects are due to the fact that the network has never seen a certain combination of two categories (of the transplanted object and the affected object) the authors run similar experiments by duplicating an object of an image to some other location. The same effects were observed.
Different reasons could cause the effects:
- partial occlusions: this is a well known problem. Correctly identifying objects which do overlap is still a challenging task for an object detector.
- contextual reasoning: the context in which the object is shown seems to be important because the global image information is encoded somehow in each decision.
- out-of-distribution objects: the transplanted objects have abrupt edges which do not naturally occur. Such distributions could confuse a network.
- not shift-invariant: networks are known not to be shift-invariant, i.e. a small change in the position causes big changes in the output
- non-maximal suppression: this is a technique common in most detectors. If some object A is not detected anymore (because it overlaps with some other object) another object B may become visible which was suppressed by NMS when the object A was visible.
- feature interference: a network also considers features which do not belong the object (e.g. because the region of interest is a rectangular one). Therefore, a small black object nearby a television is more likely to be a remote control than the same object nearby some other object. This is useful if the object does not provide enough evidence due to size and partial occlusion. On the other hand this behavior could cause errors.
Finally, the authors argue that feature interference could be the root cause for most of the observed phenomena and that contextual reasoning and partial occlusion are just specific cases of it.