Motivated from the fact that many users experience frequent misinterpretations when talking to Amazon Alexa the authors conduct an empirical analysis of the interpretation errors which Amazon Alexa makes. Based on the observations a new attack is developed which the authors call the skill squatting attack. An attacker can leverage systematic interpretation errors (i.e. a specific word is reliably misunderstood as some other word) to route a user to a malicious application. The user does not recognize that he’s talking to an Amazon Skill (applications developed by third-party vendors for Alexa) created and controlled by an adversary.
The authors also develop a variant of the attack which they call the spear skill squatting attack. This attack works like skill squatting but is targeted to a specific group of people (e.g. only male, female or only people of a particular region). The name for this attack comes from the related spear phishing attacks, which are phishing attacks targeted at specific groups of individuals.
The attacks become possible because Alexa sometimes makes systematic interpretation errors. By doing experiments it was observed that specific words are predictably misinterpreted, i.e. one word is always/often misinterpreted as another word. For example, the word “coal” is interpreted as “call”, “sell” as “cell”, “boil” as “boyle”, etc. Especially short words can cause such errors.
Spear skill squatting attacks become possible because Alexa tends to make different predictable misinterpretations depending on the gender or the accent of a person.
Now, predictable misinterpretations can be exploited as follows:
This attack is somehow related to domain name typosquatting, where an attacker predicts a common “typo” in domain names. However, typosquatting relies on the user to make a mistake. Here, the attack is intrinsic to the speech recognition service itself.