Elves explain how to understand adversarial attacks

This blog revolves around AI & security, and as such heavily features the term adversarial attacks. Previous posts feature both their definition and a lot of discussion about them, but I still feel I can do more to introduce some intuitive understanding that does not lean too much on technical AI or security terminology. My first post on understanding adversarial attacks was a bit more broadly scoped and still somewhat technical. This one is supposed to be really light on technical terms, introducing adversarial attacks using the intuition behind language. And it’s about elves.

Let’s move to the magical Discworld of Terry Pratchett. The Discworld novel Lords and Ladies takes us to the magical kingdom of Lancre. The local witch coven (the protagonists) discover that there is an increasing number of crop circles appearing in the fields around the town. This signifies the coming of times when beings from alternate dimensions try to enter Discworld. The local girls start a new coven which is increasingly attracted to a circle of iron stones close to the village, a gateway to the captivating world of the Elves.

"Elves are wonderful. They provoke wonder.
Elves are marvellous. They cause marvels.
Elves are fantastic. They create fantasies.
Elves are glamorous. They project glamour.
Elves are enchanting. They weave enchantment.
Elves are terrific. They beget terror.
The thing about words is that meanings can twist just like a snake, and if you want to find snakes look for them behind words that have changed their meaning.
No one ever said elves are nice.
Elves are bad."
― Terry Pratchett, Lords and Ladies

These Elves have all the high fantasy elven characteristics, but they are not the noble force of good we’d expect. In fact, they are capricious, sadistic beings, and indeed the main antagonists of Lords and Ladies. They manage to influence enough locals to dismantle the iron stone circle, set up in the past to protect the world from the Elves (they are weak to iron). This allows them to enter Discworld and wreak considerable havoc. In the end, the witch coven manages to subdue the Elf threat and banish the Elves into their dimension.

I am sure Sir Terry didn’t deliberately try to describe an adversarial attack, but basically, that’s what the Elves did. They have all the key traits of a good adversarial attacker:

  • Manipulative — The Elves conceal the truth (their evil nature) by diverting attention to their superficial characteristics: good looks and style. Adversarial attacks also manipulate the payload’s true content to look like something else to the target model.
  • Subtle — The Elves manipulate the charmed people subtly, with whispers. Drawing too much attention to their antics would be harmful to their goals. This also applies to adversarial attacks: if the target learns about the attack, the attacker loses the gained advantage and may face repercussions.
  • Knowledgeable — The Elves know their audience and which buttons to push for each person, and they try to hide from people that see through their glamour. Adversarial attacks are more dangerous if tailored specifically to the target model, and it’s best not to deploy them against models suspected of being resistant, lest the attacker draws unwanted attention.
  • Opportunistic — The initial situation does not look so good for the Elves. They are kept out of Discworld by what appears to be an impenetrable barrier. But they wait for their opportunity when the naive locals come closer, and then they pounce… Adversarial attacks may look difficult to conceive at first, but a good adversarial attacker knows they only have to succeed once. The defender, on the other hand, needs to succeed every time.
  • Subverting expectations — Breaking the fourth wall somewhat, Pratchett’s Elves are captivating because they actively work with what we readers know about elves from previous works. Adversarial attacks also skate around what the model expects from its previous experience with the phenomena—the training data.
Fig. 1: Trustworthy or not?

Defending against the likes of Pratchett’s Elves presents quite the security challenge. Sweeping denials such as “no Elves may enter” are certainly not the correct defense response. I mean, “deny all <insert race> people” has never been a good rule for anything… It is important to pick the correct aspects of the data to look at. Would you trust the elf in Fig. 1? Fans of Lord of the Rings know it’s Galadriel, who certainly is a good character and a boon to the good cause (if we forgive that one moment of power temptation). But perhaps her gaze is alarming to you… What do we focus on, then? That’s exactly the question AI security research aims to answer. So, 2024 on this blog is kicked off with this post, stay tuned for more!


