Adversarial Examples & Robustness

Adversarial examples are inputs deliberately crafted to cause AI models to fail. The classic demonstration involves images: adding tiny, invisible-to-humans changes to a photo of a panda can make a model confidently classify it as a gibbon. For language models, adversarial attacks include prompt injection (hiding malicious instructions in input text), jailbreaking (persuading the model to ignore its safety training) and data poisoning (corrupting training data to create exploitable weaknesses). These attacks matter because they reveal that AI models are brittle in ways that are difficult to predict and defend against. A model that performs perfectly on normal inputs can be systematically fooled by an adversary who understands its weaknesses. Robustness research aims to make models resistant to these attacks, through techniques like adversarial training (exposing the model to adversarial examples during training), input preprocessing (detecting and neutralising adversarial perturbations) and architectural changes that make models inherently more resistant to manipulation. For businesses, the practical concern is that any AI system exposed to untrusted inputs - which includes virtually any customer-facing application - needs defences against adversarial manipulation. This is especially critical for systems that take actions based on their outputs, where a manipulated input could trigger unintended consequences.