How AI Works›Reliability & Safety Engineering

Adversarial Examples & Robustness

Explore further

Informed By

Red-Teaming & Adversarial Evaluation

How AI Works

Informs

Scale, Emergence & Evaluation

How AI Works

Go Wider

Reliability & Safety Engineering

How AI Works

Related To

Prompt Injection & Model Attacks

Data & Infrastructure

Adversarial examples are inputs deliberately crafted to cause AI models to fail. The classic demonstration involves images: adding tiny, invisible-to-humans changes to a photo of a panda can make a model confidently classify it as a gibbon. For language models, adversarial attacks include prompt injection (hiding malicious instructions in input text), jailbreaking (persuading the model to ignore its safety training) and data poisoning (corrupting training data to create exploitable weaknesses). These attacks matter because they reveal that AI models are brittle in ways that are difficult to predict and defend against. A model that performs perfectly on normal inputs can be systematically fooled by an adversary who understands its weaknesses. Robustness research aims to make models resistant to these attacks, through techniques like adversarial training (exposing the model to adversarial examples during training), input preprocessing (detecting and neutralising adversarial perturbations) and architectural changes that make models inherently more resistant to manipulation. For businesses, the practical concern is that any AI system exposed to untrusted inputs - which includes virtually any customer-facing application - needs defences against adversarial manipulation. This is especially critical for systems that take actions based on their outputs, where a manipulated input could trigger unintended consequences.