Prompt Injection & Model Attacks
Prompt injection has emerged as one of the most pressing security concerns for applications built on large language models. The attack exploits the fact that LLMs process instructions and data in the same way - they can't reliably distinguish between your system prompt and malicious instructions embedded in user input or retrieved content. A user might craft an input that overrides your system prompt, causing the model to ignore its safety guidelines, reveal its instructions, or perform unintended actions. Indirect prompt injection is even trickier - malicious instructions are hidden in content that the model processes, like a web page being summarised or an email being analysed. Despite significant research effort, there's no complete solution to prompt injection. Defences include input filtering, output validation, separating model capabilities with different permission levels, and limiting the actions a model can take. But these are mitigations, not fixes - determined attackers can often find workarounds. For applications where the model can take consequential actions - making purchases, sending messages, accessing databases - this is a serious concern that requires defence in depth. Treat the model as an untrusted component in your security architecture, validate its outputs before acting on them, and design your system so that the worst case of a successful injection is limited and recoverable.