Cost Architecture & Unit Economics
Understanding the cost of running AI in production is essential for building sustainable AI products and features. The key metric is cost per inference - what it costs you each time the model processes a request. This varies enormously depending on model size, hardware choice, optimisation level, and utilisation rate. A simple classification model might cost fractions of a penny per inference. A large language model generating a long response can cost several pence or more. Multiply by millions of daily requests, and these small numbers become significant. The unit economics of AI features determine what's commercially viable. If your AI-powered feature costs five pence per use but generates only two pence of value, it's not sustainable regardless of how impressive the technology is. Cost architecture involves choosing the right model size for the task (smaller is cheaper), optimising serving infrastructure, using spot or preemptible instances where possible, implementing caching for repeated queries, and routing simpler requests to cheaper, smaller models while reserving expensive models for complex ones. This last technique - sometimes called model routing or cascading - is increasingly common and can reduce costs by 50% or more. Building cost awareness into your AI development process from the start, rather than treating it as an afterthought, prevents unpleasant surprises when you move from prototype to production scale.