Latency Management
Latency - how long it takes for an AI system to respond - is often the difference between a good user experience and a frustrating one. Research consistently shows that users notice delays beyond about 200 milliseconds, and satisfaction drops sharply past one second. For AI applications embedded in interactive products, managing latency is critical. The sources of latency in an AI system include network round trips (especially if the model runs remotely), preprocessing of the input, the model inference itself, and any postprocessing of the output. Each can be optimised independently. Network latency can be reduced by deploying models closer to users - using content delivery networks, regional deployments, or edge computing. Preprocessing can be streamlined or moved earlier in the pipeline. Inference latency depends on model size, hardware, and serving software - techniques like model distillation, quantisation, caching common requests, and using optimised inference engines (TensorRT, vLLM, ONNX Runtime) can dramatically reduce response times. For large language models, streaming responses token by token creates the perception of faster response even though total generation time is unchanged. Setting and monitoring latency budgets - maximum acceptable response times for each component - is a practical discipline that keeps performance from gradually degrading as systems grow more complex.