Interpretability (Mechanistic Understanding)
While explainability generates human-friendly accounts of AI behaviour, interpretability aims for genuine understanding of how models work internally - what individual neurons, layers, or circuits actually do. This is sometimes called mechanistic interpretability, and it's one of the most active areas of AI safety research. The goal is to move beyond "this input feature mattered" to "here's the specific computation the model performed." Progress has been real but limited. Researchers have identified individual circuits in language models that handle specific tasks - detecting sentiment, tracking grammatical number, performing simple arithmetic. But understanding isolated circuits is a long way from understanding an entire model's behaviour across all possible inputs. For most businesses, mechanistic interpretability is a research frontier rather than a practical tool. Its relevance is indirect but important: as interpretability improves, it becomes more feasible to verify that AI systems are doing what we think they're doing, to identify hidden failure modes before they cause harm, and to build justified rather than merely hoped-for trust in AI behaviour.