Tokenisation & Tokens

Before an AI model can process your text, it needs to break it into pieces called tokens. A token might be a whole word, part of a word, or even a single character - it depends on the tokeniser's design. The word "understanding" might be split into "under" and "standing," while common words like "the" stay whole. Numbers, punctuation and code are all split up too, sometimes in ways that seem odd. This matters for several practical reasons. First, AI pricing is usually based on tokens, so understanding tokenisation helps you estimate costs. Second, the way text is tokenised affects the model's capabilities - some tokenisers handle non-English languages poorly, splitting simple words into many tokens, which wastes capacity and can reduce quality. Third, there's a limit on the total number of tokens a model can process at once (the context window), so inefficient tokenisation means you can fit less actual content into each request. Most users never need to think about tokenisation directly, but if you're building products on top of AI models, it's one of those low-level details that can quietly cause problems.