When you type a question to an AI assistant, the model never sees your words. It sees numbers — specifically, a sequence of integers that represent chunks of text called tokens. The process of converting your prose into these chunks is called tokenization, and it is perhaps the most underappreciated architectural decision in modern artificial intelligence. It determines what a model can spell, how it handles foreign languages, whether it can do arithmetic, and how much your API bill costs. Yet most users have never encountered the term.
Tokenization is not word-splitting. Early natural language systems did treat spaces as boundaries, but that approach fails spectacularly with agglutinative languages like Finnish, compounds in German, or the simple fact that English distinguishes between "don't" and "do not." Modern tokenizers instead learn statistical patterns from massive text corpora, identifying recurring substrings and assigning each a unique integer. The result is a vocabulary — typically between 30,000 and 100,000 entries — that includes whole common words, frequent prefixes and suffixes, individual characters, and strange fragments that appear often in training data.
The economics of the token
Every token costs compute. Transformer models, the architecture behind most contemporary large language models, scale quadratically with sequence length in their attention mechanisms. Longer token sequences mean exponentially more matrix multiplications. This is why API providers charge per token and why efficient tokenization is a competitive advantage. A tokenizer that represents the same English sentence in 15 tokens instead of 20 reduces inference cost by roughly a quarter. But efficiency in English often comes at the expense of other languages. Studies have shown that the same semantic content requires two to three times as many tokens in Burmese or Amharic as in English, simply because those scripts appeared less frequently in training data. The AI economy, at its most granular level, is priced in a currency that favors the already-dominant.
When tokens betray the model
Tokenization explains several of AI's most mocked failures. Ask a language model to count the letters in "strawberry" and it may stumble, because the word is not processed letter by letter — it arrives as one or two opaque tokens. Arithmetic suffers similarly: the number 7,842 might tokenize as "7" + "," + "842" or as a single token depending on context, and the model has no native sense that these represent quantities rather than symbols. Spelling, rhyming, and character-level reasoning all become harder when the atomic unit of perception is a statistical substring rather than a grapheme. Researchers have experimented with byte-level and character-level tokenizers to address these limitations, but the computational costs remain prohibitive at scale.
The vocabulary as a cultural artifact
A tokenizer's vocabulary is a fossilized record of its training data. Common English words get dedicated tokens; rare technical terms are shattered into syllables. Proper nouns from underrepresented cultures may be split into meaningless fragments, subtly signaling to the model that these are unusual, peripheral, less coherent. The tokenizer for one widely used model assigns a single token to "Christmas" but requires three tokens for "Diwali." This is not malice; it is statistics. But statistics encode history, and history is not neutral. Every design choice in AI — even the ones that seem purely technical — carries assumptions about whose language matters.
Our take
Tokenization is the kind of infrastructural decision that rarely makes headlines but quietly shapes outcomes for billions of interactions. It is a reminder that artificial intelligence is not a single monolithic technology but a stack of choices, each with trade-offs, each reflecting the priorities of its creators. Understanding how models see — or rather, how they are forced to see — is the first step toward demanding they see more fairly.




