Tokenization

How text is split before the LLM processes it — the Universal Translator parsing your sentence one syllable at a time.

Tokenization is the conversion of text into the numerical sequence that a language model actually processes. LLMs don't operate on characters or words — they operate on tokens, which are fragments of text determined by a vocabulary that was optimized during the model's pre-training. Common English words might each be a single token ("the," "cat," "running"). Less common words, technical terms, and words in languages underrepresented in training data often split into multiple tokens ("tokenization" might be ["token," "ization"] or even ["token," "iz," "ation"] depending on the model's vocabulary). Numbers, code, and special characters are tokenized differently from prose, and different models use different tokenization schemes (GPT-4 uses tiktoken with a BPE vocabulary; Claude uses Anthropic's tokenizer; open-source models vary). The average ratio for English prose is approximately 0.75 words per token, meaning 100 words require approximately 133 tokens.

Tokenization has practical implications for both cost and capability. Since LLM API pricing is per-token, understanding how different content types tokenize helps predict and optimize inference costs. Code, JSON, and technical documentation often tokenize less efficiently than prose — a 1,000-word technical document might consume 1,500 tokens while 1,000 words of plain prose consumes closer to 1,300. Rare words, specialized terminology, and non-English text tokenize particularly inefficiently — a word that's a single token in English might be 5-10 tokens in a language with less training data representation, making the same task dramatically more expensive in some languages. Context window limits measured in tokens mean that understanding tokenization also determines how much content can fit in a single interaction.

For B2B teams building AI applications, tokenization awareness informs architecture decisions in several ways. Chunking strategy for RAG systems should account for token counts rather than character or word counts — chunking by token count ensures each retrieved chunk fits predictably in context without overflow. Cost modeling should estimate token consumption based on realistic input lengths, including the system prompt and any retrieved context, not just user messages. Applications that process technical documentation, code, or multilingual content should benchmark token consumption for representative real inputs rather than using English prose estimates, which can underestimate actual token usage by 50-100% or more for technical or non-English content.

tokenizationtokenLLMNLPtext processingAI

Related terms

← Back to Glossary