Context Window

How much the AI holds in working memory — the Pensieve has infinite capacity; LLMs are still catching up.

The context window defines the total amount of information — in tokens — that a large language model can "see" when generating a response. Everything the model receives in a single API call counts toward this limit: the system prompt, the full conversation history, any retrieved documents, tool call results, and the user's current message. If the total exceeds the context window, content must be truncated or summarized, potentially losing critical information. Modern frontier models have dramatically expanded context windows: early GPT-3 had a 4,096-token window (roughly 3,000 words); current models like Claude 3.5 support 200,000 tokens (approximately 150,000 words, or a small novel), and some experimental models push toward millions of tokens. The practical ceiling has grown 50-fold in three years, changing what's possible in a single AI interaction.

Context window size shapes application architecture fundamentally. Small context windows require aggressive chunking and retrieval strategies — only the most relevant information can fit alongside the prompt and instructions. Large context windows enable new patterns: you can include an entire codebase, a full meeting transcript, or a complete customer history in a single interaction rather than selecting and filtering what to include. However, longer contexts aren't free — inference cost is roughly proportional to context length, and model quality can degrade on very long contexts as the model's "attention" becomes distributed across more tokens. The concept of "lost in the middle" describes a documented phenomenon where LLMs perform worse at retrieving information from the middle of long contexts compared to the beginning or end.

For B2B teams building AI applications, context window management is a central architectural decision. A customer support AI with a 200K token window can include the full account history, all relevant documentation, and the complete conversation in every interaction — but at significant per-call cost. A smaller context window with good retrieval (RAG) achieves similar quality at lower cost by including only the most relevant snippets rather than everything. The right choice depends on the ratio of information density to inference cost, the quality requirements of the application, and how frequently the "right" information is hard to retrieve reliably versus simply including everything.

context windowLLMtokensAI memoryinput lengthprompt engineering

Related terms

← Back to Glossary