What Tokens and Context Windows Mean for Enterprise AI Decisions
A clear explanation of tokens and context windows, and why these two technical limits shape cost, accuracy and feasibility in enterprise AI projects.
An explanation of how AI models handle long documents, the practical limits imposed by context windows, and the architectural approaches organisations use to work around them.
AI models do not read documents the way humans do. Every model has a context window — a maximum amount of text it can hold in working memory during a single inference call. Documents that fit within that window can be processed in full; documents that exceed it require deliberate architectural workarounds. For organisations whose AI use cases involve contracts, reports, manuals, policies or research documents, understanding these limits is foundational to building systems that actually work.
A context window is measured in tokens — roughly four characters, or about three-quarters of a word, each. A 100-page contract might run to 70,000–90,000 tokens. Current frontier models offer context windows ranging from approximately 128,000 tokens (GPT-4o) to over one million tokens (Gemini 1.5 Pro). However, larger windows do not solve the problem entirely: cost scales linearly with input tokens, latency increases, and retrieval accuracy within very large contexts can degrade — a phenomenon sometimes called the "lost in the middle" effect, where the model's attention to content near the edges of a long context is stronger than its attention to content buried in the middle.
The practical consequence is that processing long documents well requires more than simply choosing a model with a large enough window. It requires a deliberate strategy matched to the specific task.
Organisations in professional services, legal, financial services, insurance, healthcare and government regularly work with documents that are long, dense and non-linear: multi-hundred-page tender responses, regulatory submissions, technical standards, insurance policies, and accumulated contract libraries. If the AI system cannot reliably extract information from these documents, the use case does not deliver its promised value.
The risk is not always obvious. A system that appears to work during testing — where documents were short — may fail silently in production when real documents arrive. The model may answer with high apparent confidence while drawing on only the first third of a 200-page document because the rest exceeded the context window.
There are three primary architectural patterns for handling long documents:
1. Full-context loading (stuff the window) The entire document is loaded into a single large context window. This is the simplest approach and works well when documents are reliably under the model's context limit and latency and cost are acceptable. It is most appropriate for moderate-length documents where completeness matters more than speed.
2. Retrieval-Augmented Generation (RAG) The document is pre-processed: split into chunks, each chunk converted to an embedding (a numerical vector representation of its semantic content), and stored in a vector database. When a query arrives, the query is also embedded, and the most semantically similar chunks are retrieved and passed to the model as context. Only the relevant fragments enter the context window, not the full document. This is the standard approach for large document libraries and knowledge base applications. The quality of this approach depends heavily on chunking strategy, embedding model quality and retrieval configuration.
3. Hierarchical summarisation (map-reduce) The document is divided into segments. Each segment is summarised independently (the "map" step). The summaries are then consolidated into a final output (the "reduce" step). This is effective for tasks like summarisation or extracting recurring themes across a long report, but loses granular detail in the compression step.
For complex documents with mixed structure — text, tables, figures, footnotes — pre-processing quality significantly affects downstream accuracy. A poorly extracted PDF produces corrupted chunks that neither retrieval nor full-context loading can fully compensate for.
Choosing the right approach requires understanding the specific task. Full-context loading suits analysis tasks where completeness is critical and the document set is bounded in size. RAG suits large, frequently updated document libraries where users ask targeted questions. Hierarchical summarisation suits narrative summarisation tasks where granular recall is less important than thematic coverage.
In practice, many enterprise document AI systems combine approaches: RAG for targeted question-answering, hierarchical summarisation for regular report digests, and full-context loading for time-sensitive, high-stakes review tasks where the extra cost is justified.
Several implementation factors deserve early attention when designing a long-document pipeline:
Edison AI's AI implementation team regularly helps organisations design document processing pipelines that match task requirements to the right architectural pattern — from contract analysis to knowledge base construction.
For any AI use case that involves documents, begin with a document audit: catalogue the formats, lengths, quality levels and access patterns of the documents the system will process. This audit will determine which architectural approach is appropriate and what pre-processing investment is required before any model selection decision is made. Build that pre-processing effort into the project plan and budget from the start.
Edison AI runs practical AI training that turns this understanding into day-to-day team capability.
Current models have context windows ranging from roughly 32,000 to over 1 million tokens, but cost and latency increase proportionally with context length. Documents that exceed the window must be chunked and retrieved selectively, or summarised hierarchically, rather than passed in full.
Content beyond the context window is simply not seen by the model. Without a retrieval or summarisation strategy, the model will answer based only on what fits within its window — which may exclude the most relevant section of a long document.
For most enterprise use cases, retrieval-augmented generation (RAG) is the preferred approach: the document is chunked, each chunk is embedded, and only the most relevant chunks are retrieved and passed to the model at query time. For narrative tasks like summarisation, hierarchical chunking and map-reduce summarisation are common alternatives.
Edison AI helps Australian businesses move from AI curiosity to practical implementation, with workflow design, team training and measurable outcomes. Tell us about your setup and we'll come back with a sequenced plan grounded in the same thinking you just read.
Article: How AI Models Handle Long Documents: Context Limits and Workarounds