The context-window ladder, 2023–2025: every frontier LLM context size, dated

Frontier LLM context windows grew from 8,192 tokens (GPT-4, March 2023) to 10,000,000 tokens (Llama 4 Scout, April 2025) across 24 months. This is every step, dated to the release, with sources.

Frontier LLM context windows grew from 8,192 tokens at the GPT-4 launch on 14 March 2023 to 10,000,000 tokens at the Llama 4 Scout release on 5 April 2025 — a ~1,200x expansion in 24 months. This piece documents every step on that ladder, with the release date, the lab, and the primary source for each.

What was the frontier context window in March 2023?

The frontier context window in March 2023 was 32,768 tokens, held by GPT-4-32k (OpenAI). The widely-available GPT-4 variant shipped at 8,192 tokens on 14 March 2023; the 32k variant was limited-access in the same generation (OpenAI, GPT-4 system card, March 2023). Anthropic’s Claude v1 shipped at ~9,000 tokens. Meta had not yet released Llama 2; the open-weights Llama 1 shipped at 2,048 tokens on 24 February 2023. Google had not yet released a public frontier API.

32K was enough for a long technical paper. It was not enough for a novel or a medium-sized codebase. Production systems of this period were built almost entirely on retrieve-then-stuff architectures.

When did the first 100K context window ship?

The first production 100,000-token context window shipped on 11 May 2023, when Anthropic extended Claude to 100K tokens (Anthropic, Introducing 100K Context Windows, 11 May 2023). A single Claude 100K call could ingest The Great Gatsby (roughly 72,000 tokens) or a full 10-K filing in one pass. Anthropic pitched the release at enterprise document analysis — legal review, financial filings, long-form retrieval — rather than at conversational use.

This was the first context-window release that changed product architecture. Before 100K, most commercial RAG pipelines chunked source documents into 512–2,048 token passages and retrieved the top-k before prompting. After 100K, “fit-and-ask” became an alternative pattern for single-document queries. Retrieval did not disappear — it remained cheaper, and empirical work that summer (Liu et al., Lost in the Middle, arXiv:2307.03172, 6 July 2023) showed that long-context quality degraded for facts placed in the middle of the window — but the default shifted.

When did 128K become table stakes?

128K tokens became the table-stakes frontier API size during November 2023. OpenAI announced GPT-4 Turbo with a 128,000-token window at DevDay on 6 November 2023 (OpenAI, New models and developer products, 6 November 2023). Anthropic shipped Claude 2.1 at 200,000 tokens on 21 November 2023 (Anthropic, Claude 2.1, 21 November 2023). The open-weights camp matched within the same month when 01.AI released Yi-34B-200K (200,000 tokens) on 5 November 2023 (01.AI, Yi model family release, November 2023).

By 1 December 2023, every major commercial API had a ≥128K option. The 32K ceiling that held for the first eight months of 2023 collapsed in three weeks.

When did a frontier LLM first cross 1M tokens?

Google crossed the 1,000,000-token barrier on 15 February 2024 with the research release of Gemini 1.5 Pro (Google DeepMind, Our next-generation model: Gemini 1.5, 15 February 2024). The model was made generally available to developers on 9 April 2024. Google’s technical report claimed near-perfect needle-in-haystack recall at up to 10,000,000 tokens — an order of magnitude beyond the advertised production limit.

Gemini 1.5 was the only frontier model with a 1M+ window for the remainder of 2024. OpenAI’s GPT-4o (launched 13 May 2024) stayed at 128,000 tokens. Anthropic’s Claude 3 family (launched 4 March 2024) stayed at 200,000 tokens. The context-size gap between Google and the other two labs persisted from February 2024 through at least mid-2025.

How did 2024 shape up across labs?

Through 2024, three distinct context-size tiers emerged: Google at 1M+, Anthropic and OpenAI at 128K–200K, and open-weights converging at 128K. Specific releases:

Claude 3 (Anthropic) — Haiku, Sonnet, and Opus, all 200,000 tokens, launched 4 March 2024 (Anthropic, Introducing the next generation of Claude).
GPT-4o (OpenAI) — 128,000 tokens, launched 13 May 2024 (OpenAI, Hello GPT-4o).
Gemini 1.5 Pro, 2M (Google DeepMind) — raised from 1M to 2,000,000 tokens in Google AI Studio, announced at Google I/O on 14 May 2024.
Claude 3.5 Sonnet (Anthropic) — 200,000 tokens, launched 20 June 2024; a refreshed “New Sonnet” shipped on 22 October 2024 at the same window size (Anthropic, Claude 3.5 Sonnet).
GPT-4o-mini (OpenAI) — 128,000 tokens, launched 18 July 2024.
Llama 3.1 405B (Meta) — 128,000 tokens, open-weights, launched 23 July 2024 (Meta, Introducing Llama 3.1).
Qwen 2.5 (Alibaba) — 32,000 to 128,000 tokens depending on variant, launched 19 September 2024 (Qwen Team, Qwen2.5).
Llama 3.2 (Meta) — 128,000 tokens, launched 25 September 2024.

Magic.dev claimed a 100,000,000-token context on an internal model called LTM-2-mini on 29 August 2024 (Magic, 100M token context windows), but the model was narrow — coding-only — and never released publicly as a general-purpose frontier model.

What shipped at the top of the ladder in 2025?

The top of the ladder in 2025 is Llama 4 Scout and Llama 4 Maverick at 10,000,000 tokens, open-weights, released on 5 April 2025 (Meta, Introducing the Llama 4 herd). Before Llama 4, no open-weights model had crossed 1M tokens. Other frontier 2025 releases:

Gemini 2.0 Flash (Google DeepMind) — 1,000,000 tokens, launched 11 December 2024 as experimental, GA in February 2025.
Claude 3.7 Sonnet (Anthropic) — 200,000 tokens, launched 24 February 2025 (Anthropic, Claude 3.7 Sonnet and Claude Code).
GPT-4.5 (OpenAI) — 128,000 tokens, launched 27 February 2025 (OpenAI, Introducing GPT-4.5).
Gemini 2.5 Pro (Google DeepMind) — 1,000,000 tokens, launched 25 March 2025; Google stated a 2M roadmap.

By mid-2025, the frontier sat at: Google 1M–2M, Meta 10M (open-weights), Anthropic 200K, OpenAI 128K across most tiers.

What did long context actually improve?

Long context improved three things: needle-in-haystack recall, per-token pricing, and — via prompt caching — the economics of repeated long prompts.

Recall — Gemini 1.5’s technical report claimed over 99% retrieval accuracy on needle tests up to 10M tokens in internal evaluations (Google DeepMind, Gemini 1.5 technical report, February 2024).
Pricing — input pricing per million tokens on comparable-capability tiers fell roughly 10x between the March 2023 GPT-4 baseline and the Q4 2024 equivalents. GPT-4 launched at $30 / 1M input tokens; GPT-4o launched at $5 / 1M in May 2024; GPT-4o-mini at $0.15 / 1M in July 2024.
Prompt caching — Anthropic shipped prompt caching on 14 August 2024 (Anthropic, Prompt caching with Claude); OpenAI shipped automatic prompt caching on 1 October 2024 (OpenAI, Prompt caching in the API); Google Vertex AI added context caching during 2024. Cached tokens in all three systems cost 10–25% of uncached input tokens, making 100K+ system prompts economically viable in agentic loops.

What did long context not improve?

Long context did not improve reasoning quality, latency, or effective working memory at the same rate it improved raw recall.

Reasoning over long context — empirical evaluations (Hsieh et al., RULER, arXiv:2404.06654, April 2024; Yen et al., HELMET, arXiv:2410.02694, October 2024; Bai et al., LongBench v2, December 2024) consistently show task-accuracy degradation well before the advertised context limit, even when simple needle-recall remains perfect.
Latency to first token — uncached 1M-token prompts routinely incur multi-second time-to-first-token penalties on all three major clouds (observed in public benchmarks and vendor documentation, 2024). Caching mitigates but does not eliminate the cost.
Effective working memory — frontier models (including Gemini 1.5, Claude 3.5, GPT-4o) continue to forget or contradict instructions stated 50,000–100,000 tokens earlier in a conversation, documented in RULER and HELMET.

Did long context kill RAG?

Long context did not kill retrieval-augmented generation. RAG remains cheaper, faster, and often more accurate than context-stuffing for three reasons:

Economics — a retrieve-top-k-then-generate call is 1–2 orders of magnitude cheaper than stuffing 200K–2M tokens every turn, even with prompt caching.
Middle-of-context decay — models under-attend to facts placed at positions 30–70% through a long prompt, the effect documented by Liu et al. in 2023 and replicated in every subsequent long-context benchmark.
Latency — retrieval adds 50–200ms; stuffing adds seconds.

The design question in 2025 is not RAG vs. long context but retrieve then stuff how much, and cache what. Production patterns converged on: retrieve the top 5–20 most relevant passages, fit them in a ≤32K window, cache the stable system prompt, let the model generate. Long context is used selectively for single-document workloads (contract review, codebase queries) where the full document fits and retrieval adds no value.

What is the operating ceiling for long-context use in production?

The practical operating ceiling for long-context use in 2025 is roughly 200,000 tokens per call on most production paths, constrained by cost and latency rather than by model capability.

A 2,000,000-token Gemini 1.5 Pro call at its January 2025 pricing ($1.25 / 1M input tokens ≤128K, $2.50 / 1M ≥128K) costs approximately $5.00 per call for input alone, before generation and before caching. A 10,000,000-token Llama 4 Scout inference is theoretically possible on open-weights deployments but requires multi-GPU serving infrastructure that most production teams do not operate. The capability ceiling continues to rise; the economic ceiling has moved more slowly.

The ladder, complete

Date	Model	Lab	Context	Notes
14 Mar 2023	GPT-4	OpenAI	8,192	Public baseline
14 Mar 2023	GPT-4-32k	OpenAI	32,768	Limited access
11 May 2023	Claude 100K	Anthropic	100,000	First production 100K
18 Jul 2023	Llama 2	Meta	4,096	Open-weights baseline
6 Nov 2023	GPT-4 Turbo	OpenAI	128,000	DevDay
5 Nov 2023	Yi-34B-200K	01.AI	200,000	Open-weights
21 Nov 2023	Claude 2.1	Anthropic	200,000
15 Feb 2024	Gemini 1.5 Pro	Google DeepMind	1,000,000	First 1M frontier
4 Mar 2024	Claude 3 (Haiku/Sonnet/Opus)	Anthropic	200,000
13 May 2024	GPT-4o	OpenAI	128,000
14 May 2024	Gemini 1.5 Pro	Google DeepMind	2,000,000	Expanded from 1M
20 Jun 2024	Claude 3.5 Sonnet	Anthropic	200,000
18 Jul 2024	GPT-4o-mini	OpenAI	128,000
23 Jul 2024	Llama 3.1 405B	Meta	128,000	Open-weights
14 Aug 2024	Claude prompt caching	Anthropic	—	Shifted long-context economics
29 Aug 2024	Magic LTM-2-mini	Magic	100,000,000	Narrow (coding), not public
19 Sep 2024	Qwen 2.5	Alibaba	up to 128,000	Open-weights
25 Sep 2024	Llama 3.2	Meta	128,000	Open-weights
1 Oct 2024	OpenAI prompt caching	OpenAI	—	Automatic
22 Oct 2024	Claude 3.5 Sonnet (new)	Anthropic	200,000	Refresh
11 Dec 2024	Gemini 2.0 Flash	Google DeepMind	1,000,000
24 Feb 2025	Claude 3.7 Sonnet	Anthropic	200,000
27 Feb 2025	GPT-4.5	OpenAI	128,000
25 Mar 2025	Gemini 2.5 Pro	Google DeepMind	1,000,000	2M roadmap
5 Apr 2025	Llama 4 Scout / Maverick	Meta	10,000,000	First open-weights ≥1M

Sources

Release dates and advertised context sizes are from each lab’s primary announcement: OpenAI (platform.openai.com/docs, openai.com/index), Anthropic (anthropic.com/news), Google DeepMind (blog.google, deepmind.google/discover), Meta (ai.meta.com/blog, llama.com), 01.AI (01.ai, huggingface.co/01-ai), Alibaba (qwenlm.github.io), Magic (magic.dev/blog). Long-context quality benchmarks: Liu et al., Lost in the Middle, arXiv:2307.03172, July 2023. Hsieh et al., RULER, arXiv:2404.06654, April 2024. Yen et al., HELMET, arXiv:2410.02694, October 2024. Bai et al., LongBench v2, arXiv:2412.15204, December 2024.

This reference is updated as new frontier context windows ship. The canonical URL does not change. Corrections and additions: editorial@hypogray.com.