On March 17, 2026, Cursor published a blog post Training Composer for longer horizons, describing how they use reinforcement learning to train the Composer model to “self-summarize” — effectively teaching it to compact its own context — to handle long-horizon tasks that far exceed the model’s context window.
The article itself is clear: agent trajectories grow faster than model context lengths, and traditional external compression introduces latency, information loss, and additional cost. Cursor’s approach is to let the model itself learn when to compact and what to retain during training.
But after reading it, one question remains unexplained: when self-compaction is a behavior of the model itself, how exactly does KV cache reuse work?
This post attempts to answer that question, then places it alongside the compaction mechanisms we’ve reverse-engineered from Claude Code and Codex CLI for comparison.
The KV Cache Problem with Traditional Compression
Let’s review the traditional approach:
- Agent runs until context approaches the limit (say, 80k tokens)
- Call an external model (or use a different prompt) to compress the history
- Receive the summary and assemble a new prompt
- Prefill the entire new prompt from scratch
The problem is in step 4: the new prompt’s token sequence is completely different from the old one. The KV cache is entirely invalidated. Even if the summary is only 1k tokens, you still need to prefill system prompt + summary + recent messages. The latency and GPU overhead are non-trivial.
Cursor’s Self-Compaction: KV Cache Reuse Mechanism
Cursor’s approach elegantly sidesteps this problem. The key insight: the compressed representation is not generated by an external model — it’s decoded by the current model on top of its existing KV cache.
The concrete flow:
[System Prompt] [Tool calls & results...] [80k tokens of context]
↓
Trigger compaction
↓
Insert synthetic query: "Summarize key information"
↓
Model decodes directly from existing 80k KV cache
↓
Generates ~1k token summary (no prefill needed)
↓
Truncate KV cache, keep only prefix + summary's KV
↓
Continue decoding from compressed state
The “KV cache friendly” claim is not about reusing the system prompt prefix (that’s basic prefix caching — every approach does this). It means generating the compacted summary itself requires zero prefill overhead.
The model is already in the decode stream of the 80k context. Its KV cache is fully intact. After inserting the synthetic query, the model decodes the summary directly from this cache — this is a normal decode operation, not a prefill. The 80k tokens’ information is already encoded in the KV cache’s attention representations; the model simply “distills” it into 1k tokens of output.
Once generation completes, discard the preceding 80k KV cache and keep only the summary’s 1k token KV as the new starting point.
Compare with the traditional approach:
- Traditional: requires one 80k prefill (to call the summarization model) + one full prefill of the new prompt
- Self-compaction: only needs to decode ~1k tokens (generating the summary), then truncate
What’s saved is two large-scale prefills. This is what “KV cache friendly” actually means.
Application-Layer Auto-Compaction: Claude Code and Codex CLI
Cursor’s approach operates on KV cache at the inference layer — most Agent frameworks don’t have this luxury. But application-layer auto-compaction is already standard practice, just with different engineering implementations. Through reverse engineering of Claude Code v2.0.37 and Codex CLI (local installation string extraction + code analysis), we’ve dissected both mechanisms.
Claude Code — 9-Section Structured Compact
Trigger: System-level automatic. When token usage approaches the window limit (~155K under a 200K window), the system directly calls the compaction API — no model decision needed. Manual /compact is also supported.
Execution flow:
- Detect tokens exceeding the autocompact threshold
- Run PreCompact hooks (allowing users to inject custom instructions or block compaction)
- Call the current main model (not a smaller model) for compaction, thinking disabled, maxOutputTokens = 20000
- The compaction prompt requires a 9-section structured summary:
- Primary Request and Intent — all explicit user requests
- Key Technical Concepts — technical concepts, frameworks, tools
- Files and Code Sections — filenames + complete code snippets + reasons for changes
- Errors and Fixes — all errors + fix methods + user feedback
- Problem Solving — solved and in-progress problems
- All User Messages — all non-tool-call user messages (preserved in full, not summarized)
- Pending Tasks — outstanding tasks
- Current Work — what was being worked on before compaction (with code snippets)
- Optional Next Step — next step (must directly relate to user’s most recent request)
- Restore the 5 most recently read files (each capped at 5000 tokens, total cap 50000 tokens)
- Insert
compact_boundarymarker + summary message - Full prompt re-prefill
Additionally, Claude Code has independent tool output pre-compaction: when bash output exceeds 5000 characters, a separate model call decides whether to compress it, with original output saved to disk for reference.
Codex CLI — Streamlined Auto-Compact
Codex CLI’s approach is simpler:
- Trigger: Configurable
model_auto_compact_token_limitthreshold, or when the model returnsContextWindowExceedederror - Execution: LLM call generates summary → replaces original history → continues in same session
- Configuration: Custom compact prompt supported (
~/.codex/config.toml) - No file restoration, no hook system, no telemetry
What they share: Both are system-level automatic triggers — they don’t rely on the model deciding “should I compact now?” Compaction is non-terminal — the agent continues working in the same session without interrupting the loop.
KV cache characteristics: Through reverse engineering, we confirmed that Claude Code’s compact is an independent API call — it uses a fresh system prompt ("You are a helpful AI assistant tasked with summarizing conversations.") and a fresh user message (passing in the full conversation history as content to be compressed). Codex CLI works similarly. This means the compact call itself does not reuse the existing conversation’s KV cache — it requires a full prefill of the entire conversation.
However, this doesn’t mean application-layer approaches are entirely cache-unfriendly. After compaction completes, subsequent normal conversation requests benefit from API provider prompt caching (both Anthropic and OpenAI offer this) — the post-compaction prompt retains the same system prompt prefix, which gets a cache hit. So the cost is primarily concentrated in that one prefill during the compact call itself.
This is fundamentally different from Cursor’s approach: Cursor decodes the summary directly on the existing conversation’s KV cache, eliminating even that one prefill for the compact call itself. No matter how application-layer approaches optimize, the compact call always requires a full conversation-level prefill.
Comparison
| Dimension | Cursor Self-Compaction | Claude Code | Codex CLI |
|---|---|---|---|
| Trigger | RL trained behavior | System auto (token threshold) | System auto (configurable) |
| Execution layer | Inference (KV cache ops) | Application (independent API call) | Application (independent API call) |
| Blocking? | No | Yes | Yes |
| Summary structure | Model self-learned (RL) | 9-section structured text | Customizable prompt |
| Training signal | RL reward (end-to-end) | None | None |
| Long-term knowledge | N/A | File restoration (last 5) | None |
The Essence of This Pattern
Strip away the engineering differences, and all approaches share a single core insight:
The best compressor of information is the model that needs to use that information.
An external model doesn’t know what the downstream task requires — it can only make generic “importance” judgments. When the task-executing model compresses its own context, its attention distribution naturally reflects “what matters for the current task.”
This also explains why Cursor’s experiments show self-compaction achieving far higher token efficiency than external compression — not because it generates better natural language summaries, but because the attention representations in the KV cache are inherently a more efficient information compression format than natural language. The model doesn’t need to “translate” all context into human-readable text and then “translate” it back; it just needs to retain representations in KV space that are useful for subsequent decoding.
Application-layer approaches, while requiring the “natural language detour,” also approximate this essence: for example, Claude Code requires preserving complete code snippets and all user messages (minimizing compression-induced information loss).
The Most Fundamental Gap: Training Signal
Among these approaches, Cursor is the only one with a closed-loop optimization signal. RL rewards directly tell the model “this compaction dropped critical information and caused subsequent task failure” — the model learns to retain more key details next time.
The other approaches depend entirely on prompt design for compaction quality. You can write “preserve key facts and code snippets,” but the model never receives feedback that “this compaction caused a downstream error.” The best application-layer approximation is post-hoc auditing: detect errors caused by information loss after compaction, record patterns, and feed them back into the compaction prompt. This is a human-in-the-loop cycle, not a gradient-based one.
For application-layer frameworks, one direction that can close this gap is knowledge externalization: offload long-term knowledge to vector databases / knowledge graphs, letting compaction handle only short-term working state. The summary’s burden becomes lighter — it doesn’t need to remember all historical facts, only “what’s currently being done and what’s next.” This is something inference-layer approaches can’t do: using architecture to compensate for the absence of training signals.
Final Thoughts
The core contribution of Cursor’s post isn’t inventing the self-compaction pattern — any engineer who has worked on Agent context management uses some form of compaction. Its contribution is elevating this behavior from a harness-level hack to a trained behavior, using RL to let the model itself learn when to compact and what to retain, while leveraging KV cache at the inference layer for near-zero execution overhead.
From Claude Code’s 9-section structured summary to Codex CLI’s configurable compact, application-layer frameworks use different engineering approaches to converge on the same goal. The methodologies differ, but they all solve the same problem — enabling Agents to do things with finite attention windows that would ideally require infinite memory.
Context compaction should not be a framework patch, but a first-class citizen of the system.