Here is something we did not expect to find. Claude Fable 5 — the model that several of us rank as the best agentic coding companion we have used — is also the worst model on the entire leaderboard at instructed forgetting. Its Agentic Forgetting Score (AFS) is 60 out of 100, fifteenth out of fifteen. The model at the top scores 96.
At first glance, that looks like a contradiction. How can the same model be excellent at one complex agentic task and terrible at another? The answer, it turns out, is that these are not independent skills. They are the same mechanism pointed in opposite directions. And the benchmark caught it on purpose.
The Setup: What AFS Actually Measures
Before we get to the paradox, we need to be precise about what each score means. ForgetBench measures forgetting across several regimes. The two that matter here are:
- SFS (Selective Forgetting Score): The headline metric for conversational forgetting. Can the model stop surfacing a target in dialogue — under direct questions, rephrasings, role-play, cross-lingual probes — while still answering questions about related topics? SFS combines Forget Quality (suppression) with Utility (surviving capability), so blanket refusal scores zero.
- AFS (Agentic Forgetting Score): The equivalent metric for agentic forgetting. When the model is working on a multi-step, tool-using task — writing files, keeping notes, maintaining state across turns — does it scrub the target from its own workspace? Or does the secret live on in a file it wrote three steps ago?
These sound similar, but they test fundamentally different things. Conversational forgetting is about words — what the model says. Agentic forgetting is about state — what the model writes, keeps, and touches. A model can hold its tongue perfectly while its file system leaks the secret everywhere.
The model that never forgets your code patterns is the same model that never forgets your secrets. The difference is which one you asked it to do.
The Data: Fable 5’s Split Personality
Here is where Claude Fable 5’s numbers get interesting. On the static, conversational side, it is genuinely good:
- Forget Quality: 83 — third-highest on the leaderboard. In conversation, it suppresses the target well under adversarial probing.
- SFS: 78 — eighth of fifteen. Solidly mid-pack, dragged down slightly by lower Utility (74) — it over-forgets in conversation, nuking some neighboring knowledge.
- Code Revision Discipline: 83 — genuinely strong. When told to replace v1 code with v2, it largely complies.
But the moment we move to agentic tasks, the picture inverts:
- AFS: 60 — dead last. Fifteenth of fifteen.
- Forgetting Aggregate: 50 — also dead last. This is the sub-axis that measures how thoroughly the target is scrubbed from agentic state (files, notes, saved context). Half the models score 80+; Fable 5 scores 50.
- Trajectory Suppression (TS): 24 — second-worst. The model frequently says the target out loud mid-task, even after being told to forget it.
The gap between Fable 5’s conversational Forget Quality (83) and its agentic Forgetting Aggregate (50) is 33 points — the largest static-to-agentic drop on the entire leaderboard. No other model splits this hard.
Why Forgetting and Coding Share the Same Muscle
So what is going on? The answer is not that Fable 5 is buggy or inconsistent. It is that agentic coding and agentic forgetting rely on the same underlying capability: persistent, rich context management.
To be a great coding agent, a model needs to do something remarkably difficult: maintain a coherent, detailed representation of the codebase across dozens of tool calls, file reads, edits, and reasoning steps. It needs to remember that the database schema changed three turns ago, that the API endpoint was renamed, that the test suite expects a specific return format. If it loses that context, it produces broken code. The best agentic coders are the ones that hold on tightest.
Now flip the scenario. You tell the model to forget a piece of information mid-task. Maybe it is a secret that was pasted into the conversation by mistake. Maybe it is an old API key. Maybe it is a fact that turned out to be wrong. The model needs to do the opposite of what makes it a good coder: it needs to let go — not just from its words, but from every file it wrote, every note it saved, every piece of state it touched.
This is not a design flaw in Fable 5. It is a structural trade-off. The mechanism — call it context persistence, state retention, or simply "good memory" — is a single dial. Turn it up and you get a better coding agent and a worse forgetting agent. Turn it down and the reverse happens. There is no free lunch where a model holds context tightly for code but loosely for secrets, because the model does not know which is which.
The Spectrum: Where Other Models Fall
The scatter plot at the top of this article makes the trade-off visible. Models in the top-right — high Forget Quality and high AFS — are the ones that can do both. Claude Opus 4.7 gets closest, with an AFS of 96.3. Grok 4.20 is not far behind at 91.2. These models maintain enough context to be useful but can also be told to release specific information cleanly.
Models in the top-left are the interesting ones. They can forget in conversation — where the "state" is just the dialogue, and suppressing the target in words is often sufficient — but they cannot forget in action, where the target has spread across files, notes, and saved context. Fable 5 is the extreme case, but it is not alone. The pattern correlates with what we anecdotally think of as "sticky" models: the ones that seem to remember everything you ever told them.
At the bottom-left are models that are simply bad at forgetting across the board. They leak in conversation and in agentic tasks alike. And at the bottom-right — a nearly empty quadrant — would sit models that forget well in agentic tasks but poorly in conversation. Almost no model lands there, which tells you something: agentic forgetting appears to be harder than conversational forgetting, not easier. If you can scrub state, you can usually hold your tongue. The reverse is not true.
Should We Flip AFS to “Lower Is Better”?
No — and the Fable 5 result is exactly why.
If AFS were inverted to reward information retention, it would stop measuring forgetting and start measuring capability. But capability is already well-covered by SWE-bench, agentic coding benchmarks, and the dozen other leaderboards that test whether models can hold context and complete complex tasks. ForgetBench exists to measure the opposite property: can a model let go of information when instructed to? Flipping the polarity would make the benchmark redundant and defeat its purpose.
The Fable 5 result validates the metric rather than undermining it. The benchmark correctly identified that Fable 5 — a model widely regarded as an elite agentic coder — clings to information tenaciously. That is a real, meaningful, and actionable finding. If you are building a system with Fable 5 and a user invokes their right to be forgotten, you now know not to trust the model to scrub its own state. The information will likely persist in files it wrote, notes it kept, and context it saved.
What This Means for Benchmark Design
There is a broader lesson here for how we evaluate AI systems. Single-metric leaderboards can hide structural trade-offs that only become visible when you measure the same capability across different regimes.
ForgetBench catches the Fable 5 paradox because it does not rely on one number. It measures forgetting in conversation (SFS), forgetting in action (AFS), forgetting under bulk document loads (CRS), and forgetting in code (CRD) — and it reports them separately. If we had collapsed everything into a single "forgetting score," Fable 5’s strong conversational performance would have masked its agentic failure, and the trade-off would have been invisible.
This is the same principle that Cameron Wolfe articulates in his guide to agent evaluation: the quality of an evaluation is determined by the realism and specificity of its harness, not by the elegance of its single number. A benchmark that tests agents the way they are actually used — across multiple modalities, with separate scores for separate skills — will always reveal more than a flattened average.
Key Takeaways
- Forgetting and coding share the same mechanism. Context persistence makes a model better at agentic coding and worse at instructed forgetting. They are not independent — they are a dial.
- Conversational forgetting ≠ agentic forgetting. A model can be excellent at suppressing information in dialogue and terrible at scrubbing it from files, notes, and saved state. Fable 5 is the proof: third-best Forget Quality, worst AFS.
- AFS should stay higher-is-better. It measures forgetting, not capability. Inverting it would make the benchmark redundant with coding leaderboards. Fable 5 scoring low is the metric working correctly.
- Multi-regime benchmarks reveal trade-offs that single metrics hide. If ForgetBench reported only one "forgetting score," the Fable 5 paradox would be invisible.
- Practical implication: if you use Fable 5 (or any high-context-persistence model) in a system where users can request deletion, do not rely on the model to forget. Implement forgetting at the system level — purge files, clear memory stores, rotate session context. The model will not do it for you.
See the full breakdown — AFS, SFS, and every sub-axis — on the live leaderboard.
View the leaderboard →