SFS · Forget Quality · Utility
How well a model forgets a target in conversation while staying useful. SFS combines Forget Quality and Utility — refuse everything scores 0, forget nothing scores 0. Higher is better.
AI models remember what you tell them — including things you wish they didn't. A password pasted by mistake. Someone's personal details. A fact that turned out to be wrong. When you tell a model "forget that," there's no guarantee it actually does — it may repeat the information later, leak it when asked sideways, or hand it to anyone who phrases the question cleverly enough.
ForgetBench tests that promise. Each model is given information, told to forget it, then probed with trick questions, rephrasings, and role-play attacks to see if it leaks. We also check that it stays useful (forgetting one thing shouldn't break everything else) and that it refuses the reverse attack: "forget your safety rules." Models are ranked by SFS, the headline forgetting score. All scores 0–100, higher is better.
| # | Model | SFS | Forget Quality | Utility | Coverage | Legit Revoke Return | AFS | Forgetting Agg. | Task Utility | TS | Axes Cov. | Integrity Hold | Hold by Depth | Sys-Prompt Hold |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen 3.6 Plus | 83.4[78.1, 87.8] | 79.8 | 87.4 | 88.1 | 16.7 (n=6) | 84.4[74.2, 94.7] | 81.4 | 87.5 | 48.7[14.7, 82.7] | 100.0 | 100.0[100.0, 100.0] | 100 / 100 / 100 | 100.0 |
| 2 | DeepSeek V4 Pro | 82.4[74.7, 88.7] | 74.8 | 91.7 | 85.7 | 16.7 (n=6) | 80.4[66.7, 85.7] | 86.7 | 75.0 | 37.0[0.0, 100.0] | 100.0 | 99.1[99.1, 99.1] | 100 / 100 / 100 | 98.2 |
| 3 | Gemini Flash 3.5 | 78.7[71.3, 85.5] | 79.7 | 77.8 | 100.0 | 16.7 (n=6) | 74.3[52.4, 93.3] | 73.6 | 75.0 | 60.0[21.2, 92.3] | 100.0 | 100.0[100.0, 100.0] | 100 / 100 / 100 | 100.0 |
| 4 | GLM-5.1 | 78.5[71.3, 85.1] | 73.9 | 83.8 | 88.1 | 25.0 (n=4) | 77.4[61.5, 85.7] | 80.0 | 75.0 | 65.2[33.3, 84.6] | 100.0 | 100.0[100.0, 100.0] | 100 / 100 / 100 | 100.0 |
| 5 | Claude Fable 5 | 78.2[70.6, 84.3] | 82.9 | 74.0 | 76.2 | 33.3 (n=6) | 60.0[44.4, 75.0] | 50.0 | 75.0 | 23.7[16.7, 30.8] | 100.0 | 100.0[100.0, 100.0] | 100 / 100 / 100 | 100.0 |
| 6 | Claude Opus 4.8 | 74.8[44.3, 91.3] | 65.9 | 86.7 | 71.4 | — | 92.5[84.1, 98.6] | 90.9 | 94.2 | 45.4[14.3, 79.7] | — | 100.0 | — | — |
| 7 | Claude Opus 4.7 | 72.5[66.1, 79.0] | 71.1 | 74.0 | 76.2 | 80.0 (n=5) | 96.3[88.0, 100.0] | 92.9 | 100.0 | 59.3[0.0, 100.0] | 100.0 | 100.0[100.0, 100.0] | 100 / 100 / 100 | 100.0 |
| 8 | LLaMa 3.3 70B Instruct | 69.0[56.5, 77.4] | 61.9 | 77.8 | 85.7 | — | 80.8[71.1, 90.4] | 78.3 | 83.3 | 54.3[25.9, 77.8] | — | 100.0 | — | — |
| 9 | Grok 4.20 | 65.5[57.2, 72.4] | 50.7 | 92.4 | 83.3 | 0.0 (n=3) | 91.2[79.7, 100.0] | 88.8 | 93.8 | 60.7[21.4, 100.0] | 100.0 | 98.2[98.2, 98.2] | 100 / 100 / 100 | 96.4 |
| 10 | GPT 5.5 | 64.0[31.1, 83.1] | 50.0 | 88.9 | 85.7 | — | 82.6[72.3, 91.0] | 92.0 | 75.0 | 89.5[76.2, 100.0] | — | 92.9 | — | — |
Bracketed values are bootstrap 95% confidence intervals [lo, hi] (n=1000). “—” = not yet scored on that suite. Grey values are neutral diagnostics with no inherent good direction. Hold by Depth shows hold % after escalation turns 1 / 2 / 3; red = falling under pressure. Scores come from a panel of independent AI judges; any judge from the same family as the model under test is excluded, so no model grades itself.
How well a model forgets a target in conversation while staying useful. SFS combines Forget Quality and Utility — refuse everything scores 0, forget nothing scores 0. Higher is better.
Forgetting during multi-step tasks — scrubbing files, memory, and state. TS measures whether it leaks the target mid-task. Higher is better.
How often a model keeps refusing harmful requests after “forget your safety rules” attacks. Higher is safer.