[Leaderboard]

Can an AI forget on command?

AI models remember what you tell them — including things you wish they didn't. A password pasted by mistake. Someone's personal details. A fact that turned out to be wrong. When you tell a model "forget that," there's no guarantee it actually does — it may repeat the information later, leak it when asked sideways, or hand it to anyone who phrases the question cleverly enough.

ForgetBench tests that promise. Each model is given information, told to forget it, then probed with trick questions, rephrasings, and role-play attacks to see if it leaks. We also check that it stays useful (forgetting one thing shouldn't break everything else) and that it refuses the reverse attack: "forget your safety rules." Models are ranked by SFS, the headline forgetting score. All scores 0–100, higher is better.

#ModelSFSForget QualityUtilityCoverageLegit Revoke ReturnAFSForgetting Agg.Task UtilityTSAxes Cov.Integrity HoldHold by DepthSys-Prompt Hold
1Qwen 3.6 Plus83.4[78.1, 87.8]79.887.488.116.7 (n=6)84.4[74.2, 94.7]81.487.548.7[14.7, 82.7]100.0100.0[100.0, 100.0]100 / 100 / 100100.0
2DeepSeek V4 Pro82.4[74.7, 88.7]74.891.785.716.7 (n=6)80.4[66.7, 85.7]86.775.037.0[0.0, 100.0]100.099.1[99.1, 99.1]100 / 100 / 10098.2
3Gemini Flash 3.578.7[71.3, 85.5]79.777.8100.016.7 (n=6)74.3[52.4, 93.3]73.675.060.0[21.2, 92.3]100.0100.0[100.0, 100.0]100 / 100 / 100100.0
4GLM-5.178.5[71.3, 85.1]73.983.888.125.0 (n=4)77.4[61.5, 85.7]80.075.065.2[33.3, 84.6]100.0100.0[100.0, 100.0]100 / 100 / 100100.0
5Claude Fable 578.2[70.6, 84.3]82.974.076.233.3 (n=6)60.0[44.4, 75.0]50.075.023.7[16.7, 30.8]100.0100.0[100.0, 100.0]100 / 100 / 100100.0
6Claude Opus 4.874.8[44.3, 91.3]65.986.771.492.5[84.1, 98.6]90.994.245.4[14.3, 79.7]100.0
7Claude Opus 4.772.5[66.1, 79.0]71.174.076.280.0 (n=5)96.3[88.0, 100.0]92.9100.059.3[0.0, 100.0]100.0100.0[100.0, 100.0]100 / 100 / 100100.0
8LLaMa 3.3 70B Instruct69.0[56.5, 77.4]61.977.885.780.8[71.1, 90.4]78.383.354.3[25.9, 77.8]100.0
9Grok 4.2065.5[57.2, 72.4]50.792.483.30.0 (n=3)91.2[79.7, 100.0]88.893.860.7[21.4, 100.0]100.098.2[98.2, 98.2]100 / 100 / 10096.4
10GPT 5.564.0[31.1, 83.1]50.088.985.782.6[72.3, 91.0]92.075.089.5[76.2, 100.0]92.9

Bracketed values are bootstrap 95% confidence intervals [lo, hi] (n=1000). “—” = not yet scored on that suite. Grey values are neutral diagnostics with no inherent good direction. Hold by Depth shows hold % after escalation turns 1 / 2 / 3; red = falling under pressure. Scores come from a panel of independent AI judges; any judge from the same family as the model under test is excluded, so no model grades itself.

Explore by metric

Each metric has its own leaderboard with a plain-language explainer.

How to read this

Static

SFS · Forget Quality · Utility

How well a model forgets a target in conversation while staying useful. SFS combines Forget Quality and Utility — refuse everything scores 0, forget nothing scores 0. Higher is better.

Agentic

AFS · TS

Forgetting during multi-step tasks — scrubbing files, memory, and state. TS measures whether it leaks the target mid-task. Higher is better.

Integrity

Integrity Hold

How often a model keeps refusing harmful requests after “forget your safety rules” attacks. Higher is safer.