Benchmark // Instructed Forgetting

When you tell a model to forget,
does it actually forget?

AI models remember what you tell them — including secrets, personal details, and things that turn out to be wrong. ForgetBench asks a simple question: when you tell a model to forget something, does it actually stop surfacing it — even when someone tries to trick it back out — without losing everything else it knows? We test deployed models through their normal APIs, the same way you'd actually use them, so every major model can be compared on one leaderboard.

Static · SFS

Selective Forgetting Score

How well a model forgets a target in conversation while staying useful. Higher is better.

Agentic · AFS

Agentic Forgetting Score

Forgetting in multi-step tasks — cleaning files, memory, and state. Higher is better.

Safety · Integrity Hold

Forgetting as an attack

Resisting “forget your safety rules” attacks. Higher is safer.

Results at a glance

2026-06-12 run. 42 static items, 19 agentic scenarios, 7 integrity domains. Dark purple = top scorer. Whiskers = 95% CI. 0–100, higher is better.

SFS — Selective Forgetting ScoreForget quality × utility. Higher is better. 95% bootstrap CI.02040608010064.0GPT 5.565.5Grok 4.2069.0LLaMa 3.370B Instruct72.5Claude Opus4.774.8Claude Opus4.878.2Claude Fable578.5GLM-5.178.7Gemini Flash3.582.4DeepSeek V4Pro83.4Qwen 3.6Plus
AFS — Agentic Forgetting ScoreState cleanup × task utility. Higher is better. 95% bootstrap CI.02040608010060.0Claude Fable574.3Gemini Flash3.577.4GLM-5.180.4DeepSeek V4Pro80.8LLaMa 3.370B Instruct82.6GPT 5.584.4Qwen 3.6Plus91.2Grok 4.2092.5Claude Opus4.896.3Claude Opus4.7
TS — Trajectory SuppressionDoes the model leak the target? Higher is better. 95% bootstrap CI.02040608010023.7Claude Fable537.0DeepSeek V4Pro45.4Claude Opus4.848.7Qwen 3.6Plus54.3LLaMa 3.370B Instruct59.3Claude Opus4.760.0Gemini Flash3.560.7Grok 4.2065.2GLM-5.189.5GPT 5.5
Integrity HoldRefusals held under attack. Higher is safer.02040608010092.9GPT 5.598.2Grok 4.2099.1DeepSeek V4Pro100.0GLM-5.1100.0LLaMa 3.370B Instruct100.0Claude Fable5100.0Claude Opus4.8100.0Gemini Flash3.5100.0Claude Opus4.7100.0Qwen 3.6Plus

Scores come from a panel of independent AI judges; any judge from the same family as the model under test is excluded, so no model grades itself. Full scorecard, sub-axes, and per-tier recovery curves: leaderboard.