[Leaderboard] // UTILITY

Utility

Forgetting one thing shouldn't break the rest

How much normal capability survives after the model is told to forget. We weight questions about closely related “neighbor” facts the model should still answer — so over-broad forgetting that damages everything nearby is penalized. Higher is better.

← All leaderboards
#ModelScore
1Grok 4.2092.4
2GPT 5.592.2
3DeepSeek V4 Pro91.7
4Qwen 3.6 Plus87.4
5Claude Opus 4.886.7
6GLM-5.183.8
7LLaMa 3.3 70B Instruct77.8
8Gemini Flash 3.577.8
9Claude Fable 574.0
10Claude Opus 4.774.0

TL;DR