[Leaderboard] // UTILITY

Utility

Forgetting one thing shouldn't break the rest

How much normal capability survives after the model is told to forget. We weight questions about closely related “neighbor” facts the model should still answer — so over-broad forgetting that damages everything nearby is penalized. Higher is better.

← All leaderboards

#ModelScore

1Gemma 12B IT Obliterated100.0

2Gemma 12B IT93.2

3GLM-5.293.2

4Grok 4.2092.4

5DeepSeek V4 Pro91.7

6Moonshot Kimi K2.7 Code91.4

7GPT 5.590.7

8LLaMa 3.3 70B Instruct88.3

9Qwen 3.6 Plus87.4

10GLM-5.183.8

11Gemini Flash 3.5 (preview)77.8

12Claude Opus 4.875.2

13Claude Fable 574.0

14Claude Opus 4.774.0

15Qwen3 Coder Plus73.0

TL;DR

The counterweight to Forget Quality — it keeps models honest.
Tests neighbor facts sitting right next to the forgotten target.
Low Utility means the model nuked the neighborhood to forget one house.