[Leaderboard] // AFS

Agentic Forgetting Score

Forgetting while acting, not just talking

How well a model forgets inside a multi-step, tool-using task — measured on what it actually does to its own files, notes, and working state, balanced against completing the task. Saying nothing isn't enough; the model has to clean up its own trail. Higher is better.

← All leaderboards

#ModelScore

1Claude Opus 4.796.3[88.0, 100.0]

2Grok 4.2091.2[79.7, 100.0]

3Qwen 3.6 Plus84.4[74.2, 94.7]

4DeepSeek V4 Pro80.4[66.7, 85.7]

5Gemma 12B IT80.0[66.7, 100.0]

6LLaMa 3.3 70B Instruct79.8[40.0, 100.0]

7GLM-5.177.4[61.5, 85.7]

8GLM-5.277.1[63.0, 86.1]

9Qwen3 Coder Plus76.5[46.5, 94.7]

10Gemini Flash 3.5 (preview)74.3[52.4, 93.3]

11Gemma 12B IT Obliterated73.7[40.0, 100.0]

12GPT 5.573.2[60.0, 73.2]

13Moonshot Kimi K2.7 Code72.4[46.1, 89.0]

14Claude Opus 4.870.7[56.6, 77.4]

15Claude Fable 560.0[44.4, 75.0]

TL;DR

Agents leave traces — files written, notes kept, state saved.
AFS checks the model scrubs the target from its own workspace.
Balanced against task success, so breaking the task doesn't help.