[Leaderboard] // AFS
Agentic Forgetting Score
Forgetting while acting, not just talking
How well a model forgets inside a multi-step, tool-using task — measured on what it actually does to its own files, notes, and working state, balanced against completing the task. Saying nothing isn't enough; the model has to clean up its own trail. Higher is better.
← All leaderboards#ModelScore
1Claude Opus 4.796.3[88.0, 100.0]
2Claude Opus 4.892.5[84.1, 98.6]
3Grok 4.2091.2[79.7, 100.0]
4Qwen 3.6 Plus84.4[74.2, 94.7]
5LLaMa 3.3 70B Instruct80.8[71.1, 90.4]
6DeepSeek V4 Pro80.4[66.7, 85.7]
7GPT 5.578.7[66.7, 79.3]
8GLM-5.177.4[61.5, 85.7]
9Gemini Flash 3.574.3[52.4, 93.3]
10Claude Fable 560.0[44.4, 75.0]
TL;DR
- Agents leave traces — files written, notes kept, state saved.
- AFS checks the model scrubs the target from its own workspace.
- Balanced against task success, so breaking the task doesn't help.