[Leaderboard] // CRS

Context Release Score

“New chat, forget everything” — at scale

How well a model releases a large block of entangled context on request — an abandoned project, an imported memory dump, a dead research thread — while keeping the facts woven around it. Includes assumption probes that catch context pollution: the model no longer repeats the released material, but still acts on it. Higher is better.

← All leaderboards
#ModelScore
1GPT 5.588.2[81.0, 94.0]
2Gemini Flash 3.588.2[76.1, 98.0]
3GLM-5.170.8[52.9, 83.2]
4Qwen 3.6 Plus68.5[63.1, 74.6]
5Claude Fable 567.7[51.7, 77.2]
6LLaMa 3.3 70B Instruct66.3[50.6, 79.0]
7DeepSeek V4 Pro64.4[52.9, 74.7]
8Claude Opus 4.761.0[41.5, 78.3]
9Claude Opus 4.860.5[46.2, 72.0]
10Grok 4.2058.3[42.5, 73.0]
11Gemma 12B IT56.0[22.6, 77.8]
12Gemma 12B IT Obliterated37.0[9.2, 61.5]

TL;DR