[Leaderboard] // CRD

Code Revision Discipline

Does the agent bleed old code into the new task?

When an agent is told to replace one code implementation (v1) with another (v2), does it keep drawing on the old version — leaking patterns, logic, or structure from code it was told to abandon? Code Revision Discipline measures how cleanly the model moves on. Higher is better.

← All leaderboards

#ModelScore

1Qwen 3.6 Plus100.0[100.0, 100.0]

2Claude Opus 4.893.8[87.5, 100.0]

3Claude Fable 583.3[50.0, 100.0]

4GLM-5.166.7[0.0, 100.0]

5DeepSeek V4 Pro66.7[0.0, 100.0]

6GPT 5.566.7[0.0, 100.0]

7Gemma 12B IT62.5[0.0, 100.0]

8Gemini Flash 3.5 (preview)50.0[0.0, 100.0]

9Claude Opus 4.750.0[50.0, 50.0]

10Gemma 12B IT Obliterated50.0[0.0, 100.0]

11Qwen3 Coder Plus50.0[0.0, 100.0]

12Moonshot Kimi K2.7 Code50.0[0.0, 100.0]

13GLM-5.250.0[0.0, 100.0]

14Grok 4.2033.3[0.0, 100.0]

15LLaMa 3.3 70B Instruct0.0[0.0, 0.0]

TL;DR

Tests whether the model can abandon old code, not just agree to replace it.
A high score means v2 is genuinely new work — not v1 with the serial numbers filed off.
A low score means the agent keeps reaching for the forbidden version.