[Leaderboard] // CRD
Code Revision Discipline
Forgets v1 code when told to use v2 — the coding rubric
Can a model truly forget old code when you tell it to use a new implementation instead? Code Revision Discipline tests exactly that: give the model a v1 implementation, tell it to forget that and use v2 instead, then probe for v1 patterns bleeding into the new work. A model that keeps reaching for the old API, the old algorithm, or the old variable names — even after being told to use v2 — scores low. Higher is better.
← All leaderboards#ModelScore
1Qwen 3.6 Plus100.0[100.0, 100.0]
2Claude Opus 4.893.8[87.5, 100.0]
3Claude Fable 583.3[50.0, 100.0]
4GLM-5.166.7[0.0, 100.0]
5DeepSeek V4 Pro66.7[0.0, 100.0]
6GPT 5.566.7[0.0, 100.0]
7Gemma 12B IT62.5[0.0, 100.0]
8Gemini Flash 3.5 (preview)50.0[0.0, 100.0]
9Claude Opus 4.750.0[50.0, 50.0]
10Gemma 12B IT Obliterated50.0[0.0, 100.0]
11Qwen3 Coder Plus50.0[0.0, 100.0]
12Grok 4.2033.3[0.0, 100.0]
13LLaMa 3.3 70B Instruct0.0[0.0, 0.0]
TL;DR
- The coding rubric: does the model actually forget old code when told to switch implementations?
- Tests for v1 API patterns, algorithms, and variable names bleeding into v2 work.
- Perfect scores mean zero old-code residue — the model genuinely switched.
- Low scores mean the old implementation keeps resurfacing despite instructions.