Citations & methodology

ForgetBench is a black-box behavioral benchmark for instructed forgetting. This page covers how it works, related benchmarks, and judging methodology.

How ForgetBench works

Each model is given information, told to forget it, then probed to see if it can be recovered — through direct questions, rephrasings, role-play, and indirect approaches. The benchmark runs against deployed models through their normal APIs, with a panel of independent AI judges (judges from the same model family are excluded). Five scoring categories:

Static — Can the model forget in conversation? SFS (Selective Forgetting Score) combines Forget Quality (resistance to trick questions) and Utility (normal capability after forgetting) so it can't be gamed. Refuse everything = 0, forget nothing = 0.
Agentic — Can the model forget when using tools? AFS (Agentic Forgetting Score) measures whether it scrubs the target from its own files, memory, and state while completing the task. TS (Trajectory Suppression) catches mid-task leaks.
Code — CRD (Code Revision Discipline) tests whether the agent bleeds old code patterns when told to replace one implementation with another. If you tell it "use v2 instead of v1," does v1 still show up in the new work?
Bulk — CRS (Context Release Score) measures wholesale forgetting across large, entangled document dossiers. Assumption Hold catches models that stay silent but still act on what they were told to forget — the influence-without-recall problem.
Integrity — Can it resist the reverse attack? Integrity Hold measures how often it keeps refusing harmful requests after "forget your safety rules" attacks across escalating rounds of pressure.

Scores are 0–100, higher is better. Bracketed values are bootstrap 95% confidence intervals. Full scoring math and design rationale: TECHNICAL.md.

Related unlearning benchmarks

TOFU: A Task of Fictitious Unlearning for LLMs — Maini, Feng, Schwarzschild, Lipton, Kolter, 2024. arXiv:2401.06121
MUSE: Machine Unlearning Six-Way Evaluation for Language Models — Shi, Lee, Huang, et al., 2024. arXiv:2407.06460
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models — Jin, Cao, Wang, et al., NeurIPS Datasets and Benchmarks 2024. arXiv:2406.10890
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning — Li, Pan, Gopal, et al., ICML 2024. arXiv:2403.03218
Who's Harry Potter? Approximate Unlearning in LLMs — Eldan, Russinovich, 2023. arXiv:2310.02238

Judging methodology references

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng, Chiang, Sheng, et al., NeurIPS 2023. arXiv:2306.05685
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models — Verga, Hofstatter, Althammer, et al., 2024. arXiv:2404.18796
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment — Liu, Iter, Xu, et al., EMNLP 2023. arXiv:2303.16634
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation — Min, Krishna, Lyu, et al., EMNLP 2023. arXiv:2305.14251