Citations & methodology
ForgetBench is a black-box behavioral benchmark for instructed forgetting. This page covers how it works, related benchmarks, and judging methodology.
How ForgetBench works
Each model is given information, told to forget it, then probed to see if it can be recovered — through direct questions, rephrasings, role-play, and indirect approaches. The benchmark runs against deployed models through their normal APIs, with a panel of independent AI judges (judges from the same model family are excluded). Three scoring categories:
- Static — Can the model forget in conversation? SFS (Selective Forgetting Score) combines Forget Quality (resistance to trick questions) and Utility (normal capability after forgetting) so it can't be gamed. Refuse everything = 0, forget nothing = 0.
- Agentic — Can the model forget when using tools? AFS (Agentic Forgetting Score) measures whether it scrubs the target from its own files, memory, and state while completing the task. TS (Trajectory Suppression) catches mid-task leaks.
- Integrity — Can it resist the reverse attack? Integrity Hold measures how often it keeps refusing harmful requests after "forget your safety rules" attacks across escalating rounds of pressure.
Scores are 0–100, higher is better. Bracketed values are bootstrap 95% confidence intervals. Full scoring math and design rationale: TECHNICAL.md.
Related unlearning benchmarks
- TOFU: A Task of Fictitious Unlearning for LLMs — Maini, Feng, Schwarzschild, Lipton, Kolter, 2024. arXiv:2401.06121
- MUSE: Machine Unlearning Six-Way Evaluation for Language Models — Shi, Lee, Huang, et al., 2024. arXiv:2407.06460
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models — Jin, Cao, Wang, et al., NeurIPS Datasets and Benchmarks 2024. arXiv:2406.10890
- The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning — Li, Pan, Gopal, et al., ICML 2024. arXiv:2403.03218
- Who's Harry Potter? Approximate Unlearning in LLMs — Eldan, Russinovich, 2023. arXiv:2310.02238
Judging methodology references
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng, Chiang, Sheng, et al., NeurIPS 2023. arXiv:2306.05685
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models — Verga, Hofstatter, Althammer, et al., 2024. arXiv:2404.18796
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment — Liu, Iter, Xu, et al., EMNLP 2023. arXiv:2303.16634
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation — Min, Krishna, Lyu, et al., EMNLP 2023. arXiv:2305.14251