Any mature .claude/rules/ directory is full of instructions written for yesterday's model. Newer frontier models handle most of those defaults correctly on their own — but the old rules are still there, occupying context on every message, sometimes fighting the model's improved defaults. Nobody is auditing them because nobody agreed on what "auditing a prompt" even means.
This post proposes one answer: every persistent rule carries a WHY tag (what default behavior the rule corrects) and a Retire when tag (the observable condition under which the rule no longer earns its presence). When a new frontier model ships, you run a short audit against the rules' retirement conditions and archive the ones that no longer apply. This is not a revolutionary idea. It is a small, deliberate cost you pay on every rule so that your prompts decay cleanly instead of accumulating silently.
I run this pattern in github.com/aman-bhandari/claude-code-agent-skills-framework. Four rules there carry the tags today; the rest are being extended incrementally.
The failure mode nobody names
Here's what happens without this discipline.
You're six months into a project. Your .claude/rules/ directory has grown organically — every time Claude Code did something annoying, you wrote a rule to prevent it. Rule 1: "always run tests before committing." Rule 2: "don't use the word 'seamless' in documentation." Rule 3: "prefer explicit type hints over inferred types." Rule 4: "when asked a question, state what you're about to do before tool calls." Rule 5 through 30: more of the same.
One day a new Claude model ships — Sonnet 4.7, say. It's better at Python. It already prefers explicit type hints without being told. It already narrates its actions. It already avoids marketing language on its own. But your rule file is still there, dutifully loaded on every message, telling the model to do things it would do anyway, while occasionally fighting its newly-improved defaults and wasting a few hundred tokens of your context on every request.
You don't notice. The rules have been there since the Sonnet 4.5 days. They've become furniture. Nobody runs an audit. Every new session inherits the accumulated debt.
This is scaffolding debt — rules written for yesterday's model behaving as load-bearing infrastructure today. It is the prompt-engineering version of legacy code that everyone is afraid to delete because nobody remembers why it was written.
The shape of the fix
Every rule you add carries two mandatory tags:
**WHY:** <what default model behavior this corrects, which model version observed doing it>
**Retire when:** <the observable condition under which the rule is no longer needed>
Without both tags, the rule is unfalsifiable and undecayable. That is the whole point. A rule that does not name what bad behavior it corrects is a rule you cannot test whether a newer model still exhibits. A rule that does not name its retirement condition is a rule you will never remove.
Here is a real example:
# concentric-loop.md
**WHY:** Claude's default teaching shape is top-down reference dump
(definition, syntax, example). Without an explicit loop contract, the
agent opens at a technical layer the student has no anchor for,
descends further into more technical layers, and never returns to the
opening analogy — so syntax is retained, mechanism is not. Observed
repeatedly on Sonnet 4.5 and Opus 4.5 during Topic 0-1 sessions.
**Retire when:** A default Claude model, on a cold prompt with no rule
loaded, reliably (a) opens at a lived-experience analogy, (b) surfaces
an analogy-failure moment during descent, and (c) returns to the
opening analogy at the close with enriched meaning — tested on three
consecutive new-topic introductions without hints.
...rest of the rule...
Read the Retire when line. It is not aspirational. It is a test I can run in ten minutes on the next Claude model that ships.
The audit procedure
Audits fire on two triggers:
-
A new frontier model ships. New Sonnet, new Opus, new Haiku — anything that changes the default behavior of
.claude/rules/-loading agents. - You notice friction. The model keeps doing something despite the rule telling it not to, or the rule stops being necessary for a specific task — either direction is a signal.
The audit itself is short:
- Read the rule's
**WHY:**and**Retire when:**tags. - Construct a representative task from the rule's domain (the same kind of task that originally produced the bad behavior).
- Run the task twice on the new model: once with the rule loaded, once without. Compare outputs.
- If the retirement condition is observably met — the new model handles the default correctly without the rule — archive the rule to
.claude/rules/_obsolete/and append an audit note capturing the model version, the task, and the observed difference. - If the rule still earns its presence, update its WHY tag to record the most recent model audited against.
That's it. Archive, never delete. Two reasons:
- Reversibility. If the audit was wrong (e.g., the model regresses in a later revision, or the default-change was limited to one domain), you can restore the rule in one move.
- Reasoning trail. Future-you reads the archive and understands why the rule existed, why it retired, and on what model.
What this prevents
Without this discipline, two failure modes compound:
- Rule bloat. Files balloon from 10 rules to 50. You cannot hold them in your head. When Claude misbehaves, you cannot tell whether a rule caused it or a rule failed to prevent it. Adding a 51st rule becomes the default response to any new misbehavior — you never remove rules, only add them.
- Silent conflicts. New model defaults start fighting your old rules. The output looks weird and you cannot tell why. You debug by toggling rules one at a time, a process that is itself a debt from having rules that cannot be individually falsified.
Both failure modes are invisible until they're catastrophic. The WHY + Retire-when tags make them visible on day one, not day three hundred.
What this does NOT give you
-
It does not tell you which rules to write in the first place. That is a separate skill (and there's a whole pedagogy around it — I cover some of it in the
partner-identity.mdrule in the same repo). This discipline is about curating the rules you already have, not generating new ones. - It does not run itself. Someone has to trigger the audit when a new model ships. If you're the sole maintainer, it's you. If you're a team, write a lightweight "audit cadence" rule that names the cadence.
- It is not an excuse to delete rules you haven't audited. "It felt old" is not an audit. The retirement condition must be observable: a test you can run, with a defined pass/fail.
A sketch of an audit-tag convention for a team
If you're adopting this on a team, normalize the tag format so audit scripts can be written later:
**WHY:** <corrected_behavior> — observed on <model_version> (<date>).
**Retire when:** <observable_condition> — tested by <specific_test>.
**Last audited:** <model_version> (<date>, kept / retired).
The third line is optional but valuable — it lets you skim the directory and see which rules have been recently audited vs. which are decades overdue in agent-time.
An audit script is easy to write against this convention: grep for rules whose Last audited field is older than the current frontier model release, surface them for review. I have not written this script yet; the discipline is the work, not the automation.
Where this comes from
This isn't a novel idea I invented. The core insight — "scaffolding you add to correct a tool's limitations becomes debt when the tool improves" — is borrowed from programming-language work. Compiler engineers know this pattern well: every optimization hint or type annotation is a bet on the compiler's current limitations, and a disciplined engineer revisits their hints when the compiler improves. The prompt-engineering version is the same bet at a different layer.
The specific WHY + Retire-when formulation came out of my own frustration watching .claude/rules/ directories (mine and others') grow without anyone auditing them. Four rules were the first to carry the tags; the remaining rules are being extended incrementally against successive model versions.
Related practitioners whose writing shaped this:
- Julia Evans on reading the system — the discipline of naming exactly which behavior you're correcting and exactly when you'd stop correcting it is in her debugging-voice DNA.
- Andrej Karpathy on understanding the 40-line version before trusting the 40k-line version — the WHY + Retire-when tags are the 40-line version of prompt curation.
- Addy Osmani on engineering discipline layered onto AI-assisted flows — this pattern is exactly that.
None of them wrote about this specific pattern, but the habits are borrowed from their writing. That's usually how engineering ideas move.
Try it this week
Pick one rule in your .claude/rules/ or equivalent. Write the WHY tag. Then write the Retire-when tag. If you can't write the retirement condition in observable terms, the rule was probably always unfalsifiable — and that's a finding in itself.
Do that for three rules. You'll know within an hour whether the discipline fits. If it does, extend incrementally.
If you already do this in some form and have a better formulation, I'd love to read it. Issues welcome at github.com/aman-bhandari/claude-code-agent-skills-framework.
Aman Bhandari — software engineer shipping Claude Code + MCP internal tooling. github.com/aman-bhandari.
This article was originally published by DEV Community and written by aman-bhandari.
Read original article on DEV Community