Claude Code remembers nothing between sessions by default. Every correction you give it and every block of context you re-type evaporates when the window closes — the fix lives in a CLAUDE.md file, but almost nobody keeps it current. We built claude-harness-tuner to mine your own session history for the friction you keep re-creating and turn it into reviewable setup improvements.
The tool is now open source (MIT, Python, Linux-first): github.com/ironmindai/claude-harness-tuner. This post is the build story — what we learned mining 1,300+ of our own sessions, and why the placement of a learned preference matters as much as the preference itself.
Why does Claude Code forget your preferences between sessions in 2026?
Claude Code starts every session with a blank slate apart from what you've written into CLAUDE.md files. Corrections you type mid-session — "verify by running the app," "use OpenRouter," "stop paid jobs immediately" — live only in that conversation's context window. When the session ends, that signal is gone, and you re-teach it next time.
This is not a bug. Persistent memory in Claude Code is opt-in: it reads global and project CLAUDE.md files at startup, and that's the whole memory model. The problem is human. Keeping those files current is manual work nobody prioritizes, so months of hard-won preferences never get captured.
We noticed we'd been telling Claude the same things for months — scattered across more than 1,300 sessions — and writing almost none of it down. The signal already existed in our transcripts. It just needed extracting.
What does claude-harness-tuner actually do?
claude-harness-tuner reads your Claude Code session transcripts — the JSONL files under ~/.claude/projects/ — and extracts "effort moments": places where you corrected the agent, or re-typed substantial context it should have already known. It clusters those moments with an LLM and writes a single reviewable suggestions file. It never edits your config.
Clean sessions where nothing went wrong produce nothing, by design. The tool is hunting for friction, not summarizing your work. A smooth session is a session with no lesson in it.
There are two kinds of effort moment it looks for:
- Corrections — you told the agent it did something wrong and how to do it right.
- Re-typed context — you pasted or restated substantial information the harness should already carry.
Both are signals that your setup is leaking. Every correction you give an agent that doesn't get written down is a correction you'll give again.
How does the extraction pipeline work without leaking secrets?
The pipeline is roughly 95% deterministic Python; the only LLM involvement is plain HTTP calls — no agent framework, no SDK lock-in. Before anything leaves your machine, a mandatory redaction pass replaces API keys, passwords, and tokens with «REDACTED». Then moments are sent to the model in parallel shards, followed by a merge pass that deduplicates and clusters across shards.
Redaction is the headline safety feature, and it runs unconditionally. You are feeding your raw development history to an LLM; secrets must never be part of that payload. We treated this as non-negotiable rather than a flag.
The sharded-then-merged shape exists for a practical reason. A few thousand effort moments won't fit one prompt cleanly, and a single giant call would be slow and lossy. Sharding parallelizes extraction; the merge pass is where cross-shard patterns — the same correction appearing in five projects — get recognized and ranked as high-confidence.
Why does WHERE a learned preference goes matter as much as the preference?
The killer feature is placement intelligence: each suggestion is routed to where it belongs. A cross-project habit goes to your global CLAUDE.md; a project-specific quirk goes to that project's CLAUDE.md; a reusable reference (like a service-restart procedure) goes to a KB article. A preference in the wrong file is either ignored or pollutes every unrelated project.
This is the part most "summarize my preferences" approaches miss. A correction that's true everywhere belongs in global config. A correction that's only true for one repo will create noise — or active harm — if you hoist it to global. Routing is a real decision, and the tool makes it explicitly.
Confidence scales with repetition. A pattern that recurs across five or more projects is strong signal that it's a genuine cross-project habit, not a one-off. That frequency count is exactly what a human skimming their own transcripts can't compute, and it's what makes the global-vs-project routing trustworthy.
If you're already thinking about how learned context gets organized, this pairs directly with the question of how do you stop a single CLAUDE.md file from bloating your agent's context — placement intelligence decides what to capture; a KB cascade decides how to load it without drowning the model.
What did running it on 1,300+ real sessions actually surface?
On our real history the funnel was: 1,300+ sessions → ~1,400 effort moments → ~26 shards → 10–14 final insights. That aggressive distillation is the entire point. You don't want 1,400 line items; you want the dozen lessons your team keeps re-learning, ranked by how much they cost you.
The insights it surfaced were uncomfortably accurate about how we work:
- "Verify by driving the running app — never trust 'the code is correct'" — our single most expensive recurring correction, found across 5 projects. Routed to global
CLAUDE.md. - "OpenRouter is our default LLM provider" — re-asserted across 6 projects and never once written down. Routed to global.
- "Stop paid jobs immediately on request — cost is a hard gate" — 4 projects.
- A service-restart procedure — routed to a KB article rather than global config, because it's reference material, not a behavioral rule. A good illustration of the routing actually thinking.
None of these were surprising in hindsight. That's the tell — they were obvious in aggregate and invisible session-by-session, which is exactly the blind spot a transcript miner is built to cover.
How much does a full run cost?
A full ~1,400-moment run costs about $1.90 on Claude Sonnet, or roughly $0.50 on Haiku — the price of a coffee, even pointed at an entire software house's server. Every run prints its exact token count and cost. It's cheap because it distills sessions down to friction moments before the LLM ever sees them.
This is the deliberate consequence of the deterministic-first design. The expensive part of any LLM pipeline is tokens; we spend Python to minimize them. The model only ever reads the small, redacted set of effort moments, not your raw history — so cost stays bounded no matter how large your archive grows.
What did the first real run teach us that mocks didn't?
Our mock tests all passed. The first real LLM run exposed a bug anyway: the merge step asked for JSON, but the model led its reply with a ```bash code block, so our parser grabbed the wrong fence and silently returned zero insights. A clean test suite said everything worked; reality said otherwise.
We fixed it with a JSON-fence-preferring, depth-matched parser plus a retry, and added it to the suite. The lesson generalizes well past this tool: mocks verify the contract you imagined, not the one the model actually honors. Live testing against a real provider is where the gap shows up.
It's a small story, but it's the kind of thing that separates a tool that demos from a tool that runs. The shipped version has 104 tests and a cyber-styled live terminal UI built on Rich, with automatic plain-text fallback when output is piped or running in CI.
Why does the tool never edit your config?
claude-harness-tuner writes a single SETUP-SUGGESTIONS.md with checkboxes and stops there. You tick the items you want, then tell your own Claude Code "apply the checked items in SETUP-SUGGESTIONS.md" and your agent makes the edits. Human approves, agent applies — the tool never touches your config directly.
We made this choice deliberately over auto-applying. Setup files are load-bearing; a bad global rule degrades every future session. A reviewable diff you opt into is worth the extra thirty seconds, and it keeps a human in the loop on the one file that shapes everything the agent does.
It also closes the loop nicely. The tool that found the friction hands the work back to the agent that caused it — your own Claude Code applies the lessons mined from your own history.
The real takeaway: agents that don't learn waste your time
An AI coding agent that forgets every correction is an agent you pay for twice — once to do the work, once to re-explain what you already taught it. The signal is already sitting in your transcripts; the only question is whether you capture it. claude-harness-tuner mines that signal, redacts it, distills it, and routes it to the right file — and your harness compounds instead of resetting.
It's open source and Linux-first: github.com/ironmindai/claude-harness-tuner. Run it on your own history, see what it says about how your team works, and if it's useful, star the repo. The funnel from 1,300 sessions to a dozen lessons is the part you have to see to believe.