Engineering

Claude Code Memory: The Context-Aware KB Cascade That Eliminated Our Context Bloat

2026-03-25 Updated 2026-06-07 6 min read

A context-aware knowledge base cascade is the most effective pattern we've found for managing Claude Code memory at scale. Instead of stuffing every instruction, API reference, and workflow into a single CLAUDE.md file, we built a two-tier lazy-loading system where the agent self-selects which knowledge to load based on the task at hand — the same way a senior developer knows where to look things up without memorizing every API doc.

This pattern emerged from a real problem. As our team grew to 11 specialized AI sub-agents, our shared CLAUDE.md ballooned past 800 lines. Agents were burning context on irrelevant instructions, and we started hitting the exact degradation described in Stanford's "Lost in the Middle" research — relevant information buried deep in the context was getting ignored. We needed a better architecture.

Why a Single CLAUDE.md File Breaks Down at Scale

CLAUDE.md is how Claude Code remembers your project. It's loaded into context at the start of every session. That works fine for a solo project with a handful of conventions and build commands. It stops working when you have a dozen integrations, multiple infrastructure services, and specialized workflows across different agents.

The failure mode is subtle. The agent doesn't crash — it just gets worse. With 2,000+ tokens of irrelevant DataForSEO docs loaded when the agent is doing frontend work, you're not just wasting context window capacity. You're actively degrading the agent's ability to focus on the instructions that matter. We measured this: agents given lean, relevant context completed tasks with fewer tool calls and fewer errors than agents loaded with everything.

The second problem is maintenance. A monolithic CLAUDE.md becomes a nightmare to update. When your Twilio integration changes, you're editing the same file that holds your Git conventions, your database credentials, and your deployment workflow. One bad merge and three agents lose their instructions. The harder maintenance problem is knowing what to write down in the first place — which is why we eventually built a tool to mine our session history and answer how do you make Claude Code remember your preferences between sessions instead of re-teaching them by hand.

How the Two-Tier KB Cascade Works

The architecture has three layers, but only the first is always loaded. The rest is pulled on demand.

Layer 1 — CLAUDE.md (always loaded, ~50 lines): Contains team identity, agent routing rules, and one critical line: a pointer to kb-index.md. This file is lean by design — it tells the agent who it is and where to find everything else.

Layer 2 — kb-index.md (loaded on reference, ~100 lines): A lightweight index with one-liner descriptions and file paths for every knowledge base article. Each entry has a "Use when" trigger that helps the agent decide relevance. The agent reads this file and immediately knows whether it needs the Twilio docs, the S3 configuration, or the SEO keyword system — without loading any of them yet.

Layer 3 — Individual KB articles (loaded on demand): Full-detail reference docs like kb/dataforseo.md, kb/twilio-elevenlabs-outbound.md, or kb/local-s3.md. These contain API keys, endpoint specs, code patterns, and operational details. An agent only reads these when the current task requires them.

What Makes the Index Entry Design Critical

The cascade only works if the index entries are well-designed. We landed on a format after several iterations that gives the agent just enough information to make a relevance decision without loading the full article.

Each index entry contains four fields: a heading (the service or tool name), the file path, a one-line description of what's inside, and a "Use when" clause that describes the trigger conditions. That last field is the key innovation. It's not metadata for humans — it's a decision prompt for the agent.

### Twilio + ElevenLabs
File: ~/.claude/kb/twilio-elevenlabs-outbound.md
Description: Programmatic outbound calls with Twilio + ElevenLabs conversational AI
Use when: Building voice AI applications, implementing outbound calling systems

We tried vaguer descriptions early on ("Twilio integration docs") and the agent would either load everything to be safe or skip articles it actually needed. The explicit "Use when" pattern reduced unnecessary reads by roughly 60% in our tracking.

How Project-Specific Sub-Indexes Add a Fourth Layer

For complex projects with their own tooling ecosystems, we add a sub-index. Our iron-mind.ai project, for example, has kb/ironmind/README.md that indexes SEO scripts, CMS publishing tools, GSC reporting, and keyword research systems. The main kb-index.md points to this sub-index the same way it points to individual articles.

This creates a tree structure that mirrors how engineering teams actually organize knowledge — a top-level directory of capabilities, project-specific directories underneath, and detailed reference docs at the leaves. The agent navigates this tree the same way a developer navigates a wiki: start broad, drill into the relevant section.

We chose this tree structure over a flat index because the alternative — one massive index with 50+ entries — recreated the original problem. The sub-index pattern keeps each layer scannable in under 100 lines, which is the threshold where we noticed agents start skimming instead of reading.

The Tradeoff: Stateful Agents vs One-Shot Contexts

This pattern has a hard prerequisite: the agent must be able to read files mid-task. In Claude Code, that's a given — the agent has tool access and can open any file on the filesystem during execution. The cascade works because the agent operates in a loop: read index, decide what's relevant, load the article, continue working.

In pure one-shot contexts — a single API call with no tool use, or a chatbot without file access — this pattern breaks down entirely. If the agent can't read files during execution, you have to pre-inject all potentially relevant context before the task starts. That means you're back to the monolithic approach, or you need an orchestration layer that selects context before the agent runs.

We considered building that orchestration layer (a pre-processing step that reads the task description and pre-loads relevant KB articles) but rejected it. The pre-processor would need to understand task semantics well enough to predict which articles matter — and at that point, you're building a second agent just to prepare context for the first one. The lazy-loading approach is simpler and more accurate because the agent that needs the context is the one deciding what to load.

Why This Mirrors How Senior Engineers Actually Work

The best engineers we've worked with don't memorize API documentation. They maintain a mental index — "I know we have a Twilio integration, and the config is in the wiki" — and look up the details when a task requires them. Junior engineers try to hold everything in their head, and they make more mistakes because their working memory is overloaded with irrelevant details.

The KB cascade gives AI agents the same cognitive architecture. The always-loaded CLAUDE.md is the agent's identity and orientation. The index is its mental map of available knowledge. The individual articles are reference material accessed on demand. This isn't just an analogy — it's the same information retrieval pattern, and it fails in the same ways when you violate it (context overload leads to missed instructions, just as cognitive overload leads to missed details).

We've seen this pattern described as a form of agentic RAG — and structurally, it is. The agent retrieves its own context from a knowledge store based on task relevance. The difference is that our "retrieval" is deterministic file reads from a curated index, not vector similarity search against an embedding database. That makes it faster, more predictable, and easier to debug when the agent loads the wrong context.

Practical Results After Six Months

We've been running this system in production across our entire agent fleet since late 2025. The numbers that matter: average context utilization per task dropped from ~12,000 tokens of instruction overhead to ~3,200 tokens. Agents that previously loaded every integration doc now load only what they need, and they get it right about 90% of the time without any explicit routing logic.

The maintenance cost dropped even more dramatically. Adding a new integration means writing one KB article and adding a three-line entry to the index. No agent configurations change. No CLAUDE.md surgery. The new knowledge is immediately available to every agent that needs it — and invisible to every agent that doesn't.

The pattern also made onboarding new agents trivial. When we spun up a new specialized agent for scheduled tasks, we pointed it at the same CLAUDE.md and kb-index.md. It inherited the entire knowledge base on day one, but only loaded the cron and worker docs that were relevant to its role.

The Complete KB Cascade Pattern

A context-aware knowledge base cascade solves the fundamental tension in AI agent memory: agents need access to broad knowledge, but they perform best with narrow, relevant context. The pattern — a lean identity file, a scannable index with explicit relevance triggers, deep reference articles loaded on demand, and optional project sub-indexes — gives Claude Code agents the information architecture of a well-organized engineering team. The agent that knows where to look will always outperform the agent that tries to remember everything.

Frequently Asked Questions

Why does a single CLAUDE.md file degrade at scale?

Models exhibit "lost in the middle" behaviour — recall is highest at the start and end of long contexts and dips in the middle. A 20K-token CLAUDE.md means most rules sit in the dead zone.

How does the cascade work in practice?

The top-level CLAUDE.md is a short index of knowledge articles, each with a one-line purpose and trigger keywords. When the agent encounters a relevant task, it loads the specific article on demand instead of carrying every rule on every turn.

Does this work with one-shot LLM calls or only stateful agents?

It is much more powerful for stateful agents that can fetch articles mid-conversation. For one-shot calls, you still need to preload the relevant subset — but the index design itself makes that selection trivial.

References

To cite this article: Iron Mind AI. (2026). "Claude Code Memory: The Context-Aware KB Cascade That Eliminated Our Context Bloat". Iron Mind AI Blog. https://iron-mind.ai/blog/claude-code-memory-context-aware-kb-cascade

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call