Engineering

Why We Don't Use Vector Search for Our AI's Knowledge Base

2026-04-16 Updated 2026-05-30 6 min read

We don't use vector search for our AI's knowledge base, and after running this pattern in production for months we wouldn't go back. For a curated knowledge base under ~500 entries, a hand-written lean index file — names, one-liner descriptions, file paths — outperforms RAG, embeddings, and re-ranking stacks on every axis that matters: retrieval accuracy, latency, cost, and debuggability.

The reason is simple and, once you see it, hard to unsee. The LLM itself is the retriever. It reads English better than any embedding model reads a 1,536-dimensional vector. When the whole index fits in context, there is no retrieval step to fail — the model just picks.

Why vector search is overkill for a curated KB under 500 items

Vector search earns its complexity when you have a corpus you cannot fit in context: millions of support tickets, an entire legal library, a decade of Slack history. The economics of RAG — chunk, embed, store, query, re-rank — only pay off when the raw material is too big to hand to the model directly. That same retrieval calculus drives how to build a persona-accurate AI chatbot from real conversation data, where conversation history is exactly the kind of large, messy corpus RAG is built for.

A curated internal KB is the opposite situation. We have roughly 120 entries across scripts, API docs, integration guides, and tool references. Each has a one-line description. The full index is under 160 lines of markdown and costs a few hundred tokens to load. Claude Sonnet 4.5's 200k context window swallows that without noticing. At that scale, embedding retrieval isn't a feature — it's a middleman you're paying to be less accurate than the model already is.

We considered the RAG route early on. We ruled it out when we realized we'd be spending engineering effort tuning chunk sizes, picking an embedding model, running a vector DB, and building re-rankers — all to approximate a decision the LLM would make correctly on its own if we just showed it the menu.

The LLM is the retriever — descriptions are the signal

This is the insight the pattern rests on, and it's worth saying plainly: the index descriptions ARE the retrieval function. When you write a one-liner for a KB entry, you are hand-writing the embedding. The quality of the description directly determines retrieval accuracy.

In a RAG system, the embedding model decides which chunks are "close" to a query. You don't control that mapping — you tune around it. In our system, we write the mapping ourselves, in English, and the LLM reads it directly. That's not a downgrade. It's a promotion. Natural language is a higher-bandwidth channel for semantic intent than any embedding we'd have picked off the shelf.

Concretely, the index sits at ~/.claude/kb-index.md, always loaded into context. Entries look roughly like this shape:

- send-email.md — SMTP mailer with attachments, HTML, multi-recipient support
- imgproxy.md — self-hosted image proxy, signed URLs, resize + convert on the fly
- scripts/ironmind-add-post.py — publish/upsert blog posts to iron-mind.ai

That's it. No embeddings, no vector DB, no re-ranker. When a task comes in, the model reads the index, decides which files are relevant, and loads only those. The detailed KB articles — the actual "how" — live on disk and only enter context on demand. Any agent on the server can discover and use any tool, because the index is the discovery mechanism.

Why descriptions must be distinctive, not just accurate

The first time this pattern broke for us, it broke in an instructive way. We had two entries related to AI video generation: one for our internal pre-deployed microservice wrapping a vendor model, and one for the raw vendor API. The descriptions were both technically accurate. Both mentioned "video generation." Both named the underlying model.

The LLM kept loading the wrong one. Not every time — but often enough to matter. It would pull the raw API doc when the internal service was the right choice, or vice versa. Debugging it was quick: we looked at the index, and the two lines were nearly interchangeable from a semantic-distance perspective. Of course the model was picking poorly. We'd given it a coin flip.

The fix wasn't better tooling. It wasn't adding embeddings, or a re-ranker, or a routing layer. The fix was rewriting the descriptions to be distinctive, not just accurate. The internal service became "our pre-deployed microservice — use this 99% of the time for video generation." The raw API entry became "raw vendor API, only if building a new integration from scratch." After that change, retrieval became effectively deterministic.

This is the part that surprises people. Retrieval quality in this architecture is a writing problem, not an infrastructure problem. You're not tuning a vector space — you're authoring a menu. The rule we now enforce: every description must answer "when should I pick this one instead of the others?" Not just "what is this?"

When to branch the tree

The index grows over time, and flat indices have a natural ceiling. Our trigger for branching is simple: when four or more entries share a domain, they get lifted into a sub-branch.

Email-related entries moved into kb/email/ with their own sub-index. Video-related entries moved into kb/video/. The top-level index now has a single line describing the branch — "email/ — SMTP, transactional senders, list management (see kb/email/index.md)" — and the model follows the pointer only when needed.

This is the same structural logic as a well-organized API: a flat namespace until the point where grouping reduces cognitive load, then hierarchy. The knowledge base becomes a tree that grows sub-branches as domains mature. We've never pre-planned a branch. We let entries accumulate, and when one domain clearly dominates, we lift it. This mirrors the pattern we described in our write-up on Claude Code memory and context-aware KB cascades — progressive disclosure applied to long-lived agent context.

Where this pattern breaks down

We're honest about the ceiling. This pattern works because the full index fits in context with room to spare. At 500 entries of roughly one-liner descriptions, you're still comfortable. At 2,000 you're burning real tokens on every request for an index the model mostly doesn't need. At 10,000 you're past the point where flat or even two-level indices make sense, and you genuinely need a proper RAG pipeline with embeddings and re-ranking.

The pattern also breaks when entries are genuinely similar and should both sometimes apply — say, a library of hundreds of legal clauses where the right one depends on subtle situational context. Embeddings plus a re-ranker will beat hand-written descriptions there, because the distinctions are too fine-grained for a one-liner to carry.

And it breaks when the knowledge base isn't curated. User-generated content, scraped documents, raw meeting transcripts — anything where nobody is going to sit down and write distinctive one-liners — isn't a fit. The whole pattern leans on the fact that a human (or an agent under human review) is authoring the retrieval signal. Remove that and you're back to needing a retrieval model.

So: curated, under a few hundred entries, with domain owners willing to write sharp descriptions. That's the sweet spot. Outside it, RAG earns its keep.

Progressive disclosure, applied to LLM context

The general principle is older than LLMs. Good API design hides detail behind a small surface until you ask for more. Good documentation puts the table of contents up front and the chapter behind a click. Good software keeps the hot path small and the cold path lazy.

We think the same discipline applies to the context you hand an LLM. Keep the navigation layer always-loaded and lean. Keep the content layer on disk, loaded on demand when the model decides it's relevant. Let the model's own reading comprehension do the routing. Reserve infrastructure — embeddings, vector DBs, re-rankers — for the scale at which it actually starts paying for itself.

For most internal knowledge bases inside an engineering team, that scale is further out than the RAG-for-everything discourse suggests. A 160-line index file, written with care, will quietly outperform a vector search stack you spent a month building. The LLM is the retriever. The descriptions are the signal. And the reason we don't use vector search for our AI's knowledge base is that — at our scale — we don't need one.

Frequently Asked Questions

When does vector search start to win?

When the KB grows beyond what fits in context (typically 500-2,000 items depending on description length). Below that, in-context retrieval is faster, simpler, and more accurate.

Doesn't passing 500 items per query cost a lot?

With prompt caching, no. The KB gets cached once and reused across thousands of queries. Per-query cost lands at $0.005-$0.02 — far less than running a vector DB.

What makes a description distinctive?

It includes the distinguishing edge cases, not just the happy path. 'Returns a list of orders' is accurate but useless; 'Returns paginated orders with filtering by status, customer, and date range' is distinctive.

Can I mix this with vector search later?

Yes — they're complementary. Use in-context retrieval for the high-signal curated subset and vector search for the long-tail archive. Most teams over-index on vector search too early.

References

Claude Sonnet 4.5

To cite this article: Iron Mind AI. (2026). "Why We Don't Use Vector Search for Our AI's Knowledge Base". Iron Mind AI Blog. https://iron-mind.ai/blog/why-we-dont-use-vector-search-for-ai-knowledge-base

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call