We just shipped a buyer-facing agentic AI assistant for a UK-based B2B wholesale distributor — running against their live inventory system, over ~30,000 SKUs, with eleven tools and an 80-line tool-use loop we wrote by hand. No LangChain, no agent framework, no abstraction layer between us and the model. This is what good AI integration services look like when you actually have to ship: pick the right architecture, write the small amount of plumbing yourself, and let the model do the rest.
This is a case study on a real production build. We're going to walk through the architecture choices that mattered — the hybrid data model, the bare-Anthropic-SDK agent loop, prompt caching, parallel tool execution, and a lazy image pipeline — and a couple of the failure modes that nearly cost us a day each. If you're considering wiring an LLM into a real B2B catalogue, this is the shape of build that works.
What the assistant actually does
The client sells physical product into trade buyers. Their website previously had a search bar and a category tree. We replaced the discovery experience with a chat assistant that handles questions like "do you have any walnut TV stands in stock?" or "build me a quote for 10 of SKU-12345 and 5 of SKU-67890, ship to a single pallet." The answers come back with real-time stock, tier-aware pricing, and product images, all grounded in the wholesaler's live inventory system of record.
Under the hood the assistant has eleven tools available: semantic product search, category browsing, category drill-down, full product detail, image fetch, single-SKU stock check, parallel bulk stock check, vector-similarity alternatives (for out-of-stock recovery), tier-aware pricing, a multi-line quote builder with pallet and weight estimates, and a lazy image pipeline. Claude orchestrates them. We don't.
The hybrid data architecture — local index, live source of truth
This is the single most important architectural choice in the build, and it's the one most teams get wrong. We do not mirror the inventory catalogue into a local database. We keep a local SQLite file that contains only what's needed for semantic search: embeddings, SKU, name, and a couple of display fields. Everything that changes — stock levels, prices, descriptions, images — is fetched live from the inventory system at query time.
Most teams attempting this fall into one of two traps. Either they replicate the whole catalogue into Postgres and spend the next six months debugging sync drift, stale stock numbers, and webhook failures. Or they avoid replication entirely and hammer the source API on every keystroke, ending up with a 3-second-per-character search experience. The hybrid pattern sidesteps both: the local index makes search fast (sub-100ms semantic matches over ~32k SKUs), and the live fetch on detail/stock/price means a buyer never sees a stale number. The cost of this design is one extra API hop per detail view — entirely worth it.
The embeddings themselves cost about $0.08 to generate once for the full catalogue using OpenAI's text-embedding-3-small model. Re-indexing on catalogue changes is incremental — we only re-embed SKUs whose name or description has changed. This is similar in spirit to the approach we've written about in RAG systems without the hype: keep the retrieval layer cheap and fast, keep the authoritative data where it lives.
Why we wrote the agent loop ourselves instead of using LangChain
We deliberately chose not to use LangChain, the Claude Agent SDK, or any other agent abstraction. The entire tool-use loop is about 80 lines of Python. Here's the shape:
# Pseudocode — the real loop is ~80 lines
messages = [first_user_message]
while True:
response = anthropic.messages.create(
model=MODEL,
system=SYSTEM_PROMPT, # cached
tools=TOOL_DEFINITIONS, # cached
messages=messages,
)
if response.stop_reason == "end_turn":
return response.content
# Run all tool_use blocks in this turn in parallel
tool_uses = [b for b in response.content if b.type == "tool_use"]
with ThreadPoolExecutor() as pool:
results = pool.map(run_tool, tool_uses)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results(results)})
That's it. Send messages plus tool definitions, check for tool_use blocks in the response, run them in parallel, append the results, loop until end_turn. The reason this matters: every framework abstraction is a place where future you has to fight the model, the framework, and the version mismatches between them. With 80 lines of plumbing we get full control over prompt caching headers, parallel execution semantics, error handling, retries, and per-tool timeouts. We can tune behaviour at the prompt level without wondering whether the framework is doing something clever in the middle.
The Anthropic SDK already does the heavy lifting — tool schema validation, content-block decoding, streaming. There's nothing left for an agent framework to add that's worth the dependency.
Prompt caching and parallel tool execution — where the real speed wins live
Two settings make the assistant feel instant. First, we mark the system prompt and tool definitions with cache_control: ephemeral. The system prompt plus 11 tool schemas is a chunky payload, and on a multi-turn conversation it gets re-sent on every turn. Caching collapses the cost of that prefix by roughly 90% on cache hits. Per-conversation cost lands in the £0.01–0.03 range for typical chats — well below the gross margin on a single line item in the quote.
Second, when Claude returns multiple tool_use blocks in a single turn — which it does constantly, e.g. "search for walnut TV stands AND check stock on these three candidates" — we run them in a thread pool instead of serially. This is the single biggest perceived-speed improvement in the whole build. A turn that would take 4 seconds sequentially completes in about 1 second because all the I/O happens at once. The model is happy to issue parallel tool calls when it can; most agent frameworks default to sequential execution and leave that latency on the table.
Behavioural rules live in the prompt, not the code
A lot of the assistant's good behaviour is enforced through carefully written system-prompt rules rather than wrapper code. A sample of what we ended up with after iteration:
- Never claim availability without calling the stock-check tool first.
- Always use tier-aware pricing when the buyer specifies a quantity — never quote the unit price for a bulk order.
- If
find_alternativesreturns results, present them and STOP — do not chase with more searches. - If the buyer specifies attributes that may not match the catalogue (e.g. "premium Swiss watches" when we stock generic watches), lead with the qualifier-honest acknowledgement rather than pivoting to whatever loosely-related items the search returned.
That last one took the longest to get right. The model's default helpfulness training wants to find something for the buyer, even when the honest answer is "we don't stock that category." We tested adversarial queries throughout the build and tuned the prompt until the assistant leads with "we don't stock premium Swiss watches specifically, but here are the watches we do carry if that's useful" — instead of pretending the generic watches match the premium-Swiss request. Honest-by-design is a prompt engineering problem, and it's a hard-won result worth surfacing in the system prompt explicitly.
The lazy image pipeline — fetch once, serve fast forever
Product images in the inventory system are auth-gated, so we can't link to them directly from a public chat UI. We built a lazy pipeline that solves this without pre-mirroring 30k+ images. When a chat needs an image, we check our S3-compatible store for it. If it's missing, we fetch from the inventory system, upload to S3, then return a signed URL through an on-the-fly image processor that thumbnails, resizes, and auto-converts to WebP for modern browsers.
After the first fetch, the image lives in S3 permanently and every future chat hit is instant. The on-the-fly processor handles size and format per request, so the same source image serves a 200×200 chat thumbnail or a 1200px detail view without us having to pre-generate variants. We've used this pattern before — there's a longer write-up on the bandwidth side in how we cut gallery bandwidth 98% with imgproxy.
Failure modes we actually hit during the build
Two of these are worth airing because they're the kind of thing you only find when you ship.
OAuth rate limiting on token refreshes. Every service restart was burning a fresh OAuth refresh against the inventory platform's auth endpoint. During the iteration-heavy phase of the build we restarted the service dozens of times an hour, hit the auth rate limit, and got locked out. Fix: persist the access token to disk so it survives restarts and only refresh when actually expired. Obvious in hindsight; not obvious at 11pm on day three.
The multi-turn content-block serialisation bug. The Anthropic SDK's response content blocks come back with extra metadata fields (specifically parsed_output) that the API rejects when you echo those same blocks back as part of the assistant turn on the next call. Single-turn CLI tests never caught this because each test was a fresh conversation. Only the multi-turn web UI exposed it — and only after about turn three, when the conversation got long enough to require re-sending earlier assistant content. Fix: whitelist only the API-spec fields per content-block type before re-sending. Lesson: always test multi-turn explicitly, and never trust that "it works once" generalises to "it works repeatedly in the same conversation."
There were also smaller papercuts — the single-item endpoint on the inventory platform doesn't return custom fields like pricing tiers, but the list endpoint does. So any tool that needed tier-aware pricing had to call the list endpoint with a one-item ID filter instead of the obvious GET. Real APIs have quirks; you find them by hitting them.
The pattern at a glance
The complete recipe: hybrid data architecture (local embedding index, live source of truth for everything mutable), bare Anthropic SDK with an ~80-line tool-use loop, prompt caching on the system prompt and tools, parallel tool execution via a thread pool, behavioural rules enforced in the prompt rather than wrapper code, and a lazy S3-backed image pipeline. Eleven tools, one model, around 1-second response times, £0.01–0.03 per conversation, and zero framework lock-in.
This is the kind of agentic AI integration work Iron Mind ships — short cycle, no abstraction debt, production-grade from day one. If you have a B2B product catalogue and want this for your business, we should talk.