Engineering

Claude MCP Browser Automation: How We Cut Token Costs by 95% With Accessibility Trees

2025-10-01 Updated 2026-05-17 8 min read

Any Claude MCP server that controls a browser burns through tokens fast -- a single page interaction can consume 125,000+ tokens just to read the DOM. By replacing raw HTML extraction with accessibility trees, natural language element finding, and a reference ID system, we cut token usage by 95% across real browser automation workflows without losing any capability.

This matters the moment you move AI browser automation out of demos and into production. At scale, a workflow that touches 50 pages per run goes from costing dollars per execution to fractions of a cent. The architecture change that makes this possible is surprisingly straightforward, but the impact on cost and latency is dramatic. If you are new to the protocol itself, our deep dive on Claude MCP architecture and production patterns covers the fundamentals.

Why Raw HTML Is the Wrong Input for AI Agents

The default approach to AI-driven browser automation is simple: fetch the page HTML, send it to the LLM, let it figure out what to click. The problem is that a modern web page -- Instagram, LinkedIn, any SPA -- generates 100,000 to 200,000 tokens of HTML. Most of that is nested divs, inline styles, SVG paths, and tracking scripts that have zero relevance to the task.

The LLM processes all of it anyway. It parses through thousands of tokens of CSS classes and data attributes to find a single button. Then on the next step, you fetch the HTML again because the page changed slightly, and the LLM burns another 125,000 tokens. A five-step workflow easily hits 500,000+ tokens before accomplishing anything meaningful.

The cost problem is obvious. But there is a subtler issue: accuracy drops as context grows. Research on context rot has shown that every frontier LLM tested exhibits measurable performance degradation as input length increases -- often breaking down well before their advertised context window limits. LLMs perform worse at extracting specific elements from massive HTML blobs than they do from compact, structured representations. Fixing the token problem also fixes the reliability problem. We apply the same principle to agent instructions themselves through a context-aware KB cascade that lazy-loads only the knowledge each agent needs.

How Accessibility Trees Replace Full HTML

Instead of sending raw HTML to the model, we extract the page's accessibility tree -- the same semantic structure that screen readers use. This tree contains only the meaningful elements: buttons, links, inputs, textareas, checkboxes, and their labels. All the layout noise disappears.

A page that produces 125,000 tokens of raw HTML generates roughly 5,000 tokens as a filtered accessibility tree. That is a 96% reduction before any other optimization. The tree includes each element's role, human-readable name, tag, depth, and a stable reference ID -- everything an AI agent needs to understand the page and act on it.

Filtering matters here. For most automation tasks, you only care about interactive elements -- things you can click, type into, or toggle. Filtering the tree to interactive-only elements cuts it further, often down to 1,000-3,000 tokens for a complex page. The AI agent gets a clean, scannable list of exactly the elements it can interact with.

Why Natural Language Element Finding Beats CSS Selectors

Traditional browser automation depends on CSS selectors: button[aria-label='Post'], input.search-box, #submit-btn. These are brittle. A single class name change in a deploy breaks the selector, and the AI agent has to re-analyze the full HTML to generate a new one.

Natural language finding flips this. The agent says "find the post button" and a scoring system matches it against the accessibility tree. Elements are scored on exact name matches, partial matches, role alignment, placeholder text, and semantic similarity. The top results come back with reference IDs, scores, and enough metadata for the agent to pick the right one.

This approach costs roughly 300-500 tokens per lookup. Compare that to the alternative: sending 125,000 tokens of HTML so the LLM can craft a selector, then hoping that selector works. The natural language path is cheaper by two orders of magnitude and more resilient to page changes.

How Reference IDs Eliminate Redundant Queries

Once you find an element -- whether through the accessibility tree or a natural language query -- the system assigns it a reference ID like ref_42. Every subsequent action on that element uses the reference instead of re-querying the page.

In a typical form-filling workflow, the old approach fetches the full HTML before every action: fetch HTML to find the email field, type into it, fetch HTML again to find the password field, type into that, fetch HTML again to find the submit button, click it. Three full HTML fetches, 375,000+ tokens. With reference IDs, you query the accessibility tree once, store the refs for all three elements, and execute all actions using those refs. Total cost: roughly 6,000 tokens.

References persist for the lifetime of the page. They only invalidate on navigation or when the element is removed from the DOM. For multi-step workflows on the same page -- filling forms, clicking through modals, interacting with dynamic content -- this means a single upfront tree query covers dozens of subsequent actions. The pattern is similar to how a state machine manages multi-stage AI pipelines -- each step carries forward just the context it needs, rather than re-computing everything from scratch.

What the Numbers Look Like in Production

Here are token counts from real workflows, comparing the raw HTML approach against the optimized stack:

Single button click: 125,000 tokens (HTML) vs. 600 tokens (find + ref click). That is a 99.5% reduction.

Form with three fields and a submit button: 375,000 tokens (three HTML fetches) vs. 6,000 tokens (one tree query, four ref-based actions). A 98% reduction.

Multi-step social media workflow (navigate, find comment box, type, submit): 250,000 tokens (two HTML fetches minimum) vs. 5,700 tokens (tree + two find queries + ref actions). A 98% reduction, and the optimized version is actually more reliable because it uses semantic matching instead of fragile selectors.

At current API pricing, a workflow that cost $0.50-$1.00 per run with raw HTML drops to $0.01-$0.02 per run. That is the difference between "interesting prototype" and "viable production system."

Why This Architecture Matters for AI Integration Services

Token cost is the hidden barrier that kills most AI integration services involving browser automation. The demo works great -- an agent navigates a page, clicks some buttons, extracts data. Then you calculate the per-run cost and realize it is 50x too expensive for the client's volume. The project dies or gets descoped to something simpler.

The accessibility tree approach removes that barrier. When your per-run cost drops by 95%, workflows that were economically impossible become routine. Automated data collection across hundreds of pages, social media management at scale, form processing pipelines that run thousands of times per day -- all of these become viable at production economics.

The latency improvement matters too. Sending 125,000 tokens to an LLM takes time -- both in network transfer and inference. Sending 5,000 tokens is faster by a factor that users notice. Workflows feel snappy instead of sluggish, which matters for anything user-facing or time-sensitive. This is one reason we rely on specialized AI sub-agents across our engineering workflow -- each agent receives only the narrow context it needs, keeping token budgets tight and response quality high.

The Tradeoffs Worth Knowing About

Accessibility trees are not a free lunch. There are real tradeoffs to consider when adopting this approach.

First, you lose visual context. The accessibility tree does not tell the AI agent what the page looks like -- element positions, colors, relative layout. For tasks where visual understanding matters (screenshot comparison, layout verification), you still need screenshots or targeted HTML extraction as a complement.

Second, not every web application has good accessibility markup. Pages with poor semantic HTML produce sparse, unhelpful accessibility trees. The natural language scoring helps compensate, but a page where every interactive element is a generic div with no aria labels is harder to work with than one built with proper semantics.

Third, reference IDs add state. The system has to track which refs point to which DOM elements, and handle invalidation when the page changes. This is manageable complexity, but it is complexity -- the stateless "fetch HTML every time" approach is simpler even if it is wildly more expensive.

The Pattern That Makes AI Browser Automation Viable

The core insight is that AI agents do not need to see the full page to act on it. They need a structured, semantic summary of actionable elements -- which is exactly what an accessibility tree provides. Combine that with natural language element finding (so the agent never needs to craft CSS selectors) and stable reference IDs (so elements are queried once, not on every action), and you get a Claude MCP browser automation system that costs 95% less to operate and runs faster at every step. This is the pattern that moves AI browser control from expensive experiment to production infrastructure.

Frequently Asked Questions

Why is feeding raw HTML to Claude so expensive?

A typical SaaS page generates 100k-200k tokens of HTML noise — most of it CSS classnames, ARIA attributes the model can't act on, and inline scripts. You pay for every token whether the model uses it or not.

What is an accessibility tree?

It's the same semantic structure screen readers use — a clean hierarchy of roles (button, link, textbox) and labels. Browsers expose it via Chrome DevTools Protocol; it's the cleanest input format for AI agents.

Why are natural language selectors more reliable than CSS?

CSS classnames change every deploy in modern frameworks like Tailwind. 'The submit button labeled Continue' stays stable across redesigns because it's about meaning.

Do I need a custom MCP server to do this?

Yes, or a wrapper around Playwright that extracts the accessibility tree before sending it to Claude. The standard Playwright MCP servers ship full HTML by default.

References

To cite this article: Iron Mind AI. (2025). "Claude MCP Browser Automation: How We Cut Token Costs by 95% With Accessibility Trees". Iron Mind AI Blog. https://iron-mind.ai/blog/claude-mcp-browser-automation-token-optimization

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call