Engineering

Large-Scale Web Scraping: How We Built an On-Demand Proxy Fleet to Collect 1.1M Records

2026-04-10 Updated 2026-05-17 6 min read

Large-scale web scraping against bot-protected government portals requires more than a headless browser and a list of proxies. It requires on-demand infrastructure — fresh IP pools spun up in minutes, disposable compute that finishes in hours instead of days, and resume logic that treats the database as the single source of truth. We recently collected over 1.1 million records from a major public data portal protected by Akamai Bot Manager, and the architecture decisions that got us there are worth walking through.

The project had a simple goal and a hard execution path: scrape hundreds of parallel data contracts from a large government portal, each containing thousands of provider records behind aggressive bot detection. No public API. No bulk export. Just a browser-rendered search interface with JavaScript challenges, fingerprinting, and IP reputation scoring standing between us and the data.

Why Raw HTTP Requests Fail Against Modern Bot Detection

Our first instinct — and every scraper's first instinct — was to reverse-engineer the network requests and replay them with requests or httpx. That approach dies instantly against Akamai. The site serves a JavaScript challenge on first load that must execute in a real browser context. It fingerprints the TLS handshake, the browser's navigator properties, canvas rendering, and WebGL output. A raw HTTP client fails every one of those checks. Even when raw replay does work — against an unprotected internal API — there's a separate trap waiting: we wrote about it in our piece on reverse engineering undocumented APIs by reading shipping code instead of documentation, where the difference between the README's endpoint and the latest published version's endpoint cost us hours of dead-end debugging.

So we went with Playwright running headless Chromium, wrapped with playwright-stealth to patch the most common detection vectors — navigator.webdriver, Chrome DevTools protocol leaks, missing plugin arrays. This got us past the initial challenge. But it was only the first gate. Similar to how we discovered a hidden API behind a provider portal on a previous project, the real challenge wasn't getting in — it was staying in at scale.

How a Fixed Proxy Pool Became a Liability Overnight

We started with a pool of 19 UK datacenter proxy IPs. For the first few hours, everything worked. Workers rotated through the pool, each browser session picked a fresh proxy, and data flowed in. Then Akamai's IP reputation system caught up.

Datacenter IPs are inherently suspicious to bot detection systems — they don't belong to residential ISPs, and when 19 of them start hammering the same government portal in parallel, the pattern is obvious. Within 12 hours, the entire pool was flagged. Requests started returning CAPTCHAs, then hard blocks. We needed fresh IPs, and we needed them fast.

Why We Spun Up 37 VMs Instead of Buying Residential Proxies

The standard play here is to buy residential proxy bandwidth from a provider like Bright Data or Oxylabs. That works, but it's expensive at scale — residential bandwidth runs $8-15 per GB, and browser-rendered pages with JavaScript challenges are heavy. We were looking at thousands of dollars in proxy costs for a multi-day scrape.

Instead, we used the Linode API to programmatically spin up 37 fresh Nanode instances across US, EU, and Asia Pacific regions. Each VM is a $5/month Nanode — 1GB RAM, 1 CPU, more than enough to run a Squid forward proxy. We wrote a deployment script that SSH'd into each VM in parallel, installed Squid, configured authentication, and registered the IP back to our proxy pool. The entire fleet was operational in under four minutes.

Total cost of the 37-node proxy fleet: roughly $185/month. But we only needed it for a few days. Spin up, scrape, tear down. The per-run cost was closer to $30. Compare that to residential proxy pricing for the same volume of traffic and the math isn't even close.

What a Multi-Worker Browser Scraper Architecture Looks Like

The scraper ran on a dedicated 8-core, 150GB RAM cloud machine. That sounds like overkill until you realize what headless Chromium actually consumes — each browser context eats 200-400MB of RAM, and we needed multiple workers running simultaneously across hundreds of contracts.

The architecture was straightforward: a task queue of contracts to scrape, a pool of browser workers that each claimed a contract, and a database that tracked completion state. Each worker launched a fresh browser context with a randomly assigned proxy from the fleet, navigated to the portal, and started querying.

We chose the expensive machine deliberately. A cheaper 2-core box could run the same code — it would just take four to five days instead of one night. Engineer time spent babysitting a slow scrape, handling timeouts, and restarting failed workers costs far more than the hourly premium on a machine that finishes while you sleep. This is a tradeoff we make on every data collection project, and it has never been wrong.

How Parallel ZIP Code Queries Turned 77 Seconds Into 3

The portal's search interface accepted ZIP code queries — enter a ZIP, get back providers in that area. The naive approach is sequential: query ZIP 1, wait for results, query ZIP 2, wait, repeat. For a contract with 30+ relevant ZIP codes, that's over a minute of serial waiting.

We noticed the portal's internal API accepted queries independently — each ZIP code search fired its own XHR request. So instead of sequential queries from the browser, we intercepted the network layer and fired all ZIP code queries in parallel using Promise.all from within a single browser tab. Thirty queries that took 77 seconds sequentially completed in under 3 seconds.

This is the kind of optimization that only matters at scale, and at scale it matters enormously. Across 300+ contracts with dozens of ZIP codes each, we saved hours of wall-clock time — which on an expensive hourly machine translates directly to dollars saved.

Why the Database Is a Better Checkpoint Than a File

Most scraper tutorials use a checkpoint file — a JSON or CSV that tracks which pages have been scraped. This breaks in exactly the scenarios where you need it most: when the scraper crashes mid-write, when you're running multiple workers, or when you need to resume after a partial failure.

We skipped the checkpoint file entirely. The database is the checkpoint. Before a worker starts a contract, it queries the database: how many records exist for this contract? If the count matches the expected total, skip it. If it's partial, resume from where it left off. If it's zero, start fresh. No file locking, no corruption risk, no sync issues between workers. The data store and the progress tracker are the same thing.

This pattern — using your destination as your progress tracker — is something we apply on every production scraping system we build. It's simpler, more reliable, and eliminates an entire class of bugs.

What 1.1 Million Records Across 300+ Contracts Taught Us

The scrape ran overnight and into the next morning. By the time we checked in, we had 1.1 million records across 300+ contracts loaded into the database, with the remaining contracts still churning through. A few observations from the run:

Fresh IPs have a half-life. Even the 37-node fleet started seeing increased challenge rates after 18-20 hours of continuous use. For a multi-day scrape, you'd want to rotate the fleet — tear down stale VMs and spin up fresh ones every 12-16 hours. The Linode API makes this trivial to automate.

The same on-demand crawling stack also powers smaller, more surgical jobs — for example the health knowledge base pipeline we built by reverse-engineering provider URL patterns across six public health sites. Different scale, same infrastructure philosophy.

Browser memory leaks are real. Chromium contexts that run for hours accumulate memory. We added a hard restart cycle — every 50 contracts, the worker kills its browser and launches a fresh one. This kept memory usage stable across the full run.

Rate limiting yourself is cheaper than getting blocked. We added deliberate delays between queries — not because we were being polite, but because steady, human-paced traffic triggers fewer bot detection heuristics than burst patterns. Slowing down by 20% saved us from IP burns that would have cost hours of re-provisioning.

The On-Demand Infrastructure Pattern

The real takeaway from this project isn't about scraping — it's about infrastructure philosophy. We spent roughly $30 on disposable proxy VMs and a few hundred on a powerful compute machine for one night. The entire 1.1 million record collection cost less than a single month of residential proxy service would have. On-demand infrastructure that you spin up, use hard, and tear down will always beat permanent infrastructure you maintain and pay for continuously. For large-scale web scraping — or any burst-compute workload — this is the pattern that wins every time.

Frequently Asked Questions

Why not buy residential proxies instead of spinning up VMs?

Residential proxies are expensive per GB and slow. Datacenter VMs are cheap, fast, and disposable — when one IP gets flagged we destroy and recreate it. The Linode API made the whole fleet programmable.

Why use headless browsers instead of raw HTTP?

Modern bot detection (Akamai, Cloudflare, PerimeterX) inspects TLS fingerprints, JS execution, and behavioural signals. Raw requests fail before the response body is returned. Real browsers pass the checks at the cost of a heavier runtime.

How does the database act as a checkpoint?

Every successful record write is the checkpoint. On any failure, the next worker resumes from the last persisted state with a single SQL query — no flaky file locks, no half-written JSON to repair.

What did 1.1M records actually teach you?

That the bottleneck is never CPU or bandwidth — it's how fast you can replace burned IPs. Treating proxies as cattle, not pets, is the only sustainable pattern.

References

To cite this article: Iron Mind AI. (2026). "Large-Scale Web Scraping: How We Built an On-Demand Proxy Fleet to Collect 1.1M Records". Iron Mind AI Blog. https://iron-mind.ai/blog/large-scale-web-scraping-on-demand-proxy-infrastructure

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call