Engineering

Healthcare Data Extraction: How We Found a Hidden API Behind a Provider Portal

2025-05-27 Updated 2026-05-17 6 min read

When a healthcare insurer offers a "Find a Doctor" portal but no public API, most engineering teams default to building a web scraper. We took a different approach: we used a headless browser to intercept every network request the portal made in real time, discovered the entire search experience was powered by a third-party service with publicly exposed credentials, and extracted 6.2 million provider records in a few hours — no scraping required.

Healthcare data extraction at this scale is a common challenge for health-tech companies, payers, and analytics platforms. Provider directories are locked behind search UIs designed for one-at-a-time lookups, not bulk access. But modern web applications are built in layers, and when you know how to peel them back, the data layer is often far more accessible than the UI suggests.

Why We Didn't Build a Scraper

The task was straightforward on paper: gather a comprehensive list of healthcare providers — with NPI numbers, specialties, network assignments, and addresses — from a major Medicare Advantage insurer's provider directory. The directory had no documented API, no bulk download option, and no data export feature. Just a search box.

A traditional scraper would mean parsing HTML, handling JavaScript rendering, managing pagination through a UI designed for human users, and fighting rate limits — all while the insurer could change their DOM structure at any time and break everything. We estimated weeks of development and ongoing maintenance for a brittle solution.

Instead, we asked a different question: what is the browser actually doing when someone searches this portal? The frontend has to get its data from somewhere. If we could see those requests, we could skip the UI entirely.

How Intercepting Live Network Traffic Changes the Game

We spun up a headless browser — not to scrape content from the page, but to act as a passive observer. Think of it as sitting between the browser and the server, watching every HTTP request and response flow through in real time. This is a well-documented capability in tools like Puppeteer and Playwright, where request interception lets you observe exactly which requests and responses are being exchanged during a page interaction.

We pointed the headless browser at the insurer's provider search portal, performed a simple search, and watched the network tab light up. What we saw immediately changed our entire approach.

The search requests weren't going to the insurer's own backend. They were going to *.algolia.net — the API endpoint for Algolia, a popular search-as-a-service platform. The insurer's entire "Find a Doctor" experience was a React frontend making direct API calls to an Algolia index. The insurer's own servers weren't involved in the search at all.

Why the Credentials Were Sitting in Plain Sight

Here's where it gets interesting. Algolia's architecture is designed for client-side search. The browser needs to query Algolia directly — that's the whole point of the product. To make that work, the Application ID and a Search API Key must be embedded in the frontend JavaScript. Algolia's own documentation confirms that search keys are designed to be public. They're read-only by design — you can search, but you can't modify or delete anything.

We found both credentials hardcoded in the portal's JavaScript bundle. No obfuscation, no token exchange, no authentication wall. This wasn't a security oversight — it's literally how the platform is architected. Every website using Algolia for client-side search has these keys in their frontend code.

With those two values, we could query the insurer's provider index directly via Algolia's REST API, bypassing the portal's UI entirely.

What 6.2 Million Records Look Like in Clean JSON

Once we had direct API access, the scope of the data became clear. The Algolia index contained 6.2 million provider records spanning 29 states. Each record included the provider's full name, NPI number, specialty, practice addresses, phone numbers, network assignments, and plan affiliations — all returned as structured JSON.

Compare that to what the portal's UI showed: a paginated list of cards, ten at a time, with no way to export or filter beyond basic search terms. The UI was designed for a patient looking up their doctor. The underlying data layer held the entire provider network.

We chose to use Algolia's browse endpoint rather than their standard search endpoint. The browse API is specifically designed for iterating over large datasets — it uses cursor-based pagination rather than offset-based, which means you can traverse millions of records without hitting the 1,000-hit pagination limit that applies to standard search queries.

Why Resume Capability Matters at Scale

Extracting 6.2 million records isn't a single API call — it's thousands of paginated requests over several hours. Network interruptions, rate limits, and process crashes are inevitable at that scale. We built the extraction script with resume capability from the start: every batch of records was written to disk immediately, and the cursor position was checkpointed after each successful page.

If the script stopped for any reason — a network timeout, a rate limit response, even a machine restart — it could pick up exactly where it left off. No re-downloading records we already had, no duplicate detection needed. This is a pattern we apply to any large-scale data extraction, similar to the approach we used when building a production LinkedIn profile scraper with Python and the Voyager API, where session persistence and warm/cold path architecture kept extraction reliable across millions of profiles.

The entire extraction completed in under four hours. A traditional scraper attempting the same volume through the portal's UI would have taken weeks — if it worked at all.

The Pattern: Every Frontend Has a Data Layer

This project reinforced a principle we apply across all our data extraction work: when there's no API, the API still exists — you just have to find it. Modern web applications are rarely monolithic. They're assembled from services: a search provider here, an authentication service there, a CDN for assets, a separate API for dynamic content. The frontend is just the presentation layer stitching these services together.

When you intercept network traffic instead of scraping the DOM, you're looking at the application the way its own engineers see it — as a collection of API calls. The same principle applies to email-based data: when SaaS receipts arrive as HTML emails instead of PDF attachments, you need to look at the right layer — rendering the HTML to an image rather than trying to parse the raw markup. This is the same mindset we apply in our browser automation work with Claude MCP, where understanding the underlying structure of a page (its accessibility tree, not its rendered HTML) unlocks capabilities that brute-force approaches can't match.

What We Chose Not to Do

We considered several alternative approaches before settling on the headless interception strategy. A full DOM scraper was the obvious default, but the maintenance burden was unacceptable for a one-time extraction of this scale. We also considered using the CMS NPPES public download files — the federal NPI registry does offer bulk data — but those files don't include network assignments or plan affiliations, which were critical to the project's requirements.

We also deliberately chose not to parallelize the extraction aggressively. We could have run dozens of concurrent requests and finished in minutes instead of hours. But hammering a third-party API — even one with public credentials — is poor engineering practice. We rate-limited ourselves to a respectful cadence, which also reduced the risk of triggering any IP-level blocks. When a project does require brute-force parallelism against bot-protected portals — as we encountered when building an on-demand proxy fleet for large-scale web scraping — the infrastructure strategy changes completely.

Healthcare Data Extraction Without Building a Scraper

The lesson from this project isn't about Algolia specifically — it's about methodology. Healthcare data extraction doesn't have to mean building and maintaining fragile scrapers that break every time a portal redesigns its UI. By treating the browser as a network traffic observatory rather than a screen reader, we turned a weeks-long scraping project into a few hours of clean API consumption. The data was always there, structured and accessible. We just had to look at the right layer.

Frequently Asked Questions

Is API reverse engineering legal in healthcare?

Depends on the data, the contract, and the jurisdiction. HIPAA-covered data needs explicit authorization. Public provider directories typically don't, but we always recommend legal review before production deployment.

How do you find a hidden API?

Open Chrome DevTools, go to the Network tab, filter to XHR/fetch, then use the portal normally. The API calls light up in real time — copy as cURL, replicate in Python.

What if the API has signed requests?

Read the JavaScript. Signing logic always lives in the bundled frontend code. Unminify, find the signing function, port it to Python.

How do you handle authentication?

Capture the login flow, replay the token-issuing request in your own code, then reuse the token until it expires. Most portals issue 24-hour tokens — perfect for batch extraction.

References

To cite this article: Iron Mind AI. (2025). "Healthcare Data Extraction: How We Found a Hidden API Behind a Provider Portal". Iron Mind AI Blog. https://iron-mind.ai/blog/healthcare-data-extraction-hidden-api

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call