Building a Health Knowledge Base Pipeline: Discovery, URL Patterns, and Normalization

When a client handed us a CSV mapping 3,837 health events to source URLs from six public health providers, the obvious move was to import it and start querying. We didn't. Instead we used the URLs themselves as a discovery surface, reverse-engineered each provider's content structure, and built a normalized health knowledge base pipeline that ended up being roughly an order of magnitude larger than the input data — and dramatically cleaner.

This is post one of two on that build. Here we cover the discovery and normalization stages: how we audited the source CSV, found the seam in each provider's URL structure, scraped the full catalogs, and merged them with four official medical code tables into a single portable reference database. The next post covers the embedding and semantic-mapping layer that sits on top.

Why we didn't trust the input CSV

The CSV had 4,317 rows covering 3,837 unique health events across six providers: MedlinePlus, the CDC, NHS.UK, the WHO, the National Institute on Aging, and HealthInAging. On paper that's a reasonable starting catalog. In practice, roughly 90% of events had only one source URL attached, and most providers were severely under-represented relative to what they actually publish.

That distribution is a tell. A real-world catalog of human health events — symptoms, conditions, procedures, screenings, lifestyle topics — should look richly cross-referenced, not 90% single-source. What we had wasn't a clean knowledge base. It was the residue of whatever the previous data pass happened to find. We needed the catalog the providers actually publish, not the slice that survived a partial scrape.

How URL patterns turned into a catalog enumerator

Here is the part that paid off. Before scraping anything, we audited the URLs already in the CSV and lined them up by provider. Five of the six providers turned out to use perfectly consistent topic-page URL patterns — predictable slugs under predictable path prefixes, with no query strings or per-page variation. One provider used a slightly more complex hierarchy, but still deterministic.

That meant we didn't need to crawl each site link-by-link. We needed to enumerate the slug space. For five providers, that's a sitemap parse or an index-page scrape followed by URL templating. For MedlinePlus specifically, the National Library of Medicine publishes an official set of XML files covering virtually all health topic pages — 1,080 entries at the time of the audit. We didn't have to discover anything; we just had to read the index.

The shift in mindset matters more than the technique. The CSV told us about events. The URL patterns told us about the universe of events each provider considers worth publishing. Those are very different things.

The decision: hybrid HTTP/browser crawling

Some of the providers serve clean static HTML. Others render content via JavaScript or gate parts of the page behind interactive widgets. We didn't want to run a full headless browser against the static-HTML sites — it's wasteful, slow, and introduces non-determinism we don't need — and we didn't want to maintain two separate scrapers either.

We routed the crawl through our internal headless crawler service: a FastAPI front end wrapping Crawlee's Python library, with a hybrid mode that picks per-request between a plain HTTP fetch and a browser-rendered fetch. Static pages take the cheap path. JS-heavy pages get the browser. Same queue, same retry logic, same output schema. We've used this same service across other large scraping engagements, and the per-provider config is usually a 30-line file.

We chose this hybrid model over a pure-browser approach because the cost difference at the catalog scale we needed (thousands of pages per provider) is not subtle — browser-mode requests are 5–10× slower and use proportionally more memory. For a deeper look at how we run this kind of crawl across millions of records, we wrote about the on-demand proxy fleet we built for large-scale web scraping.

What we actually scraped per provider

For each provider we extracted: the canonical topic title, the cleaned body text, any structured sub-sections (overview, symptoms, treatment, prevention — these vary by provider), the publication or last-updated date when available, and a stable provider-side identifier. Everything got normalized into a single row schema before storage.

The "before storage" part is doing real work. Each provider has its own opinions about heading hierarchy, list formatting, sidebar boilerplate, and what counts as "the article". A condition page on NHS.UK does not look like one on the WHO site. Without normalization at ingestion time, downstream consumers — search, embedding, retrieval — would be paying that cost on every query forever. We took the hit once, at ingest.

Why we copied four official medical code tables in alongside the scrape

Health events on their own are useful. Health events linked to billing and coding standards are dramatically more useful — both for downstream applications and for any clinical or insurance-adjacent feature. We folded four official US medical code tables into the same reference database:

ICD-10-CM — the diagnosis code set used in the US for clinical and billing purposes. We loaded 74,719 codes. (For 2024, the official total is around 78,044 codes following the FY 2024 update; our snapshot predates the latest revision.)

ICD-10-PCS — the procedure coding system used for inpatient hospital procedures. 79,115 codes. ICD-10-PCS is structurally different from ICD-10-CM: it's a multi-axis system where each character position has a specific meaning (body system, root operation, approach, etc.) rather than a flat hierarchy.

CPT — Current Procedural Terminology, maintained by the AMA. Used for outpatient and physician services. 1,358 codes in our slice (CPT is licensed; the practical "all CPT" set is much larger and gated behind licensing).

HCPCS — Healthcare Common Procedure Coding System, the Medicare-aligned code set covering supplies, ambulance services, and items not in CPT. 7,376 codes.

These four tables were already living in a Postgres instance from prior work. We exported them as-is into the same SQLite database alongside the scraped health-topic content. No transformations on the codes themselves — they're authoritative reference data, and we didn't want to introduce any silent edits.

Why SQLite, and why we deferred embeddings

Two architectural calls worth flagging, because they're the kind of thing that's cheap to get right early and expensive to undo.

First, we picked SQLite over Postgres for the unified reference DB. The whole point of this artifact is that it's a self-contained, portable snapshot — every consumer (a worker, a notebook, a local test, a future container) gets the same file and reads from it directly. No connection strings, no network round-trips, no "is the DB seeded yet" checks in CI. Postgres makes sense when multiple writers are mutating shared state. This database has one writer (the ingest pipeline) and many readers, and it ships as an artifact.

Second, we deliberately did not generate embeddings during this stage. Embeddings are cheap to produce later, and — more importantly — the embedding model has to match whatever the query-time service uses. Baking in a specific model now would either lock the future system to that choice or guarantee a re-embedding pass when it changes. We left the schema with embedding columns reserved and empty. When the retrieval service is finalized, we'll do a single batch embed.

This is the same logic behind a broader pattern we've written about: why we don't use vector search as the default for AI knowledge bases. Embeddings are a tool, not the architecture.

The numbers, end to end

The input CSV had 4,317 rows mapping 3,837 unique events to URLs across six providers, ~90% single-sourced. The discovery pass — five URL patterns plus one official sitemap — gave us the actual published catalog per provider. The normalized output combines that scraped content with 74,719 ICD-10-CM codes, 79,115 ICD-10-PCS codes, 1,358 CPT codes, and 7,376 HCPCS codes, all in a single SQLite file with a unified schema and reserved-but-empty embedding columns.

The total published catalog after the scrape is multiples of the original CSV's 3,837 events. We're deliberately not quoting the exact final count here because it's a moving target — providers add and retire pages constantly, and the value of this database is that it can be re-snapshotted on demand, not that any one snapshot is canonical.

What the next post covers

With the reference database in place, the next problem is the interesting one: given a free-text health event from a user or an upstream system, how do we map it accurately to the right ICD-10-CM diagnosis, the right CPT or HCPCS procedure, and the right authoritative source page? That's where embeddings finally earn their keep — and where the model choice, the chunking strategy, and the disambiguation logic stop being theoretical. Post two breaks down the vector similarity layer and the retrieval pipeline that sits on top of this normalized health knowledge base.

Why we didn't trust the input CSV

How URL patterns turned into a catalog enumerator

The decision: hybrid HTTP/browser crawling

What we actually scraped per provider

Why we copied four official medical code tables in alongside the scrape

Why SQLite, and why we deferred embeddings

The numbers, end to end

What the next post covers

Related Posts

Don't Trust API Docs, Trust Shipping Code: Reverse Engineering Undocumented APIs

How to Make an App in 2026: The AI-Native Stack We Actually Ship With

Ready to Build Something?