HTML email receipts from SaaS vendors like Stripe, Paddle, and PayPal are the silent failure mode in automated expense detection. Most expense automation pipelines assume receipts arrive as PDF attachments — but an increasing number of vendors send receipts as styled HTML embedded directly in the email body, with no attachment at all. If your system only parses attachments, it silently misses a growing share of real business expenses.
We hit this exact wall on an AI-powered expense detection system we built for a client. The system ingested business emails via IMAP, ran them through a two-step LLM workflow, and uploaded detected expenses to their accounting platform. It worked beautifully — until we realized it was missing every Stripe charge, every OpenAI billing notification, and every Paddle subscription receipt. The reason: none of those vendors attach a PDF. They embed the receipt in the HTML email body itself.
Why Modern SaaS Vendors Stopped Attaching PDFs
The shift is practical. Stripe's receipt system sends a styled HTML email with a link to a hosted receipt page — no PDF, no attachment. Paddle, PayPal, and OpenAI follow the same pattern. The receipt is the email. From the vendor's perspective this makes sense: fewer support tickets about corrupted attachments, better rendering across email clients, and easier branding control.
From an automation perspective, it is a nightmare. A standard email parsing pipeline checks for attachments (typically filtering by MIME type for PDFs and images), extracts them, and runs analysis. When there is no attachment, the pipeline has nothing to analyze beyond the raw email text — which for HTML emails is a mess of nested tables, inline CSS, and tracking pixels that defeats simple text extraction.
Why Plain Text Extraction From HTML Emails Falls Short
The obvious first attempt is to strip HTML tags and feed the plain text to an LLM for expense detection. We tried this. The results were unreliable for a specific reason: HTML email receipts use visual layout to convey meaning. The amount "$49.00" only makes sense as the charge amount because of its position relative to the line items above it. Strip the HTML and you get a flat string where "$49.00" appears alongside tax amounts, subtotals, and previous balance figures with no structural context.
For simple receipts this might work. But production systems need to handle Stripe's multi-line-item invoices, PayPal's currency-converted transactions, and Paddle's VAT-inclusive European receipts. The layout carries information that plain text loses.
The Fix: Rendering HTML Emails to Images for Vision Analysis
The solution we landed on converts the HTML email body into a PNG image, then feeds that image to a vision-capable LLM. This preserves the visual layout — the same layout a human would see when opening the email — and lets the model read the receipt the way a person would: visually, with spatial context intact.
We chose WeasyPrint for the HTML-to-image conversion. The rendering pipeline looks roughly like this:
# Pseudocode — simplified from production
def render_email_to_image(html_body):
# WeasyPrint renders HTML+CSS to a document
doc = weasyprint.HTML(string=html_body).render()
# Extract first page as PNG
# (receipts are almost always single-page)
page = doc.pages[0]
image_bytes = page.write_png()
return base64_encode(image_bytes)
The rendered PNG then gets passed to the vision step of our LLM workflow as a base64-encoded image. The model sees the receipt exactly as a human would — line items, totals, vendor logo, invoice number — all in their correct visual positions.
Why We Chose a Two-Step LLM Pipeline Over a Single Pass
We run expense detection as a conditional two-step workflow, not a single LLM call. Step one uses a lightweight model (gpt-4o-mini) to analyze the email text and determine whether the email is expense-related at all. Step two — the expensive vision step using gpt-4o — only fires if step one flags the email and an attachment or HTML receipt is present.
This matters for cost. The text-only classification step costs roughly $0.0005 per email. The vision analysis step costs roughly $0.0036 per email. If you ran vision on every email in a busy inbox processing 1,000 messages per month, you would spend $3.60 on vision alone. With conditional routing, the actual cost drops to around $2.35 total — because most emails are not expenses and never hit the vision step. For teams evaluating state machines versus task queues for multi-stage AI pipelines, this conditional execution pattern is one of the strongest arguments for explicit workflow orchestration.
We considered running a single gpt-4o call with both text and image for every email. We ruled it out when we ran the numbers: the cost difference was 7x per email, and the accuracy gain on the classification step was negligible. The lightweight model catches 99%+ of non-expense emails correctly, so the expensive model only needs to handle the hard cases.
How Structured Outputs Prevent Extraction Hallucinations
Vision LLMs are good at reading receipts — but they occasionally hallucinate amounts, invent invoice numbers, or misattribute currency codes. In a financial automation system, a single hallucinated decimal point can create real accounting problems.
We enforce structured outputs using a strict JSON schema validated against a Pydantic model. The LLM must return vendor name, amount, currency, invoice number, and expense category in a predefined format. If the response fails schema validation, the system retries with exponential backoff rather than ingesting garbage data.
# Schema shape (simplified)
{
"vendor": "string",
"amount": "number",
"currency": "ISO 4217 code",
"invoice_number": "string | null",
"category": "enum[software, hosting, services, ...]",
"confidence": "number 0-1"
}
The confidence score is key. Extractions below a threshold get flagged for human review rather than auto-uploaded to the accounting system. This is the kind of guardrail that separates a demo from a production system — and the kind of decision-making that defines real production AI system architecture.
Handling IMAP Without Corrupting the Mailbox
A subtle but critical detail: when your automation reads emails via IMAP, a standard FETCH command marks messages as read. For a shared business inbox, this means a human opening their email client sees everything as already read — with no way to know which messages are new. We use IMAP's PEEK flag to read messages without altering their status.
Deduplication is the other landmine. IMAP message UIDs are not globally unique and can reset when a mailbox is rebuilt. We deduplicate on the email Message-ID header, enforced as a unique constraint at the database level. This means the scanner can safely re-process an entire mailbox without creating duplicate expense records — a property you absolutely need when deploying updates to a production system.
What the Full Pipeline Looks Like
The complete flow from email to accounting record runs as follows. The IMAP scanner polls the inbox and fetches new messages using PEEK. Each email's Message-ID is checked against the database. New emails enter step one: text analysis by gpt-4o-mini, which classifies whether the email contains an expense. If yes, the system checks for PDF attachments. If none are found, it checks whether the email body contains HTML receipt patterns. If it does, WeasyPrint renders the HTML to a PNG image. The image (or PDF attachment) then enters step two: vision analysis by gpt-4o, which extracts structured expense data. Validated results get uploaded to the accounting platform via OAuth, with token refresh handled automatically.
The entire pipeline processes a typical business email in under 3 seconds. Emails without expenses exit after step one in under 500 milliseconds. At roughly $0.004 per expense-containing email, the LLM cost is a rounding error compared to the hours of manual receipt processing it replaces.
The Pattern Worth Remembering
HTML email receipts are only going to become more common as SaaS vendors move away from PDF attachments. Any expense automation or AI integration service that only handles attachments will silently miss a growing percentage of real transactions. The fix is not complicated — render the HTML to an image, feed it to a vision model, enforce structured outputs — but it requires knowing the problem exists in the first place. We discovered it the hard way, after a client asked why their Stripe charges were not showing up. The lesson: when building AI-powered document processing, always account for the documents that are not documents at all.