AI-generated outreach sounds like a bot for two fixable reasons: the prompts instruct sales-deck structure (flatter, pitch, ask for a call) with no guardrails against AI tells like em-dashes, "leverage," and fabricated stats; and the system knows everything about the recipient and nothing about the sender. We fixed both — a voice contract plus a sender voice profile — and messages started reading like a real person.

We hit this building an AI outreach assistant for a LinkedIn automation product. The premise of good outreach is simple: it should read like one real person typing to one real person. Ours didn't. Here is how we diagnosed it, the dead-end we walked into, and the architectural fix that actually worked.

What makes AI-generated outreach sound like a bot?

AI outreach sounds robotic because of three recognizable tells: typographic giveaways (em-dashes, "navigating," "leverage"), fabricated metrics ("our clients see 40% faster onboarding"), and the universal mass-outreach skeleton — flatter their role, pitch what you do, ask for a call. Readers pattern-match all three instantly, and the message dies on contact.

None of these are model failures. A frontier model can write warm, specific, human prose without trying. It produces bot-speak when the prompt tells it to behave like a sales deck and gives it nothing real to say.

The fabricated-metrics tell is the most damaging. When the model has no concrete fact about what the sender offers, it invents one — a plausible-sounding percentage, a fake client outcome — because the template asked for a value proposition and the model abhors a vacuum. That is not hallucination for its own sake. It is the model dutifully filling a slot we left empty.

How did we diagnose the problem instead of guessing?

We built a test harness that ran multiple synthetic-but-realistic contact scenarios through the real generation pipeline, then printed the raw output. Reading actual messages — not re-reading prompts and imagining what they'd produce — is the only way to diagnose tone. Eyeballing a prompt tells you your intent; reading output tells you what the model actually did.

The scenarios varied the recipient deliberately: a contact with an obvious shared connection, a contact with a clear professional hook, and a genuinely cold contact with no hook at all. That last category is where everything fell apart, and it turned out to be the most informative case we ran.

The harness made the failure modes legible. With a strong hook, the model wrote something passable. With no hook, it defaulted to "I work with teams like yours" filler or invented a business for the sender wholesale. Same pipeline, same model — the only variable was how much real material the model had to work with.

What is a voice contract, and why did it only half-work?

A voice contract is an explicit instruction block in the prompt that defines what good output is and bans what bad output looks like: no em-dashes, no "leverage" or "navigating," no fabricated metrics, no flatter-pitch-ask skeleton, write like one human to another. It is a semantic contract that fixes silent failures in multi-step LLM pipelines applied to tone.

The voice contract immediately improved the messages that already had a hook. It did nothing for the hookless cold contacts, because you cannot instruct your way out of having nothing to say. That gap was the first clue that the real problem lived somewhere other than the prompt.

It also raised an enforcement question. A contract is only as good as your ability to check compliance. We knew the model would drift — some percentage of generations would still smell automated. So we tried to enforce the rules in code. That is where we walked into a wall.

Why did regex fail to catch salesy messages, and what worked instead?

Regex failed because "does this sound human or salesy" is a subjective judgment, and regex only matches surface patterns. A keyword stripper for "leverage" false-positives on legitimate sentences and misses the same idea phrased ten other ways. For subjective quality, use an LLM as a judge; reserve regex for deterministic cleanup only.

Our first instinct was a banned-word list with keyword stripping. It produced two failure modes at once. It mangled legitimate text — a contact who literally worked in "leverage finance" got censored — and it sailed past messages that were unmistakably salesy without using any banned word. Surface matching cannot judge meaning.

So we built an LLM judge. Each generated message is scored by a separate model call against the voice contract: does this read as one human to one human, or does it smell automated? In production the gate is a generate → judge → regenerate loop — messages that fail the judge are silently regenerated before the user ever sees them. This is the same discipline behind a good LLM eval suite that catches production regressions before users do, pointed at tone instead of correctness.

We kept regex for exactly one job: deterministic typographic cleanup. An em-dash should always become a comma. That is a rule with no judgment in it, so code is the right tool. The line we drew — and the line worth teaching — is code for deterministic rules, LLMs for subjective judgment. Knowing which is which is most of the battle.

What was the real root cause of generic outreach?

The real root cause was architectural, not linguistic: the system knew everything about the recipient and nothing about the sender. When the model had to write "why I'm reaching out," it had no real sender context, so it filled the vacuum with generic filler or an invented company. A voice contract polishes phrasing; it cannot supply facts that were never in the prompt.

This reframed the entire problem. We had been treating generic output as a prompt-tuning problem when it was a missing-context problem. The model wasn't writing badly — it was writing accurately about a sender it had been told nothing about. Garbage context, generic output.

The fix was a sender voice profile: a structured record of who you are, what you actually do, who you target, what you want out of a conversation, and your tone. With that in the prompt, the model stops inventing. It has real material — a real offer, a real audience, a real ask — and the writing turns specific and human almost on its own.

How do you get a user to fill in a sender profile without a blank form?

You don't make them fill in a blank form. We auto-draft the sender voice profile by pointing our existing profile-research function at the sender's own LinkedIn profile. The user opens an editable first draft generated from their real profile data, then molds it instead of staring at empty fields. Bootstrapping input from data you already have beats asking the user to start from nothing.

We already had a profile-research function — it was built to understand recipients. Aiming it at the sender cost almost nothing and changed the UX entirely. A blank "describe your voice and offer" form gets abandoned; a pre-filled draft that is 80% right gets edited and saved.

We made one more deliberate UX choice in the same spirit: the canned templates became invisible. Users describe their goal and their voice in plain language; they never pick an abstract template from a menu. The template machinery still exists under the hood — we just stopped making the human translate their intent into our internal abstraction.

What does the fix look like in a real message?

Same genuinely cold contact, same pipeline. The only change is whether the sender voice profile is present. Without it, the model invents a company and fabricates a claim to fill the empty "why I'm reaching out" slot. With it, the model uses the real offer and owns the cold reach honestly.

Before — no sender profile (model invents a sender):

Hi Sarah — really impressed by your work navigating growth at
Northwind. At Apex Digital we help teams like yours leverage
automation to see 40% faster onboarding. Would love to grab
15 minutes to show you how. Open to a quick call this week?

Every tell is present: the em-dash, "navigating," "leverage," a fabricated 40% stat, an invented company ("Apex Digital" does not exist), and the flatter-pitch-ask skeleton end to end.

After — with sender profile (real offer, honest cold reach):

Hi Sarah, I'll be honest, this is a cold message. I build
small internal tools for ops teams and saw you run ops at
Northwind. Not pitching anything specific, just curious what
your team currently does by hand that you wish a script
handled. Happy to trade notes either way.

It owns the cold reach, makes no fabricated claim, references something real about the sender's actual work, and asks a question a human would actually ask. It reads like one person typing to another, because now the model has a real person to write as.

What are the transferable lessons for building AI writing systems?

Four principles carried beyond this project. Don't patch the output, fix the input — generic AI output is usually a missing-context problem, not a prompt-tuning problem. Use LLMs to judge subjective quality and code for deterministic rules. Test by reading real output through the real pipeline. And auto-draft-then-edit beats blank forms every time.

The pattern that ties them together: when AI output is generic, the reflex is to tune the prompt, and the prompt is rarely the real lever. We spent the most time on the voice contract and got the smallest return; we spent the least time on the sender voice profile and got the biggest. The model was never the bottleneck — the context we fed it was.

If your AI-generated writing reads like a bot, resist the urge to keep rewriting the prompt. Ask first what the model knows about the speaker. More often than not, the vacuum is the problem, and filling it honestly — with real facts about who is actually talking — is what makes the writing sound human again.