You filter at the cheapest layer first. Before spending a single paid download token, we pull YouTube's free thumbnail CDN frames, send them to gpt-4o-mini with a strict JSON Schema, and score each video 0–10 for usability. That zero-cost pre-screen took 1,833 candidate music videos down to 500 worth downloading — a 27% pass rate — at fractions of a cent each.

We are fine-tuning Wan2.2-S2V-14B, an open-source audio-to-video model, to generate lipsync performance videos. Our thesis is unfashionable: you beat existing lipsync models not with a cleverer architecture but with cleaner training data. The architecture is published. The moat is the dataset — and the dataset starts with ruthless filtering.

Why pre-screen at all instead of just downloading and inspecting?

Because the expensive step is the download, not the inspection. Our video source runs through a paid RapidAPI endpoint that bills per fetch. Downloading 1,833 full videos to discover that two-thirds are unusable — wrong framing, no visible singer, rapid cuts — is paying full price to learn what a thumbnail could have told us for a tenth of a cent.

Most music videos are useless for lipsync training. They cut every two seconds, shoot the singer from behind, drown the face in stage lighting, or never show a clear mouth. We needed continuous single shots with a large frontal face actually singing. That describes maybe a quarter of what's out there.

So we built a gate in front of the download. The pre-screener's only job is to be cheap enough to run on everything and accurate enough that a "pass" is worth the download token it unlocks. It doesn't need to be perfect — a later visual audit catches the survivors that slip through. It needs to be cheap and to almost never throw away a good video.

How does a zero-cost video pre-screener actually work?

YouTube serves preview frames from a free CDN — no auth, no quota, no download token. We grab maxresdefault.jpg plus three frames auto-sampled at the 25/50/75% marks of the video, then send those four images to gpt-4o-mini and ask one question: is this footage usable for singer lipsync training? Cost is a fraction of a cent per video.

The model returns a structured verdict: a yield_score from 0 to 10, a shot_type label, an estimate of likely_chunks (how many clean training segments the video could plausibly yield), and an official_channel flag. Anything below our score threshold never costs a download token.

Four CDN thumbnails is thin evidence compared to the full video, and we knew that going in. The pre-screener is deliberately tuned toward recall over precision: we would rather pass a few duds that the download-stage analysis later rejects than discard a genuinely good source on the strength of one badly-timed thumbnail. The cost asymmetry justifies the bias — a false pass wastes one download token, a false reject loses a training clip forever.

Why gpt-4o-mini with strict JSON Schema, and not Claude via prompt-based JSON?

Because at this volume, parse reliability matters more than model intelligence, and constrained decoding gives you parse reliability for free. We started with Claude through OpenRouter, asking for JSON in the prompt. It worked most of the time — but "most of the time" across 1,833 calls means dozens of malformed responses, trailing prose, fenced code blocks, and the occasional apology where a score should be.

We switched the screener to gpt-4o-mini with OpenAI's Structured Outputs, which constrains generation against a supplied JSON Schema at the decoding level. The model is not asked nicely to return JSON; it is structurally prevented from emitting anything else. Parse errors went to zero. No retry loop, no regex salvage, no defensive try/except wrapping every call.

This was a deliberate tradeoff. Claude is the stronger vision reasoner, and for a borderline artistic judgement we would reach for it. But the pre-screen is a high-volume, low-nuance classification: four thumbnails, one usability score. We chose the cheaper model with the harder schema guarantee over the smarter model with the softer output contract — because at 1,833 calls, the failure mode that actually hurts is a malformed response, not a slightly-off score.

The broader lesson generalises beyond this project. When you orchestrate LLMs in a pipeline, the output contract is part of the architecture, not an afterthought. We have written before about how negative constraints fixed our multi-step LLM video pipeline — same principle, different layer: tell the model exactly what shape its output must take, and stop hoping it cooperates.

How do you find 1,833 candidate videos without an LLM hallucinating fake IDs?

You let the LLM propose artists and songs — never video IDs — then resolve real IDs yourself. Ask a model directly for a YouTube video ID and it will confidently invent an 11-character string that 404s. So our research harness asks only for artist-plus-song pairs, bucketed by genre and demographic, then resolves each pair to a real ID by scraping YouTube search and validating through the free oEmbed endpoint.

The harness rotates through twelve genre and demographic buckets so the dataset doesn't collapse into one era or one vocal style. For each generated pair it scrapes the search results page for the top real video ID, confirms the ID exists via oembed (free, no auth), then hands it to the pre-screener. Ten parallel workers run the pre-screen, and the whole loop is resumable — it checkpoints state after every batch, so a crash at video 1,400 doesn't restart from zero.

It runs until a target number of videos pass, not until a number are screened. We set the target at 500 passed and let it churn through 1,833 candidates to get there. Resolving real IDs from scraped search pages is the same discipline we use across our data work — when the documented path lies or doesn't exist, you go to the source. That's the lesson behind how we built an on-demand proxy fleet to collect 1.1M records: trust the data you can verify, not the convenient API that hands you garbage.

What happens to a video after it passes the pre-screen?

It enters the real pipeline, where the analysis is heavier and the bar is higher. We run PySceneDetect to split the video into single continuous shots with no cuts, then send representative frames from each shot to Claude vision for the judgement calls a thumbnail can't make: is the singer frontal, is the mouth visible, is the face large enough, are they actually singing rather than just on screen?

Face size is measured, not guessed. We run YuNet face detection to compute the fraction of the frame the face occupies, and reject shots where the face is too small to drive lipsync learning. Surviving shots are normalised to 16fps and emitted as per-chunk metadata JSON, ready for training.

This is where the cheap pre-screen earns its keep. The expensive analysis — scene detection, vision classification, face measurement — only ever runs on the 27% that cleared the gate. We spend the heavy compute on videos that already have a fighting chance, not on the two-thirds that a thumbnail told us to skip.

What is the out-of-window reference trick, and why does it matter?

It forces the model to learn audio-to-motion coupling instead of copying its input. We train on a roughly five-second window but pull the reference frame from outside that window. If the reference came from inside the clip, the model could cheat by copying a frame it has effectively already seen. Pulling it from elsewhere in the same shot means the reference fixes identity while the audio has to drive the mouth.

Wan2.2-S2V uses the reference frame as an identity anchor — it tells the model who this person is, not what they should be doing. By deliberately decoupling the reference from the training window, we make the audio signal the only thing that can explain the mouth movement. That's the behaviour we actually want at inference: give it a face and a song, get a performance.

Why do lipsync models invent flickering teeth, and how does a second reference fix it?

Because a closed-mouth reference never shows the model what this person's open mouth looks like, so it hallucinates a generic one — usually a flickering smear of teeth that belongs to no one. Our fix is the dual-ref: we feed both a closed-mouth and an open-mouth reference frame, both pulled from outside the training window, so the model has a real example of the teeth and inner mouth it needs to render.

This is a core hypothesis of our first training run, not a settled result — we're testing whether dual-ref measurably reduces the invented-teeth artifact versus a single closed-mouth reference. But the reasoning is sound and the failure mode is well-known to anyone who has watched a lipsync model open a mouth it has never seen open. Extracting a clean open-mouth, teeth-visible frame for every chunk is exactly the kind of work the automated pipeline makes feasible at scale.

If the pipeline is automated, why audit the output by hand?

Because "automated" and "perfect" are different words, and the data quality bar for fine-tuning is unforgiving. We ran a full visual audit of our first 67 chunks using a browser-based agent that pulled frames from every clip and every reference PNG, then flagged the failures the automated checks missed: crossfade dissolves mid-clip, scene cuts sitting right on a chunk boundary, and wrong-identity reference frames.

The most instructive failure: a Celine Dion chunk from the Titanic music video whose reference frame contained Kate Winslet. The face detector saw a large frontal face and passed it; it had no concept of which face. The audit caught it. We dropped 14 imperfect chunks and 5 bad reference frames out of the original 67, leaving 53 clean ones.

Our rule is simple and expensive: remove anything that isn't perfect. For a 14-billion-parameter fine-tune, a handful of poisoned chunks teaches the model the wrong thing faster than a hundred clean ones teach it the right thing. We would rather ship a smaller, immaculate dataset than a larger, contaminated one.

Where does the dataset stand, and what's the layered-filtering takeaway?

We started with 50 hand-curated videos yielding 67 chunks, audited down to 53 clean ones, added six prescreened sources for nine more chunks, and are now running 500 prescreened candidates through the full pipeline toward a target of 200-plus immaculate chunks before the first training run on RunPod. The pattern that makes this affordable is layered filtering: cheapest gate first, heaviest analysis last, human eyes on the survivors.

The pre-screener is the load-bearing idea. By filtering on free CDN thumbnails with a constrained-decoding model before committing a single paid download, we turned a budget-blowing brute-force download job into a 27%-yield funnel that spends real money only on footage that already looks right. The architecture of Wan2.2-S2V is open to everyone. What isn't open is a clean dataset built this carefully — and that's exactly where we think the advantage lives.