An LLM should write code against a documented toolkit, not fill in a JSON spec, whenever the output requires relationships rather than static values. We learned this building an AI After Effects edit generator in 2026: a JSON edit-spec couldn't express "this caption appears one second before the cut," but a single line of code can. Code expresses relationships; schemas only hold constants.

The system turns a set of video clips into a native, fully-editable After Effects composition — cuts, transitions, text overlays, effects — with no human editor assembling it by hand. The architectural journey from a declarative spec to a code-generating toolkit is the real story, and the lesson transfers far beyond video.

What does the tool actually build, and why doesn't it render anything?

The tool is a compiler, not a renderer. It generates an ExtendScript (.jsx) file that an operator runs inside After Effects via File > Scripts > Run Script File. The script talks to AE's DOM and builds a live composition with discrete layers and keyframes. Crucially, it never launches or renders AE itself.

That decision is the foundation of the whole design. We deliberately did not automate a headless After Effects on a server — AE is heavy, license-bound, and brittle to drive programmatically. By emitting a script the user runs locally, we sidestep server-side AE entirely.

The output is also the point: editors get a native project with editable layers and keyframes, not a flattened MP4. They can still nudge a cut, retime a caption, or swap an effect. We chose a generated .jsx over a rendered video precisely because the deliverable had to stay editable.

Why did the JSON edit-spec approach fail?

The JSON edit-spec failed because timing in an edit is relational, and a declarative schema can only hold static numbers. Our first version had the LLM emit a JSON "edit spec," a Python builder translated it into ExtendScript, and that translation layer bred persistent bugs: timing collisions, gaps between clips, and text overlapping the wrong frames.

The root problem was expressiveness. A caption that should appear one second before a cut is a relationship: label_in = cut_point - 1.0. In JSON you can only write the already-computed number — so the moment any upstream timing shifted, every dependent value was silently wrong, and nothing in the spec recorded why it had that value.

We spent real time patching the translation layer before admitting the layer itself was the bug. Every fix for one timing collision quietly created another, because the spec had no way to say "these two things move together."

Why is it better to let the LLM write code instead of a spec?

Letting the LLM write code is better because code is the natural language of relationships, and an edit is almost entirely relationships. Instead of emitting static timing values into a schema, the model writes a short script of creative decisions — crossDissolve(clipA, clipB, 0.5), punchText("CUT", cut_point - 1.0) — where every value can be derived from another.

This reframes the LLM's job. It no longer fills in a form a human pre-designed; it expresses intent directly in a medium that captures dependencies. The same shift applies any time you're tempted to hand a model a rigid schema for something inherently relational — pricing rules, scheduling, layout, animation. How negative constraints fixed our multi-step LLM video pipeline taught us a related lesson: the shape of what you ask the model to produce determines how often it gets it right.

What does the final toolkit architecture look like?

The final architecture splits responsibility cleanly: a reusable ExtendScript toolkit library holds all the heavy AE-DOM machinery, and the LLM authors only a short creative script that calls named functions in it. A thin Python step concatenates the two into the final .jsx. The model writes the decisions; the toolkit holds the engineering.

The toolkit (ae_toolkit.js) is a library of reusable "maneuvers" — named operations an editor would recognize:

// ae_toolkit.js — the heavy machinery (≈800 lines), written once
function crossDissolve(layerA, layerB, durationSec) { /* opacity keyframes, frame-snapped */ }
function dipToBlack(comp, atSec, durationSec)        { /* black solid + fades */ }
function flashTo(comp, atSec)                         { /* white frame punch */ }
function punchText(comp, text, atSec)                 { /* subtitle bar + scale-in */ }
function kenBurns(layer, fromScale, toScale)         { /* slow zoom keyframes */ }
function addClip(comp, filename, inSec, outSec)      { /* footage layer, trimmed */ }
function addAudioTrack(comp, filename)              { /* audio layer */ }

The LLM-authored creative script is tiny by comparison — roughly forty lines of decisions:

// creative script — what the LLM writes (~40 lines)
var comp = newComp("Promo", 1920, 1080, 25);
addClip(comp, "clip_01.mp4", 0.0, 3.2);
punchText(comp, "OPENING", 0.4);
crossDissolve(layer("clip_01"), addClip(comp, "clip_02.mp4", 3.2, 6.0), 0.5);
dipToBlack(comp, 6.0, 0.4);
// ...creative decisions only, calling the toolkit

We chose this split over having the LLM write the entire script for two concrete reasons. First, token cost dropped roughly 80% — the model emits forty lines of intent instead of regenerating eight hundred lines of boilerplate AE-DOM code every run. Second, the toolkit is durable: every bug we fix in crossDissolve is fixed forever, for every future edit, instead of being re-rolled probabilistically each generation.

What are the hardest ExtendScript lessons we hit?

The hardest lessons were all about ExtendScript being a frozen, ancient dialect with sharp edges around timing and paths. ExtendScript runs on ECMAScript 3 — no let/const, no arrow functions, no template literals, and no JSON.parse. These constraints shaped the whole toolkit.

Because there's no JSON.parse, we don't hand the script JSON to deserialize. We inject data as a literal var SPEC = {...} baked directly into the generated file. The script reads a native object that already exists at parse time — no parsing step to fail.

Frame accuracy was the second trap. Every cut time must snap to the frame grid (1/fps) or you get one-frame gaps and visible flicker between clips. We validated all timing math in Python before generating the script, so the .jsx only ever receives frame-aligned values.

Then there's a classic AE gotcha: "Cannot move a layer before or after itself." Our layer-ordering code tried to reposition a layer relative to its own index after the index had shifted. The fix was to track stable layer references rather than positional indices — indices move under you the moment you reorder anything.

How does it run on any machine without path configuration?

It runs anywhere by resolving footage relative to the script's own location, never via absolute paths. ExtendScript exposes the running script's path through $.fileName, so the generated .jsx finds its sibling footage by filename alone. The convention is simple: all footage plus the script live in one folder, referenced by filename only.

This makes the output portable across Mac and PC with zero path config. An operator drops the folder anywhere and runs the script; it resolves clip_01.mp4 next to itself regardless of drive letters, usernames, or OS path separators. Hard-coded absolute paths are the single most common reason a teammate's generated script breaks on someone else's machine.

What small craft details mattered for quality?

Two craft details disproportionately affected output quality: text contrast and not double-keyframing. For captions over unpredictable footage, we settled on yellow text (#FFE600) on a dark, semi-opaque bar — the 80s-subtitle look — because it reads over anything, including the white frame of a flash transition.

The double-keyframe bug is subtler and worth flagging. Our animation helpers (fade, punch, zoom) already set their own opacity keyframes. Early on we stacked a manual fade on top of a helper that also faded, and the two keyframe sets corrupted each other. The rule baked into the toolkit: a helper owns its property; never set keyframes a helper already manages.

Why does this plug cleanly into an agentic system?

It plugs cleanly into an agent because the toolkit gives the agent a documented, deterministic API to write against — no second LLM, no spec translation. A Claude SDK agent that already holds the footage filenames, in/out timecodes, and subtitle timing writes the creative script directly against the toolkit's function signatures, and the Python step compiles it.

That closes the loop. The agent that made the editorial decisions is the same actor that emits the code, against an interface it can read documentation for. There's no lossy hand-off to a JSON schema and back. The output is frame-accurate, gap-free, and fully editable — production-grade, not a demo. This is the same philosophy behind the 80-line agent loop we shipped against a live B2B catalogue: give the agent a tight, well-documented surface and let it act, rather than wrapping it in layers of abstraction.

What's the transferable lesson beyond video editing?

The transferable lesson: when an LLM needs to express relationships, give it a code API to call, not a JSON schema to fill in. Schemas are perfect for flat, independent values. The moment outputs depend on each other — timing, pricing tiers, layout constraints, scheduling — a code interface lets the model write the dependency directly, and a hand-authored toolkit absorbs the complexity and improves with every fix.