Multi-step LLM pipelines that generate sequential content — video segments, story chapters, slide decks — break down when each step runs in isolation. The fix is a global planning step that runs before any content is generated, producing a binding contract that tells each downstream step not just what to do, but what it cannot do. We call this pattern vectorial constraint, and it solved the single biggest quality problem in our AI music video generation pipeline.
We built this for a platform that generates multi-segment music videos from uploaded tracks. Each segment is 8-12 seconds of video, and each gets its own LLM-generated prompt fed to a text-to-video model. Before we introduced a global shot plan, segment 3 had no idea what segments 1 and 2 had already done. The result: repeated visual beats, inconsistent emotional arcs, and segments that felt like three separate videos stitched together rather than one cohesive piece.
Why Sequential LLM Calls Produce Repetitive Output
The naive approach to multi-segment generation is straightforward: loop over your segments, call the LLM for each one, pass it the creative brief, get back a prompt. The problem is that LLMs are mode-seeking. Given the same creative concept, they converge on the same "best" interpretation every time.
If your concept is "melancholic urban night scene," every segment will independently decide that the emotional peak is a character standing in the rain looking up. The model isn't wrong — that is a strong visual beat for that concept. But when three segments all deliver the same emotional release, the video has no arc. It's the same moment repeated three times.
Passing prior prompts as context helps with surface-level deduplication ("don't repeat the rain shot"), but it doesn't solve the deeper problem: each segment still independently decides what emotional function it serves. You get visual variety with emotional monotony.
How a Global Shot Plan Creates Coherence Across Segments
We added a planning step (Step 2c in our pipeline) that runs once, before any video prompt is generated. A single GPT-4o-mini call using OpenAI's structured outputs produces a concrete visual blueprint for all segments simultaneously. Each segment in the plan gets:
visual_content— what physically happens on screen (location-scout level, not abstract emotions)entry_state/exit_state— the emotional state the viewer is in entering and leaving this segmentprohibited_state_changes— emotional transitions this segment must NOT delivercut_transition— how this segment begins relative to the previous onecamera_strategy— specific camera technique with emotional justification
The entry/exit states are chained: each segment's entry_state must be the logical consequence of the previous segment's exit_state. Segment 0 always starts from "cold open — viewer has no prior context." The final segment's exit state must deliver the creative concept's target emotional destination.
This plan is stored as JSONB in PostgreSQL alongside the rest of the pipeline state, which means we can query, debug, and compare shot plans across generations without parsing log files.
Why Telling the Model What NOT to Do Matters More Than What to Do
The prohibited_state_changes field is the most powerful part of the schema. Rather than only telling each segment what emotional beat to hit, we explicitly constrain what it cannot do. Recent research on negative constraints in LLM alignment has shown that training and prompting with negative-only feedback can match or exceed positive guidance — our production experience confirms this in the pipeline orchestration context.
Here's a concrete example. Say you have a three-segment video with an emotional arc of "quiet tension, building dread, cathartic release." The plan might assign:
- Segment 0 — prohibited: "Do not deliver cathartic release (owned by segment 2). Do not escalate beyond unease (escalation owned by segment 1)."
- Segment 1 — prohibited: "Do not resolve tension (resolution owned by segment 2). Do not return to calm (calm was spent in segment 0)."
- Segment 2 — prohibited: "Do not re-establish tension (already spent in segments 0-1). Do not end in ambiguity (the arc demands resolution)."
Each prohibition names which segment owns that transition. This is what makes it vectorial — we're defining each segment's direction by ruling out wrong directions, not just pointing at the right one. The model can still be creative within its assigned lane, but it can't accidentally deliver someone else's emotional payload.
Why Duration Awareness Changes the Arc Shape
We learned the hard way that a generic three-act structure doesn't fit every clip. An 8-second video with one segment doesn't have room for setup, escalation, and payoff. It has room for one defining moment.
The arc planner receives duration-aware guidance before it plans anything:
- 1 segment (8-12s): Single Moment — one charged emotional beat caught mid-stream. No arc, no progression.
- 2 segments (13-24s): Contrast / Juxtaposition — a before/after pair or tension/release duality.
- 3 segments (25-30s): Three-Act Progression — establish, complicate, resolve.
Without this, we saw the LLM cramming a full hero's journey into an 8-second clip. The model doesn't know how long a segment is in real time — it only sees tokens. Injecting arc shape guidance based on segment count is a cheap fix that eliminated an entire class of pacing failures.
How the Plan Feeds Downstream Without Becoming a Bottleneck
Each downstream video prompt generation step (Step 3 in our pipeline) receives the full global shot plan for all segments, plus all previously generated prompts. The prompt generator isn't inventing content — it's executing a pre-planned arc slice.
We considered two alternatives before landing on this architecture. First, we tried embedding all constraints directly in the Step 3 system prompt — a single mega-prompt with rules for each segment inline. This worked for 2 segments but became unreliable at 3, because the model would lose track of which constraints applied to which segment. Second, we tried a feedback loop where Step 3 would generate, then a critic model would reject prompts that violated continuity. This doubled latency and still missed subtle emotional repetitions.
The global plan approach works because it separates planning from execution. The planning model (GPT-4o-mini) is good at holistic reasoning about arc structure. The execution model (Gemini Flash) is good at generating rich visual descriptions. Each model does what it's best at, and the structured plan is the interface contract between them. This kind of state machine architecture for multi-stage AI pipelines gives us crash recovery and debuggability at every step.
What Vectorial Constraint Looks Like in the Schema
The structured output schema enforces the plan's shape at parse time. Each segment must include all fields, and the prohibited_state_changes field requires 2-4 entries. Here's a simplified view:
GlobalShotPlan:
segments: list[SegmentShotPlan]
SegmentShotPlan:
segment_index: int
visual_content: str # concrete actions, subjects, locations
entry_state: str # emotional state entering (chained)
exit_state: str # emotional state leaving
prohibited_state_changes: list[str] # 2-4 vectorial prohibitions
cut_transition: str # how this segment begins
camera_strategy: str # technique + emotional justification
The key design choice here is that prohibited_state_changes is a list of strings, not an enum. Each prohibition is a natural-language sentence that names the forbidden transition and attributes it to the segment that owns it. We tried using structured enums (e.g., RELEASE, ESCALATION, RESOLUTION) but found that natural language prohibitions were more effective — the downstream model understood "do not deliver cathartic release (owned by segment 2)" better than prohibited: [RELEASE].
Where This Pattern Applies Beyond Video
The vectorial constraint pattern isn't specific to video generation. Any LLM pipeline that produces sequential, interdependent outputs will hit the same convergence problem. If you're generating a multi-chapter report, each chapter's LLM call will gravitate toward the same key insight unless you pre-plan which chapter owns which argument. If you're building a text-to-video music pipeline, each prompt needs to know what emotional territory has already been claimed.
The pattern has three components that transfer cleanly:
- A global planning step that sees all segments before any content is generated
- Chained state transitions where each segment's entry state must match the previous segment's exit state
- Negative constraints that explicitly prohibit each segment from delivering transitions owned by other segments
The global planning step adds one LLM call to the pipeline — roughly 800-1200 tokens of output for a 3-segment plan using GPT-4o-mini. The cost is negligible. The downstream improvement in coherence is not.
The Pattern at a Glance
Vectorial constraint in multi-step LLM pipelines means planning globally before executing locally, and constraining each step by what it cannot do — not just what it should. The global shot plan with chained entry/exit states and prohibited state changes turned our AI video pipeline from producing disconnected segments into producing coherent visual narratives. The cost was one additional LLM call. The improvement was the difference between a slideshow and a story.