Engineering

How Negative Constraints Fixed Our Multi-Step LLM Video Pipeline

2026-03-29 Updated 2026-06-14 6 min read

Multi-step LLM pipelines that generate sequential content — video segments, story chapters, slide decks — break down when each step runs in isolation. The fix is a global planning step that runs before any content is generated, producing a binding contract that tells each downstream step not just what to do, but what it cannot do. We call this pattern vectorial constraint, and it solved the single biggest quality problem in our AI music video generation pipeline. Output quality is one front; input quality is the other — we cover the data side in how we filter thousands of YouTube videos into a clean AI training dataset without burning download budget.

We built this for a platform that generates multi-segment music videos from uploaded tracks. Each segment is 8-12 seconds of video, and each gets its own LLM-generated prompt fed to a text-to-video model. Before we introduced a global shot plan, segment 3 had no idea what segments 1 and 2 had already done. The result: repeated visual beats, inconsistent emotional arcs, and segments that felt like three separate videos stitched together rather than one cohesive piece.

Why Sequential LLM Calls Produce Repetitive Output

The naive approach to multi-segment generation is straightforward: loop over your segments, call the LLM for each one, pass it the creative brief, get back a prompt. The problem is that LLMs are mode-seeking. Given the same creative concept, they converge on the same "best" interpretation every time.

If your concept is "melancholic urban night scene," every segment will independently decide that the emotional peak is a character standing in the rain looking up. The model isn't wrong — that is a strong visual beat for that concept. But when three segments all deliver the same emotional release, the video has no arc. It's the same moment repeated three times.

Passing prior prompts as context helps with surface-level deduplication ("don't repeat the rain shot"), but it doesn't solve the deeper problem: each segment still independently decides what emotional function it serves. You get visual variety with emotional monotony.

How a Global Shot Plan Creates Coherence Across Segments

We added a planning step (Step 2c in our pipeline) that runs once, before any video prompt is generated. A single GPT-4o-mini call using OpenAI's structured outputs produces a concrete visual blueprint for all segments simultaneously. Each segment in the plan gets:

visual_content — what physically happens on screen (location-scout level, not abstract emotions)
entry_state / exit_state — the emotional state the viewer is in entering and leaving this segment
prohibited_state_changes — emotional transitions this segment must NOT deliver
cut_transition — how this segment begins relative to the previous one
camera_strategy — specific camera technique with emotional justification

The entry/exit states are chained: each segment's entry_state must be the logical consequence of the previous segment's exit_state. Segment 0 always starts from "cold open — viewer has no prior context." The final segment's exit state must deliver the creative concept's target emotional destination.

This plan is stored as JSONB in PostgreSQL alongside the rest of the pipeline state, which means we can query, debug, and compare shot plans across generations without parsing log files.

Why Telling the Model What NOT to Do Matters More Than What to Do

The prohibited_state_changes field is the most powerful part of the schema. Rather than only telling each segment what emotional beat to hit, we explicitly constrain what it cannot do. Recent research on negative constraints in LLM alignment has shown that training and prompting with negative-only feedback can match or exceed positive guidance — our production experience confirms this in the pipeline orchestration context.

Here's a concrete example. Say you have a three-segment video with an emotional arc of "quiet tension, building dread, cathartic release." The plan might assign:

Segment 0 — prohibited: "Do not deliver cathartic release (owned by segment 2). Do not escalate beyond unease (escalation owned by segment 1)."
Segment 1 — prohibited: "Do not resolve tension (resolution owned by segment 2). Do not return to calm (calm was spent in segment 0)."
Segment 2 — prohibited: "Do not re-establish tension (already spent in segments 0-1). Do not end in ambiguity (the arc demands resolution)."

Each prohibition names which segment owns that transition. This is what makes it vectorial — we're defining each segment's direction by ruling out wrong directions, not just pointing at the right one. The model can still be creative within its assigned lane, but it can't accidentally deliver someone else's emotional payload.

Why Duration Awareness Changes the Arc Shape

We learned the hard way that a generic three-act structure doesn't fit every clip. An 8-second video with one segment doesn't have room for setup, escalation, and payoff. It has room for one defining moment.

The arc planner receives duration-aware guidance before it plans anything:

1 segment (8-12s): Single Moment — one charged emotional beat caught mid-stream. No arc, no progression.
2 segments (13-24s): Contrast / Juxtaposition — a before/after pair or tension/release duality.
3 segments (25-30s): Three-Act Progression — establish, complicate, resolve.

Without this, we saw the LLM cramming a full hero's journey into an 8-second clip. The model doesn't know how long a segment is in real time — it only sees tokens. Injecting arc shape guidance based on segment count is a cheap fix that eliminated an entire class of pacing failures.

How the Plan Feeds Downstream Without Becoming a Bottleneck

Each downstream video prompt generation step (Step 3 in our pipeline) receives the full global shot plan for all segments, plus all previously generated prompts. The prompt generator isn't inventing content — it's executing a pre-planned arc slice.

We considered two alternatives before landing on this architecture. First, we tried embedding all constraints directly in the Step 3 system prompt — a single mega-prompt with rules for each segment inline. This worked for 2 segments but became unreliable at 3, because the model would lose track of which constraints applied to which segment. Second, we tried a feedback loop where Step 3 would generate, then a critic model would reject prompts that violated continuity. This doubled latency and still missed subtle emotional repetitions.

The global plan approach works because it separates planning from execution. The planning model (GPT-4o-mini) is good at holistic reasoning about arc structure. The execution model (Gemini Flash) is good at generating rich visual descriptions. Each model does what it's best at, and the structured plan is the interface contract between them. This kind of state machine architecture for multi-stage AI pipelines gives us crash recovery and debuggability at every step.

What Vectorial Constraint Looks Like in the Schema

The structured output schema enforces the plan's shape at parse time. Each segment must include all fields, and the prohibited_state_changes field requires 2-4 entries. Here's a simplified view:

GlobalShotPlan:
  segments: list[SegmentShotPlan]

SegmentShotPlan:
  segment_index: int
  visual_content: str       # concrete actions, subjects, locations
  entry_state: str          # emotional state entering (chained)
  exit_state: str           # emotional state leaving
  prohibited_state_changes: list[str]  # 2-4 vectorial prohibitions
  cut_transition: str       # how this segment begins
  camera_strategy: str      # technique + emotional justification

The key design choice here is that prohibited_state_changes is a list of strings, not an enum. Each prohibition is a natural-language sentence that names the forbidden transition and attributes it to the segment that owns it. We tried using structured enums (e.g., RELEASE, ESCALATION, RESOLUTION) but found that natural language prohibitions were more effective — the downstream model understood "do not deliver cathartic release (owned by segment 2)" better than prohibited: [RELEASE].

Where This Pattern Applies Beyond Video

The vectorial constraint pattern isn't specific to video generation. Any LLM pipeline that produces sequential, interdependent outputs will hit the same convergence problem. If you're generating a multi-chapter report, each chapter's LLM call will gravitate toward the same key insight unless you pre-plan which chapter owns which argument. If you're building a text-to-video music pipeline, each prompt needs to know what emotional territory has already been claimed.

There's a deeper thread connecting this to a separate build of ours: a structured schema is the right interface only when the values are independent. The moment outputs depend on one another, you're better off letting the model express the dependency in code — which is the core argument in why an LLM should write code against a toolkit rather than fill in a JSON spec, where a JSON edit-spec couldn't capture timing relationships that a single line of code expresses trivially.

The pattern has three components that transfer cleanly:

A global planning step that sees all segments before any content is generated
Chained state transitions where each segment's entry state must match the previous segment's exit state
Negative constraints that explicitly prohibit each segment from delivering transitions owned by other segments

The global planning step adds one LLM call to the pipeline — roughly 800-1200 tokens of output for a 3-segment plan using GPT-4o-mini. The cost is negligible. The downstream improvement in coherence is not.

The Pattern at a Glance

Vectorial constraint in multi-step LLM pipelines means planning globally before executing locally, and constraining each step by what it cannot do — not just what it should. The global shot plan with chained entry/exit states and prohibited state changes turned our AI video pipeline from producing disconnected segments into producing coherent visual narratives. The cost was one additional LLM call. The improvement was the difference between a slideshow and a story.

Frequently Asked Questions

Why do sequential LLM calls produce repetitive output?

Each call optimises against its local prompt without knowing what siblings produced. Identical seeds, similar embeddings, and no negative space mean the model gravitates to the same well-trodden outputs every time.

What is a negative constraint in this context?

An explicit instruction telling the model what state changes, themes, or actions are off-limits — e.g. 'the protagonist must not change clothes between shots 3 and 5'. Negative constraints carve out solution space the way positive prompts cannot.

Does this pattern only apply to video generation?

No. Any chained LLM pipeline that produces multiple coherent artifacts — multi-section reports, courseware, codebases, marketing campaigns — benefits from the same global-plan-plus-negative-constraints structure.

References

OpenAI structured outputs guide

To cite this article: Iron Mind AI. (2026). "How Negative Constraints Fixed Our Multi-Step LLM Video Pipeline". Iron Mind AI Blog. https://iron-mind.ai/blog/llm-pipeline-negative-constraints-video-generation

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call