Syncing AI-generated voiceover, music, and video into a coherent timeline requires word-level timestamps as the single source of truth for duration and pacing. Without them, each AI output — TTS audio, generated music, animated video clips — runs at its own arbitrary length, and the final render drifts out of sync within seconds.
This is the core challenge in any AI video pipeline that combines multiple generative models. The voiceover dictates narrative pacing, but the music has its own tempo, and each AI-generated video clip lands at whatever duration the model decides. Word-level timestamps solve this by giving you millisecond-accurate anchor points to cut, stretch, and align every layer.
Why AI outputs don't naturally align
When you generate voiceover with a TTS model like ElevenLabs, the audio length depends on the text, speaking rate, and SSML pauses you insert. A 45-second voiceover script might render as 38 seconds or 52 seconds depending on the voice and parameters.
AI music generation is even less predictable. You request a "90-second cinematic score" and get back something between 80 and 100 seconds. The model doesn't know your voiceover length because it never heard it.
AI video generation (image-to-video models like Seedance) produces clips at fixed durations — typically 5 or 10 seconds — regardless of how long your scene's narration actually runs. You can't tell the model "make this shot exactly 7.3 seconds." It gives you what it gives you.
How word-level timestamps become the sync bridge
The fix is to generate the voiceover first, then extract word-level timestamps from it using a transcription model. ElevenLabs Scribe returns every word with its start and end time in milliseconds. This gives you a timing map of the entire narration.
With that map, you know exactly when each scene's narration starts and ends. If Scene 3 covers words 47 through 62 and word 47 starts at 23,400ms while word 62 ends at 31,200ms, Scene 3's duration is exactly 7,800 milliseconds. That becomes the target duration for Scene 3's video clip and the window in which Scene 3's portion of the music must play.
# Extract scene durations from word-level timestamps
def calculate_scene_durations(words, scene_boundaries):
"""
words: list of {"word": str, "start": float, "end": float} from Scribe
scene_boundaries: list of (start_word_idx, end_word_idx) per scene
"""
durations = []
for start_idx, end_idx in scene_boundaries:
scene_start_ms = words[start_idx]["start"] * 1000
scene_end_ms = words[end_idx]["end"] * 1000
durations.append({
"start_ms": scene_start_ms,
"end_ms": scene_end_ms,
"duration_ms": scene_end_ms - scene_start_ms
})
return durations
This function maps each scene to an exact millisecond window derived from the voiceover's actual pacing — not from the script's word count or an estimate.
How to trim AI video clips to exact scene durations
AI video models produce fixed-length clips. If your scene needs 7.8 seconds but the model generated a 10-second clip, you need to trim it. If the scene needs 12 seconds, you may need to slow the clip or generate a second clip and crossfade.
The simplest reliable approach is to generate clips slightly longer than needed and trim to exact duration using ffmpeg. Speed ramping (adjusting playback speed within a narrow range like 0.85x-1.15x) handles small mismatches without visible distortion.
# Trim video clip to exact scene duration
import subprocess
def trim_clip_to_duration(input_path, output_path, target_duration_ms):
"""Trim or speed-adjust a video clip to hit exact target duration."""
# Get source duration
probe = subprocess.run(
["ffprobe", "-v", "quiet", "-show_entries",
"format=duration", "-of", "csv=p=0", input_path],
capture_output=True, text=True
)
source_duration = float(probe.stdout.strip()) * 1000
target_duration = target_duration_ms
speed_factor = source_duration / target_duration
# Only speed-adjust if within acceptable range (0.8x - 1.2x)
if 0.8 <= speed_factor <= 1.2:
subprocess.run([
"ffmpeg", "-i", input_path,
"-filter:v", f"setpts={1/speed_factor}*PTS",
"-t", f"{target_duration_ms / 1000:.3f}",
"-y", output_path
], check=True)
else:
# Hard trim — cut at target duration
subprocess.run([
"ffmpeg", "-i", input_path,
"-t", f"{target_duration_ms / 1000:.3f}",
"-c", "copy", "-y", output_path
], check=True)
return output_path
Speed factors between 0.8x and 1.2x are imperceptible to viewers on AI-generated video. Beyond that range, a hard trim is safer than a noticeable speed change.
How to segment and mix audio to match scene timing
The voiceover is already correctly timed — it's the source of truth. The music score needs to be trimmed to the same total duration and mixed underneath the voiceover at the right volume levels.
pydub handles this cleanly. You load both audio tracks, trim the music to match voiceover length, apply dynamic compression to the voiceover (so quiet words don't get buried), and mix them with the music ducked below the narration.
from pydub import AudioSegment
from pydub.effects import compress_dynamic_range, low_pass_filter
def mix_voiceover_and_score(vo_path, music_path, output_path):
"""Mix voiceover and music score with proper leveling."""
voiceover = AudioSegment.from_file(vo_path)
music = AudioSegment.from_file(music_path)
# Trim music to match voiceover duration
target_duration = len(voiceover)
if len(music) > target_duration:
# Fade out music at the end
music = music[:target_duration].fade_out(2000)
elif len(music) < target_duration:
# Loop music if too short
loops_needed = (target_duration // len(music)) + 1
music = (music * loops_needed)[:target_duration].fade_out(2000)
# Compress voiceover dynamics — keeps quiet words audible
voiceover = compress_dynamic_range(voiceover, threshold=-20.0, ratio=4.0)
# Duck music under voiceover (-14dB below narration)
music = music - 14
# Mix
mixed = voiceover.overlay(music)
mixed.export(output_path, format="wav")
return output_path
Ducking the music by 14dB below the voiceover keeps the score audible as atmosphere without competing with narration. Adjust this value based on the voice model's output levels.
How to concatenate synced scenes into the final video
Once every video clip is trimmed to its scene duration and the audio mix is complete, the final step is concatenating all clips in order and laying the mixed audio on top.
# Generate ffmpeg concat file and merge with audio
def concat_final_video(clip_paths, audio_path, output_path):
"""Concatenate trimmed clips and overlay mixed audio."""
# Write concat list
concat_file = "/tmp/concat_list.txt"
with open(concat_file, "w") as f:
for path in clip_paths:
f.write(f"file '{path}'\n")
# Concat video, replace audio with mixed track
subprocess.run([
"ffmpeg",
"-f", "concat", "-safe", "0", "-i", concat_file,
"-i", audio_path,
"-map", "0:v", "-map", "1:a",
"-c:v", "libx264", "-preset", "slow", "-crf", "18",
"-c:a", "aac", "-b:a", "192k",
"-shortest", "-y", output_path
], check=True)
return output_path
The -shortest flag ensures the output matches the shorter of video or audio, catching any sub-frame rounding errors. The CRF 18 setting preserves visual quality without ballooning file size.
Where timing errors actually come from
The most common sync failures aren't in the code — they're in the voiceover generation step. If your TTS model inserts unexpected pauses or rushes through a sentence, the word-level timestamps will be accurate to what was spoken, but what was spoken won't match what you expected.
SSML controls in the TTS request are the first line of defense. Explicit <break> tags between scenes give you predictable pause points. Rate adjustments via <prosody> tags keep the pacing consistent across emotional tones in the script.
The second failure point is scene boundary detection. If your pipeline splits narration into scenes at the wrong word boundaries, every downstream duration calculation is wrong. Validating scene splits against the original script structure — before generating any video — prevents cascading timing errors.
The complete sync pattern at a glance
Syncing AI voiceover, music, and video using word-level timestamps follows a strict sequence: generate voiceover first with SSML controls, extract word-level timestamps via transcription, calculate exact scene durations from those timestamps, trim each video clip to its scene's millisecond window, mix the music score to match total voiceover duration, and concatenate everything with ffmpeg. The word-level timestamps are the single coordination layer that makes independently-generated AI outputs behave as a unified timeline. Every other approach — estimating durations from word counts, using fixed clip lengths, or hoping the models agree — breaks at scale.