Engineering

How to Sync AI Voiceover, Music, and Video Using Word-Level Timestamps

2025-06-10 Updated 2026-05-17 6 min read

Syncing AI-generated voiceover, music, and video into a coherent timeline requires word-level timestamps as the single source of truth for duration and pacing. Without them, each AI output — TTS audio, generated music, animated video clips — runs at its own arbitrary length, and the final render drifts out of sync within seconds.

This is the core challenge in any AI video pipeline that combines multiple generative models. The voiceover dictates narrative pacing, but the music has its own tempo, and each AI-generated video clip lands at whatever duration the model decides. Word-level timestamps solve this by giving you millisecond-accurate anchor points to cut, stretch, and align every layer.

Why AI outputs don't naturally align

When you generate voiceover with a TTS model like ElevenLabs, the audio length depends on the text, speaking rate, and SSML pauses you insert. A 45-second voiceover script might render as 38 seconds or 52 seconds depending on the voice and parameters.

AI music generation is even less predictable. You request a "90-second cinematic score" and get back something between 80 and 100 seconds. The model doesn't know your voiceover length because it never heard it.

AI video generation (text-to-video models like Seedance) produces clips at fixed durations — typically 5 or 10 seconds — regardless of how long your scene's narration actually runs. You can't tell the model "make this shot exactly 7.3 seconds." It gives you what it gives you. As we found when building our Seedance 2.0 music video pipeline, the generation step is one of several where precise timing control matters.

How word-level timestamps become the sync bridge

The fix is to generate the voiceover first, then extract word-level timestamps from it using a transcription model. ElevenLabs Scribe returns every word with its start and end time in milliseconds. This gives you a timing map of the entire narration.

With that map, you know exactly when each scene's narration starts and ends. If Scene 3 covers words 47 through 62 and word 47 starts at 23,400ms while word 62 ends at 31,200ms, Scene 3's duration is exactly 7,800 milliseconds. That becomes the target duration for Scene 3's video clip and the window in which Scene 3's portion of the music must play.

# Extract scene durations from word-level timestamps
def calculate_scene_durations(words, scene_boundaries):
    """
    words: list of {"word": str, "start": float, "end": float} from Scribe
    scene_boundaries: list of (start_word_idx, end_word_idx) per scene
    """
    durations = []
    for start_idx, end_idx in scene_boundaries:
        scene_start_ms = words[start_idx]["start"] * 1000
        scene_end_ms = words[end_idx]["end"] * 1000
        durations.append({
            "start_ms": scene_start_ms,
            "end_ms": scene_end_ms,
            "duration_ms": scene_end_ms - scene_start_ms
        })
    return durations

This function maps each scene to an exact millisecond window derived from the voiceover's actual pacing — not from the script's word count or an estimate.

How to trim AI video clips to exact scene durations

AI video models produce fixed-length clips. If your scene needs 7.8 seconds but the model generated a 10-second clip, you need to trim it. If the scene needs 12 seconds, you may need to slow the clip or generate a second clip and crossfade.

The simplest reliable approach is to generate clips slightly longer than needed and trim to exact duration using ffmpeg. Speed ramping (adjusting playback speed within a narrow range like 0.85x-1.15x) handles small mismatches without visible distortion.

# Trim video clip to exact scene duration
import subprocess

def trim_clip_to_duration(input_path, output_path, target_duration_ms):
    """Trim or speed-adjust a video clip to hit exact target duration."""
    # Get source duration
    probe = subprocess.run(
        ["ffprobe", "-v", "quiet", "-show_entries",
         "format=duration", "-of", "csv=p=0", input_path],
        capture_output=True, text=True
    )
    source_duration = float(probe.stdout.strip()) * 1000
    target_duration = target_duration_ms

    speed_factor = source_duration / target_duration

    # Only speed-adjust if within acceptable range (0.8x - 1.2x)
    if 0.8 <= speed_factor <= 1.2:
        subprocess.run([
            "ffmpeg", "-i", input_path,
            "-filter:v", f"setpts={1/speed_factor}*PTS",
            "-t", f"{target_duration_ms / 1000:.3f}",
            "-y", output_path
        ], check=True)
    else:
        # Hard trim — cut at target duration
        subprocess.run([
            "ffmpeg", "-i", input_path,
            "-t", f"{target_duration_ms / 1000:.3f}",
            "-c", "copy", "-y", output_path
        ], check=True)

    return output_path

Speed factors between 0.8x and 1.2x are imperceptible to viewers on AI-generated video. Beyond that range, a hard trim is safer than a noticeable speed change.

How to segment and mix audio to match scene timing

The voiceover is already correctly timed — it's the source of truth. The music score needs to be trimmed to the same total duration and mixed underneath the voiceover at the right volume levels.

pydub handles this cleanly. You load both audio tracks, trim the music to match voiceover length, apply dynamic compression to the voiceover (so quiet words don't get buried), and mix them with the music ducked below the narration.

from pydub import AudioSegment
from pydub.effects import compress_dynamic_range, low_pass_filter

def mix_voiceover_and_score(vo_path, music_path, output_path):
    """Mix voiceover and music score with proper leveling."""
    voiceover = AudioSegment.from_file(vo_path)
    music = AudioSegment.from_file(music_path)

    # Trim music to match voiceover duration
    target_duration = len(voiceover)
    if len(music) > target_duration:
        # Fade out music at the end
        music = music[:target_duration].fade_out(2000)
    elif len(music) < target_duration:
        # Loop music if too short
        loops_needed = (target_duration // len(music)) + 1
        music = (music * loops_needed)[:target_duration].fade_out(2000)

    # Compress voiceover dynamics — keeps quiet words audible
    voiceover = compress_dynamic_range(voiceover, threshold=-20.0, ratio=4.0)

    # Duck music under voiceover (-14dB below narration)
    music = music - 14

    # Mix
    mixed = voiceover.overlay(music)
    mixed.export(output_path, format="wav")
    return output_path

Ducking the music by 14dB below the voiceover keeps the score audible as atmosphere without competing with narration. Adjust this value based on the voice model's output levels.

How to concatenate synced scenes into the final video

Once every video clip is trimmed to its scene duration and the audio mix is complete, the final step is concatenating all clips in order and laying the mixed audio on top.

# Generate ffmpeg concat file and merge with audio
def concat_final_video(clip_paths, audio_path, output_path):
    """Concatenate trimmed clips and overlay mixed audio."""
    # Write concat list
    concat_file = "/tmp/concat_list.txt"
    with open(concat_file, "w") as f:
        for path in clip_paths:
            f.write(f"file '{path}'\n")

    # Concat video, replace audio with mixed track
    subprocess.run([
        "ffmpeg",
        "-f", "concat", "-safe", "0", "-i", concat_file,
        "-i", audio_path,
        "-map", "0:v", "-map", "1:a",
        "-c:v", "libx264", "-preset", "slow", "-crf", "18",
        "-c:a", "aac", "-b:a", "192k",
        "-shortest", "-y", output_path
    ], check=True)
    return output_path

The -shortest flag ensures the output matches the shorter of video or audio, catching any sub-frame rounding errors. The CRF 18 setting preserves visual quality without ballooning file size.

Where timing errors actually come from

The most common sync failures aren't in the code — they're in the voiceover generation step. If your TTS model inserts unexpected pauses or rushes through a sentence, the word-level timestamps will be accurate to what was spoken, but what was spoken won't match what you expected.

SSML controls in the TTS request are the first line of defense. Explicit <break> tags between scenes give you predictable pause points. Rate adjustments via <prosody> tags keep the pacing consistent across emotional tones in the script.

The second failure point is scene boundary detection. If your pipeline splits narration into scenes at the wrong word boundaries, every downstream duration calculation is wrong. Validating scene splits against the original script structure — before generating any video — prevents cascading timing errors.

The complete sync pattern at a glance

Syncing AI voiceover, music, and video using word-level timestamps follows a strict sequence: generate voiceover first with SSML controls, extract word-level timestamps via transcription, calculate exact scene durations from those timestamps, trim each video clip to its scene's millisecond window, mix the music score to match total voiceover duration, and concatenate everything with ffmpeg. The word-level timestamps are the single coordination layer that makes independently-generated AI outputs behave as a unified timeline. Every other approach — estimating durations from word counts, using fixed clip lengths, or hoping the models agree — breaks at scale.

Frequently Asked Questions

Why not just generate everything at fixed lengths?

AI voiceover varies in pace based on input text, emotion, and model. Generating to a fixed length means either rushing or padding — both sound unnatural.

What tool produces good word-level timestamps?

OpenAI's Whisper produces solid word-level timestamps. Deepgram and AssemblyAI are commercial alternatives with better accuracy on noisy audio.

Can this work in real-time?

Not yet — the timestamp extraction step adds 10-30 seconds. For real-time, you need streaming TTS with built-in timestamps like ElevenLabs' streaming API.

How precise can timing get?

Sub-100ms reliably. Below that, perceived sync depends more on the listener than the tooling.

To cite this article: Iron Mind AI. (2025). "How to Sync AI Voiceover, Music, and Video Using Word-Level Timestamps". Iron Mind AI Blog. https://iron-mind.ai/blog/sync-ai-voiceover-music-video-timestamps

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call