How to Generate Multi-Language Subtitles for AI Videos Using Kling, Whisper, and Claude API
Build a local subtitle pipeline combining Kling 3.0, Whisper, and Claude API to generate SRT files in 8+ languages — no captioning software needed.
Here’s the dirty secret about AI video creation in 2026: generating the video is the easy part. The moment you want captions — let alone captions in Spanish, Japanese, and Portuguese simultaneously — you’re suddenly staring at a $30/month captioning SaaS, a pile of misaligned timestamps, and the slow creep of existential dread. Kling 3.0 generates genuinely impressive video. What it doesn’t do, despite what some workflow articles claim, is transcribe speech natively. That feature isn’t built into Kling.
So instead of pretending it is, here’s a workflow that actually works: Kling 3.0 handles video generation, OpenAI’s Whisper handles speech-to-text transcription locally on your machine, and Claude API handles translation into as many languages as you need. You export clean SRT files, drop them into YouTube or TikTok, and you’re done. No third-party captioning platform. No recurring subscription. Just three tools doing exactly what they’re each best at.
This tutorial walks through the full pipeline, from generating your video in Kling to uploading a perfectly timed SRT to YouTube. You’ll need basic comfort with a terminal and Python, but nothing exotic.
What You’ll Actually Build
By the end of this tutorial, you’ll have a working local pipeline that takes any MP4 file — Kling-generated or otherwise — and produces: a timestamped English SRT file, translated SRT files in up to 8+ languages, and optionally, burned-in subtitle overlays with custom fonts using FFmpeg. The whole process for a 60-second video runs in under two minutes on a modern laptop.
What You Need Before You Start
You need four things: a Kling 3.0 account (available at klingai.com), an Anthropic API key for Claude, Python 3.10+ with pip, and FFmpeg installed locally. Whisper runs locally for free — no API key, no quota limits, no data leaving your machine unless you want it to. For Claude API access, Anthropic charges per token; translating a 60-second video transcript into 8 languages costs roughly $0.02 with Claude Haiku 4.5. Yes, two cents.
Note 💡
Whisper comes in five model sizes: tiny, base, small, medium, and large. For subtitle generation,
mediumhits the sweet spot — accurate enough for clean speech, fast enough that you won’t fall asleep waiting. Uselarge-v3only if your audio has heavy accents or background noise worth fighting.
Step 1 — Generate Your Video in Kling 3.0
Start in Kling 3.0 with a text-to-video or image-to-video prompt. For subtitle workflows, the single most important thing is audio clarity. If you’re generating talking-head style content or voiceover-driven video, your prompt should specify clean, unambiguous speech delivery. Here are prompts that consistently produce subtitle-friendly output:
A confident presenter speaking directly to camera, clear diction, minimal background noise, well-lit indoor setting, neutral background, close-up framing, natural speech pace -- professional explainer video style
This prompt keeps Kling focused on producing a video where Whisper will actually have something clean to work with. The “clear diction” and “minimal background noise” descriptors meaningfully affect the generated audio character. Swap “confident presenter” for “documentary narrator” or “product demo host” depending on your content type.
Professional product demonstration video, calm voiceover narration describing product features, neutral indoor background, medium shot, steady camera, clear audio, no music overlay -- e-commerce style
For e-commerce content specifically, this format works well because the pacing is deliberate and the vocabulary is repetitive enough that Whisper’s accuracy climbs above 95% even on the base model.
Pro tip ✅
If Kling generates a video with background music mixed into the speech track, Whisper will struggle. Before transcribing, run a quick audio separation pass with Demucs (
pip install demucs). It splits vocals from background music in about 30 seconds and saves you hours of manual timestamp correction.
Step 2 — Transcribe with Whisper
Download your Kling video as an MP4. Then install and run Whisper:
pip install openai-whisper
whisper your_video.mp4 --model medium --output_format srt --language en --output_dir ./subtitles
This single command transcribes the audio and writes a properly formatted SRT file into a /subtitles folder. The --language en flag forces English detection, which speeds things up. If your source content is in another language, swap en for the appropriate language code (es, ja, pt, etc.) and Whisper will transcribe accordingly before you send it to Claude for translation.
whisper your_video.mp4 --model large-v3 --output_format srt --language en --word_timestamps True --output_dir ./subtitles
Adding --word_timestamps True produces word-level timing data embedded in the SRT. This doesn’t change the visible output, but it’s useful if you later want to do karaoke-style highlighting or split long subtitle lines. Keep this flag on by default — it costs nothing and gives you options.
Step 3 — Translate with Claude API
Here’s where it gets genuinely fast. The following Python script reads your English SRT, sends the text content to Claude Haiku 4.5 for translation, and writes new SRT files for each target language — preserving all the original timestamps exactly.
import anthropic
import re
def parse_srt(filepath):
with open(filepath, 'r', encoding='utf-8') as f:
content = f.read()
blocks = content.strip().split('nn')
parsed = []
for block in blocks:
lines = block.strip().split('n')
if len(lines) >= 3:
index = lines[0]
timestamp = lines[1]
text = ' '.join(lines[2:])
parsed.append({'index': index, 'timestamp': timestamp, 'text': text})
return parsed
def translate_srt(blocks, target_language, api_key):
client = anthropic.Anthropic(api_key=api_key)
texts = [b['text'] for b in blocks]
batch_text = 'n---n'.join(texts)
message = client.messages.create(
model='claude-haiku-4-5',
max_tokens=4096,
messages=[{
'role': 'user',
'content': f'Translate each subtitle segment below into {target_language}. Preserve the --- separator between segments. Return only the translated text, nothing else, in the same order.nn{batch_text}'
}]
)
translated = message.content[0].text.split('n---n')
output_blocks = []
for i, block in enumerate(blocks):
trans_text = translated[i].strip() if i < len(translated) else block['text']
output_blocks.append(f"{block['index']}n{block['timestamp']}n{trans_text}")
return 'nn'.join(output_blocks)
# Usage
api_key = 'your_anthropic_api_key'
languages = ['Spanish', 'French', 'German', 'Japanese', 'Portuguese', 'Korean', 'Arabic', 'Hindi']
blocks = parse_srt('./subtitles/your_video.srt')
for lang in languages:
translated_srt = translate_srt(blocks, lang, api_key)
lang_code = lang[:2].lower()
with open(f'./subtitles/your_video_{lang_code}.srt', 'w', encoding='utf-8') as f:
f.write(translated_srt)
print(f'Done: {lang}')
This script batches all subtitle segments into a single Claude API call per language, which keeps costs low and maintains translation consistency across segments — Claude can see the full context of your video transcript, not just one line at a time. For Arabic and Hebrew, the script outputs right-to-left text correctly because SRT files are Unicode; YouTube and TikTok handle RTL rendering on their end.
Pro tip ✅
Tell Claude what your video is about before asking for translation. Prepend your prompt with “This is a subtitle file from a [product demo / cooking tutorial / tech explainer] video.” Domain context pushes translation accuracy up noticeably, especially for technical vocabulary that has multiple valid translations in languages like Japanese or Arabic.
Step 4 — Burn-In Subtitles with Brand Fonts (Optional)
If you’re posting to TikTok or Instagram Reels where viewers watch without selecting a subtitle track, burned-in subtitles get dramatically higher engagement than external SRT tracks. FFmpeg handles this with one command:
ffmpeg -i your_video.mp4 -vf "subtitles=./subtitles/your_video.srt:force_style='FontName=Montserrat,FontSize=22,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Alignment=2'" -c:a copy output_with_subs.mp4
Break down what this does: FontName=Montserrat sets your brand font (any font installed on your system works), FontSize=22 is readable on mobile without covering too much frame, PrimaryColour=&H00FFFFFF is white text, OutlineColour=&H00000000 is black outline for readability on any background, and Alignment=2 places subtitles at the bottom center. Change Alignment=6 for top center if your video has a lower-third graphic.
ffmpeg -i your_video.mp4 -vf "subtitles=./subtitles/your_video.srt:force_style='FontName=Inter,FontSize=24,PrimaryColour=&H0000FFFF,OutlineColour=&H00000000,Outline=3,Bold=1,Alignment=2'" -c:a copy output_branded.mp4
This variant uses yellow text (&H0000FFFF in BGR format — yes, FFmpeg uses BGR, not RGB, which trips everyone up the first time) with bold weight. Yellow subtitles on dark footage are extremely readable and have become something of a default for YouTube educational content. Adjust the hex color code to match your brand palette.
Warning ⚠️
FFmpeg’s color format is BGR, not RGB. So pure red is
&H000000FF, pure blue is&H00FF0000, and pure green is&H0000FF00. This is one of those things that causes 20 minutes of confusion until you know it, then takes two seconds forever after.
Step 5 — Export and Upload SRT to YouTube and TikTok
YouTube accepts SRT files directly. In YouTube Studio, open your video, go to Subtitles, click Add Language, select your target language, and upload the corresponding SRT file. YouTube won’t auto-translate your manually uploaded SRT — it treats each language file as authoritative. Repeat the upload for each of your eight language files. The whole process takes about three minutes per language.
TikTok’s SRT support works through the desktop creator portal. Upload your video, go to Captions, and select “Upload captions file.” TikTok accepts SRT format and applies it as a selectable caption track. Note that TikTok’s auto-caption feature will override your uploaded SRT unless you explicitly disable it in caption settings — disable it, because TikTok’s auto-captions are noticeably worse than Whisper + Claude output, especially for any non-standard vocabulary.
Pro tip ✅
YouTube uses SRT timestamps in HH:MM:SS,mmm format. If Whisper produces timestamps with a period instead of a comma (some versions do this), YouTube will reject the file silently. Run a quick find-and-replace in any text editor — replace all periods in timestamp lines with commas — before uploading. Takes 10 seconds and saves 10 minutes of debugging.
Timing Sync: When Whisper Gets It Wrong
Whisper’s timestamp accuracy degrades in two situations: speech over music (covered earlier — use Demucs) and fast speech where a single subtitle block runs longer than about 7 seconds. Long blocks don’t cause sync issues, but they’re hard to read. You can split them automatically with this addition to your workflow:
whisper your_video.mp4 --model medium --output_format srt --language en --max_line_width 42 --max_line_count 2 --output_dir ./subtitles
The --max_line_width 42 flag limits subtitle lines to 42 characters (the standard readability threshold for mobile screens) and --max_line_count 2 caps each block at two lines. Whisper automatically splits longer segments at natural pause points to meet these constraints. The result is subtitle timing that feels natural instead of forcing viewers to pause mid-read.
Pro tip ✅
Run your final SRT through a free validator like the online SubtitleEdit tool before uploading to YouTube. It catches overlapping timestamps, malformed headers, and encoding issues that YouTube’s subtitle parser handles badly. Five minutes of validation prevents a frustrating upload-fail-fix loop.
The Full Pipeline in One Script
Once you’ve run through this manually a few times, collapse it into a single shell script so you can run the whole pipeline — transcription, translation into 8 languages, and burned-in subtitle render — with one command:
#!/bin/bash
# Usage: bash subtitle_pipeline.sh your_video.mp4
INPUT=$1
BASENAME=$(basename "$INPUT" .mp4)
OUTDIR="./subtitles"
mkdir -p $OUTDIR
echo "Transcribing..."
whisper "$INPUT" --model medium --output_format srt --language en --max_line_width 42 --max_line_count 2 --output_dir $OUTDIR
echo "Translating..."
python3 translate_srt.py --input "$OUTDIR/${BASENAME}.srt" --languages "Spanish,French,German,Japanese,Portuguese,Korean,Arabic,Hindi"
echo "Rendering burned-in version..."
ffmpeg -i "$INPUT" -vf "subtitles=${OUTDIR}/${BASENAME}.srt:force_style='FontName=Inter,FontSize=22,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Alignment=2'" -c:a copy "${BASENAME}_captioned.mp4"
echo "Done. SRT files in $OUTDIR/"
This wraps everything. You drop your Kling-exported MP4 into a folder, run the script, and walk away. Two minutes later you have eight SRT files ready to upload and a burned-in version ready for short-form platforms.
Why This Stack Actually Holds Up
The case for this workflow over a dedicated captioning platform comes down to three things: cost, control, and portability. Whisper is free and runs locally — your video content never touches a third-party server. Claude API translation at Haiku 4.5 rates is genuinely cheap; translating a 10-minute video into 8 languages costs under $0.15. And because your SRT files are plain text sitting in a folder, you can version-control them, edit them in any text editor, and reuse them across platforms without re-uploading to a SaaS dashboard. When a captioning platform changes its pricing or shuts down, your workflow doesn’t care. The tools here aren’t going anywhere, and two of them run entirely on your own hardware. That’s not a small thing when you’re building production pipelines that need to work reliably six months from now.


