PPT Audio To Video
Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.
Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.
Core workflow
- 1. Inventory inputs.
- Confirm which of these exist: audio-only
mp4/m4a/mp3/wav,
ppt/pptx,
pdf, and any pre-rendered slide images.
- Prefer an existing
pdf or image directory for rendering. Treat
pptx as the source of slide text and as a fallback for export.
- 2. Prepare tools.
- Required for deterministic steps:
ffmpeg,
ffprobe,
pdftoppm.
- Required for transcription:
whisper-cli from
whisper-cpp plus a multilingual model such as
ggml-small.bin.
- If only
pptx exists and no
pdf/images exist, prefer
Keynote or
PowerPoint export on macOS. Use
soffice only as fallback because profile or rendering issues are common.
- 3. Produce slide images.
- If
pdf exists, render it to images:
pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide"
- If only
pptx exists, export to
pdf or slide images with
Keynote or
PowerPoint, then continue from
pdf.
- Keep slide filenames ordered and stable, such as
slide-01.png,
slide-02.png, ...
- 4. Extract slide text.
- Run:
python3 scripts/extract_slide_outline.py \
--pptx "$PPTX" \
--out "$WORKDIR/slide_outline.csv"
- Use the output to identify slide titles, distinctive keywords, and section changes.
- 5. Extract clean audio for ASR.
- For audio-only
mp4, extract mono
wav:
ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav"
- If the source is already
wav/mp3/m4a, convert to the same mono
wav form if needed.
- 6. Transcribe with
whisper-cli.
- Example:
whisper-cli -ng \
-m "$MODEL" \
-f "$WORKDIR/audio.wav" \
-l zh \
-ocsv -osrt -of "$WORKDIR/transcript"
- Prefer
transcript.csv for downstream parsing.
transcript.srt is useful for manual review.
- If GPU allocation fails on macOS, retry with
-ng to force CPU mode.
- 7. Build
slide_timings.csv.
- Do not average slide durations unless the user explicitly asks for it.
- Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords.
- Use this schema:
slide,start_sec,end_sec,duration_sec,reason
1,0.000,15.000,15.000,opening title and agenda
2,15.000,100.000,85.000,architecture overview starts here
- Keep slide numbers sequential and ensure
duration_sec = end_sec - start_sec.
- Validate that the last
end_sec matches the audio duration or is within a small tolerance.
- 8. Render the final video.
- Run:
python3 scripts/render_from_timing_csv.py \
--images "$SLIDE_IMAGES_DIR" \
--timings "$WORKDIR/slide_timings.csv" \
--audio "$WORKDIR/audio.wav" \
--output "$OUT_VIDEO"
- The script generates an
ffconcat file, validates timing continuity, and calls
ffmpeg to encode the final
mp4.
- 9. Verify and iterate.
- Check output duration with
ffprobe.
- If a slide cuts too early or too late, edit only the affected rows in
slide_timings.csv and rerun the render script.
- Keep the transcript, outline, and timing CSV as reproducible working files.
Heuristics for timing alignment
- - Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
- Use the first segment that clearly switches topic as the next slide start.
- Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
- Let the model infer timings, but keep the render step deterministic through
slide_timings.csv. - When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.
Common commands
Install dependencies on macOS if missing:
CODEBLOCK6
Typical multilingual model download:
CODEBLOCK7
Bundled scripts
Extract slide text from
pptx into CSV or JSON for timing analysis.
Validate a timing CSV, generate an
ffconcat, and render the final video with
ffmpeg.
PPT 音频转视频
当源视频包含旁白音频但缺少可用的幻灯片画面,且最终交付物应为基于幻灯片的讲座视频时,请使用此技能。
解析与此技能目录相关的捆绑脚本。如果运行环境已打开此 SKILL.md,请优先使用 scripts/extractslideoutline.py 和 scripts/renderfromtiming_csv.py 等路径,而非特定机器的绝对路径。
核心工作流程
- 1. 盘点输入文件。
- 确认以下文件是否存在:纯音频 mp4/m4a/mp3/wav、ppt/pptx、pdf 以及任何预渲染的幻灯片图片。
- 优先使用现有的 pdf 或图片目录进行渲染。将 pptx 作为幻灯片文本的来源和导出时的备用方案。
- 2. 准备工具。
- 确定性步骤所需工具:ffmpeg、ffprobe、pdftoppm。
- 转录所需工具:来自 whisper-cpp 的 whisper-cli,以及多语言模型(如 ggml-small.bin)。
- 如果只有 pptx 而没有 pdf/图片,在 macOS 上优先使用 Keynote 或 PowerPoint 导出。仅将 soffice 作为备用方案,因为常出现配置文件或渲染问题。
- 3. 生成幻灯片图片。
- 如果存在 pdf,将其渲染为图片:
bash
pdftoppm -png -r 200 $PDF $OUTDIR/slide
- 如果只有 pptx,使用 Keynote 或 PowerPoint 导出为 pdf 或幻灯片图片,然后从 pdf 继续处理。
- 保持幻灯片文件名有序且稳定,例如 slide-01.png、slide-02.png……
- 4. 提取幻灯片文本。
- 运行:
bash
python3 scripts/extract
slideoutline.py \
--pptx $PPTX \
--out $WORKDIR/slide_outline.csv
- 使用输出来识别幻灯片标题、独特关键词和章节变化。
- 5. 提取用于语音识别的纯净音频。
- 对于纯音频 mp4,提取单声道 wav:
bash
ffmpeg -y -i $AUDIO
MP4 -ar 16000 -ac 1 -c:a pcms16le $WORKDIR/audio.wav
- 如果源文件已经是 wav/mp3/m4a,根据需要转换为相同的单声道 wav 格式。
- 6. 使用 whisper-cli 进行转录。
- 示例:
bash
whisper-cli -ng \
-m $MODEL \
-f $WORKDIR/audio.wav \
-l zh \
-ocsv -osrt -of $WORKDIR/transcript
- 优先使用 transcript.csv 进行下游解析。transcript.srt 适用于人工审核。
- 如果在 macOS 上 GPU 分配失败,使用 -ng 重试以强制使用 CPU 模式。
- 7. 构建 slide_timings.csv。
- 除非用户明确要求,否则不要平均幻灯片时长。
- 同时读取转录文本和幻灯片大纲,根据主题变化、章节边界和独特关键词创建单调递增的时间安排方案。
- 使用此架构:
csv
slide,start
sec,endsec,duration_sec,reason
1,0.000,15.000,15.000,开场标题和议程
2,15.000,100.000,85.000,架构概述从此处开始
- 保持幻灯片编号连续,并确保 durationsec = endsec - start_sec。
- 验证最后一个 end_sec 与音频时长匹配或在可接受的小误差范围内。
- 8. 渲染最终视频。
- 运行:
bash
python3 scripts/render
fromtiming_csv.py \
--images $SLIDE
IMAGESDIR \
--timings $WORKDIR/slide_timings.csv \
--audio $WORKDIR/audio.wav \
--output $OUT_VIDEO
- 该脚本生成 ffconcat 文件,验证时间连续性,并调用 ffmpeg 编码最终的 mp4。
- 9. 验证和迭代。
- 使用 ffprobe 检查输出时长。
- 如果某张幻灯片切换过早或过晚,仅编辑 slide_timings.csv 中受影响的行,然后重新运行渲染脚本。
- 保留转录文本、大纲和时间安排 CSV 作为可复现的工作文件。
时间对齐的启发式规则
- - 章节分隔幻灯片使用时间宜短。这些幻灯片通常停留 5-20 秒。
- 将第一个明确切换主题的片段作为下一张幻灯片的起始点。
- 优先使用精确的主题转换点,而非标题词匹配。语音识别经常扭曲专有名词和产品名称。
- 让模型推断时间安排,但通过 slide_timings.csv 保持渲染步骤的确定性。
- 当置信度较低时,生成初版视频并告知用户哪些幻灯片边界可能需要审核。
常用命令
在 macOS 上安装缺失的依赖项:
bash
brew install ffmpeg poppler whisper-cpp
典型的多语言模型下载:
bash
mkdir -p .models
curl -L https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin -o .models/ggml-small.bin
捆绑脚本
- - scripts/extractslideoutline.py
从 pptx 中提取幻灯片文本为 CSV 或 JSON 格式,用于时间分析。
- - scripts/renderfromtiming_csv.py
验证时间安排 CSV,生成 ffconcat 文件,并使用 ffmpeg 渲染最终视频。