PPT Audio To Video

Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.

Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.

Core workflow

1. Inventory inputs.

- Confirm which of these exist: audio-only mp4/m4a/mp3/wav, ppt/pptx, pdf, and any pre-rendered slide images. - Prefer an existing pdf or image directory for rendering. Treat pptx as the source of slide text and as a fallback for export.

2. Prepare tools.

- Required for deterministic steps: ffmpeg, ffprobe, pdftoppm. - Required for transcription: whisper-cli from whisper-cpp plus a multilingual model such as ggml-small.bin. - If only pptx exists and no pdf/images exist, prefer Keynote or PowerPoint export on macOS. Use soffice only as fallback because profile or rendering issues are common.

3. Produce slide images.

- If pdf exists, render it to images:

     pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide"

- If only pptx exists, export to pdf or slide images with Keynote or PowerPoint, then continue from pdf. - Keep slide filenames ordered and stable, such as slide-01.png, slide-02.png, ...

4. Extract slide text.

- Run:

     python3 scripts/extract_slide_outline.py \
       --pptx "$PPTX" \
       --out "$WORKDIR/slide_outline.csv"

- Use the output to identify slide titles, distinctive keywords, and section changes.

5. Extract clean audio for ASR.

- For audio-only mp4, extract mono wav:

     ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav"

- If the source is already wav/mp3/m4a, convert to the same mono wav form if needed.

6. Transcribe with whisper-cli.

- Example:

     whisper-cli -ng \
       -m "$MODEL" \
       -f "$WORKDIR/audio.wav" \
       -l zh \
       -ocsv -osrt -of "$WORKDIR/transcript"

- Prefer transcript.csv for downstream parsing. transcript.srt is useful for manual review. - If GPU allocation fails on macOS, retry with -ng to force CPU mode.

7. Build slide_timings.csv.

- Do not average slide durations unless the user explicitly asks for it. - Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords. - Use this schema:

     slide,start_sec,end_sec,duration_sec,reason
     1,0.000,15.000,15.000,opening title and agenda
     2,15.000,100.000,85.000,architecture overview starts here

- Keep slide numbers sequential and ensure duration_sec = end_sec - start_sec. - Validate that the last end_sec matches the audio duration or is within a small tolerance.

8. Render the final video.

- Run:

     python3 scripts/render_from_timing_csv.py \
       --images "$SLIDE_IMAGES_DIR" \
       --timings "$WORKDIR/slide_timings.csv" \
       --audio "$WORKDIR/audio.wav" \
       --output "$OUT_VIDEO"

- The script generates an ffconcat file, validates timing continuity, and calls ffmpeg to encode the final mp4.

9. Verify and iterate.

- Check output duration with ffprobe. - If a slide cuts too early or too late, edit only the affected rows in slide_timings.csv and rerun the render script. - Keep the transcript, outline, and timing CSV as reproducible working files.

Heuristics for timing alignment

- Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
Use the first segment that clearly switches topic as the next slide start.
Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
Let the model infer timings, but keep the render step deterministic through slide_timings.csv.
When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.

Common commands

Install dependencies on macOS if missing:
CODEBLOCK6

Typical multilingual model download:
CODEBLOCK7

Bundled scripts

- INLINECODE44

Extract slide text from pptx into CSV or JSON for timing analysis.

- INLINECODE46

Validate a timing CSV, generate an ffconcat, and render the final video with ffmpeg.

PPT 音频转视频

当源视频包含旁白音频但缺少可用的幻灯片画面，且最终交付物应为基于幻灯片的讲座视频时，请使用此技能。

解析与此技能目录相关的捆绑脚本。如果运行环境已打开此 SKILL.md，请优先使用 scripts/extractslideoutline.py 和 scripts/renderfromtiming_csv.py 等路径，而非特定机器的绝对路径。

核心工作流程

1. 盘点输入文件。

- 确认以下文件是否存在：纯音频 mp4/m4a/mp3/wav、ppt/pptx、pdf 以及任何预渲染的幻灯片图片。 - 优先使用现有的 pdf 或图片目录进行渲染。将 pptx 作为幻灯片文本的来源和导出时的备用方案。

2. 准备工具。

- 确定性步骤所需工具：ffmpeg、ffprobe、pdftoppm。 - 转录所需工具：来自 whisper-cpp 的 whisper-cli，以及多语言模型（如 ggml-small.bin）。 - 如果只有 pptx 而没有 pdf/图片，在 macOS 上优先使用 Keynote 或 PowerPoint 导出。仅将 soffice 作为备用方案，因为常出现配置文件或渲染问题。

3. 生成幻灯片图片。

- 如果存在 pdf，将其渲染为图片： bash pdftoppm -png -r 200 $PDF $OUTDIR/slide

- 如果只有 pptx，使用 Keynote 或 PowerPoint 导出为 pdf 或幻灯片图片，然后从 pdf 继续处理。
- 保持幻灯片文件名有序且稳定，例如 slide-01.png、slide-02.png……

4. 提取幻灯片文本。

- 运行： bash python3 scripts/extractslideoutline.py \ --pptx $PPTX \ --out $WORKDIR/slide_outline.csv

- 使用输出来识别幻灯片标题、独特关键词和章节变化。

5. 提取用于语音识别的纯净音频。

- 对于纯音频 mp4，提取单声道 wav： bash ffmpeg -y -i $AUDIOMP4 -ar 16000 -ac 1 -c:a pcms16le $WORKDIR/audio.wav

- 如果源文件已经是 wav/mp3/m4a，根据需要转换为相同的单声道 wav 格式。

6. 使用 whisper-cli 进行转录。

- 示例： bash whisper-cli -ng \ -m $MODEL \ -f $WORKDIR/audio.wav \ -l zh \ -ocsv -osrt -of $WORKDIR/transcript

- 优先使用 transcript.csv 进行下游解析。transcript.srt 适用于人工审核。
- 如果在 macOS 上 GPU 分配失败，使用 -ng 重试以强制使用 CPU 模式。

7. 构建 slide_timings.csv。

- 除非用户明确要求，否则不要平均幻灯片时长。 - 同时读取转录文本和幻灯片大纲，根据主题变化、章节边界和独特关键词创建单调递增的时间安排方案。 - 使用此架构： csv slide,startsec,endsec,duration_sec,reason 1,0.000,15.000,15.000,开场标题和议程 2,15.000,100.000,85.000,架构概述从此处开始

- 保持幻灯片编号连续，并确保 durationsec = endsec - start_sec。
- 验证最后一个 end_sec 与音频时长匹配或在可接受的小误差范围内。

8. 渲染最终视频。

- 运行： bash python3 scripts/renderfromtiming_csv.py \ --images $SLIDEIMAGESDIR \ --timings $WORKDIR/slide_timings.csv \ --audio $WORKDIR/audio.wav \ --output $OUT_VIDEO

- 该脚本生成 ffconcat 文件，验证时间连续性，并调用 ffmpeg 编码最终的 mp4。

9. 验证和迭代。

- 使用 ffprobe 检查输出时长。 - 如果某张幻灯片切换过早或过晚，仅编辑 slide_timings.csv 中受影响的行，然后重新运行渲染脚本。 - 保留转录文本、大纲和时间安排 CSV 作为可复现的工作文件。

时间对齐的启发式规则

- 章节分隔幻灯片使用时间宜短。这些幻灯片通常停留 5-20 秒。
将第一个明确切换主题的片段作为下一张幻灯片的起始点。
优先使用精确的主题转换点，而非标题词匹配。语音识别经常扭曲专有名词和产品名称。
让模型推断时间安排，但通过 slide_timings.csv 保持渲染步骤的确定性。
当置信度较低时，生成初版视频并告知用户哪些幻灯片边界可能需要审核。

常用命令

在 macOS 上安装缺失的依赖项：
bash
brew install ffmpeg poppler whisper-cpp

典型的多语言模型下载：
bash
mkdir -p .models
curl -L https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin -o .models/ggml-small.bin

捆绑脚本

- scripts/extractslideoutline.py

从 pptx 中提取幻灯片文本为 CSV 或 JSON 格式，用于时间分析。

- scripts/renderfromtiming_csv.py

验证时间安排 CSV，生成 ffconcat 文件，并使用 ffmpeg 渲染最终视频。

ppt-audio-to-videoPPT音频转视频