Slide to Video Converter

Complete end-to-end pipeline for converting PPT/PPTX/PDF slides with speaker notes into high-quality narrated MP4 videos with auto-synced subtitles.

Architecture

3-Stage Pipeline with Audio Validation

CODEBLOCK0

TTS Mode Support

- Edge TTS (Default): Free online API, no local model required
Qwen3-TTS: Local GPU acceleration (Apple Silicon)
HTTP Service: Independent TTS server for multi-client usage

PPTX Support

- PPTX → PDF → PNG: Optimized conversion path using LibreOffice for best quality
Fallback Method: Python-pptx based conversion when LibreOffice not available
Auto-detection: Automatically uses PPTX if PDF not available

Workflow

Step 1: Prepare Inputs

Require two inputs from the user:

1. Slide file: PPT/PPTX or PDF. Supports automatic PPTX conversion:

- PPTX: Uses LibreOffice for high-quality conversion (recommended) - PDF: Direct conversion using pdf2image - Auto-detection: Automatically uses PPTX if PDF not available

2. Speaker notes: A JSON file with per-page narration. See references/script-format.md for the expected format.

Step 2: Install Dependencies

CODEBLOCK1

Step 3: Run Pipeline (Default: Edge TTS)

Default Mode: Edge TTS - Free online API (recommended for universal compatibility)
CODEBLOCK2

PPTX Support Options:
CODEBLOCK3

Alternative TTS Modes:

Edge TTS - Free online API, no local model required (default)
CODEBLOCK4

HTTP Service - Independent TTS server for multi-client usage
CODEBLOCK5

Qwen3-TTS - Local GPU acceleration
CODEBLOCK6

Step 4: Run Pipeline

CODEBLOCK7

Step 5: Customize

Edit config.json to adjust:

- Edge TTS: voice, rate, volume
Video: fps, codec
Image: dpi, resolution
Subtitle: font, size, color, position

Available TTS Voices

Edge TTS Voices (Default - Online API)
Voice Gender Style
INLINECODE1 Female Warm, natural (default)
INLINECODE2
Female | Professional, clear |

Voice	Gender	Style
INLINECODE1	Female	Warm, natural (default)
INLINECODE2

Edge TTS Chinese Voices (Online API)
Voice Gender Style
INLINECODE5 Male Professional news anchor (default)
INLINECODE6
Female | Warm, natural |

Voice	Gender	Style
INLINECODE5	Male	Professional news anchor (default)
INLINECODE6

List all Edge voices: INLINECODE10

Key Design Decisions

- Three TTS Modes: Support for Qwen3-TTS (local GPU), Edge TTS (online), and HTTP service - flexibility for different use cases
Audio Validation Pipeline: STT-based quality control ensures TTS output matches original text (configurable threshold)
Per-Slide Processing: Independent audio/video generation for each slide enables partial regeneration and parallel processing
Smart Subtitle Sync: Text segmentation at sentence boundaries with proportional time allocation based on character count
Zero-Reencoding Merge: FFmpeg concat demuxer for fast video assembly without quality loss
GPU Acceleration: Apple Silicon Metal support for fast local TTS inference (~3 seconds per page)
Quality Control: Multi-stage validation including duration checks, silent audio detection, and similarity scoring

Resources

This skill includes example resource directories that demonstrate how to organize different types of bundled resources:

scripts/

Executable code (Python/Bash/etc.) that can be run directly to perform specific operations.

Examples from other skills:

- PDF skill: fill_fillable_fields.py, extract_form_field_info.py - utilities for PDF manipulation
DOCX skill: document.py, utilities.py - Python modules for document processing

Appropriate for: Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations.

Note: Scripts may be executed without loading into context, but can still be read by Claude for patching or environment adjustments.

references/

Documentation and reference material intended to be loaded into context to inform Claude's process and thinking.

Examples from other skills:

- Product management: communication.md, context_building.md - detailed workflow guides
BigQuery: API reference documentation and query examples
Finance: Schema documentation, company policies

Appropriate for: In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Claude should reference while working.

assets/

Files not intended to be loaded into context, but rather used within the output Claude produces.

Examples from other skills:

- Brand styling: PowerPoint template files (.pptx), logo files
Frontend builder: HTML/React boilerplate project directories
Typography: Font files (.ttf, .woff2)

Appropriate for: Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output.

Any unneeded directories can be deleted. Not every skill requires all three types of resources.

幻灯片转视频转换器

完整的端到端流水线，用于将带演讲者备注的PPT/PPTX/PDF幻灯片转换为带自动同步字幕的高质量旁白MP4视频。

架构

三阶段流水线（含音频验证）

阶段1：音频生成与验证
┌─────────────────────────────────────────────────────────┐
│ PPTX/PDF → 图像 (png) │
│ 脚本 → TTS → 音频 → STT验证 → 已验证音频 │
└─────────────────────────────────────────────────────────┘

阶段2：逐页视频合成
┌─────────────────────────────────────────────────────────┐
│ 图像 + 已验证音频 + 字幕 → 独立MP4文件 │
└─────────────────────────────────────────────────────────┘

阶段3：最终视频组装
┌─────────────────────────────────────────────────────────┐
│ 合并所有幻灯片视频 → final.mp4 │
└─────────────────────────────────────────────────────────┘

TTS模式支持

- Edge TTS（默认）：免费在线API，无需本地模型
Qwen3-TTS：本地GPU加速（Apple Silicon）
HTTP服务：独立TTS服务器，支持多客户端使用

PPTX支持

- PPTX → PDF → PNG：使用LibreOffice的优化转换路径，确保最佳质量
备用方法：当LibreOffice不可用时，基于Python-pptx的转换
自动检测：当PDF不可用时自动使用PPTX

工作流程

步骤1：准备输入

需要用户提供两个输入：

1. 幻灯片文件：PPT/PPTX或PDF。支持自动PPTX转换：

- PPTX：使用LibreOffice进行高质量转换（推荐） - PDF：使用pdf2image直接转换 - 自动检测：当PDF不可用时自动使用PPTX

2. 演讲者备注：包含逐页旁白的JSON文件。参见references/script-format.md了解预期格式。

步骤2：安装依赖

bash

系统依赖

brew install poppler ffmpeg libreoffice # macOS（添加libreoffice以支持PPTX）

apt install poppler-utils ffmpeg libreoffice # Linux

Python依赖

pip install -U mlx-audio soundfile numpy edge-tts pdf2image Pillow moviepy python-pptx

可选：HTTP服务依赖

pip install fastapi uvicorn python-multipart

步骤3：运行流水线（默认：Edge TTS）

默认模式：Edge TTS - 免费在线API（推荐用于通用兼容性）
bash
python scripts/pipeline.py

PPTX支持选项：
bash

使用PPTX文件（如果PDF和PPTX同时存在）

python scripts/pipeline.py --use-pptx

即使PDF存在也强制使用PPTX转换

python scripts/pipeline.py --use-pptx --force-audio

使用备用方法转换PPTX（无需LibreOffice）

python scripts/pipeline.py --use-pptx --fallback

替代TTS模式：

Edge TTS - 免费在线API，无需本地模型（默认）
bash
python scripts/pipeline.py --tts-edge

HTTP服务 - 独立TTS服务器，支持多客户端使用
bash

启动TTS服务器

python scripts/tts_server.py &

运行流水线

python scripts/pipeline.py --tts-http

Qwen3-TTS - 本地GPU加速
bash
python scripts/pipeline.py --tts-direct

步骤4：运行流水线

bash

完整流水线（含音频验证）

python scripts/pipeline.py

仅处理特定幻灯片

python scripts/pipeline.py --slides 1-5

快速预览模式（较低质量，更快）

python scripts/pipeline.py --fast

跳过图像生成（使用现有图像）

python scripts/pipeline.py --skip-images

强制重新生成音频

python scripts/pipeline.py --force-audio

跳过音频验证（直接使用现有音频）

python scripts/pipeline.py --skip-validation

自定义验证阈值

python scripts/pipeline.py --threshold 0.7 --max-retries 3

步骤5：自定义设置

编辑config.json调整：

- Edge TTS：语音、语速、音量
视频：帧率、编码器
图像：DPI、分辨率
字幕：字体、大小、颜色、位置

可用TTS语音

Edge TTS语音（默认 - 在线API）
语音性别风格
serena 女声温暖自然（默认）
chelsea
女声 | 专业清晰 |

语音	性别	风格
serena	女声	温暖自然（默认）
chelsea

| max | 男声 | 权威深沉 | | brian | 男声 | 友好充满活力 |

Edge TTS中文语音（在线API）
语音性别风格
zh-CN-YunyangNeural 男声专业新闻主播（默认）
zh-CN-XiaoxiaoNeural
女声 | 温暖自然 |

语音	性别	风格
zh-CN-YunyangNeural	男声	专业新闻主播（默认）
zh-CN-XiaoxiaoNeural

列出所有Edge语音：edge-tts --list-voices | grep zh-CN

关键设计决策

- 三种TTS模式：支持Qwen3-TTS（本地GPU）、Edge TTS（在线）和HTTP服务 - 灵活适应不同使用场景
音频验证流水线：基于STT的质量控制确保TTS输出与原始文本匹配（可配置阈值）
逐页处理：每页独立生成音频/视频，支持部分重新生成和并行处理
智能字幕同步：在句子边界进行文本分割，基于字符数按比例分配时间
零重新编码合并：使用FFmpeg concat demuxer快速组装视频，无质量损失
GPU加速：支持Apple Silicon Metal，实现快速本地TTS推理（每页约3秒）
质量控制：多阶段验证，包括时长检查、静音检测和相似度评分

资源

本技能包含示例资源目录，展示如何组织不同类型的捆绑资源：

scripts/

可直接运行以执行特定操作的可执行代码（Python/Bash等）。

其他技能示例：

- PDF技能：fillfillablefields.py、extractformfield_info.py - PDF操作工具
DOCX技能：document.py、utilities.py - 文档处理Python模块

适用场景： Python脚本、Shell脚本或任何执行自动化、数据处理或特定操作的可执行代码。

注意： 脚本可能在不加载到上下文的情况下执行，但Claude仍可读取以进行修补或环境调整。

references/

旨在加载到上下文中以指导Claude流程和思考的文档和参考资料。

其他技能示例：

- 产品管理：communication.md、context_building.md - 详细工作流程指南
BigQuery：API参考文档和查询示例
财务：模式文档、公司政策

适用场景： 深度文档、API参考、数据库模式、综合指南或Claude应参考的任何详细信息。

assets/

不打算加载到上下文，而是在Claude生成的输出中使用的文件。

其他技能示例：

- 品牌样式：PowerPoint模板文件（.pptx）、Logo文件
前端构建器：HTML/React样板项目目录
排版：字体文件（.ttf、.woff2）

适用场景： 模板、样板代码、文档模板、图像、图标、字体或任何旨在复制或用于最终输出的文件。

任何不需要的目录都可以删除。 并非每个技能都需要所有三种类型的资源。

slide-to-video-converter幻灯片转视频