Douyin Content Tracker
Scrapes Douyin creator videos via MediaCrawler, downloads audio with ffmpeg, and transcribes speech with Whisper.
Finding the Skill Base Directory
All commands must run from this skill's directory. To locate it, run:
CODEBLOCK0
Or check common locations:
- - INLINECODE0
- The path shown when the skill was installed
Set it as a variable for convenience:
SKILL_DIR="~/.claude/skills/douyin-content-tracker-skill" # adjust to actual path
cd "$SKILL_DIR"
First-Time Setup
Run these steps once on a new machine.
1. Install Python dependencies
CODEBLOCK2
2. Install MediaCrawler
CODEBLOCK3
3. Configure .env
CODEBLOCK4
Edit .env — required field:
CODEBLOCK5
Optional overrides:
CODEBLOCK6
4. Add target accounts
Edit accounts.txt (or set TRACKER_ACCOUNTS_FILE / pass --accounts-file when running):
CODEBLOCK7
5. First login (generates cookie)
CODEBLOCK8
A browser opens — scan the Douyin QR code to log in. Cookie is saved to .douyin_cookies.json.
Daily Usage
CODEBLOCK9
Cookie Refresh
When scraping returns 0 videos or warns "Cookie 已 N 天未更新":
CODEBLOCK10
Pipeline Flow
CODEBLOCK11
Output Locations
All generated files live under OUTPUT_BASE_DIR (defaults to ~/DouyinContentTracker on macOS/Linux, %USERPROFILE%\DouyinContentTracker on Windows).
| Subdir | Contents |
|---|
| INLINECODE10 | Scraped + normalized video metadata |
| INLINECODE11 |
Extracted audio |
|
subtitles/{blogger}/{video_id}.md | Whisper transcript (title as first line) |
|
subtitles/{blogger}.md | All transcripts for one blogger merged |
Execution Logging Guide
When running the pipeline, report progress to the user after each step completes. Do not wait until the entire pipeline finishes.
Step-by-step reporting template:
After each Bash tool call returns, immediately tell the user:
| Step | What to report |
|---|
| 采集(scrape) | 博主名称、采集到的视频条数,若失败注明原因 |
| 清洗(clean) |
清洗后有效条数 |
| 音频下载(download) | 成功下载的音频数 / 总数,跳过的条数 |
| 语音识别(whisper) | 生成的字幕文件数,输出路径 |
| 完成 | 汇总:共处理博主数、视频数、生成字幕数,以及输出目录路径 |
If a step fails, stop the pipeline, report the error output verbatim, and suggest the matching fix from references/troubleshooting.md before asking the user whether to continue.
Example output style:
CODEBLOCK12
References
Load these files into context when debugging or extending the pipeline:
- -
references/pipeline.md — per-script technical breakdown, data schemas, key function signatures - INLINECODE16 — fixes for cookie, MediaCrawler, ffmpeg, Whisper, and data errors
抖音内容追踪器
通过MediaCrawler抓取抖音创作者视频,使用ffmpeg下载音频,并利用Whisper进行语音转文字。
查找技能基础目录
所有命令必须在此技能目录下运行。要定位该目录,请执行:
bash
python -c import pathlib; print([p for p in pathlib.Path.home().rglob(douyin-content-tracker-skill/SKILL.md)])
或检查常见位置:
- - ~/.claude/skills/douyin-content-tracker-skill/
- 安装技能时显示的路径
为了方便,将其设置为变量:
bash
SKILL_DIR=~/.claude/skills/douyin-content-tracker-skill # 根据实际路径调整
cd $SKILL_DIR
首次设置
在新机器上执行以下步骤一次。
1. 安装Python依赖
bash
cd $SKILL_DIR
pip install -r scripts/requirements.txt
python -m playwright install chromium
2. 安装MediaCrawler
bash
Windows
git clone https://github.com/NanmiCoder/MediaCrawler D:/MediaCrawler
cd D:/MediaCrawler && pip install -r requirements.txt
macOS/Linux
git clone https://github.com/NanmiCoder/MediaCrawler ~/MediaCrawler
cd ~/MediaCrawler && pip install -r requirements.txt
3. 配置.env
bash
cd $SKILL_DIR
cp .env.template .env
编辑.env — 必填字段:
dotenv
MEDIACRAWLER_DIR=D:/MediaCrawler # 根据实际MediaCrawler路径调整(macOS/Linux使用~/MediaCrawler)
可选覆盖项:
dotenv
存储数据/音频/字幕/模型的位置(默认:~/DouyinContentTracker 或 %USERPROFILE%\DouyinContentTracker)
OUTPUT
BASEDIR=/Users/me/DouyinContentTracker
Whisper模型大小(默认:medium)
WHISPER_MODEL=small
4. 添加目标账号
编辑accounts.txt(或设置TRACKERACCOUNTSFILE / 运行时传递--accounts-file):
博主名称 | https://www.douyin.com/user/MS4wLjABAAAA...
5. 首次登录(生成cookie)
bash
cd $SKILL_DIR
python scripts/scrape_profile.py
浏览器打开 — 扫描抖音二维码登录。Cookie保存到.douyin_cookies.json。
日常使用
bash
cd $SKILL_DIR
追踪每个账号最新的3个视频(默认)。main.py与track_latest.py功能相同
python scripts/track_latest.py
或
python scripts/main.py
追踪最新N个视频
python scripts/track_latest.py --limit 5
使用自定义账号列表(也可通过环境变量TRACKERACCOUNTSFILE设置)
python scripts/track_latest.py --accounts-file /path/to/accounts.txt
跳过音频下载和转录(仅获取数据)
python scripts/track_latest.py --no-audio
Cookie刷新
当抓取返回0个视频或提示Cookie 已 N 天未更新时:
bash
cd $SKILL_DIR
python scripts/scrape_profile.py # 打开浏览器,扫描二维码
处理流程
accounts.txt(或--accounts-file / TRACKERACCOUNTSFILE指向的列表)
↓
scripts/scrapeprofile.py → MediaCrawler (CDP) → OUTPUTBASE_DIR/data/*.csv
↓
scripts/cleandata.py → 标准化后的 OUTPUTBASEDIR/data/cleaned*.csv
↓
scripts/downloadvideo.py → Playwright + ffmpeg → OUTPUTBASE_DIR/audio/{博主}/*.m4a
↓
scripts/extractsubtitle.py → Whisper → OUTPUTBASE_DIR/subtitles/{博主}/{视频ID}.md
输出位置
所有生成的文件位于OUTPUTBASEDIR下(macOS/Linux默认为~/DouyinContentTracker,Windows默认为%USERPROFILE%\DouyinContentTracker)。
| 子目录 | 内容 |
|---|
| data/cleaned_*.csv | 抓取并标准化后的视频元数据 |
| audio/{博主}/{视频ID}.m4a |
提取的音频 |
| subtitles/{博主}/{视频ID}.md | Whisper转录文本(标题作为第一行) |
| subtitles/{博主}.md | 合并后的单个博主所有转录文本 |
执行日志指南
运行处理流程时,在每一步完成后向用户报告进度。不要等到整个流程结束再报告。
分步报告模板:
每次Bash工具调用返回后,立即告知用户:
| 步骤 | 报告内容 |
|---|
| 采集(scrape) | 博主名称、采集到的视频条数,若失败注明原因 |
| 清洗(clean) |
清洗后有效条数 |
| 音频下载(download) | 成功下载的音频数 / 总数,跳过的条数 |
| 语音识别(whisper) | 生成的字幕文件数,输出路径 |
| 完成 | 汇总:共处理博主数、视频数、生成字幕数,以及输出目录路径 |
如果某一步失败,停止流程,逐字报告错误输出,并从references/troubleshooting.md中建议相应的修复方案,然后询问用户是否继续。
示例输出风格:
[步骤 1/4 采集] 博主「某某」— 采集完成,共 10 条视频
[步骤 2/4 清洗] 有效数据 10 条 → data/cleanedprofilexxx.csv
[步骤 3/4 音频] 下载完成 8/10(2 条无音频流,已跳过)
[步骤 4/4 字幕] 生成 8 个字幕文件 → subtitles/某某/
[完成] 1 位博主 · 10 条视频 · 8 个字幕,输出目录:~/DouyinContentTracker
参考资料
调试或扩展处理流程时,将以下文件加载到上下文中:
- - references/pipeline.md — 各脚本技术详解、数据模式、关键函数签名
- references/troubleshooting.md — cookie、MediaCrawler、ffmpeg、Whisper和数据错误的修复方案