Douyin Content Tracker

Scrapes Douyin creator videos via MediaCrawler, downloads audio with ffmpeg, and transcribes speech with Whisper.

Finding the Skill Base Directory

All commands must run from this skill's directory. To locate it, run:

CODEBLOCK0

Or check common locations:

- INLINECODE0
The path shown when the skill was installed

Set it as a variable for convenience:

SKILL_DIR="~/.claude/skills/douyin-content-tracker-skill"   # adjust to actual path
cd "$SKILL_DIR"

First-Time Setup

Run these steps once on a new machine.

1. Install Python dependencies

CODEBLOCK2

2. Install MediaCrawler

CODEBLOCK3

3. Configure `.env`

CODEBLOCK4

Edit .env — required field:
CODEBLOCK5

Optional overrides:
CODEBLOCK6

4. Add target accounts

Edit accounts.txt (or set TRACKER_ACCOUNTS_FILE / pass --accounts-file when running):
CODEBLOCK7

5. First login (generates cookie)

CODEBLOCK8

A browser opens — scan the Douyin QR code to log in. Cookie is saved to .douyin_cookies.json.

Daily Usage

CODEBLOCK9

Cookie Refresh

When scraping returns 0 videos or warns "Cookie 已 N 天未更新":

CODEBLOCK10

Pipeline Flow

CODEBLOCK11

Output Locations

All generated files live under OUTPUT_BASE_DIR (defaults to ~/DouyinContentTracker on macOS/Linux, %USERPROFILE%\DouyinContentTracker on Windows).

Subdir	Contents
INLINECODE10	Scraped + normalized video metadata
INLINECODE11

Extracted audio |
| subtitles/{blogger}/{video_id}.md | Whisper transcript (title as first line) |
| subtitles/{blogger}.md | All transcripts for one blogger merged |

Execution Logging Guide

When running the pipeline, report progress to the user after each step completes. Do not wait until the entire pipeline finishes.

Step-by-step reporting template:

After each Bash tool call returns, immediately tell the user:

Step	What to report
采集（scrape）	博主名称、采集到的视频条数，若失败注明原因
清洗（clean）

If a step fails, stop the pipeline, report the error output verbatim, and suggest the matching fix from references/troubleshooting.md before asking the user whether to continue.

Example output style:

CODEBLOCK12

References

Load these files into context when debugging or extending the pipeline:

- references/pipeline.md — per-script technical breakdown, data schemas, key function signatures
INLINECODE16 — fixes for cookie, MediaCrawler, ffmpeg, Whisper, and data errors

抖音内容追踪器

通过MediaCrawler抓取抖音创作者视频，使用ffmpeg下载音频，并利用Whisper进行语音转文字。

查找技能基础目录

所有命令必须在此技能目录下运行。要定位该目录，请执行：

bash
python -c import pathlib; print([p for p in pathlib.Path.home().rglob(douyin-content-tracker-skill/SKILL.md)])

或检查常见位置：

- ~/.claude/skills/douyin-content-tracker-skill/
安装技能时显示的路径

为了方便，将其设置为变量：
bash
SKILL_DIR=~/.claude/skills/douyin-content-tracker-skill # 根据实际路径调整
cd $SKILL_DIR

首次设置

在新机器上执行以下步骤一次。

1. 安装Python依赖

bash
cd $SKILL_DIR
pip install -r scripts/requirements.txt
python -m playwright install chromium

2. 安装MediaCrawler

bash

Windows

git clone https://github.com/NanmiCoder/MediaCrawler D:/MediaCrawler
cd D:/MediaCrawler && pip install -r requirements.txt

macOS/Linux

git clone https://github.com/NanmiCoder/MediaCrawler ~/MediaCrawler cd ~/MediaCrawler && pip install -r requirements.txt

3. 配置.env

bash
cd $SKILL_DIR
cp .env.template .env

编辑.env — 必填字段：
dotenv
MEDIACRAWLER_DIR=D:/MediaCrawler # 根据实际MediaCrawler路径调整（macOS/Linux使用~/MediaCrawler）

可选覆盖项：
dotenv

存储数据/音频/字幕/模型的位置（默认：~/DouyinContentTracker 或 %USERPROFILE%\DouyinContentTracker）

OUTPUTBASEDIR=/Users/me/DouyinContentTracker

Whisper模型大小（默认：medium）

WHISPER_MODEL=small

4. 添加目标账号

编辑accounts.txt（或设置TRACKERACCOUNTSFILE / 运行时传递--accounts-file）：

博主名称 | https://www.douyin.com/user/MS4wLjABAAAA...

5. 首次登录（生成cookie）

bash
cd $SKILL_DIR
python scripts/scrape_profile.py

浏览器打开 — 扫描抖音二维码登录。Cookie保存到.douyin_cookies.json。

日常使用

bash
cd $SKILL_DIR

追踪每个账号最新的3个视频（默认）。main.py与track_latest.py功能相同

python scripts/track_latest.py

或

python scripts/main.py

追踪最新N个视频

python scripts/track_latest.py --limit 5

使用自定义账号列表（也可通过环境变量TRACKERACCOUNTSFILE设置）

python scripts/track_latest.py --accounts-file /path/to/accounts.txt

跳过音频下载和转录（仅获取数据）

python scripts/track_latest.py --no-audio

Cookie刷新

当抓取返回0个视频或提示Cookie 已 N 天未更新时：

bash
cd $SKILL_DIR
python scripts/scrape_profile.py # 打开浏览器，扫描二维码

处理流程

accounts.txt（或--accounts-file / TRACKERACCOUNTSFILE指向的列表）
↓
scripts/scrapeprofile.py → MediaCrawler (CDP) → OUTPUTBASE_DIR/data/*.csv
↓
scripts/cleandata.py → 标准化后的 OUTPUTBASEDIR/data/cleaned*.csv
↓
scripts/downloadvideo.py → Playwright + ffmpeg → OUTPUTBASE_DIR/audio/{博主}/*.m4a
↓
scripts/extractsubtitle.py → Whisper → OUTPUTBASE_DIR/subtitles/{博主}/{视频ID}.md

输出位置

所有生成的文件位于OUTPUTBASEDIR下（macOS/Linux默认为~/DouyinContentTracker，Windows默认为%USERPROFILE%\DouyinContentTracker）。

子目录	内容
data/cleaned_*.csv	抓取并标准化后的视频元数据
audio/{博主}/{视频ID}.m4a

提取的音频 |
| subtitles/{博主}/{视频ID}.md | Whisper转录文本（标题作为第一行） |
| subtitles/{博主}.md | 合并后的单个博主所有转录文本 |

执行日志指南

运行处理流程时，在每一步完成后向用户报告进度。不要等到整个流程结束再报告。

分步报告模板：

每次Bash工具调用返回后，立即告知用户：

步骤	报告内容
采集（scrape）	博主名称、采集到的视频条数，若失败注明原因
清洗（clean）

如果某一步失败，停止流程，逐字报告错误输出，并从references/troubleshooting.md中建议相应的修复方案，然后询问用户是否继续。

示例输出风格：

[步骤 1/4 采集] 博主「某某」— 采集完成，共 10 条视频
[步骤 2/4 清洗] 有效数据 10 条 → data/cleanedprofilexxx.csv
[步骤 3/4 音频] 下载完成 8/10（2 条无音频流，已跳过）
[步骤 4/4 字幕] 生成 8 个字幕文件 → subtitles/某某/
[完成] 1 位博主 · 10 条视频 · 8 个字幕，输出目录：~/DouyinContentTracker

参考资料

调试或扩展处理流程时，将以下文件加载到上下文中：

- references/pipeline.md — 各脚本技术详解、数据模式、关键函数签名
references/troubleshooting.md — cookie、MediaCrawler、ffmpeg、Whisper和数据错误的修复方案

douyin-content-tracker抖音内容追踪