F5-TTS Mining Rig Monitor Skill
This skill provides instructions for ADA to safely monitor the ongoing F5-TTS training process on the 9-GPU mining rig (Local-LLM), without interfering with the data or environment.
IMPORTANT:
- 1. The training dataset and checkpoints are strictly located on the HDD of the mining rig at
/mnt/toshiba/projects/F5-TTS/. - Do not attempt to run training locally on
asus-z170k. - Use
uv exclusively when interacting with the Python environment on the mining rig.
Steps to Monitor Training
1. Check GPU Utilization
To ensure all 9 GPUs are actively training and not bottlenecked or OOMed, run the following command via SSH (remember to use pseudo-terminal if using watch):
ssh Local-LLM "nvidia-smi"
You should see 9
python3 processes consistently consuming ~11GB of VRAM each.
2. Check Training Epoch Progress
Check the Accelerate training logs to see the current epoch and global step:
ssh Local-LLM "tail -n 100 /mnt/toshiba/projects/F5-TTS/outputs/training_mining_rig.log"
Look for
Epoch: and
Step: progression.
3. Check System RAM and CPU Load
The mining rig only has a 2-core Pentium CPU and 16GB of RAM. Make sure the system isn't buckling under the DDP overhead:
CODEBLOCK2
4. Update the Heartbeat
After successfully probing the status, update your HEARTBEAT.md files locally to report the current Epoch, Step, GPU temperature, and estimated time remaining to Master Seiya.
F5-TTS 挖矿机监控技能
本技能为ADA提供安全监控9-GPU挖矿机(Local-LLM)上正在进行的F5-TTS训练过程的指令,且不会干扰数据或环境。
重要提示:
- 1. 训练数据集和检查点严格位于挖矿机硬盘的/mnt/toshiba/projects/F5-TTS/路径下。
- 请勿尝试在asus-z170k上本地运行训练。
- 与挖矿机上的Python环境交互时,请仅使用uv。
监控训练步骤
1. 检查GPU利用率
为确保所有9块GPU均在积极训练且未出现瓶颈或内存溢出,请通过SSH运行以下命令(若使用watch命令,请记得使用伪终端):
bash
ssh Local-LLM nvidia-smi
您应看到9个python3进程各自持续占用约11GB显存。
2. 检查训练轮次进度
查看Accelerate训练日志以了解当前轮次和全局步数:
bash
ssh Local-LLM tail -n 100 /mnt/toshiba/projects/F5-TTS/outputs/training
miningrig.log
查找Epoch:和Step:的进度信息。
3. 检查系统内存和CPU负载
挖矿机仅配备双核奔腾CPU和16GB内存。请确保系统在DDP开销下未出现性能瓶颈:
bash
ssh Local-LLM free -h && uptime
4. 更新心跳信息
成功探测状态后,请在本地更新您的HEARTBEAT.md文件,向Master Seiya报告当前轮次、步数、GPU温度以及预计剩余时间。