Operating AutoDL Training

Use this skill for remote training operations on an AutoDL Linux server. It is designed for high-frequency workflows around "start training, watch progress, inspect resources, read logs, diagnose failures, and decide what to do next" while keeping execution constrained to one configured project directory.

What This Skill Does

- Starts a configured training command in the target project directory over SSH.
Activates the remote Python environment with Conda or virtualenv fallbacks.
Checks whether training is still running by combining process, GPU, and log freshness signals.
Summarizes GPU, CPU, memory, and disk pressure instead of dumping raw command output.
Reads recent logs and extracts likely metrics such as epoch, step, loss, lr, grad_norm, val_loss, accuracy, mAP, and F1.
Detects common training failures such as CUDA OOM, NCCL errors, NaN, disk full, timeout, and segmentation faults.
Produces a human-readable training summary and recommends whether to continue, tune, or resume from a checkpoint.

Required Inputs

Collect or confirm these values before running any script:

- host: AutoDL server hostname or IP.
INLINECODE10: SSH port, usually 22.
INLINECODE12: Remote Linux username.
INLINECODE13: Absolute project directory on the remote server, for example /root/autodl-tmp/your-project.
One environment option: env_name, env_activate, or venv_path.
INLINECODE18: The training launch command, such as python train.py, python -m torch.distributed.run ..., or bash scripts/train.sh.
Optional password mode: provide AUTOCLAW_TRAIN_SSH_PASSWORD as an environment variable or local .env file when SSH key login is not available.

Prefer a config file at config.example.json copied to a real file such as config.json, or environment variables based on .env.example.

Safety Rules

- Only operate inside the configured project_path.
Do not invent missing SSH credentials or secrets.
Do not write plaintext passwords into files.
Prefer SSH keys or environment variables.
Refuse obviously destructive launch commands such as rm -rf, reboot, shutdown, mkfs, or fork bombs.
Do not kill unrelated processes or run global destructive recovery commands.

Workflow

1. Confirm Configuration

Read config.example.json and references/usage.md to understand the expected fields. Ask the user for any missing values instead of guessing.

2. Start Or Resume Training

Run scripts/remote_train.py to start a background job or build a resume command:

CODEBLOCK0

Use this when the user asks to launch training, re-launch after interruption, or resume from a checkpoint.

3. Check Live Status

Run scripts/check_status.py when the user asks whether training is still running:

CODEBLOCK1

This script combines process matching, nvidia-smi, and recent log updates to classify the run as running, stopped, failed, or unknown.

4. Inspect Resource Pressure

Run scripts/monitor_resources.py to summarize GPU/CPU/memory/disk usage:

CODEBLOCK2

Use the human-readable bottleneck assessment in the output instead of pasting raw command output unless the user asks for raw data.

5. Read Logs And Summaries

Run scripts/summarize_log.py in one of these modes:

CODEBLOCK3

Use read for recent excerpts and metrics, detect-failure for exception diagnosis, and summarize for a concise human-facing assessment with next steps.

Script Map

- scripts/remote_train.py: start training, optional resume templating, structured launch result.
INLINECODE47: process/GPU/log-based training status.
INLINECODE48: GPU/CPU/memory/disk summary and bottleneck hints.
INLINECODE49: read logs, detect failures, summarize convergence and next actions.
INLINECODE50: shared config loading, SSH execution, safe path checks, remote helpers.
INLINECODE51: reusable log parsing, failure detection, trend analysis, recommendation logic.

References

- Read references/usage.md for setup steps, example configs, and example commands.
Read references/troubleshooting.md when SSH, environment activation, logs, or training recovery fail.

Agent Guidance

- Start with the least invasive action that answers the user’s request.
When the user asks a yes/no status question, prefer scripts/check_status.py before reading a long log.
When the user asks why training stopped, run scripts/check_status.py and then scripts/summarize_log.py --action detect-failure.
When the user asks whether to continue training, run scripts/summarize_log.py --action summarize and include the recommendations from the script in the final response.
When a checkpoint path is provided, prefer scripts/remote_train.py --resume-from ... so the resume command is explicit and auditable.

操作AutoDL训练

使用此技能在AutoDL Linux服务器上进行远程训练操作。它专为开始训练、查看进度、检查资源、读取日志、诊断故障、决定下一步操作的高频工作流设计，同时将执行限制在配置的项目目录内。

技能功能

- 通过SSH在目标项目目录中启动配置的训练命令
使用Conda或virtualenv回退方案激活远程Python环境
结合进程、GPU和日志新鲜度信号检查训练是否仍在运行
汇总GPU、CPU、内存和磁盘压力，而非输出原始命令结果
读取最近日志并提取可能的指标，如epoch、step、loss、lr、gradnorm、valloss、accuracy、mAP和F1
检测常见训练故障，如CUDA OOM、NCCL错误、NaN、磁盘已满、超时和段错误
生成人类可读的训练摘要，并建议是继续、调整还是从检查点恢复

必需输入

在运行任何脚本前收集或确认以下值：

- host：AutoDL服务器主机名或IP
port：SSH端口，通常为22
username：远程Linux用户名
projectpath：远程服务器上的绝对项目目录，例如/root/autodl-tmp/your-project
一个环境选项：envname、envactivate或venvpath
traincommand：训练启动命令，如python train.py、python -m torch.distributed.run ...或bash scripts/train.sh
可选密码模式：当SSH密钥登录不可用时，将AUTOCLAWTRAINSSHPASSWORD作为环境变量或本地.env文件提供

优先使用从config.example.json复制到实际文件（如config.json）的配置文件，或基于.env.example的环境变量。

安全规则

- 仅在配置的project_path内操作
不虚构缺失的SSH凭据或密钥
不将明文密码写入文件
优先使用SSH密钥或环境变量
拒绝明显破坏性的启动命令，如rm -rf、reboot、shutdown、mkfs或fork炸弹
不杀死无关进程或运行全局破坏性恢复命令

工作流程

1. 确认配置

阅读config.example.json和references/usage.md以了解预期字段。向用户询问任何缺失的值，而非猜测。

2. 开始或恢复训练

运行scripts/remote_train.py以启动后台作业或构建恢复命令：

bash
python scripts/remote_train.py --config config.json
python scripts/remote_train.py --config config.json --resume-from outputs/checkpoints/last.ckpt

当用户要求启动训练、中断后重新启动或从检查点恢复时使用此功能。

3. 检查实时状态

当用户询问训练是否仍在运行时，运行scripts/check_status.py：

bash
python scripts/check_status.py --config config.json

此脚本结合进程匹配、nvidia-smi和最近的日志更新，将运行状态分类为running、stopped、failed或unknown。

4. 检查资源压力

运行scripts/monitor_resources.py以汇总GPU/CPU/内存/磁盘使用情况：

bash
python scripts/monitor_resources.py --config config.json

使用输出中人类可读的瓶颈评估，而非粘贴原始命令输出，除非用户要求原始数据。

5. 读取日志和摘要

以下列模式之一运行scripts/summarize_log.py：

bash
python scripts/summarize_log.py --config config.json --action read --tail 200
python scripts/summarize_log.py --config config.json --action detect-failure --tail 400
python scripts/summarize_log.py --config config.json --action summarize --tail 400

使用read获取最近的摘录和指标，detect-failure进行异常诊断，summarize获取简洁的人类可读评估及后续步骤。

脚本映射

- scripts/remotetrain.py：启动训练，可选恢复模板，结构化启动结果
scripts/checkstatus.py：基于进程/GPU/日志的训练状态
scripts/monitorresources.py：GPU/CPU/内存/磁盘摘要和瓶颈提示
scripts/summarizelog.py：读取日志，检测故障，汇总收敛情况和后续操作
scripts/common.py：共享配置加载、SSH执行、安全路径检查、远程辅助函数
scripts/log_utils.py：可复用的日志解析、故障检测、趋势分析、推荐逻辑

参考资料

- 阅读references/usage.md了解设置步骤、示例配置和示例命令
当SSH、环境激活、日志或训练恢复失败时，阅读references/troubleshooting.md

代理指南

- 从回答用户请求的最少侵入性操作开始
当用户询问是/否状态问题时，在读取长日志前优先使用scripts/checkstatus.py
当用户询问训练为何停止时，运行scripts/checkstatus.py，然后运行scripts/summarizelog.py --action detect-failure
当用户询问是否继续训练时，运行scripts/summarizelog.py --action summarize，并在最终回复中包含脚本中的建议
当提供检查点路径时，优先使用scripts/remote_train.py --resume-from ...，使恢复命令明确且可审计

operating-autodl-training远程训练管理