ROCm vLLM Deployment Skill
Production-ready automation for deploying vLLM inference services on AMD ROCm GPUs using Docker Compose.
Features
- - Environment Auto-Check - Detects and repairs missing dependencies
- Model Parameter Detection - Auto-reads config.json for optimal settings
- VRAM Estimation - Calculates memory requirements before deployment
- Secure Token Handling - Never writes tokens to compose files
- Structured Output - All logs and test results saved per-model
- Deployment Reports - Human-readable summary for each deployment
- Health Verification - Automated health checks and functional tests
- Troubleshooting Guide - Common issues and solutions
Environment Prerequisites
Recommended (for production): Add to ~/.bash_profile:
CODEBLOCK0
Not required for testing: The skill will proceed without these set:
- - HFTOKEN: Optional — public models work without it; gated models fail at download with clear error
- HFHOME: Optional — defaults to INLINECODE1
Environment Variable Detection
Priority Order:
- 1. Explicit parameter (highest) — Provided in task/request (e.g.,
hf_token: "xxx") - Environment variable — Already set in shell or from parent process
- ~/.bashprofile — Source to load variables
- Default value (lowest) — HFHOME defaults to INLINECODE3
| Variable | Required | If Missing |
|---|
| INLINECODE4 | Conditional | Continue without token (public models work; gated models fail at download with clear error) |
| INLINECODE5 |
No |
Warning + Default — Use
/root/.cache/huggingface/hub |
Philosophy: Fail fast for configuration errors, fail at download time for authentication errors.
Helper Scripts
Location: INLINECODE7
check-env.sh
Validate and load environment variables before deployment.
Usage:
CODEBLOCK1
Exit Codes:
| Code | Meaning |
|---|
| 0 | Environment check completed (variables loaded or defaulted) |
| 2 |
Critical error (e.g., cannot source ~/.bash_profile) |
Note: This script is optional. You can also directly run source ~/.bash_profile.
generate-report.sh
Generate human-readable deployment report after successful deployment.
Usage:
CODEBLOCK2
Parameters:
| Parameter | Required | Description |
|---|
| INLINECODE9 | Yes | Model ID (with / replaced by -) |
| INLINECODE12 |
Yes | Docker container name |
|
port | Yes | Host port for API endpoint |
|
status | Yes | Deployment status (e.g., "✅ Success") |
|
model-load-time | No | Model loading time in seconds |
|
memory-used | No | Memory consumption in GiB |
Output: INLINECODE17
Exit Codes:
| Code | Meaning |
|---|
| 0 | Report generated successfully |
| 1 |
Missing required parameters |
| 2 | Output directory not found |
Integration: This script is automatically called in Phase 7 of the deployment workflow.
Input Schema
| Parameter | Type | Required | Default | Description |
|---|
| modelid | String | Yes | - | HuggingFace model ID |
| dockerimage |
String | No | rocm/vllm-dev:nightly | vLLM Docker image |
| tensor
parallelsize | Integer | No | 1 | Number of GPUs |
| port | Integer | No | 9999 | API server port |
| hf_home | String | No |
${HF_HOME} or
/root/.cache/huggingface/hub | Model cache directory |
| hf_token | Secret | Conditional |
${HF_TOKEN} | HuggingFace token (optional for public models, required for gated models) |
| max
modellen | Integer | No | Auto-detect | Maximum sequence length |
| gpu
memoryutilization | Float | No | 0.85 | GPU memory utilization |
| auto_install | Boolean | No | true | Auto-install dependencies |
| log_level | String | No | INFO | Logging verbosity |
Output Structure
All deployment artifacts MUST be saved to:
CODEBLOCK3
Convert model ID to directory name by replacing / with -:
- -
openai/gpt-oss-20b → INLINECODE24 - INLINECODE25 → INLINECODE26
Per-model directory structure:
CODEBLOCK4
File requirements:
- -
deployment.log — Capture ALL container logs during deployment - INLINECODE28 — Save API response from functional test request
- INLINECODE29 — Generated in Phase 7
- All three files MUST exist before marking deployment as complete
Execution Workflow
Phase 0: Environment Check & Auto-Repair
Step 0.1: Load Environment Variables
CODEBLOCK5
If HFHOME is not defined in ~/.bashprofile, it defaults to /root/.cache/huggingface/hub.
Step 0.2: Create Output Directory
Step 0.3: Initialize Logging
- - All output → INLINECODE32
Step 0.4: System Checks
- - Detect OS and package manager
- Check Python, pip, huggingface_hub
- Check Docker, docker compose
- Check ROCm tools (rocm-smi/amd-smi)
- Check GPU access (/dev/kfd, /dev/dri)
- Check disk space (20GB minimum)
Phase 1: Model Download
Use HF_HOME from Phase 0 (environment variable or default):
CODEBLOCK6
Authentication Handling:
| Scenario | Behavior |
|---|
| Public model + no token | ✅ Download succeeds |
| Public model + token provided |
✅ Download succeeds |
| Gated model + no token | ❌ Download fails with "authentication required" error |
| Gated model + invalid token | ❌ Download fails with "invalid token" error |
| Gated model + valid token | ✅ Download succeeds |
On Authentication Failure:
CODEBLOCK7
- - Locate model path in HF cache: INLINECODE33
- Log download progress to INLINECODE34
Phase 2: Model Parameter Detection
- - Read config.json from model
- Auto-detect: maxmodellen, hiddensize, numattentionheads, numhiddenlayers, vocabsize, dtype
- Validate TP size divides attention heads
- Estimate VRAM requirement
Phase 3: Docker Compose Configuration
Generate files in output directory:
- - docker-compose.yml → INLINECODE35
- Mount HF_HOME as volume (read-only for models)
- NO hardcoded tokens in compose file
- - .env →
$HOME/vllm-compose/<model-id>/.env (optional)
- Contains:
HF_TOKEN=<value>
- Permissions:
chmod 600
- Only created if user explicitly requests persistent token storage
Volume mount example:
CODEBLOCK8
Important: Docker Compose reads ${HF_HOME} from the host environment at runtime. Before running docker compose, source ~/.bash_profile: INLINECODE40
Phase 4: Container Launch
Important: Before deploying, pull the latest image to ensure updates:
CODEBLOCK9
Note: Default port is 9999. Before running docker compose, check if port is available: ss -tlnp | grep :<port>. If port is in use, specify a different port in docker-compose.yml.
- - Pass HFTOKEN at runtime: HFTOKEN=$HF_TOKEN docker compose up -d
- Wait for container initialization
Phase 5: Health Verification
- - Check container status
- Test /health endpoint
- Test /v1/models endpoint
Phase 6: Functional Testing
- - Run completion test via
/v1/chat/completions API - Save response to: INLINECODE43
- Verify response contains valid completion
- Log deployment complete → Append to INLINECODE44
- Deployment is complete only when both files exist:
-
deployment.log
- INLINECODE46
Phase 7: Deployment Report
Generate human-readable deployment report using the helper script.
Step 7.1: Extract Deployment Metrics
CODEBLOCK10
Step 7.2: Generate Report
CODEBLOCK11
Output: INLINECODE47
Report Contents:
- - Output structure verification (file checklist)
- Deployment summary table (health, test, metrics)
- Test results (request/response preview)
- Environment configuration
- Quick commands for operations
Completion Criteria:
- -
DEPLOYMENT_REPORT.md exists in output directory - Report contains all required sections
- All file checks show ✅
Security Best Practices
- 1. Never commit tokens to version control — Add
.env to INLINECODE50 - Use .env files with chmod 600 — Restrict access to owner only
- Mask tokens in logs — Show only first 10 chars: INLINECODE51
- Pass tokens at runtime — INLINECODE52
- Store tokens in ~/.bashprofile — For production environments, set
HF_TOKEN in user's shell config - Set token for gated models — HFTOKEN is validated at download time; set in ~/.bash_profile for production
Troubleshooting
Environment Variables
| Issue | Solution |
|---|
| INLINECODE54 | Add export HF_TOKEN="hf_xxx" to ~/.bash_profile, then source ~/.bash_profile. Or provide via parameter. |
| INLINECODE58 |
defaults to
/root/.cache/huggingface/hub. For production, add
export HF_HOME="/path" to
~/.bash_profile. |
|
~/.bash_profile not found | Create
~/.bash_profile and add environment variables. |
|
Changes not taking effect | Run
source ~/.bash_profile or restart terminal. |
|
HF_TOKEN provided but download still fails | Token may be invalid or lack access to the model. Verify token at https://huggingface.co/settings/tokens |
Model Download
| Issue | Solution |
|---|
| INLINECODE67 (gated model) | Set HF_TOKEN in ~/.bash_profile or provide via parameter. Ensure token has access to the model. |
| INLINECODE70 |
Verify model ID is correct (case-sensitive). Check model exists on HuggingFace. |
|
Download timeout | Check network connection. Large models may take time. |
Deployment
| Issue | Solution |
|---|
| hf CLI not found | INLINECODE72 |
| Docker Compose fails |
Use
docker compose (no hyphen) |
| GPU access fails | Add user to
render group:
sudo usermod -aG render $USER |
| Port in use | Change
port parameter |
| OOM | Reduce
gpu_memory_utilization |
Cleanup
CODEBLOCK12
Status Check
Check deployment status and logs:
CODEBLOCK13
Quick Start (Production)
Step 1: Add environment variables to ~/.bash_profile
CODEBLOCK14
Step 2: Verify environment is ready
CODEBLOCK15
Step 3: Run deployment
CODEBLOCK16
Version History
| Version | Changes |
|---|
| 1.0.0 | Initial release |
ROCm vLLM 部署技能
使用 Docker Compose 在 AMD ROCm GPU 上部署 vLLM 推理服务的生产级自动化方案。
功能特性
- - 环境自动检查 - 检测并修复缺失的依赖项
- 模型参数检测 - 自动读取 config.json 获取最佳设置
- VRAM 估算 - 部署前计算内存需求
- 安全令牌处理 - 绝不将令牌写入 compose 文件
- 结构化输出 - 所有日志和测试结果按模型保存
- 部署报告 - 每次部署生成人类可读的摘要
- 健康验证 - 自动化健康检查和功能测试
- 故障排除指南 - 常见问题及解决方案
环境前提条件
推荐(生产环境): 添加到 ~/.bash_profile:
bash
HuggingFace 认证令牌(受限模型必需)
export HF
TOKEN=hfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
模型缓存目录(可选)
export HF_HOME=$HOME/models
应用更改
source ~/.bash_profile
测试环境非必需: 未设置以下变量时技能仍可运行:
- - HFTOKEN:可选 — 公开模型无需令牌即可工作;受限模型下载时会失败并显示明确错误
- HFHOME:可选 — 默认为 /root/.cache/huggingface/hub
环境变量检测
优先级顺序:
- 1. 显式参数(最高)— 在任务/请求中提供(例如 hftoken: xxx)
- 环境变量 — 已在 shell 或父进程中设置
- ~/.bashprofile — 加载变量
- 默认值(最低)— HF_HOME 默认为 /root/.cache/huggingface/hub
| 变量 | 必需 | 缺失时 |
|---|
| HFTOKEN | 条件性 | 无令牌继续运行(公开模型正常工作;受限模型下载时失败并显示明确错误) |
| HFHOME |
否 |
警告 + 默认值 — 使用 /root/.cache/huggingface/hub |
设计理念: 配置错误快速失败,认证错误在下载时失败。
辅助脚本
位置: <技能目录>/scripts/
check-env.sh
部署前验证并加载环境变量。
用法:
bash
基本检查(HFTOKEN 可选,HFHOME 可选且有默认值)
./scripts/check-env.sh
严格模式(HF_HOME 必需,未设置则失败)
./scripts/check-env.sh --strict
静默模式(最小化输出,适用于自动化)
./scripts/check-env.sh --quiet
使用环境变量测试
HF
TOKEN=hfxxx HF_HOME=/models ./scripts/check-env.sh
退出码:
| 代码 | 含义 |
|---|
| 0 | 环境检查完成(变量已加载或使用默认值) |
| 2 |
严重错误(例如无法加载 ~/.bash_profile) |
注意: 此脚本为可选。您也可以直接运行 source ~/.bash_profile。
generate-report.sh
成功部署后生成人类可读的部署报告。
用法:
bash
./scripts/generate-report.sh <模型ID> <容器名称> <端口> <状态> [模型加载时间] [已用内存]
示例:
./scripts/generate-report.sh \
Qwen-Qwen3-0.6B \
vllm-qwen3-0-6b \
8001 \
✅ 成功 \
3.6 \
1.2
参数:
| 参数 | 必需 | 描述 |
|---|
| 模型ID | 是 | 模型 ID(/ 替换为 -) |
| 容器名称 |
是 | Docker 容器名称 |
| 端口 | 是 | API 端点的主机端口 |
| 状态 | 是 | 部署状态(例如 ✅ 成功) |
| 模型加载时间 | 否 | 模型加载时间(秒) |
| 已用内存 | 否 | 内存消耗(GiB) |
输出: $HOME/vllm-compose/<模型ID>/DEPLOYMENT_REPORT.md
退出码:
缺少必需参数 |
| 2 | 输出目录未找到 |
集成: 此脚本在部署工作流的阶段 7 中自动调用。
输入模式
| 参数 | 类型 | 必需 | 默认值 | 描述 |
|---|
| modelid | 字符串 | 是 | - | HuggingFace 模型 ID |
| dockerimage |
字符串 | 否 | rocm/vllm-dev:nightly | vLLM Docker 镜像 |
| tensor
parallelsize | 整数 | 否 | 1 | GPU 数量 |
| port | 整数 | 否 | 9999 | API 服务器端口 |
| hf
home | 字符串 | 否 | ${HFHOME} 或 /root/.cache/huggingface/hub | 模型缓存目录 |
| hf
token | 密钥 | 条件性 | ${HFTOKEN} | HuggingFace 令牌(公开模型可选,受限模型必需) |
| max
modellen | 整数 | 否 | 自动检测 | 最大序列长度 |
| gpu
memoryutilization | 浮点数 | 否 | 0.85 | GPU 内存利用率 |
| auto_install | 布尔值 | 否 | true | 自动安装依赖项 |
| log_level | 字符串 | 否 | INFO | 日志记录详细程度 |
输出结构
所有部署产物必须保存到:
$HOME/vllm-compose/<模型ID-斜杠转短横线>/
将模型 ID 转换为目录名称,方法是将 / 替换为 -:
- - openai/gpt-oss-20b → $HOME/vllm-compose/openai-gpt-oss-20b/
- Qwen/Qwen3-Coder-Next-FP8 → $HOME/vllm-compose/Qwen-Qwen3-Coder-Next-FP8/
每个模型的目录结构:
$HOME/vllm-compose/<模型ID>/
├── deployment.log # 完整部署日志(stdout + stderr)
├── test-results.json # 功能测试结果(JSON 格式)
├── docker-compose.yml # 生成的 Docker Compose 文件
├── .env # HF_TOKEN 环境变量(chmod 600,可选)
└── DEPLOYMENT_REPORT.md # 人类可读的部署摘要
文件要求:
- - deployment.log — 捕获部署期间的所有容器日志
- test-results.json — 保存功能测试请求的 API 响应
- DEPLOYMENT_REPORT.md — 在阶段 7 中生成
- 标记部署完成前,三个文件必须全部存在
执行工作流
阶段 0:环境检查与自动修复
步骤 0.1:加载环境变量
bash
加载 ~/.bashprofile 以获取 HFHOME 和 HF_TOKEN
source ~/.bash_profile
如果未定义 HF_HOME,默认为 /root/.cache/huggingface/hub
如果 ~/.bashprofile 中未定义 HFHOME,则默认为 /root/.cache/huggingface/hub。
步骤 0.2:创建输出目录
- - 创建:$HOME/vllm-compose/<模型ID>/
步骤 0.3:初始化日志记录
- - 所有输出 → $HOME/vllm-compose/<模型ID>/deployment.log
步骤 0.4:系统检查
- - 检测操作系统和包管理器
- 检查 Python、pip、huggingface_hub
- 检查 Docker、docker compose
- 检查 ROCm 工具(rocm-smi/amd-smi)
- 检查 GPU 访问权限(/dev/kfd、/dev/dri)
- 检查磁盘空间(最低 20GB)
阶段 1:模型下载
使用阶段 0 中的 HF_HOME(环境变量或默认值):
bash
下载模型到 HF_HOME
huggingface-cli download <模型ID> --local-dir $HF_HOME/hub/models--<组织>--<模型>
或通过 Python 使用 snapshot_download:
python -c from huggingface
hub import snapshotdownload; snapshot
download(repoid=<模型ID>, cache
dir=$HFHOME)