ROCm vLLM Deployment Skill

Production-ready automation for deploying vLLM inference services on AMD ROCm GPUs using Docker Compose.

Features

- Environment Auto-Check - Detects and repairs missing dependencies
Model Parameter Detection - Auto-reads config.json for optimal settings
VRAM Estimation - Calculates memory requirements before deployment
Secure Token Handling - Never writes tokens to compose files
Structured Output - All logs and test results saved per-model
Deployment Reports - Human-readable summary for each deployment
Health Verification - Automated health checks and functional tests
Troubleshooting Guide - Common issues and solutions

Environment Prerequisites

Recommended (for production): Add to ~/.bash_profile:

CODEBLOCK0

Not required for testing: The skill will proceed without these set:

- HFTOKEN: Optional — public models work without it; gated models fail at download with clear error
HFHOME: Optional — defaults to INLINECODE1

Environment Variable Detection

Priority Order:

1. Explicit parameter (highest) — Provided in task/request (e.g., hf_token: "xxx")
Environment variable — Already set in shell or from parent process
~/.bashprofile — Source to load variables
Default value (lowest) — HFHOME defaults to INLINECODE3

Variable	Required	If Missing
INLINECODE4	Conditional	Continue without token (public models work; gated models fail at download with clear error)
INLINECODE5

No | Warning + Default — Use /root/.cache/huggingface/hub |

Philosophy: Fail fast for configuration errors, fail at download time for authentication errors.

Helper Scripts

Location: INLINECODE7

check-env.sh

Validate and load environment variables before deployment.

Usage:
CODEBLOCK1

Exit Codes:

Code	Meaning
0	Environment check completed (variables loaded or defaulted)
2

Critical error (e.g., cannot source ~/.bash_profile) |

Note: This script is optional. You can also directly run source ~/.bash_profile.

generate-report.sh

Generate human-readable deployment report after successful deployment.

Usage:
CODEBLOCK2

Parameters:

Parameter	Required	Description
INLINECODE9	Yes	Model ID (with `/` replaced by `-`)
INLINECODE12

Output: INLINECODE17

Exit Codes:

Code	Meaning
0	Report generated successfully
1

Missing required parameters |
| 2 | Output directory not found |

Integration: This script is automatically called in Phase 7 of the deployment workflow.

Input Schema

Parameter	Type	Required	Default	Description
modelid	String	Yes	-	HuggingFace model ID
dockerimage

String | No | rocm/vllm-dev:nightly | vLLM Docker image |
| tensorparallelsize | Integer | No | 1 | Number of GPUs |
| port | Integer | No | 9999 | API server port |
| hf_home | String | No | ${HF_HOME} or /root/.cache/huggingface/hub | Model cache directory |
| hf_token | Secret | Conditional | ${HF_TOKEN} | HuggingFace token (optional for public models, required for gated models) |
| maxmodellen | Integer | No | Auto-detect | Maximum sequence length |
| gpumemoryutilization | Float | No | 0.85 | GPU memory utilization |
| auto_install | Boolean | No | true | Auto-install dependencies |
| log_level | String | No | INFO | Logging verbosity |

Output Structure

All deployment artifacts MUST be saved to:
CODEBLOCK3

Convert model ID to directory name by replacing / with -:

- openai/gpt-oss-20b → INLINECODE24
INLINECODE25 → INLINECODE26

Per-model directory structure:
CODEBLOCK4

File requirements:

- deployment.log — Capture ALL container logs during deployment
INLINECODE28 — Save API response from functional test request
INLINECODE29 — Generated in Phase 7
All three files MUST exist before marking deployment as complete

Execution Workflow

Phase 0: Environment Check & Auto-Repair

Step 0.1: Load Environment Variables

CODEBLOCK5

If HFHOME is not defined in ~/.bashprofile, it defaults to /root/.cache/huggingface/hub.

Step 0.2: Create Output Directory

- Create: INLINECODE31

Step 0.3: Initialize Logging

- All output → INLINECODE32

Step 0.4: System Checks

- Detect OS and package manager
Check Python, pip, huggingface_hub
Check Docker, docker compose
Check ROCm tools (rocm-smi/amd-smi)
Check GPU access (/dev/kfd, /dev/dri)
Check disk space (20GB minimum)

Phase 1: Model Download

Use HF_HOME from Phase 0 (environment variable or default):

CODEBLOCK6

Authentication Handling:

Scenario	Behavior
Public model + no token	✅ Download succeeds
Public model + token provided

On Authentication Failure:
CODEBLOCK7

- Locate model path in HF cache: INLINECODE33
Log download progress to INLINECODE34

Phase 2: Model Parameter Detection

- Read config.json from model
Auto-detect: maxmodellen, hiddensize, numattentionheads, numhiddenlayers, vocabsize, dtype
Validate TP size divides attention heads
Estimate VRAM requirement

Phase 3: Docker Compose Configuration

Generate files in output directory:

- docker-compose.yml → INLINECODE35

- Mount HF_HOME as volume (read-only for models) - NO hardcoded tokens in compose file

- .env → $HOME/vllm-compose/<model-id>/.env (optional)

- Contains: HF_TOKEN=<value> - Permissions: chmod 600 - Only created if user explicitly requests persistent token storage

Volume mount example:
CODEBLOCK8

Important: Docker Compose reads ${HF_HOME} from the host environment at runtime. Before running docker compose, source ~/.bash_profile: INLINECODE40

Phase 4: Container Launch

Important: Before deploying, pull the latest image to ensure updates:
CODEBLOCK9

Note: Default port is 9999. Before running docker compose, check if port is available: ss -tlnp | grep :<port>. If port is in use, specify a different port in docker-compose.yml.

- Pass HFTOKEN at runtime: HFTOKEN=$HF_TOKEN docker compose up -d
Wait for container initialization

Phase 5: Health Verification

- Check container status
Test /health endpoint
Test /v1/models endpoint

Phase 6: Functional Testing

- Run completion test via /v1/chat/completions API
Save response to: INLINECODE43
Verify response contains valid completion
Log deployment complete → Append to INLINECODE44
Deployment is complete only when both files exist:

- deployment.log - INLINECODE46

Phase 7: Deployment Report

Generate human-readable deployment report using the helper script.

Step 7.1: Extract Deployment Metrics

CODEBLOCK10

Step 7.2: Generate Report

CODEBLOCK11

Output: INLINECODE47

Report Contents:

- Output structure verification (file checklist)
Deployment summary table (health, test, metrics)
Test results (request/response preview)
Environment configuration
Quick commands for operations

Completion Criteria:

- DEPLOYMENT_REPORT.md exists in output directory
Report contains all required sections
All file checks show ✅

Security Best Practices

1. Never commit tokens to version control — Add .env to INLINECODE50
Use .env files with chmod 600 — Restrict access to owner only
Mask tokens in logs — Show only first 10 chars: INLINECODE51
Pass tokens at runtime — INLINECODE52
Store tokens in ~/.bashprofile — For production environments, set HF_TOKEN in user's shell config
Set token for gated models — HFTOKEN is validated at download time; set in ~/.bash_profile for production

Troubleshooting

Environment Variables

Issue	Solution
INLINECODE54	Add `export HF_TOKEN="hf_xxx"` to `~/.bash_profile`, then `source ~/.bash_profile`. Or provide via parameter.
INLINECODE58

defaults to /root/.cache/huggingface/hub. For production, add export HF_HOME="/path" to ~/.bash_profile. | | ~/.bash_profile not found | Create ~/.bash_profile and add environment variables. | | Changes not taking effect | Run source ~/.bash_profile or restart terminal. | | HF_TOKEN provided but download still fails | Token may be invalid or lack access to the model. Verify token at https://huggingface.co/settings/tokens |

Model Download

Issue	Solution
INLINECODE67 (gated model)	Set `HF_TOKEN` in `~/.bash_profile` or provide via parameter. Ensure token has access to the model.
INLINECODE70

Verify model ID is correct (case-sensitive). Check model exists on HuggingFace. | | Download timeout | Check network connection. Large models may take time. |

Deployment

Issue	Solution
hf CLI not found	INLINECODE72
Docker Compose fails

Cleanup

CODEBLOCK12

Status Check

Check deployment status and logs:

CODEBLOCK13

Quick Start (Production)

Step 1: Add environment variables to ~/.bash_profile

CODEBLOCK14

Step 2: Verify environment is ready

CODEBLOCK15

Step 3: Run deployment

CODEBLOCK16

Version History

Version	Changes
1.0.0	Initial release

ROCm vLLM 部署技能

使用 Docker Compose 在 AMD ROCm GPU 上部署 vLLM 推理服务的生产级自动化方案。

功能特性

- 环境自动检查 - 检测并修复缺失的依赖项
模型参数检测 - 自动读取 config.json 获取最佳设置
VRAM 估算 - 部署前计算内存需求
安全令牌处理 - 绝不将令牌写入 compose 文件
结构化输出 - 所有日志和测试结果按模型保存
部署报告 - 每次部署生成人类可读的摘要
健康验证 - 自动化健康检查和功能测试
故障排除指南 - 常见问题及解决方案

环境前提条件

推荐（生产环境）： 添加到 ~/.bash_profile：

bash

HuggingFace 认证令牌（受限模型必需）

export HFTOKEN=hfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

模型缓存目录（可选）

export HF_HOME=$HOME/models

应用更改

source ~/.bash_profile

测试环境非必需： 未设置以下变量时技能仍可运行：

- HFTOKEN：可选 — 公开模型无需令牌即可工作；受限模型下载时会失败并显示明确错误
HFHOME：可选 — 默认为 /root/.cache/huggingface/hub

环境变量检测

优先级顺序：

1. 显式参数（最高）— 在任务/请求中提供（例如 hftoken: xxx）
环境变量 — 已在 shell 或父进程中设置
~/.bashprofile — 加载变量
默认值（最低）— HF_HOME 默认为 /root/.cache/huggingface/hub

变量	必需	缺失时
HFTOKEN	条件性	无令牌继续运行（公开模型正常工作；受限模型下载时失败并显示明确错误）
HFHOME

否 | 警告 + 默认值 — 使用 /root/.cache/huggingface/hub |

设计理念： 配置错误快速失败，认证错误在下载时失败。

辅助脚本

位置： <技能目录>/scripts/

check-env.sh

部署前验证并加载环境变量。

用法：
bash

基本检查（HFTOKEN 可选，HFHOME 可选且有默认值）

./scripts/check-env.sh

严格模式（HF_HOME 必需，未设置则失败）

./scripts/check-env.sh --strict

静默模式（最小化输出，适用于自动化）

./scripts/check-env.sh --quiet

使用环境变量测试

HFTOKEN=hfxxx HF_HOME=/models ./scripts/check-env.sh

退出码：

代码	含义
0	环境检查完成（变量已加载或使用默认值）
2

严重错误（例如无法加载 ~/.bash_profile） |

注意： 此脚本为可选。您也可以直接运行 source ~/.bash_profile。

generate-report.sh

成功部署后生成人类可读的部署报告。

用法：
bash
./scripts/generate-report.sh <模型ID> <容器名称> <端口> <状态> [模型加载时间] [已用内存]

示例：

./scripts/generate-report.sh \ Qwen-Qwen3-0.6B \ vllm-qwen3-0-6b \ 8001 \ ✅ 成功 \ 3.6 \ 1.2

参数：

参数	必需	描述
模型ID	是	模型 ID（/ 替换为 -）
容器名称

是 | Docker 容器名称 |
| 端口 | 是 | API 端点的主机端口 |
| 状态 | 是 | 部署状态（例如 ✅ 成功） |
| 模型加载时间 | 否 | 模型加载时间（秒） |
| 已用内存 | 否 | 内存消耗（GiB） |

输出： $HOME/vllm-compose/<模型ID>/DEPLOYMENT_REPORT.md

退出码：

代码	含义
0	报告生成成功
1

缺少必需参数 |
| 2 | 输出目录未找到 |

集成： 此脚本在部署工作流的阶段 7 中自动调用。

输入模式

参数	类型	必需	默认值	描述
modelid	字符串	是	-	HuggingFace 模型 ID
dockerimage

字符串 | 否 | rocm/vllm-dev:nightly | vLLM Docker 镜像 |
| tensorparallelsize | 整数 | 否 | 1 | GPU 数量 |
| port | 整数 | 否 | 9999 | API 服务器端口 |
| hfhome | 字符串 | 否 | ${HFHOME} 或 /root/.cache/huggingface/hub | 模型缓存目录 |
| hftoken | 密钥 | 条件性 | ${HFTOKEN} | HuggingFace 令牌（公开模型可选，受限模型必需） |
| maxmodellen | 整数 | 否 | 自动检测 | 最大序列长度 |
| gpumemoryutilization | 浮点数 | 否 | 0.85 | GPU 内存利用率 |
| auto_install | 布尔值 | 否 | true | 自动安装依赖项 |
| log_level | 字符串 | 否 | INFO | 日志记录详细程度 |

输出结构

所有部署产物必须保存到：

$HOME/vllm-compose/<模型ID-斜杠转短横线>/

将模型 ID 转换为目录名称，方法是将 / 替换为 -：

- openai/gpt-oss-20b → $HOME/vllm-compose/openai-gpt-oss-20b/
Qwen/Qwen3-Coder-Next-FP8 → $HOME/vllm-compose/Qwen-Qwen3-Coder-Next-FP8/

每个模型的目录结构：

$HOME/vllm-compose/<模型ID>/
├── deployment.log # 完整部署日志（stdout + stderr）
├── test-results.json # 功能测试结果（JSON 格式）
├── docker-compose.yml # 生成的 Docker Compose 文件
├── .env # HF_TOKEN 环境变量（chmod 600，可选）
└── DEPLOYMENT_REPORT.md # 人类可读的部署摘要

文件要求：

- deployment.log — 捕获部署期间的所有容器日志
test-results.json — 保存功能测试请求的 API 响应
DEPLOYMENT_REPORT.md — 在阶段 7 中生成
标记部署完成前，三个文件必须全部存在

执行工作流

阶段 0：环境检查与自动修复

步骤 0.1：加载环境变量

bash

加载 ~/.bashprofile 以获取 HFHOME 和 HF_TOKEN

source ~/.bash_profile

如果未定义 HF_HOME，默认为 /root/.cache/huggingface/hub

如果 ~/.bashprofile 中未定义 HFHOME，则默认为 /root/.cache/huggingface/hub。

步骤 0.2：创建输出目录

- 创建：$HOME/vllm-compose/<模型ID>/

步骤 0.3：初始化日志记录

- 所有输出 → $HOME/vllm-compose/<模型ID>/deployment.log

步骤 0.4：系统检查

- 检测操作系统和包管理器
检查 Python、pip、huggingface_hub
检查 Docker、docker compose
检查 ROCm 工具（rocm-smi/amd-smi）
检查 GPU 访问权限（/dev/kfd、/dev/dri）
检查磁盘空间（最低 20GB）

阶段 1：模型下载

使用阶段 0 中的 HF_HOME（环境变量或默认值）：

bash

下载模型到 HF_HOME

huggingface-cli download <模型ID> --local-dir $HF_HOME/hub/models--<组织>--<模型>

或通过 Python 使用 snapshot_download：

python -c from huggingfacehub import snapshotdownload; snapshotdownload(repoid=<模型ID>, cachedir=$HFHOME)

rocm_vllm_deploymentROCm vLLM部署