NCCL Optimizer
Finds the best NCCL communication configuration for distributed training with clear
separation of intra-node and inter-node bandwidth metrics.
What it does
- 1. GPU topology —
nvidia-smi topo -m to detect NVLink vs PCIe. - RDMA check —
ibv_devinfo PORT_ACTIVE state for InfiniBand/RoCE.
- ✅ RDMA → emit recommended
NCCL_IB_* env-vars.
- ❌ No RDMA → socket benchmark sweep.
- 3. Intra-node all-reduce — sweeps
NCCL_SOCKET_IFNAME × NCCL_NET_GDR_LEVEL ×
NCCL_IB_TIMEOUT, runs
all_reduce_perf -g <N>, picks best bus bandwidth.
- 4. Intra-node P2P —
p2p_bw for GPU↔GPU pair bandwidth (if available). - Inter-node benchmark — if
nodes= passed, runs MPI all_reduce_perf across nodes;
otherwise emits a ready-to-run command.
Prerequisites
| Tool | Purpose | Install |
|---|
| INLINECODE10 | GPU info + topology | NVIDIA driver |
| INLINECODE11 |
RDMA detection |
apt install ibverbs-utils |
|
all_reduce_perf | Collective benchmark | See below |
|
p2p_bw | Peer-to-peer benchmark | Same nccl-tests build |
|
mpirun | Inter-node benchmark |
apt install openmpi-bin |
Build nccl-tests
CODEBLOCK0
Usage
CODEBLOCK1
Metrics explained
| Metric | What it measures |
|---|
| All-reduce bus BW (intra) | Collective throughput across local GPUs — relevant for single-node training |
| P2P bandwidth |
GPU↔GPU direct copy speed (NVLink ≫ PCIe) |
| All-reduce bus BW (inter) | Collective throughput across nodes — bottleneck for multi-node training |
Notes
- - Bus bandwidth normalises for GPU count:
(N-1)/N × data / time. Compare at same N. - Multi-node training is almost always bottlenecked by inter-node bandwidth, not intra-node.
- RDMA (InfiniBand/RoCE) typically gives 10-100× better inter-node bandwidth than TCP.
NCCL 优化器
通过清晰分离节点内和节点间带宽指标,为分布式训练找到最佳的NCCL通信配置。
功能说明
- 1. GPU拓扑 — 使用nvidia-smi topo -m检测NVLink与PCIe连接。
- RDMA检测 — 通过ibvdevinfo检查PORTACTIVE状态以确认InfiniBand/RoCE。
- ✅ 支持RDMA → 输出推荐的NCCL
IB*环境变量。
- ❌ 不支持RDMA → 执行套接字基准测试扫描。
- 3. 节点内全规约 — 遍历NCCLSOCKETIFNAME × NCCLNETGDRLEVEL × NCCLIBTIMEOUT组合,运行allreduceperf -g ,选取最佳总线带宽。
- 节点内P2P — 使用p2pbw测量GPU间点对点带宽(如可用)。
- 节点间基准测试 — 若传入nodes=参数,跨节点运行MPI allreduceperf;否则输出可执行命令。
前置条件
| 工具 | 用途 | 安装方式 |
|---|
| nvidia-smi | GPU信息与拓扑 | NVIDIA驱动 |
| ibv_devinfo |
RDMA检测 | apt install ibverbs-utils |
| all
reduceperf | 集合通信基准测试 | 见下文 |
| p2p_bw | 点对点基准测试 | 同一nccl-tests构建 |
| mpirun | 节点间基准测试 | apt install openmpi-bin |
构建nccl-tests
bash
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
针对V100 (sm70)、A100 (sm80)、A800 (sm80)、H100 (sm90):
make -j$(nproc) CUDA_HOME=/usr/local/cuda \
NVCC
GENCODE=-gencode=arch=compute80,code=sm_80
export PATH=$PWD/build:$PATH
使用方法
bash
仅节点内测试
openclaw skill run nccl_optimizer
包含节点间基准测试(需配置免密SSH + MPI)
openclaw skill run nccl_optimizer nodes=10.0.0.1,10.0.0.2
指标说明
| 指标 | 测量内容 |
|---|
| 全规约总线带宽(节点内) | 本地GPU间的集合通信吞吐量 — 影响单节点训练性能 |
| P2P带宽 |
GPU间直接拷贝速度(NVLink ≫ PCIe) |
| 全规约总线带宽(节点间) | 跨节点集合通信吞吐量 — 多节点训练的瓶颈 |
注意事项
- - 总线带宽已按GPU数量归一化:(N-1)/N × 数据量 / 时间。需在相同N值下进行比较。
- 多节点训练几乎总是受限于节点间带宽,而非节点内带宽。
- RDMA(InfiniBand/RoCE)通常比TCP提供10-100倍的节点间带宽提升。