Hi,
Currently I build a program using NCCL. I have installed hpc-sdk-23.9 which has the CUDA version 12.2, HPCX version 2.16, and NCCL version 2.18.3. When I compile the program, I use the following command:
module load nvhpc-hpcx-cuda12/23.9 #(This is provide by the module files in sdk)
mpif90 -cuda -gpu=sm_80,cuda12.2 -cudalib=nccl myprogram.f90 -o myexe
The computing node is included with 4 x A100-SXM4-40G, where the driver version is 535.154.05, the runtime CUDA version is 12.2. All GPUs are connected with NVLink, and when I run nvidia-smi topo -m
, it indeed shows the right connection of NVLink:
GPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 NV4 NV4 SYS 0-3 0 N/A
GPU1 NV4 X NV4 NV4 SYS 0-3 0 N/A
GPU2 NV4 NV4 X NV4 SYS 1 N/A
GPU3 NV4 NV4 NV4 X SYS 1 N/A
NIC0 SYS SYS SYS SYS X
However, when I run the EXE through mpirun -np 4 ./myexe
and also use NCCL debug option NCCL_DEBUG=INFO
. The information shows that the program does not detect the NVLinks between the GPUs:
[0] NCCL INFO NVLS multicast support is not available on dev 0
[2] NCCL INFO NVLS multicast support is not available on dev 2
[3] NCCL INFO NVLS multicast support is not available on dev 3
[1] NCCL INFO NVLS multicast support is not available on dev 1
[0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
[3] NCCL INFO Connected all trees
[3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
[1] NCCL INFO Connected all trees
[1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
[2] NCCL INFO Connected all trees
[2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
It is very confused, since all the lib version is compatible. I have been bothered by this issue for about half a year.