Program compiled with HPCX failed to use NVLink in NCCL function

Hi,

Currently I build a program using NCCL. I have installed hpc-sdk-23.9 which has the CUDA version 12.2, HPCX version 2.16, and NCCL version 2.18.3. When I compile the program, I use the following command:

module load nvhpc-hpcx-cuda12/23.9    #(This is provide by the module files in sdk)
mpif90 -cuda -gpu=sm_80,cuda12.2 -cudalib=nccl myprogram.f90 -o myexe

The computing node is included with 4 x A100-SXM4-40G, where the driver version is 535.154.05, the runtime CUDA version is 12.2. All GPUs are connected with NVLink, and when I run nvidia-smi topo -m, it indeed shows the right connection of NVLink:

        GPU0    GPU1    GPU2    GPU3    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NV4     NV4     SYS     0-3     0               N/A
GPU1    NV4      X      NV4     NV4     SYS     0-3     0               N/A
GPU2    NV4     NV4      X      NV4     SYS             1               N/A
GPU3    NV4     NV4     NV4      X      SYS             1               N/A
NIC0    SYS     SYS     SYS     SYS      X

However, when I run the EXE through mpirun -np 4 ./myexe and also use NCCL debug option NCCL_DEBUG=INFO. The information shows that the program does not detect the NVLinks between the GPUs:

[0] NCCL INFO NVLS multicast support is not available on dev 0
[2] NCCL INFO NVLS multicast support is not available on dev 2
[3] NCCL INFO NVLS multicast support is not available on dev 3
[1] NCCL INFO NVLS multicast support is not available on dev 1
[0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
[3] NCCL INFO Connected all trees
[3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[3] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
[1] NCCL INFO Connected all trees
[1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer
[2] NCCL INFO Connected all trees
[2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[2] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 8 p2p channels per peer

It is very confused, since all the lib version is compatible. I have been bothered by this issue for about half a year.

Why do you think NVLink isn’t being used? Because of the “NVLS” message?

I’m not an expert in using NCCL nor looking at this debug output, However in searching around, I found this post on NCCL’s github which indicates that NVLS isn’t supported on A100, only H100 or later. Though it should still be using NVLink but with point-to-point.