I work with a server with 2 AMD EPYC 7343 16-Core Processor and 3 A40 GPU under Linux.
When i launch multiple simulation, aka 1 simulation per GPU, I am facing a performance collapse. For example by running the same 3 simulations, the calculation times are both bad and unstable.
To avoid any interference, I tested the same code without any disk writing, the performance is similar and problematic.
I would add that in my case, there is no bottle neck with CPU/GPU memory usage.
the two CPU system has two NUMA nodes with distinct memory banks. And I assume the PCIe connections are also connected to specific CPUs. Anything else has to travel though an interconnect between the CPU - which may under some circumstances be a bottleneck.
One thing to look out for is to make host memory allocations memory associated with the same NUMA node that the respective PCIe slot is connected to directly.
Is your computation involving a lot of memory transfers to/from individual GPUs?
In my case, their is nearly no memory GPU/CPU transfer. Only launch kernel in loops. Only once in a while, the GPU data is transferred for backup. My code uses unified memory. Maybe that’s the problem with the dual CPU/numa?