Multiple GPU very slow performance

user137079 · November 9, 2022, 11:56am

Hi all!

I work with a server with 2 AMD EPYC 7343 16-Core Processor and 3 A40 GPU under Linux.
When i launch multiple simulation, aka 1 simulation per GPU, I am facing a performance collapse. For example by running the same 3 simulations, the calculation times are both bad and unstable.
To avoid any interference, I tested the same code without any disk writing, the performance is similar and problematic.
I would add that in my case, there is no bottle neck with CPU/GPU memory usage.

Any one face the same problem?

cbuchner1 · November 9, 2022, 12:50pm

the two CPU system has two NUMA nodes with distinct memory banks. And I assume the PCIe connections are also connected to specific CPUs. Anything else has to travel though an interconnect between the CPU - which may under some circumstances be a bottleneck.

One thing to look out for is to make host memory allocations memory associated with the same NUMA node that the respective PCIe slot is connected to directly.

Is your computation involving a lot of memory transfers to/from individual GPUs?

user137079 · November 9, 2022, 1:00pm

In my case, their is nearly no memory GPU/CPU transfer. Only launch kernel in loops. Only once in a while, the GPU data is transferred for backup. My code uses unified memory. Maybe that’s the problem with the dual CPU/numa?

Robert_Crovella · November 9, 2022, 2:25pm

launching kernels in a situation that uses unified memory will usually cause some H->D data movement during kernel activity.

cbuchner1 · November 9, 2022, 3:01pm

Entire papers have been written about this topic.

http://impact.crhc.illinois.edu/shared/Papers/IWOPH18-Numa_Aware.pdf

I wonder if disabling one NUMA node (CPU) could already give insights into any performance aspects related to NUMA in your case.

Some more ideas: lock CPU processes/threads of each simulation to a specific CPUs using affinity masks

user137079 · November 10, 2022, 9:43am

I’ve take look on the server topologie, and here is the problem, the linux scheduler switch btw CPU on fly.

user137079 · November 10, 2022, 11:10am

Using taskset command and CUDA_VISIBLE_DEVICES to force good/bad CPU/GPU affinity :

good 2.5ms/iteration
bad 7 to 10 ms/iteration

Nearly 4 time faster with just the good launch parameters.

system · November 24, 2022, 11:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance problem when loading multiple GPU system with independent simulations CUDA Programming and Performance	11	174	June 28, 2024
MultiGPU information CUDA Programming and Performance	3	2327	June 8, 2009
About weird performance of multiple GPUs CUDA Programming and Performance	0	4293	January 5, 2009
CPU vs GPU performance CUDA Programming and Performance	3	489	December 16, 2018
Performance drop when using the processes using different gpus on one machine CUDA Programming and Performance	1	881	March 28, 2015
Problem with multiple GPUs The multiple GPUs are not working in parallel CUDA Programming and Performance	6	1878	September 2, 2010
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3024	November 13, 2007
Multiple GPUs CUDA Programming and Performance	2	1664	January 10, 2009
Same Kernel called multiple times in a loop has different runtimes CUDA Programming and Performance	2	685	April 8, 2017
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10766	January 18, 2008

Multiple GPU very slow performance

Related topics