PCIe read/write performance of xilinx FPGA card is slower than on x86_64

I used an xdma FPGA performance test program.
I tested it on Orin AGX, and the speed was only half that of x86_64 pc.
// The xdma module parameters are consistent, and no exception is found in dmesg.
Any thoughts on why the difference is so big?

The same program, the same version of xdma driver (2020.2), on x86_64 PC (12c, 16G)

FPGA PCIe LnkSta: Speed 5GT/s (ok), Width x8 (ok)

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pktnum: 128
recv speed: 2723.40 MB/s 8000000 Byte
recv speed: 2723.40 MB/s 10000000 Byte
recv speed: 2723.40 MB/s 18000000 Byte

./st_speed -d /dev/xdma0_h2c_0 -w -n 0x80
pktnum: 128
send speed: 2723.40 MB/s 8000000 Byte
send speed: 2782.61 MB/s 10000000 Byte
send speed: 2782.61 MB/s 18000000 Byte

But on Orin (12c, 64G), the FPGA card is inserted into Orin C5 slot,
the speed is only half of x86_64.

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pktnum: 128
recv speed: 1376.34 MB/s 8000000 Byte
recv speed: 1391.30 MB/s 10000000 Byte
recv speed: 1376.34 MB/s 18000000 Byte
recv speed: 1391.30 MB/s 20000000 Byte
recv speed: 1422.22 MB/s 28000000 Byte

./st_speed -d /dev/xdma0_h2c_0 -w -n 0x80
pktnum: 128
send speed: 1219.05 MB/s 8000000 Byte
send speed: 1219.05 MB/s 10000000 Byte
send speed: 1219.05 MB/s 18000000 Byte
send speed: 1219.05 MB/s 20000000 Byte
send speed: 1230.77 MB/s 28000000 Byte
send speed: 1230.77 MB/s 30000000 Byte

Orin memory is poor? but jetson orin agx is lpddr5 memory, should be better than PC ddr4 memory?

Would you like to explain why dd speed on Orin is only half of that on x86-64 PC?

on Orin:
dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0657541 s, 16.3 GB/s

on x86_64 PC:
dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes(1.1 GB,1.0 GiB)copied,0.034232 s,31.4 GB/s

sysbench --test=memory run # on orin
Total operations: 40834833 (4082796.18 per second)
39877.77 MiB transferred (3987.11 MiB/sec)

sysbench --test=memory run # on same x86_64 PC
Total operations: 104857600 (10828779.36 per second)
102400.00 MiB transferred (10574.98 MiB/sec)

Hi,
It looks like the deviation is from CPU capability. You can run sudo tegrastats on Orin and check if some CPU cores are at maximum loading. And the latest production release is Jetpack 5.1.3. If you use previous release, may upgrade to the latest version and try.

I flashed to latest release R36, but it seems performance is degrading.

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pkt num: 128
recv speed: 934.31 MB/s 8000000 Byte
recv speed: 882.76 MB/s 10000000 Byte
recv speed: 859.06 MB/s 18000000 Byte
recv speed: 859.06 MB/s 20000000 Byte
recv speed: 859.06 MB/s 28000000 Byte

nvpmodel -q
NV Power Mode: MAXN
0


tegrastats
03-11-2024 17:28:58 RAM 1939/62841MB (lfb 4x4MB) SWAP 0/31421MB (cached 0MB) CPU [20%@729,6%@729,34%@729,17%@729,0%@729,0%@729,0%@729,0%@729,0%@729,1%@729,0%@729,2%@729] EMC_FREQ 3%@665 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] VDD_GPU_SOC 2405mW/2405mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 4550mW/4550mW VDDQ_VDD2_1V8AO 707mW/707mW
03-11-2024 17:28:59 RAM 1937/62841MB (lfb 4x4MB) SWAP 0/31421MB (cached 0MB) CPU [18%@729,10%@729,40%@729,3%@729,7%@729,0%@729,0%@729,1%@729,0%@729,0%@729,2%@729,0%@729] EMC_FREQ 3%@665 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] VDD_GPU_SOC 2405mW/2405mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 4651mW/4600mW VDDQ_VDD2_1V8AO 808mW/757mW
03-11-2024 17:29:00 RAM 1938/62841MB (lfb 4x4MB) SWAP 0/31421MB (cached 0MB) CPU [17%@729,0%@729,56%@729,0%@729,1%@729,0%@729,0%@729,0%@729,0%@1267,0%@1036,0%@1036,0%@1036] EMC_FREQ 1%@2133 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] VDD_GPU_SOC 2806mW/2538mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 4752mW/4651mW VDDQ_VDD2_1V8AO 808mW/774mW

I also compared with other Arm64 server, whose dd speed and sysbench result is roughly the same as Orin.
But Arm64 server xdma performance is 2 times of that on Orin.

dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0967686 s, 11.1 GB/s

sysbench --test=memory run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 43201886 (4319620.61 per second)

42189.34 MiB transferred (4218.38 MiB/sec)


General statistics:
    total time:                          10.0001s
    total number of events:              43201886

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.05
         95th percentile:                        0.00
         sum:                                 4351.84

Threads fairness:
    events (avg/stddev):           43201886.0000/0.00
    execution time (avg/stddev):   4.3518/0.00

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pkt num: 128
recv speed: 2206.90 MB/s 8000000 Byte
recv speed: 2206.90 MB/s 10000000 Byte
recv speed: 2206.90 MB/s 18000000 Byte

./st_speed -d /dev/xdma0_h2c_0 -w -n 0x80
pkt num: 128
send speed: 2031.75 MB/s 8000000 Byte
send speed: 2031.75 MB/s 10000000 Byte
send speed: 2031.75 MB/s 18000000 Byte

Jetpack 5.1.3 is R35.5.0. Compared to R35.4.1, xdma performance remains basically unchanged.

sudo ./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pkt num: 128
recv speed: 1391.30 MB/s 8000000 Byte
recv speed: 1391.30 MB/s 10000000 Byte
recv speed: 1391.30 MB/s 18000000 Byte
recv speed: 1391.30 MB/s 20000000 Byte
recv speed: 1391.30 MB/s 28000000 Byte
recv speed: 1391.30 MB/s 30000000 Byte
recv speed: 1406.59 MB/s 38000000 Byte

sudo nvpmodel -q
NV Power Mode: MAXN
0

tegrastats
03-12-2024 15:25:51 RAM 1674/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [13%@2201,54%@2201,2%@2201,30%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:52 RAM 1674/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [10%@2201,0%@2201,30%@2201,59%@2201,0%@2201,0%@2201,0%@2201,0%@1984,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:53 RAM 1674/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [13%@2201,0%@2201,35%@2201,49%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:54 RAM 1676/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [14%@2201,87%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2406,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:55 RAM 1676/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [14%@2201,75%@2201,10%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:56 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [22%@2026,0%@2201,77%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:57 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [12%@2201,0%@2201,87%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2356] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:58 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [15%@2201,0%@2201,86%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5460mW/5371mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:59 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [15%@2201,0%@2201,27%@2201,59%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] [email protected] CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5370mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:26:00 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [15%@2201,40%@2201,0%@2201,45%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C [email protected] Tboard@38C [email protected] [email protected] SOC0@48C CV1@-256C [email protected] [email protected] [email protected] CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5369mW VDDQ_VDD2_1V8AO 1011mW/1011mW

Hi,
Please try the setting and see if it helps:
Poor DMA performance over PCIe from FPGA - #4 by WayneWWW
Poor DMA performance over PCIe from FPGA - #6 by WayneWWW

dma-coherent was there when test was done.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.