We have ported our DPDK enabled FPGA data mover IP and application to the Jetson Orin AGX. We are seeing data transfer rates as expected from the host to the FPGA over PCIe however data rates from the FPGA to the host is about 1/8 of what we expected.
We are using the vfio-pci driver for the DPDK application just like we do on AMD and x86 architectures. With the IOMMU enabled the ACS driver is placing our device, the bridge and the IOMMU device in the same group. Since all devices in a VFIO group need to either be bound to the vfio-pci driver or unbound altogether we had to turn off the IOMMU since we can’t unbind the IOMMU device from its driver. When the IOMMU is turned off within the device tree dma-coherent must also be turned off. We think this is likely causing the performance issue. We tried applying the ACS override patch to break up the vfio groups but it is not fully supported on the ARM.
Here are my questions.
Can anyone confirm that with the iommu turned off (and dma-coherent turned off) that PCIe transfers rates will suffer ?
Has anyone attempted to run a DPDK application with a device in the PCI slot using the vfio-pci driver ?
Can anyone provide guidance on how to get around this problem ?
Thank you for posting this John. We’re hoping there are others in the community who can answer the questions - or in the “misery loves company” department, confirm they see the same speed-bump.
There is nothing FPGA-specific here - it is just the PCIe Endpoint. Would expect to see the same with an NVIDIA CX6 NIC ASIC if we were programming to the ASIC register set.
Not sure if this is appropriate to say/ask here - but this is gating a NVIDIA OEM win for the AGX Orin. Not sure winning a few $M AGX Orin design even shows up on NVIDIA radar these days! (friendly sarcasm). In any event, Atomic Rules would sure like the win.
Can anyone confirm that with the iommu turned off (and dma-coherent turned off) that PCIe transfers rates will suffer ?
→ This might cause poor performance because IO coherency might be disabled here.
When iommu is enabled, it drives the coherency bit based on “dma-coherent” flag.
When iommu is disabled, PCIe controller drives the IO coherency bit, this is taken from the TLP packets sent by FPGA. So, if FPGA sent TLPs with IO coherency bit enabled then IO coherency will be enabled.
If we don’t want to rely on FPGA(EP), then there is a override bit in Tegra PCIe controller.
Note: “dma-coherent” flag should be kept intact even though iommus are removed.
Has anyone attempted to run a DPDK application with a device in the PCI slot using the vfio-pci driver ?
→ No, we didn’t attempt this.
Can anyone provide guidance on how to get around this problem ?
→ With current information, we don’t see any data comparison with and without iommu. Probably because you can’t run you application with iommu enabled? Thus, we cannot completely attribute this issue to iommu/dma coherency.
Are you using DMA engine part of FPGA for both Tx & Rx?
If yes, then Tegra side hardly plays any role here. Since Tx is working fine, please check on FPGA side why Rx is low.
Thanks WayneWWW,
Could you provide more information on the “override bit in Tegra PCIe controller”. Is there a programmers reference document available ? Any additional information that you could provide to help us the override would be appreciated.
We are using the DMA engine in the FPGA for both Tx and Rx.