Extremely slow inference in TensorRT for live semantic segmentation model

Description

A clear and concise description of the bug or issue.

The inference on the TensorRT is too slow after converting from onnx to plan file. It is taking an average of 0.60 seconds to infer a single frame. So around 2 frames are inferred per second.

The slow inference is evident from this video → slow_inference_TensorRT.mkv - Google Drive

Environment

TensorRT Version: 8.0.1.6
GPU Type:
Nvidia Driver Version:
CUDA Version: 10.2.300
CUDNN Version: 8.2.1.32
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 1.15.5+nv21.12
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Github repo → GitHub - sachinkmohan/Jetson_test_projects: Medium to high complexity Machine learning projects to compare the inference times between optimized and non optimized nueral networks

ONNX file used for Inference → seg_model_unet_40_ep_op13.onnx - Google Drive

Video file used for Inference →

Steps To Reproduce

  1. Convert the ONNX file to .plan file
  2. Load the .plan file generated and update the path in this file → Jetson_test_projects/test_img_seg.py at main · sachinkmohan/Jetson_test_projects · GitHub
  3. Now you will observe a slow inference of the files

packages installed

Package Version


absl-py 1.0.0
appdirs 1.4.4
astor 0.8.1
astunparse 1.6.3
certifi 2021.10.8
charset-normalizer 2.0.11
cityscapesScripts 2.2.0
coloredlogs 15.0.1
cycler 0.11.0
Cython 0.29.27
dataclasses 0.8
decorator 4.4.2
fire 0.4.0
flatbuffers 1.12
future 0.18.2
gast 0.4.0
google-pasta 0.2.0
graphsurgeon 0.4.5
grpcio 1.44.0rc2
h5py 2.10.0
humanfriendly 10.0
idna 3.3
imageio 2.15.0
importlib-metadata 4.8.3
jetson-stats 3.1.2
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
keras2onnx 1.7.0
kiwisolver 1.3.1
Mako 1.1.6
Markdown 3.3.6
MarkupSafe 2.0.1
matplotlib 3.3.4
mock 3.0.5
networkx 2.5.1
numpy 1.19.4
nvidia-pyindex 1.0.9
onnx 1.10.2
onnxconverter-common 1.9.0
opt-einsum 3.3.0
pbr 5.8.1
Pillow 8.4.0
pip 21.3.1
pkg_resources 0.0.0
pkgconfig 1.5.5
protobuf 3.19.4
pybind11 2.9.1
pycuda 2020.1
pyparsing 3.0.7
pyquaternion 0.9.9
python-dateutil 2.8.2
pytools 2021.2.9
PyWavelets 1.1.1
requests 2.27.1
scikit-image 0.17.2
scipy 1.5.4
setuptools 49.6.0
six 1.16.0
tensorboard 1.15.0
tensorflow 1.15.5+nv21.12
tensorflow-estimator 1.15.1
tensorrt 8.0.1.6
termcolor 1.1.0
testresources 2.0.1
tf2onnx 1.9.3
tifffile 2020.9.3
tqdm 4.62.3
typing 3.7.4.3
typing_extensions 4.0.1
uff 0.6.9
urllib3 1.26.8
Werkzeug 2.0.3
wheel 0.37.1
wrapt 1.13.3
zipp 3.6.0

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

The onnx file already shared.

The output from the trtexec is as below

user@user-desktop:/usr/src/tensorrt/bin$ ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
[03/03/2022-16:03:49] [I] === Model Options ===
[03/03/2022-16:03:49] [I] Format: *
[03/03/2022-16:03:49] [I] Model: 
[03/03/2022-16:03:49] [I] Output:
[03/03/2022-16:03:49] [I] === Build Options ===
[03/03/2022-16:03:49] [I] Max batch: explicit
[03/03/2022-16:03:49] [I] Workspace: 16 MiB
[03/03/2022-16:03:49] [I] minTiming: 1
[03/03/2022-16:03:49] [I] avgTiming: 8
[03/03/2022-16:03:49] [I] Precision: FP32
[03/03/2022-16:03:49] [I] Calibration: 
[03/03/2022-16:03:49] [I] Refit: Disabled
[03/03/2022-16:03:49] [I] Sparsity: Disabled
[03/03/2022-16:03:49] [I] Safe mode: Disabled
[03/03/2022-16:03:49] [I] Restricted mode: Disabled
[03/03/2022-16:03:49] [I] Save engine: 
[03/03/2022-16:03:49] [I] Load engine: seg_model_unet_40_ep_op13.trt
[03/03/2022-16:03:49] [I] NVTX verbosity: 0
[03/03/2022-16:03:49] [I] Tactic sources: Using default tactic sources
[03/03/2022-16:03:49] [I] timingCacheMode: local
[03/03/2022-16:03:49] [I] timingCacheFile: 
[03/03/2022-16:03:49] [I] Input(s)s format: fp32:CHW
[03/03/2022-16:03:49] [I] Output(s)s format: fp32:CHW
[03/03/2022-16:03:49] [I] Input build shape: input=1+1+1
[03/03/2022-16:03:49] [I] Input calibration shapes: model
[03/03/2022-16:03:49] [I] === System Options ===
[03/03/2022-16:03:49] [I] Device: 0
[03/03/2022-16:03:49] [I] DLACore: 
[03/03/2022-16:03:49] [I] Plugins:
[03/03/2022-16:03:49] [I] === Inference Options ===
[03/03/2022-16:03:49] [I] Batch: Explicit
[03/03/2022-16:03:49] [I] Input inference shape: input=1
[03/03/2022-16:03:49] [I] Iterations: 10
[03/03/2022-16:03:49] [I] Duration: 3s (+ 200ms warm up)
[03/03/2022-16:03:49] [I] Sleep time: 0ms
[03/03/2022-16:03:49] [I] Streams: 1
[03/03/2022-16:03:49] [I] ExposeDMA: Disabled
[03/03/2022-16:03:49] [I] Data transfers: Enabled
[03/03/2022-16:03:49] [I] Spin-wait: Disabled
[03/03/2022-16:03:49] [I] Multithreading: Disabled
[03/03/2022-16:03:49] [I] CUDA Graph: Disabled
[03/03/2022-16:03:49] [I] Separate profiling: Disabled
[03/03/2022-16:03:49] [I] Time Deserialize: Disabled
[03/03/2022-16:03:49] [I] Time Refit: Disabled
[03/03/2022-16:03:49] [I] Skip inference: Disabled
[03/03/2022-16:03:49] [I] Inputs:
[03/03/2022-16:03:49] [I] === Reporting Options ===
[03/03/2022-16:03:49] [I] Verbose: Disabled
[03/03/2022-16:03:49] [I] Averages: 10 inferences
[03/03/2022-16:03:49] [I] Percentile: 99
[03/03/2022-16:03:49] [I] Dump refittable layers:Disabled
[03/03/2022-16:03:49] [I] Dump output: Disabled
[03/03/2022-16:03:49] [I] Profile: Disabled
[03/03/2022-16:03:49] [I] Export timing to JSON file: 
[03/03/2022-16:03:49] [I] Export output to JSON file: 
[03/03/2022-16:03:49] [I] Export profile to JSON file: 
[03/03/2022-16:03:49] [I] 
[03/03/2022-16:03:49] [I] === Device Information ===
[03/03/2022-16:03:49] [I] Selected Device: Xavier
[03/03/2022-16:03:49] [I] Compute Capability: 7.2
[03/03/2022-16:03:49] [I] SMs: 8
[03/03/2022-16:03:49] [I] Compute Clock Rate: 1.377 GHz
[03/03/2022-16:03:49] [I] Device Global Memory: 31928 MiB
[03/03/2022-16:03:49] [I] Shared Memory per SM: 96 KiB
[03/03/2022-16:03:49] [I] Memory Bus Width: 256 bits (ECC disabled)
[03/03/2022-16:03:49] [I] Memory Clock Rate: 1.377 GHz
[03/03/2022-16:03:49] [I] 
[03/03/2022-16:03:49] [I] TensorRT version: 8001
[03/03/2022-16:03:50] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 402, GPU 3768 (MiB)
[03/03/2022-16:03:50] [I] [TRT] Loaded engine size: 30 MB
[03/03/2022-16:03:50] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 402 MiB, GPU 3768 MiB
[03/03/2022-16:03:50] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[03/03/2022-16:03:51] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +233, now: CPU 647, GPU 4023 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +304, now: CPU 954, GPU 4327 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 954, GPU 4327 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 954 MiB, GPU 4327 MiB
[03/03/2022-16:03:52] [I] Engine loaded in 3.38538 sec.
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 924 MiB, GPU 4297 MiB
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 924, GPU 4297 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 924, GPU 4297 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 926 MiB, GPU 4417 MiB
[03/03/2022-16:03:52] [I] Created input binding for input with dimensions 1x320x480x3
[03/03/2022-16:03:52] [I] Created output binding for sigmoid with dimensions 1x320x480x1
[03/03/2022-16:03:52] [I] Starting inference
[03/03/2022-16:03:56] [I] Warmup completed 2 queries over 200 ms
[03/03/2022-16:03:56] [I] Timing trace has 33 queries over 3.09938 s
[03/03/2022-16:03:56] [I] 
[03/03/2022-16:03:56] [I] === Trace details ===
[03/03/2022-16:03:56] [I] Trace averages of 10 runs:
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 100.411 ms - Host latency: 100.705 ms (end to end 100.717 ms, enqueue 100.345 ms)
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 98.3098 ms - Host latency: 98.6065 ms (end to end 98.6184 ms, enqueue 98.4107 ms)
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 89.1673 ms - Host latency: 89.43 ms (end to end 89.4423 ms, enqueue 91.116 ms)
[03/03/2022-16:03:56] [I] 
[03/03/2022-16:03:56] [I] === Performance summary ===
[03/03/2022-16:03:56] [I] Throughput: 10.6473 qps
[03/03/2022-16:03:56] [I] Latency: min = 70.1868 ms, max = 100.871 ms, mean = 93.9085 ms, median = 98.0884 ms, percentile(99%) = 100.871 ms
[03/03/2022-16:03:56] [I] End-to-End Host Latency: min = 70.1992 ms, max = 100.882 ms, mean = 93.9205 ms, median = 98.1062 ms, percentile(99%) = 100.882 ms
[03/03/2022-16:03:56] [I] Enqueue Time: min = 69.8557 ms, max = 100.793 ms, mean = 94.4267 ms, median = 97.7693 ms, percentile(99%) = 100.793 ms
[03/03/2022-16:03:56] [I] H2D Latency: min = 0.131592 ms, max = 0.217773 ms, mean = 0.20166 ms, median = 0.214844 ms, percentile(99%) = 0.217773 ms
[03/03/2022-16:03:56] [I] GPU Compute Time: min = 70.0078 ms, max = 100.58 ms, mean = 93.6333 ms, median = 97.7971 ms, percentile(99%) = 100.58 ms
[03/03/2022-16:03:56] [I] D2H Latency: min = 0.0473633 ms, max = 0.0871582 ms, mean = 0.0735067 ms, median = 0.076355 ms, percentile(99%) = 0.0871582 ms
[03/03/2022-16:03:56] [I] Total Host Walltime: 3.09938 s
[03/03/2022-16:03:56] [I] Total GPU Compute Time: 3.0899 s
[03/03/2022-16:03:56] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/03/2022-16:03:56] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/03/2022-16:03:56] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/03/2022-16:03:56] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
[03/03/2022-16:03:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 927, GPU 4416 (MiB)

Please advise on how to proceed further.

Hi,

Sorry for the delay in response. Based on the logs, it looks normal.
We will move this post to the Jetson Xavier related forum to get better help.

Thanks!

I can move this over for you.

Hi,

First, in case you didn’t notice this.
You can maximize the Xavier performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Based on your description, have you tested the same model with other frameworks?
For example, do you know the performance of ONNXRuntime or PyTorch for the same model on Xavier?

More, the end-to-end inference time from trtexec is around 93.9205 ms.

[03/03/2022-16:03:56] [I] === Performance summary ===
[03/03/2022-16:03:56] [I] Throughput: 10.6473 qps
[03/03/2022-16:03:56] [I] Latency: min = 70.1868 ms, max = 100.871 ms, mean = 93.9085 ms, median = 98.0884 ms, percentile(99%) = 100.871 ms
[03/03/2022-16:03:56] [I] End-to-End Host Latency: min = 70.1992 ms, max = 100.882 ms, mean = 93.9205 ms, median = 98.1062 ms, percentile(99%) = 100.882 ms
...

Since the score is much higher than the data you report.
The pipeline bottleneck might come from other tasks, ex. image processing.

Do you use OpenCV to read frames and do the preprocessing?
If yes, it’s recommended to use Deepstream SDK instead since it has optimized for the Jetson platform.
https://developer.nvidia.com/deepstream-sdk

Thanks.

Please check this, I am not doing any pre-processing. All I am doing is resizing the frame and doing the inference.

Also the inference time I calculated is on the mi.inference() function, i.e, line 91

At the moment, due to time constraints, I don’t want to invest time on experimenting DeepStream SDK. Also trtexec always pops out great inference numbers which doesn’t work here clearly.

I have 2 questions.

  1. Does TensorRT work without any pre-processing? like in this case? Or does it need pre-processing, like normalization which I have seen in some demo github repositories to give the promised inference numbers claimed by TensorRT?

  2. Also, are there any documentation on the image pre-processing and post-processing that one should follow to work with the TensorRT inference?

Hi,

Sorry for the not-clear statement.

There are some pre-processings in the implementation.
For example, decoder to convert camera frame into BGR format, resizer to apply interpolation.
Since OpenCV does this with CPU, it might be the bottleneck of your pipeline.

More, based on the source below:

It seems that you create the input/output buffer for each frame.
But these buffers can be reused for better performance.
You can find an example for doing so below:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#OpenCV_with_PLAN_model

Thanks.

Thanks for the response! No luck yet!

I tried the above, it is still giving me a low fps, i.e 2fps.

I added the pre-processing here

I added the inference part here

The same code works for object detection. I get 30-32 fps. I don’t do any pre-processing here.

Please suggest!

Hi,

It looks like you still create the buffer for every single frame.

Would you mind moving the buffer creation to the initial time.
And reuse the buffer instead of creating it when inference?

Thanks.

1 Like

Thanks a ton @AastaLLL ! Your tip really helped me. I reused the buffer as you suggested like below.

Now the segmentation improved drastically from 2 fps to 27-30 fps. My object detection code which was working at 30 fps, works now at 100 fps.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.