Extremely slow inference in TensorRT for live semantic segmentation model

sachinkm308 · February 23, 2022, 3:53pm

Description

A clear and concise description of the bug or issue.

The inference on the TensorRT is too slow after converting from onnx to plan file. It is taking an average of 0.60 seconds to infer a single frame. So around 2 frames are inferred per second.

The slow inference is evident from this video → slow_inference_TensorRT.mkv - Google Drive

Environment

TensorRT Version: 8.0.1.6
GPU Type:
Nvidia Driver Version:
CUDA Version: 10.2.300
CUDNN Version: 8.2.1.32
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 1.15.5+nv21.12
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Github repo → GitHub - sachinkmohan/Jetson_test_projects: Medium to high complexity Machine learning projects to compare the inference times between optimized and non optimized nueral networks

ONNX file used for Inference → seg_model_unet_40_ep_op13.onnx - Google Drive

Video file used for Inference →

Steps To Reproduce

Convert the ONNX file to .plan file
Load the .plan file generated and update the path in this file → Jetson_test_projects/test_img_seg.py at main · sachinkmohan/Jetson_test_projects · GitHub
Now you will observe a slow inference of the files

packages installed

Package Version

absl-py 1.0.0
appdirs 1.4.4
astor 0.8.1
astunparse 1.6.3
certifi 2021.10.8
charset-normalizer 2.0.11
cityscapesScripts 2.2.0
coloredlogs 15.0.1
cycler 0.11.0
Cython 0.29.27
dataclasses 0.8
decorator 4.4.2
fire 0.4.0
flatbuffers 1.12
future 0.18.2
gast 0.4.0
google-pasta 0.2.0
graphsurgeon 0.4.5
grpcio 1.44.0rc2
h5py 2.10.0
humanfriendly 10.0
idna 3.3
imageio 2.15.0
importlib-metadata 4.8.3
jetson-stats 3.1.2
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
keras2onnx 1.7.0
kiwisolver 1.3.1
Mako 1.1.6
Markdown 3.3.6
MarkupSafe 2.0.1
matplotlib 3.3.4
mock 3.0.5
networkx 2.5.1
numpy 1.19.4
nvidia-pyindex 1.0.9
onnx 1.10.2
onnxconverter-common 1.9.0
opt-einsum 3.3.0
pbr 5.8.1
Pillow 8.4.0
pip 21.3.1
pkg_resources 0.0.0
pkgconfig 1.5.5
protobuf 3.19.4
pybind11 2.9.1
pycuda 2020.1
pyparsing 3.0.7
pyquaternion 0.9.9
python-dateutil 2.8.2
pytools 2021.2.9
PyWavelets 1.1.1
requests 2.27.1
scikit-image 0.17.2
scipy 1.5.4
setuptools 49.6.0
six 1.16.0
tensorboard 1.15.0
tensorflow 1.15.5+nv21.12
tensorflow-estimator 1.15.1
tensorrt 8.0.1.6
termcolor 1.1.0
testresources 2.0.1
tf2onnx 1.9.3
tifffile 2020.9.3
tqdm 4.62.3
typing 3.7.4.3
typing_extensions 4.0.1
uff 0.6.9
urllib3 1.26.8
Werkzeug 2.0.3
wheel 0.37.1
wrapt 1.13.3
zipp 3.6.0

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

NVES · February 24, 2022, 10:00am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

sachinkm308 · March 3, 2022, 3:06pm

The onnx file already shared.

The output from the trtexec is as below

user@user-desktop:/usr/src/tensorrt/bin$ ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
[03/03/2022-16:03:49] [I] === Model Options ===
[03/03/2022-16:03:49] [I] Format: *
[03/03/2022-16:03:49] [I] Model: 
[03/03/2022-16:03:49] [I] Output:
[03/03/2022-16:03:49] [I] === Build Options ===
[03/03/2022-16:03:49] [I] Max batch: explicit
[03/03/2022-16:03:49] [I] Workspace: 16 MiB
[03/03/2022-16:03:49] [I] minTiming: 1
[03/03/2022-16:03:49] [I] avgTiming: 8
[03/03/2022-16:03:49] [I] Precision: FP32
[03/03/2022-16:03:49] [I] Calibration: 
[03/03/2022-16:03:49] [I] Refit: Disabled
[03/03/2022-16:03:49] [I] Sparsity: Disabled
[03/03/2022-16:03:49] [I] Safe mode: Disabled
[03/03/2022-16:03:49] [I] Restricted mode: Disabled
[03/03/2022-16:03:49] [I] Save engine: 
[03/03/2022-16:03:49] [I] Load engine: seg_model_unet_40_ep_op13.trt
[03/03/2022-16:03:49] [I] NVTX verbosity: 0
[03/03/2022-16:03:49] [I] Tactic sources: Using default tactic sources
[03/03/2022-16:03:49] [I] timingCacheMode: local
[03/03/2022-16:03:49] [I] timingCacheFile: 
[03/03/2022-16:03:49] [I] Input(s)s format: fp32:CHW
[03/03/2022-16:03:49] [I] Output(s)s format: fp32:CHW
[03/03/2022-16:03:49] [I] Input build shape: input=1+1+1
[03/03/2022-16:03:49] [I] Input calibration shapes: model
[03/03/2022-16:03:49] [I] === System Options ===
[03/03/2022-16:03:49] [I] Device: 0
[03/03/2022-16:03:49] [I] DLACore: 
[03/03/2022-16:03:49] [I] Plugins:
[03/03/2022-16:03:49] [I] === Inference Options ===
[03/03/2022-16:03:49] [I] Batch: Explicit
[03/03/2022-16:03:49] [I] Input inference shape: input=1
[03/03/2022-16:03:49] [I] Iterations: 10
[03/03/2022-16:03:49] [I] Duration: 3s (+ 200ms warm up)
[03/03/2022-16:03:49] [I] Sleep time: 0ms
[03/03/2022-16:03:49] [I] Streams: 1
[03/03/2022-16:03:49] [I] ExposeDMA: Disabled
[03/03/2022-16:03:49] [I] Data transfers: Enabled
[03/03/2022-16:03:49] [I] Spin-wait: Disabled
[03/03/2022-16:03:49] [I] Multithreading: Disabled
[03/03/2022-16:03:49] [I] CUDA Graph: Disabled
[03/03/2022-16:03:49] [I] Separate profiling: Disabled
[03/03/2022-16:03:49] [I] Time Deserialize: Disabled
[03/03/2022-16:03:49] [I] Time Refit: Disabled
[03/03/2022-16:03:49] [I] Skip inference: Disabled
[03/03/2022-16:03:49] [I] Inputs:
[03/03/2022-16:03:49] [I] === Reporting Options ===
[03/03/2022-16:03:49] [I] Verbose: Disabled
[03/03/2022-16:03:49] [I] Averages: 10 inferences
[03/03/2022-16:03:49] [I] Percentile: 99
[03/03/2022-16:03:49] [I] Dump refittable layers:Disabled
[03/03/2022-16:03:49] [I] Dump output: Disabled
[03/03/2022-16:03:49] [I] Profile: Disabled
[03/03/2022-16:03:49] [I] Export timing to JSON file: 
[03/03/2022-16:03:49] [I] Export output to JSON file: 
[03/03/2022-16:03:49] [I] Export profile to JSON file: 
[03/03/2022-16:03:49] [I] 
[03/03/2022-16:03:49] [I] === Device Information ===
[03/03/2022-16:03:49] [I] Selected Device: Xavier
[03/03/2022-16:03:49] [I] Compute Capability: 7.2
[03/03/2022-16:03:49] [I] SMs: 8
[03/03/2022-16:03:49] [I] Compute Clock Rate: 1.377 GHz
[03/03/2022-16:03:49] [I] Device Global Memory: 31928 MiB
[03/03/2022-16:03:49] [I] Shared Memory per SM: 96 KiB
[03/03/2022-16:03:49] [I] Memory Bus Width: 256 bits (ECC disabled)
[03/03/2022-16:03:49] [I] Memory Clock Rate: 1.377 GHz
[03/03/2022-16:03:49] [I] 
[03/03/2022-16:03:49] [I] TensorRT version: 8001
[03/03/2022-16:03:50] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 402, GPU 3768 (MiB)
[03/03/2022-16:03:50] [I] [TRT] Loaded engine size: 30 MB
[03/03/2022-16:03:50] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 402 MiB, GPU 3768 MiB
[03/03/2022-16:03:50] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[03/03/2022-16:03:51] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +233, now: CPU 647, GPU 4023 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +304, now: CPU 954, GPU 4327 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 954, GPU 4327 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 954 MiB, GPU 4327 MiB
[03/03/2022-16:03:52] [I] Engine loaded in 3.38538 sec.
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 924 MiB, GPU 4297 MiB
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 924, GPU 4297 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 924, GPU 4297 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 926 MiB, GPU 4417 MiB
[03/03/2022-16:03:52] [I] Created input binding for input with dimensions 1x320x480x3
[03/03/2022-16:03:52] [I] Created output binding for sigmoid with dimensions 1x320x480x1
[03/03/2022-16:03:52] [I] Starting inference
[03/03/2022-16:03:56] [I] Warmup completed 2 queries over 200 ms
[03/03/2022-16:03:56] [I] Timing trace has 33 queries over 3.09938 s
[03/03/2022-16:03:56] [I] 
[03/03/2022-16:03:56] [I] === Trace details ===
[03/03/2022-16:03:56] [I] Trace averages of 10 runs:
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 100.411 ms - Host latency: 100.705 ms (end to end 100.717 ms, enqueue 100.345 ms)
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 98.3098 ms - Host latency: 98.6065 ms (end to end 98.6184 ms, enqueue 98.4107 ms)
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 89.1673 ms - Host latency: 89.43 ms (end to end 89.4423 ms, enqueue 91.116 ms)
[03/03/2022-16:03:56] [I] 
[03/03/2022-16:03:56] [I] === Performance summary ===
[03/03/2022-16:03:56] [I] Throughput: 10.6473 qps
[03/03/2022-16:03:56] [I] Latency: min = 70.1868 ms, max = 100.871 ms, mean = 93.9085 ms, median = 98.0884 ms, percentile(99%) = 100.871 ms
[03/03/2022-16:03:56] [I] End-to-End Host Latency: min = 70.1992 ms, max = 100.882 ms, mean = 93.9205 ms, median = 98.1062 ms, percentile(99%) = 100.882 ms
[03/03/2022-16:03:56] [I] Enqueue Time: min = 69.8557 ms, max = 100.793 ms, mean = 94.4267 ms, median = 97.7693 ms, percentile(99%) = 100.793 ms
[03/03/2022-16:03:56] [I] H2D Latency: min = 0.131592 ms, max = 0.217773 ms, mean = 0.20166 ms, median = 0.214844 ms, percentile(99%) = 0.217773 ms
[03/03/2022-16:03:56] [I] GPU Compute Time: min = 70.0078 ms, max = 100.58 ms, mean = 93.6333 ms, median = 97.7971 ms, percentile(99%) = 100.58 ms
[03/03/2022-16:03:56] [I] D2H Latency: min = 0.0473633 ms, max = 0.0871582 ms, mean = 0.0735067 ms, median = 0.076355 ms, percentile(99%) = 0.0871582 ms
[03/03/2022-16:03:56] [I] Total Host Walltime: 3.09938 s
[03/03/2022-16:03:56] [I] Total GPU Compute Time: 3.0899 s
[03/03/2022-16:03:56] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/03/2022-16:03:56] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/03/2022-16:03:56] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/03/2022-16:03:56] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
[03/03/2022-16:03:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 927, GPU 4416 (MiB)

Please advise on how to proceed further.

spolisetty · March 9, 2022, 7:29am

Hi,

Sorry for the delay in response. Based on the logs, it looks normal.
We will move this post to the Jetson Xavier related forum to get better help.

Thanks!

TomNVIDIA · March 9, 2022, 4:00pm

I can move this over for you.

AastaLLL · March 10, 2022, 3:39am

Hi,

First, in case you didn’t notice this.
You can maximize the Xavier performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Based on your description, have you tested the same model with other frameworks?
For example, do you know the performance of ONNXRuntime or PyTorch for the same model on Xavier?

More, the end-to-end inference time from trtexec is around 93.9205 ms.

[03/03/2022-16:03:56] [I] === Performance summary ===
[03/03/2022-16:03:56] [I] Throughput: 10.6473 qps
[03/03/2022-16:03:56] [I] Latency: min = 70.1868 ms, max = 100.871 ms, mean = 93.9085 ms, median = 98.0884 ms, percentile(99%) = 100.871 ms
[03/03/2022-16:03:56] [I] End-to-End Host Latency: min = 70.1992 ms, max = 100.882 ms, mean = 93.9205 ms, median = 98.1062 ms, percentile(99%) = 100.882 ms
...

Since the score is much higher than the data you report.
The pipeline bottleneck might come from other tasks, ex. image processing.

Do you use OpenCV to read frames and do the preprocessing?
If yes, it’s recommended to use Deepstream SDK instead since it has optimized for the Jetson platform.
https://developer.nvidia.com/deepstream-sdk

Thanks.

sachinkm308 · March 11, 2022, 10:46am

Please check this, I am not doing any pre-processing. All I am doing is resizing the frame and doing the inference.

Also the inference time I calculated is on the mi.inference() function, i.e, line 91

github.com

sachinkmohan/Jetson_test_projects/blob/866f0ce727c7da6edecb3b3c333c3069cf4f45db/Image_classification_nd/optimized/test_img_seg.py#L83-L92


      
          ret, frame = cap.read()
          resized = cv.resize(frame, (480, 320))
          im3 = np.expand_dims(resized, axis=0)
          #print('IM3 shape', im3.shape)
          #data_set = data_set.reshape(1, 300, 480, 3)
          #logger.debug("Starting inference")
          start = time.time()
          batch_size1 = 1
          out = mi.inference(engine_path,  im3, batch_size1)
          end = time.time()

At the moment, due to time constraints, I don’t want to invest time on experimenting DeepStream SDK. Also trtexec always pops out great inference numbers which doesn’t work here clearly.

I have 2 questions.

Does TensorRT work without any pre-processing? like in this case? Or does it need pre-processing, like normalization which I have seen in some demo github repositories to give the promised inference numbers claimed by TensorRT?
Also, are there any documentation on the image pre-processing and post-processing that one should follow to work with the TensorRT inference?

AastaLLL · March 24, 2022, 7:15am

Hi,

Sorry for the not-clear statement.

There are some pre-processings in the implementation.
For example, decoder to convert camera frame into BGR format, resizer to apply interpolation.
Since OpenCV does this with CPU, it might be the bottleneck of your pipeline.

More, based on the source below:

github.com

sachinkmohan/Jetson_test_projects/blob/main/Image_classification_nd/optimized/model_inference.py#L13


      
          import engine_ops as eng
          import inference as inf
          import pycuda.autoinit
          
          
import pycuda.driver as cuda
          
          

          
def initialize(engine_path, data_set, batch_size):
              engine = eng.load_engine(engine_path)
              h_input, d_input, h_output, d_output, stream = inf .allocate_buffers(engine, batch_size, trt.float32)
              return engine, h_input, d_input, h_output, d_output, stream
          
          
def inference(engine_path, data_set, batch_size):
              engine, h_input, d_input, h_output, d_output, stream = initialize(engine_path, data_set, batch_size)
              out = inf.do_inference(engine, data_set, h_input, d_input, h_output, d_output, stream, batch_size)
              return out
          
          
def inference_seg(engine_path, data_set, batch_size):
              #engine, h_inputs, cuda_inputs, h_outputs, cuda_outputs, stream = initialize_seg(engine_path, data_set, batch_size)
              engine, h_input, d_input, h_output, d_output, stream = initialize(engine_path, data_set, batch_size)
              bindings = []

It seems that you create the input/output buffer for each frame.
But these buffers can be reused for better performance.
You can find an example for doing so below:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#OpenCV_with_PLAN_model

Thanks.

sachinkm308 · March 25, 2022, 2:14pm

Thanks for the response! No luck yet!

I tried the above, it is still giving me a low fps, i.e 2fps.

I added the pre-processing here

github.com

sachinkmohan/Jetson_test_projects/blob/0a54b99705beac1516a9efef61f4e1a222e62ad0/Image_classification_nd/optimized/test_img_seg.py#L84-L96


      
          ret, frame = cap.read()
          resized = cv.resize(frame, (480, 320))
          #im3 = np.expand_dims(resized, axis=0)
          
          
pre_pro = (2.0 / 255.0) * resized.transpose((2, 0, 1)) - 1.0  # Converting HWC -> CHW
          
          
# Ref 1 -> https://elinux.org/Jetson/L4T/TRT_Customized_Example#OpenCV_with_PLAN_model
          # Ref 2 -> https://github.com/NVIDIA/object-detection-tensorrt-example/blob/master/SSD_Model/utils/inference.py
          
          
start = time.time()
          batch_size1 = 1
          out = mi.inference_seg(engine_path,  pre_pro, batch_size1)
          end = time.time()

I added the inference part here

github.com

sachinkmohan/Jetson_test_projects/blob/0a54b99705beac1516a9efef61f4e1a222e62ad0/Image_classification_nd/optimized/model_inference.py#L20-L34


      
          def inference_seg(engine_path, data_set, batch_size):
              #engine, h_inputs, cuda_inputs, h_outputs, cuda_outputs, stream = initialize_seg(engine_path, data_set, batch_size)
              engine, h_input, d_input, h_output, d_output, stream = initialize(engine_path, data_set, batch_size)
              bindings = []
              np.copyto(h_input, data_set.ravel())
              stream = cuda.Stream()
              context = engine.create_execution_context()
          
          
    cuda.memcpy_htod_async(d_input, h_input, stream)
              context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
              cuda.memcpy_dtoh_async(h_output, d_output, stream)
              stream.synchronize()
          
          
    output = h_output.reshape(np.concatenate(([1], engine.get_binding_shape(1))))
              return output

The same code works for object detection. I get 30-32 fps. I don’t do any pre-processing here.

github.com

sachinkmohan/Jetson_test_projects/blob/0a54b99705beac1516a9efef61f4e1a222e62ad0/Image_classification_nd/optimized/test_img.py#L70-L81


      
          ret, frame = cap.read()
          resized = cv.resize(frame, (480, 300))
          im3 = np.expand_dims(resized, axis=0)
          
          

          
#print('IM3 shape', im3.shape)
          #data_set = data_set.reshape(1, 300, 480, 3)
          #logger.debug("Starting inference")
          start = time.time()
          batch_size1 = 1
          out = mi.inference(engine_path,  im3, batch_size1)
          y_pred = np.reshape(out, (1,-1, 18))

Please suggest!

AastaLLL · March 31, 2022, 6:45am

Hi,

It looks like you still create the buffer for every single frame.

github.com

sachinkmohan/Jetson_test_projects/blob/0a54b99705beac1516a9efef61f4e1a222e62ad0/Image_classification_nd/optimized/model_inference.py#L22


      
              h_input, d_input, h_output, d_output, stream = inf .allocate_buffers(engine, batch_size, trt.float32)
              return engine, h_input, d_input, h_output, d_output, stream
          
          
def inference(engine_path, data_set, batch_size):
              engine, h_input, d_input, h_output, d_output, stream = initialize(engine_path, data_set, batch_size)
              out = inf.do_inference(engine, data_set, h_input, d_input, h_output, d_output, stream, batch_size)
              return out
          
          
def inference_seg(engine_path, data_set, batch_size):
              #engine, h_inputs, cuda_inputs, h_outputs, cuda_outputs, stream = initialize_seg(engine_path, data_set, batch_size)
              engine, h_input, d_input, h_output, d_output, stream = initialize(engine_path, data_set, batch_size)
              bindings = []
              np.copyto(h_input, data_set.ravel())
              stream = cuda.Stream()
              context = engine.create_execution_context()
          
          
    cuda.memcpy_htod_async(d_input, h_input, stream)
              context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
              cuda.memcpy_dtoh_async(h_output, d_output, stream)
              stream.synchronize()

Would you mind moving the buffer creation to the initial time.
And reuse the buffer instead of creating it when inference?

Thanks.

sachinkm308 · April 12, 2022, 3:38pm

Thanks a ton @AastaLLL ! Your tip really helped me. I reused the buffer as you suggested like below.

Now the segmentation improved drastically from 2 fps to 27-30 fps. My object detection code which was working at 30 fps, works now at 100 fps.

github.com

sachinkmohan/Jetson_test_projects/blob/76180b4faab7a092220597bb8fff188a0e802a8a/Image_classification_nd/optimized/test_img_seg.py#L71-L111


      
          def initialize(engine_path, data_set, batch_size):
              engine = eng.load_engine(engine_path)
              h_input, d_input, h_output, d_output, stream = inf.allocate_buffers(engine, batch_size, trt.float32)
              return engine, h_input, d_input, h_output, d_output, stream
          
          
batch_size=1
          engine = eng.load_engine(engine_path)
          h_input = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
          h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(trt.float32))
          d_input = cuda.mem_alloc(h_input.nbytes)
          d_output = cuda.mem_alloc(h_input.nbytes)
          stream = cuda.Stream()
          bindings = []
          
          
while cap.isOpened():
              #engine_path = join(os.getcwd(), "/models/plan/ssd7_keras_1.plan")
          
          
    new_frame_time = time.time()

This file has been truncated. show original

system · May 4, 2022, 6:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	764	March 13, 2023
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	915	March 13, 2023
Inference Speed Jetson Xavier NX pytorch	6	892	April 12, 2023
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2916	January 18, 2019
Tensorflow model acceleration on AGX Jetson AGX Xavier tensorflow	14	1201	October 7, 2022
TensorRT ERROR: pointWiseV2Helpers.h::launchPwgenKernel::532 Cuda Driver (invalid resource handle) Jetson Xavier NX tensorrt , cuda , jetson-inference	3	2059	March 24, 2022
Inference of model using tensorflow/onnxruntime and TensorRT gives different result Jetson TX2 tensorrt	20	2538	October 18, 2021
TensorRT --- non-int8 fallback when trying to calibrate ONNX model DeepStream SDK tensorrt , deepstream	11	440	July 1, 2024
Tensorrt inference with batch > 1 TensorRT	4	1392	October 13, 2022
TensorRT ( C++ ) inference strange behavior on Jetson AGX Xavier TensorRT cudnn	0	21	January 15, 2025

Description

Environment

Relevant Files

Steps To Reproduce

packages installed

Related topics