The onnx file already shared.
The output from the trtexec is as below
user@user-desktop:/usr/src/tensorrt/bin$ ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
[03/03/2022-16:03:49] [I] === Model Options ===
[03/03/2022-16:03:49] [I] Format: *
[03/03/2022-16:03:49] [I] Model:
[03/03/2022-16:03:49] [I] Output:
[03/03/2022-16:03:49] [I] === Build Options ===
[03/03/2022-16:03:49] [I] Max batch: explicit
[03/03/2022-16:03:49] [I] Workspace: 16 MiB
[03/03/2022-16:03:49] [I] minTiming: 1
[03/03/2022-16:03:49] [I] avgTiming: 8
[03/03/2022-16:03:49] [I] Precision: FP32
[03/03/2022-16:03:49] [I] Calibration:
[03/03/2022-16:03:49] [I] Refit: Disabled
[03/03/2022-16:03:49] [I] Sparsity: Disabled
[03/03/2022-16:03:49] [I] Safe mode: Disabled
[03/03/2022-16:03:49] [I] Restricted mode: Disabled
[03/03/2022-16:03:49] [I] Save engine:
[03/03/2022-16:03:49] [I] Load engine: seg_model_unet_40_ep_op13.trt
[03/03/2022-16:03:49] [I] NVTX verbosity: 0
[03/03/2022-16:03:49] [I] Tactic sources: Using default tactic sources
[03/03/2022-16:03:49] [I] timingCacheMode: local
[03/03/2022-16:03:49] [I] timingCacheFile:
[03/03/2022-16:03:49] [I] Input(s)s format: fp32:CHW
[03/03/2022-16:03:49] [I] Output(s)s format: fp32:CHW
[03/03/2022-16:03:49] [I] Input build shape: input=1+1+1
[03/03/2022-16:03:49] [I] Input calibration shapes: model
[03/03/2022-16:03:49] [I] === System Options ===
[03/03/2022-16:03:49] [I] Device: 0
[03/03/2022-16:03:49] [I] DLACore:
[03/03/2022-16:03:49] [I] Plugins:
[03/03/2022-16:03:49] [I] === Inference Options ===
[03/03/2022-16:03:49] [I] Batch: Explicit
[03/03/2022-16:03:49] [I] Input inference shape: input=1
[03/03/2022-16:03:49] [I] Iterations: 10
[03/03/2022-16:03:49] [I] Duration: 3s (+ 200ms warm up)
[03/03/2022-16:03:49] [I] Sleep time: 0ms
[03/03/2022-16:03:49] [I] Streams: 1
[03/03/2022-16:03:49] [I] ExposeDMA: Disabled
[03/03/2022-16:03:49] [I] Data transfers: Enabled
[03/03/2022-16:03:49] [I] Spin-wait: Disabled
[03/03/2022-16:03:49] [I] Multithreading: Disabled
[03/03/2022-16:03:49] [I] CUDA Graph: Disabled
[03/03/2022-16:03:49] [I] Separate profiling: Disabled
[03/03/2022-16:03:49] [I] Time Deserialize: Disabled
[03/03/2022-16:03:49] [I] Time Refit: Disabled
[03/03/2022-16:03:49] [I] Skip inference: Disabled
[03/03/2022-16:03:49] [I] Inputs:
[03/03/2022-16:03:49] [I] === Reporting Options ===
[03/03/2022-16:03:49] [I] Verbose: Disabled
[03/03/2022-16:03:49] [I] Averages: 10 inferences
[03/03/2022-16:03:49] [I] Percentile: 99
[03/03/2022-16:03:49] [I] Dump refittable layers:Disabled
[03/03/2022-16:03:49] [I] Dump output: Disabled
[03/03/2022-16:03:49] [I] Profile: Disabled
[03/03/2022-16:03:49] [I] Export timing to JSON file:
[03/03/2022-16:03:49] [I] Export output to JSON file:
[03/03/2022-16:03:49] [I] Export profile to JSON file:
[03/03/2022-16:03:49] [I]
[03/03/2022-16:03:49] [I] === Device Information ===
[03/03/2022-16:03:49] [I] Selected Device: Xavier
[03/03/2022-16:03:49] [I] Compute Capability: 7.2
[03/03/2022-16:03:49] [I] SMs: 8
[03/03/2022-16:03:49] [I] Compute Clock Rate: 1.377 GHz
[03/03/2022-16:03:49] [I] Device Global Memory: 31928 MiB
[03/03/2022-16:03:49] [I] Shared Memory per SM: 96 KiB
[03/03/2022-16:03:49] [I] Memory Bus Width: 256 bits (ECC disabled)
[03/03/2022-16:03:49] [I] Memory Clock Rate: 1.377 GHz
[03/03/2022-16:03:49] [I]
[03/03/2022-16:03:49] [I] TensorRT version: 8001
[03/03/2022-16:03:50] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 402, GPU 3768 (MiB)
[03/03/2022-16:03:50] [I] [TRT] Loaded engine size: 30 MB
[03/03/2022-16:03:50] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 402 MiB, GPU 3768 MiB
[03/03/2022-16:03:50] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[03/03/2022-16:03:51] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +233, now: CPU 647, GPU 4023 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +304, now: CPU 954, GPU 4327 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 954, GPU 4327 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 954 MiB, GPU 4327 MiB
[03/03/2022-16:03:52] [I] Engine loaded in 3.38538 sec.
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 924 MiB, GPU 4297 MiB
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 924, GPU 4297 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 924, GPU 4297 (MiB)
[03/03/2022-16:03:52] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 926 MiB, GPU 4417 MiB
[03/03/2022-16:03:52] [I] Created input binding for input with dimensions 1x320x480x3
[03/03/2022-16:03:52] [I] Created output binding for sigmoid with dimensions 1x320x480x1
[03/03/2022-16:03:52] [I] Starting inference
[03/03/2022-16:03:56] [I] Warmup completed 2 queries over 200 ms
[03/03/2022-16:03:56] [I] Timing trace has 33 queries over 3.09938 s
[03/03/2022-16:03:56] [I]
[03/03/2022-16:03:56] [I] === Trace details ===
[03/03/2022-16:03:56] [I] Trace averages of 10 runs:
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 100.411 ms - Host latency: 100.705 ms (end to end 100.717 ms, enqueue 100.345 ms)
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 98.3098 ms - Host latency: 98.6065 ms (end to end 98.6184 ms, enqueue 98.4107 ms)
[03/03/2022-16:03:56] [I] Average on 10 runs - GPU latency: 89.1673 ms - Host latency: 89.43 ms (end to end 89.4423 ms, enqueue 91.116 ms)
[03/03/2022-16:03:56] [I]
[03/03/2022-16:03:56] [I] === Performance summary ===
[03/03/2022-16:03:56] [I] Throughput: 10.6473 qps
[03/03/2022-16:03:56] [I] Latency: min = 70.1868 ms, max = 100.871 ms, mean = 93.9085 ms, median = 98.0884 ms, percentile(99%) = 100.871 ms
[03/03/2022-16:03:56] [I] End-to-End Host Latency: min = 70.1992 ms, max = 100.882 ms, mean = 93.9205 ms, median = 98.1062 ms, percentile(99%) = 100.882 ms
[03/03/2022-16:03:56] [I] Enqueue Time: min = 69.8557 ms, max = 100.793 ms, mean = 94.4267 ms, median = 97.7693 ms, percentile(99%) = 100.793 ms
[03/03/2022-16:03:56] [I] H2D Latency: min = 0.131592 ms, max = 0.217773 ms, mean = 0.20166 ms, median = 0.214844 ms, percentile(99%) = 0.217773 ms
[03/03/2022-16:03:56] [I] GPU Compute Time: min = 70.0078 ms, max = 100.58 ms, mean = 93.6333 ms, median = 97.7971 ms, percentile(99%) = 100.58 ms
[03/03/2022-16:03:56] [I] D2H Latency: min = 0.0473633 ms, max = 0.0871582 ms, mean = 0.0735067 ms, median = 0.076355 ms, percentile(99%) = 0.0871582 ms
[03/03/2022-16:03:56] [I] Total Host Walltime: 3.09938 s
[03/03/2022-16:03:56] [I] Total GPU Compute Time: 3.0899 s
[03/03/2022-16:03:56] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/03/2022-16:03:56] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/03/2022-16:03:56] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/03/2022-16:03:56] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # ./trtexec --loadEngine=seg_model_unet_40_ep_op13.trt --shapes=input:1*3*320*480
[03/03/2022-16:03:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 927, GPU 4416 (MiB)
Please advise on how to proceed further.