Cuda transfer from device to host is extremely slow

aiad789 · December 1, 2021, 11:00pm

Hello ,
Im using the below code to create cuda stream and run inference on SSD mobile 320x320 V2 converted to tesnorrt .The inference is running fast but Im facing extreme slowness when moving the data back from device to host in the d_to_h steps .The inference is taking 5 ms while the transfer is taking 20 ms .
Is there anyhting in the code where i can enhance or improve the speed of transfrer and could this be an issue ?

Im usnig Xavier and TensorRT 8
Thanks

class TensorRTInfer:

def __init__(self, engine):
    """
    :param engine_path: The path to the serialized engine to load from disk.
    """

    # Load TRT engine
    self.cfx = cuda.Device(0).make_context()
    self.stream = cuda.Stream()
    self.engine = engine
    self.context = self.engine.create_execution_context()

    # Setup I/O bindings
    self.inputs1 = []
    self.outputs1 = []
    self.allocations1 = []

    for i in range(self.engine.num_bindings):
       
        name = self.engine.get_binding_name(i)
        dtype = self.engine.get_binding_dtype(i)
        shape = self.engine.get_binding_shape(i)
      
        size = np.dtype(trt.nptype(dtype)).itemsize * 1
        for s in shape:
            size *= s
        allocation1 = cuda.mem_alloc(size)

        binding1 = {
            'index': i,
            'name': name,
            'dtype': np.dtype(trt.nptype(dtype)),
            'shape': list(shape),
            'allocation': allocation1,
        }

        self.allocations1.append(allocation1)

        if self.engine.binding_is_input(i):
            self.inputs1.append(binding1)

        else:
            self.outputs1.append(binding1)
         
    self.outputs2 = []
    for shape, dtype in self.output_spec():
        shape[0]=shape[0] *1 
        self.outputs2.append(np.zeros(shape, dtype))
    print("done building..")

def input_spec(self):
    """
    Get the specs for the input tensor of the network. Useful to prepare memory allocations.
    :return: Two items, the shape of the input tensor and its (numpy) datatype.
    """
    return self.inputs[0]['shape'], self.inputs[0]['dtype']

def output_spec(self):
    """
    Get the specs for the output tensors of the network. Useful to prepare memory allocations.
    :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.
    """
    specs = []
    for o in self.outputs1:
        specs.append((o['shape'], o['dtype']))
  
    return specs

def h_to_d(self, batch):
    self.batch = batch
    cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))   
def destory(self):
    self.cfx.pop()
def d_to_h(self):      
   
  for o in range(len(self.outputs2)):
    cuda.memcpy_dtoh_async(self.outputs2[0], self.outputs1[0]['allocation'], self.stream)
    
    return self.outputs2
def infer_this(self):
    self.cfx.push()
    self.context.execute_async(batch_size=1,bindings=self.allocations1, stream_handle=self.stream.handle)
    self.cfx.pop()

NVES · December 2, 2021, 7:08am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

aiad789 · December 2, 2021, 10:54am

model1_trt_16.trt (8.1 MB)

The complete code will be :

import os
import sys
from time import time, sleep, perf_counter
import time
import time
import ctypes
import argparse
import numpy as np
import tensorrt as trt

import pycuda.driver as cuda
import pycuda.autoinit
import threading
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Process, Queue, Manager
import multiprocessing
import cv2


class TensorRTInfer:
    """
    Implements inference for the Model TensorRT engine.
    """

    def __init__(self, engine):
        """
        :param engine_path: The path to the serialized engine to load from disk.
        """

        # Load TRT engine
        self.cfx = cuda.Device(0).make_context()
        self.stream = cuda.Stream()
        self.engine = engine
        self.context = self.engine.create_execution_context()

        # Setup I/O bindings
        self.inputs1 = []
        self.outputs1 = []
        self.allocations1 = []

        for i in range(self.engine.num_bindings):
           
            name = self.engine.get_binding_name(i)
            dtype = self.engine.get_binding_dtype(i)
            shape = self.engine.get_binding_shape(i)
          
            size = np.dtype(trt.nptype(dtype)).itemsize * 1
            for s in shape:
                size *= s
            allocation1 = cuda.mem_alloc(size)

            binding1 = {
                'index': i,
                'name': name,
                'dtype': np.dtype(trt.nptype(dtype)),
                'shape': list(shape),
                'allocation': allocation1,
            }

            self.allocations1.append(allocation1)

            if self.engine.binding_is_input(i):
                self.inputs1.append(binding1)

            else:
                self.outputs1.append(binding1)
             
        self.outputs2 = []
        for shape, dtype in self.output_spec():
            shape[0]=shape[0] *1 
            self.outputs2.append(np.zeros(shape, dtype))
        print("done building..")

    def input_spec(self):
        """
        Get the specs for the input tensor of the network. Useful to prepare memory allocations.
        :return: Two items, the shape of the input tensor and its (numpy) datatype.
        """
        return self.inputs[0]['shape'], self.inputs[0]['dtype']

    def output_spec(self):
        """
        Get the specs for the output tensors of the network. Useful to prepare memory allocations.
        :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.
        """
        specs = []
        for o in self.outputs1:
            specs.append((o['shape'], o['dtype']))
      
        return specs
   
    def h_to_d(self, batch):
        self.batch = batch
        cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))   
    def destory(self):
        self.cfx.pop()
    def d_to_h(self):      
       
      for o in range(len(self.outputs2)):
        cuda.memcpy_dtoh_async(self.outputs2[0], self.outputs1[0]['allocation'], self.stream)
        
        return self.outputs2
    def infer_this(self):
        self.cfx.push()
        self.context.execute_async(batch_size=1,bindings=self.allocations1, stream_handle=self.stream.handle)
        self.cfx.pop()

if __name__ == '__main__':
    logger = trt.Logger(trt.Logger.ERROR)
    trt.init_libnvinfer_plugins(logger, namespace="")
    engine = None
    with open('/home/zenith/Desktop/model1_16.trt', "rb") as f, trt.Runtime(logger) as runtime:
        engine1 = runtime.deserialize_cuda_engine(f.read())   
    mat1 = cv2.imread('/home/zenith/Desktop/tf16/img108.jpg')
    stretch_near1 = cv2.resize(mat1, (640, 640))
    _image1 = np.expand_dims(stretch_near1, axis=0).astype(np.float32)
   
    images = np.random.rand(1, 640, 640, 3).astype(np.float32)
    trt_infer_big1 = TensorRTInfer(engine1)

    x = range(100)
    for n in x:
        tic = time.perf_counter()
        tiic = time.perf_counter()
        trt_infer_big1.h_to_d(_image1)
            
        tooc = time.perf_counter()
        vll = tooc - tiic
        print("h_to_d:" + str(vll))
        act1 = time.perf_counter()
        trt_infer_big1.infer_this()
          
        act2 = time.perf_counter()
        vll = act2 - act1
        print("inference:" + str(vll))
        teec = time.perf_counter()
        trt_infer_big1.d_to_h()
        
        toec = time.perf_counter()
        vll = toec - teec
        print("d_to_h:" + str(vll))

        toc = time.perf_counter()
        vll = toc - tic
        print("whole time:" + str(vll))
        sleep(0.05)

in the above for loop ,I’m trying to follow cuda concurrent pattern which should reduce the time considerably compared with linear approach .
You will notice the time for d_to_h in the loop is taking the largest amount of time while the inference is taking so little .

aiad789 · December 9, 2021, 11:55am

Any update please ?

Thanks

spolisetty · January 4, 2022, 10:44am

Hi,

Please refer to following. Which may help you.

github.com

NVIDIA/TensorRT/blob/main/samples/python/common.py#L155


      
                  host_mem = cuda.pagelocked_empty(size, dtype)
                  device_mem = cuda.mem_alloc(host_mem.nbytes)
                  # Append the device buffer to device bindings.
                  bindings.append(int(device_mem))
                  # Append to the appropriate list.
                  if engine.binding_is_input(binding):
                      inputs.append(HostDeviceMem(host_mem, device_mem))
                  else:
                      outputs.append(HostDeviceMem(host_mem, device_mem))
              return inputs, outputs, bindings, stream
          
          

          
# This function is generalized for multiple inputs/outputs.
          # inputs and outputs are expected to be lists of HostDeviceMem objects.
          def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
              # Transfer input data to the GPU.
              [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
              # Run inference.
              context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
              # Transfer predictions back from the GPU.
              [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]

Thank you.

jdluckyday · February 13, 2022, 5:54am

Thanks for sharing. However it is still go through the samples by repeating the full cycles, instead of a pipeline type of process. The inference takes much less time then moving data between host and device.

Means the sample code does the following: copy sample1 input from host to device, inference, copy sampe1 output from device to host. Then copy sample2 input from host to device, inference, copy sampe2 output from device to host.

Is there a way to do it like a pipeline style? Means simultaneously copying sample1 output from host to device and copying sample2 input from host to device. Thanks!

Topic		Replies	Views
TensorRT copy data cost a lot of time TensorRT	1	650	April 8, 2020
Error while moving data from cuda-capable device to host memory - Error Code 1: Cuda Runtime (unspecified launch failure) Jetson Nano tensorrt , cuda	2	594	October 15, 2021
Memory Copy from Device to Host seems blocked or delayed CUDA Programming and Performance tensorrt , cuda , ubuntu	0	151	June 3, 2024
Transfer data from GPU to CPU takes too much times on TX2 TensorRT	1	1281	August 9, 2019
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	2033	December 14, 2018
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	924	March 13, 2023
Using tensorRT to accelerate caffe model， but it take more time to inference Jetson TX2	6	508	October 18, 2021
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	770	March 13, 2023
There is a difference in inference speed in TensorRT 8 TensorRT tensorrt	4	513	October 28, 2021
Transfer data from GPU to CPU takes too much times on TX2 Jetson TX2	5	969	October 18, 2021

Cuda transfer from device to host is extremely slow

Related topics