Segmentation fault when updating from enqueueV2() to enqueueV3()

Description

Following my post on deprecated functions in TensorRT 8.5.
I updated my code from enqueueV2 to enqueueV3.

From:

void GreenModel::LaunchInferenceAsyc() {
    cudaMemcpyAsync(buffer_bindings[BINDING_PTR_IDX_INPUT], input.data(), input.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
    context->enqueueV2(buffer_bindings, stream, nullptr);
    cudaMemcpyAsync(output.data(), buffer_bindings[BINDING_PTR_IDX_OUTPUT], output.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
}

To:

void GreenModel::LaunchInferenceAsyc() {
    cudaMemcpyAsync(buffer_bindings[BINDING_PTR_IDX_INPUT], input.data(), input.size() * sizeof(float), cudaMemcpyHostToDevice, stream);
    context->enqueueV3(stream);
    cudaMemcpyAsync(output.data(), buffer_bindings[BINDING_PTR_IDX_OUTPUT], output.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
}

I’m getting a segmentation fault in:

bool enqueueV3(cudaStream_t stream) noexcept
{
    return mImpl->enqueueV3(stream);
}

It’s working fine with enqueueV2.
Am I missing an extra step here?

Environment

TensorRT Version: 8.5.2
Nvidia Driver Version: NVIDIA Jetson AGX Orin
CUDA Version: 11.4
Operating System + Version: linux ubuntu 20.04 aarch64

Hi @lioriz ,
Can you please help us with the verbose logs here.

thanks

1 Like

@AakankshaS verbose logs:

INFO: Loaded engine size: 43 MiB
WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
VERBOSE: Trying to load shared library libcudnn.so.8
VERBOSE: Loaded shared library libcudnn.so.8
VERBOSE: Using cuDNN as plugin tactic source
INFO: [MemUsageChange] Init cuDNN: CPU +619, GPU +660, now: CPU 942, GPU 10444 (MiB)
INFO: [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +42, now: CPU 0, GPU 42 (MiB)
INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +3, now: CPU 901, GPU 10409 (MiB)
VERBOSE: Using cuDNN as core library tactic source
VERBOSE: Deserialization required 1644704 microseconds.
VERBOSE: Trying to load shared library libcudnn.so.8
VERBOSE: Loaded shared library libcudnn.so.8
VERBOSE: Using cuDNN as plugin tactic source
VERBOSE: Using cuDNN as core library tactic source
VERBOSE: Total per-runner device persistent memory is 0
VERBOSE: Total per-runner host persistent memory is 214528
VERBOSE: Allocated activation device memory of size 24331264
INFO: [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +23, now: CPU 0, GPU 65 (MiB)

Solved by adding setTensorAddress( ) when initiating Cuda.

pseudocode:

void InitCuda(){
    cudaMalloc(&buffers_bindings[BINDING_PTR_IDX_INPUT], size );
    cudaMalloc(&buffers_bindings[BINDING_PTR_IDX_OUTPUT], size );

    this->context = this->model_->createExecutionContext();

    context->setTensorAddress(input_image_blob_name_.c_str(), buffer_bindings[BINDING_PTR_IDX_INPUT]);
    context->setTensorAddress(output_blob_name_.c_str(), buffer_bindings[BINDING_PTR_IDX_OUTPUT]);
}

void LaunchInferenceAsyc() {
    cudaMemcpyAsync(buffer_bindings[BINDING_PTR_IDX_INPUT], input.data(), input.size(), cudaMemcpyHostToDevice, stream);

    context->enqueueV3(stream);

    cudaMemcpyAsync(output.data(), buffer_bindings[BINDING_PTR_IDX_OUTPUT], output.size(), cudaMemcpyDeviceToHost, stream);
}

Thanks to this issue and this code example.

In addition, this issue: enqueueV3 is slower than enqueueV2 · Issue #2877 · NVIDIA/TensorRT · GitHub, was very interesting and helped my understanding.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.