Custom Bounding Box Parsing function for RetinaNet in DeepStream without handling Anchors and Backbone

• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question
Hello,

I am working with a PyTorch RetinaNet model that has a ResNet-50 backbone. I have exported this model to ONNX, and below are its properties extracted using Netron.app and polygraphy:


Here is the output of polygraphy:

[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 17
    
    ---- 1 Graph Input(s) ----
    {input [dtype=float32, shape=('batch_size', 3, 1080, 1920)]}
    
    ---- 2 Graph Output(s) ----
    {cls_logits [dtype=float32, shape=('batch_size', 'Concatcls_logits_dim_1', 2)],
     bbox_regression [dtype=float32, shape=('batch_size', 'Concatbbox_regression_dim_1', 4)]}
    
    ---- 195 Initializer(s) ----
    
    ---- 626 Node(s) ----

I would like to use this model in DeepStream app. Since this is an object detection model, I implemented a custom bounding box parser function in C++ to handle the outputs from the network. Here’s my current implementation:

extern "C" bool NvDsInferParseCustomRetinaNet(
    const std::vector<NvDsInferLayerInfo> &outputLayersInfo,
    const NvDsInferNetworkInfo &networkInfo,
    const NvDsInferParseDetectionParams &detectionParams,
    std::vector<NvDsInferObjectDetectionInfo> &objectList)
{
    // Ensure output layers are valid
    if (outputLayersInfo.size() != 2)
    {
        std::cerr << "Error: Expected 2 output layers (class logits and bbox regression)." << std::endl;
        return false;
    }

    const NvDsInferLayerInfo *classLayer = nullptr;
    const NvDsInferLayerInfo *boxLayer = nullptr;

    for (const auto &layer : outputLayersInfo)
    {
        if (strcmp(layer.layerName, "cls_logits") == 0)
            classLayer = &layer;
        else if (strcmp(layer.layerName, "bbox_regression") == 0)
            boxLayer = &layer;
    }

    if (!classLayer || !boxLayer)
    {
        std::cerr << "Error: Missing class logits or bbox regression layers." << std::endl;
        return false;
    }

    // print dimenstions of classLayer and boxLayer
    // classLayer.inferDims.numElements has 778410 = 2 * 389205
    // classLayer->inferDims.d[0] = 389205
    // classLayer->inferDims.d[1] = 2
    // boxLayer.inferDims.numElements has 1556820 = 4 * 389205
    // boxLayer->inferDims.d[0] = 389205
    // boxLayer->inferDims.d[1] = 4

    // Extract buffers from classLayer and boxLayer
    float *classBuffer = (float *)classLayer->buffer;
    float *boxBuffer = (float *)boxLayer->buffer;
    const int numClasses = classLayer->inferDims.d[1];
    int numDetsToParse = classLayer->inferDims.numElements / numClasses;

    // Get parameters from config
    const float confidenceThreshold = detectionParams.perClassPreclusterThreshold[0];

    // Temporary vectors to store detecctions before NMS
    std::vector<NvDsInferObjectDetectionInfo> allDetections;
    std::vector<bool> keep;

    // Iterate through all detections and process them
    objectList.clear();
    for (int i = 0; i < numDetsToParse; i++)
    {
        // Get classification scores
        float scores[numClasses];
        scores[0] = sigmoid(classBuffer[i * 2]);
        scores[1] = sigmoid(classBuffer[i * 2 + 1]);

        // Find highest scoring class
        int maxClassId = (scores[1] > scores[0]) ? 1 : 0;
        float maxScore = std::max(scores[0], scores[1]);

        // Filter by confidence threshold
        if (maxScore < confidenceThreshold)
            continue;

        // Get bounding box coordinates
        float x1 = boxBuffer[i * 4];
        float y1 = boxBuffer[i * 4 + 1];
        float x2 = boxBuffer[i * 4 + 2];
        float y2 = boxBuffer[i * 4 + 3];

        // Clip boxes to image boundaries
        x1 = std::max(0.0f, std::min(x1, (float)networkInfo.width));
        y1 = std::max(0.0f, std::min(y1, (float)networkInfo.height));
        x2 = std::max(0.0f, std::min(x2, (float)networkInfo.width));
        y2 = std::max(0.0f, std::min(y2, (float)networkInfo.height));

        // Create detection object
        NvDsInferObjectDetectionInfo detection;
        detection.classId = maxClassId;
        detection.detectionConfidence = maxScore;
        detection.left = x1;
        detection.top = y1;
        detection.width = x2 - x1;
        detection.height = y2 - y1;

        objectList.push_back(detection);
    }
    return true;
}

Issue:

RetinaNet relies on anchors and feature pyramid networks (FPNs) to predict bounding boxes, which means that directly interpreting the cls_logits and bbox_regression outputs does not yield correct bounding boxes without additional processing.
In PyTorch, the model inside the forward function relies on the AnchorGenerator, RetinaNetHead, and GeneralizedRCNNTransform, among other components, to generate the final detections. Additionally it uses some postprocess detections and transformations. Reimplementing the entire post-processing pipeline (anchors + NMS) in C++ would require significant effort.

I came across NVIDIA’s RetinaNet example repository where surprisingly, this example does not include anchor generation or backbone processing in the nvdsparsebbox_retinanet.cpp file.

Questions:

  • How does NVIDIA’s example handle anchors and backbone if they are not explicitly present in the bounding box parsing code?
  • Is this functionality incorporated during ONNX to TensorRT conversion, embedding anchor generation directly into the TensorRT model?
  • How can I structure my custom bounding box parsing function so that I do not need to handle anchors and backbone manually, similar to NVIDIA’s RetinaNet example?

Looking forward to any insights on handling this efficiently within DeepStream!

This requires you to modify the RetinaNet model of pytorch to fuse the operator with the ONNX model, so that TRT can generate the engine file and output a layer that is compatible with DeepStream post-processing.

However, DeepStream 7.1 no longer supports retinanet, you can refer to some legacy code

@junshengy Thank you for your response!

Just to clarify - by “fusing” the anchor postprocessing and NMS into the ONNX model as one of the final layers, does this mean that all RetinaNet postprocessing will be handled within the model itself?

Also, why is RetinaNet no longer supported in DeepStream 7.1? Does this mean I won’t be able to use it in my DeepStream pipeline at all?

Regarding the TAO Toolkit, if I use it, will it ensure that the postprocessing step is included in the final model? Additionally, how can I run TAO Toolkit on Jetson? As far as I know, TAO isn’t officially supported on Jetson devices.

Looking forward to your insights!

Not exactly. In the demo I gave above, converting the output layer to bbox still requires deepstream, this post-processing is not part of the model.

You can use it with versions prior to DS-7.0, or you can port it to latest version by yourself, but it will not be officially supported.

Please refer to the TAO documentation above. You need to train on dGPU, export, and then deploy the model to Jetson.

@junshengy Thank you for your reply.

Do you mean it requires custom bounding box parser in C++, right? But instead of processing all the anchors, threshold filtering and NMS we “fuse” those functionalities to the ONNX model and then process in C++ custom bounding box parser only processed bounding boxes that are the ones we can display, is it correct?

I still do not understand this. Currently I have a RetinaNet model that is running fine on DeepStream 7.1, however it runs without a proper bounding box parser function because as you wrote the NMS should “fused” into the model. Why this model would not run on DeepStream 7.1? I do not have any troubles converting it to TensorRT and then appending nvinfer element to the pipeline.

Thank you, I will try training it on dGPU.

Yes.

I mean the model provided by deepstream_tao_apps, not your model. If you want to use deepstream_tao_apps post-processing, you need to modify the output layer of your model to be the same as provided by tao

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.