• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question
Hello,
I am working with a PyTorch RetinaNet model that has a ResNet-50 backbone. I have exported this model to ONNX, and below are its properties extracted using Netron.app and polygraphy:
Here is the output of polygraphy:
[I] ==== ONNX Model ====
Name: main_graph | ONNX Opset: 17
---- 1 Graph Input(s) ----
{input [dtype=float32, shape=('batch_size', 3, 1080, 1920)]}
---- 2 Graph Output(s) ----
{cls_logits [dtype=float32, shape=('batch_size', 'Concatcls_logits_dim_1', 2)],
bbox_regression [dtype=float32, shape=('batch_size', 'Concatbbox_regression_dim_1', 4)]}
---- 195 Initializer(s) ----
---- 626 Node(s) ----
I would like to use this model in DeepStream app. Since this is an object detection model, I implemented a custom bounding box parser function in C++ to handle the outputs from the network. Here’s my current implementation:
extern "C" bool NvDsInferParseCustomRetinaNet(
const std::vector<NvDsInferLayerInfo> &outputLayersInfo,
const NvDsInferNetworkInfo &networkInfo,
const NvDsInferParseDetectionParams &detectionParams,
std::vector<NvDsInferObjectDetectionInfo> &objectList)
{
// Ensure output layers are valid
if (outputLayersInfo.size() != 2)
{
std::cerr << "Error: Expected 2 output layers (class logits and bbox regression)." << std::endl;
return false;
}
const NvDsInferLayerInfo *classLayer = nullptr;
const NvDsInferLayerInfo *boxLayer = nullptr;
for (const auto &layer : outputLayersInfo)
{
if (strcmp(layer.layerName, "cls_logits") == 0)
classLayer = &layer;
else if (strcmp(layer.layerName, "bbox_regression") == 0)
boxLayer = &layer;
}
if (!classLayer || !boxLayer)
{
std::cerr << "Error: Missing class logits or bbox regression layers." << std::endl;
return false;
}
// print dimenstions of classLayer and boxLayer
// classLayer.inferDims.numElements has 778410 = 2 * 389205
// classLayer->inferDims.d[0] = 389205
// classLayer->inferDims.d[1] = 2
// boxLayer.inferDims.numElements has 1556820 = 4 * 389205
// boxLayer->inferDims.d[0] = 389205
// boxLayer->inferDims.d[1] = 4
// Extract buffers from classLayer and boxLayer
float *classBuffer = (float *)classLayer->buffer;
float *boxBuffer = (float *)boxLayer->buffer;
const int numClasses = classLayer->inferDims.d[1];
int numDetsToParse = classLayer->inferDims.numElements / numClasses;
// Get parameters from config
const float confidenceThreshold = detectionParams.perClassPreclusterThreshold[0];
// Temporary vectors to store detecctions before NMS
std::vector<NvDsInferObjectDetectionInfo> allDetections;
std::vector<bool> keep;
// Iterate through all detections and process them
objectList.clear();
for (int i = 0; i < numDetsToParse; i++)
{
// Get classification scores
float scores[numClasses];
scores[0] = sigmoid(classBuffer[i * 2]);
scores[1] = sigmoid(classBuffer[i * 2 + 1]);
// Find highest scoring class
int maxClassId = (scores[1] > scores[0]) ? 1 : 0;
float maxScore = std::max(scores[0], scores[1]);
// Filter by confidence threshold
if (maxScore < confidenceThreshold)
continue;
// Get bounding box coordinates
float x1 = boxBuffer[i * 4];
float y1 = boxBuffer[i * 4 + 1];
float x2 = boxBuffer[i * 4 + 2];
float y2 = boxBuffer[i * 4 + 3];
// Clip boxes to image boundaries
x1 = std::max(0.0f, std::min(x1, (float)networkInfo.width));
y1 = std::max(0.0f, std::min(y1, (float)networkInfo.height));
x2 = std::max(0.0f, std::min(x2, (float)networkInfo.width));
y2 = std::max(0.0f, std::min(y2, (float)networkInfo.height));
// Create detection object
NvDsInferObjectDetectionInfo detection;
detection.classId = maxClassId;
detection.detectionConfidence = maxScore;
detection.left = x1;
detection.top = y1;
detection.width = x2 - x1;
detection.height = y2 - y1;
objectList.push_back(detection);
}
return true;
}
Issue:
RetinaNet relies on anchors and feature pyramid networks (FPNs) to predict bounding boxes, which means that directly interpreting the cls_logits
and bbox_regression
outputs does not yield correct bounding boxes without additional processing.
In PyTorch, the model inside the forward function relies on the AnchorGenerator, RetinaNetHead, and GeneralizedRCNNTransform, among other components, to generate the final detections. Additionally it uses some postprocess detections and transformations. Reimplementing the entire post-processing pipeline (anchors + NMS) in C++ would require significant effort.
I came across NVIDIA’s RetinaNet example repository where surprisingly, this example does not include anchor generation or backbone processing in the nvdsparsebbox_retinanet.cpp file.
Questions:
- How does NVIDIA’s example handle anchors and backbone if they are not explicitly present in the bounding box parsing code?
- Is this functionality incorporated during ONNX to TensorRT conversion, embedding anchor generation directly into the TensorRT model?
- How can I structure my custom bounding box parsing function so that I do not need to handle anchors and backbone manually, similar to NVIDIA’s RetinaNet example?
Looking forward to any insights on handling this efficiently within DeepStream!