Skip to main content
Ctrl+K
NVIDIA Triton Inference Server - Home

NVIDIA Triton Inference Server

  • GitHub
NVIDIA Triton Inference Server - Home

NVIDIA Triton Inference Server

  • GitHub

Table of Contents

  • Home
  • Release notes
  • Compatibility matrix

Getting Started

  • Quick Deployment Guide by backend
    • TRT-LLM
    • vLLM
    • Python with HuggingFace
    • PyTorch
    • ONNX
    • TensorFlow
    • Openvino
  • LLM With TRT-LLM
  • Multimodal model
  • Stable diffusion

Scaling guide

  • Multi-Node (AWS)
  • Multi-Instance

LLM Features

  • Constrained Decoding
  • Function Calling
  • Speculative Decoding
    • TRT-LLM
    • vLLM

Client

  • API Reference
    • OpenAI API
    • KServe API
      • HTTP/REST and GRPC Protocol
      • Extensions
        • Binary tensor data extension
        • Classification extension
        • Schedule policy extension
        • Sequence extension
        • Shared-memory extension
        • Model configuration extension
        • Model repository extension
        • Statistics extension
        • Trace extension
        • Logging extension
        • Parameters extension
  • In-Process Triton Server API
    • C/C++
    • Python
      • Kafka I/O
      • Rayserve
    • Java
  • Client Libraries
  • Python tritonclient Package API
    • tritonclient
      • tritonclient.grpc
        • tritonclient.grpc.aio
        • tritonclient.grpc.auth
      • tritonclient.http
        • tritonclient.http.aio
        • tritonclient.http.auth
      • tritonclient.utils
        • tritonclient.utils.cuda_shared_memory
        • tritonclient.utils.shared_memory

Server

  • Concurrent Model Execution
  • Scheduler
  • Batcher
  • Model Pipelines
    • Ensemble
    • Business Logic Scripting
  • State Management
    • Implicit State Management
  • Request Cancellation
  • Rate Limiter
  • Caching
  • Metrics
  • Tracing

Model Management

  • Repository
  • Configuration
  • Optimization
  • Controls
  • Decoupled models
  • Custom operators

Backends

  • TRT-LLM
  • vLLM
    • vLLM Backend
    • Multi-LoRA
  • Python Backend
  • PyTorch (LibTorch) Backend
  • ONNX Runtime
  • TensorFlow
  • TensorRT
  • FIL
  • DALI
  • Custom

Perf benchmarking and tuning

  • GenAI Perf Analyzer
    • Large language models
    • Visual language models
    • Embedding models
    • Ranking models
    • Multiple LoRA adapters
  • Performance Analyzer
    • Recommended Installation Method
    • Inference Load Modes
    • Input Data
    • Measurement Modes
  • Model Analyzer
    • Model Analyzer CLI
    • Launch Modes
    • Table of Contents
    • Model Analyzer Metrics
    • Table of Contents
    • Checkpointing in Model Analyzer
    • Model Analyzer Reports
    • Deploying Model Analyzer on a Kubernetes cluster
  • Model Navigator

Debugging

  • Guide
  • API Reference

API Reference#

previous

Speculative Decoding with vLLM

next

OpenAI-Compatible Frontend for Triton Inference Server (Beta)

NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2018-2025, NVIDIA Corporation.