Skip to main content

Ctrl+K

NVIDIA Triton Inference Server

GitHub

NVIDIA Triton Inference Server

GitHub

Table of Contents

Home
Release notes
Compatibility matrix

Getting Started

Quick Deployment Guide by backend
- TRT-LLM
- vLLM
- Python with HuggingFace
- PyTorch
- ONNX
- TensorFlow
- Openvino
LLM With TRT-LLM
Multimodal model
Stable diffusion

Scaling guide

Multi-Node (AWS)
Multi-Instance

LLM Features

Constrained Decoding
Function Calling
Speculative Decoding
- TRT-LLM
- vLLM

Client

API Reference
- OpenAI API
- KServe API
  - HTTP/REST and GRPC Protocol
  - Extensions
    
    Binary tensor data extension
    
    Classification extension
    
    Schedule policy extension
    
    Sequence extension
    
    Shared-memory extension
    
    Model configuration extension
    
    Model repository extension
    
    Statistics extension
    
    Trace extension
    
    Logging extension
    
    Parameters extension
In-Process Triton Server API
- C/C++
- Python
  - Kafka I/O
  - Rayserve
- Java
Client Libraries
Python tritonclient Package API
- tritonclient

Server

Concurrent Model Execution
Scheduler
Batcher
Model Pipelines
- Ensemble
- Business Logic Scripting
State Management
- Implicit State Management
Request Cancellation
Rate Limiter
Caching
Metrics
Tracing

Model Management

Repository
Configuration
Optimization
Controls
Decoupled models
Custom operators

Backends

TRT-LLM
vLLM
- vLLM Backend
- Multi-LoRA
Python Backend
PyTorch (LibTorch) Backend
ONNX Runtime
TensorFlow
TensorRT
FIL
DALI
Custom

Perf benchmarking and tuning

GenAI Perf Analyzer
Performance Analyzer
Model Analyzer
Model Navigator

Debugging

Guide

API Reference

API Reference#

previous

Speculative Decoding with vLLM

next

OpenAI-Compatible Frontend for Triton Inference Server (Beta)

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2018-2025, NVIDIA Corporation.