Skip to main content
Google Cloud
Documentation Technology areas
  • AI and ML
  • Application development
  • Application hosting
  • Compute
  • Data analytics and pipelines
  • Databases
  • Distributed, hybrid, and multicloud
  • Generative AI
  • Industry solutions
  • Networking
  • Observability and monitoring
  • Security
  • Storage
Cross-product tools
  • Access and resources management
  • Costs and usage management
  • Google Cloud SDK, languages, frameworks, and tools
  • Infrastructure as code
  • Migration
Related sites
  • Google Cloud Home
  • Free Trial and Free Tier
  • Architecture Center
  • Blog
  • Contact Sales
  • Google Cloud Developer Center
  • Google Developer Center
  • Google Cloud Marketplace
  • Google Cloud Marketplace Documentation
  • Google Cloud Skills Boost
  • Google Cloud Solution Center
  • Google Cloud Support
  • Google Cloud Tech Youtube Channel
/
  • English
  • Deutsch
  • Español – América Latina
  • Français
  • Português – Brasil
  • 中文 – 简体
  • 日本語
  • 한국어
Console Sign in
  • Cloud Run
Guides Reference Samples Resources
Contact Us Start free
Google Cloud
  • Documentation
    • Guides
    • Reference
    • Samples
    • Resources
  • Technology areas
    • More
  • Cross-product tools
    • More
  • Related sites
    • More
  • Console
  • Contact Us
  • Start free
  • Discover
  • Product overview
  • Cloud Run resource model
  • Container runtime contract
  • Is my service a good fit for Cloud Run?
  • When should I deploy a function?
  • Get started
  • Overview
  • Deploy a sample web service
    • Deploy a sample container
    • Create template repository and deploy from a git repository
    • Deploy a Hello World service from source code
      • Go
      • Node.js
      • Python
      • Java
      • Kotlin
      • C#
      • C++
      • PHP
      • Ruby
      • Other
      • Frameworks
        • Overview
        • Angular SSR
        • Next.js
        • Nuxt.js
        • SvelteKit
  • Execute a sample job
    • Execute a job
    • Execute a job from source code
      • Go
      • Node.js
      • Python
      • Java
      • Shell
  • Deploy a sample function
    • Deploy a function using the console
    • Deploy a function using gcloud
  • Develop
  • Set up your environment
  • Plan and prepare your service
    • Develop your service
    • Containerize your code
    • Connect to Google Cloud services
    • Install a system package in your container
    • Run gcloud commands within your container
    • Host AI agents
  • Plan and prepare your function
    • Overview
    • Compare Cloud Run functions
    • Write functions
      • Overview
      • HTTP functions
      • Event-driven functions
    • Runtimes
      • Overview
      • Node.js
        • Overview
        • Node.js dependencies
      • Python
        • Overview
        • Python dependencies
      • Go
        • Overview
        • Go dependencies
      • Java
        • Overview
        • Java dependencies
      • .NET
      • Ruby
      • PHP
    • Local functions development
    • Function triggers
    • Tutorials
      • Create a function that returns BigQuery results
      • Create a function that returns Spanner results
      • Integrate with Cloud databases
      • Codelabs
  • Build and test
    • Build sources to containers
    • Build functions to containers
    • Local testing
  • Serve HTTP requests
  • Deploy services
    • Deploy container images
    • Continuous deployment from git
    • Deploy from source code
    • Deploy functions
  • Serve web traffic
    • Mapping custom domains
    • Serving static assets with CDN
    • Serving traffic from multiple regions
    • Enable session affinity
    • Frontend proxying using Nginx
  • Manage services
    • View, copy, or delete services
    • View or delete revisions
    • Traffic migration, gradual rollouts, rollbacks
  • Configure services
    • Overview
    • Capacity
      • Memory limits
      • CPU limits
      • GPU
        • GPU configuration
        • GPU performance best practices
        • Run LLM inference on Cloud Run GPUs with Ollama
        • Run LLM inference on Cloud Run GPUs with vLLM
        • Run OpenCV on Cloud Run with GPU acceleration
        • Run LLM inference on Cloud Run GPUs with Hugging Face Transformers.js
        • Run LLM inference on Cloud Run GPUs with Hugging Face TGI
      • Request timeout
      • Maximum concurrent requests
        • About maximum concurrent requests per instance
        • Configure maximum concurrent requests
      • Billing
      • Optimize service configurations with Recommender
    • Environment
      • Container port and entrypoint
      • Environment variables
        • Overview
        • Configure environment variables
      • Volume mounts
        • Cloud Storage volumes
        • NFS volumes
        • In-memory volumes
        • Other network file systems
      • Execution environment
        • Overview
        • Select an execution environment
      • Container health checks
      • HTTP/2 requests
      • Secrets
      • Service identity
    • Scaling
      • About instance autoscaling
      • Maximum instances
        • About maximum instances
        • Configure maximum instances
      • Minimum instances
      • Manual scaling
    • Metadata
      • Description
      • Labels
      • Tags
    • Source deploy configurations
      • Supported language runtimes and base images
      • Configure automatic base image updates
      • Build environment variables
      • Build service account
      • Build worker pools
  • Invoke and trigger services
    • Invoke with HTTPS
      • Overview
      • Invoke services on a schedule
      • Create a workflow
        • Invoke services as part of a Workflow
        • Connect a series of services from Cloud Functions and Cloud Run tutorial
      • Execute asynchronous tasks
      • Call a service from a Pub/Sub push subscription
        • Trigger service from Pub/Sub
        • Integrate image processing into Pub/Sub sample tutorial
    • Trigger from events
      • Create triggers with Eventarc
      • Pub/Sub triggers
        • Create triggers with Pub/Sub
        • Trigger functions from Pub/Sub using Eventarc
        • Trigger functions from routed log entries
      • Cloud Storage triggers
        • Create triggers with Cloud Storage
        • Trigger services from Cloud Storage using Eventarc
        • Trigger functions from Cloud Storage using Eventarc
      • Firestore triggers
        • Create triggers with Firestore
        • Trigger functions from events in a Firestore database
    • Create WebSocket services
      • Overview
      • Build a WebSocket Chat service tutorial
    • Connect with other services using gRPC
    • Host a webhook target
  • Best practices
    • General development tips for services
    • Optimize Java services
    • Optimize Python services
    • Load testing best practices
    • Understand zonal redundancy
    • Functions best practices
      • Overview
      • Enable event-driven function retries
  • Execute background jobs
  • Create jobs
  • Execute jobs
    • Execute jobs
    • Execute scheduled jobs
    • Execute scheduled jobs in a VPC SC perimeter
    • Execute jobs from Workflows
  • Configure jobs
    • Container entrypoint
    • CPU limits
    • Memory limits
    • Environment variables
    • Container health checks
    • Volume mounts
      • Cloud Storage volumes
      • NFS volumes
      • In-memory volumes
      • Other network file systems
    • Labels
    • Maximum retries
    • Parallelism
    • Secrets
    • Service identity
    • Task timeout
    • Tags
  • Manage jobs
    • View or delete jobs
    • View or stop job executions
  • Best practices
  • Configure networking
  • Best practices for Cloud Run networking
  • Configure private networking
  • Send traffic to VPC network
    • Overview
    • Direct VPC egress
    • Dual-stack services and jobs
    • Migrate standard VPC connector to Direct VPC egress
    • VPC connectors
  • Send traffic to Shared VPC network
    • Overview
    • Direct VPC egress
    • Migrate Shared VPC connector to Direct VPC egress
    • Connectors in service projects
    • Connectors in host project
  • Static outbound IP address
  • Network security
    • Restrict ingress (services)
    • Use VPC Service Controls (VPC SC)
  • Cloud Service Mesh
  • Secure
  • Security design overview
  • Authenticate requests
    • Overview
    • Allow public access
    • Custom audiences
    • Authenticate developers
    • Service-to-service
    • Authenticate users
    • End user authentication tutorial
  • Secure your resources
    • Access control with IAM
    • Configure IAP for Cloud Run
    • Introduction to service identity
    • Protect services with Cloud Armor
    • Use Binary Authorization
    • Use Cloud Run Threat Detection
    • Use customer managed encryption keys
    • Manage custom constraints for projects
    • View software supply chain security insights
    • Secure Cloud Run services tutorial
  • Monitor and log
  • Monitoring and logging overview
  • View built-in metrics
  • Write Prometheus metrics
  • Write OpenTelemetry metrics
  • Log and view logs
  • Audit logging
  • Error reporting
  • Use distributed tracing for services
  • Migrate
  • An existing web service
  • From App Engine
  • From AWS Lambda
  • From Heroku
  • From Cloud Foundry
    • Migration overview
    • Choose an OCI-compliant-strategy
    • Migrate to OCI containers
    • Migrate configuration
    • Sample migration: Spring Music
  • From VMWare Tanzu
  • From a VM using Migrate to Containers
  • From Kubernetes
  • To GKE
  • Troubleshoot
  • Introduction
  • Troubleshoot errors
  • Local troubleshooting tutorial
  • Known issues
  • Samples
  • All Cloud Run code samples
  • Code samples for all products
  • AI and ML
  • Application development
  • Application hosting
  • Compute
  • Data analytics and pipelines
  • Databases
  • Distributed, hybrid, and multicloud
  • Generative AI
  • Industry solutions
  • Networking
  • Observability and monitoring
  • Security
  • Storage
  • Access and resources management
  • Costs and usage management
  • Google Cloud SDK, languages, frameworks, and tools
  • Infrastructure as code
  • Migration
  • Google Cloud Home
  • Free Trial and Free Tier
  • Architecture Center
  • Blog
  • Contact Sales
  • Google Cloud Developer Center
  • Google Developer Center
  • Google Cloud Marketplace
  • Google Cloud Marketplace Documentation
  • Google Cloud Skills Boost
  • Google Cloud Solution Center
  • Google Cloud Support
  • Google Cloud Tech Youtube Channel
  • Home
  • Cloud Run
  • Documentation
  • Guides

Run LLM inference on Cloud Run GPUs with Hugging Face TGI
Stay organized with collections Save and categorize content based on your preferences.

The following example shows how to run a backend service that runs the Hugging Face Text Generation Inference (TGI) toolkit, which is a toolkit for deploying and serving Large Language Models (LLMs), using Llama 3.

See the entire example at Deploy Llama 3.1 8B with TGI DLC on Cloud Run.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-05-13 UTC.

  • Why Google

    • Choosing Google Cloud
    • Trust and security
    • Modern Infrastructure Cloud
    • Multicloud
    • Global infrastructure
    • Customers and case studies
    • Analyst reports
    • Whitepapers
  • Products and pricing

    • See all products
    • See all solutions
    • Google Cloud for Startups
    • Google Cloud Marketplace
    • Google Cloud pricing
    • Contact sales
  • Support

    • Google Cloud Community
    • Support
    • Release Notes
    • System status
  • Resources

    • GitHub
    • Getting Started with Google Cloud
    • Google Cloud documentation
    • Code samples
    • Cloud Architecture Center
    • Training and Certification
    • Developer Center
  • Engage

    • Blog
    • Events
    • X (Twitter)
    • Google Cloud on YouTube
    • Google Cloud Tech on YouTube
    • Become a Partner
    • Google Cloud Affiliate Program
    • Press Corner
  • About Google
  • Privacy
  • Site terms
  • Google Cloud terms
  • Manage cookies
  • Our third decade of climate action: join us
  • Sign up for the Google Cloud newsletter Subscribe
  • English
  • Deutsch
  • Español – América Latina
  • Français
  • Português – Brasil
  • 中文 – 简体
  • 日本語
  • 한국어