Create an AI-optimized GKE cluster with default configuration

This page shows you how to create your own AI-optimized Google Kubernetes Engine (GKE) cluster that uses Cluster Director for GKE to support your AI and ML workloads, using A4 or A3 Ultra virtual machines (VMs).

Cluster Director for GKE lets you deploy and manage large AI-optimized clusters of accelerated VMs with features such as targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. For more information, see Cluster Director.

GKE provides a single platform surface to run a diverse set of workloads for your organization's needs. This includes high performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. GKE reduces the operational burden of managing multiple platforms.

Choose how to create an AI-optimized GKE cluster

The following options for cluster creation each provide varying degrees of ease and flexibility in cluster configuration and workload scheduling:

  • Create clusters with the default configuration for compute, storage, and networking resources, and with GPUDirect RDMA-over-Converged-Ethernet (RoCE) enabled:

  • Alternatively, you can create your GKE cluster manually for precise customization or expansion of existing production GKE environments. To create an AI-optimized GKE cluster manually, see Create a custom AI-optimized GKE cluster.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Choose a consumption option and obtain capacity

  1. Choose a consumption option. Make your choice based on how you want to get and use GPU resources. To learn more, see Choose a consumption option.

    For GKE, consider the following additional information when choosing a consumption option:

  2. Obtain capacity. Learn how to obtain capacity for your consumption option.

    To learn more, see Obtain capacity.

Requirements

The following requirements apply to an AI-optimized GKE cluster:

  • Ensure you use the minimum GPU driver version, depending on the machine type:

    • A4: The B200 GPUs in A4 VMs require a minimum of the 570 GPU driver version. GKE, by default, automatically installs this driver version on all A4 nodes running the required minimum version for A4, 1.32.1-gke.1729000 or later.
    • A3 Ultra: The H200 GPUs in A3 Ultra VMs require a minimum of 550 GPU driver version, which is available in GKE 1.31 as latest driver version. For A3 Ultra, you must set gpu-driver-version=latest with GKE 1.31. For GKE version 1.31.5-gke.1169000 or later, GKE, by default, automatically installs 550 GPU driver versions on A3 Ultra nodes.
  • For A3 Ultra node pools, you must set the disk type to hyperdisk-balanced.

  • To use GPUDirect RDMA, use the following minimum versions depending on the machine type:

    • A4: Use 1.32.2-gke.1475000 or later.
    • A3 Ultra: Use 1.31.4-gke.1183000 or later.
  • To use GPUDirect RDMA, the GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.

Create a cluster

Use the following instructions to create a cluster either using Cluster Toolkit or XPK.

Create a cluster using Cluster Toolkit

This section guides you through the cluster creation process, ensuring that your project follows best practices and meets the requirements for an AI-optimized GKE cluster.

A4

  1. Launch Cloud Shell. You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to install dependencies to prepare a different environment.
  2. Clone the Cluster Toolkit from the git repository:

    cd ~
    git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
    
  3. Install the Cluster Toolkit:

    cd cluster-toolkit && git checkout main && make
    
  4. Create a Cloud Storage bucket to store the state of the Terraform deployment:

    gcloud storage buckets create gs://BUCKET_NAME \
        --default-storage-class=STANDARD \
        --location=COMPUTE_REGION_TERRAFORM_STATE \
        --uniform-bucket-level-access
    gcloud storage buckets update gs://BUCKET_NAME --versioning
    

    Replace the following variables:

    • BUCKET_NAME: the name of the new Cloud Storage bucket.
    • COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.
  5. The files that you need to edit to create a cluster depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.

    Reservation-bound

    In the examples/gke-a4/gke-a4-deployment.yaml file, fill in the following settings in the terraform_backend_defaults and vars sections to match the specific values for your deployment:

    • DEPLOYMENT_NAME: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.
    • BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
    • PROJECT_ID: your Google Cloud project ID.
    • COMPUTE_REGION: the compute region for the cluster.
    • COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
    • NODE_COUNT: the number of A4 nodes in your cluster.
    • IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.
    • For the extended_reservation field, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:

      • To place the node pool anywhere in the reservation, provide the name of your reservation (RESERVATION_NAME).
      • To target a specific block within your reservation, use the reservation and block names in the following format:

        RESERVATION_NAME/reservationBlocks/BLOCK_NAME
        

      If you don't know which blocks are available in your reservation, see View a reservation topology.

    • SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the system node pool. The default value is 100 GB.

    • A4_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the A4 node pool. The default value is 100 GB.

    To modify advanced settings, edit examples/gke-a4/gke-a4.yaml.

    Flex-start

    1. In the examples/gke-a4/gke-a4-deployment.yaml file, fill in the following settings in the terraform_backend_defaults and vars sections to match the specific values for your deployment:
      • DEPLOYMENT_NAME: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.
      • BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
      • PROJECT_ID: your Google Cloud project ID.
      • COMPUTE_REGION: the compute region for the cluster.
      • COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
      • Remove static_node_count.
      • IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.
      • Remove the extended_reservation field, and replace the field with enable_flex_start: true. Add on the next line enable_queued_provisioning: true if you'd also like to use queued provisioning. For more information, see Use node pools with flex-start with queued provisioning.
      • SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the system node pool. The default value is 100 GB.
      • A4_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the A4 node pool. The default value is 100 GB.
    2. In the examples/gke-a4/gke-a4.yaml file, make the following changes:

      • In the vars block, remove static_node_count.
      • In the vars block, replace the entire extended_reservation block (including the extended_reservation line itself) with enable_flex_start: true, and, optionally, enable_queued_provisioning: true.
      • In the vars block, remove the following line: kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl")).
      • Under id: a4-pool, remove the following line: static_node_count: $(vars.static_node_count).
      • Under id: a4-pool, remove the reservation_affinity block. Replace this block with the following lines:

        • enable_flex_start: $(vars.enable_flex_start)
        • auto_repair: false
        • For queued provisioning, if you want to enable it, add the following additional lines:
          • enable_queued_provisioning: $(vars.enable_queued_provisioning)
          • autoscaling_total_min_nodes: 0
      • Under id: workload-manager-install, remove the following block:

        config_path: $(vars.kueue_configuration_path)
        config_template_vars:
          num_gpus: $(a4-pool.static_gpu_count)
          accelerator_type: $(vars.accelerator_type)
        
      • Under id: job-template, remove the following line: node_count: $(vars.static_node_count).

  6. Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command:

    gcloud auth application-default login
    
  7. Deploy the blueprint to provision the GKE infrastructure using A4 machine types:

    cd ~/cluster-toolkit
    ./gcluster deploy -d \
    examples/gke-a4/gke-a4-deployment.yaml \
    examples/gke-a4/gke-a4.yaml
    
  8. When prompted, select (A)pply to deploy the blueprint.

    • The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a nodepool.
    • To support the fio-bench-job-template job template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.

A3 Ultra

  1. Launch Cloud Shell. You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to install dependencies to prepare a different environment.
  2. Clone the Cluster Toolkit from the git repository:

    cd ~
    git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
    
  3. Install the Cluster Toolkit:

    cd cluster-toolkit && git checkout main && make
    
  4. Create a Cloud Storage bucket to store the state of the Terraform deployment:

    gcloud storage buckets create gs://BUCKET_NAME \
        --default-storage-class=STANDARD \
        --location=COMPUTE_REGION_TERRAFORM_STATE \
        --uniform-bucket-level-access
    gcloud storage buckets update gs://BUCKET_NAME --versioning
    

    Replace the following variables:

    • BUCKET_NAME: the name of the new Cloud Storage bucket.
    • COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.
  5. The files that you need to edit to create a cluster depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.

    Reservation-bound

    In the examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml file, replace the following variables in the terraform_backend_defaults and vars sections to match the specific values for your deployment:

    • DEPLOYMENT_NAME: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.
    • BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
    • PROJECT_ID: your Google Cloud project ID.
    • COMPUTE_REGION: the compute region for the cluster.
    • COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
    • NODE_COUNT: the number of A3 Ultra nodes in your cluster.
    • IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.
    • For the extended_reservation field, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:

      • To place the node pool anywhere in the reservation, provide the name of your reservation (RESERVATION_NAME).
      • To target a specific block within your reservation, use the reservation and block names in the following format:

        RESERVATION_NAME/reservationBlocks/BLOCK_NAME
        

      If you don't know which blocks are available in your reservation, see View a reservation topology.

    • SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the system node pool. The default value is 100 GB.

    • A3ULTRA_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the A3 Ultra node pool. The default value is 100 GB.

    To modify advanced settings, edit examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml.

    Flex-start

    1. In the examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml file, replace the following variables in the terraform_backend_defaults and vars sections to match the specific values for your deployment:

      • DEPLOYMENT_NAME: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.
      • BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
      • PROJECT_ID: your Google Cloud project ID.
      • COMPUTE_REGION: the compute region for the cluster.
      • COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
      • Remove static_node_count.
      • IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work.
      • Remove the extended_reservation field, and replace the field with enable_flex_start: true. Add on the next line enable_queued_provisioning: true if you'd also like to use queued provisioning. For more information, see Use node pools with flex-start with queued provisioning.
      • SYSTEM_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the system node pool. The default value is 100 GB.
      • A3ULTRA_NODE_POOL_DISK_SIZE_GB: the size of disk for each node of the A4 node pool. The default value is 100 GB.
    2. In the examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml file, make the following changes:

      • In the vars block, remove static_node_count.
      • In the vars block, replace the entire extended_reservation block (including the extended_reservation line itself) with enable_flex_start: true, and, optionally, enable_queued_provisioning: true.
      • In the vars block, remove the following line: kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl")).
      • Under id: a3-ultragpu-pool, remove the following line: static_node_count: $(vars.static_node_count).
      • Under id: a3-ultragpu-pool, remove the reservation_affinity block. Replace this block with the following lines:

        • enable_flex_start: $(vars.enable_flex_start)
        • auto_repair: false
        • For queued provisioning, if you want to enable it, add the following additional lines:
          • enable_queued_provisioning: $(vars.enable_queued_provisioning)
          • autoscaling_total_min_nodes: 0
      • Under id: workload-manager-install, remove the following block:

        config_path: $(vars.kueue_configuration_path)
        config_template_vars:
          num_gpus: $(a4-pool.static_gpu_count)
          accelerator_type: $(vars.accelerator_type)
        
  6. Generate Application Default Credentials (ADC) to provide access to Terraform. If you're using Cloud Shell, you can run the following command:

    gcloud auth application-default login
    
  7. Deploy the blueprint to provision the GKE infrastructure using A3 Ultra machine types:

    cd ~/cluster-toolkit
    ./gcluster deploy -d \
    examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml \
    examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
    
  8. When prompted, select (A)pply to deploy the blueprint.

    • The blueprint creates VPC networks, a GPU RDMA VPC network, service accounts, a cluster, and a nodepool.
    • To support the fio-bench-job-template job template in the blueprint, Google Cloud buckets, network storage, and persistent volumes resources are created.

Create a cluster and run workloads using XPK

Accelerated Processing Kit (XPK) lets you quickly provision and utilize clusters. XPK generates preconfigured, training-optimized infrastructure, ideal for when workload execution is your primary focus.

Create a cluster and run workloads with A3 Ultra VMs using XPK:

  1. Install the required tools to meet the XPK prerequisites.
  2. Copy the version number of the latest tagged release of XPK, for example, "v0.8.0". In the following command, replace the XPK_TAG with the latest XPK version number.
  3. Open a shell window on a Linux machine, and enter the following commands to clone XPK from the Git repository, and install the required packages:

      ## Setup virtual environment.
      VENV_DIR=~/venvp3
      python3 -m venv $VENV_DIR
      source $VENV_DIR/bin/activate
      ## Clone the repository.
      git clone --branch XPK_TAG https://github.com/google/xpk.git
      cd xpk
      ## Install required packages
      make install && export PATH=$PATH:$PWD/bin
    
  4. Create a Standard cluster using A3 Ultra VMs. You can provision the cluster's nodes using reserved capacity:

      python3 xpk.py cluster create \
         --cluster=CLUSTER_NAME \
         --device-type=h200-141gb-8 \
         --zone=COMPUTE_ZONE  \
         --project=PROJECT_ID \
         --num-nodes=NUM_NODES \
         --reservation=RESERVATION_NAME
    

    Replace the following variables:

    • CLUSTER_NAME: a name for the cluster.
    • COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines. To use reserved capacity, ensure that you use the zone where you reserved the capacity. And, we generally recommend choosing a zone near the user to minimize latency.
    • PROJECT_ID: your Google Cloud project ID.
    • NUM_NODES: the number of worker nodes in the node pool.
    • RESERVATION_NAME: the name of your reservation.

      XPK offers additional arguments for cluster creation, including those for creating private clusters, creating Vertex AI Tensorboards, and using node auto-provisioning. For more information, refer to the cluster creation guide for XPK.

  5. Verify that the cluster was created successfully:

      python3 xpk.py cluster list --zone=COMPUTE_ZONE --project=PROJECT_ID
    
  6. Optional: Run a workload to test the cluster environment:

      python3 xpk.py workload create \
         --workload WORKLOAD_NAME --command "echo goodbye" \
         --cluster CLUSTER_NAME \
         --device-type=h200-141gb-8 \
         --num-nodes=WORKLOAD_NUM_NODES \
         --zone=COMPUTE_ZONE \
         --project=PROJECT_ID
    

    Replace the following variables:

    • WORKLOAD_NAME: name of your workload.
    • CLUSTER_NAME: the name of the cluster.
    • WORKLOAD_NUM_NODES: number of worker nodes used for workload execution.
    • COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines.
    • PROJECT_ID: your Google Cloud project ID.

Deploy and run NCCL test

To validate the functionality of the provisioned cluster, you can run the following NCCL test. With nodes provisioned with reservations you run this NCCL test with Topology Aware Scheduling. Nodes that are provisioned with flex-start don't use TAS.

Run the NCCL test by completing the following steps:

  1. Connect to your cluster:

    gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
    

    Replace CLUSTER_NAME with the name of your cluster, which, for the clusters created with Cluster Toolkit, are based on the DEPLOYMENT_NAME. Replace COMPUTE_REGION with the name of the compute region.

  2. Deploy an all-gather NCCL performance test with Topology Aware Scheduling enabled by using the gke-a3-ultragpu/nccl-jobset-example.yaml file for A3 Ultra VMs and the gke-4/nccl-jobset-example.yaml file for A4 VMs:

    1. Modify the YAML file in the following ways if you meet the conditions:

      • The tests use a certain number of nodes by default. If you want to change the number of nodes, change the following values to your required number of nodes:

        • parallelism
        • completions
        • N_NODES
      • If you want to test nodes provisioned by flex-start, under metadata, do the following:

        • Replace the kueue.x-k8s.io/queue-name value with dws-local-queue.
        • Add the following annotation:

          annotations:
             provreq.kueue.x-k8s.io/maxRunDurationSeconds: "600"
          
    2. Create the resources to run the test.

      For A3 Ultra VMs, use the following:

      kubectl create -f ~/cluster-toolkit/examples/gke-a3-ultragpu/nccl-jobset-example.yaml
      

      For A4 VMs, use the following:

      kubectl create -f ~/cluster-toolkit/examples/gke-a4/nccl-jobset-example.yaml
      

      This command returns a JobSet name.

      The output should be similar to the following:

      jobset.jobset.x-k8s.io/all-gather8t7dt created
      
  3. To view the results of the NCCL test, run this command to view all of the running Pods:

    kubectl get pods
    

    The output should be similar to the following:

    NAME                          READY   STATUS      RESTARTS   AGE
    all-gather8t7dt-w-0-0-n9s6j   0/1     Completed   0          9m34s
    all-gather8t7dt-w-0-1-rsf7r   0/1     Completed   0          9m34s
    
  4. Find a Pod name matching the pattern jobset-name-w-0-0-*. The logs of this Pod contain the results of the NCCL test.

    To fetch the logs for this Pod, run this command:

    kubectl logs all-gather8t7dt-w-0-0-n9s6j
    

    The output should be similar to the following:

    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
            2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
            4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
            8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
           16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
           32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
           65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
          131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
          262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
          524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
         1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
         2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
         4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
         8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
        16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
        33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
        67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
       134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
       268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
       536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
      1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
      2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
      4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
      8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 120.248
    

Run reproducible benchmarks

You can use reproduce pre-training benchmarks for large machine learning open models on A4 and A3 Ultra VMs on GKE.

Each recipe provides you with the instructions to complete the following tasks:

  • Prepare your environment.
  • Run the benchmark.
  • Analyze the benchmarks results. This includes the benchmark results and detailed logs for further analysis.

To view all the recipes available, see the GPU recipes repository.

Models Framework Recipe
Llama-3.1-70B MaxText 32 node workload
Llama-3.1-70B NeMo 32 node workload
Mixtral-8-7B MaxText 32 node workload
Mixtral-8-7B NeMo 32 node workload

Clean up resources created by Cluster Toolkit

To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:

   cd ~/cluster-toolkit
   ./gcluster destroy CLUSTER_NAME/

Replace CLUSTER_NAME with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster names will be based on the DEPLOYMENT_NAME.

What's next