Skip to content

XPK's Workload Policy Naming Causes Failures When Recreating Clusters with Different Topologies #752

@bzantium

Description

@bzantium

Problem:

When using xpk cluster create for TPU types that require a workload-policy (e.g., tpu7x with specific topologies), XPK correctly creates a resource policy named {cluster-name}-placement-policy with the specified acceleratorTopology.

However, xpk cluster delete does not delete this associated resource policy. If a user then runs xpk cluster create again with the same cluster name but a different TPU topology (e.g., deleting a tpu7x-64 cluster and creating a tpu7x-128 cluster), the cluster create command fails.

This failure occurs because the gcloud beta container node-pools create command (generated internally by XPK) attempts to use the existing {cluster-name}-placement-policy, which still contains the acceleratorTopology from the previous cluster configuration (e.g., 2x4x4 instead of the newly requested 4x4x4). This mismatch leads to an error during node pool creation, similar to the one encountered when manually providing an incorrect --placement-policy with a different topology.

Steps to Reproduce:

  1. Create a cluster requiring a workload policy with a specific topology:
    xpk cluster create --cluster my-tpu-cluster --tpu-type=tpu7x-64 --reservation=RESERVATION_ID ...
    (A workload policy named my-tpu-cluster-placement-policy with acceleratorTopology: 2x4x4 is created)
  2. Delete the cluster:
    xpk cluster delete --cluster my-tpu-cluster ...
    (The workload policy my-tpu-cluster-placement-policy remains)
  3. Attempt to create the cluster again with the same name but a different topology:
    xpk cluster create --cluster my-tpu-cluster --tpu-type=tpu7x-128 --reservation=RESERVATION_ID ...
  4. Observe the node pool creation failure due to the topology mismatch between the requested node pool (4x4x4) and the existing placement policy (2x4x4).
ERROR: (gcloud.beta.container.node-pools.create) Operation [<Operation
 clusterConditions: [<StatusCondition
 canonicalCode: CanonicalCodeValueValuesEnum(INVALID_ARGUMENT, 3)
 message: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.">]
 detail: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size."
 endTime: '2025-10-28T14:39:44.912501721Z'
 error: <Status
 code: 3
 details: []
 message: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.">
 name: 'operation-1761662360188-eb0a551b-11ee-4ffd-8c31-a778e534cff8'
 nodepoolConditions: []
 operationType: OperationTypeValueValuesEnum(CREATE_NODE_POOL, 7)
 selfLink: 'https://container.googleapis.com/v1beta1/projects/579803409133/locations/us-central1/operations/operation-1761662360188-eb0a551b-11ee-4ffd-8c31-a778e534cff8'
 startTime: '2025-10-28T14:39:20.188771275Z'
 status: StatusValueValuesEnum(DONE, 3)
 statusMessage: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size."
 targetLink: 'https://container.googleapis.com/v1beta1/projects/579803409133/locations/us-central1/clusters/tpu-v7/nodePools/tpu-v7-np-0'
 zone: 'us-central1'>] finished with error: Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.

Expected Behavior:

XPK should handle the lifecycle of the associated workload policy more robustly.

Proposed Solution:

  1. Enhance xpk cluster delete:
    • Check if a resource policy named {cluster-name}-placement-policy exists.
    • If it exists, attempt to delete it using gcloud compute resource-policies delete {cluster-name}-placement-policy .... Handle potential errors if the policy is still in use (though ideally, node pools using it should already be deleted).
  2. Enhance xpk cluster create:
    • Before creating a workload policy named {cluster-name}-placement-policy, check if it already exists.
    • If it exists:
      • Delete the existing policy using gcloud compute resource-policies delete ... (attempt deletion even if it might fail if in use, letting subsequent create fail clearly).
    • Always create a new policy with the correct topology required for the current tpu-type being requested using gcloud compute resource-policies create workload-policy ....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions