-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Problem:
When using xpk cluster create for TPU types that require a workload-policy (e.g., tpu7x with specific topologies), XPK correctly creates a resource policy named {cluster-name}-placement-policy with the specified acceleratorTopology.
However, xpk cluster delete does not delete this associated resource policy. If a user then runs xpk cluster create again with the same cluster name but a different TPU topology (e.g., deleting a tpu7x-64 cluster and creating a tpu7x-128 cluster), the cluster create command fails.
This failure occurs because the gcloud beta container node-pools create command (generated internally by XPK) attempts to use the existing {cluster-name}-placement-policy, which still contains the acceleratorTopology from the previous cluster configuration (e.g., 2x4x4 instead of the newly requested 4x4x4). This mismatch leads to an error during node pool creation, similar to the one encountered when manually providing an incorrect --placement-policy with a different topology.
Steps to Reproduce:
- Create a cluster requiring a workload policy with a specific topology:
(A workload policy named
xpk cluster create --cluster my-tpu-cluster --tpu-type=tpu7x-64 --reservation=RESERVATION_ID ...
my-tpu-cluster-placement-policywithacceleratorTopology: 2x4x4is created) - Delete the cluster:
(The workload policy
xpk cluster delete --cluster my-tpu-cluster ...
my-tpu-cluster-placement-policyremains) - Attempt to create the cluster again with the same name but a different topology:
xpk cluster create --cluster my-tpu-cluster --tpu-type=tpu7x-128 --reservation=RESERVATION_ID ...
- Observe the node pool creation failure due to the topology mismatch between the requested node pool (
4x4x4) and the existing placement policy (2x4x4).
ERROR: (gcloud.beta.container.node-pools.create) Operation [<Operation
clusterConditions: [<StatusCondition
canonicalCode: CanonicalCodeValueValuesEnum(INVALID_ARGUMENT, 3)
message: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.">]
detail: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size."
endTime: '2025-10-28T14:39:44.912501721Z'
error: <Status
code: 3
details: []
message: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.">
name: 'operation-1761662360188-eb0a551b-11ee-4ffd-8c31-a778e534cff8'
nodepoolConditions: []
operationType: OperationTypeValueValuesEnum(CREATE_NODE_POOL, 7)
selfLink: 'https://container.googleapis.com/v1beta1/projects/579803409133/locations/us-central1/operations/operation-1761662360188-eb0a551b-11ee-4ffd-8c31-a778e534cff8'
startTime: '2025-10-28T14:39:20.188771275Z'
status: StatusValueValuesEnum(DONE, 3)
statusMessage: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size."
targetLink: 'https://container.googleapis.com/v1beta1/projects/579803409133/locations/us-central1/clusters/tpu-v7/nodePools/tpu-v7-np-0'
zone: 'us-central1'>] finished with error: Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.
Expected Behavior:
XPK should handle the lifecycle of the associated workload policy more robustly.
Proposed Solution:
- Enhance
xpk cluster delete:- Check if a resource policy named
{cluster-name}-placement-policyexists. - If it exists, attempt to delete it using
gcloud compute resource-policies delete {cluster-name}-placement-policy .... Handle potential errors if the policy is still in use (though ideally, node pools using it should already be deleted).
- Check if a resource policy named
- Enhance
xpk cluster create:- Before creating a workload policy named
{cluster-name}-placement-policy, check if it already exists. - If it exists:
- Delete the existing policy using
gcloud compute resource-policies delete ...(attempt deletion even if it might fail if in use, letting subsequent create fail clearly).
- Delete the existing policy using
- Always create a new policy with the correct topology required for the current
tpu-typebeing requested usinggcloud compute resource-policies create workload-policy ....
- Before creating a workload policy named