XPK's Workload Policy Naming Causes Failures When Recreating Clusters with Different Topologies

### Problem:

When using `xpk cluster create` for TPU types that require a workload-policy (e.g., tpu7x with specific topologies), XPK correctly creates a resource policy named `{cluster-name}-placement-policy` with the specified `acceleratorTopology`.

However, `xpk cluster delete` does not delete this associated resource policy. If a user then runs `xpk cluster create` again *with the same cluster name but a different TPU topology* (e.g., deleting a `tpu7x-64` cluster and creating a `tpu7x-128` cluster), the `cluster create` command fails.

This failure occurs because the `gcloud beta container node-pools create` command (generated internally by XPK) attempts to use the *existing* `{cluster-name}-placement-policy`, which still contains the `acceleratorTopology` from the *previous* cluster configuration (e.g., `2x4x4` instead of the newly requested `4x4x4`). This mismatch leads to an error during node pool creation, similar to the one encountered when manually providing an incorrect `--placement-policy` with a different topology.

### Steps to Reproduce:

1.  Create a cluster requiring a workload policy with a specific topology:
    ```bash
    xpk cluster create --cluster my-tpu-cluster --tpu-type=tpu7x-64 --reservation=RESERVATION_ID ...
    ```
    *(A workload policy named `my-tpu-cluster-placement-policy` with `acceleratorTopology: 2x4x4` is created)*
2.  Delete the cluster:
    ```bash
    xpk cluster delete --cluster my-tpu-cluster ...
    ```
    *(The workload policy `my-tpu-cluster-placement-policy` remains)*
3.  Attempt to create the cluster again with the same name but a different topology:
    ```bash
    xpk cluster create --cluster my-tpu-cluster --tpu-type=tpu7x-128 --reservation=RESERVATION_ID ...
    ```
4.  Observe the node pool creation failure due to the topology mismatch between the requested node pool (`4x4x4`) and the existing placement policy (`2x4x4`).

```
ERROR: (gcloud.beta.container.node-pools.create) Operation [<Operation
 clusterConditions: [<StatusCondition
 canonicalCode: CanonicalCodeValueValuesEnum(INVALID_ARGUMENT, 3)
 message: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.">]
 detail: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size."
 endTime: '2025-10-28T14:39:44.912501721Z'
 error: <Status
 code: 3
 details: []
 message: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.">
 name: 'operation-1761662360188-eb0a551b-11ee-4ffd-8c31-a778e534cff8'
 nodepoolConditions: []
 operationType: OperationTypeValueValuesEnum(CREATE_NODE_POOL, 7)
 selfLink: 'https://container.googleapis.com/v1beta1/projects/579803409133/locations/us-central1/operations/operation-1761662360188-eb0a551b-11ee-4ffd-8c31-a778e534cff8'
 startTime: '2025-10-28T14:39:20.188771275Z'
 status: StatusValueValuesEnum(DONE, 3)
 statusMessage: "Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size."
 targetLink: 'https://container.googleapis.com/v1beta1/projects/579803409133/locations/us-central1/clusters/tpu-v7/nodePools/tpu-v7-np-0'
 zone: 'us-central1'>] finished with error: Google Compute Engine: Invalid value for field 'resource.resizeBy': '16'. Requested invalid target size '16' for a Managed Instance Group in the gang mode of size '8'. Should be equal to the gang size.
```

### Expected Behavior:

XPK should handle the lifecycle of the associated workload policy more robustly.

### Proposed Solution:

1.  **Enhance `xpk cluster delete`:**
      * Check if a resource policy named `{cluster-name}-placement-policy` exists.
      * If it exists, attempt to delete it using `gcloud compute resource-policies delete {cluster-name}-placement-policy ...`. Handle potential errors if the policy is still in use (though ideally, node pools using it should already be deleted).
2.  **Enhance `xpk cluster create`:**
      * Before creating a workload policy named `{cluster-name}-placement-policy`, check if it already exists.
      * If it **exists**:
          * **Delete** the existing policy using `gcloud compute resource-policies delete ...` (attempt deletion even if it might fail if in use, letting subsequent create fail clearly).
      * **Always create** a new policy with the correct topology required for the *current* `tpu-type` being requested using `gcloud compute resource-policies create workload-policy ...`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XPK's Workload Policy Naming Causes Failures When Recreating Clusters with Different Topologies #752

Problem:

Steps to Reproduce:

Expected Behavior:

Proposed Solution:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XPK's Workload Policy Naming Causes Failures When Recreating Clusters with Different Topologies #752

Description

Problem:

Steps to Reproduce:

Expected Behavior:

Proposed Solution:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions