Skip to content

Conversation

@puretension
Copy link
Contributor

Issue # (if applicable)

Closes #35611.

Reason for this change

IAM roles generate overflow policies when their policy document exceeds 10k bytes, splitting statements into managed policies. The ordering of statements was non-deterministic, causing statements to move between policies during deployments, leading to temporary permission gaps and potential service interruptions.

Description of changes

  • Root cause: Policy statements were processed in non-deterministic order during overflow policy splitting
  • Solution: Added deterministic sorting of policy statements before splitting in PolicyDocument._splitDocument()
  • Implementation: Sort statements by JSON hash to ensure identical statements always end up in the same overflow policies across deployments
  • Alternative considered: Feature flags, complex hash functions, or custom resources - rejected for simplicity and backward compatibility
  • Design decision: Minimal change with maximum impact - only 8 lines of sorting logic

Describe any new or updated permissions being added

No new or updated IAM permissions are needed.

Description of how you validated changes

  • Added unit test overflow policies have deterministic statement ordering that verifies identical policy structures across multiple deployments
  • Test passes, confirming deterministic behavior
  • Manual verification shows statements are consistently ordered by content hash

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Sort policy statements deterministically to prevent service interruption
during overflow policy updates. This ensures identical statements always
end up in the same overflow policies across deployments.

Fixes aws#35611

Signed-off-by: puretension <rlrlfhtm5@gmail.com>
@github-actions github-actions bot added beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK effort/medium Medium work item – several days of effort p1 labels Oct 28, 2025
@aws-cdk-automation aws-cdk-automation requested a review from a team October 28, 2025 16:35
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This review is outdated)

@puretension
Copy link
Contributor Author

Exemption Request

This fix addresses a deterministic ordering issue in policy statement processing that prevents service interruptions during deployments. The change is:

  1. Pure algorithmic fix: Only adds deterministic sorting logic without changing any AWS resource behavior
  2. No observable AWS changes: The generated CloudFormation templates remain functionally identical - only the internal statement ordering changes
  3. Unit test sufficient: The fix is fully validated by the unit test overflow policies have deterministic statement ordering which verifies identical policy structures across deployments
  4. Integration test would be identical: Since the AWS resources and their configurations remain unchanged, an integration test would produce identical snapshots before and after the fix
  5. Low risk change: 8 lines of sorting logic with no external dependencies or AWS API interactions

The core issue was non-deterministic behavior causing statements to move between policies unpredictably. This fix ensures deterministic behavior - the same input always produces the same output structure, eliminating the root cause of service interruptions.

An integration test would not add meaningful validation beyond what the unit test already provides, as the AWS resource configurations and their final states remain identical.

@aws-cdk-automation aws-cdk-automation added the pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. label Oct 28, 2025
Signed-off-by: puretension <rlrlfhtm5@gmail.com>
@Abogical Abogical self-assigned this Oct 30, 2025
@Abogical Abogical added the pr-linter/exempt-integ-test The PR linter will not require integ test changes label Oct 30, 2025
@aws-cdk-automation aws-cdk-automation dismissed their stale review October 30, 2025 09:40

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@Abogical Abogical requested a review from rix0rrr October 30, 2025 10:14
@rix0rrr
Copy link
Contributor

rix0rrr commented Oct 30, 2025

I will have a look at this later, but I would like your thoughts on the following first:

A deterministic sorting improves the situation slightly:

  • (stable) We now have guarantees that when all statements remain the same, their assignment to policy documents won't change.

...but it doesn't solve the fundamental problem:

  • (changes) when statements get added, that may cause some statements to shift the policy document they were assigned to. When this happens, there is an inconsistency period as the policies get applied by CloudFormation. There will be periods in which a moved statement may be in none of the policies.

My slight objection to this is that I don't have strong evidence that in the (stable) situation, statements get moved between policies to begin with. That requires some nondeterminism in the execution engine that I don't know that even exists in the first place. I would like to see some evidence for that happening first? And then again, in the (changes) situation, I think this will now increase the likelihood that new statements will cause statements to shift over to different policy documents, and thereby increase the chances of inconsistencies during deployment, not lessen them. Because previously, most likely (though not guaranteed, depending on user code) statements would probably be added to the end of the list, and now they will be inserted in the middle of a list causing shift-overs:

image

All of this is to say: I'm a bit suspicious of this, and would like to see more validation than "I added a test and it passed". I would like to see deployment in ongoing real-world applications first.

@Abogical
Copy link
Member

Abogical commented Oct 30, 2025

Thanks @rix0rrr! You raise a good point that when adding statements, statements may not be added to the end of the list under this fix, but it could be added in the middle. Hence, deployed statements that are the end of the policies would now have to be shifted to a new policy document. This situation is what can cause an inconsistent deployment so I don't see how sorting statements can fix and close issue #36511.

@puretension, thank you for attempting to fix this. However, I see that the direction this PR takes is incorrect. I'll have to close this PR accordingly. I've unlocked the conversation in case you have any comments.

@Abogical Abogical closed this Oct 30, 2025
@github-actions
Copy link
Contributor

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2025
@aws aws unlocked this conversation Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK effort/medium Medium work item – several days of effort p1 pr-linter/exempt-integ-test The PR linter will not require integ test changes pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

iam: application of role overflow policy can interrupt service availability

4 participants