Skip to content

Conversation

Copy link

Copilot AI commented Oct 21, 2025

Description

Fixes #8127 (replace with actual issue number)

This PR fixes a logging issue where AWS Auto Scaling Group activity failures were being logged in multi-line format, breaking log parsing tools like Datadog.

Problem

When scaling operations fail, the cluster-autoscaler logs AWS Activity details at line 551 in auto_scaling_groups.go. The code was logging the entire Activity struct pointer directly:

klog.Warningf("ASG %s scaling failed with %s", asgRef.Name, *activity)

This caused Go's default string formatting to output the struct across multiple lines:

W0116 07:10:42.815402 auto_scaling_groups.go:551] ASG eks-batch-spot-xxxx scaling failed with {
  ActivityId: "4b6651b7-...",
  AutoScalingGroupARN: "arn:aws:autoscaling:...",
  AutoScalingGroupName: "eks-batch-spot-...",
  Cause: "At 2025-01-16T07:09:48Z an instance was started...",
  Description: "Launching a new EC2 instance...",
  Details: "{\"Subnet ID\":\"subnet-xxx\"...}",
  EndTime: 2025-01-16 07:09:49 +0000 UTC,
  Progress: 100,
  StartTime: 2025-01-16 07:09:49.096 +0000 UTC,
  StatusCode: "Failed",
  StatusMessage: "Could not launch Spot Instances..."
}

This multi-line format breaks log aggregation services and makes it difficult to search and analyze logs.

Solution

The log statement now extracts and formats the key fields in a single line:

klog.Warningf("ASG %s scaling failed: StatusCode=%s, StatusMessage=%s, Description=%s, Cause=%s", 
    asgRef.Name, 
    aws.StringValue(activity.StatusCode),
    aws.StringValue(activity.StatusMessage),
    aws.StringValue(activity.Description),
    aws.StringValue(activity.Cause))

Output is now single-line and parseable:

W0116 07:10:42.815402 auto_scaling_groups.go:551] ASG eks-batch-spot-xxxx scaling failed: StatusCode=Failed, StatusMessage=Could not launch Spot Instances..., Description=Launching a new EC2 instance..., Cause=At 2025-01-16T07:09:48Z an instance was started...

Changes

  • Modified logging at line 551 to format Activity fields explicitly
  • Used aws.StringValue() for safe nil pointer handling
  • Included the most relevant fields: StatusCode, StatusMessage, Description, and Cause
  • All critical debugging information is preserved

Testing

  • All existing AWS cloudprovider tests pass
  • Verified single-line log output format
  • No regressions introduced

/area cluster-autoscaler
/kind bug

Original prompt

This section details on the original issue you should resolve

<issue_title>Logging multi-line klog output on cluster-autoscaler when scaling failed</issue_title>
<issue_description>

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

Component version: v1.31.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.4-eks-2d5f260

What environment is this in?:

AWS EKS

What did you expect to happen?:

cluster-autoscaler always logs single-line klog output.

What happened instead?:

auto_scaling_groups.go:551 log contains multi line logging in klog.

I0116 07:10:42.719737       1 auto_scaling_groups.go:508] Instance group eks-batch-spot202407091400-ap-northeast-1a-72c85d24-0d0c-3317-6d99-xxxxxxxxxxxx has only 0 instances created while requested count is 1. Creating placeholder instances.
W0116 07:10:42.815402       1 auto_scaling_groups.go:551] ASG eks-batch-spot202407091400-ap-northeast-1a-72c85d24-0d0c-3317-6d99-xxxxxxxxxxxx scaling failed with {
  ActivityId: "4b6651b7-c98a-a0c3-5fe1-xxxxxxxxxxxx",
  AutoScalingGroupARN: "arn:aws:autoscaling:ap-northeast-1:xxxxxxxxxxxx:autoScalingGroup:15be2ddc-7564-4bf2-bb35-xxxxxxxxxxxx:autoScalingGroupName/eks-batch-spot202407091400-ap-northeast-1a-72c85d24-0d0c-3317-6d99-xxxxxxxxxxxx",
  AutoScalingGroupName: "eks-batch-spot202407091400-ap-northeast-1a-72c85d24-0d0c-3317-6d99-xxxxxxxxxxxx",
  Cause: "At 2025-01-16T07:09:48Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
  Description: "Launching a new EC2 instance.  Status Reason: Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed.",
  Details: "{\"Subnet ID\":\"subnet-xxxxxxxxxxxx\",\"Availability Zone\":\"ap-northeast-1a\"}",
  EndTime: 2025-01-16 07:09:49 +0000 UTC,
  Progress: 100,
  StartTime: 2025-01-16 07:09:49.096 +0000 UTC,
  StatusCode: "Failed",
  StatusMessage: "Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed."
}
W0116 07:10:42.815520       1 auto_scaling_groups.go:516] Instance group eks-batch-spot202407091400-ap-northeast-1a-72c85d24-0d0c-3317-6d99-xxxxxxxxxxxx cannot provision any more nodes!
I0116 07:10:43.011817       1 auto_scaling_groups.go:161] Updated ASG cache for eks-daemon-spot-202407091400-ap-northeast-1a-56c85d24-0d05-8da9-1e78-xxxxxxxxxxxx. min/max/current is 1/450/2

Then some services cannot parse this log as correct klog such as datadog and so on.

How to reproduce it (as minimally and precisely as possible):

just logging on

klog.Warningf("ASG %s scaling failed with %s", asgRef.Name, *activity)

Anything else we need to know?:

N/A</issue_description>

Comments on the Issue (you are @copilot in this section)

@jackfrancis /remove-lifecycle stale @jackfrancis /assign @copilot @k8s-ci-robot @jackfrancis: GitHub didn't allow me to assign the following users: copilot.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @copilot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Fixes #7702


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Fix multi-line logging of AWS Activity struct

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 21, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Copilot. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 21, 2025
Co-authored-by: jackfrancis <1494193+jackfrancis@users.noreply.github.com>
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider and removed do-not-merge/needs-area labels Oct 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Copilot
Once this PR has been reviewed and has the lgtm label, please ask for approval from jackfrancis. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI changed the title [WIP] Fix logging of multi-line klog output in cluster-autoscaler Fix multi-line logging of AWS Activity struct in cluster-autoscaler Oct 21, 2025
Copilot AI requested a review from jackfrancis October 21, 2025 21:26
Copilot finished work on behalf of jackfrancis October 21, 2025 21:26
@jackfrancis
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Oct 21, 2025
@jackfrancis jackfrancis marked this pull request as ready for review October 21, 2025 21:31
@k8s-ci-robot k8s-ci-robot removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 21, 2025
@jackfrancis
Copy link
Contributor

/release-note-edit

Fix multi-line logging of AWS Activity struct

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 21, 2025
@jackfrancis
Copy link
Contributor

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CA: Use default go vet analyzers Logging multi-line klog output on cluster-autoscaler when scaling failed

3 participants