Skip to content

refactor(ci): Add actions.summerwind.dev ARC runner deployment option#797

Open
XinyueZhang369 wants to merge 3 commits intomainfrom
xz/runner-deployment
Open

refactor(ci): Add actions.summerwind.dev ARC runner deployment option#797
XinyueZhang369 wants to merge 3 commits intomainfrom
xz/runner-deployment

Conversation

@XinyueZhang369
Copy link
Collaborator

@XinyueZhang369 XinyueZhang369 commented Mar 17, 2026

Description

Problem

The existing ARC deployment guide only covers the official GitHub ARC controller (ghcr.io/actions/actions-runner-controller-charts). Some clusters require or prefer the community actions.summerwind.dev controller which uses RunnerDeployment CRDs and provides HorizontalRunnerAutoscaler support.

Solution

Add an alternative deployment path using the actions.summerwind.dev ARC controller to the README, along with the corresponding Kubernetes manifests for RBAC, CPU runners, GPU runners, and autoscaling.

Changes

  • Add RunnerDeployment-based runner manifests (arc-runner-cpu.yaml, arc-runner-gpu.yaml, arc-runner-rbac.yaml, arc-runner-autoscaler.yaml)
  • Update README.md with an alternative section documenting the actions.summerwind.dev ARC installation and deployment steps
  • Fix GPU resource units and adjust min/max for HorizontalRunnerAutoscaler

Test Plan

  • Deploy the actions.summerwind.dev controller via Helm and verify pods are running in actions-runner-system
  • Apply the RBAC, runner, and autoscaler manifests and confirm RunnerDeployment and HorizontalRunnerAutoscaler resources are created
  • Trigger a GitHub Actions workflow targeting the runner labels and verify runner pods are provisioned
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive guide for deploying self-hosted runners using the community-maintained actions-runner-controller with step-by-step setup instructions.
  • New Features

    • GPU and CPU-optimized runner deployment configurations for diverse workload requirements.
    • Kubernetes-native autoscaling capabilities for self-hosted runners.
    • Security configurations for runner access control.

Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>
Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>
Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 17, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This PR introduces Kubernetes manifests and documentation for deploying GitHub Actions runners using the community actions-runner-controller (ARC) variant from actions.summerwind.dev. The changes include CPU and GPU runner deployments, autoscaling configurations, RBAC setup, and an extended README guide covering installation and verification procedures.

Changes

Cohort / File(s) Summary
Documentation
scripts/k8s-runner-resources/README.md
Adds a comprehensive guide section for alternative ARC controller installation, GitHub App setup, secret creation, resource deployment, and verification steps for the summerwind.dev variant.
RBAC & Access Control
scripts/k8s-runner-resources/arc-runner-rbac.yaml
Establishes ServiceAccount, Role, and RoleBinding for arc-runner-sa in actions-runner-system namespace with permissions for secrets and pod access.
Runner Deployments
scripts/k8s-runner-resources/arc-runner-cpu.yaml, scripts/k8s-runner-resources/arc-runner-gpu.yaml
Adds RunnerDeployment configurations for CPU-based runners (8 CPU, 16Gi memory) and multiple GPU variants (H100, A10) with node selectors, tolerations, persistent volume claims for model caching, DinD sidecars, and environment variable bindings for API keys.
Autoscaling
scripts/k8s-runner-resources/arc-runner-autoscaler.yaml
Defines HorizontalRunnerAutoscaler resources with scaling metrics (TotalNumberOfQueuedAndInProgressWorkflowRuns, PercentageRunnersBusy) and min/max replica constraints for each runner deployment variant.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested labels

ci, documentation

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐰 A runner hops through Kubernetes fields,
With CPU cores and GPU shields,
Autoscalers scale when jobs run deep,
RBAC guards secrets we keep! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding an alternative ARC runner deployment option using actions.summerwind.dev controller, which is the primary objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch xz/runner-deployment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request expands the available options for deploying GitHub Actions self-hosted runners on Kubernetes. It provides a new, fully documented method utilizing the actions.summerwind.dev controller, offering greater flexibility for environments that prefer or require this specific community-driven solution. The changes include all necessary Kubernetes configuration files to set up CPU and GPU runners with proper access controls and dynamic scaling capabilities.

Highlights

  • New ARC Deployment Option: Added an alternative deployment path for GitHub Actions self-hosted runners using the community actions.summerwind.dev ARC controller, which leverages RunnerDeployment CRDs and HorizontalRunnerAutoscaler.
  • Kubernetes Manifests: Introduced new Kubernetes manifests for RBAC, CPU runners, various GPU runners (H100, A10), and their corresponding autoscalers.
  • Documentation Update: Updated the README.md with comprehensive installation and deployment steps for the actions.summerwind.dev ARC option.
  • Resource Configuration: Adjusted GPU resource units and refined minimum/maximum replica settings for HorizontalRunnerAutoscaler configurations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an alternative deployment method for ARC runners using the actions.summerwind.dev controller, adding new Kubernetes manifests and updating documentation. My review found several critical and high-severity issues in the new manifests. The RBAC role for runner pods is overly permissive, creating a security risk. The CPU runner's Docker-in-Docker configuration is incomplete, lacking proper setup for the Docker socket and requiring DOCKER_TLS_CERTDIR to be explicitly set to an empty string, which will cause Docker-dependent jobs to fail. Additionally, a GPU runner deployment is missing necessary CPU and memory resource definitions and also requires the DOCKER_TLS_CERTDIR environment variable to be set for its Docker-in-Docker setup, potentially causing instability. Finally, the CPU runner's autoscaler configuration contains invalid parameters that will prevent it from functioning correctly. Addressing these points will improve the security, stability, and functionality of the new runner deployments.

Comment on lines +104 to +109
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: "0.95"
scaleDownThreshold: "0.25"
scaleUpFactor: "0.5"
scaleDownFactor: "0.5"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The arc-cpu-runner-autoscaler is configured to use the PercentageRunnersBusy metric, but it includes scaleUpFactor and scaleDownFactor fields. These fields are only valid for the TotalNumberOfQueuedAndInProgressWorkflowRuns metric type and will be ignored or cause an error here. This will prevent the autoscaler from functioning as expected.

  metrics:
    - type: PercentageRunnersBusy
      scaleUpThreshold: "0.95"
      scaleDownThreshold: "0.25"

Comment on lines +9 to +48
spec:
ephemeral: true
repository: lightseekorg/smg
labels:
- k8s-runner-cpu
serviceAccountName: arc-runner-sa

containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Docker-in-Docker (dind) configuration for the CPU runner is incomplete. The runner container is missing the DOCKER_HOST environment variable and volume mounts for the Docker socket. The docker sidecar is missing the privileged security context, resource definitions, and volume mounts required for it to function correctly. Additionally, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string in the docker sidecar to disable TLS for the Docker socket, which is necessary for DinD setups. This will cause any Docker operations in workflows on this runner to fail. The configuration should be updated to properly set up the dind sidecar and the communication between the two containers, similar to the GPU runner definitions.

    spec:
      ephemeral: true
      repository: lightseekorg/smg
      labels:
        - k8s-runner-cpu
      serviceAccountName: arc-runner-sa

      volumes:
        - name: docker-sock
          emptyDir: {}
        - name: docker-storage
          emptyDir: {}

      containers:
        - name: runner
          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
          resources:
            requests:
              cpu: "8"
              memory: "16Gi"
            limits:
              cpu: "8"
              memory: "16Gi"
          volumeMounts:
            - name: docker-sock
              mountPath: /var/run
          env:
            - name: DOCKER_HOST
              value: unix:///var/run/docker.sock
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: HUGGINGFACE_API_KEY
                  name: huggingface-secret
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  key: OPENAI_API_KEY
                  name: openai-api-key
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  key: ANTHROPIC_API_KEY
                  name: anthropic-api-key
            - name: XAI_API_KEY
              valueFrom:
                secretKeyRef:
                  key: XAI_API_KEY
                  name: xai-api-key
        - name: docker
          image: fra.ocir.io/idqj093njucb/docker:dind
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          env:
            - name: DOCKER_TLS_CERTDIR
              value: ""
            - name: DOCKER_DRIVER
              value: overlay2
          volumeMounts:
            - name: docker-sock
              mountPath: /var/run
            - name: docker-storage
              mountPath: /var/lib/docker
References
  1. When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the DOCKER_TLS_CERTDIR environment variable to an empty string.

Comment on lines +168 to +215
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
limits:
nvidia.com/gpu: 4
volumeMounts:
- name: model-cache
mountPath: /models
- name: docker-sock
mountPath: /var/run
- name: dshm
mountPath: /dev/shm
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
securityContext:
privileged: true # Required for DinD
env:
- name: DOCKER_TLS_CERTDIR
value: "" # Disables TLS for shared socket use
volumeMounts:
- name: docker-sock
mountPath: /var/run
- name: docker-storage
mountPath: /var/lib/docker

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The arc-runner-gpu-a10 deployment is missing CPU and memory resource requests and limits for both the runner and docker containers. Additionally, for the docker sidecar, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string to disable TLS for the Docker socket, which is necessary for DinD setups. This results in a lower Quality of Service (QoS) class, making the pods more likely to be evicted under node pressure. It is a best practice to explicitly define resources for all containers to ensure predictable performance and scheduling.

      containers:
        - name: runner
          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
          resources:
            requests:
              cpu: "16"
              memory: "64Gi"
            limits:
              cpu: "16"
              memory: "64Gi"
              nvidia.com/gpu: 4
          volumeMounts:
            - name: model-cache
              mountPath: /models
            - name: docker-sock
              mountPath: /var/run
            - name: dshm
              mountPath: /dev/shm
          env:
            - name: DOCKER_HOST
              value: unix:///var/run/docker.sock
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: HUGGINGFACE_API_KEY
                  name: huggingface-secret
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  key: OPENAI_API_KEY
                  name: openai-api-key
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  key: ANTHROPIC_API_KEY
                  name: anthropic-api-key
            - name: XAI_API_KEY
              valueFrom:
                secretKeyRef:
                  key: XAI_API_KEY
                  name: xai-api-key
        - name: docker
          image: fra.ocir.io/idqj093njucb/docker:dind
          securityContext:
            privileged: true  # Required for DinD
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          env:
            - name: DOCKER_TLS_CERTDIR
              value: ""  # Disables TLS for shared socket use
            - name: DOCKER_DRIVER
              value: overlay2
          volumeMounts:
            - name: docker-sock
              mountPath: /var/run
            - name: docker-storage
              mountPath: /var/lib/docker
References
  1. When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the DOCKER_TLS_CERTDIR environment variable to an empty string.

Comment on lines +25 to +27
- pods
- pods/log
- pods/exec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Role for the runner pods grants permissions for pods/log and pods/exec. This is overly permissive and violates the principle of least privilege. A standard runner pod does not need to execute commands in or view logs of other pods. These permissions could be abused if a workflow is compromised and should be removed to enhance security.

      - pods

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml`:
- Around line 104-109: The PercentageRunnersBusy metric configuration uses scale
factors that decrease capacity on scale-up; update the metric block (metrics /
type: PercentageRunnersBusy) so that scaleUpFactor is greater than 1 (e.g.,
"1.4" or "1.5") to increase runners when busy and scaleDownFactor remains less
than 1 (e.g., "0.7") to reduce runners when underutilized, keeping the existing
thresholds (scaleUpThreshold and scaleDownThreshold) as-is.

In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml`:
- Around line 47-48: The docker sidecar container (name: docker) is missing
critical DinD configuration: add securityContext.privileged: true to the docker
container, add volumeMounts for the docker socket and storage (mounts named
docker-sock and docker-storage) and ensure matching volumes are defined at the
pod level, add environment variables DOCKER_TLS_CERTDIR (empty string) and
DOCKER_DRIVER (e.g., overlay2) to the docker container, and add appropriate
resources.requests and resources.limits (cpu/memory) similar to the GPU runner's
docker sidecar so DinD can run properly and the runner can access the
socket/storage.
- Around line 16-48: The CPU deployment is missing the Docker socket and related
volumes/volumeMounts so the runner container cannot talk to the dind container;
add a top-level volumes block defining docker-sock (hostPath
/var/run/docker.sock), docker-storage (emptyDir) and dshm (emptyDir with medium:
Memory) and update the runner container (name: runner) to include volumeMounts
for docker-sock (mountPath: /var/run/docker.sock), docker-storage (mountPath:
/var/lib/docker) and dshm (mountPath: /dev/shm); ensure the dind container
(name: docker) also mounts those same volumes so DOCKER_HOST:
unix:///var/run/docker.sock works correctly.

In `@scripts/k8s-runner-resources/arc-runner-gpu.yaml`:
- Around line 168-215: The arc-runner-gpu-a10 deployment is missing resource
requests for the runner container, resources for the docker sidecar, and a
memory-backed docker-storage volume; update the YAML to match other GPU
deployments by adding a resources block to the runner container (keep
limits.nvidia.com/gpu: 4 and add cpu and memory requests and limits), add a
resources.requests/limits block to the docker sidecar container (name: docker)
similar to other DinD sidecars, and change the docker-storage volume definition
(name: docker-storage) to an emptyDir with medium: Memory and sizeLimit: 4Gi to
make it memory-backed. Ensure you modify the runner and docker container specs
and the docker-storage volume declaration accordingly.

In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml`:
- Around line 13-20: Replace the misleading "# Argo Workflows" comment above the
RBAC block with an accurate label for ARC/GitHub Actions runners (e.g., "#
Actions Runner Controller (ARC) Runners" or "# ARC runners") so the comment
correctly reflects the RBAC purpose; update the comment that immediately
precedes the secrets resource block (the line containing "# Argo Workflows") to
the new text in the same location.
- Around line 22-31: The RBAC rule for the Pods resource includes the pods/exec
subresource but only grants read verbs (get, list, watch), which prevents exec
operations; update the verbs array for the Pods rule that lists "pods" and
"pods/exec" to include "create" if runners should be able to exec into pods, or
add a clear inline comment next to the resources/verbs block (referencing the
Pods rule and the "pods/exec" entry) stating the intent is read-only and that
"create" was intentionally omitted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7c51f99-fc49-4531-b171-992bec599165

📥 Commits

Reviewing files that changed from the base of the PR and between 7f54c64 and 070ea73.

📒 Files selected for processing (5)
  • scripts/k8s-runner-resources/README.md
  • scripts/k8s-runner-resources/arc-runner-autoscaler.yaml
  • scripts/k8s-runner-resources/arc-runner-cpu.yaml
  • scripts/k8s-runner-resources/arc-runner-gpu.yaml
  • scripts/k8s-runner-resources/arc-runner-rbac.yaml

Comment on lines +104 to +109
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: "0.95"
scaleDownThreshold: "0.25"
scaleUpFactor: "0.5"
scaleDownFactor: "0.5"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

actions-runner-controller PercentageRunnersBusy scaleUpFactor scaleDownFactor configuration

💡 Result:

In Actions Runner Controller (ARC), PercentageRunnersBusy is a HorizontalRunnerAutoscaler.spec.metrics[].type that scales a RunnerDeployment based on the percentage of runners currently busy (running jobs). ARC queries GitHub for the number of busy runners in the RunnerDeployment’s namespace, then adjusts the desired replica count when thresholds are crossed. [1]

Example configuration

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
spec:
  scaleTargetRef:
    name: example-runner-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: "0.75"
    scaleDownThreshold: "0.30"
    scaleUpFactor: "1.4"
    scaleDownFactor: "0.7"

[1][2]

What the fields mean (for PercentageRunnersBusy)

  • scaleUpThreshold: if % busy is greater than this value, ARC re-evaluates desired runners to scale up. [2]
  • scaleDownThreshold: if % busy is less than this value, ARC re-evaluates desired runners to scale down. [2][3]
  • scaleUpFactor: multiplier applied to the current runner count to compute the new desired count on scale-up (e.g., 10 → 10 * 1.4 → 14). [1][2]
  • scaleDownFactor: multiplier applied to the current runner count to compute the new desired count on scale-down (e.g., 10 → 10 * 0.7 → 7). [1][2]

Notes:

  • You typically pair these with minReplicas / maxReplicas bounds. [1]
  • ARC supports either factor-based scaling (scaleUpFactor/scaleDownFactor) or step-based scaling (scaleUpAdjustment/scaleDownAdjustment) for a metric (not both). [3]

Sources:
[1] ARC docs (actions.github.io/actions-runner-controller)
[2] ARC API docs on pkg.go.dev (actions-runner-controller)
[3] HorizontalRunnerAutoscalerMetrics docs (docs.rs)


Correct scaleUpFactor and scaleDownFactor values for PercentageRunnersBusy metric.

The format of scaleUpFactor: "0.5" and scaleDownFactor: "0.5" as strings is correct, but the values are problematic. These fields are multipliers applied to the current runner count:

  • When scaleUpThreshold: "0.95" is exceeded (≥95% runners busy), the system applies scaleUpFactor: "0.5", which reduces the runner count to 50% of current (e.g., 4 → 2 runners). This is the opposite of the intended behavior—you should add runners when they are busy, not remove them.
  • Similarly, scaleDownFactor: "0.5" also reduces runners, making both directions reduce capacity.

Use a factor greater than 1 for scaleUpFactor (e.g., "1.5" or "1.4") to increase runners, and keep scaleDownFactor less than 1 (e.g., "0.7") to decrease runners during low usage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml` around lines 104 -
109, The PercentageRunnersBusy metric configuration uses scale factors that
decrease capacity on scale-up; update the metric block (metrics / type:
PercentageRunnersBusy) so that scaleUpFactor is greater than 1 (e.g., "1.4" or
"1.5") to increase runners when busy and scaleDownFactor remains less than 1
(e.g., "0.7") to reduce runners when underutilized, keeping the existing
thresholds (scaleUpThreshold and scaleDownThreshold) as-is.

Comment on lines +16 to +48
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing volumes and volume mounts for Docker socket sharing.

The runner container references DOCKER_HOST: unix:///var/run/docker.sock in GPU deployments, but this CPU deployment is missing:

  1. The volumes section entirely (no docker-sock, docker-storage, dshm volumes)
  2. Volume mounts in the runner container

Without shared volumes, the runner and DinD containers cannot communicate.

Proposed fix to add volumes section
       serviceAccountName: arc-runner-sa
+
+      volumes:
+        - name: docker-sock
+          emptyDir: {}
+        - name: docker-storage
+          emptyDir:
+            medium: Memory
+            sizeLimit: 4Gi
 
       containers:
         - name: runner
           image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
           resources:
             requests:
               cpu: "8"
               memory: "16Gi"
             limits:
               cpu: "8"
               memory: "16Gi"
+          volumeMounts:
+            - name: docker-sock
+              mountPath: /var/run
           env:
+            - name: DOCKER_HOST
+              value: unix:///var/run/docker.sock
             - name: HF_TOKEN
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 16 - 48, The
CPU deployment is missing the Docker socket and related volumes/volumeMounts so
the runner container cannot talk to the dind container; add a top-level volumes
block defining docker-sock (hostPath /var/run/docker.sock), docker-storage
(emptyDir) and dshm (emptyDir with medium: Memory) and update the runner
container (name: runner) to include volumeMounts for docker-sock (mountPath:
/var/run/docker.sock), docker-storage (mountPath: /var/lib/docker) and dshm
(mountPath: /dev/shm); ensure the dind container (name: docker) also mounts
those same volumes so DOCKER_HOST: unix:///var/run/docker.sock works correctly.

Comment on lines +47 to +48
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Docker-in-Docker sidecar is missing critical configuration.

The docker container is incomplete compared to the GPU runner manifests. It's missing:

  • securityContext.privileged: true (required for DinD)
  • Volume mounts for docker-sock and docker-storage
  • Environment variables (DOCKER_TLS_CERTDIR, DOCKER_DRIVER)
  • Resource requests/limits

Without these, the DinD sidecar will fail to function, and the runner container won't be able to use Docker.

Proposed fix based on GPU runner configuration
         - name: docker
           image: fra.ocir.io/idqj093njucb/docker:dind
+          securityContext:
+            privileged: true  # Required for DinD
+          resources:
+            requests:
+              cpu: "1"
+              memory: "2Gi"
+            limits:
+              cpu: "2"
+              memory: "4Gi"
+          env:
+            - name: DOCKER_TLS_CERTDIR
+              value: ""  # Disables TLS for shared socket use
+            - name: DOCKER_DRIVER
+              value: overlay2
+          volumeMounts:
+            - name: docker-sock
+              mountPath: /var/run
+            - name: docker-storage
+              mountPath: /var/lib/docker
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 47 - 48, The
docker sidecar container (name: docker) is missing critical DinD configuration:
add securityContext.privileged: true to the docker container, add volumeMounts
for the docker socket and storage (mounts named docker-sock and docker-storage)
and ensure matching volumes are defined at the pod level, add environment
variables DOCKER_TLS_CERTDIR (empty string) and DOCKER_DRIVER (e.g., overlay2)
to the docker container, and add appropriate resources.requests and
resources.limits (cpu/memory) similar to the GPU runner's docker sidecar so DinD
can run properly and the runner can access the socket/storage.

Comment on lines +168 to +215
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
limits:
nvidia.com/gpu: 4
volumeMounts:
- name: model-cache
mountPath: /models
- name: docker-sock
mountPath: /var/run
- name: dshm
mountPath: /dev/shm
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
securityContext:
privileged: true # Required for DinD
env:
- name: DOCKER_TLS_CERTDIR
value: "" # Disables TLS for shared socket use
volumeMounts:
- name: docker-sock
mountPath: /var/run
- name: docker-storage
mountPath: /var/lib/docker
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent configuration in arc-runner-gpu-a10 deployment.

This deployment differs from the other GPU deployments in several ways:

  1. Missing runner container resource requests (lines 171-173): Only limits.nvidia.com/gpu is specified; no CPU/memory requests like other deployments have.
  2. Missing docker container resources (lines 204-215): No resource requests/limits for the DinD sidecar.
  3. docker-storage volume not memory-backed (lines 161-162): Other deployments use medium: Memory with sizeLimit: 4Gi.

These inconsistencies may cause scheduling issues or resource contention.

Proposed fix to align with other GPU deployments
       containers:
         - name: runner
           image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
           resources:
+            requests:
+              cpu: "32"
+              memory: "128Gi"
             limits:
               nvidia.com/gpu: 4
         - name: docker-storage
-          emptyDir: {}
+          emptyDir:
+            medium: Memory
+            sizeLimit: 4Gi
         - name: docker
           image: fra.ocir.io/idqj093njucb/docker:dind
           securityContext:
             privileged: true  # Required for DinD
+          resources:
+            requests:
+              cpu: "1"
+              memory: "2Gi"
+            limits:
+              cpu: "2"
+              memory: "4Gi"
           env:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-gpu.yaml` around lines 168 - 215, The
arc-runner-gpu-a10 deployment is missing resource requests for the runner
container, resources for the docker sidecar, and a memory-backed docker-storage
volume; update the YAML to match other GPU deployments by adding a resources
block to the runner container (keep limits.nvidia.com/gpu: 4 and add cpu and
memory requests and limits), add a resources.requests/limits block to the docker
sidecar container (name: docker) similar to other DinD sidecars, and change the
docker-storage volume definition (name: docker-storage) to an emptyDir with
medium: Memory and sizeLimit: 4Gi to make it memory-backed. Ensure you modify
the runner and docker container specs and the docker-storage volume declaration
accordingly.

Comment on lines +13 to +20
# Argo Workflows
- apiGroups: [""]
resources:
- secrets
verbs:
- get
- list
- watch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Misleading comment: this is for ARC runners, not Argo Workflows.

The comment # Argo Workflows on line 13 appears to be a copy-paste artifact. This RBAC manifest is for GitHub Actions runners via the actions-runner-controller, not Argo Workflows.

Suggested fix
 rules:
-  # Argo Workflows
+  # Access to secrets for runner configuration
   - apiGroups: [""]
     resources:
       - secrets
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Argo Workflows
- apiGroups: [""]
resources:
- secrets
verbs:
- get
- list
- watch
# Access to secrets for runner configuration
- apiGroups: [""]
resources:
- secrets
verbs:
- get
- list
- watch
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml` around lines 13 - 20,
Replace the misleading "# Argo Workflows" comment above the RBAC block with an
accurate label for ARC/GitHub Actions runners (e.g., "# Actions Runner
Controller (ARC) Runners" or "# ARC runners") so the comment correctly reflects
the RBAC purpose; update the comment that immediately precedes the secrets
resource block (the line containing "# Argo Workflows") to the new text in the
same location.

Comment on lines +22 to +31
# Pods
- apiGroups: [""]
resources:
- pods
- pods/log
- pods/exec
verbs:
- get
- list
- watch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if other runner RBAC files in the repo grant create verb for pods/exec
rg -n "pods/exec" --type yaml -A 5 -B 2

Repository: lightseekorg/smg

Length of output: 585


🌐 Web query:

Kubernetes RBAC pods/exec create verb permission required

💡 Result:

To allow kubectl exec via Kubernetes RBAC, you must grant the create verb on the pods/exec subresource (core API group ""). [1], [2]

Typical minimal Role rules (namespace-scoped):

rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get"]          # kubectl commonly reads the Pod first
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]       # required for exec (kubectl uses POST -> RBAC "create")

Note: some client libraries negotiate exec using an HTTP GET first (e.g., WebSocket), in which case you may also need verbs: ["get","create"] on pods/exec. [2]

Quick check:

  • kubectl auth can-i create pods/exec -n <ns> [3]

Sources:
[1] Kubernetes RBAC docs (subresources via resource/subresource) https://kubernetes.io/docs/reference/access-authn-authz/rbac/
[2] Stack Overflow: kubectl exec requires create on pods/exec (and sometimes get) https://stackoverflow.com/questions/48118125/kubernetes-rbac-role-verbs-to-exec-to-pod
[3] Kubenomicon: kubectl auth can-i create pods/exec https://kubenomicon.com/Execution/Exec_inside_container.html


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation in the repo about this RBAC config
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "README*" \) | xargs grep -l "rbac\|pods/exec" 2>/dev/null || echo "No documentation found"

# Also check if there are comments in the RBAC file itself
head -50 scripts/k8s-runner-resources/arc-runner-rbac.yaml

Repository: lightseekorg/smg

Length of output: 1024


🏁 Script executed:

#!/bin/bash
# Check the README for context about the runner RBAC requirements
cat scripts/k8s-runner-resources/README.md

Repository: lightseekorg/smg

Length of output: 6147


Add create verb to pods/exec or clarify intent with a comment.

Kubernetes RBAC requires the create verb on the pods/exec subresource to enable actual pod exec operations. The current permissions grant only get, list, and watch verbs, which are insufficient for exec functionality. Either add create to the verbs list if runners need to exec into pods, or add an inline comment explaining that this is intentionally limited to read-only access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml` around lines 22 - 31, The
RBAC rule for the Pods resource includes the pods/exec subresource but only
grants read verbs (get, list, watch), which prevents exec operations; update the
verbs array for the Pods rule that lists "pods" and "pods/exec" to include
"create" if runners should be able to exec into pods, or add a clear inline
comment next to the resources/verbs block (referencing the Pods rule and the
"pods/exec" entry) stating the intent is read-only and that "create" was
intentionally omitted.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 070ea73062

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +47 to +48
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Configure DinD sidecar for CPU RunnerDeployment

This RunnerDeployment adds a docker:dind sidecar but does not wire it up for usable Docker access from the runner (no shared /var/run volume, no privileged DinD setup, and no runner-side Docker endpoint wiring). In workflows that run on k8s-runner-cpu and invoke Docker (for example container actions or docker build), jobs will fail because the runner cannot reach a functional daemon.

Useful? React with 👍 / 👎.

Comment on lines +28 to +31
verbs:
- get
- list
- watch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Grant create on pods/exec in runner Role

The Role includes pods/exec but only grants get, list, and watch; exec requests are authorized via the create verb on the pods/exec subresource. As written, any workflow step that uses kubectl exec from these runners will get RBAC forbidden errors even though exec access appears intended.

Useful? React with 👍 / 👎.

@CatherineSue
Copy link
Collaborator

What about the old files using the github CRD? should we remove them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants