refactor(ci): Add actions.summerwind.dev ARC runner deployment option by XinyueZhang369 · Pull Request #797 · lightseekorg/smg

XinyueZhang369 · 2026-03-17T23:51:51Z

Description

Problem

The existing ARC deployment guide only covers the official GitHub ARC controller (ghcr.io/actions/actions-runner-controller-charts). Some clusters require or prefer the community actions.summerwind.dev controller which uses RunnerDeployment CRDs and provides HorizontalRunnerAutoscaler support.

Solution

Add an alternative deployment path using the actions.summerwind.dev ARC controller to the README, along with the corresponding Kubernetes manifests for RBAC, CPU runners, GPU runners, and autoscaling.

Changes

Add RunnerDeployment-based runner manifests (arc-runner-cpu.yaml, arc-runner-gpu.yaml, arc-runner-rbac.yaml, arc-runner-autoscaler.yaml)
Update README.md with an alternative section documenting the actions.summerwind.dev ARC installation and deployment steps
Fix GPU resource units and adjust min/max for HorizontalRunnerAutoscaler

Test Plan

Deploy the actions.summerwind.dev controller via Helm and verify pods are running in actions-runner-system
Apply the RBAC, runner, and autoscaler manifests and confirm RunnerDeployment and HorizontalRunnerAutoscaler resources are created
Trigger a GitHub Actions workflow targeting the runner labels and verify runner pods are provisioned

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Release Notes

Documentation
- Added comprehensive guide for deploying self-hosted runners using the community-maintained actions-runner-controller with step-by-step setup instructions.
New Features
- GPU and CPU-optimized runner deployment configurations for diverse workload requirements.
- Kubernetes-native autoscaling capabilities for self-hosted runners.
- Security configurations for runner access control.

Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>

coderabbitai · 2026-03-17T23:52:06Z

📝 Walkthrough

Walkthrough

This PR introduces Kubernetes manifests and documentation for deploying GitHub Actions runners using the community actions-runner-controller (ARC) variant from actions.summerwind.dev. The changes include CPU and GPU runner deployments, autoscaling configurations, RBAC setup, and an extended README guide covering installation and verification procedures.

Changes

Cohort / File(s)	Summary
Documentation `scripts/k8s-runner-resources/README.md`	Adds a comprehensive guide section for alternative ARC controller installation, GitHub App setup, secret creation, resource deployment, and verification steps for the summerwind.dev variant.
RBAC & Access Control `scripts/k8s-runner-resources/arc-runner-rbac.yaml`	Establishes ServiceAccount, Role, and RoleBinding for arc-runner-sa in actions-runner-system namespace with permissions for secrets and pod access.
Runner Deployments `scripts/k8s-runner-resources/arc-runner-cpu.yaml`, `scripts/k8s-runner-resources/arc-runner-gpu.yaml`	Adds RunnerDeployment configurations for CPU-based runners (8 CPU, 16Gi memory) and multiple GPU variants (H100, A10) with node selectors, tolerations, persistent volume claims for model caching, DinD sidecars, and environment variable bindings for API keys.
Autoscaling `scripts/k8s-runner-resources/arc-runner-autoscaler.yaml`	Defines HorizontalRunnerAutoscaler resources with scaling metrics (TotalNumberOfQueuedAndInProgressWorkflowRuns, PercentageRunnersBusy) and min/max replica constraints for each runner deployment variant.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

ci: migrate from summerwind ARC to GitHub-maintained ARC v2 #432: Addresses the addition of summerwind ARC RunnerDeployment/HorizontalRunnerAutoscaler manifests and RBAC configuration files with overlapping scope and CRD types.

Possibly related PRs

fix(ci): modify docker-storage dir and add scaler for a10 runners #402: Modifies the same Kubernetes runner manifests (autoscaler and GPU runner configurations) affecting replica counts, metrics, and storage settings.
ci: Change to use github runner set for ci runners #659: Implements migration from legacy actions-runner-controller to ARC runner scale set model with similar documentation, secrets, and manifest additions in scripts/k8s-runner-resources.
fix(ci): enable ephemeral runners to prevent dead runner accumulation #503: Updates ARC RunnerDeployment specs in arc-runner-cpu.yaml and arc-runner-gpu.yaml to enable ephemeral runner behavior.

Suggested labels

ci, documentation

Suggested reviewers

CatherineSue
key4ng
slin1237

Poem

🐰 A runner hops through Kubernetes fields,
With CPU cores and GPU shields,
Autoscalers scale when jobs run deep,
RBAC guards secrets we keep! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding an alternative ARC runner deployment option using actions.summerwind.dev controller, which is the primary objective of this PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch xz/runner-deployment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

gemini-code-assist · 2026-03-17T23:52:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request expands the available options for deploying GitHub Actions self-hosted runners on Kubernetes. It provides a new, fully documented method utilizing the actions.summerwind.dev controller, offering greater flexibility for environments that prefer or require this specific community-driven solution. The changes include all necessary Kubernetes configuration files to set up CPU and GPU runners with proper access controls and dynamic scaling capabilities.

Highlights

New ARC Deployment Option: Added an alternative deployment path for GitHub Actions self-hosted runners using the community actions.summerwind.dev ARC controller, which leverages RunnerDeployment CRDs and HorizontalRunnerAutoscaler.
Kubernetes Manifests: Introduced new Kubernetes manifests for RBAC, CPU runners, various GPU runners (H100, A10), and their corresponding autoscalers.
Documentation Update: Updated the README.md with comprehensive installation and deployment steps for the actions.summerwind.dev ARC option.
Resource Configuration: Adjusted GPU resource units and refined minimum/maximum replica settings for HorizontalRunnerAutoscaler configurations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an alternative deployment method for ARC runners using the actions.summerwind.dev controller, adding new Kubernetes manifests and updating documentation. My review found several critical and high-severity issues in the new manifests. The RBAC role for runner pods is overly permissive, creating a security risk. The CPU runner's Docker-in-Docker configuration is incomplete, lacking proper setup for the Docker socket and requiring DOCKER_TLS_CERTDIR to be explicitly set to an empty string, which will cause Docker-dependent jobs to fail. Additionally, a GPU runner deployment is missing necessary CPU and memory resource definitions and also requires the DOCKER_TLS_CERTDIR environment variable to be set for its Docker-in-Docker setup, potentially causing instability. Finally, the CPU runner's autoscaler configuration contains invalid parameters that will prevent it from functioning correctly. Addressing these points will improve the security, stability, and functionality of the new runner deployments.

gemini-code-assist · 2026-03-17T23:54:07Z

scripts/k8s-runner-resources/arc-runner-autoscaler.yaml

+  metrics:
+    - type: PercentageRunnersBusy
+      scaleUpThreshold: "0.95"
+      scaleDownThreshold: "0.25"
+      scaleUpFactor: "0.5"
+      scaleDownFactor: "0.5"


The arc-cpu-runner-autoscaler is configured to use the PercentageRunnersBusy metric, but it includes scaleUpFactor and scaleDownFactor fields. These fields are only valid for the TotalNumberOfQueuedAndInProgressWorkflowRuns metric type and will be ignored or cause an error here. This will prevent the autoscaler from functioning as expected.

metrics: - type: PercentageRunnersBusy scaleUpThreshold: "0.95" scaleDownThreshold: "0.25"

gemini-code-assist · 2026-03-17T23:54:07Z

scripts/k8s-runner-resources/arc-runner-cpu.yaml

+    spec:
+      ephemeral: true
+      repository: lightseekorg/smg
+      labels:
+        - k8s-runner-cpu
+      serviceAccountName: arc-runner-sa
+
+      containers:
+        - name: runner
+          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
+          resources:
+            requests:
+              cpu: "8"
+              memory: "16Gi"
+            limits:
+              cpu: "8"
+              memory: "16Gi"
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  key: HUGGINGFACE_API_KEY
+                  name: huggingface-secret
+            - name: OPENAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: OPENAI_API_KEY
+                  name: openai-api-key
+            - name: ANTHROPIC_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: ANTHROPIC_API_KEY
+                  name: anthropic-api-key
+            - name: XAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: XAI_API_KEY
+                  name: xai-api-key
+        - name: docker
+          image: fra.ocir.io/idqj093njucb/docker:dind


The Docker-in-Docker (dind) configuration for the CPU runner is incomplete. The runner container is missing the DOCKER_HOST environment variable and volume mounts for the Docker socket. The docker sidecar is missing the privileged security context, resource definitions, and volume mounts required for it to function correctly. Additionally, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string in the docker sidecar to disable TLS for the Docker socket, which is necessary for DinD setups. This will cause any Docker operations in workflows on this runner to fail. The configuration should be updated to properly set up the dind sidecar and the communication between the two containers, similar to the GPU runner definitions.

spec: ephemeral: true repository: lightseekorg/smg labels: - k8s-runner-cpu serviceAccountName: arc-runner-sa volumes: - name: docker-sock emptyDir: {} - name: docker-storage emptyDir: {} containers: - name: runner image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 resources: requests: cpu: "8" memory: "16Gi" limits: cpu: "8" memory: "16Gi" volumeMounts: - name: docker-sock mountPath: /var/run env: - name: DOCKER_HOST value: unix:///var/run/docker.sock - name: HF_TOKEN valueFrom: secretKeyRef: key: HUGGINGFACE_API_KEY name: huggingface-secret - name: OPENAI_API_KEY valueFrom: secretKeyRef: key: OPENAI_API_KEY name: openai-api-key - name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: key: ANTHROPIC_API_KEY name: anthropic-api-key - name: XAI_API_KEY valueFrom: secretKeyRef: key: XAI_API_KEY name: xai-api-key - name: docker image: fra.ocir.io/idqj093njucb/docker:dind securityContext: privileged: true resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" env: - name: DOCKER_TLS_CERTDIR value: "" - name: DOCKER_DRIVER value: overlay2 volumeMounts: - name: docker-sock mountPath: /var/run - name: docker-storage mountPath: /var/lib/docker

References

When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the DOCKER_TLS_CERTDIR environment variable to an empty string.

gemini-code-assist · 2026-03-17T23:54:07Z

scripts/k8s-runner-resources/arc-runner-gpu.yaml

+      containers:
+        - name: runner
+          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
+          resources:
+            limits:
+              nvidia.com/gpu: 4
+          volumeMounts:
+            - name: model-cache
+              mountPath: /models
+            - name: docker-sock
+              mountPath: /var/run
+            - name: dshm
+              mountPath: /dev/shm
+          env:
+            - name: DOCKER_HOST
+              value: unix:///var/run/docker.sock
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  key: HUGGINGFACE_API_KEY
+                  name: huggingface-secret
+            - name: OPENAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: OPENAI_API_KEY
+                  name: openai-api-key
+            - name: ANTHROPIC_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: ANTHROPIC_API_KEY
+                  name: anthropic-api-key
+            - name: XAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: XAI_API_KEY
+                  name: xai-api-key
+        - name: docker
+          image: fra.ocir.io/idqj093njucb/docker:dind
+          securityContext:
+            privileged: true  # Required for DinD
+          env:
+            - name: DOCKER_TLS_CERTDIR
+              value: ""  # Disables TLS for shared socket use
+          volumeMounts:
+            - name: docker-sock
+              mountPath: /var/run
+            - name: docker-storage
+              mountPath: /var/lib/docker


The arc-runner-gpu-a10 deployment is missing CPU and memory resource requests and limits for both the runner and docker containers. Additionally, for the docker sidecar, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string to disable TLS for the Docker socket, which is necessary for DinD setups. This results in a lower Quality of Service (QoS) class, making the pods more likely to be evicted under node pressure. It is a best practice to explicitly define resources for all containers to ensure predictable performance and scheduling.

containers: - name: runner image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 resources: requests: cpu: "16" memory: "64Gi" limits: cpu: "16" memory: "64Gi" nvidia.com/gpu: 4 volumeMounts: - name: model-cache mountPath: /models - name: docker-sock mountPath: /var/run - name: dshm mountPath: /dev/shm env: - name: DOCKER_HOST value: unix:///var/run/docker.sock - name: HF_TOKEN valueFrom: secretKeyRef: key: HUGGINGFACE_API_KEY name: huggingface-secret - name: OPENAI_API_KEY valueFrom: secretKeyRef: key: OPENAI_API_KEY name: openai-api-key - name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: key: ANTHROPIC_API_KEY name: anthropic-api-key - name: XAI_API_KEY valueFrom: secretKeyRef: key: XAI_API_KEY name: xai-api-key - name: docker image: fra.ocir.io/idqj093njucb/docker:dind securityContext: privileged: true # Required for DinD resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" env: - name: DOCKER_TLS_CERTDIR value: "" # Disables TLS for shared socket use - name: DOCKER_DRIVER value: overlay2 volumeMounts: - name: docker-sock mountPath: /var/run - name: docker-storage mountPath: /var/lib/docker

References

When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the DOCKER_TLS_CERTDIR environment variable to an empty string.

gemini-code-assist · 2026-03-17T23:54:07Z

scripts/k8s-runner-resources/arc-runner-rbac.yaml

+      - pods
+      - pods/log
+      - pods/exec


The Role for the runner pods grants permissions for pods/log and pods/exec. This is overly permissive and violates the principle of least privilege. A standard runner pod does not need to execute commands in or view logs of other pods. These permissions could be abused if a workflow is compromised and should be removed to enhance security.

- pods

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml`:
- Around line 104-109: The PercentageRunnersBusy metric configuration uses scale
factors that decrease capacity on scale-up; update the metric block (metrics /
type: PercentageRunnersBusy) so that scaleUpFactor is greater than 1 (e.g.,
"1.4" or "1.5") to increase runners when busy and scaleDownFactor remains less
than 1 (e.g., "0.7") to reduce runners when underutilized, keeping the existing
thresholds (scaleUpThreshold and scaleDownThreshold) as-is.

In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml`:
- Around line 47-48: The docker sidecar container (name: docker) is missing
critical DinD configuration: add securityContext.privileged: true to the docker
container, add volumeMounts for the docker socket and storage (mounts named
docker-sock and docker-storage) and ensure matching volumes are defined at the
pod level, add environment variables DOCKER_TLS_CERTDIR (empty string) and
DOCKER_DRIVER (e.g., overlay2) to the docker container, and add appropriate
resources.requests and resources.limits (cpu/memory) similar to the GPU runner's
docker sidecar so DinD can run properly and the runner can access the
socket/storage.
- Around line 16-48: The CPU deployment is missing the Docker socket and related
volumes/volumeMounts so the runner container cannot talk to the dind container;
add a top-level volumes block defining docker-sock (hostPath
/var/run/docker.sock), docker-storage (emptyDir) and dshm (emptyDir with medium:
Memory) and update the runner container (name: runner) to include volumeMounts
for docker-sock (mountPath: /var/run/docker.sock), docker-storage (mountPath:
/var/lib/docker) and dshm (mountPath: /dev/shm); ensure the dind container
(name: docker) also mounts those same volumes so DOCKER_HOST:
unix:///var/run/docker.sock works correctly.

In `@scripts/k8s-runner-resources/arc-runner-gpu.yaml`:
- Around line 168-215: The arc-runner-gpu-a10 deployment is missing resource
requests for the runner container, resources for the docker sidecar, and a
memory-backed docker-storage volume; update the YAML to match other GPU
deployments by adding a resources block to the runner container (keep
limits.nvidia.com/gpu: 4 and add cpu and memory requests and limits), add a
resources.requests/limits block to the docker sidecar container (name: docker)
similar to other DinD sidecars, and change the docker-storage volume definition
(name: docker-storage) to an emptyDir with medium: Memory and sizeLimit: 4Gi to
make it memory-backed. Ensure you modify the runner and docker container specs
and the docker-storage volume declaration accordingly.

In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml`:
- Around line 13-20: Replace the misleading "# Argo Workflows" comment above the
RBAC block with an accurate label for ARC/GitHub Actions runners (e.g., "#
Actions Runner Controller (ARC) Runners" or "# ARC runners") so the comment
correctly reflects the RBAC purpose; update the comment that immediately
precedes the secrets resource block (the line containing "# Argo Workflows") to
the new text in the same location.
- Around line 22-31: The RBAC rule for the Pods resource includes the pods/exec
subresource but only grants read verbs (get, list, watch), which prevents exec
operations; update the verbs array for the Pods rule that lists "pods" and
"pods/exec" to include "create" if runners should be able to exec into pods, or
add a clear inline comment next to the resources/verbs block (referencing the
Pods rule and the "pods/exec" entry) stating the intent is read-only and that
"create" was intentionally omitted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7c51f99-fc49-4531-b171-992bec599165

📥 Commits

Reviewing files that changed from the base of the PR and between 7f54c64 and 070ea73.

📒 Files selected for processing (5)

scripts/k8s-runner-resources/README.md
scripts/k8s-runner-resources/arc-runner-autoscaler.yaml
scripts/k8s-runner-resources/arc-runner-cpu.yaml
scripts/k8s-runner-resources/arc-runner-gpu.yaml
scripts/k8s-runner-resources/arc-runner-rbac.yaml

coderabbitai · 2026-03-17T23:56:24Z

scripts/k8s-runner-resources/arc-runner-autoscaler.yaml

+  metrics:
+    - type: PercentageRunnersBusy
+      scaleUpThreshold: "0.95"
+      scaleDownThreshold: "0.25"
+      scaleUpFactor: "0.5"
+      scaleDownFactor: "0.5"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

actions-runner-controller PercentageRunnersBusy scaleUpFactor scaleDownFactor configuration

💡 Result:

In Actions Runner Controller (ARC), PercentageRunnersBusy is a HorizontalRunnerAutoscaler.spec.metrics[].type that scales a RunnerDeployment based on the percentage of runners currently busy (running jobs). ARC queries GitHub for the number of busy runners in the RunnerDeployment’s namespace, then adjusts the desired replica count when thresholds are crossed. [1]

Example configuration

apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler spec: scaleTargetRef: name: example-runner-deployment minReplicas: 1 maxReplicas: 10 metrics: - type: PercentageRunnersBusy scaleUpThreshold: "0.75" scaleDownThreshold: "0.30" scaleUpFactor: "1.4" scaleDownFactor: "0.7"

[1][2]

What the fields mean (for PercentageRunnersBusy)

scaleUpThreshold: if % busy is greater than this value, ARC re-evaluates desired runners to scale up. [2]

scaleDownThreshold: if % busy is less than this value, ARC re-evaluates desired runners to scale down. [2][3]

scaleUpFactor: multiplier applied to the current runner count to compute the new desired count on scale-up (e.g., 10 → 10 * 1.4 → 14). [1][2]

scaleDownFactor: multiplier applied to the current runner count to compute the new desired count on scale-down (e.g., 10 → 10 * 0.7 → 7). [1][2]

Notes:

You typically pair these with minReplicas / maxReplicas bounds. [1]

ARC supports either factor-based scaling (scaleUpFactor/scaleDownFactor) or step-based scaling (scaleUpAdjustment/scaleDownAdjustment) for a metric (not both). [3]

Sources:
[1] ARC docs (actions.github.io/actions-runner-controller)
[2] ARC API docs on pkg.go.dev (actions-runner-controller)
[3] HorizontalRunnerAutoscalerMetrics docs (docs.rs)

Correct scaleUpFactor and scaleDownFactor values for PercentageRunnersBusy metric.

The format of scaleUpFactor: "0.5" and scaleDownFactor: "0.5" as strings is correct, but the values are problematic. These fields are multipliers applied to the current runner count:

When scaleUpThreshold: "0.95" is exceeded (≥95% runners busy), the system applies scaleUpFactor: "0.5", which reduces the runner count to 50% of current (e.g., 4 → 2 runners). This is the opposite of the intended behavior—you should add runners when they are busy, not remove them.

Similarly, scaleDownFactor: "0.5" also reduces runners, making both directions reduce capacity.

Use a factor greater than 1 for scaleUpFactor (e.g., "1.5" or "1.4") to increase runners, and keep scaleDownFactor less than 1 (e.g., "0.7") to decrease runners during low usage.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml` around lines 104 - 109, The PercentageRunnersBusy metric configuration uses scale factors that decrease capacity on scale-up; update the metric block (metrics / type: PercentageRunnersBusy) so that scaleUpFactor is greater than 1 (e.g., "1.4" or "1.5") to increase runners when busy and scaleDownFactor remains less than 1 (e.g., "0.7") to reduce runners when underutilized, keeping the existing thresholds (scaleUpThreshold and scaleDownThreshold) as-is.

coderabbitai · 2026-03-17T23:56:24Z

scripts/k8s-runner-resources/arc-runner-cpu.yaml

+      containers:
+        - name: runner
+          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
+          resources:
+            requests:
+              cpu: "8"
+              memory: "16Gi"
+            limits:
+              cpu: "8"
+              memory: "16Gi"
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  key: HUGGINGFACE_API_KEY
+                  name: huggingface-secret
+            - name: OPENAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: OPENAI_API_KEY
+                  name: openai-api-key
+            - name: ANTHROPIC_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: ANTHROPIC_API_KEY
+                  name: anthropic-api-key
+            - name: XAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: XAI_API_KEY
+                  name: xai-api-key
+        - name: docker
+          image: fra.ocir.io/idqj093njucb/docker:dind


⚠️ Potential issue | 🔴 Critical

Missing volumes and volume mounts for Docker socket sharing.

The runner container references DOCKER_HOST: unix:///var/run/docker.sock in GPU deployments, but this CPU deployment is missing:

The volumes section entirely (no docker-sock, docker-storage, dshm volumes)

Volume mounts in the runner container

Without shared volumes, the runner and DinD containers cannot communicate.

Proposed fix to add volumes section

serviceAccountName: arc-runner-sa + + volumes: + - name: docker-sock + emptyDir: {} + - name: docker-storage + emptyDir: + medium: Memory + sizeLimit: 4Gi containers: - name: runner image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 resources: requests: cpu: "8" memory: "16Gi" limits: cpu: "8" memory: "16Gi" + volumeMounts: + - name: docker-sock + mountPath: /var/run env: + - name: DOCKER_HOST + value: unix:///var/run/docker.sock - name: HF_TOKEN

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 16 - 48, The CPU deployment is missing the Docker socket and related volumes/volumeMounts so the runner container cannot talk to the dind container; add a top-level volumes block defining docker-sock (hostPath /var/run/docker.sock), docker-storage (emptyDir) and dshm (emptyDir with medium: Memory) and update the runner container (name: runner) to include volumeMounts for docker-sock (mountPath: /var/run/docker.sock), docker-storage (mountPath: /var/lib/docker) and dshm (mountPath: /dev/shm); ensure the dind container (name: docker) also mounts those same volumes so DOCKER_HOST: unix:///var/run/docker.sock works correctly.

coderabbitai · 2026-03-17T23:56:24Z

scripts/k8s-runner-resources/arc-runner-cpu.yaml

+        - name: docker
+          image: fra.ocir.io/idqj093njucb/docker:dind


⚠️ Potential issue | 🔴 Critical

Docker-in-Docker sidecar is missing critical configuration.

The docker container is incomplete compared to the GPU runner manifests. It's missing:

securityContext.privileged: true (required for DinD)

Volume mounts for docker-sock and docker-storage

Environment variables (DOCKER_TLS_CERTDIR, DOCKER_DRIVER)

Resource requests/limits

Without these, the DinD sidecar will fail to function, and the runner container won't be able to use Docker.

Proposed fix based on GPU runner configuration

- name: docker image: fra.ocir.io/idqj093njucb/docker:dind + securityContext: + privileged: true # Required for DinD + resources: + requests: + cpu: "1" + memory: "2Gi" + limits: + cpu: "2" + memory: "4Gi" + env: + - name: DOCKER_TLS_CERTDIR + value: "" # Disables TLS for shared socket use + - name: DOCKER_DRIVER + value: overlay2 + volumeMounts: + - name: docker-sock + mountPath: /var/run + - name: docker-storage + mountPath: /var/lib/docker

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 47 - 48, The docker sidecar container (name: docker) is missing critical DinD configuration: add securityContext.privileged: true to the docker container, add volumeMounts for the docker socket and storage (mounts named docker-sock and docker-storage) and ensure matching volumes are defined at the pod level, add environment variables DOCKER_TLS_CERTDIR (empty string) and DOCKER_DRIVER (e.g., overlay2) to the docker container, and add appropriate resources.requests and resources.limits (cpu/memory) similar to the GPU runner's docker sidecar so DinD can run properly and the runner can access the socket/storage.

coderabbitai · 2026-03-17T23:56:24Z

scripts/k8s-runner-resources/arc-runner-gpu.yaml

+      containers:
+        - name: runner
+          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
+          resources:
+            limits:
+              nvidia.com/gpu: 4
+          volumeMounts:
+            - name: model-cache
+              mountPath: /models
+            - name: docker-sock
+              mountPath: /var/run
+            - name: dshm
+              mountPath: /dev/shm
+          env:
+            - name: DOCKER_HOST
+              value: unix:///var/run/docker.sock
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  key: HUGGINGFACE_API_KEY
+                  name: huggingface-secret
+            - name: OPENAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: OPENAI_API_KEY
+                  name: openai-api-key
+            - name: ANTHROPIC_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: ANTHROPIC_API_KEY
+                  name: anthropic-api-key
+            - name: XAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  key: XAI_API_KEY
+                  name: xai-api-key
+        - name: docker
+          image: fra.ocir.io/idqj093njucb/docker:dind
+          securityContext:
+            privileged: true  # Required for DinD
+          env:
+            - name: DOCKER_TLS_CERTDIR
+              value: ""  # Disables TLS for shared socket use
+          volumeMounts:
+            - name: docker-sock
+              mountPath: /var/run
+            - name: docker-storage
+              mountPath: /var/lib/docker


⚠️ Potential issue | 🟠 Major

Inconsistent configuration in arc-runner-gpu-a10 deployment.

This deployment differs from the other GPU deployments in several ways:

Missing runner container resource requests (lines 171-173): Only limits.nvidia.com/gpu is specified; no CPU/memory requests like other deployments have.

Missing docker container resources (lines 204-215): No resource requests/limits for the DinD sidecar.

docker-storage volume not memory-backed (lines 161-162): Other deployments use medium: Memory with sizeLimit: 4Gi.

These inconsistencies may cause scheduling issues or resource contention.

Proposed fix to align with other GPU deployments

containers: - name: runner image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 resources: + requests: + cpu: "32" + memory: "128Gi" limits: nvidia.com/gpu: 4

- name: docker-storage - emptyDir: {} + emptyDir: + medium: Memory + sizeLimit: 4Gi

- name: docker image: fra.ocir.io/idqj093njucb/docker:dind securityContext: privileged: true # Required for DinD + resources: + requests: + cpu: "1" + memory: "2Gi" + limits: + cpu: "2" + memory: "4Gi" env:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-runner-resources/arc-runner-gpu.yaml` around lines 168 - 215, The arc-runner-gpu-a10 deployment is missing resource requests for the runner container, resources for the docker sidecar, and a memory-backed docker-storage volume; update the YAML to match other GPU deployments by adding a resources block to the runner container (keep limits.nvidia.com/gpu: 4 and add cpu and memory requests and limits), add a resources.requests/limits block to the docker sidecar container (name: docker) similar to other DinD sidecars, and change the docker-storage volume definition (name: docker-storage) to an emptyDir with medium: Memory and sizeLimit: 4Gi to make it memory-backed. Ensure you modify the runner and docker container specs and the docker-storage volume declaration accordingly.

coderabbitai · 2026-03-17T23:56:24Z

scripts/k8s-runner-resources/arc-runner-rbac.yaml

+  # Argo Workflows
+  - apiGroups: [""]
+    resources:
+      - secrets
+    verbs:
+      - get
+      - list
+      - watch


🧹 Nitpick | 🔵 Trivial

Misleading comment: this is for ARC runners, not Argo Workflows.

The comment # Argo Workflows on line 13 appears to be a copy-paste artifact. This RBAC manifest is for GitHub Actions runners via the actions-runner-controller, not Argo Workflows.

Suggested fix

rules: - # Argo Workflows + # Access to secrets for runner configuration - apiGroups: [""] resources: - secrets

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Argo Workflows

- apiGroups: [""]

resources:

- secrets

verbs:

- get

- list

- watch

# Access to secrets for runner configuration

- apiGroups: [""]

resources:

- secrets

verbs:

- get

- list

- watch

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml` around lines 13 - 20, Replace the misleading "# Argo Workflows" comment above the RBAC block with an accurate label for ARC/GitHub Actions runners (e.g., "# Actions Runner Controller (ARC) Runners" or "# ARC runners") so the comment correctly reflects the RBAC purpose; update the comment that immediately precedes the secrets resource block (the line containing "# Argo Workflows") to the new text in the same location.

coderabbitai · 2026-03-17T23:56:24Z

scripts/k8s-runner-resources/arc-runner-rbac.yaml

+  # Pods
+  - apiGroups: [""]
+    resources:
+      - pods
+      - pods/log
+      - pods/exec
+    verbs:
+      - get
+      - list
+      - watch


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if other runner RBAC files in the repo grant create verb for pods/exec rg -n "pods/exec" --type yaml -A 5 -B 2

Repository: lightseekorg/smg

Length of output: 585

🌐 Web query:

Kubernetes RBAC pods/exec create verb permission required

💡 Result:

To allow kubectl exec via Kubernetes RBAC, you must grant the create verb on the pods/exec subresource (core API group ""). [1], [2]

Typical minimal Role rules (namespace-scoped):

rules: - apiGroups: [""] resources: ["pods"] verbs: ["get"] # kubectl commonly reads the Pod first - apiGroups: [""] resources: ["pods/exec"] verbs: ["create"] # required for exec (kubectl uses POST -> RBAC "create")

Note: some client libraries negotiate exec using an HTTP GET first (e.g., WebSocket), in which case you may also need verbs: ["get","create"] on pods/exec. [2]

Quick check:

kubectl auth can-i create pods/exec -n <ns> [3]

Sources:
[1] Kubernetes RBAC docs (subresources via resource/subresource) https://kubernetes.io/docs/reference/access-authn-authz/rbac/
[2] Stack Overflow: kubectl exec requires create on pods/exec (and sometimes get) https://stackoverflow.com/questions/48118125/kubernetes-rbac-role-verbs-to-exec-to-pod
[3] Kubenomicon: kubectl auth can-i create pods/exec https://kubenomicon.com/Execution/Exec_inside_container.html

🏁 Script executed:

#!/bin/bash # Check if there's any documentation in the repo about this RBAC config find . -type f \( -name "*.md" -o -name "*.txt" -o -name "README*" \) | xargs grep -l "rbac\|pods/exec" 2>/dev/null || echo "No documentation found" # Also check if there are comments in the RBAC file itself head -50 scripts/k8s-runner-resources/arc-runner-rbac.yaml

Repository: lightseekorg/smg

Length of output: 1024

🏁 Script executed:

#!/bin/bash # Check the README for context about the runner RBAC requirements cat scripts/k8s-runner-resources/README.md

Repository: lightseekorg/smg

Length of output: 6147

Add create verb to pods/exec or clarify intent with a comment.

Kubernetes RBAC requires the create verb on the pods/exec subresource to enable actual pod exec operations. The current permissions grant only get, list, and watch verbs, which are insufficient for exec functionality. Either add create to the verbs list if runners need to exec into pods, or add an inline comment explaining that this is intentionally limited to read-only access.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml` around lines 22 - 31, The RBAC rule for the Pods resource includes the pods/exec subresource but only grants read verbs (get, list, watch), which prevents exec operations; update the verbs array for the Pods rule that lists "pods" and "pods/exec" to include "create" if runners should be able to exec into pods, or add a clear inline comment next to the resources/verbs block (referencing the Pods rule and the "pods/exec" entry) stating the intent is read-only and that "create" was intentionally omitted.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 070ea73062

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-17T23:58:33Z

scripts/k8s-runner-resources/arc-runner-cpu.yaml

+        - name: docker
+          image: fra.ocir.io/idqj093njucb/docker:dind


Configure DinD sidecar for CPU RunnerDeployment

This RunnerDeployment adds a docker:dind sidecar but does not wire it up for usable Docker access from the runner (no shared /var/run volume, no privileged DinD setup, and no runner-side Docker endpoint wiring). In workflows that run on k8s-runner-cpu and invoke Docker (for example container actions or docker build), jobs will fail because the runner cannot reach a functional daemon.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-17T23:58:33Z

scripts/k8s-runner-resources/arc-runner-rbac.yaml

+    verbs:
+      - get
+      - list
+      - watch


Grant create on pods/exec in runner Role

The Role includes pods/exec but only grants get, list, and watch; exec requests are authorized via the create verb on the pods/exec subresource. As written, any workflow step that uses kubectl exec from these runners will get RBAC forbidden errors even though exec access appears intended.

Useful? React with 👍 / 👎.

CatherineSue · 2026-03-18T18:27:28Z

What about the old files using the github CRD? should we remove them?

XinyueZhang369 added 3 commits March 17, 2026 10:25

use runnerdeployment

1762fd7

Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>

Fix gpu unit and modify min/max for HRA

d00861b

Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>

Update README

070ea73

Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>

XinyueZhang369 requested review from CatherineSue, gongwei-130, key4ng and slin1237 as code owners March 17, 2026 23:51

github-actions bot added the documentation Improvements or additions to documentation label Mar 17, 2026

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot requested changes Mar 17, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

Conversation

XinyueZhang369 commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Example configuration

What the fields mean (for PercentageRunnersBusy)

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

CatherineSue commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

XinyueZhang369 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

What the fields mean (for `PercentageRunnersBusy`)