refactor(ci): Add actions.summerwind.dev ARC runner deployment option#797
refactor(ci): Add actions.summerwind.dev ARC runner deployment option#797XinyueZhang369 wants to merge 3 commits intomainfrom
Conversation
Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>
Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>
Signed-off-by: XinyueZhang369 <zoeyzhang369@gmail.com>
📝 WalkthroughWalkthroughThis PR introduces Kubernetes manifests and documentation for deploying GitHub Actions runners using the community actions-runner-controller (ARC) variant from actions.summerwind.dev. The changes include CPU and GPU runner deployments, autoscaling configurations, RBAC setup, and an extended README guide covering installation and verification procedures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can disable poems in the walkthrough.Disable the |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request expands the available options for deploying GitHub Actions self-hosted runners on Kubernetes. It provides a new, fully documented method utilizing the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an alternative deployment method for ARC runners using the actions.summerwind.dev controller, adding new Kubernetes manifests and updating documentation. My review found several critical and high-severity issues in the new manifests. The RBAC role for runner pods is overly permissive, creating a security risk. The CPU runner's Docker-in-Docker configuration is incomplete, lacking proper setup for the Docker socket and requiring DOCKER_TLS_CERTDIR to be explicitly set to an empty string, which will cause Docker-dependent jobs to fail. Additionally, a GPU runner deployment is missing necessary CPU and memory resource definitions and also requires the DOCKER_TLS_CERTDIR environment variable to be set for its Docker-in-Docker setup, potentially causing instability. Finally, the CPU runner's autoscaler configuration contains invalid parameters that will prevent it from functioning correctly. Addressing these points will improve the security, stability, and functionality of the new runner deployments.
| metrics: | ||
| - type: PercentageRunnersBusy | ||
| scaleUpThreshold: "0.95" | ||
| scaleDownThreshold: "0.25" | ||
| scaleUpFactor: "0.5" | ||
| scaleDownFactor: "0.5" |
There was a problem hiding this comment.
The arc-cpu-runner-autoscaler is configured to use the PercentageRunnersBusy metric, but it includes scaleUpFactor and scaleDownFactor fields. These fields are only valid for the TotalNumberOfQueuedAndInProgressWorkflowRuns metric type and will be ignored or cause an error here. This will prevent the autoscaler from functioning as expected.
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: "0.95"
scaleDownThreshold: "0.25"| spec: | ||
| ephemeral: true | ||
| repository: lightseekorg/smg | ||
| labels: | ||
| - k8s-runner-cpu | ||
| serviceAccountName: arc-runner-sa | ||
|
|
||
| containers: | ||
| - name: runner | ||
| image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 | ||
| resources: | ||
| requests: | ||
| cpu: "8" | ||
| memory: "16Gi" | ||
| limits: | ||
| cpu: "8" | ||
| memory: "16Gi" | ||
| env: | ||
| - name: HF_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: HUGGINGFACE_API_KEY | ||
| name: huggingface-secret | ||
| - name: OPENAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: OPENAI_API_KEY | ||
| name: openai-api-key | ||
| - name: ANTHROPIC_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: ANTHROPIC_API_KEY | ||
| name: anthropic-api-key | ||
| - name: XAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: XAI_API_KEY | ||
| name: xai-api-key | ||
| - name: docker | ||
| image: fra.ocir.io/idqj093njucb/docker:dind |
There was a problem hiding this comment.
The Docker-in-Docker (dind) configuration for the CPU runner is incomplete. The runner container is missing the DOCKER_HOST environment variable and volume mounts for the Docker socket. The docker sidecar is missing the privileged security context, resource definitions, and volume mounts required for it to function correctly. Additionally, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string in the docker sidecar to disable TLS for the Docker socket, which is necessary for DinD setups. This will cause any Docker operations in workflows on this runner to fail. The configuration should be updated to properly set up the dind sidecar and the communication between the two containers, similar to the GPU runner definitions.
spec:
ephemeral: true
repository: lightseekorg/smg
labels:
- k8s-runner-cpu
serviceAccountName: arc-runner-sa
volumes:
- name: docker-sock
emptyDir: {}
- name: docker-storage
emptyDir: {}
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
volumeMounts:
- name: docker-sock
mountPath: /var/run
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
securityContext:
privileged: true
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: DOCKER_TLS_CERTDIR
value: ""
- name: DOCKER_DRIVER
value: overlay2
volumeMounts:
- name: docker-sock
mountPath: /var/run
- name: docker-storage
mountPath: /var/lib/dockerReferences
- When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the
DOCKER_TLS_CERTDIRenvironment variable to an empty string.
| containers: | ||
| - name: runner | ||
| image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: 4 | ||
| volumeMounts: | ||
| - name: model-cache | ||
| mountPath: /models | ||
| - name: docker-sock | ||
| mountPath: /var/run | ||
| - name: dshm | ||
| mountPath: /dev/shm | ||
| env: | ||
| - name: DOCKER_HOST | ||
| value: unix:///var/run/docker.sock | ||
| - name: HF_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: HUGGINGFACE_API_KEY | ||
| name: huggingface-secret | ||
| - name: OPENAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: OPENAI_API_KEY | ||
| name: openai-api-key | ||
| - name: ANTHROPIC_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: ANTHROPIC_API_KEY | ||
| name: anthropic-api-key | ||
| - name: XAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: XAI_API_KEY | ||
| name: xai-api-key | ||
| - name: docker | ||
| image: fra.ocir.io/idqj093njucb/docker:dind | ||
| securityContext: | ||
| privileged: true # Required for DinD | ||
| env: | ||
| - name: DOCKER_TLS_CERTDIR | ||
| value: "" # Disables TLS for shared socket use | ||
| volumeMounts: | ||
| - name: docker-sock | ||
| mountPath: /var/run | ||
| - name: docker-storage | ||
| mountPath: /var/lib/docker |
There was a problem hiding this comment.
The arc-runner-gpu-a10 deployment is missing CPU and memory resource requests and limits for both the runner and docker containers. Additionally, for the docker sidecar, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string to disable TLS for the Docker socket, which is necessary for DinD setups. This results in a lower Quality of Service (QoS) class, making the pods more likely to be evicted under node pressure. It is a best practice to explicitly define resources for all containers to ensure predictable performance and scheduling.
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
requests:
cpu: "16"
memory: "64Gi"
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: 4
volumeMounts:
- name: model-cache
mountPath: /models
- name: docker-sock
mountPath: /var/run
- name: dshm
mountPath: /dev/shm
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
securityContext:
privileged: true # Required for DinD
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: DOCKER_TLS_CERTDIR
value: "" # Disables TLS for shared socket use
- name: DOCKER_DRIVER
value: overlay2
volumeMounts:
- name: docker-sock
mountPath: /var/run
- name: docker-storage
mountPath: /var/lib/dockerReferences
- When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the
DOCKER_TLS_CERTDIRenvironment variable to an empty string.
| - pods | ||
| - pods/log | ||
| - pods/exec |
There was a problem hiding this comment.
The Role for the runner pods grants permissions for pods/log and pods/exec. This is overly permissive and violates the principle of least privilege. A standard runner pod does not need to execute commands in or view logs of other pods. These permissions could be abused if a workflow is compromised and should be removed to enhance security.
- podsThere was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml`:
- Around line 104-109: The PercentageRunnersBusy metric configuration uses scale
factors that decrease capacity on scale-up; update the metric block (metrics /
type: PercentageRunnersBusy) so that scaleUpFactor is greater than 1 (e.g.,
"1.4" or "1.5") to increase runners when busy and scaleDownFactor remains less
than 1 (e.g., "0.7") to reduce runners when underutilized, keeping the existing
thresholds (scaleUpThreshold and scaleDownThreshold) as-is.
In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml`:
- Around line 47-48: The docker sidecar container (name: docker) is missing
critical DinD configuration: add securityContext.privileged: true to the docker
container, add volumeMounts for the docker socket and storage (mounts named
docker-sock and docker-storage) and ensure matching volumes are defined at the
pod level, add environment variables DOCKER_TLS_CERTDIR (empty string) and
DOCKER_DRIVER (e.g., overlay2) to the docker container, and add appropriate
resources.requests and resources.limits (cpu/memory) similar to the GPU runner's
docker sidecar so DinD can run properly and the runner can access the
socket/storage.
- Around line 16-48: The CPU deployment is missing the Docker socket and related
volumes/volumeMounts so the runner container cannot talk to the dind container;
add a top-level volumes block defining docker-sock (hostPath
/var/run/docker.sock), docker-storage (emptyDir) and dshm (emptyDir with medium:
Memory) and update the runner container (name: runner) to include volumeMounts
for docker-sock (mountPath: /var/run/docker.sock), docker-storage (mountPath:
/var/lib/docker) and dshm (mountPath: /dev/shm); ensure the dind container
(name: docker) also mounts those same volumes so DOCKER_HOST:
unix:///var/run/docker.sock works correctly.
In `@scripts/k8s-runner-resources/arc-runner-gpu.yaml`:
- Around line 168-215: The arc-runner-gpu-a10 deployment is missing resource
requests for the runner container, resources for the docker sidecar, and a
memory-backed docker-storage volume; update the YAML to match other GPU
deployments by adding a resources block to the runner container (keep
limits.nvidia.com/gpu: 4 and add cpu and memory requests and limits), add a
resources.requests/limits block to the docker sidecar container (name: docker)
similar to other DinD sidecars, and change the docker-storage volume definition
(name: docker-storage) to an emptyDir with medium: Memory and sizeLimit: 4Gi to
make it memory-backed. Ensure you modify the runner and docker container specs
and the docker-storage volume declaration accordingly.
In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml`:
- Around line 13-20: Replace the misleading "# Argo Workflows" comment above the
RBAC block with an accurate label for ARC/GitHub Actions runners (e.g., "#
Actions Runner Controller (ARC) Runners" or "# ARC runners") so the comment
correctly reflects the RBAC purpose; update the comment that immediately
precedes the secrets resource block (the line containing "# Argo Workflows") to
the new text in the same location.
- Around line 22-31: The RBAC rule for the Pods resource includes the pods/exec
subresource but only grants read verbs (get, list, watch), which prevents exec
operations; update the verbs array for the Pods rule that lists "pods" and
"pods/exec" to include "create" if runners should be able to exec into pods, or
add a clear inline comment next to the resources/verbs block (referencing the
Pods rule and the "pods/exec" entry) stating the intent is read-only and that
"create" was intentionally omitted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d7c51f99-fc49-4531-b171-992bec599165
📒 Files selected for processing (5)
scripts/k8s-runner-resources/README.mdscripts/k8s-runner-resources/arc-runner-autoscaler.yamlscripts/k8s-runner-resources/arc-runner-cpu.yamlscripts/k8s-runner-resources/arc-runner-gpu.yamlscripts/k8s-runner-resources/arc-runner-rbac.yaml
| metrics: | ||
| - type: PercentageRunnersBusy | ||
| scaleUpThreshold: "0.95" | ||
| scaleDownThreshold: "0.25" | ||
| scaleUpFactor: "0.5" | ||
| scaleDownFactor: "0.5" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
actions-runner-controller PercentageRunnersBusy scaleUpFactor scaleDownFactor configuration
💡 Result:
In Actions Runner Controller (ARC), PercentageRunnersBusy is a HorizontalRunnerAutoscaler.spec.metrics[].type that scales a RunnerDeployment based on the percentage of runners currently busy (running jobs). ARC queries GitHub for the number of busy runners in the RunnerDeployment’s namespace, then adjusts the desired replica count when thresholds are crossed. [1]
Example configuration
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
spec:
scaleTargetRef:
name: example-runner-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: "0.75"
scaleDownThreshold: "0.30"
scaleUpFactor: "1.4"
scaleDownFactor: "0.7"[1][2]
What the fields mean (for PercentageRunnersBusy)
scaleUpThreshold: if% busyis greater than this value, ARC re-evaluates desired runners to scale up. [2]scaleDownThreshold: if% busyis less than this value, ARC re-evaluates desired runners to scale down. [2][3]scaleUpFactor: multiplier applied to the current runner count to compute the new desired count on scale-up (e.g., 10 →10 * 1.4→ 14). [1][2]scaleDownFactor: multiplier applied to the current runner count to compute the new desired count on scale-down (e.g., 10 →10 * 0.7→ 7). [1][2]
Notes:
- You typically pair these with
minReplicas/maxReplicasbounds. [1] - ARC supports either factor-based scaling (
scaleUpFactor/scaleDownFactor) or step-based scaling (scaleUpAdjustment/scaleDownAdjustment) for a metric (not both). [3]
Sources:
[1] ARC docs (actions.github.io/actions-runner-controller)
[2] ARC API docs on pkg.go.dev (actions-runner-controller)
[3] HorizontalRunnerAutoscalerMetrics docs (docs.rs)
Correct scaleUpFactor and scaleDownFactor values for PercentageRunnersBusy metric.
The format of scaleUpFactor: "0.5" and scaleDownFactor: "0.5" as strings is correct, but the values are problematic. These fields are multipliers applied to the current runner count:
- When
scaleUpThreshold: "0.95"is exceeded (≥95% runners busy), the system appliesscaleUpFactor: "0.5", which reduces the runner count to 50% of current (e.g., 4 → 2 runners). This is the opposite of the intended behavior—you should add runners when they are busy, not remove them. - Similarly,
scaleDownFactor: "0.5"also reduces runners, making both directions reduce capacity.
Use a factor greater than 1 for scaleUpFactor (e.g., "1.5" or "1.4") to increase runners, and keep scaleDownFactor less than 1 (e.g., "0.7") to decrease runners during low usage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml` around lines 104 -
109, The PercentageRunnersBusy metric configuration uses scale factors that
decrease capacity on scale-up; update the metric block (metrics / type:
PercentageRunnersBusy) so that scaleUpFactor is greater than 1 (e.g., "1.4" or
"1.5") to increase runners when busy and scaleDownFactor remains less than 1
(e.g., "0.7") to reduce runners when underutilized, keeping the existing
thresholds (scaleUpThreshold and scaleDownThreshold) as-is.
| containers: | ||
| - name: runner | ||
| image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 | ||
| resources: | ||
| requests: | ||
| cpu: "8" | ||
| memory: "16Gi" | ||
| limits: | ||
| cpu: "8" | ||
| memory: "16Gi" | ||
| env: | ||
| - name: HF_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: HUGGINGFACE_API_KEY | ||
| name: huggingface-secret | ||
| - name: OPENAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: OPENAI_API_KEY | ||
| name: openai-api-key | ||
| - name: ANTHROPIC_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: ANTHROPIC_API_KEY | ||
| name: anthropic-api-key | ||
| - name: XAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: XAI_API_KEY | ||
| name: xai-api-key | ||
| - name: docker | ||
| image: fra.ocir.io/idqj093njucb/docker:dind |
There was a problem hiding this comment.
Missing volumes and volume mounts for Docker socket sharing.
The runner container references DOCKER_HOST: unix:///var/run/docker.sock in GPU deployments, but this CPU deployment is missing:
- The
volumessection entirely (nodocker-sock,docker-storage,dshmvolumes) - Volume mounts in the runner container
Without shared volumes, the runner and DinD containers cannot communicate.
Proposed fix to add volumes section
serviceAccountName: arc-runner-sa
+
+ volumes:
+ - name: docker-sock
+ emptyDir: {}
+ - name: docker-storage
+ emptyDir:
+ medium: Memory
+ sizeLimit: 4Gi
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
+ volumeMounts:
+ - name: docker-sock
+ mountPath: /var/run
env:
+ - name: DOCKER_HOST
+ value: unix:///var/run/docker.sock
- name: HF_TOKEN🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 16 - 48, The
CPU deployment is missing the Docker socket and related volumes/volumeMounts so
the runner container cannot talk to the dind container; add a top-level volumes
block defining docker-sock (hostPath /var/run/docker.sock), docker-storage
(emptyDir) and dshm (emptyDir with medium: Memory) and update the runner
container (name: runner) to include volumeMounts for docker-sock (mountPath:
/var/run/docker.sock), docker-storage (mountPath: /var/lib/docker) and dshm
(mountPath: /dev/shm); ensure the dind container (name: docker) also mounts
those same volumes so DOCKER_HOST: unix:///var/run/docker.sock works correctly.
| - name: docker | ||
| image: fra.ocir.io/idqj093njucb/docker:dind |
There was a problem hiding this comment.
Docker-in-Docker sidecar is missing critical configuration.
The docker container is incomplete compared to the GPU runner manifests. It's missing:
securityContext.privileged: true(required for DinD)- Volume mounts for
docker-sockanddocker-storage - Environment variables (
DOCKER_TLS_CERTDIR,DOCKER_DRIVER) - Resource requests/limits
Without these, the DinD sidecar will fail to function, and the runner container won't be able to use Docker.
Proposed fix based on GPU runner configuration
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
+ securityContext:
+ privileged: true # Required for DinD
+ resources:
+ requests:
+ cpu: "1"
+ memory: "2Gi"
+ limits:
+ cpu: "2"
+ memory: "4Gi"
+ env:
+ - name: DOCKER_TLS_CERTDIR
+ value: "" # Disables TLS for shared socket use
+ - name: DOCKER_DRIVER
+ value: overlay2
+ volumeMounts:
+ - name: docker-sock
+ mountPath: /var/run
+ - name: docker-storage
+ mountPath: /var/lib/docker🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 47 - 48, The
docker sidecar container (name: docker) is missing critical DinD configuration:
add securityContext.privileged: true to the docker container, add volumeMounts
for the docker socket and storage (mounts named docker-sock and docker-storage)
and ensure matching volumes are defined at the pod level, add environment
variables DOCKER_TLS_CERTDIR (empty string) and DOCKER_DRIVER (e.g., overlay2)
to the docker container, and add appropriate resources.requests and
resources.limits (cpu/memory) similar to the GPU runner's docker sidecar so DinD
can run properly and the runner can access the socket/storage.
| containers: | ||
| - name: runner | ||
| image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1 | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: 4 | ||
| volumeMounts: | ||
| - name: model-cache | ||
| mountPath: /models | ||
| - name: docker-sock | ||
| mountPath: /var/run | ||
| - name: dshm | ||
| mountPath: /dev/shm | ||
| env: | ||
| - name: DOCKER_HOST | ||
| value: unix:///var/run/docker.sock | ||
| - name: HF_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: HUGGINGFACE_API_KEY | ||
| name: huggingface-secret | ||
| - name: OPENAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: OPENAI_API_KEY | ||
| name: openai-api-key | ||
| - name: ANTHROPIC_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: ANTHROPIC_API_KEY | ||
| name: anthropic-api-key | ||
| - name: XAI_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: XAI_API_KEY | ||
| name: xai-api-key | ||
| - name: docker | ||
| image: fra.ocir.io/idqj093njucb/docker:dind | ||
| securityContext: | ||
| privileged: true # Required for DinD | ||
| env: | ||
| - name: DOCKER_TLS_CERTDIR | ||
| value: "" # Disables TLS for shared socket use | ||
| volumeMounts: | ||
| - name: docker-sock | ||
| mountPath: /var/run | ||
| - name: docker-storage | ||
| mountPath: /var/lib/docker |
There was a problem hiding this comment.
Inconsistent configuration in arc-runner-gpu-a10 deployment.
This deployment differs from the other GPU deployments in several ways:
- Missing runner container resource requests (lines 171-173): Only
limits.nvidia.com/gpuis specified; no CPU/memory requests like other deployments have. - Missing docker container resources (lines 204-215): No resource requests/limits for the DinD sidecar.
docker-storagevolume not memory-backed (lines 161-162): Other deployments usemedium: MemorywithsizeLimit: 4Gi.
These inconsistencies may cause scheduling issues or resource contention.
Proposed fix to align with other GPU deployments
containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
+ requests:
+ cpu: "32"
+ memory: "128Gi"
limits:
nvidia.com/gpu: 4 - name: docker-storage
- emptyDir: {}
+ emptyDir:
+ medium: Memory
+ sizeLimit: 4Gi - name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
securityContext:
privileged: true # Required for DinD
+ resources:
+ requests:
+ cpu: "1"
+ memory: "2Gi"
+ limits:
+ cpu: "2"
+ memory: "4Gi"
env:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/k8s-runner-resources/arc-runner-gpu.yaml` around lines 168 - 215, The
arc-runner-gpu-a10 deployment is missing resource requests for the runner
container, resources for the docker sidecar, and a memory-backed docker-storage
volume; update the YAML to match other GPU deployments by adding a resources
block to the runner container (keep limits.nvidia.com/gpu: 4 and add cpu and
memory requests and limits), add a resources.requests/limits block to the docker
sidecar container (name: docker) similar to other DinD sidecars, and change the
docker-storage volume definition (name: docker-storage) to an emptyDir with
medium: Memory and sizeLimit: 4Gi to make it memory-backed. Ensure you modify
the runner and docker container specs and the docker-storage volume declaration
accordingly.
| # Argo Workflows | ||
| - apiGroups: [""] | ||
| resources: | ||
| - secrets | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Misleading comment: this is for ARC runners, not Argo Workflows.
The comment # Argo Workflows on line 13 appears to be a copy-paste artifact. This RBAC manifest is for GitHub Actions runners via the actions-runner-controller, not Argo Workflows.
Suggested fix
rules:
- # Argo Workflows
+ # Access to secrets for runner configuration
- apiGroups: [""]
resources:
- secrets📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Argo Workflows | |
| - apiGroups: [""] | |
| resources: | |
| - secrets | |
| verbs: | |
| - get | |
| - list | |
| - watch | |
| # Access to secrets for runner configuration | |
| - apiGroups: [""] | |
| resources: | |
| - secrets | |
| verbs: | |
| - get | |
| - list | |
| - watch |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml` around lines 13 - 20,
Replace the misleading "# Argo Workflows" comment above the RBAC block with an
accurate label for ARC/GitHub Actions runners (e.g., "# Actions Runner
Controller (ARC) Runners" or "# ARC runners") so the comment correctly reflects
the RBAC purpose; update the comment that immediately precedes the secrets
resource block (the line containing "# Argo Workflows") to the new text in the
same location.
| # Pods | ||
| - apiGroups: [""] | ||
| resources: | ||
| - pods | ||
| - pods/log | ||
| - pods/exec | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if other runner RBAC files in the repo grant create verb for pods/exec
rg -n "pods/exec" --type yaml -A 5 -B 2Repository: lightseekorg/smg
Length of output: 585
🌐 Web query:
Kubernetes RBAC pods/exec create verb permission required
💡 Result:
To allow kubectl exec via Kubernetes RBAC, you must grant the create verb on the pods/exec subresource (core API group ""). [1], [2]
Typical minimal Role rules (namespace-scoped):
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get"] # kubectl commonly reads the Pod first
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"] # required for exec (kubectl uses POST -> RBAC "create")Note: some client libraries negotiate exec using an HTTP GET first (e.g., WebSocket), in which case you may also need verbs: ["get","create"] on pods/exec. [2]
Quick check:
kubectl auth can-i create pods/exec -n <ns>[3]
Sources:
[1] Kubernetes RBAC docs (subresources via resource/subresource) https://kubernetes.io/docs/reference/access-authn-authz/rbac/
[2] Stack Overflow: kubectl exec requires create on pods/exec (and sometimes get) https://stackoverflow.com/questions/48118125/kubernetes-rbac-role-verbs-to-exec-to-pod
[3] Kubenomicon: kubectl auth can-i create pods/exec https://kubenomicon.com/Execution/Exec_inside_container.html
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation in the repo about this RBAC config
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "README*" \) | xargs grep -l "rbac\|pods/exec" 2>/dev/null || echo "No documentation found"
# Also check if there are comments in the RBAC file itself
head -50 scripts/k8s-runner-resources/arc-runner-rbac.yamlRepository: lightseekorg/smg
Length of output: 1024
🏁 Script executed:
#!/bin/bash
# Check the README for context about the runner RBAC requirements
cat scripts/k8s-runner-resources/README.mdRepository: lightseekorg/smg
Length of output: 6147
Add create verb to pods/exec or clarify intent with a comment.
Kubernetes RBAC requires the create verb on the pods/exec subresource to enable actual pod exec operations. The current permissions grant only get, list, and watch verbs, which are insufficient for exec functionality. Either add create to the verbs list if runners need to exec into pods, or add an inline comment explaining that this is intentionally limited to read-only access.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/k8s-runner-resources/arc-runner-rbac.yaml` around lines 22 - 31, The
RBAC rule for the Pods resource includes the pods/exec subresource but only
grants read verbs (get, list, watch), which prevents exec operations; update the
verbs array for the Pods rule that lists "pods" and "pods/exec" to include
"create" if runners should be able to exec into pods, or add a clear inline
comment next to the resources/verbs block (referencing the Pods rule and the
"pods/exec" entry) stating the intent is read-only and that "create" was
intentionally omitted.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 070ea73062
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| - name: docker | ||
| image: fra.ocir.io/idqj093njucb/docker:dind |
There was a problem hiding this comment.
Configure DinD sidecar for CPU RunnerDeployment
This RunnerDeployment adds a docker:dind sidecar but does not wire it up for usable Docker access from the runner (no shared /var/run volume, no privileged DinD setup, and no runner-side Docker endpoint wiring). In workflows that run on k8s-runner-cpu and invoke Docker (for example container actions or docker build), jobs will fail because the runner cannot reach a functional daemon.
Useful? React with 👍 / 👎.
| verbs: | ||
| - get | ||
| - list | ||
| - watch |
There was a problem hiding this comment.
Grant create on pods/exec in runner Role
The Role includes pods/exec but only grants get, list, and watch; exec requests are authorized via the create verb on the pods/exec subresource. As written, any workflow step that uses kubectl exec from these runners will get RBAC forbidden errors even though exec access appears intended.
Useful? React with 👍 / 👎.
|
What about the old files using the github CRD? should we remove them? |
Description
Problem
The existing ARC deployment guide only covers the official GitHub ARC controller (
ghcr.io/actions/actions-runner-controller-charts). Some clusters require or prefer the communityactions.summerwind.devcontroller which usesRunnerDeploymentCRDs and providesHorizontalRunnerAutoscalersupport.Solution
Add an alternative deployment path using the
actions.summerwind.devARC controller to the README, along with the corresponding Kubernetes manifests for RBAC, CPU runners, GPU runners, and autoscaling.Changes
RunnerDeployment-based runner manifests (arc-runner-cpu.yaml,arc-runner-gpu.yaml,arc-runner-rbac.yaml,arc-runner-autoscaler.yaml)README.mdwith an alternative section documenting theactions.summerwind.devARC installation and deployment stepsHorizontalRunnerAutoscalerTest Plan
actions.summerwind.devcontroller via Helm and verify pods are running inactions-runner-systemRunnerDeploymentandHorizontalRunnerAutoscalerresources are createdChecklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
Release Notes
Documentation
New Features