Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions scripts/k8s-runner-resources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,58 @@ helm uninstall <runner-set-name> -n actions-runner-system
# Remove the controller (after all runner sets are removed)
helm uninstall arc -n actions-runner-system
```

---

## Alternative: Using `actions.summerwind.dev` ARC

Instead of the official GitHub ARC controller above, you can use the community [actions-runner-controller](https://github.com/actions/actions-runner-controller) (`actions.summerwind.dev`). This uses `RunnerDeployment` CRDs instead of runner scale sets.

### 1. Install the Controller

```bash
helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
helm repo update
helm install actions-runner-controller actions-runner-controller/actions-runner-controller \
--namespace actions-runner-system \
--create-namespace
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

### 2. Create a GitHub App

Follow the same steps as [Section 1](#1-create-a-github-app) and [Section 2](#2-install-the-github-app) above to create and install a GitHub App.

### 3. Create the Kubernetes Secret

Create a secret named `controller-manager` in the `actions-runner-system` namespace with your GitHub App credentials:

```bash
kubectl create secret generic controller-manager \
-n actions-runner-system \
--from-literal=github_app_id=<your-app-id> \
--from-literal=github_app_installation_id=<your-installation-id> \
--from-file=github_app_private_key=<path-to-your-pem-file>
```

### 4. Apply Runner Resources

```bash
# RBAC for runner pods
kubectl apply -f scripts/k8s-runner-resources/arc-runner-rbac.yaml

# CPU runner deployment
kubectl apply -f scripts/k8s-runner-resources/arc-runner-cpu.yaml

# GPU runner deployment
kubectl apply -f scripts/k8s-runner-resources/arc-runner-gpu.yaml

# Autoscaler
kubectl apply -f scripts/k8s-runner-resources/arc-runner-autoscaler.yaml
```
Comment on lines +175 to +189
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document the missing cluster prerequisites before kubectl apply.

On Lines 179-188, the referenced manifests depend on preexisting huggingface-secret, openai-api-key, anthropic-api-key, xai-api-key, and, for GPU runners, the model-cache PVC. A fresh install following this section will create RunnerDeployments that either stay Pending or fail to start. Please add those prerequisites here, or split the CPU/GPU steps so the dependency surface is explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/README.md` around lines 175 - 189, Update the
"Apply Runner Resources" section to list the required cluster prerequisites
and/or split CPU/GPU steps: explicitly document that the manifests
(arc-runner-rbac.yaml, arc-runner-cpu.yaml, arc-runner-gpu.yaml,
arc-runner-autoscaler.yaml) require pre-created secrets huggingface-secret,
openai-api-key, anthropic-api-key, xai-api-key and that GPU runners additionally
require the model-cache PersistentVolumeClaim (model-cache PVC); either add a
prerequisites subsection with commands or links to create those secrets/PVCs
before running kubectl apply, or separate the CPU and GPU apply instructions
with the GPU block noting the model-cache PVC dependency.


### 5. Verify

```bash
kubectl get runnerdeployments -n actions-runner-system
kubectl get pods -n actions-runner-system
```
107 changes: 107 additions & 0 deletions scripts/k8s-runner-resources/arc-runner-autoscaler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: arc-runner-4-h100-autoscaler
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: arc-runner-4-gpu-h100

minReplicas: 4
maxReplicas: 20

metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- lightseekorg/smg
Comment on lines +14 to +17
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For actions-runner-controller/actions.summerwind.dev HorizontalRunnerAutoscaler, does the TotalNumberOfQueuedAndInProgressWorkflowRunsmetric filter by the targetRunnerDeployment labels, or does it count all queued/in-progress jobs for the listed repositories? If I create multiple HRAs for the same repository but different runner labels, will they all observe the same queue depth?

💡 Result:

The TotalNumberOfQueuedAndInProgressWorkflowRuns metric counts all queued and in-progress workflow runs across the specified repositoryNames in the HorizontalRunnerAutoscaler (HRA) spec. It does not filter by the labels of the target RunnerDeployment or RunnerSet. The metric polls GitHub's API for pending workflow runs against the listed repositories and uses this total to suggest desired replicas (up to maxReplicas), without considering which specific runners (identified by labels) those jobs are targeted to. If multiple HRAs are created for the same repository(ies) but target different RunnerDeployments with different runner labels, they will all observe the same queue depth because each HRA's metric calculation is independent and based solely on the repository-scoped workflow run counts from GitHub, not on label-matched jobs or current runner availability for those labels. Jobs queued for specific labels will only be assignable to matching runners, but the scaling decision doesn't filter the count by labels—leading to potential over-scaling across HRAs if not managed carefully (e.g., via unique repositories per HRA).

Citations:


All five GPU autoscalers will scale based on the same repo-wide queue metric, causing unrelated pools to scale unnecessarily.

The TotalNumberOfQueuedAndInProgressWorkflowRuns metric (lines 14–17, 32–35, 50–53, 68–71, 86–89) counts all queued and in-progress workflow runs across lightseekorg/smg without filtering by RunnerDeployment labels. Each GPU autoscaler observes the total repository queue depth and scales independently, so a single job queued for one GPU pool will trigger scale-up across all five pools.

Consider scoping each HRA to a dedicated repository, workflow label, or using a different metric that respects runner labels to prevent over-scaling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-autoscaler.yaml` around lines 14 -
17, The autoscalers all use the same repo-wide metric
TotalNumberOfQueuedAndInProgressWorkflowRuns for repository lightseekorg/smg,
causing unrelated GPU pools to scale; update each HRA block that references
TotalNumberOfQueuedAndInProgressWorkflowRuns to scope the metric to the intended
pool by adding workflowLabels (or repository/workflow selectors) that match the
target RunnerDeployment labels or use a metric that filters by runner labels, so
each autoscaler only observes queue depth for its specific RunnerDeployment;
locate the metric entries named TotalNumberOfQueuedAndInProgressWorkflowRuns and
replace or augment repositoryNames: - lightseekorg/smg with the appropriate
workflowLabels or label-based selector for the corresponding GPU pool.

Comment on lines +15 to +17
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use label-aware autoscaling for each runner pool

Each autoscaler is configured with TotalNumberOfQueuedAndInProgressWorkflowRuns on the same repository (lightseekorg/smg), which is repo-wide rather than runner-label specific. That means queue pressure from one workload class can scale unrelated RunnerDeployments (e.g., CPU backlog scaling H100 pools), causing unnecessary scale-outs and resource/cost churn. Consider a per-deployment signal such as PercentageRunnersBusy or splitting autoscaler scope.

Useful? React with 👍 / 👎.

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: arc-runner-a10-autoscaler
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: arc-runner-gpu-a10

minReplicas: 2
maxReplicas: 4

metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- lightseekorg/smg
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: arc-runner-1-h100-autoscaler
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: arc-runner-1-gpu-h100

minReplicas: 4
maxReplicas: 20

metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- lightseekorg/smg
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: arc-runner-1-autoscaler
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: arc-runner-1-gpu

minReplicas: 2
maxReplicas: 16

metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- lightseekorg/smg
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: arc-runner-2-h100-autoscaler
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: arc-runner-2-gpu-h100

minReplicas: 2
maxReplicas: 10

metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- lightseekorg/smg
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: arc-cpu-runner-autoscaler
namespace: actions-runner-system
spec:
scaleTargetRef:
kind: RunnerDeployment
name: arc-runner-cpu

minReplicas: 4
maxReplicas: 8

metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- lightseekorg/smg
48 changes: 48 additions & 0 deletions scripts/k8s-runner-resources/arc-runner-cpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: arc-runner-cpu
namespace: actions-runner-system
spec:
replicas: 4
template:
spec:
ephemeral: true
repository: lightseekorg/smg
labels:
- k8s-runner-cpu
serviceAccountName: arc-runner-sa

containers:
- name: runner
image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "8"
memory: "16Gi"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_API_KEY
name: huggingface-secret
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
key: ANTHROPIC_API_KEY
name: anthropic-api-key
- name: XAI_API_KEY
valueFrom:
secretKeyRef:
key: XAI_API_KEY
name: xai-api-key
- name: docker
image: fra.ocir.io/idqj093njucb/docker:dind
Comment on lines +9 to +48
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Docker-in-Docker (dind) configuration for the CPU runner is incomplete. The runner container is missing the DOCKER_HOST environment variable and volume mounts for the Docker socket. The docker sidecar is missing the privileged security context, resource definitions, and volume mounts required for it to function correctly. Additionally, the DOCKER_TLS_CERTDIR environment variable must be set to an empty string in the docker sidecar to disable TLS for the Docker socket, which is necessary for DinD setups. This will cause any Docker operations in workflows on this runner to fail. The configuration should be updated to properly set up the dind sidecar and the communication between the two containers, similar to the GPU runner definitions.

    spec:
      ephemeral: true
      repository: lightseekorg/smg
      labels:
        - k8s-runner-cpu
      serviceAccountName: arc-runner-sa

      volumes:
        - name: docker-sock
          emptyDir: {}
        - name: docker-storage
          emptyDir: {}

      containers:
        - name: runner
          image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
          resources:
            requests:
              cpu: "8"
              memory: "16Gi"
            limits:
              cpu: "8"
              memory: "16Gi"
          volumeMounts:
            - name: docker-sock
              mountPath: /var/run
          env:
            - name: DOCKER_HOST
              value: unix:///var/run/docker.sock
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: HUGGINGFACE_API_KEY
                  name: huggingface-secret
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  key: OPENAI_API_KEY
                  name: openai-api-key
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  key: ANTHROPIC_API_KEY
                  name: anthropic-api-key
            - name: XAI_API_KEY
              valueFrom:
                secretKeyRef:
                  key: XAI_API_KEY
                  name: xai-api-key
        - name: docker
          image: fra.ocir.io/idqj093njucb/docker:dind
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          env:
            - name: DOCKER_TLS_CERTDIR
              value: ""
            - name: DOCKER_DRIVER
              value: overlay2
          volumeMounts:
            - name: docker-sock
              mountPath: /var/run
            - name: docker-storage
              mountPath: /var/lib/docker
References
  1. When using a Docker-in-Docker (DinD) setup, it is necessary to disable TLS for the Docker socket by setting the DOCKER_TLS_CERTDIR environment variable to an empty string.

Comment on lines +16 to +48
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing volumes and volume mounts for Docker socket sharing.

The runner container references DOCKER_HOST: unix:///var/run/docker.sock in GPU deployments, but this CPU deployment is missing:

  1. The volumes section entirely (no docker-sock, docker-storage, dshm volumes)
  2. Volume mounts in the runner container

Without shared volumes, the runner and DinD containers cannot communicate.

Proposed fix to add volumes section
       serviceAccountName: arc-runner-sa
+
+      volumes:
+        - name: docker-sock
+          emptyDir: {}
+        - name: docker-storage
+          emptyDir:
+            medium: Memory
+            sizeLimit: 4Gi
 
       containers:
         - name: runner
           image: fra.ocir.io/idqj093njucb/action-runner:v0.0.1
           resources:
             requests:
               cpu: "8"
               memory: "16Gi"
             limits:
               cpu: "8"
               memory: "16Gi"
+          volumeMounts:
+            - name: docker-sock
+              mountPath: /var/run
           env:
+            - name: DOCKER_HOST
+              value: unix:///var/run/docker.sock
             - name: HF_TOKEN
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 16 - 48, The
CPU deployment is missing the Docker socket and related volumes/volumeMounts so
the runner container cannot talk to the dind container; add a top-level volumes
block defining docker-sock (hostPath /var/run/docker.sock), docker-storage
(emptyDir) and dshm (emptyDir with medium: Memory) and update the runner
container (name: runner) to include volumeMounts for docker-sock (mountPath:
/var/run/docker.sock), docker-storage (mountPath: /var/lib/docker) and dshm
(mountPath: /dev/shm); ensure the dind container (name: docker) also mounts
those same volumes so DOCKER_HOST: unix:///var/run/docker.sock works correctly.

Comment on lines +47 to +48
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Docker-in-Docker sidecar is missing critical configuration.

The docker container is incomplete compared to the GPU runner manifests. It's missing:

  • securityContext.privileged: true (required for DinD)
  • Volume mounts for docker-sock and docker-storage
  • Environment variables (DOCKER_TLS_CERTDIR, DOCKER_DRIVER)
  • Resource requests/limits

Without these, the DinD sidecar will fail to function, and the runner container won't be able to use Docker.

Proposed fix based on GPU runner configuration
         - name: docker
           image: fra.ocir.io/idqj093njucb/docker:dind
+          securityContext:
+            privileged: true  # Required for DinD
+          resources:
+            requests:
+              cpu: "1"
+              memory: "2Gi"
+            limits:
+              cpu: "2"
+              memory: "4Gi"
+          env:
+            - name: DOCKER_TLS_CERTDIR
+              value: ""  # Disables TLS for shared socket use
+            - name: DOCKER_DRIVER
+              value: overlay2
+          volumeMounts:
+            - name: docker-sock
+              mountPath: /var/run
+            - name: docker-storage
+              mountPath: /var/lib/docker
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/k8s-runner-resources/arc-runner-cpu.yaml` around lines 47 - 48, The
docker sidecar container (name: docker) is missing critical DinD configuration:
add securityContext.privileged: true to the docker container, add volumeMounts
for the docker socket and storage (mounts named docker-sock and docker-storage)
and ensure matching volumes are defined at the pod level, add environment
variables DOCKER_TLS_CERTDIR (empty string) and DOCKER_DRIVER (e.g., overlay2)
to the docker container, and add appropriate resources.requests and
resources.limits (cpu/memory) similar to the GPU runner's docker sidecar so DinD
can run properly and the runner can access the socket/storage.

Comment on lines +47 to +48
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Configure DinD sidecar for CPU RunnerDeployment

This RunnerDeployment adds a docker:dind sidecar but does not wire it up for usable Docker access from the runner (no shared /var/run volume, no privileged DinD setup, and no runner-side Docker endpoint wiring). In workflows that run on k8s-runner-cpu and invoke Docker (for example container actions or docker build), jobs will fail because the runner cannot reach a functional daemon.

Useful? React with 👍 / 👎.

Loading