Skip to content

Conversation

@dejanzele
Copy link
Member

@dejanzele dejanzele commented Dec 2, 2025

What type of PR is this?

Enhancement

What this PR does / why we need it

Previously, when the scheduler hit its timeout during the scheduling cycle, it would return an error and discard all work, even jobs that were successfully scheduled before the timeout.

This change implements the following approach: when a timeout occurs, stop considering new jobs but continue scheduling any evicted jobs that still need to be rescheduled.

The key change is in QueueScheduler.Schedule: instead of returning an error on context timeout, we call OnlyYieldEvicted() to switch the iterator to only yield evicted jobs, finish scheduling those, then return the partial results. The subsequent stages (oversubscription handling, optimiser, unbind) continue to run normally.

Expected output

When a timeout happens, we should see log lines like the ones below:

  INFO Timeout reached for pool default, switching to evicted-only mode
  INFO Scheduling cycle interrupted by context deadline exceeded: scheduled 873 jobs for pool default
  INFO Scheduled on executor pool default in 19.983083ms with error <nil>

How to test

Check the section at the end called Additional Files for the test script and test Armada job.

  1. Configure scheduler with a short timeout in _local/scheduler/config.yaml:
maxSchedulingDuration: 200ms
  1. Configure fake executor with enough capacity in _local/fakeexecutor/config.yaml:
nodes:
  - name: "fake-node"
    count: 50
    allocatable:
      cpu: "64"
      memory: "256Gi"
  1. Start the local environment: goreman -f _local/procfiles/fake-executor.Procfile start
  2. Create two test queues:
armadactl create queue queue-a
armadactl create queue queue-b
  1. Run the following commands to generate jobs:
./scripts/submit-jobs.sh -c 5000 -q queue-a -j jobset-timeout-a example/fair-share-test.yaml
./scripts/submit-jobs.sh -c 5000 -q queue-b -j jobset-timeout-b example/fair-share-test.yaml
  1. Assert that the following logs appear in the output of the scheduler:
  INFO Timeout reached for pool default, switching to evicted-only mode
  INFO Scheduling cycle interrupted by context deadline exceeded: scheduled 873 jobs for pool default
  INFO Scheduled on executor pool default in 19.983083ms with error <nil>
Additional Files
# scripts/submit-jobs.sh

#!/bin/bash
set -e

COUNT=1
JOBSET="test-jobset"
QUEUE="test-queue"
JOB_TEMPLATE=""
MAX_PARALLEL=50

while [[ $# -gt 0 ]]; do
    case $1 in
        -c|--count) COUNT="$2"; shift 2 ;;
        -j|--jobset) JOBSET="$2"; shift 2 ;;
        -q|--queue) QUEUE="$2"; shift 2 ;;
        -p|--parallel) MAX_PARALLEL="$2"; shift 2 ;;
        -*) echo "Unknown option $1"; exit 1 ;;
        *) JOB_TEMPLATE="$1"; shift ;;
    esac
done

[[ -z "$JOB_TEMPLATE" ]] && JOB_TEMPLATE="example/fair-share-test.yaml"
[[ ! -f "$JOB_TEMPLATE" ]] && echo "Error: $JOB_TEMPLATE not found" && exit 1

ARMADACTL="./armadactl"
[[ ! -f "$ARMADACTL" ]] && ARMADACTL="armadactl"

TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT

JOB_FILE="$TEMP_DIR/job.yaml"
sed -e "s/^jobSetId:.*/jobSetId: $JOBSET/" -e "s/^queue:.*/queue: $QUEUE/" "$JOB_TEMPLATE" > "$JOB_FILE"

$ARMADACTL create queue "$QUEUE" 2>/dev/null || true

echo "Submitting $COUNT batches to queue '$QUEUE' jobset '$JOBSET'..."

PIDS=()
for ((i=1; i<=COUNT; i++)); do
    $ARMADACTL submit "$JOB_FILE" >/dev/null 2>&1 &
    PIDS+=($!)
    if ((${#PIDS[@]} >= MAX_PARALLEL)) || ((i == COUNT)); then
        for pid in "${PIDS[@]}"; do wait $pid; done
        PIDS=()
        echo "Progress: $i/$COUNT"
    fi
done

echo "Done. Submitted $COUNT batches to queue '$QUEUE'"
# example/fair-share-test.yaml

queue: test-queue
jobSetId: fair-share-test
jobs:
  - namespace: default
    priority: 1000
    podSpec: &podspec
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: worker
          image: busybox:latest
          command: ["sleep", "3600"]
          resources:
            limits:
              memory: 64Mi
              cpu: 50m
            requests:
              memory: 64Mi
              cpu: 50m
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec
  - namespace: default
    priority: 1000
    podSpec: *podspec

nikola-jokic
nikola-jokic previously approved these changes Dec 2, 2025
sctx.TerminationReason = ctx.Err().Error()
default:
}
if ctx.Err() != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is safe? The scheduler has multiple stages (evict, reschedule, remove overfill) and I'm not sure that a timeout at an arbitrary point will result in a usable set of schedulable jobs. The test below handles the simple case where
we are scheduling jobs onto an empty cluster with no preemption but I suspect other cases will fail:

  • Fair share prremption
  • Urgency based preemption (particularly in the case of overfill)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll work on covering them.

Copy link
Collaborator

@d80tb7 d80tb7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this works in the case of premption.

@dejanzele dejanzele force-pushed the feat/scheduler-graceful-shutdown branch 8 times, most recently from 1ff8fc9 to 27cd367 Compare December 16, 2025 07:22
… of rolling back all work

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the feat/scheduler-graceful-shutdown branch from 27cd367 to fa91955 Compare December 18, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants