Fix ti_skip_downstream overwriting RUNNING tasks to SKIPPED#63266
Open
sam-dumont wants to merge 2 commits intoapache:mainfrom
Open
Fix ti_skip_downstream overwriting RUNNING tasks to SKIPPED#63266sam-dumont wants to merge 2 commits intoapache:mainfrom
sam-dumont wants to merge 2 commits intoapache:mainfrom
Conversation
In HA deployments, ti_skip_downstream() issues a bulk UPDATE without a state guard. When a BranchOperator decides to skip downstream tasks, it can overwrite a task already RUNNING on a worker to SKIPPED, causing a 409 heartbeat conflict that kills the task mid-execution. Add a skippable_state_clause to the UPDATE WHERE clause so RUNNING, SUCCESS, and FAILED tasks are never overwritten to SKIPPED. QUEUED tasks are intentionally allowed to be skipped: no work has been done yet and the BranchOperator's decision should take priority. The worker pod will get a benign 409 on PATCH /run and exit cleanly. closes: apache#59378
f4a3ded to
589633a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ti_skip_downstream()issues an UPDATE filtered by(dag_id, run_id, task_id, map_index)without a state guard. When a BranchOperator on one scheduler decides to skip downstream tasks, the UPDATE can overwrite a task already RUNNING on a worker. The worker's next heartbeat returns 409 withcurrent_state: skipped, killing the task mid-execution.This is a companion fix to #60330, which guards
schedule_tis()against the same class of race condition. Different code path (Execution API routes vsdagrun.py), same root cause : unguarded bulk UPDATEs on TI state.Production data (12 days, 5 schedulers, ~500 concurrent workers)
We deployed both fixes as monkey patches on our prod cluster and monitored 409 heartbeat errors via CloudWatch :
current_state: scheduledcurrent_state: failedcurrent_state: skippedFix
Add
skippable_state_clauseto the UPDATE's WHERE clause :The
or_(IS NULL, NOT IN)pattern handles SQL NULL semantics :NULL NOT IN (...)evaluates to NULL (falsy), so tasks withstate=Noneneed an explicitIS NULLcheck to remain skippable.QUEUED is intentionally NOT guarded : a QUEUED task hasn't started executing yet, so the BranchOperator's decision should take priority. The worker pod will get a benign 409 on
PATCH /runand exit cleanly. Blocking QUEUED would cause a semantic error where the wrong branch executes.Tests
5 regression tests in
TestTISkipDownstreamRaceCondition:related: #59378
related: #60330
related: #57618
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (claude-opus-4-6) following the guidelines