Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - Task Instance Fix#55159
Open
HsiuChuanHsu wants to merge 4 commits intoapache:mainfrom
Open
Conversation
edbf605 to
f0ab406
Compare
Member
|
Fix looks reasonable but tests don’t agree. This should include a test case too. |
f0ab406 to
01962e3
Compare
75dfd76 to
63a5ad1
Compare
Contributor
Author
When implementing unit tests for the new orphaned task detection logic in the Original problem
Solution |
63a5ad1 to
e9dbba6
Compare
Contributor
|
@HsiuChuanHsu this PR combines changes to airflow core and k8s provider. If these changes are not coupled can you please separate? Providers and core have different release cycles |
Contributor
Author
|
@eladkal Sure, will work on it. |
e9dbba6 to
7d78088
Compare
Contributor
Author
7d78088 to
94b9ea9
Compare
- Handle 429 errors in KubernetesExecutor task publishing retry logic - Detect orphaned tasks and record TaskInstanceHistory in failure handler - Add detailed logging for rate limiting scenarios
Move orphaned task detection before end_date assignment to ensure TaskInstanceHistory is recorded for tasks that become detached during scheduler restarts due to Kubernetes API 429 errors.
94b9ea9 to
d9b125d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.
Problem
When using KubernetesExecutor or CeleryKubernetesExecutor:
NoneRUNNINGSolution
KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failuresTaskInstance: Detect orphaned tasks (
state=None+start_date set+end_date unset) and record TaskInstanceHistoryImpact
Before:
After:
Fixes: #49517
Related: #49244
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.