-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[data] Add node_id, pid, attempt # for hanging tasks #59793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[data] Add node_id, pid, attempt # for hanging tasks #59793
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to enhance the debuggability of hanging tasks by incorporating node_id, pid, and attempt # into the hanging task detector's output. This is achieved by passing the task_id through the operator pipeline to OpRuntimeMetrics, which is then used by the HangingExecutionIssueDetector to fetch detailed task information. The implementation is generally sound, with a beneficial refactoring in physical_operator.py. However, I've identified a critical issue in hash_shuffle.py where arguments to on_task_submitted are incorrectly ordered, which would result in a runtime error.
python/ray/data/_internal/issue_detection/detectors/hanging_detector.py
Outdated
Show resolved
Hide resolved
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
|
|
||
|
|
||
| @dataclass | ||
| class RunningTaskInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this state is not serialized and persisted anywhere correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah it's not
| task_state = ray.util.state.get_task( | ||
| task_info.task_id.hex(), | ||
| timeout=1.0 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in _explain=True and log the explanation in the event of a failure?
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
python/ray/data/_internal/issue_detection/detectors/hanging_detector.py
Outdated
Show resolved
Hide resolved
…tector.py Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
…/add-more-info-to-hanging-tasks
Description
Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely:
I did consider adding this to high memory detector, but avoided for 2 reasons
RunningTaskInfoExample script to trigger hanging issues
Related issues
None
Additional information
None