[data] Add node_id, pid, attempt # for hanging tasks #59793

iamjustinhsu · 2025-12-31T21:56:08Z

Description

Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely:

node_id
pid
attempt #

I did consider adding this to high memory detector, but avoided for 2 reasons

requires more refractor of RunningTaskInfo
afaik, not helpful in debugging since high memory is after the task completes

Example script to trigger hanging issues

import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()

Related issues

None

Additional information

None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to enhance the debuggability of hanging tasks by incorporating node_id, pid, and attempt # into the hanging task detector's output. This is achieved by passing the task_id through the operator pipeline to OpRuntimeMetrics, which is then used by the HangingExecutionIssueDetector to fetch detailed task information. The implementation is generally sound, with a beneficial refactoring in physical_operator.py. However, I've identified a critical issue in hash_shuffle.py where arguments to on_task_submitted are incorrectly ordered, which would result in a runtime error.

python/ray/data/_internal/execution/operators/hash_shuffle.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

goutamvenkat-anyscale · 2026-01-05T22:42:08Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py



 @dataclass
 class RunningTaskInfo:


I assume this state is not serialized and persisted anywhere correct?

nah it's not

goutamvenkat-anyscale · 2026-01-05T23:06:49Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

+                            task_state = ray.util.state.get_task(
+                                task_info.task_id.hex(),
+                                timeout=1.0
+                            )


Can we pass in _explain=True and log the explanation in the event of a failure?

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

…tector.py Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>

…/add-more-info-to-hanging-tasks

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

[data] Add node_id, pid, attempt # for hanging tasks

57b7e6a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner December 31, 2025 21:56

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

cursor bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

reorder

a051a8e

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

iamjustinhsu added 2 commits December 31, 2025 14:41

handle exceptions

53a2494

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

refactor

982c855

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 1, 2026

change timeout to 1s

c42db1c

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Jan 5, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Show resolved Hide resolved

goutamvenkat-anyscale reviewed Jan 5, 2026

View reviewed changes

iamjustinhsu added 2 commits January 5, 2026 15:38

fix

4c36197

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

case for list

1e98d89

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

goutamvenkat-anyscale approved these changes Jan 7, 2026

View reviewed changes

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jan 7, 2026

alexeykudinkin approved these changes Jan 7, 2026

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

iamjustinhsu and others added 4 commits January 7, 2026 15:18

Update python/ray/data/_internal/issue_detection/detectors/hanging_de…

e8a4d2e

…tector.py Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>

Merge branch 'master' of https://github.com/ray-project/ray into jhsu…

ee4c6ac

…/add-more-info-to-hanging-tasks

fix test

01dc318

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

lint

772f313

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Add node_id, pid, attempt # for hanging tasks #59793

[data] Add node_id, pid, attempt # for hanging tasks #59793

iamjustinhsu commented Dec 31, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Uh oh!

iamjustinhsu Jan 5, 2026

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[data] Add node_id, pid, attempt # for hanging tasks #59793

Are you sure you want to change the base?

[data] Add node_id, pid, attempt # for hanging tasks #59793

Conversation

iamjustinhsu commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example script to trigger hanging issues

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iamjustinhsu commented Dec 31, 2025 •

edited

Loading