Skip to content

Conversation

@iamjustinhsu
Copy link
Contributor

@iamjustinhsu iamjustinhsu commented Dec 31, 2025

Description

Currently, when displaying hanging tasks, we show ray data level task index, which is useless for ray core debugging. This PR adds more info to long running tasks namely:

  • node_id
  • pid
  • attempt #

I did consider adding this to high memory detector, but avoided for 2 reasons

  • requires more refractor of RunningTaskInfo
  • afaik, not helpful in debugging since high memory is after the task completes

Example script to trigger hanging issues

import ray
import time
from ray.data._internal.issue_detection.detectors import HangingExecutionIssueDetectorConfig


ctx = ray.data.DataContext.get_current()
ctx.issue_detectors_config.hanging_detector_config = HangingExecutionIssueDetectorConfig(
    detection_time_interval_s=1.0,
)

def sleep(x):
    if x['id'] == 0:
        time.sleep(100)
    return x
ray.data.range(100, override_num_blocks=100).map_batches(sleep).materialize()

Related issues

None

Additional information

None

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner December 31, 2025 21:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to enhance the debuggability of hanging tasks by incorporating node_id, pid, and attempt # into the hanging task detector's output. This is achieved by passing the task_id through the operator pipeline to OpRuntimeMetrics, which is then used by the HangingExecutionIssueDetector to fetch detailed task information. The implementation is generally sound, with a beneficial refactoring in physical_operator.py. However, I've identified a critical issue in hash_shuffle.py where arguments to on_task_submitted are incorrectly ordered, which would result in a runtime error.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 1, 2026
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>


@dataclass
class RunningTaskInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this state is not serialized and persisted anywhere correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah it's not

Comment on lines 159 to 162
task_state = ray.util.state.get_task(
task_info.task_id.hex(),
timeout=1.0
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in _explain=True and log the explanation in the event of a failure?

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jan 7, 2026
iamjustinhsu and others added 4 commits January 7, 2026 15:18
…tector.py

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants