Skip to content

Fix group name lookup crash for flattened mesh collectives in debug_helpers#364

Open
fmassa wants to merge 1 commit intomainfrom
fmassa/fix_group_name_lookup_cost
Open

Fix group name lookup crash for flattened mesh collectives in debug_helpers#364
fmassa wants to merge 1 commit intomainfrom
fmassa/fix_group_name_lookup_cost

Conversation

@fmassa
Copy link
Contributor

@fmassa fmassa commented Mar 13, 2026

make_custom_runtime_estimation assumed every collective's process group corresponds to a single mesh dimension, using groups_name.index(group_name) to resolve it. This crashes with ValueError when a collective uses the flattened mesh group created by mesh._flatten() in _apply_placement_common, since that group isn't in mesh.get_all_groups().

The fix resolves the group name against per-dimension groups first, then falls back to mesh._flatten() (which is cached/idempotent) with mesh_dim=0. All downstream shape and topology references use cost_mesh so the cost estimation is correct for both per-dimension and flattened collectives.

Authored with Claude.

…elpers

make_custom_runtime_estimation assumed every collective's process group
corresponds to a single mesh dimension, using groups_name.index(group_name)
to resolve it. This crashes with ValueError when a collective uses the
flattened mesh group created by mesh._flatten() in _apply_placement_common,
since that group isn't in mesh.get_all_groups().

The fix resolves the group name against per-dimension groups first, then falls
back to mesh._flatten() (which is cached/idempotent) with mesh_dim=0. All
downstream shape and topology references use cost_mesh so the cost estimation
is correct for both per-dimension and flattened collectives.

Authored with Claude.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant