Skip to content

Conversation

@nfnt
Copy link
Member

@nfnt nfnt commented Dec 9, 2025

Handle the case of multiple data providers providing the same dataset. For that case we assume that a dataset name is used uniquely for the same dataset. With that assumption, we can inform workers of the multiple data providers for a dataset and have them choose between them when downloading data slices. Right now, workers choose randomly if there are multiple data providers available.

Tested locally by having 2 data providers with the MNIST dataset. Workers pulled slices from both of them.

Remarks:
While this allows for simple load balancing when pull data, this won't provide any fault tolerance if data providers are no longer available: The scheduler determines the available data providers once. For fault tolerance, this needs to be done for each data slice request from workers. Furthermore, data providers add themselves into the DHT using start_providing. It'll take a while after a provider is gone that this record is removed from the DHT. We would have to account for this by checking the actual availability of data providers as reported by the DHT.

@nfnt nfnt requested review from l45k and orlandohohmeier December 9, 2025 14:52
@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

❌ Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/scheduler/src/bin/hypha-scheduler.rs 0.00% 19 Missing ⚠️
crates/worker/src/executor/bridge.rs 0.00% 7 Missing ⚠️
crates/scheduler/src/scheduling/data_scheduler.rs 0.00% 5 Missing ⚠️
crates/data/src/bin/hypha-data.rs 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Handle the case of multiple data providers providing the same dataset.
For that case we assume that a dataset name is used uniquely for the
same dataset. With that assumption, we can inform workers of the
multiple data providers for a dataset and have them choose between them
when downloading data slices. Right now, workers choose randomly if
there are multiple data providers available.
@nfnt nfnt force-pushed the nfnt/support-multiple-data-providers branch from 851ad9d to 8283838 Compare December 10, 2025 13:43
Comment on lines +213 to +261
// Only a single record is stored for this key in the DHT.
// I.e. if there are multiple data providers with the same dataset,
// we might overwrite an existing record.
// For now, we assume that a dataset name is always used for the same record.
// With that assumption, overwriting existing records is okay.
// We might need to change this in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is captured in: #203

@nfnt nfnt merged commit db773ec into alpha Dec 11, 2025
8 of 9 checks passed
@nfnt nfnt deleted the nfnt/support-multiple-data-providers branch December 11, 2025 07:36
@github-actions
Copy link

🎉 This PR is included in version 1.0.0-alpha.39 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants