Skip to content

feat: Sampler interface and ability to add custom samplers#101

Merged
ilan-gold merged 58 commits intoscverse:mainfrom
selmanozleyen:feat/sampler
Jan 22, 2026
Merged

feat: Sampler interface and ability to add custom samplers#101
ilan-gold merged 58 commits intoscverse:mainfrom
selmanozleyen:feat/sampler

Conversation

@selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Dec 17, 2025

fixes: #43

Not ready for review. Just wanted an initial blessing or feedback if your eye catched something.

Key points for my main design proposal for the interaction with Loader and Sampler:

  • Loader always takes a Sampler[list[slice]]
  • ChunkAlignedSampler(Sampler[list[slice]]) expects an implementation _iter_chunk_batches which returns the chunk ids and at most two slices
  • ChunkAlignedSampler(Sampler[list[slice]]) is a subclass of it that already has __iter__ implemented and will use _iter_chunk_batches and will convert chunk id's into slices.
  • This way a child of ChunkAlignedSampler is always easily validated to be chunk aligned for our loader.

Why at most two slices per batch? Because one can always rearrange fragmantations so that a batch would have at most 2 unaligned chunk slices. I already have an idea for how a mask and category would look like following this logic.

Stuff I haven't paid attention to much:

  • about rng generator storage. Like we shuffle chunks in sampler but do we need to use the same generator to shuffle batches themselves?
  • worker logic. kind of ignored that
  • same for preload_nchunks haven't given much thought
  • shuffle default value (should it be optional like torch?)
  • also the unit tests were vibe coded and not checked atm

ping @ilan-gold , @felix0097 because its a draft PR

@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 94.65241% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.00%. Comparing base (76ba1a1) to head (7237023).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/annbatch/loader.py 93.33% 4 Missing ⚠️
src/annbatch/samplers/_chunk_sampler.py 95.23% 4 Missing ⚠️
src/annbatch/utils.py 90.90% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #101      +/-   ##
==========================================
- Coverage   92.45%   92.00%   -0.45%     
==========================================
  Files           6       10       +4     
  Lines         636      738     +102     
==========================================
+ Hits          588      679      +91     
- Misses         48       59      +11     
Files with missing lines Coverage Δ
src/annbatch/__init__.py 100.00% <100.00%> (ø)
src/annbatch/abc/__init__.py 100.00% <100.00%> (ø)
src/annbatch/abc/sampler.py 100.00% <100.00%> (ø)
src/annbatch/samplers/__init__.py 100.00% <100.00%> (ø)
src/annbatch/types.py 100.00% <100.00%> (ø)
src/annbatch/utils.py 84.61% <90.90%> (-2.20%) ⬇️
src/annbatch/loader.py 90.32% <93.33%> (-1.67%) ⬇️
src/annbatch/samplers/_chunk_sampler.py 95.23% <95.23%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@giovp
Copy link
Member

giovp commented Dec 17, 2025

this looks great @selmanozleyen just chiming in cause I was working on something tangentially related that could land here as well: #100 basically the idea of my PR is to implement the DistributedSampler like strategy, but for iterable loaders. I think it would be great if instead of being standalone, it lands into a Sampler implementation.

More generally, in my opinion the most useful "samplers" that I would personally use straight away are: DistributedSampler and WeightedSampler.

@selmanozleyen
Copy link
Member Author

@giovp thanks! Thanks for the PR, I saw it. I still have it in mind. I will read your PR once I get an idea if this one is good enough. Then I plan to also incorperate your changes

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the base class makes sense.

I don't really get _chunk_ids_to_slices - why not just only return slices?

And why do we need to check unaligned slices at all? What is the failure case if we don't? It seems like too much. This would simplify the class structure since I think ChunkAlignedSampler would not need to exist.

I'm also not clear what "chunk aligned" means - the on-disk sparse data doesn't really have a useful concept of chunking, so what is being aligned?

Nice start!

@selmanozleyen
Copy link
Member Author

selmanozleyen commented Dec 18, 2025

I'm also not clear what "chunk aligned" means - the on-disk sparse data doesn't really have a useful concept of chunking, so what is being aligned?

oh ok, for sparse case then this whole chunk aligned concept might be useless. So I should just revert? But then I don't undersand why initially we would shuffle only chunks in the original implementation

@ilan-gold
Copy link
Collaborator

oh ok, for sparse case then this whole chunk aligned concept might be useless. So I should just revert? But then I don't undersand why initially we would shuffle only chunks in the original implementation

The concept of chunks for us you can think of as entirely virtual and divorced from on-disk chunks. chunk = contiguous slice of obs to fetch

@selmanozleyen selmanozleyen marked this pull request as ready for review December 20, 2025 13:55
@selmanozleyen
Copy link
Member Author

@ilan-gold yes we can get rid of nobs. But I dont get why we'd want sampler.sample(nobs) since start and end is already given.

@ilan-gold
Copy link
Collaborator

I'm not saying I'm married to this - I still don't think samplers even need an awareness (aside from start/end) of the "size" of the dataset.

I was saying I don't know why anyone would need this information, but if they were to need it, it's a runtime (relative to construction of the Loader) problem because you can keep adding datasets and htus n_obs can change. I agree there's probably no need for it. We should cross that bridge when/if we get there but in the meantime, try to avoid it.

@selmanozleyen
Copy link
Member Author

This PR has gotten very big. So I won't include the categorical sampling and distributed sampling here. Once you approve of the SliceSampler,LoaderBuilder and Loader I suggest that we merge this to main or I create another branch on top of this one. Just to make the reviews easier.

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is getting big so I would probably limit the refactoring to what is purely necessary and focus only on the Sampler and slotting it into the current implementation (i.e., what is on main), which I think is doable. Separately, we can do a refactor PR. In general I would aim to make the PR as small as possible! But great that things are taking shape now :) Good to see all tests pass, so seems like the basics are funcitonal

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Well on our way now :)

@selmanozleyen
Copy link
Member Author

@ilan-gold before I ask for a full review and I polish, let me recap the things I think we aren't in the same page on:

  • Where drop_last works, in Sampler or in Loader? There is no defined way to handle drop_last in IterableDataset since its abstract but I should point out the fetchers of IterableDataset handle drop_last in torch.DataLoader not in the dataset. But that handling is almost meaningless in our case as I demonstrated here feat: Sampler interface and ability to add custom samplers #101 (comment) . Furthermore, in torch DataLoader if batch_size!=1 and batch_sampler is not None it gives an error. Since Loader will always have a batch_sampler within it, I think it's reasonable to expect the users to call with torch.DataLoader.batch_size=1 as default behavior. So drop_last is a no-op in torch.DataLoader when annbatch.Loader is given then that means we should handle it. I think it's common torch-like pattern to handle it in a default sampler and error if there is mutually exclusive arguments given.

  • I also want to clarify the behavior difference. We used to still carry over leftovers even when preload_size % batch_size == 0 because often the last chunk that is incomplete would get in the middle when loading

    min(self.n_obs, (index + 1) * self._chunk_size),

    (this is only true when shuffle is true)
    Then this would lead to incomplete batches which would lead to carrying over as I explained here feat: Sampler interface and ability to add custom samplers #101 (comment)

  • about the set_sampler: I think we are on the same page but for the solution, it still needs back-and-forth.

@ilan-gold
Copy link
Collaborator

ilan-gold commented Jan 7, 2026

For the first drop_last point, I don't think it's clear even what our end goal for the torch.utils.data.DataLoader integration is - we may end up only supporting DataLoader(batch_size=None, ...) i.e., pure IterableDataset yielding (although no idea how to enforce this) in which case I think we're golden modulo some worker handling maybe. So I would not worry about the interaction between torch.utils.data.DataLoader and drop_last especially since it is currently untested. Just make sure our Loader + drop_last interact correctly for which there should already be tests :)

I also want to clarify the behavior difference. We used to still carry over leftovers even when preload_size % batch_size == 0 because often the last chunk that is incomplete would get in the middle when loading

Ah great catch, yes, you're absolutely right , so our tests are somewhat wrong. The last "incomplete" batch slice is shuffled in midway. In this case, why don't we change the way slices are generated/shuffled a bit then to ensure the following, which I think will both simplify the sample code and inject more randomness:

  1. Until the very last batch, preload_size % batch_size == 0 is always true i.e., the amount of data loaded from disk up until the final iteration obeys this constraint. This way, there is no leftover to worry about with SliceSampler
  2. The "shorter" slice is allowed to be located in the middle of the on disk datasets so that this shorter slice is not always the same one (i.e., the truncated end of the dataset). This will probably require a bit of new code. Actually, if you want to really go for broke, you could implement a wraparound of the short slice and allow it to be at the beginning, or broken up between the start and end. This would increase our randomness a bit which would be nice so you could have partitions of a n_obs=259,slice_size=64 on-disk loader as [slice(0, 2), slice(2, 66), slice(66, 130), slice(130, 194), slice(194, 258), slice(258, 259)] or [slice(0, 64), slice(64, 67), slice(67, 131), slice(131, 195), slice(195, 259)] or[slice(0, 64), slice(64, 128), slice(128, 192), slice(192, 256), slice(256, 259)]. Then you shuffle only the slice_size slices and put the shorter ones at the end.

Do these two invariants sound reasonable?

EDIT: I did a strikethrough of two things. I think the current behavior is correct and uses the leftover handling to deal with slice that is shuffled it. So I really think the above two points are the way to go then. I added #106 to make sure the behavior is expected.

@selmanozleyen
Copy link
Member Author

selmanozleyen commented Jan 7, 2026

Those two sound reasonable, in fact I tried to implement that but didn't had to time to make sure it worked well with the tests. Basically the same idea you got. The incomplete slice would also be randomly selected

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@selmanozleyen
Copy link
Member Author

thanks @ilan-gold again. Only thing unclear is I think here: #101 (comment) but I wrote a response.

sorry about the tests, I admit that I overlooked the "vibe-check" on them :D. Should be good now, if there is something wrong there it is organic hand-made blunders

@selmanozleyen
Copy link
Member Author

as a note to the future, sampler doesn't actually have to deal with slices and it can just give start points of the chunks and the size of the last chunk. Maybe one can do more optimized stuff using this or not, just pointing it out.

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small things now just!

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok This looks good! Great stuff - how do you want to proceed here? Do you want to try implementing a new sampler off this branch or merge this as-is?

@selmanozleyen
Copy link
Member Author

Nicee!! I'd like to merge this as is if you don't mind. It's already big enough

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicee!! I'd like to merge this as is if you don't mind. It's already big enough

Ok!

@ilan-gold ilan-gold merged commit 2c14c73 into scverse:main Jan 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Sampling Strategies API

3 participants