feat: Sampler interface and ability to add custom samplers by selmanozleyen · Pull Request #101 · scverse/annbatch

selmanozleyen · 2025-12-17T19:28:49Z

fixes: #43

Not ready for review. Just wanted an initial blessing or feedback if your eye catched something.

Key points for my main design proposal for the interaction with Loader and Sampler:

Loader always takes a Sampler[list[slice]]
ChunkAlignedSampler(Sampler[list[slice]]) expects an implementation _iter_chunk_batches which returns the chunk ids and at most two slices
ChunkAlignedSampler(Sampler[list[slice]]) is a subclass of it that already has __iter__ implemented and will use _iter_chunk_batches and will convert chunk id's into slices.
This way a child of ChunkAlignedSampler is always easily validated to be chunk aligned for our loader.

Why at most two slices per batch? Because one can always rearrange fragmantations so that a batch would have at most 2 unaligned chunk slices. I already have an idea for how a mask and category would look like following this logic.

Stuff I haven't paid attention to much:

about rng generator storage. Like we shuffle chunks in sampler but do we need to use the same generator to shuffle batches themselves?
worker logic. kind of ignored that
same for preload_nchunks haven't given much thought
shuffle default value (should it be optional like torch?)
also the unit tests were vibe coded and not checked atm

ping @ilan-gold , @felix0097 because its a draft PR

codecov · 2025-12-17T19:33:12Z

Codecov Report

❌ Patch coverage is 94.65241% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.00%. Comparing base (76ba1a1) to head (7237023).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/annbatch/loader.py	93.33%	4 Missing ⚠️
src/annbatch/samplers/_chunk_sampler.py	95.23%	4 Missing ⚠️
src/annbatch/utils.py	90.90%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #101      +/-   ##
==========================================
- Coverage   92.45%   92.00%   -0.45%     
==========================================
  Files           6       10       +4     
  Lines         636      738     +102     
==========================================
+ Hits          588      679      +91     
- Misses         48       59      +11

Files with missing lines	Coverage Δ
src/annbatch/__init__.py	`100.00% <100.00%> (ø)`
src/annbatch/abc/__init__.py	`100.00% <100.00%> (ø)`
src/annbatch/abc/sampler.py	`100.00% <100.00%> (ø)`
src/annbatch/samplers/__init__.py	`100.00% <100.00%> (ø)`
src/annbatch/types.py	`100.00% <100.00%> (ø)`
src/annbatch/utils.py	`84.61% <90.90%> (-2.20%)`	⬇️
src/annbatch/loader.py	`90.32% <93.33%> (-1.67%)`	⬇️
src/annbatch/samplers/_chunk_sampler.py	`95.23% <95.23%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

giovp · 2025-12-17T23:52:35Z

this looks great @selmanozleyen just chiming in cause I was working on something tangentially related that could land here as well: #100 basically the idea of my PR is to implement the DistributedSampler like strategy, but for iterable loaders. I think it would be great if instead of being standalone, it lands into a Sampler implementation.

More generally, in my opinion the most useful "samplers" that I would personally use straight away are: DistributedSampler and WeightedSampler.

selmanozleyen · 2025-12-18T08:24:23Z

@giovp thanks! Thanks for the PR, I saw it. I still have it in mind. I will read your PR once I get an idea if this one is good enough. Then I plan to also incorperate your changes

ilan-gold

I think the base class makes sense.

I don't really get _chunk_ids_to_slices - why not just only return slices?

And why do we need to check unaligned slices at all? What is the failure case if we don't? It seems like too much. This would simplify the class structure since I think ChunkAlignedSampler would not need to exist.

I'm also not clear what "chunk aligned" means - the on-disk sparse data doesn't really have a useful concept of chunking, so what is being aligned?

Nice start!

src/annbatch/sampler.py

tests/test_sampler.py

src/annbatch/sampler.py

selmanozleyen · 2025-12-18T09:48:11Z

I'm also not clear what "chunk aligned" means - the on-disk sparse data doesn't really have a useful concept of chunking, so what is being aligned?

oh ok, for sparse case then this whole chunk aligned concept might be useless. So I should just revert? But then I don't undersand why initially we would shuffle only chunks in the original implementation

ilan-gold · 2025-12-18T12:09:42Z

oh ok, for sparse case then this whole chunk aligned concept might be useless. So I should just revert? But then I don't undersand why initially we would shuffle only chunks in the original implementation

The concept of chunks for us you can think of as entirely virtual and divorced from on-disk chunks. chunk = contiguous slice of obs to fetch

src/annbatch/loader.py

src/annbatch/sampler/_sampler.py

src/annbatch/loader.py

src/annbatch/sampler/_sampler.py

selmanozleyen · 2025-12-22T13:40:38Z

@ilan-gold yes we can get rid of nobs. But I dont get why we'd want sampler.sample(nobs) since start and end is already given.

ilan-gold · 2025-12-22T14:03:38Z

I'm not saying I'm married to this - I still don't think samplers even need an awareness (aside from start/end) of the "size" of the dataset.

I was saying I don't know why anyone would need this information, but if they were to need it, it's a runtime (relative to construction of the Loader) problem because you can keep adding datasets and htus n_obs can change. I agree there's probably no need for it. We should cross that bridge when/if we get there but in the meantime, try to avoid it.

uv.lock

selmanozleyen · 2026-01-02T01:58:26Z

This PR has gotten very big. So I won't include the categorical sampling and distributed sampling here. Once you approve of the SliceSampler,LoaderBuilder and Loader I suggest that we merge this to main or I create another branch on top of this one. Just to make the reviews easier.

ilan-gold

I agree this is getting big so I would probably limit the refactoring to what is purely necessary and focus only on the Sampler and slotting it into the current implementation (i.e., what is on main), which I think is doable. Separately, we can do a refactor PR. In general I would aim to make the PR as small as possible! But great that things are taking shape now :) Good to see all tests pass, so seems like the basics are funcitonal

src/annbatch/sampler/_sampler.py

src/annbatch/loader.py

src/annbatch/sampler/_sampler.py

docs/api.md

ilan-gold

Nice! Well on our way now :)

docs/conf.py

src/annbatch/sampler/__init__.py

src/annbatch/sampler/_sampler.py

src/annbatch/loader.py

selmanozleyen · 2026-01-06T00:10:04Z

@ilan-gold before I ask for a full review and I polish, let me recap the things I think we aren't in the same page on:

Where drop_last works, in Sampler or in Loader? There is no defined way to handle drop_last in IterableDataset since its abstract but I should point out the fetchers of IterableDataset handle drop_last in torch.DataLoader not in the dataset. But that handling is almost meaningless in our case as I demonstrated here feat: Sampler interface and ability to add custom samplers #101 (comment) . Furthermore, in torch DataLoader if batch_size!=1 and batch_sampler is not None it gives an error. Since Loader will always have a batch_sampler within it, I think it's reasonable to expect the users to call with torch.DataLoader.batch_size=1 as default behavior. So drop_last is a no-op in torch.DataLoader when annbatch.Loader is given then that means we should handle it. I think it's common torch-like pattern to handle it in a default sampler and error if there is mutually exclusive arguments given.
I also want to clarify the behavior difference. We used to still carry over leftovers even when preload_size % batch_size == 0 because often the last chunk that is incomplete would get in the middle when loading

annbatch/src/annbatch/loader.py

Line 574 in f84e673

min(self.n_obs, (index + 1) * self._chunk_size),

(this is only true when shuffle is true)
Then this would lead to incomplete batches which would lead to carrying over as I explained here feat: Sampler interface and ability to add custom samplers #101 (comment)
about the set_sampler: I think we are on the same page but for the solution, it still needs back-and-forth.

ilan-gold · 2026-01-07T10:44:10Z

For the first drop_last point, I don't think it's clear even what our end goal for the torch.utils.data.DataLoader integration is - we may end up only supporting DataLoader(batch_size=None, ...) i.e., pure IterableDataset yielding (although no idea how to enforce this) in which case I think we're golden modulo some worker handling maybe. So I would not worry about the interaction between torch.utils.data.DataLoader and drop_last especially since it is currently untested. Just make sure our Loader + drop_last interact correctly for which there should already be tests :)

I also want to clarify the behavior difference. We used to still carry over leftovers even when preload_size % batch_size == 0 because often the last chunk that is incomplete would get in the middle when loading

Ah great catch, yes, you're absolutely right ~~, so our tests are somewhat wrong.~~ The last "incomplete" ~~batch~~ slice is shuffled in midway. In this case, why don't we change the way slices are generated/shuffled a bit then to ensure the following, which I think will both simplify the sample code and inject more randomness:

Until the very last batch, preload_size % batch_size == 0 is always true i.e., the amount of data loaded from disk up until the final iteration obeys this constraint. This way, there is no leftover to worry about with SliceSampler
The "shorter" slice is allowed to be located in the middle of the on disk datasets so that this shorter slice is not always the same one (i.e., the truncated end of the dataset). This will probably require a bit of new code. Actually, if you want to really go for broke, you could implement a wraparound of the short slice and allow it to be at the beginning, or broken up between the start and end. This would increase our randomness a bit which would be nice so you could have partitions of a n_obs=259,slice_size=64 on-disk loader as [slice(0, 2), slice(2, 66), slice(66, 130), slice(130, 194), slice(194, 258), slice(258, 259)] or [slice(0, 64), slice(64, 67), slice(67, 131), slice(131, 195), slice(195, 259)] or[slice(0, 64), slice(64, 128), slice(128, 192), slice(192, 256), slice(256, 259)]. Then you shuffle only the slice_size slices and put the shorter ones at the end.

Do these two invariants sound reasonable?

EDIT: I did a strikethrough of two things. I think the current behavior is correct and uses the leftover handling to deal with slice that is shuffled it. So I really think the above two points are the way to go then. I added #106 to make sure the behavior is expected.

selmanozleyen · 2026-01-07T15:56:02Z

Those two sound reasonable, in fact I tried to implement that but didn't had to time to make sure it worked well with the tests. Basically the same idea you got. The incomplete slice would also be randomly selected

ilan-gold

Nice!

docs/api.md

src/annbatch/sampler/_sampler.py

tests/test_dataset.py

tests/test_sampler.py

selmanozleyen · 2026-01-15T18:54:57Z

thanks @ilan-gold again. Only thing unclear is I think here: #101 (comment) but I wrote a response.

sorry about the tests, I admit that I overlooked the "vibe-check" on them :D. Should be good now, if there is something wrong there it is organic hand-made blunders

selmanozleyen · 2026-01-20T18:15:57Z

as a note to the future, sampler doesn't actually have to deal with slices and it can just give start points of the chunks and the size of the last chunk. Maybe one can do more optimized stuff using this or not, just pointing it out.

ilan-gold

Small things now just!

src/annbatch/__init__.py

src/annbatch/loader.py

src/annbatch/types.py

src/annbatch/utils.py

src/annbatch/samplers/_chunk_sampler.py

docs/conf.py

ilan-gold

Ok This looks good! Great stuff - how do you want to proceed here? Do you want to try implementing a new sampler off this branch or merge this as-is?

selmanozleyen · 2026-01-21T15:39:45Z

Nicee!! I'd like to merge this as is if you don't mind. It's already big enough

ilan-gold

Nicee!! I'd like to merge this as is if you don't mind. It's already big enough

Ok!

src/annbatch/loader.py

ilan-gold reviewed Dec 18, 2025

View reviewed changes

selmanozleyen force-pushed the feat/sampler branch from 06aa938 to 64d6108 Compare December 18, 2025 11:13

ilan-gold reviewed Dec 19, 2025

View reviewed changes

src/annbatch/loader.py Outdated Show resolved Hide resolved

src/annbatch/loader.py Outdated Show resolved Hide resolved

src/annbatch/sampler/_sampler.py Outdated Show resolved Hide resolved

selmanozleyen commented Dec 20, 2025

View reviewed changes

src/annbatch/loader.py Outdated Show resolved Hide resolved

selmanozleyen marked this pull request as ready for review December 20, 2025 13:55

ilan-gold reviewed Dec 22, 2025

View reviewed changes

ilan-gold reviewed Dec 23, 2025

View reviewed changes

uv.lock Outdated Show resolved Hide resolved

ilan-gold reviewed Jan 2, 2026

View reviewed changes

src/annbatch/sampler/_sampler.py Outdated Show resolved Hide resolved

src/annbatch/loader.py Outdated Show resolved Hide resolved

src/annbatch/sampler/_sampler.py Outdated Show resolved Hide resolved

ilan-gold reviewed Jan 3, 2026

View reviewed changes

docs/api.md Outdated Show resolved Hide resolved

selmanozleyen requested a review from ilan-gold January 4, 2026 16:23

ilan-gold requested changes Jan 5, 2026

View reviewed changes

selmanozleyen requested a review from ilan-gold January 12, 2026 12:01

ilan-gold requested changes Jan 14, 2026

View reviewed changes

ilan-gold mentioned this pull request Jan 15, 2026

feat: class-based shuffler #108

Merged

selmanozleyen requested a review from ilan-gold January 15, 2026 18:55

resolve conflicts with main

ba2cbe9

selmanozleyen force-pushed the feat/sampler branch from 8ffa7ae to ba2cbe9 Compare January 18, 2026 18:39

load_obs thing was removed by auto formatting

d951c56

selmanozleyen added 6 commits January 20, 2026 17:19

qualname for fix. no sampler in public API

899cc18

check coverage when shuffled otherwise also check order

89e7ccd

fix to prev commit

0356374

clarify doc

87f1ccb

update worker tests

8a7f8c2

new * location for ChunkSampler

a85634a

selmanozleyen requested a review from ilan-gold January 20, 2026 18:50

selmanozleyen added 4 commits January 20, 2026 20:18

add typing but can revert if too verbose

61efe81

remove unused fields. (maybe linter check can be added)

faaf525

remove old SO link

eecb0b1

don't put generators into np.all !!

2cd09ca

ilan-gold reviewed Jan 21, 2026

View reviewed changes

selmanozleyen added 10 commits January 21, 2026 12:33

apply typing and docstring suggestion

4fcb553

change in folder structure

b53d685

make batch sampler getter

d464caa

remove empty line

0b4883b

apply docstring suggestions for Loader args

c450586

remove empty line

15a11b9

conf.py is same as main

84c9124

change shuffle

c38160d

remove todo

b8472e3

update to match old behaviour

e620194

selmanozleyen requested a review from ilan-gold January 21, 2026 14:05

ilan-gold approved these changes Jan 21, 2026

View reviewed changes

ilan-gold reviewed Jan 21, 2026

View reviewed changes

src/annbatch/loader.py Outdated Show resolved Hide resolved

selmanozleyen and others added 2 commits January 21, 2026 17:41

Merge branch 'main' into feat/sampler

1b38afd

put vstack inside accumulate chunks

7237023

ilan-gold merged commit 2c14c73 into scverse:main Jan 22, 2026
9 checks passed

Conversation

selmanozleyen commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

giovp commented Dec 17, 2025

Uh oh!

selmanozleyen commented Dec 18, 2025

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

selmanozleyen commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

selmanozleyen commented Dec 22, 2025

Uh oh!

ilan-gold commented Dec 22, 2025

Uh oh!

Uh oh!

selmanozleyen commented Jan 2, 2026

Uh oh!

ilan-gold left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

selmanozleyen commented Jan 6, 2026

Uh oh!

ilan-gold commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

selmanozleyen commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

selmanozleyen commented Dec 17, 2025 •

edited

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading

selmanozleyen commented Dec 18, 2025 •

edited

Loading

ilan-gold left a comment •

edited

Loading

ilan-gold commented Jan 7, 2026 •

edited

Loading

selmanozleyen commented Jan 7, 2026 •

edited

Loading