Skip to content

feat: ChunkSampler with replacement when n_iters is set#143

Open
selmanozleyen wants to merge 27 commits intoscverse:mainfrom
selmanozleyen:feat/chunksampler-with-replacement
Open

feat: ChunkSampler with replacement when n_iters is set#143
selmanozleyen wants to merge 27 commits intoscverse:mainfrom
selmanozleyen:feat/chunksampler-with-replacement

Conversation

@selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Feb 10, 2026

we decided it might be smart to add with replacement sampling to continue for the categorical sampler: #119 (comment)

@selmanozleyen selmanozleyen self-assigned this Feb 10, 2026
@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 96.96970% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.88%. Comparing base (d47240f) to head (729809f).

Files with missing lines Patch % Lines
src/annbatch/samplers/_chunk_sampler.py 96.87% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #143      +/-   ##
==========================================
- Coverage   93.71%   91.88%   -1.83%     
==========================================
  Files          11       11              
  Lines         811      863      +52     
==========================================
+ Hits          760      793      +33     
- Misses         51       70      +19     
Files with missing lines Coverage Δ
src/annbatch/loader.py 89.33% <100.00%> (-3.29%) ⬇️
src/annbatch/samplers/_chunk_sampler.py 94.92% <96.87%> (+1.74%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +38 to +41
Must be ``False`` when ``n_iters`` is set.
n_iters
If set, enables with-replacement sampling for exactly this many
batches instead of epoch-based iteration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to make this explicit i.e., n_iters can override without-replacement sampling behavior, drop_last should be renamed to drop_undersized (we can keep the top-level Loader API of drop_last), and then we have a with_replacement arg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then with_replacement would need to be None by default in the signature so that it resolves to True when n_iters is None.

Comment on lines +183 to +187
last = chunk_pool[-1]
if last.stop - last.start < self._chunk_size:
new_stop = min(last.start + self._chunk_size, stop)
new_start = new_stop - self._chunk_size
chunk_pool[-1] = slice(new_start, new_stop)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then this just becomes drop the last one or not, which is also the smallest

Comment on lines +197 to +200
num_workers = worker_handle.num_workers
worker_id = worker_handle._worker_info.id
base, remainder = divmod(n_iters, num_workers)
n_iters = base + (1 if worker_id < remainder else 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very tempted to just error here i.e., we only support torch in the basic chunking use-case. I don't think we should be encouraging people to do this but @felix0097 am curious what you think

If ``None``, it's set to True if ``n_iters`` is provided, otherwise False.
n_iters
Number of batches to yield. Required when ``with_replacement`` is True.
Can't be provided if with_replacement is False.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what I meant, sorry for being unclear - if n_iters is passed in, we just use that instead of deriving from n_obs to define the epoch. This way, there's no "default" behavior switching i.e., If ``None``, it's set to True if ``n_iters`` is provided, otherwise False. is unnecessary. To highlight

  1. n_iters is set, then independent of with_replacement that number of iterations is done
  2. with_replacement=True means that n_iters has to be set
  3. with_replacement=False can use n_iters if it's set.

Does this make sense? I've proposed three changes that highlight this logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, as without replacement would mean there is no duplicate of what you see also right? If you think its fine then I will do it. I though of just supporting truncating but then we would need to error when possible_n_iters < n_iters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, as without replacement would mean there is no duplicate of what you see also right? If you think its fine then I will do it. I though of just supporting truncating but then we would need to error when possible_n_iters < n_iters.

This seems reasonable - I hadn't considered it, but it could just be part of validation to check n_iters against n_obs if sampling without replacement.

start, stop = self._mask.start or 0, self._mask.stop or n_obs
total_obs = stop - start
return total_obs // self._batch_size if self._drop_last else math.ceil(total_obs / self._batch_size)
return self._possible_n_iters(n_obs) if not self._with_replacement else self._n_iters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this line is clearer because you wouldn't rely on _with_replacement but instead just _n_iters

Suggested change
return self._possible_n_iters(n_obs) if not self._with_replacement else self._n_iters
return self._possible_n_iters(n_obs) if not self._n_iters is None else self._n_iters

Comment on lines +98 to +104
if with_replacement is None:
with_replacement = n_iters is not None
if with_replacement and n_iters is None:
raise ValueError("n_iters is required when with_replacement is True.")
if not with_replacement and n_iters is not None:
raise ValueError("n_iters is only supported when with_replacement is True.")
self._n_iters, self._with_replacement = n_iters, with_replacement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if with_replacement is None:
with_replacement = n_iters is not None
if with_replacement and n_iters is None:
raise ValueError("n_iters is required when with_replacement is True.")
if not with_replacement and n_iters is not None:
raise ValueError("n_iters is only supported when with_replacement is True.")
self._n_iters, self._with_replacement = n_iters, with_replacement
if with_replacement and n_iters is None:
raise ValueError("n_iters is required when with_replacement is True.")
self._n_iters, self._with_replacement = n_iters, with_replacement

shuffle: bool = False,
drop_last: bool = False,
drop_undersized: bool = False,
with_replacement: bool | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with_replacement: bool | None = None,
with_replacement: bool = False,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants