feat: Sampler interface and ability to add custom samplers#101
feat: Sampler interface and ability to add custom samplers#101ilan-gold merged 58 commits intoscverse:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #101 +/- ##
==========================================
- Coverage 92.45% 92.00% -0.45%
==========================================
Files 6 10 +4
Lines 636 738 +102
==========================================
+ Hits 588 679 +91
- Misses 48 59 +11
🚀 New features to boost your workflow:
|
|
this looks great @selmanozleyen just chiming in cause I was working on something tangentially related that could land here as well: #100 basically the idea of my PR is to implement the DistributedSampler like strategy, but for iterable loaders. I think it would be great if instead of being standalone, it lands into a More generally, in my opinion the most useful "samplers" that I would personally use straight away are: |
|
@giovp thanks! Thanks for the PR, I saw it. I still have it in mind. I will read your PR once I get an idea if this one is good enough. Then I plan to also incorperate your changes |
ilan-gold
left a comment
There was a problem hiding this comment.
I think the base class makes sense.
I don't really get _chunk_ids_to_slices - why not just only return slices?
And why do we need to check unaligned slices at all? What is the failure case if we don't? It seems like too much. This would simplify the class structure since I think ChunkAlignedSampler would not need to exist.
I'm also not clear what "chunk aligned" means - the on-disk sparse data doesn't really have a useful concept of chunking, so what is being aligned?
Nice start!
oh ok, for sparse case then this whole chunk aligned concept might be useless. So I should just revert? But then I don't undersand why initially we would shuffle only chunks in the original implementation |
06aa938 to
64d6108
Compare
The concept of chunks for us you can think of as entirely virtual and divorced from on-disk chunks. chunk = contiguous slice of |
|
@ilan-gold yes we can get rid of nobs. But I dont get why we'd want sampler.sample(nobs) since start and end is already given. |
I was saying I don't know why anyone would need this information, but if they were to need it, it's a runtime (relative to construction of the |
|
This PR has gotten very big. So I won't include the categorical sampling and distributed sampling here. Once you approve of the |
There was a problem hiding this comment.
I agree this is getting big so I would probably limit the refactoring to what is purely necessary and focus only on the Sampler and slotting it into the current implementation (i.e., what is on main), which I think is doable. Separately, we can do a refactor PR. In general I would aim to make the PR as small as possible! But great that things are taking shape now :) Good to see all tests pass, so seems like the basics are funcitonal
ilan-gold
left a comment
There was a problem hiding this comment.
Nice! Well on our way now :)
|
@ilan-gold before I ask for a full review and I polish, let me recap the things I think we aren't in the same page on:
|
|
For the first
Ah great catch, yes, you're absolutely right
Do these two invariants sound reasonable? EDIT: I did a strikethrough of two things. I think the current behavior is correct and uses the leftover handling to deal with slice that is shuffled it. So I really think the above two points are the way to go then. I added #106 to make sure the behavior is expected. |
|
Those two sound reasonable, in fact I tried to implement that but didn't had to time to make sure it worked well with the tests. Basically the same idea you got. The incomplete slice would also be randomly selected |
|
thanks @ilan-gold again. Only thing unclear is I think here: #101 (comment) but I wrote a response. sorry about the tests, I admit that I overlooked the "vibe-check" on them :D. Should be good now, if there is something wrong there it is organic hand-made blunders |
8ffa7ae to
ba2cbe9
Compare
|
as a note to the future, sampler doesn't actually have to deal with slices and it can just give start points of the chunks and the size of the last chunk. Maybe one can do more optimized stuff using this or not, just pointing it out. |
ilan-gold
left a comment
There was a problem hiding this comment.
Small things now just!
ilan-gold
left a comment
There was a problem hiding this comment.
Ok This looks good! Great stuff - how do you want to proceed here? Do you want to try implementing a new sampler off this branch or merge this as-is?
|
Nicee!! I'd like to merge this as is if you don't mind. It's already big enough |
ilan-gold
left a comment
There was a problem hiding this comment.
Nicee!! I'd like to merge this as is if you don't mind. It's already big enough
Ok!
fixes: #43
Not ready for review. Just wanted an initial blessing or feedback if your eye catched something.
Key points for my main design proposal for the interaction with
LoaderandSampler:Sampler[list[slice]]ChunkAlignedSampler(Sampler[list[slice]])expects an implementation_iter_chunk_batcheswhich returns the chunk ids and at most two slicesChunkAlignedSampler(Sampler[list[slice]])is a subclass of it that already has__iter__implemented and will use_iter_chunk_batchesand will convert chunk id's into slices.ChunkAlignedSampleris always easily validated to be chunk aligned for our loader.Why at most two slices per batch? Because one can always rearrange fragmantations so that a batch would have at most 2 unaligned chunk slices. I already have an idea for how a mask and category would look like following this logic.
Stuff I haven't paid attention to much:
preload_nchunkshaven't given much thoughtping @ilan-gold , @felix0097 because its a draft PR