Hacky tensor parallel for Vision Models#791
Draft
Ph0rk0z wants to merge 3 commits intoturboderp-org:devfrom
Draft
Hacky tensor parallel for Vision Models#791Ph0rk0z wants to merge 3 commits intoturboderp-org:devfrom
Ph0rk0z wants to merge 3 commits intoturboderp-org:devfrom
Conversation
Seems faster than through torch by a tiny bit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I was frustrated with TP not working on pixtral-large (its slower than qwen235b) and messed around a little bit. The model still sees images and generates text. No obvious side effects were observed, but I will test more along with other arch. Torch asserts because stuff is in inference mode and this bypasses the check with a regular copy. Somehow it's also faster by like .10 t/s.
I will also try it on qwen-VL at some point. Mainly this is here for anyone else who likes to chat with memes but wants higher speeds. Still have to test on long context too. I only did a handful of images. Maybe at 32k ctx it blows up or goes OOM. Everyone feel free to tell my why this is a horrible idea :P
update: I have used up to 20k context on pixtral and have tested qwen2 VL 72b. It's working as well. 1MB images eat your context, who would have thought...