-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support qwen2-vl with turbomind backend #2720
base: main
Are you sure you want to change the base?
Conversation
Postpone the review it untill @irexyc refactor tm's attention module |
Any updates? |
p3 = *(p + 2); | ||
} | ||
else { | ||
p1 = p2 = p3 = (int)timestep - mrope_position_delta_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some doubts about this sentence. Should it be addition or subtraction here?
p1 = p2 = p3 = (int)timestep + mrope_position_delta_;
I will look forward to your reply and answer if you have the time.
Thanks! @irexyc
Besides the error in p1 = p2 = p3 = (int)timestep - mrope_position_delta_, the current branch produces incorrect results during batch inference. @irexyc |
Any updates? |
A PR based on the current code will be submitted this week |
Will it also support Qwen2.5VL? |
as I saw in modification of |
qwen2.5-vl will be supported by pytorch engine. |
waiting for demo of inference with qwen2-vl with turbomind backend |
what's more, any plan to support qwen2-vl quantized with awq, w4a16 with turbomind? |
@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it. For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it. This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan |
Thanks for sharing! I found this error:
I know there was some changes in model and architectures... my latest configuration works with vLLM release and it's match to latest changes in qwen-2.5-vl |
@piotr-sikora-v It seems you are using |
I'am running it from CLI with setting backend to turbomind. No errors
here is my full command:
|
|
Great! it's works! I don't know why yet, but sometimes it freezes while generating... it's possible it's because of my configuration. |
after one hour of running I got crash
On log I saw that memory on GPU was increasing all time. command:
System: BTW. |
Thanks for your feedback, I will check this. For better performance, you can quant the model according to this doc |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist