Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support qwen2-vl with turbomind backend #2720

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

irexyc
Copy link
Collaborator

@irexyc irexyc commented Nov 6, 2024

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@lvhan028 lvhan028 added the enhancement New feature or request label Nov 6, 2024
@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 8, 2024

Postpone the review it untill @irexyc refactor tm's attention module
cc @AllentDan @lzhangzz

@serser
Copy link

serser commented Dec 27, 2024

Any updates?

p3 = *(p + 2);
}
else {
p1 = p2 = p3 = (int)timestep - mrope_position_delta_;
Copy link

@xiaoxiangshusheng xiaoxiangshusheng Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some doubts about this sentence. Should it be addition or subtraction here?
p1 = p2 = p3 = (int)timestep + mrope_position_delta_;
I will look forward to your reply and answer if you have the time.
Thanks! @irexyc

@chenzhengda
Copy link

Besides the error in p1 = p2 = p3 = (int)timestep - mrope_position_delta_, the current branch produces incorrect results during batch inference. @irexyc

@randomseed713
Copy link

Any updates?

@irexyc
Copy link
Collaborator Author

irexyc commented Feb 19, 2025

A PR based on the current code will be submitted this week

@Juniper1021
Copy link
Contributor

A PR based on the current code will be submitted this week

Will it also support Qwen2.5VL?

@piotr-sikora-v
Copy link

A PR based on the current code will be submitted this week

Will it also support Qwen2.5VL?

as I saw in modification of lmdeploy/turbomind/supported_models.py there will not support Qwen2_5_VLForConditionalGeneration architecutres :/

@lvhan028
Copy link
Collaborator

qwen2.5-vl will be supported by pytorch engine.

@quanfeifan
Copy link

waiting for demo of inference with qwen2-vl with turbomind backend

@quanfeifan
Copy link

waiting for demo of inference with qwen2-vl with turbomind backend

what's more, any plan to support qwen2-vl quantized with awq, w4a16 with turbomind?

@irexyc
Copy link
Collaborator Author

irexyc commented Feb 25, 2025

@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it.

For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it.

This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan

@piotr-sikora-v
Copy link

@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it.

For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it.

This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan

Thanks for sharing!

I found this error:

2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi flush_l1d
Virtualization:                       VT-x
L1d cache:                            896 KiB (28 instances)
L1i cache:                            896 KiB (28 instances)
L2 cache:                             7 MiB (28 instances)
L3 cache:                             70 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-13,28-41
NUMA node1 CPU(s):                    14-27,42-55
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX disabled
Vulnerability L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 642, in _build_model
[rank0]:     model, cache_engine, cache_config = _tp_build_model(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 343, in _tp_build_model
[rank0]:     raise e
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 320, in _tp_build_model
[rank0]:     patched_model = build_patched_model(model_config, device=device_map)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 204, in build_patched_model
[rank0]:     return build_model_from_hf_config(model_config, dtype=dtype, device=device)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 194, in build_model_from_hf_config
[rank0]:     model_cls = _get_model_class(model_config, module_map)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 184, in _get_model_class
[rank0]:     raise RuntimeError(f'Can not found rewrite for architectures: {architectures}')
[rank0]: RuntimeError: Can not found rewrite for architectures: ['Qwen2_5_VLForConditionalGeneration']
[rank0]:[W225 09:24:17.484472721 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

I know there was some changes in model and architectures... my latest configuration works with vLLM release and it's match to latest changes in qwen-2.5-vl

@irexyc
Copy link
Collaborator Author

irexyc commented Feb 25, 2025

@piotr-sikora-v It seems you are using pytorch backend which is not supported now. You can try using TurbomindEngineConfig backend config. If you still encounter problems, you can provide reproducible code

@piotr-sikora-v
Copy link

@piotr-sikora-v It seems you are using pytorch backend which is not supported now. You can try using TurbomindEngineConfig backend config. If you still encounter problems, you can provide reproducible code

I'am running it from CLI with setting backend to turbomind.
But I don't know why is falling back to pytorch
I build it using your release and then
pip install -e .

No errors

...
Requirement already satisfied: six>=1.5 in /root/lmdenv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets->outlines<0.1.0->lmdeploy==0.7.0.post3) (1.17.0)
Building wheels for collected packages: lmdeploy
  Building editable for lmdeploy (pyproject.toml) ... done
  Created wheel for lmdeploy: filename=lmdeploy-0.7.0.post3-0.editable-py3-none-any.whl size=12668 sha256=0677ea3d15cc5ff3cb7df91227fcaac23e2eeca8ea5712616584327c4bc79bf2
  Stored in directory: /tmp/pip-ephem-wheel-cache-b_p080vj/wheels/34/17/c2/7b396938fa7c074d4d5a12e9b171b3e1e3d09d1f65f742809e
Successfully built lmdeploy
Installing collected packages: lmdeploy
  Attempting uninstall: lmdeploy
    Found existing installation: lmdeploy 0.7.0.post3
    Uninstalling lmdeploy-0.7.0.post3:
      Successfully uninstalled lmdeploy-0.7.0.post3
Successfully installed lmdeploy-0.7.0.post3

here is my full command:

lmdeploy serve api_server  --dtype float16 
--cache-max-entry-count 0.1 --max-concurrent-requests 16  --log-level INFO    --enable-prefix-caching  --mod
el-name /root/model_qwen3_cmon3 --tp 4 --server-port 8000 --backend turbomind  /root/model_qwen3_cmon3-vllm-
latest/ 
2025-02-25 09:58:07,712 - lmdeploy - WARNING - archs.py:57 - Fallback to pytorch engine because turbomind en
gine is not installed correctly. If you insist to use turbomind engine, you may need to reinstall lmdeploy from pypi or build from source and try again.
2025-02-25 09:58:07,712 - lmdeploy - WARNING - archs.py:62 - Try to run with pytorch engine because `/root/m
odel_qwen3_cmon3-vllm-latest/` is not explicitly supported by lmdeploy.
2025-02-25 09:58:10,215 - lmdeploy - INFO - builder.py:63 - matching vision model: Qwen2VLModel
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fas
t=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will 
result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-02-25 09:58:11,862 - lmdeploy - INFO - async_engine.py:260 - input backend=pytorch, backend_config=Pyto
rchEngineConfig(dtype='float16', tp=4, session_len=None, max_batch_size=128, cache_max_entry_count=0.1, pref
ill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=819
2, thread_safe=False, enable_prefix_caching=True, device_type='cuda', eager_mode=False, custom_module_map=No
ne, download_dir=None, revision=None, quant_policy=0)
2025-02-25 09:58:11,863 - lmdeploy - INFO - async_engine.py:261 - input chat_template_config=None
2025-02-25 09:58:11,870 - lmdeploy - INFO - async_engine.py:270 - updated chat_template_onfig=ChatTemplateCo
nfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None,
 eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-02-25 09:58:13,923 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: 
[4.33.0 ~ 4.46.1], but found version: 4.50.0.dev0

@irexyc
Copy link
Collaborator Author

irexyc commented Feb 25, 2025

I build it using your release and then
pip install -e .

pip install -e . won't build turobmind backend. You can install the wheel package from here

@piotr-sikora-v
Copy link

I build it using your release and then
pip install -e .

pip install -e . won't build turobmind backend. You can install the wheel package from here

Great! it's works!
I think it's 20% faster than on vLLM, but I need do some benchmarks.

I don't know why yet, but sometimes it freezes while generating... it's possible it's because of my configuration.

@piotr-sikora-v
Copy link

after one hour of running I got crash

2025-02-25 13:48:25,776 - lmdeploy - INFO - async_engine.py:675 - session=120, history_tokens=0, input_tokens=2028, max_new_tokens=4096, seq_start=True, seq_end=True, step=0, prep=True
2025-02-25 13:48:25,776 - lmdeploy - INFO - turbomind.py:560 - [async_stream_infer] session 120 start
[TM][INFO] [ProcessInferRequests] Request for 120 received.
[TM][INFO] [Forward] [0, 1), dc=0, pf=1, sum_q=2028, sum_k=2028, max_q=2028, max_k=2028
2025-02-25 13:48:25,848 - lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: OverflowError out of range integral type conversion attempted
2025-02-25 13:48:25,849 - lmdeploy - INFO - turbomind.py:622 - [async_stream_infer] GeneratorExit
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/models/llama/unified_decoder.cc:265 

On log I saw that memory on GPU was increasing all time.

command:

lmdeploy serve api_server  --dtype float16 --cache-max-entry-count 0.1 --max-concurrent-requests 16  --log-level INFO    --enable-prefix-caching  --tp 4 --server-port 8000 --backend turbomind Qwen/Qwen2.5-VL-3B-Instruct

System:
4x V100 SXM2

BTW.
I don't have metrics in lmdeploy so I only can compare with avarage time on my jobs
on vLLM it was 10.52s on lmdeploy it was 9.03
Any hint to get better value with this hardware and model?

@irexyc
Copy link
Collaborator Author

irexyc commented Feb 26, 2025

@piotr-sikora-v

Thanks for your feedback, I will check this.

For better performance, you can quant the model according to this doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants