support qwen2-vl with turbomind backend #2720

irexyc · 2024-11-06T03:19:13Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

lvhan028 · 2024-11-08T06:42:16Z

Postpone the review it untill @irexyc refactor tm's attention module
cc @AllentDan @lzhangzz

serser · 2024-12-27T04:40:05Z

Any updates?

xiaoxiangshusheng · 2025-01-01T08:32:48Z

src/turbomind/kernels/attention/rotary_embedding.h

+            p3           = *(p + 2);
+        }
+        else {
+            p1 = p2 = p3 = (int)timestep - mrope_position_delta_;


I have some doubts about this sentence. Should it be addition or subtraction here?
p1 = p2 = p3 = (int)timestep + mrope_position_delta_;
I will look forward to your reply and answer if you have the time.
Thanks! @irexyc

chenzhengda · 2025-01-20T09:54:47Z

Besides the error in p1 = p2 = p3 = (int)timestep - mrope_position_delta_, the current branch produces incorrect results during batch inference. @irexyc

randomseed713 · 2025-02-19T09:04:27Z

Any updates?

irexyc · 2025-02-19T09:15:24Z

A PR based on the current code will be submitted this week

Juniper1021 · 2025-02-20T06:30:34Z

A PR based on the current code will be submitted this week

Will it also support Qwen2.5VL?

piotr-sikora-v · 2025-02-20T09:38:58Z

A PR based on the current code will be submitted this week

Will it also support Qwen2.5VL?

as I saw in modification of lmdeploy/turbomind/supported_models.py there will not support Qwen2_5_VLForConditionalGeneration architecutres :/

lvhan028 · 2025-02-20T14:28:00Z

qwen2.5-vl will be supported by pytorch engine.

quanfeifan · 2025-02-23T08:22:41Z

waiting for demo of inference with qwen2-vl with turbomind backend

quanfeifan · 2025-02-23T13:06:07Z

waiting for demo of inference with qwen2-vl with turbomind backend

what's more, any plan to support qwen2-vl quantized with awq, w4a16 with turbomind?

irexyc · 2025-02-25T04:03:22Z

@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it.

For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it.

This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan

piotr-sikora-v · 2025-02-25T09:26:36Z

@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it.

For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it.

This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan

Thanks for sharing!

I found this error:

2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi flush_l1d
Virtualization:                       VT-x
L1d cache:                            896 KiB (28 instances)
L1i cache:                            896 KiB (28 instances)
L2 cache:                             7 MiB (28 instances)
L3 cache:                             70 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-13,28-41
NUMA node1 CPU(s):                    14-27,42-55
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX disabled
Vulnerability L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 642, in _build_model
[rank0]:     model, cache_engine, cache_config = _tp_build_model(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 343, in _tp_build_model
[rank0]:     raise e
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 320, in _tp_build_model
[rank0]:     patched_model = build_patched_model(model_config, device=device_map)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 204, in build_patched_model
[rank0]:     return build_model_from_hf_config(model_config, dtype=dtype, device=device)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 194, in build_model_from_hf_config
[rank0]:     model_cls = _get_model_class(model_config, module_map)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 184, in _get_model_class
[rank0]:     raise RuntimeError(f'Can not found rewrite for architectures: {architectures}')
[rank0]: RuntimeError: Can not found rewrite for architectures: ['Qwen2_5_VLForConditionalGeneration']
[rank0]:[W225 09:24:17.484472721 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

I know there was some changes in model and architectures... my latest configuration works with vLLM release and it's match to latest changes in qwen-2.5-vl

irexyc · 2025-02-25T09:43:46Z

@piotr-sikora-v It seems you are using pytorch backend which is not supported now. You can try using TurbomindEngineConfig backend config. If you still encounter problems, you can provide reproducible code

piotr-sikora-v · 2025-02-25T10:05:19Z

@piotr-sikora-v It seems you are using pytorch backend which is not supported now. You can try using TurbomindEngineConfig backend config. If you still encounter problems, you can provide reproducible code

I'am running it from CLI with setting backend to turbomind.
But I don't know why is falling back to pytorch
I build it using your release and then
pip install -e .

No errors

...
Requirement already satisfied: six>=1.5 in /root/lmdenv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets->outlines<0.1.0->lmdeploy==0.7.0.post3) (1.17.0)
Building wheels for collected packages: lmdeploy
  Building editable for lmdeploy (pyproject.toml) ... done
  Created wheel for lmdeploy: filename=lmdeploy-0.7.0.post3-0.editable-py3-none-any.whl size=12668 sha256=0677ea3d15cc5ff3cb7df91227fcaac23e2eeca8ea5712616584327c4bc79bf2
  Stored in directory: /tmp/pip-ephem-wheel-cache-b_p080vj/wheels/34/17/c2/7b396938fa7c074d4d5a12e9b171b3e1e3d09d1f65f742809e
Successfully built lmdeploy
Installing collected packages: lmdeploy
  Attempting uninstall: lmdeploy
    Found existing installation: lmdeploy 0.7.0.post3
    Uninstalling lmdeploy-0.7.0.post3:
      Successfully uninstalled lmdeploy-0.7.0.post3
Successfully installed lmdeploy-0.7.0.post3

here is my full command:

lmdeploy serve api_server  --dtype float16 
--cache-max-entry-count 0.1 --max-concurrent-requests 16  --log-level INFO    --enable-prefix-caching  --mod
el-name /root/model_qwen3_cmon3 --tp 4 --server-port 8000 --backend turbomind  /root/model_qwen3_cmon3-vllm-
latest/ 
2025-02-25 09:58:07,712 - lmdeploy - WARNING - archs.py:57 - Fallback to pytorch engine because turbomind en
gine is not installed correctly. If you insist to use turbomind engine, you may need to reinstall lmdeploy from pypi or build from source and try again.
2025-02-25 09:58:07,712 - lmdeploy - WARNING - archs.py:62 - Try to run with pytorch engine because `/root/m
odel_qwen3_cmon3-vllm-latest/` is not explicitly supported by lmdeploy.
2025-02-25 09:58:10,215 - lmdeploy - INFO - builder.py:63 - matching vision model: Qwen2VLModel
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fas
t=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will 
result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-02-25 09:58:11,862 - lmdeploy - INFO - async_engine.py:260 - input backend=pytorch, backend_config=Pyto
rchEngineConfig(dtype='float16', tp=4, session_len=None, max_batch_size=128, cache_max_entry_count=0.1, pref
ill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=819
2, thread_safe=False, enable_prefix_caching=True, device_type='cuda', eager_mode=False, custom_module_map=No
ne, download_dir=None, revision=None, quant_policy=0)
2025-02-25 09:58:11,863 - lmdeploy - INFO - async_engine.py:261 - input chat_template_config=None
2025-02-25 09:58:11,870 - lmdeploy - INFO - async_engine.py:270 - updated chat_template_onfig=ChatTemplateCo
nfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None,
 eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-02-25 09:58:13,923 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: 
[4.33.0 ~ 4.46.1], but found version: 4.50.0.dev0

irexyc · 2025-02-25T10:10:56Z

I build it using your release and then
pip install -e .

pip install -e . won't build turobmind backend. You can install the wheel package from here

piotr-sikora-v · 2025-02-25T11:29:23Z

I build it using your release and then
pip install -e .

pip install -e . won't build turobmind backend. You can install the wheel package from here

Great! it's works!
I think it's 20% faster than on vLLM, but I need do some benchmarks.

I don't know why yet, but sometimes it freezes while generating... it's possible it's because of my configuration.

piotr-sikora-v · 2025-02-25T13:53:57Z

after one hour of running I got crash

2025-02-25 13:48:25,776 - lmdeploy - INFO - async_engine.py:675 - session=120, history_tokens=0, input_tokens=2028, max_new_tokens=4096, seq_start=True, seq_end=True, step=0, prep=True
2025-02-25 13:48:25,776 - lmdeploy - INFO - turbomind.py:560 - [async_stream_infer] session 120 start
[TM][INFO] [ProcessInferRequests] Request for 120 received.
[TM][INFO] [Forward] [0, 1), dc=0, pf=1, sum_q=2028, sum_k=2028, max_q=2028, max_k=2028
2025-02-25 13:48:25,848 - lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: OverflowError out of range integral type conversion attempted
2025-02-25 13:48:25,849 - lmdeploy - INFO - turbomind.py:622 - [async_stream_infer] GeneratorExit
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/models/llama/unified_decoder.cc:265

On log I saw that memory on GPU was increasing all time.

command:

lmdeploy serve api_server  --dtype float16 --cache-max-entry-count 0.1 --max-concurrent-requests 16  --log-level INFO    --enable-prefix-caching  --tp 4 --server-port 8000 --backend turbomind Qwen/Qwen2.5-VL-3B-Instruct

System:
4x V100 SXM2

BTW.
I don't have metrics in lmdeploy so I only can compare with avarage time on my jobs
on vLLM it was 10.52s on lmdeploy it was 9.03
Any hint to get better value with this hardware and model?

irexyc · 2025-02-26T02:05:31Z

@piotr-sikora-v

Thanks for your feedback, I will check this.

For better performance, you can quant the model according to this doc

support qwen2-vl with turbomind backend

90a7a24

lvhan028 added the enhancement New feature or request label Nov 6, 2024

lvhan028 requested review from AllentDan, lvhan028 and lzhangzz November 8, 2024 03:03

lvhan028 mentioned this pull request Nov 20, 2024

[Feature] qwen2 vl support the turbomind engine #2774

Open

xiaoxiangshusheng reviewed Jan 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support qwen2-vl with turbomind backend #2720

support qwen2-vl with turbomind backend #2720

irexyc commented Nov 6, 2024

lvhan028 commented Nov 8, 2024

serser commented Dec 27, 2024

xiaoxiangshusheng Jan 1, 2025 •

edited

Loading

chenzhengda commented Jan 20, 2025

randomseed713 commented Feb 19, 2025

irexyc commented Feb 19, 2025

Juniper1021 commented Feb 20, 2025

piotr-sikora-v commented Feb 20, 2025

lvhan028 commented Feb 20, 2025

quanfeifan commented Feb 23, 2025

quanfeifan commented Feb 23, 2025

irexyc commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

irexyc commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

irexyc commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

irexyc commented Feb 26, 2025

support qwen2-vl with turbomind backend #2720

Are you sure you want to change the base?

support qwen2-vl with turbomind backend #2720

Conversation

irexyc commented Nov 6, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

lvhan028 commented Nov 8, 2024

serser commented Dec 27, 2024

xiaoxiangshusheng Jan 1, 2025 • edited Loading

Choose a reason for hiding this comment

chenzhengda commented Jan 20, 2025

randomseed713 commented Feb 19, 2025

irexyc commented Feb 19, 2025

Juniper1021 commented Feb 20, 2025

piotr-sikora-v commented Feb 20, 2025

lvhan028 commented Feb 20, 2025

quanfeifan commented Feb 23, 2025

quanfeifan commented Feb 23, 2025

irexyc commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

irexyc commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

irexyc commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

piotr-sikora-v commented Feb 25, 2025

irexyc commented Feb 26, 2025

xiaoxiangshusheng Jan 1, 2025 •

edited

Loading