vllm_infer对qwen2.5vl推理很慢，10000个图文对卡住很久 #7216

2019211753 · 2025-03-08T05:24:30Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-134-generic-x86_64-with-glibc2.35
Python version: 3.11.11
PyTorch version: 2.5.1+cu121 (GPU)
Transformers version: 4.49.0
Datasets version: 3.2.0
Accelerate version: 1.2.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 4090
GPU number: 4
GPU memory: 23.64GB
vLLM version: 0.7.3

Reproduction

~/LLaMA-Factory main !2 ?10 ❯ python scripts/vllm_infer.py --model_name_or_path /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft --dataset evalmuse_test --template qwen2_vl
[INFO|training_args.py:2183] 2025-03-08 13:06:57,056 >> PyTorch: setting up devices
[INFO|training_args.py:1862] 2025-03-08 13:06:57,223 >> The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[INFO|configuration_utils.py:697] 2025-03-08 13:06:57,258 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/config.json
[INFO|configuration_utils.py:771] 2025-03-08 13:06:57,264 >> Model config Qwen2_5_VLConfig {
  "_name_or_path": "/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft",
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "image_token_id": 151655,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 128000,
  "max_window_layers": 28,
  "model_type": "qwen2_5_vl",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.49.0",
  "use_cache": true,
  "use_sliding_window": false,
  "video_token_id": 151656,
  "vision_config": {
    "hidden_size": 1280,
    "in_chans": 3,
    "model_type": "qwen2_5_vl",
    "spatial_patch_size": 14,
    "tokens_per_second": 2,
    "torch_dtype": "bfloat16"
  },
  "vision_end_token_id": 151653,
  "vision_start_token_id": 151652,
  "vision_token_id": 151654,
  "vocab_size": 152064
}

[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,347 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,348 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,348 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,348 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,348 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,348 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,348 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-08 13:06:57,772 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:379] 2025-03-08 13:06:57,773 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/preprocessor_config.json
[INFO|image_processing_base.py:379] 2025-03-08 13:06:57,774 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/preprocessor_config.json
[WARNING|logging.py:329] 2025-03-08 13:06:57,774 >> Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[INFO|image_processing_base.py:434] 2025-03-08 13:06:57,775 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,775 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,775 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,776 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,776 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,776 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,776 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:06:57,776 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-08 13:06:58,142 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:876] 2025-03-08 13:06:58,591 >> Processor Qwen2_5_VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: Qwen2TokenizerFast(name_or_path='/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft', vocab_size=151643, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
        151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151657: AddedToken("<tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151658: AddedToken("</tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
}
)

{
  "processor_class": "Qwen2_5_VLProcessor"
}

[INFO|2025-03-08 13:06:58] llamafactory.data.template:157 >> Add <|im_end|> to stop words.
[INFO|2025-03-08 13:06:58] llamafactory.data.loader:157 >> Loading dataset evalmuse_test_data.json...
training example:
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151653, 2082, 55856, 279, 2701, 9934, 323, 2968, 264, 5456, 315, 220, 15, 311, 220, 20, 369, 279, 24629, 315, 279, 2661, 2168, 13, 758, 5256, 11, 23643, 279, 3204, 20844, 304, 279, 9934, 323, 2968, 264, 5456, 369, 279, 2392, 12579, 279, 2168, 624, 20566, 315, 279, 23694, 45209, 11, 14538, 48810, 11, 39034, 11, 2841, 594, 2311, 11, 43582, 13330, 11, 9906, 7987, 11, 18378, 1879, 11, 71670, 151645, 198, 151644, 77091, 198]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Analyze the following prompt and give a score of 0 to 5 for the overlay of the given image. In addition, analyze the possible keywords in the prompt and give a score for the element matching the image.
attack of the pig giants, giant pigs, illustration, children's book, fictional drawing, bright colors, fantasy world, minimalist<|im_end|>
<|im_start|>assistant

label_ids:
[5035, 10405, 25, 2240, 11, 2392, 10405, 25, 314, 89500, 45209, 320, 47899, 1648, 2240, 11, 3359, 320, 7175, 1648, 2240, 11, 14538, 320, 9116, 1648, 2240, 11, 39034, 320, 9116, 1648, 2240, 11, 2841, 594, 2311, 320, 1700, 1648, 2240, 11, 43582, 13330, 320, 9116, 1648, 2240, 11, 9906, 7987, 320, 3423, 1648, 2240, 11, 18378, 1879, 320, 2527, 1648, 2240, 11, 71670, 320, 9116, 1648, 2240, 92, 151645, 198]
labels:
total_score: None, element_score: {pig giants (animal): None, attack (activity): None, giant (attribute): None, illustration (attribute): None, children's book (object): None, fictional drawing (attribute): None, bright colors (color): None, fantasy world (location): None, minimalist (attribute): None}<|im_end|>

INFO 03-08 13:11:40 __init__.py:207] Automatically detected platform cuda.
[INFO|configuration_utils.py:697] 2025-03-08 13:11:40,313 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/config.json
[INFO|configuration_utils.py:697] 2025-03-08 13:11:40,313 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/config.json
[INFO|configuration_utils.py:771] 2025-03-08 13:11:40,314 >> Model config Qwen2_5_VLConfig {
  "_name_or_path": "/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft",
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "image_token_id": 151655,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 128000,
  "max_window_layers": 28,
  "model_type": "qwen2_5_vl",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.49.0",
  "use_cache": true,
  "use_sliding_window": false,
  "video_token_id": 151656,
  "vision_config": {
    "hidden_size": 1280,
    "in_chans": 3,
    "model_type": "qwen2_5_vl",
    "spatial_patch_size": 14,
    "tokens_per_second": 2,
    "torch_dtype": "bfloat16"
  },
  "vision_end_token_id": 151653,
  "vision_start_token_id": 151652,
  "vision_token_id": 151654,
  "vocab_size": 152064
}

INFO 03-08 13:11:47 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 03-08 13:11:47 config.py:1382] Defaulting to use mp for distributed inference
WARNING 03-08 13:11:47 arg_utils.py:1197] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 03-08 13:11:47 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft', speculative_config=None, tokenizer='/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:11:47,358 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-08 13:11:47,773 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:1093] 2025-03-08 13:11:47,862 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/generation_config.json
[INFO|configuration_utils.py:1140] 2025-03-08 13:11:47,862 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.1,
  "top_k": 1,
  "top_p": 0.001
}

WARNING 03-08 13:11:47 utils.py:2128] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information.
WARNING 03-08 13:11:47 multiproc_worker_utils.py:300] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-08 13:11:47 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-08 13:11:48 cuda.py:229] Using Flash Attention backend.
INFO 03-08 13:11:52 __init__.py:207] Automatically detected platform cuda.
INFO 03-08 13:11:52 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=2100252) INFO 03-08 13:11:52 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2100251) INFO 03-08 13:11:52 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
INFO 03-08 13:11:52 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=2100253) INFO 03-08 13:11:52 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2100252) INFO 03-08 13:11:53 cuda.py:229] Using Flash Attention backend.
(VllmWorkerProcess pid=2100251) INFO 03-08 13:11:53 cuda.py:229] Using Flash Attention backend.
(VllmWorkerProcess pid=2100253) INFO 03-08 13:11:53 cuda.py:229] Using Flash Attention backend.
(VllmWorkerProcess pid=2100252) INFO 03-08 13:11:54 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2100253) INFO 03-08 13:11:54 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2100251) INFO 03-08 13:11:54 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2100252) INFO 03-08 13:11:54 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=2100253) INFO 03-08 13:11:54 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=2100251) INFO 03-08 13:11:54 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 03-08 13:11:54 utils.py:916] Found nccl from library libnccl.so.2
INFO 03-08 13:11:54 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=2100251) WARNING 03-08 13:11:54 custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2100253) WARNING 03-08 13:11:54 custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-08 13:11:54 custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2100252) WARNING 03-08 13:11:54 custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 03-08 13:11:54 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_647dc9ff'), local_subscribe_port=38071, remote_subscribe_port=None)
INFO 03-08 13:11:54 model_runner.py:1110] Starting to load model /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft...
(VllmWorkerProcess pid=2100252) INFO 03-08 13:11:54 model_runner.py:1110] Starting to load model /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft...
(VllmWorkerProcess pid=2100253) INFO 03-08 13:11:54 model_runner.py:1110] Starting to load model /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft...
(VllmWorkerProcess pid=2100251) INFO 03-08 13:11:54 model_runner.py:1110] Starting to load model /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft...
(VllmWorkerProcess pid=2100252) WARNING 03-08 13:11:55 vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
(VllmWorkerProcess pid=2100253) WARNING 03-08 13:11:55 vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
(VllmWorkerProcess pid=2100251) WARNING 03-08 13:11:55 vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 03-08 13:11:55 vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 03-08 13:11:55 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
(VllmWorkerProcess pid=2100251) INFO 03-08 13:11:55 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
(VllmWorkerProcess pid=2100252) INFO 03-08 13:11:55 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
(VllmWorkerProcess pid=2100253) INFO 03-08 13:11:55 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:10<00:31, 10.57s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:18<00:17,  8.93s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:19<00:05,  5.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:24<00:00,  5.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:24<00:00,  6.18s/it]

(VllmWorkerProcess pid=2100253) INFO 03-08 13:12:20 model_runner.py:1115] Loading model weights took 3.9972 GB
(VllmWorkerProcess pid=2100252) INFO 03-08 13:12:20 model_runner.py:1115] Loading model weights took 3.9972 GB
(VllmWorkerProcess pid=2100251) INFO 03-08 13:12:20 model_runner.py:1115] Loading model weights took 3.9972 GB
INFO 03-08 13:12:20 model_runner.py:1115] Loading model weights took 3.9972 GB
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:20,841 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-08 13:12:21,228 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=2100252) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorkerProcess pid=2100252) WARNING 03-08 13:12:21 model_runner.py:1288] Computed max_num_seqs (min(256, 128000 // 131072)) to be less than 1. Setting it to the minimum value of 1.
(VllmWorkerProcess pid=2100251) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorkerProcess pid=2100251) WARNING 03-08 13:12:21 model_runner.py:1288] Computed max_num_seqs (min(256, 128000 // 131072)) to be less than 1. Setting it to the minimum value of 1.
[INFO|image_processing_base.py:379] 2025-03-08 13:12:21,328 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/preprocessor_config.json
[INFO|image_processing_base.py:434] 2025-03-08 13:12:21,329 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

WARNING 03-08 13:12:21 model_runner.py:1288] Computed max_num_seqs (min(256, 128000 // 131072)) to be less than 1. Setting it to the minimum value of 1.
[INFO|image_processing_base.py:379] 2025-03-08 13:12:21,329 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/preprocessor_config.json
[INFO|image_processing_base.py:434] 2025-03-08 13:12:21,329 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:12:21,330 >> loading file chat_template.jinja
(VllmWorkerProcess pid=2100253) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorkerProcess pid=2100253) WARNING 03-08 13:12:21 model_runner.py:1288] Computed max_num_seqs (min(256, 128000 // 131072)) to be less than 1. Setting it to the minimum value of 1.
[INFO|tokenization_utils_base.py:2313] 2025-03-08 13:12:22,150 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:876] 2025-03-08 13:12:22,675 >> Processor Qwen2_5_VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: CachedQwen2TokenizerFast(name_or_path='/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft', vocab_size=151643, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
        151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151657: AddedToken("<tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151658: AddedToken("</tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
}
)

{
  "processor_class": "Qwen2_5_VLProcessor"
}

[WARNING|logging.py:329] 2025-03-08 13:12:29,201 >> It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
(VllmWorkerProcess pid=2100252) It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
(VllmWorkerProcess pid=2100251) It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
(VllmWorkerProcess pid=2100253) It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
[WARNING|tokenization_utils_base.py:3945] 2025-03-08 13:12:39,140 >> Token indices sequence length is longer than the specified maximum sequence length for this model (131072 > 2048). Running this sequence through the model will result in indexing errors
(VllmWorkerProcess pid=2100252) Token indices sequence length is longer than the specified maximum sequence length for this model (131072 > 2048). Running this sequence through the model will result in indexing errors
(VllmWorkerProcess pid=2100251) Token indices sequence length is longer than the specified maximum sequence length for this model (131072 > 2048). Running this sequence through the model will result in indexing errors
(VllmWorkerProcess pid=2100253) Token indices sequence length is longer than the specified maximum sequence length for this model (131072 > 2048). Running this sequence through the model will result in indexing errors
WARNING 03-08 13:12:44 profiling.py:192] The context length (128000) of the model is too short to hold the multi-modal embeddings in the worst case (131072 tokens in total, out of which {'image': 65536, 'video': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
(VllmWorkerProcess pid=2100251) WARNING 03-08 13:12:47 profiling.py:192] The context length (128000) of the model is too short to hold the multi-modal embeddings in the worst case (131072 tokens in total, out of which {'image': 65536, 'video': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
(VllmWorkerProcess pid=2100252) WARNING 03-08 13:12:47 profiling.py:192] The context length (128000) of the model is too short to hold the multi-modal embeddings in the worst case (131072 tokens in total, out of which {'image': 65536, 'video': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
(VllmWorkerProcess pid=2100253) WARNING 03-08 13:12:48 profiling.py:192] The context length (128000) of the model is too short to hold the multi-modal embeddings in the worst case (131072 tokens in total, out of which {'image': 65536, 'video': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
(VllmWorkerProcess pid=2100251) INFO 03-08 13:13:07 worker.py:267] Memory profiling takes 46.98 seconds
(VllmWorkerProcess pid=2100251) INFO 03-08 13:13:07 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
(VllmWorkerProcess pid=2100251) INFO 03-08 13:13:07 worker.py:267] model weights take 4.00GiB; non_torch_memory takes 0.25GiB; PyTorch activation peak memory takes 7.68GiB; the rest of the memory reserved for KV Cache is 9.35GiB.
(VllmWorkerProcess pid=2100253) INFO 03-08 13:13:07 worker.py:267] Memory profiling takes 46.63 seconds
(VllmWorkerProcess pid=2100253) INFO 03-08 13:13:07 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
(VllmWorkerProcess pid=2100253) INFO 03-08 13:13:07 worker.py:267] model weights take 4.00GiB; non_torch_memory takes 0.25GiB; PyTorch activation peak memory takes 7.68GiB; the rest of the memory reserved for KV Cache is 9.35GiB.
(VllmWorkerProcess pid=2100252) INFO 03-08 13:13:07 worker.py:267] Memory profiling takes 47.02 seconds
(VllmWorkerProcess pid=2100252) INFO 03-08 13:13:07 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
(VllmWorkerProcess pid=2100252) INFO 03-08 13:13:07 worker.py:267] model weights take 4.00GiB; non_torch_memory takes 0.25GiB; PyTorch activation peak memory takes 7.68GiB; the rest of the memory reserved for KV Cache is 9.35GiB.
INFO 03-08 13:13:07 worker.py:267] Memory profiling takes 47.12 seconds
INFO 03-08 13:13:07 worker.py:267] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
INFO 03-08 13:13:07 worker.py:267] model weights take 4.00GiB; non_torch_memory takes 0.25GiB; PyTorch activation peak memory takes 7.68GiB; the rest of the memory reserved for KV Cache is 9.35GiB.
INFO 03-08 13:13:08 executor_base.py:111] # cuda blocks: 43777, # CPU blocks: 18724
INFO 03-08 13:13:08 executor_base.py:116] Maximum concurrency for 128000 tokens per request: 5.47x
INFO 03-08 13:13:11 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   0%|                                               | 0/35 [00:00<?, ?it/s](VllmWorkerProcess pid=2100252) INFO 03-08 13:13:12 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2100253) INFO 03-08 13:13:12 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2100251) INFO 03-08 13:13:12 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████| 35/35 [00:24<00:00,  1.44it/s]INFO 03-08 13:13:36 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.52 GiB
(VllmWorkerProcess pid=2100253) INFO 03-08 13:13:36 model_runner.py:1562] Graph capturing finished in 23 secs, took 0.52 GiB
(VllmWorkerProcess pid=2100251) INFO 03-08 13:13:36 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.52 GiB
(VllmWorkerProcess pid=2100252) INFO 03-08 13:13:36 model_runner.py:1562] Graph capturing finished in 24 secs, took 0.52 GiB
INFO 03-08 13:13:36 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 75.59 seconds
[INFO|image_processing_base.py:379] 2025-03-08 13:13:36,970 >> loading configuration file /data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft/preprocessor_config.json
[INFO|image_processing_base.py:434] 2025-03-08 13:13:36,970 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-08 13:13:36,971 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-08 13:13:37,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:876] 2025-03-08 13:13:38,346 >> Processor Qwen2_5_VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 12845056,
    "shortest_edge": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: CachedQwen2TokenizerFast(name_or_path='/data1_8t/user/yb/LLaMA-Factory/outputs/qwen2_5_vl_lora_sft', vocab_size=151643, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
        151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151657: AddedToken("<tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151658: AddedToken("</tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
}
)

{
  "processor_class": "Qwen2_5_VLProcessor"
}

^C(VllmWorkerProcess pid=2100251) INFO 03-08 13:21:58 multiproc_worker_utils.py:253] Worker exiting
(VllmWorkerProcess pid=2100252) INFO 03-08 13:21:58 multiproc_worker_utils.py:253] Worker exiting
(VllmWorkerProcess pid=2100253) INFO 03-08 13:21:58 multiproc_worker_utils.py:253] Worker exiting
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data1_8t/user/yb/LLaMA-Factory/scripts/vllm_infer.py", line 159, in <module>
[rank0]:     fire.Fire(vllm_infer)
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/LLaMA-Factory/scripts/vllm_infer.py", line 136, in vllm_infer
[rank0]:     results = LLM(**engine_args).generate(inputs, sampling_params, lora_request=lora_request)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/utils.py", line 1057, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 461, in generate
[rank0]:     self._validate_and_add_requests(
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1330, in _validate_and_add_requests
[rank0]:     self._add_request(
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1348, in _add_request
[rank0]:     self.llm_engine.add_request(
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/utils.py", line 1057, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 756, in add_request
[rank0]:     preprocessed_inputs = self.input_preprocessor.preprocess(
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/inputs/preprocess.py", line 762, in preprocess
[rank0]:     return self._process_decoder_only_prompt(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/inputs/preprocess.py", line 711, in _process_decoder_only_prompt
[rank0]:     prompt_comps = self._prompt_to_llm_inputs(
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/inputs/preprocess.py", line 343, in _prompt_to_llm_inputs
[rank0]:     return self._process_multimodal(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/inputs/preprocess.py", line 273, in _process_multimodal
[rank0]:     return mm_processor.apply(prompt, mm_data, mm_processor_kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/multimodal/processing.py", line 1239, in apply
[rank0]:     ) = self._cached_apply_hf_processor(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/multimodal/processing.py", line 1031, in _cached_apply_hf_processor
[rank0]:     ) = self._apply_hf_processor_main(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/multimodal/processing.py", line 976, in _apply_hf_processor_main
[rank0]:     mm_kwargs = self._apply_hf_processor_mm_only(
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/multimodal/processing.py", line 937, in _apply_hf_processor_mm_only
[rank0]:     _, mm_kwargs, _ = self._apply_hf_processor_text_mm(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/multimodal/processing.py", line 865, in _apply_hf_processor_text_mm
[rank0]:     processed_data = self._call_hf_processor(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/model_executor/models/qwen2_vl.py", line 985, in _call_hf_processor
[rank0]:     return self.info.ctx.call_hf_processor(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/vllm/inputs/registry.py", line 167, in call_hf_processor
[rank0]:     return hf_processor(**data, **merged_kwargs, return_tensors="pt")
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 123, in __call__
[rank0]:     image_inputs = self.image_processor(images=images, videos=None, **output_kwargs["images_kwargs"])
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 42, in __call__
[rank0]:     return self.preprocess(images, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/models/qwen2_vl/image_processing_qwen2_vl.py", line 375, in preprocess
[rank0]:     patches, image_grid_thw = self._preprocess(
[rank0]:                               ^^^^^^^^^^^^^^^^^
[rank0]:   File "/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/models/qwen2_vl/image_processing_qwen2_vl.py", line 253, in _preprocess
[rank0]:     patches = np.concatenate([patches, repeats], axis=0)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: KeyboardInterrupt
[rank0]:[W308 13:22:05.759240212 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
/data1_8t/user/yb/anaconda3/envs/factory/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '```


### Others

手动ctrl c之后发现应该是卡在图像处理这里

The text was updated successfully, but these errors were encountered:

2019211753 added bug Something isn't working pending This problem is yet to be addressed labels Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm_infer对qwen2.5vl推理很慢，10000个图文对卡住很久 #7216

vllm_infer对qwen2.5vl推理很慢，10000个图文对卡住很久 #7216

2019211753 commented Mar 8, 2025

vllm_infer对qwen2.5vl推理很慢，10000个图文对卡住很久 #7216

vllm_infer对qwen2.5vl推理很慢，10000个图文对卡住很久 #7216

Comments

2019211753 commented Mar 8, 2025

Reminder

System Info

Reproduction