[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm #14715

bjj · 2025-03-13T01:40:42Z

Your current environment

The output of `python collect_env.py`

INFO 03-12 18:02:27 [__init__.py:256] Automatically detected platform rocm.
Collecting environment information...
PyTorch version: 2.7.0.dev20250221+rocm6.3
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.3.42131-fa1d09cbd

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.39

Python version: 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.11.0-17-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI100 (gfx908:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.3.42131
MIOpen runtime version: 3.3.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               4
On-line CPU(s) list:                  0-3
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
CPU family:                           6
Model:                                94
Thread(s) per core:                   1
Core(s) per socket:                   4
Socket(s):                            1
Stepping:                             3
CPU(s) scaling MHz:                   100%
CPU max MHz:                          3900.0000
CPU min MHz:                          800.0000
BogoMIPS:                             6999.82
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            128 KiB (4 instances)
L1i cache:                            128 KiB (4 instances)
L2 cache:                             1 MiB (4 instances)
L3 cache:                             6 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-3
Vulnerability Gather data sampling:   Vulnerable: No microcode
Vulnerability Itlb multihit:          KVM: Mitigation: VMX disabled
Vulnerability L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Mitigation; Microcode
Vulnerability Tsx async abort:        Mitigation; TSX disabled

Versions of relevant libraries:
[pip3] lion-pytorch==0.2.3
[pip3] numpy==1.26.4
[pip3] nvidia-ml-py==12.570.86
[pip3] pynvml==12.0.0
[pip3] pytorch-triton-rocm==3.2.0+git4b3bb1f8
[pip3] pyzmq==26.2.1
[pip3] torch==2.7.0.dev20250221+rocm6.3
[pip3] torchaudio==2.6.0.dev20250221+rocm6.3
[pip3] torchvision==0.22.0.dev20250221+rocm6.3
[pip3] transformers==4.49.0
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect
ROCM Version: 6.3.42134-a9a80e791
Neuron SDK Version: N/A
vLLM Version: 0.7.4.dev403+g23d99d94
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0
GPU0   0

================================= Hops between two GPUs ==================================
       GPU0
GPU0   0

=============================== Link Type between two GPUs ===============================
       GPU0
GPU0   0

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================

LD_LIBRARY_PATH=/home/bjj/vllm-rocm/lib/python3.12/site-packages/cv2/../../lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Using this exact model: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4

Generation with ROCm generates a string of neverending !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! with short prompts. If the prompt is longer (e.g. tools are provided, or the user simply asks a longer question) then generation is fine.

The same model works fine on CUDA.

Command:

vllm serve --dtype auto --gpu-memory-utilization 0.95 --served-model-name qwen2.5:32b \
--max-model-len 16384 ~/models/Qwen2.5-32B-Instruct-GPTQ-Int4/  --enable-auto-tool-choice \
--tool-call-parser llama3_json  --generation-config auto

Successful generation:

curl -sS -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [{"role": "user", "content": "hello"}],
  "stream": false,
  "model": "qwen2.5:32b"
}'

Result:

{"id":"chatcmpl-06b26b685535420ba52eacebfce2a025","object":"chat.completion","created":1741829679,"model":"qwen2.5:32b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,
"content":"Hello! How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],
"usage":{"prompt_tokens":30,"total_tokens":40,"completion_tokens":10,"prompt_tokens_details":null},"prompt_logprobs":null}

To demonstrate the bug, turn on streaming so you can see it happening:

curl -sS -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [{"role": "user", "content": "hello"}],
  "stream": true,
  "model": "qwen2.5:32b"
}'

Result:

data: {"id":"chatcmpl-0223b5d5bc024b4da9e8c56fefbad50e","object":"chat.completion.chunk","created":1741829884,"model":"qwen2.5:32b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-0223b5d5bc024b4da9e8c56fefbad50e","object":"chat.completion.chunk","created":1741829884,"model":"qwen2.5:32b","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-0223b5d5bc024b4da9e8c56fefbad50e","object":"chat.completion.chunk","created":1741829884,"model":"qwen2.5:32b","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}]}
...

Startup messages for CUDA

INFO 03-12 17:54:05 __init__.py:207] Automatically detected platform cuda.
INFO 03-12 17:54:05 api_server.py:912] vLLM API server version 0.7.3
INFO 03-12 17:54:05 api_server.py:913] args: Namespace(subparser='serve', model_tag='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, enable_reasoning=False, reasoning_parser=None, tool_call_parser='llama3_json', tool_parser_plugin='', model='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=16384, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2.5:32b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7ed0c8a96980>)
INFO 03-12 17:54:05 api_server.py:209] Started engine process with PID 39703
INFO 03-12 17:54:08 __init__.py:207] Automatically detected platform cuda.
INFO 03-12 17:54:09 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 03-12 17:54:10 gptq_marlin.py:143] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 03-12 17:54:12 config.py:549] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 03-12 17:54:13 gptq_marlin.py:143] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 03-12 17:54:13 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', speculative_config=None, tokenizer='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwen2.5:32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 03-12 17:54:13 cuda.py:229] Using Flash Attention backend.
INFO 03-12 17:54:14 model_runner.py:1110] Starting to load model /home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/...
INFO 03-12 17:54:14 gptq_marlin.py:235] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.22it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:01,  2.02it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:01,  1.97it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:01<00:00,  1.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.04it/s]

INFO 03-12 17:54:16 model_runner.py:1115] Loading model weights took 18.0193 GB
INFO 03-12 17:54:25 worker.py:267] Memory profiling takes 8.17 seconds
INFO 03-12 17:54:25 worker.py:267] the current vLLM instance can use total_gpu_memory (47.38GiB) x gpu_memory_utilization (0.95) = 45.01GiB
INFO 03-12 17:54:25 worker.py:267] model weights take 18.02GiB; non_torch_memory takes 0.07GiB; PyTorch activation peak memory takes 4.02GiB; the rest of the memory reserved for KV Cache is 22.90GiB.
INFO 03-12 17:54:25 executor_base.py:111] # cuda blocks: 5863, # CPU blocks: 1024
INFO 03-12 17:54:25 executor_base.py:116] Maximum concurrency for 16384 tokens per request: 5.73x
INFO 03-12 17:54:26 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.90it/s]
INFO 03-12 17:54:38 model_runner.py:1562] Graph capturing finished in 12 secs, took 0.45 GiB
INFO 03-12 17:54:38 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 21.93 seconds
INFO 03-12 17:54:39 serving_chat.py:76] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 03-12 17:54:39 serving_chat.py:111] Overwriting default chat sampling param with: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 03-12 17:54:39 serving_completion.py:56] Overwriting default completion sampling param with: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 03-12 17:54:39 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-12 17:54:39 launcher.py:23] Available routes are:
INFO 03-12 17:54:39 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 03-12 17:54:39 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 03-12 17:54:39 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-12 17:54:39 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 03-12 17:54:39 launcher.py:31] Route: /health, Methods: GET
INFO 03-12 17:54:39 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 03-12 17:54:39 launcher.py:31] Route: /tokenize, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /detokenize, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/models, Methods: GET
INFO 03-12 17:54:39 launcher.py:31] Route: /version, Methods: GET
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /pooling, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /score, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/score, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /rerank, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 03-12 17:54:39 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [39683]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Startup messages for ROCm

INFO 03-12 18:34:30 [__init__.py:256] Automatically detected platform rocm.
INFO 03-12 18:34:31 [api_server.py:912] vLLM API server version 0.7.4.dev403+g23d99d94
INFO 03-12 18:34:31 [api_server.py:913] args: Namespace(subparser='serve', model_tag='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='llama3_json', tool_parser_plugin='', model='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=16384, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2.5:32b'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x78e272344a40>)
INFO 03-12 18:34:31 [api_server.py:209] Started engine process with PID 3623102
INFO 03-12 18:34:34 [__init__.py:256] Automatically detected platform rocm.
INFO 03-12 18:34:48 [config.py:577] This model supports multiple tasks: {'score', 'embed', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 03-12 18:34:52 [config.py:577] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
WARNING 03-12 18:34:54 [config.py:656] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 03-12 18:34:54 [config.py:1522] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
WARNING 03-12 18:34:58 [config.py:656] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 03-12 18:34:58 [config.py:1522] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 03-12 18:34:58 [llm_engine.py:235] Initializing a V0 LLM engine (v0.7.4.dev403+g23d99d94) with config: model='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', speculative_config=None, tokenizer='/home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=qwen2.5:32b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 03-12 18:34:59 [rocm.py:130] None is not supported in AMD GPUs.
INFO 03-12 18:34:59 [rocm.py:131] Using ROCmFlashAttention backend.
INFO 03-12 18:34:59 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-12 18:34:59 [model_runner.py:1110] Starting to load model /home/bjj/models/Qwen2.5-32B-Instruct-GPTQ-Int4/...
WARNING 03-12 18:34:59 [rocm.py:206] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:03<00:13,  3.26s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:06<00:10,  3.53s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:29<00:24, 12.05s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:34<00:09,  9.23s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:45<00:00,  9.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:45<00:00,  9.04s/it]

INFO 03-12 18:35:45 [loader.py:429] Loading weights took 45.49 seconds
INFO 03-12 18:35:50 [model_runner.py:1146] Model loading took 18.3379 GB and 46.682949 seconds
/home/bjj/vllm-rocm/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py:43: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:340.)
  return F.linear(x, layer.weight, bias)
INFO 03-12 18:36:25 [worker.py:267] Memory profiling takes 34.73 seconds
INFO 03-12 18:36:25 [worker.py:267] the current vLLM instance can use total_gpu_memory (31.98GiB) x gpu_memory_utilization (0.95) = 30.39GiB
INFO 03-12 18:36:25 [worker.py:267] model weights take 18.34GiB; non_torch_memory takes 0.38GiB; PyTorch activation peak memory takes 3.45GiB; the rest of the memory reserved for KV Cache is 8.22GiB.
INFO 03-12 18:36:25 [executor_base.py:111] # rocm blocks: 2103, # CPU blocks: 1024
INFO 03-12 18:36:25 [executor_base.py:116] Maximum concurrency for 16384 tokens per request: 2.05x
INFO 03-12 18:36:27 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:41<00:00,  1.20s/it]
INFO 03-12 18:37:09 [model_runner.py:1570] Graph capturing finished in 42 secs, took 0.69 GiB
INFO 03-12 18:37:09 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 79.53 seconds
INFO 03-12 18:37:10 [serving_chat.py:76] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 03-12 18:37:10 [serving_chat.py:114] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 03-12 18:37:10 [serving_completion.py:60] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 03-12 18:37:10 [api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-12 18:37:10 [launcher.py:26] Available routes are:
INFO 03-12 18:37:10 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /health, Methods: GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /version, Methods: GET
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /score, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-12 18:37:10 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [3623061]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

bjj added the bug Something isn't working label Mar 13, 2025

noooop mentioned this issue Mar 13, 2025

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization #14126

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm #14715

[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm #14715

bjj commented Mar 13, 2025

[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm #14715

[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm #14715

Comments

bjj commented Mar 13, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...