GPU Imbalanced Loading #7250

WillDreamer · 2025-03-11T08:06:54Z

Reminder

I have read the above rules and searched the existing issues.

System Info

When I submit a training jobs as follows:
llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path Qwen/Qwen2-VL-7B-Instruct \ --preprocessing_num_workers 16 \ --finetuning_type lora \ --template qwen2_vl \ --rope_scaling linear \ --flash_attn auto \ --dataset_dir /data/LLaMA-Factory/data \ --dataset EMMA-mini \ --cutoff_len 4096 \ --learning_rate 5e-05 \ --num_train_epochs 3.0 \ --max_samples 100000 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 2 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 100 \ --warmup_steps 0 \ --packing False \ --report_to \ --output_dir saves/Qwen2-VL-7B-Instruct/lora/train_2025-03-11-07-34-45 \ --pure_bf16 True \ --plot_loss True \ --trust_remote_code True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True \ --optim adamw_torch \ --quantization_bit 4 \ --quantization_method bitsandbytes \ --double_quantization True \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0 \ --lora_target all \ --deepspeed cache/ds_z2_config.json

The GPU states are not balanced. How could I deal with this?

Reproduction

GPU imbalance

Others

No response

The text was updated successfully, but these errors were encountered:

Kuangdd01 · 2025-03-11T18:54:55Z

This issue seems similar to #5991.
In your case, batchsize_per_device is set to 1. GPU utilization will be different due to different sequence length on per gpu.

WillDreamer added bug Something isn't working pending This problem is yet to be addressed labels Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Imbalanced Loading #7250

GPU Imbalanced Loading #7250

WillDreamer commented Mar 11, 2025

Kuangdd01 commented Mar 11, 2025

GPU Imbalanced Loading #7250

GPU Imbalanced Loading #7250

Comments

WillDreamer commented Mar 11, 2025

Reminder

System Info

Reproduction

Others

Kuangdd01 commented Mar 11, 2025