You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue seems similar to #5991.
In your case, batchsize_per_device is set to 1. GPU utilization will be different due to different sequence length on per gpu.
Reminder
System Info
When I submit a training jobs as follows:
llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path Qwen/Qwen2-VL-7B-Instruct \ --preprocessing_num_workers 16 \ --finetuning_type lora \ --template qwen2_vl \ --rope_scaling linear \ --flash_attn auto \ --dataset_dir /data/LLaMA-Factory/data \ --dataset EMMA-mini \ --cutoff_len 4096 \ --learning_rate 5e-05 \ --num_train_epochs 3.0 \ --max_samples 100000 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 2 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 100 \ --warmup_steps 0 \ --packing False \ --report_to \ --output_dir saves/Qwen2-VL-7B-Instruct/lora/train_2025-03-11-07-34-45 \ --pure_bf16 True \ --plot_loss True \ --trust_remote_code True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True \ --optim adamw_torch \ --quantization_bit 4 \ --quantization_method bitsandbytes \ --double_quantization True \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0 \ --lora_target all \ --deepspeed cache/ds_z2_config.json
The GPU states are not balanced. How could I deal with this?
Reproduction
Others
No response
The text was updated successfully, but these errors were encountered: